├── .gitignore
├── .travis.yml
├── LICENSE
├── MANIFEST.in
├── README.md
├── aile
    ├── __init__.py
    ├── _kernel.pyx
    ├── dtw.pyx
    ├── kernel.py
    ├── ptree.py
    └── slybot_project.py
├── demo1.py
├── demo2.py
├── demo3.py
├── doc
    ├── Makefile
    ├── _static
    │   ├── F_jk.pdf
    │   ├── F_jk.svg
    │   ├── F_jk_bars.pdf
    │   ├── F_jk_bars.svg
    │   ├── F_jk_graph.py
    │   ├── F_jk_no_labels.svg
    │   ├── HMM_1.pdf
    │   ├── HMM_1.svg
    │   ├── HMM_2.pdf
    │   ├── HMM_2.svg
    │   ├── PHMM_1.pdf
    │   ├── PHMM_1.svg
    │   ├── PHMM_2.pdf
    │   ├── PHMM_2.svg
    │   ├── PHMM_3.pdf
    │   ├── PHMM_3.svg
    │   ├── PHMM_4.pdf
    │   ├── PHMM_4.svg
    │   ├── forward.pdf
    │   ├── forward.svg
    │   ├── transition_matrix_1.pdf
    │   ├── transition_matrix_1.svg
    │   ├── transition_matrix_2.pdf
    │   ├── transition_matrix_2.svg
    │   ├── transition_matrix_3.pdf
    │   ├── transition_matrix_3.svg
    │   ├── transition_matrix_4.pdf
    │   ├── transition_matrix_4.svg
    │   ├── transition_matrix_5.pdf
    │   └── transition_matrix_5.svg
    ├── conf.py
    ├── index.rst
    ├── make.bat
    ├── notes.rst
    └── requirements.txt
├── misc
    ├── demo1_img.png
    ├── demo2_img.png
    └── visual.py
├── requirements.txt
├── scripts
    └── gen-slybot-project
├── setup.py
├── test
    ├── Ars Technica.html
    ├── Monster.html
    ├── Patch of Land 2.html
    ├── Patch of Land.html
    ├── Reddit.html
    ├── requirements.txt
    ├── table.css
    └── test_slybot.py
└── tox.ini


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.html
3 | *.png
4 | *.so
5 | *#
6 | *~
7 | build/


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python: 2.7
 3 | 
 4 | install:
 5 |   - pip install cython
 6 |   - pip install -U tox
 7 | 
 8 | script:
 9 |   - travis_wait tox -vvv
10 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Pedro López-Adeva Fernández-Layos
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include aile/*.pyx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Automatic Item List Extraction [![Build Status](https://travis-ci.org/scrapinghub/aile.svg?branch=master)](https://travis-ci.org/scrapinghub/aile)
 2 | 
 3 | This repository is a temporary container for experiments in automatic extraction of list and tables from web pages.
 4 | At some later point I will merge the surviving algorithms either in [scrapely](https://github.com/scrapy/scrapely)
 5 | or [portia](https://github.com/scrapinghub/portia).
 6 | 
 7 | I document my ideas and algorithms descriptions at [readthedocs](http://aile.readthedocs.org/en/latest/).
 8 | 
 9 | The current approach is based on the HTML code of the page, treated as a stream of HTML tags as processed by
10 | [scrapely](https://github.com/scrapy/scrapely). An alternative approach would be to use also the web page
11 | rendering information ([this script](https://github.com/plafl/aile/blob/master/misc/visual.py) renders a tree
12 | of bounding boxes for each element).
13 | 
14 | ## Installation
15 | 	pip install -r requirements.txt
16 | 	python setup.py develop
17 | 
18 | ## Running
19 | If you want to have a feeling of how it works there are two demo scripts included in the repo.
20 | 
21 | - demo1.py
22 |   Will annotate the HTML code of a web page, marking as red the lines that form part of the repeating item
23 |   and with a prefix number the field number inside the item. The output is written in the file 'annotated.html'.
24 | 
25 |       python demo1.py https://news.ycombinator.com
26 | 
27 |   ![annotated HTML](https://github.com/plafl/aile/blob/master/misc/demo1_img.png)
28 | 
29 | - demo2.py
30 |   Will label, color and draw the HTML tree so that repeating elements are easy to see. The output is interactive
31 |   (requires PyQt4).
32 | 
33 |       python demo2.py https://news.ycombinator.com
34 | 
35 |   ![annotated tree](https://github.com/plafl/aile/blob/master/misc/demo2_img.png)
36 | 
37 | ## Algorithms
38 | 
39 | We are trying to auto-detect repeating patterns in the tags, not necessarily made of of *li*, *tr* or *td* tags.
40 | 
41 | ### Clustering trees with a measure of similarity
42 | The idea is to compute the distance between all subtrees in the web page and run a clustering algorithm with this distance matrix.
43 | For a web page of size N this can be achieved in time O(N^2). The current algorithm actually computes a kernel and from the kernel
44 | computes the distance. The algorithm is based on:
45 | 
46 |     Kernels for semi-structured data
47 |     Hisashi Kashima, Teruo Koyanagi
48 | 
49 | Once we compute the distance between all subtrees of the web page DBSCAN clustering is run using the distance matrix.
50 | The result is massaged a little more until you get the result.
51 | 
52 | ### Markov models
53 | The problem of detecting repeating patterns in streams is known as *motif discovery* and most of the literature about it seems
54 | to be published in the field of genetics. Inspired from this there is [a branch](https://github.com/plafl/aile/tree/markov_model)
55 | (MEME and Profile HMM algorithms).
56 | 
57 | The Markov model approach has the following problems right now:
58 | 
59 | - Requires several web pages for training, depending on the web page type
60 | - Training is performed using EM algorithm which requires several attempts until a good optimum is achieved
61 | - The number of hidden states is hard to determine. There are some heuristics applied that work partially
62 | 
63 | These problems are not unsurmountable (I think) but require a lot of work:
64 | 
65 | - Precision could be improved using [conditional random fields](https://en.wikipedia.org/wiki/Conditional_random_field).
66 |   These could alleviate the need for data.
67 | - Training can run greatly in parallel. This is actually already done using [joblib](https://pythonhosted.org/joblib/parallel.html)
68 |   in a single PC but it could be further improved using a cluster of computers
69 | - There are some papers about hidden state merging/splitting and even an
70 |   [infinite number of states](http://machinelearning.wustl.edu/mlpapers/paper_files/nips02-AA01.pdf)
71 | 


--------------------------------------------------------------------------------
/aile/__init__.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | 
 3 | import scrapely
 4 | 
 5 | from . import slybot_project
 6 | from . import kernel
 7 | from . import ptree
 8 | 
 9 | def generate_slybot_project(url, path='slybot-project', verbose=False):
10 |     def _print(s):
11 |         if verbose:
12 |             print s,
13 | 
14 |     _print('Downloading URL...')
15 |     t1 = time.clock()
16 |     page = scrapely.htmlpage.url_to_page(url)
17 |     _print('done ({0}s)\n'.format(time.clock() - t1))
18 | 
19 |     _print('Extracting items...')
20 |     t1 = time.clock()
21 |     ie = kernel.ItemExtract(ptree.PageTree(page), separate_descendants=True)
22 |     _print('done ({0}s)\n'.format(time.clock() - t1))
23 | 
24 |     _print('Generating slybot project...')
25 |     t1 = time.clock()
26 |     slybot_project.generate(ie, path)
27 |     _print('done ({0}s)\n'.format(time.clock() - t1))
28 | 
29 |     return ie
30 | 


--------------------------------------------------------------------------------
/aile/_kernel.pyx:
--------------------------------------------------------------------------------
  1 | # cython: linetrace=True
  2 | # distutils: define_macros=CYTHON_TRACE_NOGIL=1
  3 | 
  4 | import collections
  5 | import itertools
  6 | 
  7 | import numpy as np
  8 | cimport numpy as np
  9 | cimport cython
 10 | 
 11 | 
 12 | def order_pairs(nodes):
 13 |     """Given a list of fragments return pairs of equal nodes ordered in
 14 |     such a way that if pair_1 comes before pair_2 then no element of pair_2 is
 15 |     a descendant of pair_1"""
 16 |     grouped = collections.defaultdict(list)
 17 |     for i, node in enumerate(nodes):
 18 |         grouped[node].append(i)
 19 |     return sorted(
 20 |         [pair
 21 |          for node, indices in grouped.iteritems()
 22 |          for pair in itertools.combinations_with_replacement(sorted(indices), 2)],
 23 |         key=lambda x: x[0], reverse=True)
 24 | 
 25 | 
 26 | def check_order(op, parents):
 27 |     """Test for order_pairs"""
 28 |     N = len(parents)
 29 |     C = np.zeros((N, N), dtype=int)
 30 |     for i, j in op:
 31 |         pi = parents[i]
 32 |         pj = parents[j]
 33 |         if pi > 0 and pj > 0:
 34 |             assert C[pi, pj] == 0
 35 |         C[i, j] = 1
 36 | 
 37 | 
 38 | def similarity(ptree, max_items=1):
 39 |     all_classes = list({node.class_attr for node in ptree.nodes})
 40 |     class_index = {c: i for i, c in enumerate(all_classes)}
 41 |     class_map = np.array([class_index[node.class_attr] for node in ptree.nodes])
 42 |     N = len(all_classes)
 43 |     similarity = np.zeros((N, N), dtype=float)
 44 |     for i in range(N):
 45 |         for j in range(N):
 46 |             li = min(max_items, len(all_classes[i]))
 47 |             lj = min(max_items, len(all_classes[j]))
 48 |             lk = min(max_items, len(all_classes[i] & all_classes[j]))
 49 |             similarity[i, j] = (1.0 + lk)/(1.0 + li + lj - lk)
 50 |     return class_map, similarity
 51 | 
 52 | 
 53 | @cython.boundscheck(False)
 54 | cpdef build_counts(ptree, int max_depth=4, int max_childs=20):
 55 |     cdef int N = len(ptree)
 56 |     if max_childs is None:
 57 |         max_childs = N
 58 |     pairs = order_pairs(ptree.nodes)
 59 |     cdef np.ndarray[np.double_t, ndim=2] sim
 60 |     cdef np.ndarray[np.int_t, ndim=1] cmap
 61 |     cmap, sim = similarity(ptree)
 62 | 
 63 |     cdef np.ndarray[np.double_t, ndim=3] C = np.zeros((N, N, max_depth), dtype=float)
 64 |     cdef np.ndarray[np.double_t, ndim=3] S = np.zeros(
 65 |         (max_childs + 1, max_childs + 1, max_depth), dtype=float)
 66 |     S[0, :, :] = S[:, 0, :] = 1
 67 | 
 68 |     cdef int i1, i2, j1, j2, k1, k2
 69 |     cdef np.ndarray[np.int_t, ndim=2] children = ptree.children_matrix(max_childs)
 70 |     for i1, i2 in pairs:
 71 |         s = sim[cmap[i1], cmap[i2]]
 72 |         if children[i1, 0] == -1 and children[i2, 0] == -1:
 73 |             C[i2, i1, :] = C[i1, i2, :] = s
 74 |         else:
 75 |             for j1 in range(1, max_childs + 1):
 76 |                 k1 = children[i1, j1 - 1]
 77 |                 if k1 < 0:
 78 |                     break
 79 |                 for j2 in range(1, max_childs + 1):
 80 |                     k2 = children[i2, j2 - 1]
 81 |                     if k2 < 0:
 82 |                         break
 83 |                     S[j1, j2,  :]  = S[j1 - 1, j2    , : ] +\
 84 |                                      S[j1    , j2 - 1, : ] -\
 85 |                                      S[j1 - 1, j2 - 1, : ]
 86 |                     S[j1, j2, 1:] += S[j1 - 1, j2 - 1, 1:]*C[k1, k2, :-1]
 87 |             C[i2, i1, :] = C[i1, i2, :] = s*S[j1 - 1, j2 - 1, :]
 88 |     return C[:, :, max_depth - 1]
 89 | 
 90 | 
 91 | @cython.boundscheck(False)
 92 | cpdef kernel(ptree, counts=None, int max_depth=4, int max_childs=20, double decay=0.5):
 93 |     cdef np.ndarray[np.double_t, ndim=2] C
 94 |     if counts is None:
 95 |         C = build_counts(ptree, max_depth, max_childs)
 96 |     else:
 97 |         C = counts
 98 | 
 99 |     cdef np.ndarray[np.double_t, ndim=2] K = np.zeros((C.shape[0], C.shape[1]))
100 |     cdef np.ndarray[np.double_t, ndim=2] A = C.copy()
101 |     cdef np.ndarray[np.double_t, ndim=2] B = C.copy()
102 |     cdef int N = K.shape[0]
103 |     cdef int i, j, pi, pj, ri, rj
104 |     for i in range(N - 1, -1, -1):
105 |         pi = ptree.parents[i]
106 |         for j in range(N - 1, -1, -1):
107 |             pj = ptree.parents[j]
108 |             if pi > 0:
109 |                 A[pi, j] += decay*A[i, j]
110 |             if pj > 0:
111 |                 B[i, pj] += decay*B[i, j]
112 |     for i in range(N - 1, -1, -1):
113 |         pi = ptree.parents[i]
114 |         for j in range(N - 1, -1, -1):
115 |             ri = max(ptree.match[i], i)
116 |             rj = max(ptree.match[j], j)
117 |             K[i, j] += A[i, j] + B[i, j] - C[i, j]
118 |             pj = ptree.parents[j]
119 |             if pi > 0 and pj > 0:
120 |                 K[pi, pj] += decay*K[i, j]
121 |     return K
122 | 
123 | 
124 | cpdef min_dist_complete(np.ndarray[np.double_t, ndim=2] D):
125 |     cdef np.ndarray[np.double_t, ndim=2] R = D.copy()
126 |     cdef int N = D.shape[0]
127 |     cdef int i, j, k
128 |     cdef double d
129 |     for k in range(N):
130 |         for i in range(N):
131 |             for j in range(N):
132 |                 d = max(R[i, k], R[k, j])
133 |                 if R[i,j] > d:
134 |                     R[i, j] = d
135 |     return R
136 | 


--------------------------------------------------------------------------------
/aile/dtw.pyx:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | cimport numpy as np
 3 | cimport cython
 4 | 
 5 | 
 6 | @cython.boundscheck(False)
 7 | cpdef from_distance(np.ndarray[np.double_t, ndim=2] D):
 8 |     """Given a distance matrix compute the dynamic time warp distance.
 9 | 
10 |     If the distance matrix 'D' is between sequences 's' and 't' then:
11 |         1. D[i, j] = |s[i] - t[j]|
12 |         2. DTW[i, j] represents the dynamic time warp distance between
13 |            subsequences s[:i+1] and t[:j+1]
14 |         3. DTW[-1, -1] is the dynamic time warp distance between s and t
15 |     """
16 |     cdef int m = D.shape[0]
17 |     cdef int n = D.shape[1]
18 |     cdef np.ndarray[np.double_t, ndim=2] DTW = np.zeros((m + 1, n + 1))
19 |     DTW[:, 0] = np.inf
20 |     DTW[0, :] = np.inf
21 |     DTW[0, 0] = 0
22 |     cdef int i
23 |     cdef int j
24 |     for i in range(1, m + 1):
25 |         for j in range(1, n + 1):
26 |             DTW[i, j] = D[i - 1, j - 1] + min(
27 |                 DTW[i - 1, j    ],
28 |                 DTW[i    , j - 1],
29 |                 DTW[i - 1, j - 1])
30 |     return DTW
31 | 
32 | 
33 | @cython.boundscheck(False)
34 | cpdef path(np.ndarray[np.double_t, ndim=2] DTW):
35 |     """Given a DTW matrix backtrack to find the alignment between two sequences"""
36 |     cdef int m = DTW.shape[0] - 1
37 |     cdef int n = DTW.shape[1] - 1
38 |     cdef int i = m - 1
39 |     cdef int j = n - 1
40 |     cdef np.ndarray[np.int_t, ndim=1] s = np.zeros((m,), dtype=int)
41 |     cdef np.ndarray[np.int_t, ndim=1] t = np.zeros((n,), dtype=int)
42 |     while i >= 0 or j >= 0:
43 |         s[i] = j
44 |         t[j] = i
45 |         if DTW[i, j + 1] < DTW[i + 1, j]:
46 |             if DTW[i, j + 1] < DTW[i, j]:
47 |                 i -= 1
48 |             else:
49 |                 i -= 1
50 |                 j -= 1
51 |         elif DTW[i + 1, j] < DTW[i, j]:
52 |             j -= 1
53 |         else:
54 |             i -= 1
55 |             j -= 1
56 |     return s, t
57 | 
58 | 
59 | @cython.boundscheck(False)
60 | cpdef match(np.ndarray[np.int_t, ndim=1] s,
61 |           np.ndarray[np.int_t, ndim=1] t,
62 |           np.ndarray[np.double_t, ndim=2] D):
63 |     """Given the alignments from two sequences find the match between elements.
64 | 
65 |     When aligning an element from one sequence can correspond to several elements
66 |     of the other one. Matching resolves this ambiguity forcing unique pairings. If
67 |     an element is unpaired then it as assigned -1.
68 |     """
69 |     s = s.copy()
70 |     cdef int i, j, k, m
71 |     cdef double d
72 |     cdef int N = len(s)
73 |     for i in range(N):
74 |         j = s[i]
75 |         m = k = i
76 |         d = D[i, j]
77 |         while k < N and s[k] == j:
78 |             if D[k, j] < d:
79 |                 m = k
80 |                 d = D[k, j]
81 |             k += 1
82 |         k = i
83 |         while k < N and s[k] == j:
84 |             if k != m:
85 |                 s[k] = -1
86 |             k += 1
87 |     return s
88 | 


--------------------------------------------------------------------------------
/aile/kernel.py:
--------------------------------------------------------------------------------
  1 | import collections
  2 | import itertools
  3 | 
  4 | import numpy as np
  5 | import sklearn.cluster
  6 | import networkx as nx
  7 | import scrapely.htmlpage as hp
  8 | 
  9 | import _kernel as _ker
 10 | import dtw
 11 | 
 12 | 
 13 | def to_rows(d):
 14 |     """Make a square matrix with rows equal to 'd'.
 15 | 
 16 |     >>> print to_rows(np.array([1,2,3,4]))
 17 |     [[1 2 3 4]
 18 |      [1 2 3 4]
 19 |      [1 2 3 4]
 20 |      [1 2 3 4]]
 21 |      """
 22 |     return np.tile(d, (len(d), 1))
 23 | 
 24 | 
 25 | def to_cols(d):
 26 |     """Make a square matrix with columns equal to 'd'.
 27 | 
 28 |     >>> print ker.to_cols(np.array([1,2,3,4]))
 29 |     [[1 1 1 1]
 30 |      [2 2 2 2]
 31 |      [3 3 3 3]
 32 |      [4 4 4 4]]
 33 |     """
 34 |     return np.tile(d.reshape(len(d), -1), (1, len(d)))
 35 | 
 36 | 
 37 | def normalize_kernel(K):
 38 |     """New kernel with unit diagonal.
 39 | 
 40 |     K'[i, j] = K[i, j]/sqrt(K[i,i]*K[j,j])
 41 |     """
 42 |     d = np.diag(K).copy()
 43 |     d[d == 0] = 1.0
 44 |     return K/np.sqrt(to_rows(d)*to_cols(d))
 45 | 
 46 | 
 47 | def kernel_to_distance(K):
 48 |     """Build a distance matrix.
 49 | 
 50 |     From the dot product:
 51 |         |u - v|^2 = (u - v)(u - v) = u^2 + v^2 - 2uv
 52 |     """
 53 |     d = np.diag(K)
 54 |     D = to_rows(d) + to_cols(d) - 2*K
 55 |     D[D < 0] = 0.0 # numerical error can make D go a little below 0
 56 |     return np.sqrt(D)
 57 | 
 58 | 
 59 | def tree_size_distance(page_tree):
 60 |     """Build a distance matrix comparing subtree sizes.
 61 | 
 62 |     If T1 and T2 are trees and N1 and N2 the number of nodes within:
 63 |         |T1 - T2| = |N1 - N2|/(N1 + N2)
 64 |     Since:
 65 |         N1 >= 1
 66 |         N2 >= 1
 67 |     Then:
 68 |         0 <= |T1 - T2| < 1
 69 |     """
 70 |     s = page_tree.tree_size()
 71 |     a = to_cols(s).astype(float)
 72 |     b = to_rows(s).astype(float)
 73 |     return np.abs(a - b)/(a + b)
 74 | 
 75 | 
 76 | def must_separate(nodes, page_tree):
 77 |     """Given a sequence of nodes and a PageTree return a list of pairs
 78 |     of nodes such that one is the ascendant/descendant of the other"""
 79 |     separate = []
 80 |     for src in nodes:
 81 |         m = page_tree.match[src]
 82 |         if m >= 0:
 83 |             for tgt in range(src+1, m):
 84 |                 if tgt in nodes:
 85 |                     separate.append((src, tgt))
 86 |     return separate
 87 | 
 88 | 
 89 | def cut_descendants(D, nodes, page_tree):
 90 |     """Given the distance matrix D, a set of nodes and a PageTree
 91 |     perform a multicut of the complete graph of nodes separating
 92 |     the nodes that are descendant/ascendants of each other according to the
 93 |     PageTree"""
 94 |     index = {node: i for i, node in enumerate(nodes)}
 95 |     separate = [(index[i], index[j])
 96 |                 for i, j in must_separate(nodes, page_tree)]
 97 |     if separate:
 98 |         D = D[nodes, :][:, nodes].copy()
 99 |         for i, j in separate:
100 |             D[i, j] = D[j, i] = np.inf
101 |         E = _ker.min_dist_complete(D)
102 |         eps = min(E[i,j] for i, j in separate)
103 |         components = nx.connected_components(
104 |             nx.Graph((nodes[i], nodes[j])
105 |                      for (i, j) in zip(*np.nonzero(E < eps))))
106 |     else:
107 |         components = [nodes]
108 |     return components
109 | 
110 | 
111 | def labels_to_clusters(labels):
112 |     """Given a an assignment of cluster label to each item return the a list
113 |     of sets, where each set is a cluster"""
114 |     return [np.flatnonzero(labels==label) for label in range(np.max(labels)+1)]
115 | 
116 | 
117 | def clusters_to_labels(clusters, n_samples):
118 |     """Given a list with clusters label each item"""
119 |     labels = np.repeat(-1, n_samples)
120 |     for i, c in enumerate(clusters):
121 |         for j in c:
122 |             labels[j] = i
123 |     return labels
124 | 
125 | 
126 | def boost(d, k=2):
127 |     """Given a distance between 0 and 1 make it more nonlinear"""
128 |     return 1 - (1 - d)**k
129 | 
130 | 
131 | class TreeClustering(object):
132 |     def __init__(self, page_tree):
133 |         self.page_tree = page_tree
134 | 
135 |     def fit_predict(self, X, min_cluster_size=6, d1=1.0, d2=0.1, eps=1.0,
136 |                     separate_descendants=True):
137 |         """Fit the data X and label each sample.
138 | 
139 |         X is a kernel of size (n_samples, n_samples). From this kernel the
140 |         distance matrix is computed and averaged with the tree size distance,
141 |         and DBSCAN applied to the result. Finally, we enforce the constraint
142 |         that a node cannot be inside the same cluster of any of its ascendants.
143 | 
144 |         Parameters
145 |         ---------
146 |         X : np.array
147 |             Kernel matrix
148 |         min_cluster_size : int
149 |             Parameter to DBSCAN
150 |         eps : int
151 |             Parameter to DBSCAN
152 |         d1 : float
153 |             Weight of distance computed from X
154 |         d2 : float
155 |             Weight of distance computed from tree size
156 |         separate_ascendants: bool
157 |             True to enfonce the cannot-link constraints
158 | 
159 |         Returns
160 |         -------
161 |         np.array
162 |             A label for each sample
163 |         """
164 |         Y = boost(tree_size_distance(self.page_tree), 2)
165 |         D = d1*X + d2*Y
166 |         clt = sklearn.cluster.DBSCAN(
167 |             eps=eps, min_samples=min_cluster_size, metric='precomputed')
168 |         self.clusters = []
169 |         for c in labels_to_clusters(clt.fit_predict(D)):
170 |             if len(c) >= min_cluster_size:
171 |                 if separate_descendants:
172 |                     self.clusters += filter(lambda x: len(x) >= min_cluster_size,
173 |                                             cut_descendants(D, c, self.page_tree))
174 |                 else:
175 |                     self.clusters.append(c)
176 |         self.labels = clusters_to_labels(self.clusters, D.shape[0])
177 |         return self.labels
178 | 
179 | 
180 | def cluster(page_tree, K, eps=1.2, d1=1.0, d2=0.1, separate_descendants=True):
181 |     """Asign to each node in the tree a cluster label.
182 | 
183 |     Returns
184 |     -------
185 |     np.array
186 |         For each node a label id. Label ID -1 means that the node
187 |         is an outlier (it isn't part of any cluster).
188 |     """
189 |     return TreeClustering(page_tree).fit_predict(
190 |         kernel_to_distance(normalize_kernel(K)),
191 |         eps=eps, d1=d1, d2=d2,
192 |         separate_descendants=separate_descendants)
193 | 
194 | 
195 | def clusters_tournament(ptree, labels):
196 |     """A cluster 'wins' if some node inside the cluster is the ascendant
197 |     of another node in the other cluster"""
198 |     L = np.max(labels) + 1
199 |     T = np.zeros((L, L), dtype=int)
200 |     for i, m in enumerate(ptree.match):
201 |         li = labels[i]
202 |         if li != -1:
203 |             for j in range(max(i + 1, m)):
204 |                 lj = labels[j]
205 |                 if lj != -1:
206 |                     T[li, lj] += 1
207 |     return T
208 | 
209 | 
210 | def _make_acyclic(T, labels):
211 |     """See https://en.wikipedia.org/wiki/Feedback_arc_set"""
212 |     n = T.shape[0]
213 |     if n == 0:
214 |         return []
215 |     i = np.random.randint(0, n)
216 |     L = []
217 |     R = []
218 |     for j in range(n):
219 |         if j != i:
220 |             if T[i, j] > T[j, i]:
221 |                 R.append(j)
222 |             else:
223 |                 L.append(j)
224 |     return (make_acyclic(T[L, :][:, L], labels[L]) +
225 |             [labels[i]] +
226 |             make_acyclic(T[R, :][:, R], labels[R]))
227 | 
228 | 
229 | def make_acyclic(T, labels=None):
230 |     """Tiven a tournament T, try to rank the clusters in a consisten
231 |     way"""
232 |     if labels is None:
233 |         labels = np.arange(T.shape[0])
234 |     return _make_acyclic(T, labels)
235 | 
236 | 
237 | def separate_clusters(ptree, labels):
238 |     """Make sure no tree node is contained in two different clusters"""
239 |     ranking = make_acyclic(clusters_tournament(ptree, labels))
240 |     clusters = labels_to_clusters(labels)
241 |     labels = labels.copy()
242 |     for i in ranking:
243 |         for node in clusters[i]:
244 |             labels[node+1:max(node+1, ptree.match[node])] = -1
245 |     return labels
246 | 
247 | 
248 | def score_cluster(ptree, cluster, k=4):
249 |     """Given a cluster assign a score. The higher the score the more probable
250 |     that the cluster truly represents a repeating item"""
251 |     if len(cluster) <= 1:
252 |         return 0.0
253 |     D = sklearn.neighbors.kneighbors_graph(
254 |         ptree.distance[cluster, :][:, cluster], min(len(cluster) - 1, k),
255 |         metric='precomputed', mode='distance')
256 |     score = 0.0
257 |     for i, j in zip(*D.nonzero()):
258 |         a = cluster[i]
259 |         b = cluster[j]
260 |         si = max(a+1, ptree.match[a]) - a
261 |         sj = max(b+1, ptree.match[b]) - b
262 |         score += min(si, sj)/D[i, j]**2
263 |     return score
264 | 
265 | 
266 | def some_root_has_label(labels, item, label):
267 |     for root in item:
268 |         if labels[root] == label:
269 |             return True
270 |     return False
271 | 
272 | 
273 | def extract_items_with_label(ptree, labels, label_to_extract):
274 |     """Extract all items inside the labeled PageTree that are marked or have
275 |     a sibling that is marked with label_to_extract.
276 | 
277 |     Returns
278 |     -------
279 |     List[tuple]
280 |         Where each tuple is the roots of the extracted subtrees.
281 |     """
282 |     items = []
283 |     i = 0
284 |     while i < len(labels):
285 |         children = ptree.children(i)
286 |         if np.any(labels[children] == label_to_extract):
287 |             first = None
288 |             item = []
289 |             for c in children:
290 |                 m = labels[c]
291 |                 if m != -1:
292 |                     if first is None:
293 |                         first = m
294 |                     elif m == first:
295 |                         if item:
296 |                             items.append(tuple(item))
297 |                             item = []
298 |                     # Only append tags as item roots
299 |                     if isinstance(ptree.page.parsed_body[ptree.index[c]], hp.HtmlTag):
300 |                         item.append(c)
301 |             if item:
302 |                 items.append(tuple(item))
303 |             i = ptree.match[i]
304 |         else:
305 |             i += 1
306 |     return filter(lambda item: some_root_has_label(labels, item, label_to_extract),
307 |                   items)
308 | 
309 | 
310 | def vote(sequence):
311 |     """Return the most frequent item in sequence"""
312 |     return max(collections.Counter(sequence).iteritems(),
313 |                key=lambda kv: kv[1])[0]
314 | 
315 | 
316 | def regularize_item_length(ptree, labels, item_locations, max_items_cut_per=0.33):
317 |     """Make sure all item locations have the same number of roots"""
318 |     if not item_locations:
319 |         return item_locations
320 |     min_item_length = vote(len(item_location) for item_location in item_locations)
321 |     cut_items = sum(len(item_location) > min_item_length
322 |                     for item_location in item_locations)
323 |     if cut_items > max_items_cut_per*len(item_locations):
324 |         return []
325 |     item_locations = filter(lambda x: len(x) >= min_item_length,
326 |                             item_locations)
327 |     if cut_items > 0:
328 |         label_count = collections.Counter(
329 |             labels[root] for item_location in item_locations
330 |             for root in item_location)
331 |         new_item_locations = []
332 |         for item_location in item_locations:
333 |             if len(item_location) > min_item_length:
334 |                 scored = sorted(
335 |                     ((label_count[labels[root]], root) for root in item_location),
336 |                     reverse=True)
337 |                 keep = set(x[1] for x in scored[:min_item_length])
338 |                 new_item_location = tuple(
339 |                     root
340 |                     for root in item_location
341 |                     if root in keep)
342 |             else:
343 |                 new_item_location = item_location
344 |             new_item_locations.append(new_item_location)
345 |     else:
346 |         new_item_locations = item_locations
347 |     return new_item_locations
348 | 
349 | 
350 | def extract_items(ptree, labels, min_n_items=6):
351 |     """Extract the repeating items.
352 | 
353 |     The algorithm to extract the repeating items goes as follows:
354 |         1. Determine the label that covers most children on the page
355 |         2. If a node with that label has siblings, extract the siblings too,
356 |            even if they have other labels.
357 | 
358 |     The output is a list of lists of items
359 |     """
360 |     labels = separate_clusters(ptree, labels)
361 |     scores = sorted(
362 |         enumerate(score_cluster(ptree, cluster)
363 |                   for cluster in labels_to_clusters(labels)),
364 |         key=lambda kv: kv[1], reverse=True)
365 |     items = []
366 |     for label, score in scores:
367 |         cluster = extract_items_with_label(ptree, labels, label)
368 |         if len(cluster) < min_n_items:
369 |             continue
370 |         t = regularize_item_length(ptree, labels, cluster)
371 |         if len(t) >= min_n_items:
372 |             items.append(t)
373 |     return items
374 | 
375 | 
376 | def path_distance(path_1, path_2):
377 |     """Compute the prefix distance between the two paths.
378 | 
379 |     >>> p1 = [1, 0, 3, 4, 5, 6]
380 |     >>> p2 = [1, 0, 2, 2, 2, 2, 2, 2]
381 |     >>> print path_distance(p1, p2)
382 |     6
383 |     """
384 |     d = max(len(path_1), len(path_2))
385 |     for a, b in zip(path_1, path_2):
386 |         if a != b:
387 |             break
388 |         d -= 1
389 |     return d
390 | 
391 | 
392 | def pairwise_path_distance(path_seq_1, path_seq_2):
393 |     """Compute all pairwise distances between paths in path_seq_1 and
394 |     path_seq_2"""
395 |     N1 = len(path_seq_1)
396 |     N2 = len(path_seq_2)
397 |     D = np.zeros((N1, N2))
398 |     for i in range(N1):
399 |         q1 = path_seq_1[i]
400 |         for j in range(N2):
401 |             D[i, j] = path_distance(q1, path_seq_2[j])
402 |     return D
403 | 
404 | 
405 | def extract_path_seq_1(ptree, item):
406 |     paths = []
407 |     for root in item:
408 |         for path in ptree.prefixes_at(root):
409 |             paths.append((path[0], path))
410 |     return paths
411 | 
412 | 
413 | def extract_path_seq(ptree, items):
414 |     all_paths = []
415 |     for item in items:
416 |         paths = extract_path_seq_1(ptree, item)
417 |         all_paths.append(paths)
418 |     return all_paths
419 | 
420 | 
421 | def map_paths_1(func, paths):
422 |     return [(leaf, [func(node) for node in path])
423 |             for leaf, path in paths]
424 | 
425 | 
426 | def map_paths(func, paths):
427 |     return [map_paths_1(func, path_set) for path_set in paths]
428 | 
429 | 
430 | def find_cliques(G, min_size):
431 |     """Find all cliques in G above a given size.
432 | 
433 |     If a node is part of a larger clique is deleted from the smaller ones.
434 | 
435 |     Returns
436 |     -------
437 |     dict
438 |         Mapping nodes to clique ID
439 |     """
440 |     cliques = []
441 |     for K in nx.find_cliques(G):
442 |         if len(K) >= min_size:
443 |             cliques.append(set(K))
444 |     cliques.sort(reverse=True, key=lambda x: len(x))
445 |     L = set()
446 |     for K in cliques:
447 |         K -= L
448 |         L |= K
449 |     cliques = [J for J in cliques if len(J) >= min_size]
450 |     node_to_clique = {}
451 |     for i, K in enumerate(cliques):
452 |         for node in K:
453 |             if node not in node_to_clique:
454 |                 node_to_clique[node] = i
455 |     return node_to_clique
456 | 
457 | 
458 | def match_graph(all_paths):
459 |     """Build a graph where n1 and n2 share an edge if they have
460 |     been matched using DTW"""
461 |     G = nx.Graph()
462 |     for path_set_1, path_set_2 in itertools.combinations(all_paths, 2):
463 |         n1, p1 = zip(*path_set_1)
464 |         n2, p2 = zip(*path_set_2)
465 |         D = pairwise_path_distance(p1, p2)
466 |         DTW = dtw.from_distance(D)
467 |         a1, a2 = dtw.path(DTW)
468 |         m = dtw.match(a1, a2, D)
469 |         for i, j in enumerate(m):
470 |             if j != -1:
471 |                 G.add_edge(n1[i], n2[j])
472 |     return G
473 | 
474 | 
475 | def align_items(ptree, items, node_to_clique):
476 |     n_cols = max(node_to_clique.values()) + 1
477 |     table = np.zeros((len(items), n_cols), dtype=int) - 1
478 |     for i, item in enumerate(items):
479 |         for root in item:
480 |             for c in range(root, max(root + 1, ptree.match[root])):
481 |                 try:
482 |                     table[i, node_to_clique[c]] = c
483 |                 except KeyError:
484 |                     pass
485 |     return table
486 | 
487 | 
488 | def extract_item_table(ptree, items, labels):
489 |     return align_items(
490 |         ptree,
491 |         items,
492 |         find_cliques(
493 |             match_graph(map_paths(
494 |                 lambda x: labels[x], extract_path_seq(ptree, items))),
495 |             0.5*len(items))
496 |     )
497 | 
498 | 
499 | ItemTable = collections.namedtuple('ItemTable', ['items', 'cells'])
500 | 
501 | 
502 | class ItemExtract(object):
503 |     def __init__(self, page_tree, k_max_depth=2, k_decay=0.5,
504 |                  c_eps=1.2, c_d1=1.0, c_d2=1.0, separate_descendants=True):
505 |         """Perform all extraction operations in sequence.
506 | 
507 |         Parameters
508 |         ----------
509 |         k_max_depth : int
510 |             Parameter to kernel computation
511 |         k_decay : float
512 |             Parameter to kernel computation
513 |         c_eps : float
514 |             Parameter to clustering
515 |         c_d1 : float
516 |             Parameter to clustering
517 |         c_d2 : float
518 |             Parameter to clustering
519 |         separate_descendants : bool
520 |             Parameter to clustering
521 |         """
522 |         self.page_tree = page_tree
523 |         self.kernel = _ker.kernel(page_tree, max_depth=k_max_depth, decay=k_decay)
524 |         self.labels = cluster(
525 |             page_tree, self.kernel, eps=c_eps, d1=c_d1, d2=c_d2,
526 |             separate_descendants=separate_descendants)
527 |         self.items = extract_items(page_tree, self.labels)
528 |         self.tables = [ItemTable(items, extract_item_table(page_tree, items, self.labels))
529 |                        for items in self.items]
530 |         self.table_fragments = [
531 |             ItemTable([page_tree.fragment_index(np.array(root)) for root in item],
532 |                       page_tree.fragment_index(fields))
533 |             for item, fields in self.tables]
534 | 


--------------------------------------------------------------------------------
/aile/ptree.py:
--------------------------------------------------------------------------------
  1 | import itertools
  2 | 
  3 | import numpy as np
  4 | import scrapely.htmlpage as hp
  5 | 
  6 | 
  7 | def match_fragments(fragments, max_backtrack=20):
  8 |     """Find the closing fragment for every fragment.
  9 | 
 10 |     Returns
 11 |     -------
 12 |     numpy.array
 13 |         With as many elements as fragments. If the fragment has no
 14 |         closing pair then the array contains -1 at that position
 15 |         otherwise it contains the index of the closing pair
 16 |     """
 17 |     match = np.repeat(-1, len(fragments))
 18 |     stack = []
 19 |     for i, fragment in enumerate(fragments):
 20 |         if isinstance(fragment, hp.HtmlTag):
 21 |             if fragment.tag_type == hp.HtmlTagType.OPEN_TAG:
 22 |                 stack.append((i, fragment))
 23 |             elif (fragment.tag_type == hp.HtmlTagType.CLOSE_TAG):
 24 |                 if max_backtrack is None:
 25 |                     max_j = len(stack)
 26 |                 else:
 27 |                     max_j = min(max_backtrack, len(stack))
 28 |                 for j in range(1, max_j + 1):
 29 |                     last_i, last_tag = stack[-j]
 30 |                     if (last_tag.tag == fragment.tag):
 31 |                         match[last_i] = i
 32 |                         match[i] = last_i
 33 |                         stack[-j:] = []
 34 |                         break
 35 |     return match
 36 | 
 37 | 
 38 | def is_tag(fragment):
 39 |     """Check if a fragment is also an HTML tag"""
 40 |     return isinstance(fragment, hp.HtmlTag)
 41 | 
 42 | 
 43 | def get_class(fragment):
 44 |     """Return a set with class attributes for a given fragment"""
 45 |     if is_tag(fragment):
 46 |         return frozenset((fragment.attributes.get('class') or '').split())
 47 |     else:
 48 |         return frozenset()
 49 | 
 50 | 
 51 | class TreeNode(object):
 52 |     __slots__ = ('tag', 'class_attr')
 53 | 
 54 |     def __init__(self, tag, class_attr=frozenset()):
 55 |         self.tag = tag
 56 |         self.class_attr = class_attr
 57 | 
 58 |     def __hash__(self):
 59 |         return hash(self.tag)
 60 | 
 61 |     def __eq__(self, other):
 62 |         return self.tag == other.tag
 63 | 
 64 |     def __str__(self):
 65 |         return self.__repr__().encode('ascii', 'backslashreplace')
 66 | 
 67 |     def __repr__(self):
 68 |         s = unicode(self.tag)
 69 |         if self.class_attr:
 70 |             s += u'['
 71 |             s += u','.join(self.class_attr)
 72 |             s += u']'
 73 |         return s
 74 | 
 75 | 
 76 | def non_empty_text(page, fragment):
 77 |     return fragment.is_text_content and\
 78 |         page.body[fragment.start:fragment.end].strip()
 79 | 
 80 | 
 81 | def fragment_to_node(page, fragment):
 82 |     """Convert a fragment to a node inside a tree where we are going
 83 |     to compute the kernel"""
 84 |     if non_empty_text(page, fragment):
 85 |         return TreeNode('[T]')
 86 |     elif (is_tag(fragment) and
 87 |           fragment.tag_type != hp.HtmlTagType.CLOSE_TAG):
 88 |         return TreeNode(fragment.tag, get_class(fragment))
 89 |     return None
 90 | 
 91 | 
 92 | def tree_nodes(page):
 93 |     """Return a list of fragments from page where empty text has been deleted"""
 94 |     for i, fragment in enumerate(page.parsed_body):
 95 |         node = fragment_to_node(page, fragment)
 96 |         if node is not None:
 97 |             yield (i, node)
 98 | 
 99 | 
100 | class PageTree(object):
101 |     def __init__(self, page):
102 |         self.page = page
103 |         index, self.nodes = zip(*tree_nodes(page))
104 |         self.index = np.array(index)
105 |         self.n_nodes = len(self.nodes)
106 |         self.reverse_index = np.repeat(-1, len(page.parsed_body))
107 |         for i, idx in enumerate(self.index):
108 |             self.reverse_index[idx] = i
109 |         match = match_fragments(page.parsed_body)
110 |         self.match = np.repeat(-1, self.n_nodes)
111 |         self.parents = np.repeat(-1, self.n_nodes)
112 |         for i, m in enumerate(match):
113 |             j = self.reverse_index[i]
114 |             if j >= 0:
115 |                 if m >= 0:
116 |                     k = -1
117 |                     while k < 0:
118 |                         k = self.reverse_index[m]
119 |                         m += 1
120 |                         if m == len(match):
121 |                             k = len(self.match)
122 |                             break
123 |                     assert k >= 0
124 |                 else:
125 |                     k = j # no children
126 |                 self.match[j] = k
127 |         for i, m in enumerate(self.match):
128 |             self.parents[i+1:m] = i
129 | 
130 |         self.n_children = np.zeros((self.n_nodes,), dtype=int)
131 |         self.i_child = np.zeros((self.n_nodes,), dtype=int)
132 |         for i, p in enumerate(self.parents):
133 |             if p > -1:
134 |                 self.i_child[i] = self.n_children[p]
135 |                 self.n_children[p] += 1
136 |         self.max_childs = np.max(self.n_children)
137 | 
138 |         self.distance = np.ones((self.n_nodes, self.n_nodes), dtype=int)
139 |         for i in range(self.n_nodes - 1, -1, -1):
140 |             self.distance[i, i] = 0
141 |             for a, b in itertools.combinations(self.children(i), 2):
142 |                 for j in range(a, max(a + 1, self.match[a])):
143 |                     for k in range(b, max(b + 1, self.match[b])):
144 |                         self.distance[j, k] = self.distance[j, a] + 2 + self.distance[b, k]
145 |                         self.distance[k, j] = self.distance[j, k]
146 | 
147 |     def __len__(self):
148 |         """Number of nodes in tree"""
149 |         return len(self.index)
150 | 
151 |     def children(self, i):
152 |         """An array with the indices of the direct children of node 'i'"""
153 |         return i + 1 + np.flatnonzero(self.parents[i+1:max(i+1, self.match[i])] == i)
154 | 
155 |     def children_matrix(self, max_childs=None):
156 |         """A matrix of shape (len(tree), max_childs) where row 'i' contains the
157 |         children of node 'i'"""
158 |         if max_childs is None:
159 |             max_childs = self.max_childs
160 |         N = len(self.parents)
161 |         C = np.repeat(-1, N*max_childs).reshape(N, max_childs)
162 |         for i in range(N - 1, -1, -1):
163 |             p = self.parents[i]
164 |             if p >= 0:
165 |                 for j in range(max_childs):
166 |                     if C[p, j] == -1:
167 |                         C[p, j] = i
168 |                         break
169 |         return C
170 | 
171 |     def siblings(self, i):
172 |         """Siblings of node 'i'"""
173 |         p = self.parents[i]
174 |         if p != -1:
175 |             return self.children(p)
176 |         else:
177 |             return np.flatnonzero(self.parents == -1)
178 | 
179 |     def prefix(self, i, stop_at=-1):
180 |         """A path from 'i' going upwards up to 'stop_at'"""
181 |         path = []
182 |         p = i
183 |         while p >= stop_at and p != -1:
184 |             path.append(p)
185 |             p = self.parents[p]
186 |         return path
187 | 
188 |     def prefixes_at(self, i):
189 |         """A list of paths going upwards that start at a descendant of 'i' and
190 |         end at a 'i'"""
191 |         paths = []
192 |         for j in range(i, max(i+1, self.match[i])):
193 |             paths.append(self.prefix(j, i))
194 |         return paths
195 | 
196 |     def tree_size(self):
197 |         """Return an array where the i-th entry is the size of subtree 'i'"""
198 |         r = np.arange(len(self.match))
199 |         s = r + 1
200 |         return np.where(s > self.match, s, self.match) - r
201 | 
202 |     def fragment_index(self, tree_index):
203 |         """Convert from tree node numbering to original fragment numbers"""
204 |         return np.where(
205 |             tree_index > 0, self.index[tree_index], -1)
206 | 
207 |     def is_descendant(self, parent, descendant):
208 |         return descendant >= parent and \
209 |             descendant < max(parent + 1, self.match[parent])
210 | 
211 |     def common_ascendant(self, nodes):
212 |         s = set(range(self.n_nodes))
213 |         for node in nodes:
214 |             s &= set(self.prefix(node))
215 |         return max(s) if s else -1
216 | 


--------------------------------------------------------------------------------
/demo1.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import sys
 3 | import codecs
 4 | import cgi
 5 | 
 6 | import scrapely.htmlpage as hp
 7 | import numpy as np
 8 | 
 9 | import aile.kernel
10 | import aile.ptree
11 | 
12 | 
13 | def annotate(page, labels, out_path="annotated.html"):
14 |     match = aile.ptree.match_fragments(page.parsed_body)
15 |     with codecs.open(out_path, 'w', encoding='utf-8') as out:
16 |         out.write("""
17 | <!DOCTYPE html>
18 | <html lang="en-US">
19 | <head>
20 | <style>
21 |     pre {
22 |       counter-reset: code;
23 |       padding-left: 30px;
24 |     }
25 |     .line {
26 |       counter-increment: code;
27 |     }
28 |     .line:before {
29 |       content: counter(code);
30 |       float: left;
31 |       margin-left: -30px;
32 |       width: 25px;
33 |       text-align: right;
34 |     }
35 | </style>
36 | </head>
37 | <body>
38 | <pre>
39 | """)
40 |         indent = 0
41 |         def write(s):
42 |             out.write(indent*'    ')
43 |             out.write(s)
44 | 
45 |         for i, (fragment, label) in enumerate(
46 |                 zip(page.parsed_body, labels)):
47 |             if label >= 0:
48 |                 out.write('<span class="line" style="color:red">')
49 |             else:
50 |                 out.write('<span class="line" style="color:black">')
51 |             if isinstance(fragment, hp.HtmlTag):
52 |                 if fragment.tag_type == hp.HtmlTagType.CLOSE_TAG:
53 |                     if match[i] >= 0 and indent > 0:
54 |                         indent -= 1
55 |                     write(u'{0:3d}|&lt;/{1}&gt;'.format(label, fragment.tag))
56 |                 else:
57 |                     write(u'{0:3d}|&lt;{1}'.format(label, fragment.tag))
58 |                     for k,v in fragment.attributes.iteritems():
59 |                         out.write(u' {0}="{1}"'.format(k, v))
60 |                     if fragment.tag_type == hp.HtmlTagType.UNPAIRED_TAG:
61 |                         out.write('/')
62 |                     out.write('&gt;')
63 |                     if match[i] >= 0:
64 |                         indent += 1
65 |             else:
66 |                 write(u'{0:3d}|{1}'.format(
67 |                     label,
68 |                     cgi.escape(page.body[fragment.start:fragment.end].strip())))
69 |             out.write('</span>\n')
70 |         out.write("""
71 | </pre>
72 | </body>
73 | </html>""")
74 | 
75 | 
76 | if __name__ == '__main__':
77 |     url = sys.argv[1]
78 | 
79 |     print 'Downloading URL...',
80 |     t1 = time.clock()
81 |     page = hp.url_to_page(url)
82 |     print 'done ({0}s)'.format(time.clock() - t1)
83 | 
84 |     print 'Extracting items...',
85 |     t1 = time.clock()
86 |     ie = aile.kernel.ItemExtract(aile.ptree.PageTree(page))
87 |     print 'done ({0}s)'.format(time.clock() - t1)
88 | 
89 |     print 'Annotating HTML'
90 |     labels = np.repeat(-1, len(ie.page_tree.page.parsed_body))
91 |     items, cells = ie.table_fragments[0]
92 |     for i in range(cells.shape[0]):
93 |         for j in range(cells.shape[1]):
94 |             labels[cells[i, j]] = j
95 |     annotate(ie.page_tree.page, labels)
96 | 


--------------------------------------------------------------------------------
/demo2.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import sys
 3 | 
 4 | import scrapely.htmlpage as hp
 5 | import numpy as np
 6 | import ete2
 7 | import matplotlib.pyplot as plt
 8 | 
 9 | import aile.kernel
10 | import aile.ptree
11 | 
12 | 
13 | def color_map(n_colors):
14 |     cmap = plt.cm.Set3(np.linspace(0, 1, n_colors))
15 |     cmap = np.round(cmap[:,:-1]*255).astype(int)
16 |     def to_hex(c):
17 |         return hex(c)[2:-1]
18 |     return ['#' + ''.join(map(to_hex, row)) for row in cmap]
19 | 
20 | 
21 | def draw_tree(ptree, labels=None):
22 |     root = ete2.Tree(name='root')
23 |     T = [ete2.Tree(name=(str(node) + '[' + str(i) + ']'))
24 |          for i, node in enumerate(ptree.nodes)]
25 |     if labels is not None:
26 |         for t, lab in zip(T, labels):
27 |             t.name += '{' + str(lab) + '}'
28 |     for i, p in enumerate(ptree.parents):
29 |         if p > 0:
30 |             T[p].add_child(T[i])
31 |         else:
32 |             root.add_child(T[i])
33 |     cmap = color_map(max(labels) + 2)
34 |     for t, l in zip(T, labels):
35 |         ns = ete2.NodeStyle()
36 |         ns['bgcolor'] = cmap[l]
37 |         t.set_style(ns)
38 |         if not t.is_leaf():
39 |             t.add_face(ete2.TextFace(t.name), column=0, position='branch-top')
40 |     root.show()
41 | 
42 | 
43 | if __name__ == '__main__':
44 |     url = sys.argv[1]
45 | 
46 |     print 'Downloading URL...',
47 |     t1 = time.clock()
48 |     page = hp.url_to_page(url)
49 |     print 'done ({0}s)'.format(time.clock() - t1)
50 | 
51 |     print 'Extracting items...',
52 |     t1 = time.clock()
53 |     ie = aile.kernel.ItemExtract(aile.ptree.PageTree(page))
54 |     print 'done ({0}s)'.format(time.clock() - t1)
55 | 
56 |     print 'Drawing HTML tree'
57 |     draw_tree(ie.page_tree, ie.labels)
58 | 


--------------------------------------------------------------------------------
/demo3.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import sys
 3 | 
 4 | import scrapely.htmlpage as hp
 5 | 
 6 | import aile.kernel
 7 | import aile.ptree
 8 | import aile.slybot_project
 9 | 
10 | if __name__ == '__main__':
11 |     url = sys.argv[1]
12 |     if len(sys.argv) > 2:
13 |         out_path = sys.argv[2]
14 |     else:
15 |         out_path = './slybot-project'
16 | 
17 |     print 'Downloading URL...',
18 |     t1 = time.clock()
19 |     page = hp.url_to_page(url)
20 |     print 'done ({0}s)'.format(time.clock() - t1)
21 | 
22 |     print 'Extracting items...',
23 |     t1 = time.clock()
24 |     ie = aile.kernel.ItemExtract(aile.ptree.PageTree(page))
25 |     print 'done ({0}s)'.format(time.clock() - t1)
26 | 
27 |     print 'Generating slybot project...',
28 |     t1 = time.clock()
29 |     aile.slybot_project.generate(ie, out_path)
30 |     print 'done ({0}s)'.format(time.clock() - t1)
31 | 


--------------------------------------------------------------------------------
/doc/Makefile:
--------------------------------------------------------------------------------
  1 | # Makefile for Sphinx documentation
  2 | #
  3 | 
  4 | # You can set these variables from the command line.
  5 | SPHINXOPTS    =
  6 | SPHINXBUILD   = sphinx-build
  7 | PAPER         =
  8 | BUILDDIR      = _build
  9 | 
 10 | # User-friendly check for sphinx-build
 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
 13 | endif
 14 | 
 15 | # Internal variables.
 16 | PAPEROPT_a4     = -D latex_paper_size=a4
 17 | PAPEROPT_letter = -D latex_paper_size=letter
 18 | ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 19 | # the i18n builder cannot share the environment and doctrees with the others
 20 | I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
 21 | 
 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext
 23 | 
 24 | help:
 25 | 	@echo "Please use \`make <target>' where <target> is one of"
 26 | 	@echo "  html       to make standalone HTML files"
 27 | 	@echo "  dirhtml    to make HTML files named index.html in directories"
 28 | 	@echo "  singlehtml to make a single large HTML file"
 29 | 	@echo "  pickle     to make pickle files"
 30 | 	@echo "  json       to make JSON files"
 31 | 	@echo "  htmlhelp   to make HTML files and a HTML help project"
 32 | 	@echo "  qthelp     to make HTML files and a qthelp project"
 33 | 	@echo "  applehelp  to make an Apple Help Book"
 34 | 	@echo "  devhelp    to make HTML files and a Devhelp project"
 35 | 	@echo "  epub       to make an epub"
 36 | 	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
 37 | 	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
 38 | 	@echo "  latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
 39 | 	@echo "  text       to make text files"
 40 | 	@echo "  man        to make manual pages"
 41 | 	@echo "  texinfo    to make Texinfo files"
 42 | 	@echo "  info       to make Texinfo files and run them through makeinfo"
 43 | 	@echo "  gettext    to make PO message catalogs"
 44 | 	@echo "  changes    to make an overview of all changed/added/deprecated items"
 45 | 	@echo "  xml        to make Docutils-native XML files"
 46 | 	@echo "  pseudoxml  to make pseudoxml-XML files for display purposes"
 47 | 	@echo "  linkcheck  to check all external links for integrity"
 48 | 	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
 49 | 	@echo "  coverage   to run coverage check of the documentation (if enabled)"
 50 | 
 51 | clean:
 52 | 	rm -rf $(BUILDDIR)/*
 53 | 
 54 | html:
 55 | 	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
 56 | 	@echo
 57 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
 58 | 
 59 | dirhtml:
 60 | 	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
 61 | 	@echo
 62 | 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
 63 | 
 64 | singlehtml:
 65 | 	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
 66 | 	@echo
 67 | 	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
 68 | 
 69 | pickle:
 70 | 	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
 71 | 	@echo
 72 | 	@echo "Build finished; now you can process the pickle files."
 73 | 
 74 | json:
 75 | 	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
 76 | 	@echo
 77 | 	@echo "Build finished; now you can process the JSON files."
 78 | 
 79 | htmlhelp:
 80 | 	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
 81 | 	@echo
 82 | 	@echo "Build finished; now you can run HTML Help Workshop with the" \
 83 | 	      ".hhp project file in $(BUILDDIR)/htmlhelp."
 84 | 
 85 | qthelp:
 86 | 	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
 87 | 	@echo
 88 | 	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
 89 | 	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
 90 | 	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/AILE.qhcp"
 91 | 	@echo "To view the help file:"
 92 | 	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/AILE.qhc"
 93 | 
 94 | applehelp:
 95 | 	$(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp
 96 | 	@echo
 97 | 	@echo "Build finished. The help book is in $(BUILDDIR)/applehelp."
 98 | 	@echo "N.B. You won't be able to view it unless you put it in" \
 99 | 	      "~/Library/Documentation/Help or install it in your application" \
100 | 	      "bundle."
101 | 
102 | devhelp:
103 | 	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
104 | 	@echo
105 | 	@echo "Build finished."
106 | 	@echo "To view the help file:"
107 | 	@echo "# mkdir -p $$HOME/.local/share/devhelp/AILE"
108 | 	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/AILE"
109 | 	@echo "# devhelp"
110 | 
111 | epub:
112 | 	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
113 | 	@echo
114 | 	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
115 | 
116 | latex:
117 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
118 | 	@echo
119 | 	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
120 | 	@echo "Run \`make' in that directory to run these through (pdf)latex" \
121 | 	      "(use \`make latexpdf' here to do that automatically)."
122 | 
123 | latexpdf:
124 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
125 | 	@echo "Running LaTeX files through pdflatex..."
126 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf
127 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
128 | 
129 | latexpdfja:
130 | 	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
131 | 	@echo "Running LaTeX files through platex and dvipdfmx..."
132 | 	$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
133 | 	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
134 | 
135 | text:
136 | 	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
137 | 	@echo
138 | 	@echo "Build finished. The text files are in $(BUILDDIR)/text."
139 | 
140 | man:
141 | 	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
142 | 	@echo
143 | 	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
144 | 
145 | texinfo:
146 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
147 | 	@echo
148 | 	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
149 | 	@echo "Run \`make' in that directory to run these through makeinfo" \
150 | 	      "(use \`make info' here to do that automatically)."
151 | 
152 | info:
153 | 	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
154 | 	@echo "Running Texinfo files through makeinfo..."
155 | 	make -C $(BUILDDIR)/texinfo info
156 | 	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
157 | 
158 | gettext:
159 | 	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
160 | 	@echo
161 | 	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
162 | 
163 | changes:
164 | 	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
165 | 	@echo
166 | 	@echo "The overview file is in $(BUILDDIR)/changes."
167 | 
168 | linkcheck:
169 | 	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
170 | 	@echo
171 | 	@echo "Link check complete; look for any errors in the above output " \
172 | 	      "or in $(BUILDDIR)/linkcheck/output.txt."
173 | 
174 | doctest:
175 | 	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
176 | 	@echo "Testing of doctests in the sources finished, look at the " \
177 | 	      "results in $(BUILDDIR)/doctest/output.txt."
178 | 
179 | coverage:
180 | 	$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage
181 | 	@echo "Testing of coverage in the sources finished, look at the " \
182 | 	      "results in $(BUILDDIR)/coverage/python.txt."
183 | 
184 | xml:
185 | 	$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
186 | 	@echo
187 | 	@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
188 | 
189 | pseudoxml:
190 | 	$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
191 | 	@echo
192 | 	@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
193 | 


--------------------------------------------------------------------------------
/doc/_static/F_jk.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/F_jk.pdf


--------------------------------------------------------------------------------
/doc/_static/F_jk_bars.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/F_jk_bars.pdf


--------------------------------------------------------------------------------
/doc/_static/F_jk_graph.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | import numpy as np
 3 | 
 4 | def F(W, L, d):
 5 |     return d**L*(1 - d)/(1 - d**W)
 6 | 
 7 | W = 6
 8 | d = np.linspace(0.0, 1.0, 1000, endpoint=False)
 9 | 
10 | plt.figure()
11 | for L in xrange(0, W):
12 |     plt.plot(d, F(W, L, d))
13 | plt.savefig('F_jk_no_labels.svg')
14 | 
15 | plt.figure()
16 | W = 10
17 | L = np.arange(W)
18 | j = L - 0.4
19 | d = 0.8
20 | v = np.roll(F(W, L, d), 2)
21 | plt.bar(j, v, width=0.8, color='r', label='d=0.8')
22 | plt.hold(True)
23 | j = L - 0.3
24 | d = 0.3
25 | v = np.roll(F(W, L, d), 2)
26 | plt.bar(j, v, width=0.6, color='b', label='d=0.3')
27 | plt.xlim(-0.5, W - 0.5)
28 | plt.legend()
29 | plt.title(r'$F(l,d)$ $W=10$')
30 | plt.xlabel('Motif state')
31 | plt.ylabel(r'$F$')
32 | plt.savefig('F_jk_bars.svg')
33 | plt.savefig('F_jk_bars.pdf')
34 | 
35 | 


--------------------------------------------------------------------------------
/doc/_static/F_jk_no_labels.svg:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="utf-8" standalone="no"?>
  2 | <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
  3 |   "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
  4 | <!-- Created with matplotlib (http://matplotlib.org/) -->
  5 | <svg height="432pt" version="1.1" viewBox="0 0 576 432" width="576pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
  6 |  <defs>
  7 |   <style type="text/css">
  8 | *{stroke-linecap:butt;stroke-linejoin:round;}
  9 |   </style>
 10 |  </defs>
 11 |  <g id="figure_1">
 12 |   <g id="patch_1">
 13 |    <path d="
 14 | M0 432
 15 | L576 432
 16 | L576 0
 17 | L0 0
 18 | z
 19 | " style="fill:#ffffff;"/>
 20 |   </g>
 21 |   <g id="axes_1">
 22 |    <g id="patch_2">
 23 |     <path d="
 24 | M72 388.8
 25 | L518.4 388.8
 26 | L518.4 43.2
 27 | L72 43.2
 28 | z
 29 | " style="fill:#ffffff;"/>
 30 |    </g>
 31 |    <g id="line2d_1">
 32 |     <path clip-path="url(#p7ff5b81e1d)" d="
 33 | M72 43.2
 34 | L200.563 142.592
 35 | L232.258 166.795
 36 | L255.917 184.588
 37 | L275.558 199.088
 38 | L292.968 211.666
 39 | L308.592 222.686
 40 | L323.323 232.805
 41 | L337.162 242.04
 42 | L350.107 250.418
 43 | L362.606 258.248
 44 | L374.659 265.544
 45 | L386.712 272.578
 46 | L398.318 279.094
 47 | L409.925 285.352
 48 | L421.085 291.121
 49 | L432.245 296.645
 50 | L443.405 301.922
 51 | L454.565 306.954
 52 | L465.725 311.743
 53 | L476.885 316.292
 54 | L488.045 320.606
 55 | L499.651 324.85
 56 | L511.258 328.853
 57 | L517.954 331.056
 58 | L517.954 331.056" style="fill:none;stroke:#0000ff;stroke-linecap:square;"/>
 59 |    </g>
 60 |    <g id="line2d_2">
 61 |     <path clip-path="url(#p7ff5b81e1d)" d="
 62 | M72 388.8
 63 | L80.928 382.026
 64 | L89.856 375.529
 65 | L98.784 369.308
 66 | L107.712 363.364
 67 | L116.64 357.696
 68 | L125.122 352.568
 69 | L133.603 347.689
 70 | L142.085 343.059
 71 | L150.566 338.678
 72 | L159.048 334.546
 73 | L167.53 330.663
 74 | L176.011 327.028
 75 | L184.493 323.639
 76 | L192.974 320.497
 77 | L201.456 317.599
 78 | L209.938 314.944
 79 | L218.419 312.529
 80 | L226.901 310.353
 81 | L235.382 308.412
 82 | L243.864 306.703
 83 | L252.346 305.222
 84 | L260.827 303.963
 85 | L269.309 302.922
 86 | L278.237 302.056
 87 | L287.165 301.416
 88 | L296.093 300.996
 89 | L305.467 300.782
 90 | L314.842 300.788
 91 | L324.662 301.019
 92 | L334.93 301.492
 93 | L345.643 302.219
 94 | L356.803 303.209
 95 | L368.41 304.468
 96 | L380.909 306.054
 97 | L394.301 307.985
 98 | L409.032 310.341
 99 | L425.549 313.212
100 | L445.637 316.942
101 | L473.314 322.33
102 | L517.954 331.114
103 | L517.954 331.114" style="fill:none;stroke:#008000;stroke-linecap:square;"/>
104 |    </g>
105 |    <g id="line2d_3">
106 |     <path clip-path="url(#p7ff5b81e1d)" d="
107 | M72 388.8
108 | L80.0352 388.69
109 | L88.5168 388.344
110 | L96.9984 387.777
111 | L105.926 386.956
112 | L115.301 385.864
113 | L125.122 384.488
114 | L135.389 382.821
115 | L146.102 380.857
116 | L157.709 378.505
117 | L170.654 375.649
118 | L185.386 372.162
119 | L203.242 367.697
120 | L236.275 359.147
121 | L260.381 353.024
122 | L277.79 348.828
123 | L292.968 345.398
124 | L306.806 342.496
125 | L319.752 340.002
126 | L332.251 337.815
127 | L344.75 335.857
128 | L356.803 334.193
129 | L368.856 332.754
130 | L380.909 331.54
131 | L392.962 330.548
132 | L405.014 329.774
133 | L417.514 329.193
134 | L430.459 328.818
135 | L443.851 328.659
136 | L457.69 328.721
137 | L471.974 329.006
138 | L487.152 329.528
139 | L503.669 330.321
140 | L517.954 331.171
141 | L517.954 331.171" style="fill:none;stroke:#ff0000;stroke-linecap:square;"/>
142 |    </g>
143 |    <g id="line2d_4">
144 |     <path clip-path="url(#p7ff5b81e1d)" d="
145 | M72 388.8
146 | L103.248 388.69
147 | L120.658 388.401
148 | L135.835 387.934
149 | L150.12 387.272
150 | L163.512 386.433
151 | L176.904 385.368
152 | L189.85 384.118
153 | L202.795 382.65
154 | L216.187 380.907
155 | L230.026 378.876
156 | L244.31 376.555
157 | L259.934 373.785
158 | L277.344 370.461
159 | L297.878 366.304
160 | L332.698 358.982
161 | L363.053 352.691
162 | L383.587 348.665
163 | L401.443 345.392
164 | L417.96 342.592
165 | L433.584 340.168
166 | L448.762 338.039
167 | L463.939 336.137
168 | L478.67 334.513
169 | L493.848 333.067
170 | L509.026 331.845
171 | L517.954 331.229
172 | L517.954 331.229" style="fill:none;stroke:#00bfbf;stroke-linecap:square;"/>
173 |    </g>
174 |    <g id="line2d_5">
175 |     <path clip-path="url(#p7ff5b81e1d)" d="
176 | M72 388.8
177 | L134.05 388.689
178 | L158.155 388.413
179 | L177.797 387.968
180 | L194.76 387.366
181 | L210.384 386.596
182 | L225.115 385.652
183 | L239.4 384.517
184 | L253.238 383.197
185 | L267.077 381.654
186 | L280.915 379.886
187 | L294.754 377.896
188 | L309.038 375.618
189 | L323.77 373.046
190 | L339.394 370.095
191 | L356.803 366.573
192 | L376.445 362.364
193 | L401.443 356.765
194 | L475.099 340.117
195 | L496.973 335.485
196 | L517.061 331.46
197 | L517.954 331.286
198 | L517.954 331.286" style="fill:none;stroke:#bf00bf;stroke-linecap:square;"/>
199 |    </g>
200 |    <g id="line2d_6">
201 |     <path clip-path="url(#p7ff5b81e1d)" d="
202 | M72 388.8
203 | L165.298 388.691
204 | L193.421 388.425
205 | L214.848 388.011
206 | L233.15 387.443
207 | L249.667 386.714
208 | L264.845 385.827
209 | L279.13 384.776
210 | L292.968 383.536
211 | L306.36 382.113
212 | L319.306 380.517
213 | L332.251 378.697
214 | L345.197 376.649
215 | L358.142 374.373
216 | L371.088 371.871
217 | L384.48 369.051
218 | L397.872 366.006
219 | L412.157 362.525
220 | L426.888 358.702
221 | L442.512 354.416
222 | L459.475 349.526
223 | L478.224 343.883
224 | L499.651 337.197
225 | L517.954 331.344
226 | L517.954 331.344" style="fill:none;stroke:#bfbf00;stroke-linecap:square;"/>
227 |    </g>
228 |    <g id="patch_3">
229 |     <path d="
230 | M72 43.2
231 | L518.4 43.2" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;"/>
232 |    </g>
233 |    <g id="patch_4">
234 |     <path d="
235 | M518.4 388.8
236 | L518.4 43.2" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;"/>
237 |    </g>
238 |    <g id="patch_5">
239 |     <path d="
240 | M72 388.8
241 | L518.4 388.8" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;"/>
242 |    </g>
243 |    <g id="patch_6">
244 |     <path d="
245 | M72 388.8
246 | L72 43.2" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;"/>
247 |    </g>
248 |    <g id="matplotlib.axis_1">
249 |     <g id="xtick_1">
250 |      <g id="line2d_7">
251 |       <defs>
252 |        <path d="
253 | M0 0
254 | L0 -4" id="m93b0483c22" style="stroke:#000000;stroke-width:0.5;"/>
255 |       </defs>
256 |       <g>
257 |        <use style="stroke:#000000;stroke-width:0.5;" x="72.0" xlink:href="#m93b0483c22" y="388.8"/>
258 |       </g>
259 |      </g>
260 |      <g id="line2d_8">
261 |       <defs>
262 |        <path d="
263 | M0 0
264 | L0 4" id="m741efc42ff" style="stroke:#000000;stroke-width:0.5;"/>
265 |       </defs>
266 |       <g>
267 |        <use style="stroke:#000000;stroke-width:0.5;" x="72.0" xlink:href="#m741efc42ff" y="43.2"/>
268 |       </g>
269 |      </g>
270 |      <g id="text_1">
271 |       <!-- 0.0 -->
272 |       <defs>
273 |        <path d="
274 | M31.7812 66.4062
275 | Q24.1719 66.4062 20.3281 58.9062
276 | Q16.5 51.4219 16.5 36.375
277 | Q16.5 21.3906 20.3281 13.8906
278 | Q24.1719 6.39062 31.7812 6.39062
279 | Q39.4531 6.39062 43.2812 13.8906
280 | Q47.125 21.3906 47.125 36.375
281 | Q47.125 51.4219 43.2812 58.9062
282 | Q39.4531 66.4062 31.7812 66.4062
283 | M31.7812 74.2188
284 | Q44.0469 74.2188 50.5156 64.5156
285 | Q56.9844 54.8281 56.9844 36.375
286 | Q56.9844 17.9688 50.5156 8.26562
287 | Q44.0469 -1.42188 31.7812 -1.42188
288 | Q19.5312 -1.42188 13.0625 8.26562
289 | Q6.59375 17.9688 6.59375 36.375
290 | Q6.59375 54.8281 13.0625 64.5156
291 | Q19.5312 74.2188 31.7812 74.2188" id="BitstreamVeraSans-Roman-30"/>
292 |        <path d="
293 | M10.6875 12.4062
294 | L21 12.4062
295 | L21 0
296 | L10.6875 0
297 | z
298 | " id="BitstreamVeraSans-Roman-2e"/>
299 |       </defs>
300 |       <g transform="translate(63.2521875 401.918125)scale(0.12 -0.12)">
301 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
302 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
303 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-30"/>
304 |       </g>
305 |      </g>
306 |     </g>
307 |     <g id="xtick_2">
308 |      <g id="line2d_9">
309 |       <g>
310 |        <use style="stroke:#000000;stroke-width:0.5;" x="161.28" xlink:href="#m93b0483c22" y="388.8"/>
311 |       </g>
312 |      </g>
313 |      <g id="line2d_10">
314 |       <g>
315 |        <use style="stroke:#000000;stroke-width:0.5;" x="161.28" xlink:href="#m741efc42ff" y="43.2"/>
316 |       </g>
317 |      </g>
318 |      <g id="text_2">
319 |       <!-- 0.2 -->
320 |       <defs>
321 |        <path d="
322 | M19.1875 8.29688
323 | L53.6094 8.29688
324 | L53.6094 0
325 | L7.32812 0
326 | L7.32812 8.29688
327 | Q12.9375 14.1094 22.625 23.8906
328 | Q32.3281 33.6875 34.8125 36.5312
329 | Q39.5469 41.8438 41.4219 45.5312
330 | Q43.3125 49.2188 43.3125 52.7812
331 | Q43.3125 58.5938 39.2344 62.25
332 | Q35.1562 65.9219 28.6094 65.9219
333 | Q23.9688 65.9219 18.8125 64.3125
334 | Q13.6719 62.7031 7.8125 59.4219
335 | L7.8125 69.3906
336 | Q13.7656 71.7812 18.9375 73
337 | Q24.125 74.2188 28.4219 74.2188
338 | Q39.75 74.2188 46.4844 68.5469
339 | Q53.2188 62.8906 53.2188 53.4219
340 | Q53.2188 48.9219 51.5312 44.8906
341 | Q49.8594 40.875 45.4062 35.4062
342 | Q44.1875 33.9844 37.6406 27.2188
343 | Q31.1094 20.4531 19.1875 8.29688" id="BitstreamVeraSans-Roman-32"/>
344 |       </defs>
345 |       <g transform="translate(152.7346875 401.918125)scale(0.12 -0.12)">
346 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
347 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
348 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-32"/>
349 |       </g>
350 |      </g>
351 |     </g>
352 |     <g id="xtick_3">
353 |      <g id="line2d_11">
354 |       <g>
355 |        <use style="stroke:#000000;stroke-width:0.5;" x="250.56" xlink:href="#m93b0483c22" y="388.8"/>
356 |       </g>
357 |      </g>
358 |      <g id="line2d_12">
359 |       <g>
360 |        <use style="stroke:#000000;stroke-width:0.5;" x="250.56" xlink:href="#m741efc42ff" y="43.2"/>
361 |       </g>
362 |      </g>
363 |      <g id="text_3">
364 |       <!-- 0.4 -->
365 |       <defs>
366 |        <path d="
367 | M37.7969 64.3125
368 | L12.8906 25.3906
369 | L37.7969 25.3906
370 | z
371 | 
372 | M35.2031 72.9062
373 | L47.6094 72.9062
374 | L47.6094 25.3906
375 | L58.0156 25.3906
376 | L58.0156 17.1875
377 | L47.6094 17.1875
378 | L47.6094 0
379 | L37.7969 0
380 | L37.7969 17.1875
381 | L4.89062 17.1875
382 | L4.89062 26.7031
383 | z
384 | " id="BitstreamVeraSans-Roman-34"/>
385 |       </defs>
386 |       <g transform="translate(241.7503125 401.918125)scale(0.12 -0.12)">
387 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
388 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
389 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-34"/>
390 |       </g>
391 |      </g>
392 |     </g>
393 |     <g id="xtick_4">
394 |      <g id="line2d_13">
395 |       <g>
396 |        <use style="stroke:#000000;stroke-width:0.5;" x="339.84" xlink:href="#m93b0483c22" y="388.8"/>
397 |       </g>
398 |      </g>
399 |      <g id="line2d_14">
400 |       <g>
401 |        <use style="stroke:#000000;stroke-width:0.5;" x="339.84" xlink:href="#m741efc42ff" y="43.2"/>
402 |       </g>
403 |      </g>
404 |      <g id="text_4">
405 |       <!-- 0.6 -->
406 |       <defs>
407 |        <path d="
408 | M33.0156 40.375
409 | Q26.375 40.375 22.4844 35.8281
410 | Q18.6094 31.2969 18.6094 23.3906
411 | Q18.6094 15.5312 22.4844 10.9531
412 | Q26.375 6.39062 33.0156 6.39062
413 | Q39.6562 6.39062 43.5312 10.9531
414 | Q47.4062 15.5312 47.4062 23.3906
415 | Q47.4062 31.2969 43.5312 35.8281
416 | Q39.6562 40.375 33.0156 40.375
417 | M52.5938 71.2969
418 | L52.5938 62.3125
419 | Q48.875 64.0625 45.0938 64.9844
420 | Q41.3125 65.9219 37.5938 65.9219
421 | Q27.8281 65.9219 22.6719 59.3281
422 | Q17.5312 52.7344 16.7969 39.4062
423 | Q19.6719 43.6562 24.0156 45.9219
424 | Q28.375 48.1875 33.5938 48.1875
425 | Q44.5781 48.1875 50.9531 41.5156
426 | Q57.3281 34.8594 57.3281 23.3906
427 | Q57.3281 12.1562 50.6875 5.35938
428 | Q44.0469 -1.42188 33.0156 -1.42188
429 | Q20.3594 -1.42188 13.6719 8.26562
430 | Q6.98438 17.9688 6.98438 36.375
431 | Q6.98438 53.6562 15.1875 63.9375
432 | Q23.3906 74.2188 37.2031 74.2188
433 | Q40.9219 74.2188 44.7031 73.4844
434 | Q48.4844 72.75 52.5938 71.2969" id="BitstreamVeraSans-Roman-36"/>
435 |       </defs>
436 |       <g transform="translate(331.0715625 401.918125)scale(0.12 -0.12)">
437 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
438 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
439 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-36"/>
440 |       </g>
441 |      </g>
442 |     </g>
443 |     <g id="xtick_5">
444 |      <g id="line2d_15">
445 |       <g>
446 |        <use style="stroke:#000000;stroke-width:0.5;" x="429.12" xlink:href="#m93b0483c22" y="388.8"/>
447 |       </g>
448 |      </g>
449 |      <g id="line2d_16">
450 |       <g>
451 |        <use style="stroke:#000000;stroke-width:0.5;" x="429.12" xlink:href="#m741efc42ff" y="43.2"/>
452 |       </g>
453 |      </g>
454 |      <g id="text_5">
455 |       <!-- 0.8 -->
456 |       <defs>
457 |        <path d="
458 | M31.7812 34.625
459 | Q24.75 34.625 20.7188 30.8594
460 | Q16.7031 27.0938 16.7031 20.5156
461 | Q16.7031 13.9219 20.7188 10.1562
462 | Q24.75 6.39062 31.7812 6.39062
463 | Q38.8125 6.39062 42.8594 10.1719
464 | Q46.9219 13.9688 46.9219 20.5156
465 | Q46.9219 27.0938 42.8906 30.8594
466 | Q38.875 34.625 31.7812 34.625
467 | M21.9219 38.8125
468 | Q15.5781 40.375 12.0312 44.7188
469 | Q8.5 49.0781 8.5 55.3281
470 | Q8.5 64.0625 14.7188 69.1406
471 | Q20.9531 74.2188 31.7812 74.2188
472 | Q42.6719 74.2188 48.875 69.1406
473 | Q55.0781 64.0625 55.0781 55.3281
474 | Q55.0781 49.0781 51.5312 44.7188
475 | Q48 40.375 41.7031 38.8125
476 | Q48.8281 37.1562 52.7969 32.3125
477 | Q56.7812 27.4844 56.7812 20.5156
478 | Q56.7812 9.90625 50.3125 4.23438
479 | Q43.8438 -1.42188 31.7812 -1.42188
480 | Q19.7344 -1.42188 13.25 4.23438
481 | Q6.78125 9.90625 6.78125 20.5156
482 | Q6.78125 27.4844 10.7812 32.3125
483 | Q14.7969 37.1562 21.9219 38.8125
484 | M18.3125 54.3906
485 | Q18.3125 48.7344 21.8438 45.5625
486 | Q25.3906 42.3906 31.7812 42.3906
487 | Q38.1406 42.3906 41.7188 45.5625
488 | Q45.3125 48.7344 45.3125 54.3906
489 | Q45.3125 60.0625 41.7188 63.2344
490 | Q38.1406 66.4062 31.7812 66.4062
491 | Q25.3906 66.4062 21.8438 63.2344
492 | Q18.3125 60.0625 18.3125 54.3906" id="BitstreamVeraSans-Roman-38"/>
493 |       </defs>
494 |       <g transform="translate(420.384375 401.918125)scale(0.12 -0.12)">
495 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
496 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
497 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-38"/>
498 |       </g>
499 |      </g>
500 |     </g>
501 |     <g id="xtick_6">
502 |      <g id="line2d_17">
503 |       <g>
504 |        <use style="stroke:#000000;stroke-width:0.5;" x="518.4" xlink:href="#m93b0483c22" y="388.8"/>
505 |       </g>
506 |      </g>
507 |      <g id="line2d_18">
508 |       <g>
509 |        <use style="stroke:#000000;stroke-width:0.5;" x="518.4" xlink:href="#m741efc42ff" y="43.2"/>
510 |       </g>
511 |      </g>
512 |      <g id="text_6">
513 |       <!-- 1.0 -->
514 |       <defs>
515 |        <path d="
516 | M12.4062 8.29688
517 | L28.5156 8.29688
518 | L28.5156 63.9219
519 | L10.9844 60.4062
520 | L10.9844 69.3906
521 | L28.4219 72.9062
522 | L38.2812 72.9062
523 | L38.2812 8.29688
524 | L54.3906 8.29688
525 | L54.3906 0
526 | L12.4062 0
527 | z
528 | " id="BitstreamVeraSans-Roman-31"/>
529 |       </defs>
530 |       <g transform="translate(509.915625 401.918125)scale(0.12 -0.12)">
531 |        <use xlink:href="#BitstreamVeraSans-Roman-31"/>
532 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
533 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-30"/>
534 |       </g>
535 |      </g>
536 |     </g>
537 |    </g>
538 |    <g id="matplotlib.axis_2">
539 |     <g id="ytick_1">
540 |      <g id="line2d_19">
541 |       <defs>
542 |        <path d="
543 | M0 0
544 | L4 0" id="m728421d6d4" style="stroke:#000000;stroke-width:0.5;"/>
545 |       </defs>
546 |       <g>
547 |        <use style="stroke:#000000;stroke-width:0.5;" x="72.0" xlink:href="#m728421d6d4" y="388.8"/>
548 |       </g>
549 |      </g>
550 |      <g id="line2d_20">
551 |       <defs>
552 |        <path d="
553 | M0 0
554 | L-4 0" id="mcb0005524f" style="stroke:#000000;stroke-width:0.5;"/>
555 |       </defs>
556 |       <g>
557 |        <use style="stroke:#000000;stroke-width:0.5;" x="518.4" xlink:href="#mcb0005524f" y="388.8"/>
558 |       </g>
559 |      </g>
560 |      <g id="text_7">
561 |       <!-- 0.0 -->
562 |       <g transform="translate(50.504375 392.11125)scale(0.12 -0.12)">
563 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
564 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
565 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-30"/>
566 |       </g>
567 |      </g>
568 |     </g>
569 |     <g id="ytick_2">
570 |      <g id="line2d_21">
571 |       <g>
572 |        <use style="stroke:#000000;stroke-width:0.5;" x="72.0" xlink:href="#m728421d6d4" y="319.68"/>
573 |       </g>
574 |      </g>
575 |      <g id="line2d_22">
576 |       <g>
577 |        <use style="stroke:#000000;stroke-width:0.5;" x="518.4" xlink:href="#mcb0005524f" y="319.68"/>
578 |       </g>
579 |      </g>
580 |      <g id="text_8">
581 |       <!-- 0.2 -->
582 |       <g transform="translate(50.909375 322.99125)scale(0.12 -0.12)">
583 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
584 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
585 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-32"/>
586 |       </g>
587 |      </g>
588 |     </g>
589 |     <g id="ytick_3">
590 |      <g id="line2d_23">
591 |       <g>
592 |        <use style="stroke:#000000;stroke-width:0.5;" x="72.0" xlink:href="#m728421d6d4" y="250.56"/>
593 |       </g>
594 |      </g>
595 |      <g id="line2d_24">
596 |       <g>
597 |        <use style="stroke:#000000;stroke-width:0.5;" x="518.4" xlink:href="#mcb0005524f" y="250.56"/>
598 |       </g>
599 |      </g>
600 |      <g id="text_9">
601 |       <!-- 0.4 -->
602 |       <g transform="translate(50.380625 253.87125)scale(0.12 -0.12)">
603 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
604 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
605 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-34"/>
606 |       </g>
607 |      </g>
608 |     </g>
609 |     <g id="ytick_4">
610 |      <g id="line2d_25">
611 |       <g>
612 |        <use style="stroke:#000000;stroke-width:0.5;" x="72.0" xlink:href="#m728421d6d4" y="181.44"/>
613 |       </g>
614 |      </g>
615 |      <g id="line2d_26">
616 |       <g>
617 |        <use style="stroke:#000000;stroke-width:0.5;" x="518.4" xlink:href="#mcb0005524f" y="181.44"/>
618 |       </g>
619 |      </g>
620 |      <g id="text_10">
621 |       <!-- 0.6 -->
622 |       <g transform="translate(50.463125 184.75125)scale(0.12 -0.12)">
623 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
624 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
625 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-36"/>
626 |       </g>
627 |      </g>
628 |     </g>
629 |     <g id="ytick_5">
630 |      <g id="line2d_27">
631 |       <g>
632 |        <use style="stroke:#000000;stroke-width:0.5;" x="72.0" xlink:href="#m728421d6d4" y="112.32"/>
633 |       </g>
634 |      </g>
635 |      <g id="line2d_28">
636 |       <g>
637 |        <use style="stroke:#000000;stroke-width:0.5;" x="518.4" xlink:href="#mcb0005524f" y="112.32"/>
638 |       </g>
639 |      </g>
640 |      <g id="text_11">
641 |       <!-- 0.8 -->
642 |       <g transform="translate(50.52875 115.63125)scale(0.12 -0.12)">
643 |        <use xlink:href="#BitstreamVeraSans-Roman-30"/>
644 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
645 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-38"/>
646 |       </g>
647 |      </g>
648 |     </g>
649 |     <g id="ytick_6">
650 |      <g id="line2d_29">
651 |       <g>
652 |        <use style="stroke:#000000;stroke-width:0.5;" x="72.0" xlink:href="#m728421d6d4" y="43.2"/>
653 |       </g>
654 |      </g>
655 |      <g id="line2d_30">
656 |       <g>
657 |        <use style="stroke:#000000;stroke-width:0.5;" x="518.4" xlink:href="#mcb0005524f" y="43.2"/>
658 |       </g>
659 |      </g>
660 |      <g id="text_12">
661 |       <!-- 1.0 -->
662 |       <g transform="translate(51.03125 46.51125)scale(0.12 -0.12)">
663 |        <use xlink:href="#BitstreamVeraSans-Roman-31"/>
664 |        <use x="63.623046875" xlink:href="#BitstreamVeraSans-Roman-2e"/>
665 |        <use x="95.41015625" xlink:href="#BitstreamVeraSans-Roman-30"/>
666 |       </g>
667 |      </g>
668 |     </g>
669 |    </g>
670 |   </g>
671 |  </g>
672 |  <defs>
673 |   <clipPath id="p7ff5b81e1d">
674 |    <rect height="345.6" width="446.4" x="72.0" y="43.2"/>
675 |   </clipPath>
676 |  </defs>
677 | </svg>
678 | 


--------------------------------------------------------------------------------
/doc/_static/HMM_1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/HMM_1.pdf


--------------------------------------------------------------------------------
/doc/_static/HMM_1.svg:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2 | <!-- Created with Inkscape (http://www.inkscape.org/) -->
  3 | 
  4 | <svg
  5 |    xmlns:dc="http://purl.org/dc/elements/1.1/"
  6 |    xmlns:cc="http://creativecommons.org/ns#"
  7 |    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  8 |    xmlns:svg="http://www.w3.org/2000/svg"
  9 |    xmlns="http://www.w3.org/2000/svg"
 10 |    xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
 11 |    xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
 12 |    width="187.40462mm"
 13 |    height="83.002861mm"
 14 |    viewBox="0 0 664.03211 294.10462"
 15 |    id="svg28927"
 16 |    version="1.1"
 17 |    inkscape:version="0.91 r"
 18 |    sodipodi:docname="HMM_1.svg">
 19 |   <sodipodi:namedview
 20 |      id="base"
 21 |      pagecolor="#ffffff"
 22 |      bordercolor="#666666"
 23 |      borderopacity="1.0"
 24 |      inkscape:pageopacity="0.0"
 25 |      inkscape:pageshadow="2"
 26 |      inkscape:zoom="0.49497475"
 27 |      inkscape:cx="10.773126"
 28 |      inkscape:cy="336.40497"
 29 |      inkscape:document-units="px"
 30 |      inkscape:current-layer="layer1"
 31 |      showgrid="false"
 32 |      inkscape:window-width="1855"
 33 |      inkscape:window-height="1056"
 34 |      inkscape:window-x="65"
 35 |      inkscape:window-y="24"
 36 |      inkscape:window-maximized="1"
 37 |      fit-margin-top="0"
 38 |      fit-margin-left="0"
 39 |      fit-margin-right="0"
 40 |      fit-margin-bottom="0" />
 41 |   <defs
 42 |      id="defs28929">
 43 |     <marker
 44 |        inkscape:stockid="Arrow1Lend"
 45 |        orient="auto"
 46 |        refY="0"
 47 |        refX="0"
 48 |        id="Arrow1Lend"
 49 |        style="overflow:visible"
 50 |        inkscape:isstock="true">
 51 |       <path
 52 |          id="path4875"
 53 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 54 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 55 |          transform="matrix(-0.8,0,0,-0.8,-10,0)"
 56 |          inkscape:connector-curvature="0" />
 57 |     </marker>
 58 |     <marker
 59 |        inkscape:stockid="Arrow1Lend"
 60 |        orient="auto"
 61 |        refY="0"
 62 |        refX="0"
 63 |        id="Arrow1Lend-6"
 64 |        style="overflow:visible"
 65 |        inkscape:isstock="true">
 66 |       <path
 67 |          inkscape:connector-curvature="0"
 68 |          id="path4875-9"
 69 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 70 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 71 |          transform="matrix(-0.8,0,0,-0.8,-10,0)" />
 72 |     </marker>
 73 |     <marker
 74 |        inkscape:stockid="Arrow1Lend"
 75 |        orient="auto"
 76 |        refY="0"
 77 |        refX="0"
 78 |        id="Arrow1Lend-4"
 79 |        style="overflow:visible"
 80 |        inkscape:isstock="true">
 81 |       <path
 82 |          inkscape:connector-curvature="0"
 83 |          id="path4875-8"
 84 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 85 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 86 |          transform="matrix(-0.8,0,0,-0.8,-10,0)" />
 87 |     </marker>
 88 |     <marker
 89 |        inkscape:stockid="Arrow1Lend"
 90 |        orient="auto"
 91 |        refY="0"
 92 |        refX="0"
 93 |        id="Arrow1Lend-1"
 94 |        style="overflow:visible"
 95 |        inkscape:isstock="true">
 96 |       <path
 97 |          inkscape:connector-curvature="0"
 98 |          id="path4875-96"
 99 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
100 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
101 |          transform="matrix(-0.8,0,0,-0.8,-10,0)" />
102 |     </marker>
103 |     <marker
104 |        inkscape:stockid="Arrow1Lend"
105 |        orient="auto"
106 |        refY="0"
107 |        refX="0"
108 |        id="Arrow1Lend-1-5"
109 |        style="overflow:visible"
110 |        inkscape:isstock="true">
111 |       <path
112 |          inkscape:connector-curvature="0"
113 |          id="path4875-96-4"
114 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
115 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
116 |          transform="matrix(-0.8,0,0,-0.8,-10,0)" />
117 |     </marker>
118 |     <marker
119 |        inkscape:stockid="Arrow1Lend"
120 |        orient="auto"
121 |        refY="0"
122 |        refX="0"
123 |        id="Arrow1Lend-1-5-2"
124 |        style="overflow:visible"
125 |        inkscape:isstock="true">
126 |       <path
127 |          inkscape:connector-curvature="0"
128 |          id="path4875-96-4-4"
129 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
130 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
131 |          transform="matrix(-0.8,0,0,-0.8,-10,0)" />
132 |     </marker>
133 |   </defs>
134 |   <metadata
135 |      id="metadata28932">
136 |     <rdf:RDF>
137 |       <cc:Work
138 |          rdf:about="">
139 |         <dc:format>image/svg+xml</dc:format>
140 |         <dc:type
141 |            rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
142 |         <dc:title></dc:title>
143 |       </cc:Work>
144 |     </rdf:RDF>
145 |   </metadata>
146 |   <g
147 |      inkscape:label="Capa 1"
148 |      inkscape:groupmode="layer"
149 |      id="layer1"
150 |      transform="translate(-31.117115,-405.4914)">
151 |     <g
152 |        id="g29799">
153 |       <g
154 |          transform="translate(313.14729,-7.25177)"
155 |          id="g29771">
156 |         <g
157 |            id="g29681"
158 |            transform="translate(-3.0304576,121.21831)">
159 |           <g
160 |              word-spacing="normal"
161 |              letter-spacing="normal"
162 |              font-size-adjust="none"
163 |              font-stretch="normal"
164 |              font-weight="normal"
165 |              font-variant="normal"
166 |              font-style="normal"
167 |              stroke-miterlimit="10.433"
168 |              xml:space="preserve"
169 |              transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
170 |              id="g29683"
171 |              style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
172 |                id="path29685"
173 |                d="m 479.68,426.68 -4.18,9.91 c -0.15,0.41 -0.25,0.59 -0.25,0.64 0,0.31 0.85,1.25 2.94,1.45 0.5,0.05 1,0.1 1,0.96 0,0.59 -0.61,0.59 -0.75,0.59 -2.05,0 -4.19,-0.16 -6.28,-0.16 -1.25,0 -4.33,0.16 -5.58,0.16 -0.29,0 -0.9,0 -0.9,-1 0,-0.55 0.5,-0.55 1.15,-0.55 2.99,0 3.28,-0.5 3.74,-1.59 l 5.87,-13.91 -10.51,-11.31 -0.64,-0.55 c -2.44,-2.64 -4.79,-3.43 -7.33,-3.57 -0.64,-0.05 -1.1,-0.05 -1.1,-1 0,-0.05 0,-0.55 0.66,-0.55 1.48,0 3.12,0.16 4.67,0.16 1.85,0 3.8,-0.16 5.58,-0.16 0.31,0 0.91,0 0.91,1 0,0.5 -0.5,0.55 -0.6,0.55 -0.45,0.04 -2,0.14 -2,1.54 0,0.8 0.75,1.6 1.35,2.24 l 5.07,5.39 4.49,4.87 5.03,-11.9 c 0.2,-0.55 0.25,-0.6 0.25,-0.71 0,-0.39 -0.94,-1.23 -2.89,-1.43 -0.55,-0.05 -0.99,-0.11 -0.99,-0.96 0,-0.59 0.55,-0.59 0.74,-0.59 1.41,0 4.89,0.16 6.28,0.16 1.25,0 4.28,-0.16 5.53,-0.16 0.35,0 0.95,0 0.95,0.95 0,0.6 -0.5,0.6 -0.9,0.6 -3.33,0.04 -3.44,0.2 -4.28,2.18 -1.94,4.64 -5.28,12.41 -6.42,15.39 3.39,3.5 8.6,9.43 10.2,10.82 1.45,1.2 3.34,2.39 6.33,2.54 0.65,0.05 1.09,0.05 1.09,1 0,0.05 0,0.55 -0.64,0.55 -1.5,0 -3.14,-0.16 -4.69,-0.16 -1.84,0 -3.73,0.16 -5.53,0.16 -0.3,0 -0.94,0 -0.94,-1 0,-0.34 0.25,-0.5 0.6,-0.55 0.45,-0.04 2,-0.15 2,-1.54 0,-0.71 -0.55,-1.35 -0.96,-1.8 z"
174 |                inkscape:connector-curvature="0"
175 |                style="fill:#000000;stroke-width:0" /><path
176 |                id="path29687"
177 |                d="m 501.05,400.95 -0.01,-0.04 -0.02,-0.05 -0.01,-0.04 -0.01,-0.05 -0.02,-0.05 -0.01,-0.06 -0.02,-0.05 -0.01,-0.05 -0.03,-0.11 -0.03,-0.11 -0.03,-0.12 -0.02,-0.1 -0.02,-0.06 -0.01,-0.05 -0.01,-0.05 -0.01,-0.05 -0.01,-0.05 -0.01,-0.04 -0.01,-0.05 -0.01,-0.04 -0.01,-0.03 0,-0.04 -0.01,-0.03 0,-0.03 -0.01,-0.02 0,-0.02 0,-0.01 0,-0.01 0,-0.01 0,0 c 0,-0.77 0.62,-1.11 1.18,-1.11 0.63,0 1.18,0.45 1.36,0.76 0.18,0.32 0.46,1.43 0.63,2.16 0.17,0.66 0.55,2.27 0.76,3.14 0.21,0.77 0.43,1.53 0.6,2.33 0.37,1.44 0.45,1.7 1.45,3.14 0.99,1.39 2.63,3.2 5.24,3.2 2.01,0 2.04,-1.78 2.04,-2.44 0,-2.09 -1.48,-5.95 -2.04,-7.42 -0.39,-0.98 -0.54,-1.29 -0.54,-1.89 0,-1.84 1.55,-2.98 3.32,-2.98 3.48,0 5.01,4.79 5.01,5.33 0,0.45 -0.45,0.45 -0.54,0.45 -0.5,0 -0.53,-0.2 -0.68,-0.59 -0.79,-2.79 -2.29,-4.22 -3.68,-4.22 -0.74,0 -0.88,0.48 -0.88,1.22 0,0.79 0.17,1.25 0.8,2.82 0.42,1.08 1.86,4.77 1.86,6.72 0,3.39 -2.69,3.97 -4.53,3.97 -2.89,0 -4.85,-1.76 -5.89,-3.16 -0.25,2.4 -2.3,3.16 -3.74,3.16 -1.5,0 -2.3,-1.08 -2.75,-1.87 -0.76,-1.29 -1.25,-3.29 -1.25,-3.46 0,-0.45 0.49,-0.45 0.6,-0.45 0.48,0 0.51,0.11 0.76,1.05 0.52,2.06 1.19,3.76 2.55,3.76 0.9,0 1.14,-0.76 1.14,-1.7 0,-0.67 -0.31,-1.95 -0.56,-2.89 -0.24,-0.95 -0.58,-2.38 -0.77,-3.14 z"
178 |                inkscape:connector-curvature="0"
179 |                style="fill:#000000;stroke-width:0" /></g>        </g>
180 |         <circle
181 |            style="fill:none;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
182 |            id="path29716-4"
183 |            cx="329.99432"
184 |            cy="464.75079"
185 |            r="50.507626" />
186 |       </g>
187 |       <g
188 |          transform="translate(317.22416,84.852814)"
189 |          id="g29792">
190 |         <g
191 |            id="g29537"
192 |            transform="translate(-4.0406102,219.2031)">
193 |           <g
194 |              id="g29539"
195 |              transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
196 |              xml:space="preserve"
197 |              stroke-miterlimit="10.433"
198 |              font-style="normal"
199 |              font-variant="normal"
200 |              font-weight="normal"
201 |              font-stretch="normal"
202 |              font-size-adjust="none"
203 |              letter-spacing="normal"
204 |              word-spacing="normal"
205 |              style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
206 |                d="m 491.39,439.03 0.01,0.04 0.01,0.04 0.01,0.04 0.02,0.04 0.01,0.04 0.01,0.05 0.01,0.04 0.01,0.05 0.01,0.04 0.01,0.05 0.01,0.05 0.01,0.05 0.01,0.05 0,0.06 0,0.05 0,0.03 0.01,0.03 c 0,0.45 -0.25,0.45 -1.24,0.45 l -21.03,0 c -1.19,0 -1.25,-0.05 -1.55,-1 l -2.73,-8.97 c -0.11,-0.25 -0.16,-0.59 -0.16,-0.7 0,-0.05 0,-0.55 0.61,-0.55 0.44,0 0.55,0.36 0.64,0.5 1.89,5.89 4.49,9.17 12.25,9.17 l 8.13,0 -27.85,-31.14 c 0,-0.04 -0.2,-0.75 -0.2,-0.89 0,-0.45 0.34,-0.45 1.25,-0.45 l 21.67,0 c 1.19,0 1.24,0.05 1.55,1 l 3.53,11 c 0.05,0.2 0.2,0.56 0.2,0.75 0,0.25 -0.2,0.55 -0.65,0.55 -0.44,0 -0.5,-0.19 -0.85,-1.24 -2.18,-6.78 -4.57,-10.37 -13.04,-10.37 l -8.53,0 z"
207 |                id="path29541"
208 |                inkscape:connector-curvature="0"
209 |                style="fill:#000000;stroke-width:0" /><path
210 |                d="m 493.73,400.95 -0.01,-0.04 -0.02,-0.05 -0.01,-0.04 -0.01,-0.05 -0.02,-0.05 -0.01,-0.06 -0.02,-0.05 -0.01,-0.05 -0.03,-0.11 -0.03,-0.11 -0.03,-0.12 -0.02,-0.1 -0.02,-0.06 -0.01,-0.05 -0.01,-0.05 -0.01,-0.05 -0.01,-0.05 -0.01,-0.04 -0.01,-0.05 -0.01,-0.04 -0.01,-0.03 0,-0.04 -0.01,-0.03 0,-0.03 -0.01,-0.02 0,-0.02 0,-0.01 0,-0.01 0,-0.01 0,0 c 0,-0.77 0.62,-1.11 1.18,-1.11 0.63,0 1.18,0.45 1.36,0.76 0.18,0.32 0.46,1.43 0.63,2.16 0.17,0.66 0.55,2.27 0.76,3.14 0.21,0.77 0.43,1.53 0.6,2.33 0.37,1.44 0.45,1.7 1.45,3.14 0.99,1.39 2.63,3.2 5.24,3.2 2.01,0 2.04,-1.78 2.04,-2.44 0,-2.09 -1.48,-5.95 -2.04,-7.42 -0.4,-0.98 -0.54,-1.29 -0.54,-1.89 0,-1.84 1.55,-2.98 3.32,-2.98 3.48,0 5.01,4.79 5.01,5.33 0,0.45 -0.45,0.45 -0.54,0.45 -0.5,0 -0.54,-0.2 -0.68,-0.59 -0.79,-2.79 -2.29,-4.22 -3.68,-4.22 -0.74,0 -0.88,0.48 -0.88,1.22 0,0.79 0.17,1.25 0.8,2.82 0.42,1.08 1.86,4.77 1.86,6.72 0,3.39 -2.69,3.97 -4.53,3.97 -2.89,0 -4.85,-1.76 -5.89,-3.16 -0.25,2.4 -2.3,3.16 -3.74,3.16 -1.5,0 -2.3,-1.08 -2.75,-1.87 -0.76,-1.29 -1.25,-3.29 -1.25,-3.46 0,-0.45 0.49,-0.45 0.59,-0.45 0.49,0 0.52,0.11 0.77,1.05 0.52,2.06 1.19,3.76 2.55,3.76 0.9,0 1.14,-0.76 1.14,-1.7 0,-0.67 -0.31,-1.95 -0.56,-2.89 -0.24,-0.95 -0.58,-2.38 -0.77,-3.14 z"
211 |                id="path29543"
212 |                inkscape:connector-curvature="0"
213 |                style="fill:#000000;stroke-width:0" /></g>        </g>
214 |         <circle
215 |            style="fill:none;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
216 |            id="path29716-3"
217 |            cx="325.91745"
218 |            cy="562.7356"
219 |            r="50.507626" />
220 |       </g>
221 |     </g>
222 |     <g
223 |        id="g29813"
224 |        transform="translate(23.233509,0)">
225 |       <g
226 |          transform="translate(52.527932,-4.0406189)"
227 |          id="g29757">
228 |         <g
229 |            id="g29625"
230 |            transform="translate(-136.37059,118.18785)">
231 |           <g
232 |              id="g29627"
233 |              transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
234 |              xml:space="preserve"
235 |              stroke-miterlimit="10.433"
236 |              font-style="normal"
237 |              font-variant="normal"
238 |              font-weight="normal"
239 |              font-stretch="normal"
240 |              font-size-adjust="none"
241 |              letter-spacing="normal"
242 |              word-spacing="normal"
243 |              style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
244 |                d="m 479.68,426.68 -4.18,9.91 c -0.15,0.41 -0.25,0.59 -0.25,0.64 0,0.31 0.85,1.25 2.94,1.45 0.5,0.05 1,0.1 1,0.96 0,0.59 -0.61,0.59 -0.75,0.59 -2.05,0 -4.19,-0.16 -6.28,-0.16 -1.25,0 -4.33,0.16 -5.58,0.16 -0.29,0 -0.9,0 -0.9,-1 0,-0.55 0.5,-0.55 1.15,-0.55 2.99,0 3.28,-0.5 3.74,-1.59 l 5.87,-13.91 -10.51,-11.31 -0.64,-0.55 c -2.44,-2.64 -4.79,-3.43 -7.33,-3.57 -0.64,-0.05 -1.1,-0.05 -1.1,-1 0,-0.05 0,-0.55 0.66,-0.55 1.48,0 3.12,0.16 4.67,0.16 1.85,0 3.8,-0.16 5.58,-0.16 0.31,0 0.91,0 0.91,1 0,0.5 -0.5,0.55 -0.6,0.55 -0.45,0.04 -2,0.14 -2,1.54 0,0.8 0.75,1.6 1.35,2.24 l 5.07,5.39 4.49,4.87 5.03,-11.9 c 0.2,-0.55 0.25,-0.6 0.25,-0.71 0,-0.39 -0.94,-1.23 -2.89,-1.43 -0.55,-0.05 -0.99,-0.11 -0.99,-0.96 0,-0.59 0.55,-0.59 0.74,-0.59 1.41,0 4.89,0.16 6.28,0.16 1.25,0 4.28,-0.16 5.53,-0.16 0.35,0 0.95,0 0.95,0.95 0,0.6 -0.5,0.6 -0.9,0.6 -3.33,0.04 -3.44,0.2 -4.28,2.18 -1.94,4.64 -5.28,12.41 -6.42,15.39 3.39,3.5 8.6,9.43 10.2,10.82 1.45,1.2 3.34,2.39 6.33,2.54 0.65,0.05 1.09,0.05 1.09,1 0,0.05 0,0.55 -0.64,0.55 -1.5,0 -3.14,-0.16 -4.69,-0.16 -1.84,0 -3.73,0.16 -5.53,0.16 -0.3,0 -0.94,0 -0.94,-1 0,-0.34 0.25,-0.5 0.6,-0.55 0.45,-0.04 2,-0.15 2,-1.54 0,-0.71 -0.55,-1.35 -0.96,-1.8 z"
245 |                id="path29629"
246 |                inkscape:connector-curvature="0"
247 |                style="fill:#000000;stroke-width:0" /><path
248 |                d="m 514.4,405.11 -1.19,0 c -0.1,-0.77 -0.46,-2.83 -0.91,-3.18 -0.27,-0.2 -2.95,-0.2 -3.44,-0.2 l -6.42,0 c 3.66,3.23 4.88,4.2 6.97,5.84 2.58,2.07 4.99,4.22 4.99,7.54 0,4.21 -3.69,6.79 -8.16,6.79 -4.31,0 -7.25,-3.03 -7.25,-6.23 0,-1.78 1.5,-1.96 1.86,-1.96 0.83,0 1.84,0.6 1.84,1.85 0,0.62 -0.25,1.84 -2.06,1.84 1.08,2.47 3.45,3.24 5.09,3.24 3.49,0 5.3,-2.71 5.3,-5.53 0,-3.04 -2.16,-5.44 -3.28,-6.69 l -8.39,-8.3 c -0.36,-0.31 -0.36,-0.39 -0.36,-1.36 l 14.36,0 z"
249 |                id="path29631"
250 |                inkscape:connector-curvature="0"
251 |                style="fill:#000000;stroke-width:0" /></g>        </g>
252 |         <circle
253 |            style="fill:none;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
254 |            id="path29716-2"
255 |            cx="193.82664"
256 |            cy="461.53964"
257 |            r="50.507626" />
258 |       </g>
259 |       <g
260 |          transform="translate(51.559365,77.962433)"
261 |          id="g29785">
262 |         <g
263 |            id="g29505"
264 |            transform="translate(-132.32998,226.27417)">
265 |           <g
266 |              word-spacing="normal"
267 |              letter-spacing="normal"
268 |              font-size-adjust="none"
269 |              font-stretch="normal"
270 |              font-weight="normal"
271 |              font-variant="normal"
272 |              font-style="normal"
273 |              stroke-miterlimit="10.433"
274 |              xml:space="preserve"
275 |              transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
276 |              id="g29507"
277 |              style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
278 |                id="path29509"
279 |                d="m 491.39,439.03 0.01,0.04 0.01,0.04 0.01,0.04 0.02,0.04 0.01,0.04 0.01,0.05 0.01,0.04 0.01,0.05 0.01,0.04 0.01,0.05 0.01,0.05 0.01,0.05 0.01,0.05 0,0.06 0,0.05 0,0.03 0.01,0.03 c 0,0.45 -0.25,0.45 -1.24,0.45 l -21.03,0 c -1.19,0 -1.25,-0.05 -1.55,-1 l -2.73,-8.97 c -0.11,-0.25 -0.16,-0.59 -0.16,-0.7 0,-0.05 0,-0.55 0.61,-0.55 0.44,0 0.55,0.36 0.64,0.5 1.89,5.89 4.49,9.17 12.25,9.17 l 8.13,0 -27.85,-31.14 c 0,-0.04 -0.2,-0.75 -0.2,-0.89 0,-0.45 0.34,-0.45 1.25,-0.45 l 21.67,0 c 1.19,0 1.24,0.05 1.55,1 l 3.53,11 c 0.05,0.2 0.2,0.56 0.2,0.75 0,0.25 -0.2,0.55 -0.65,0.55 -0.44,0 -0.5,-0.19 -0.85,-1.24 -2.18,-6.78 -4.57,-10.37 -13.04,-10.37 l -8.53,0 z"
280 |                inkscape:connector-curvature="0"
281 |                style="fill:#000000;stroke-width:0" /><path
282 |                id="path29511"
283 |                d="m 507.07,405.11 -1.18,0 c -0.1,-0.77 -0.46,-2.83 -0.91,-3.18 -0.27,-0.2 -2.95,-0.2 -3.44,-0.2 l -6.42,0 c 3.66,3.23 4.88,4.2 6.97,5.84 2.58,2.07 4.98,4.22 4.98,7.54 0,4.21 -3.68,6.79 -8.15,6.79 -4.31,0 -7.25,-3.03 -7.25,-6.23 0,-1.78 1.5,-1.96 1.86,-1.96 0.83,0 1.84,0.6 1.84,1.85 0,0.62 -0.25,1.84 -2.06,1.84 1.08,2.47 3.45,3.24 5.09,3.24 3.49,0 5.3,-2.71 5.3,-5.53 0,-3.04 -2.16,-5.44 -3.28,-6.69 l -8.39,-8.3 c -0.36,-0.31 -0.36,-0.39 -0.36,-1.36 l 14.36,0 z"
284 |                inkscape:connector-curvature="0"
285 |                style="fill:#000000;stroke-width:0" /></g>        </g>
286 |         <circle
287 |            style="fill:none;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
288 |            id="path29716-49"
289 |            cx="194.79521"
290 |            cy="569.62598"
291 |            r="50.507626" />
292 |       </g>
293 |     </g>
294 |     <g
295 |        id="g29827">
296 |       <g
297 |          id="g29764">
298 |         <g
299 |            id="g29577"
300 |            transform="translate(-246.47722,114.14724)">
301 |           <g
302 |              word-spacing="normal"
303 |              letter-spacing="normal"
304 |              font-size-adjust="none"
305 |              font-stretch="normal"
306 |              font-weight="normal"
307 |              font-variant="normal"
308 |              font-style="normal"
309 |              stroke-miterlimit="10.433"
310 |              xml:space="preserve"
311 |              transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
312 |              id="g29579"
313 |              style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
314 |                id="path29581"
315 |                d="m 479.68,426.68 -4.18,9.91 c -0.15,0.41 -0.25,0.59 -0.25,0.64 0,0.31 0.85,1.25 2.94,1.45 0.5,0.05 1,0.1 1,0.96 0,0.59 -0.61,0.59 -0.75,0.59 -2.05,0 -4.19,-0.16 -6.28,-0.16 -1.25,0 -4.33,0.16 -5.58,0.16 -0.29,0 -0.9,0 -0.9,-1 0,-0.55 0.5,-0.55 1.15,-0.55 2.99,0 3.28,-0.5 3.74,-1.59 l 5.87,-13.91 -10.51,-11.31 -0.64,-0.55 c -2.44,-2.64 -4.79,-3.43 -7.33,-3.57 -0.64,-0.05 -1.1,-0.05 -1.1,-1 0,-0.05 0,-0.55 0.66,-0.55 1.48,0 3.12,0.16 4.67,0.16 1.85,0 3.8,-0.16 5.58,-0.16 0.31,0 0.91,0 0.91,1 0,0.5 -0.5,0.55 -0.6,0.55 -0.45,0.04 -2,0.14 -2,1.54 0,0.8 0.75,1.6 1.35,2.24 l 5.07,5.39 4.49,4.87 5.03,-11.9 c 0.2,-0.55 0.25,-0.6 0.25,-0.71 0,-0.39 -0.94,-1.23 -2.89,-1.43 -0.55,-0.05 -0.99,-0.11 -0.99,-0.96 0,-0.59 0.55,-0.59 0.74,-0.59 1.41,0 4.89,0.16 6.28,0.16 1.25,0 4.28,-0.16 5.53,-0.16 0.35,0 0.95,0 0.95,0.95 0,0.6 -0.5,0.6 -0.9,0.6 -3.33,0.04 -3.44,0.2 -4.28,2.18 -1.94,4.64 -5.28,12.41 -6.42,15.39 3.39,3.5 8.6,9.43 10.2,10.82 1.45,1.2 3.34,2.39 6.33,2.54 0.65,0.05 1.09,0.05 1.09,1 0,0.05 0,0.55 -0.64,0.55 -1.5,0 -3.14,-0.16 -4.69,-0.16 -1.84,0 -3.73,0.16 -5.53,0.16 -0.3,0 -0.94,0 -0.94,-1 0,-0.34 0.25,-0.5 0.6,-0.55 0.45,-0.04 2,-0.15 2,-1.54 0,-0.71 -0.55,-1.35 -0.96,-1.8 z"
316 |                inkscape:connector-curvature="0"
317 |                style="fill:#000000;stroke-width:0" /><path
318 |                id="path29583"
319 |                d="m 508.47,420.92 0,0.04 0,0.04 0,0.05 0,0.04 0,0.03 0,0.04 0,0.04 0,0.03 0,0.04 0,0.03 -0.01,0.03 0,0.03 0,0.03 -0.01,0.03 0,0.03 0,0.02 -0.01,0.03 0,0.02 -0.01,0.03 0,0.02 -0.01,0.04 -0.02,0.04 -0.02,0.03 -0.02,0.03 -0.02,0.03 -0.02,0.03 -0.03,0.02 -0.03,0.02 -0.03,0.02 -0.04,0.01 -0.04,0.01 -0.05,0.02 -0.02,0 -0.02,0 -0.03,0.01 -0.03,0 -0.02,0.01 -0.03,0 -0.03,0 -0.03,0 -0.03,0 -0.04,0.01 -0.03,0 -0.04,0 -0.03,0 -0.04,0 -0.04,0 -0.04,0 -0.04,0 -0.04,0 -0.05,0 -0.04,0 c -2.24,-2.2 -5.39,-2.23 -6.83,-2.23 l 0,-1.25 c 0.84,0 3.14,0 5.04,0.97 l 0,-17.77 c 0,-1.16 0,-1.61 -3.48,-1.61 l -1.31,0 0,-1.25 c 0.62,0.03 4.9,0.14 6.2,0.14 1.08,0 5.47,-0.11 6.23,-0.14 l 0,1.25 -1.32,0 c -3.49,0 -3.49,0.45 -3.49,1.61 z"
320 |                inkscape:connector-curvature="0"
321 |                style="fill:#000000;stroke-width:0" /></g>        </g>
322 |         <circle
323 |            style="fill:none;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
324 |            id="path29716"
325 |            cx="83.124741"
326 |            cy="457.49902"
327 |            r="50.507626" />
328 |       </g>
329 |       <g
330 |          transform="translate(-0.97387695,77.962433)"
331 |          id="g29778">
332 |         <g
333 |            id="g29482"
334 |            transform="translate(-242.43661,226.27417)">
335 |           <g
336 |              id="content"
337 |              transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
338 |              xml:space="preserve"
339 |              stroke-miterlimit="10.433"
340 |              font-style="normal"
341 |              font-variant="normal"
342 |              font-weight="normal"
343 |              font-stretch="normal"
344 |              font-size-adjust="none"
345 |              letter-spacing="normal"
346 |              word-spacing="normal"
347 |              style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
348 |                d="m 491.39,439.03 0.01,0.04 0.01,0.04 0.01,0.04 0.02,0.04 0.01,0.04 0.01,0.05 0.01,0.04 0.01,0.05 0.01,0.04 0.01,0.05 0.01,0.05 0.01,0.05 0.01,0.05 0,0.06 0,0.05 0,0.03 0.01,0.03 c 0,0.45 -0.25,0.45 -1.24,0.45 l -21.03,0 c -1.19,0 -1.25,-0.05 -1.55,-1 l -2.73,-8.97 c -0.11,-0.25 -0.16,-0.59 -0.16,-0.7 0,-0.05 0,-0.55 0.61,-0.55 0.44,0 0.55,0.36 0.64,0.5 1.89,5.89 4.49,9.17 12.25,9.17 l 8.13,0 -27.85,-31.14 c 0,-0.04 -0.2,-0.75 -0.2,-0.89 0,-0.45 0.34,-0.45 1.25,-0.45 l 21.67,0 c 1.19,0 1.24,0.05 1.55,1 l 3.53,11 c 0.05,0.2 0.2,0.56 0.2,0.75 0,0.25 -0.2,0.55 -0.65,0.55 -0.44,0 -0.5,-0.19 -0.85,-1.24 -2.18,-6.78 -4.57,-10.37 -13.04,-10.37 l -8.53,0 z"
349 |                id="path29485"
350 |                inkscape:connector-curvature="0"
351 |                style="fill:#000000;stroke-width:0" /><path
352 |                d="m 501.15,420.92 0,0.04 0,0.04 0,0.05 0,0.04 0,0.03 0,0.04 0,0.04 0,0.03 0,0.04 -0.01,0.03 0,0.03 0,0.03 0,0.03 -0.01,0.03 0,0.03 0,0.02 -0.01,0.03 0,0.02 -0.01,0.03 0,0.02 -0.01,0.04 -0.02,0.04 -0.02,0.03 -0.02,0.03 -0.02,0.03 -0.02,0.03 -0.03,0.02 -0.03,0.02 -0.03,0.02 -0.04,0.01 -0.04,0.01 -0.05,0.02 -0.02,0 -0.02,0 -0.03,0.01 -0.03,0 -0.02,0.01 -0.03,0 -0.03,0 -0.03,0 -0.03,0 -0.04,0.01 -0.03,0 -0.04,0 -0.03,0 -0.04,0 -0.04,0 -0.04,0 -0.04,0 -0.05,0 -0.04,0 -0.04,0 c -2.24,-2.2 -5.4,-2.23 -6.83,-2.23 l 0,-1.25 c 0.84,0 3.14,0 5.04,0.97 l 0,-17.77 c 0,-1.16 0,-1.61 -3.48,-1.61 l -1.31,0 0,-1.25 c 0.62,0.03 4.9,0.14 6.2,0.14 1.08,0 5.47,-0.11 6.23,-0.14 l 0,1.25 -1.32,0 c -3.49,0 -3.49,0.45 -3.49,1.61 z"
353 |                id="path29487"
354 |                inkscape:connector-curvature="0"
355 |                style="fill:#000000;stroke-width:0" /></g>        </g>
356 |         <circle
357 |            style="fill:none;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
358 |            id="path29716-47"
359 |            cx="84.098618"
360 |            cy="569.62598"
361 |            r="50.507626" />
362 |       </g>
363 |     </g>
364 |     <path
365 |        style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow1Lend)"
366 |        d="m 80,595.93361 0,-87.85714"
367 |        id="path29841"
368 |        inkscape:connector-curvature="0" />
369 |     <path
370 |        style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow1Lend-6)"
371 |        d="m 268.9478,595.77222 0,-87.85714"
372 |        id="path29841-0"
373 |        inkscape:connector-curvature="0" />
374 |     <path
375 |        style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow1Lend-4)"
376 |        d="m 642.14286,597.20078 0,-87.85713"
377 |        id="path29841-1"
378 |        inkscape:connector-curvature="0" />
379 |     <path
380 |        style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow1Lend-1)"
381 |        d="m 130.82782,647.1909 87.85714,0"
382 |        id="path29841-8"
383 |        inkscape:connector-curvature="0" />
384 |     <path
385 |        style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow1Lend-1-5)"
386 |        d="m 319.82649,648.30119 87.85714,0"
387 |        id="path29841-8-0"
388 |        inkscape:connector-curvature="0" />
389 |     <path
390 |        style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-end:url(#Arrow1Lend-1-5-2)"
391 |        d="m 502.6641,648.30119 87.85714,0"
392 |        id="path29841-8-0-8"
393 |        inkscape:connector-curvature="0" />
394 |     <g
395 |        id="g31407"
396 |        transform="translate(231.32493,353.55339)">
397 |       <circle
398 |          r="6.5659914"
399 |          cy="294.24271"
400 |          cx="192.43407"
401 |          id="path31384"
402 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
403 |       <circle
404 |          r="6.5659914"
405 |          cy="294.24271"
406 |          cx="224.00133"
407 |          id="path31384-2"
408 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
409 |       <circle
410 |          r="6.5659914"
411 |          cy="294.24271"
412 |          cx="255.56859"
413 |          id="path31384-24"
414 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
415 |     </g>
416 |   </g>
417 | </svg>
418 | 


--------------------------------------------------------------------------------
/doc/_static/HMM_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/HMM_2.pdf


--------------------------------------------------------------------------------
/doc/_static/PHMM_1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/PHMM_1.pdf


--------------------------------------------------------------------------------
/doc/_static/PHMM_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/PHMM_2.pdf


--------------------------------------------------------------------------------
/doc/_static/PHMM_3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/PHMM_3.pdf


--------------------------------------------------------------------------------
/doc/_static/PHMM_4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/PHMM_4.pdf


--------------------------------------------------------------------------------
/doc/_static/forward.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/forward.pdf


--------------------------------------------------------------------------------
/doc/_static/transition_matrix_1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_1.pdf


--------------------------------------------------------------------------------
/doc/_static/transition_matrix_1.svg:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2 | <!-- Created with Inkscape (http://www.inkscape.org/) -->
  3 | 
  4 | <svg
  5 |    xmlns:dc="http://purl.org/dc/elements/1.1/"
  6 |    xmlns:cc="http://creativecommons.org/ns#"
  7 |    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  8 |    xmlns:svg="http://www.w3.org/2000/svg"
  9 |    xmlns="http://www.w3.org/2000/svg"
 10 |    xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
 11 |    xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
 12 |    width="153.7901mm"
 13 |    height="111.03947mm"
 14 |    viewBox="0 0 544.92555 393.44693"
 15 |    id="svg2"
 16 |    version="1.1"
 17 |    inkscape:version="0.91 r"
 18 |    sodipodi:docname="transition_matrix_1.svg">
 19 |   <sodipodi:namedview
 20 |      id="base"
 21 |      pagecolor="#ffffff"
 22 |      bordercolor="#666666"
 23 |      borderopacity="1.0"
 24 |      inkscape:pageopacity="0.0"
 25 |      inkscape:pageshadow="2"
 26 |      inkscape:zoom="0.7"
 27 |      inkscape:cx="552.05449"
 28 |      inkscape:cy="303.81453"
 29 |      inkscape:document-units="px"
 30 |      inkscape:current-layer="layer1"
 31 |      showgrid="false"
 32 |      inkscape:window-width="1855"
 33 |      inkscape:window-height="1056"
 34 |      inkscape:window-x="65"
 35 |      inkscape:window-y="24"
 36 |      inkscape:window-maximized="1"
 37 |      fit-margin-top="0"
 38 |      fit-margin-left="0"
 39 |      fit-margin-right="0"
 40 |      fit-margin-bottom="0" />
 41 |   <defs
 42 |      id="defs4">
 43 |     <marker
 44 |        inkscape:isstock="true"
 45 |        style="overflow:visible"
 46 |        id="Arrow1Mend"
 47 |        refX="0"
 48 |        refY="0"
 49 |        orient="auto"
 50 |        inkscape:stockid="Arrow1Mend">
 51 |       <path
 52 |          transform="matrix(-0.4,0,0,-0.4,-4,0)"
 53 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 54 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 55 |          id="path4224"
 56 |          inkscape:connector-curvature="0" />
 57 |     </marker>
 58 |     <marker
 59 |        inkscape:isstock="true"
 60 |        style="overflow:visible"
 61 |        id="Arrow1Mstart"
 62 |        refX="0"
 63 |        refY="0"
 64 |        orient="auto"
 65 |        inkscape:stockid="Arrow1Mstart">
 66 |       <path
 67 |          transform="matrix(0.4,0,0,0.4,4,0)"
 68 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 69 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 70 |          id="path4221"
 71 |          inkscape:connector-curvature="0" />
 72 |     </marker>
 73 |     <marker
 74 |        inkscape:isstock="true"
 75 |        style="overflow:visible"
 76 |        id="Arrow1Mstart-1"
 77 |        refX="0"
 78 |        refY="0"
 79 |        orient="auto"
 80 |        inkscape:stockid="Arrow1Mstart">
 81 |       <path
 82 |          transform="matrix(0.4,0,0,0.4,4,0)"
 83 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 84 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 85 |          id="path4221-1"
 86 |          inkscape:connector-curvature="0" />
 87 |     </marker>
 88 |     <marker
 89 |        inkscape:isstock="true"
 90 |        style="overflow:visible"
 91 |        id="Arrow1Mend-3"
 92 |        refX="0"
 93 |        refY="0"
 94 |        orient="auto"
 95 |        inkscape:stockid="Arrow1Mend">
 96 |       <path
 97 |          transform="matrix(-0.4,0,0,-0.4,-4,0)"
 98 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 99 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
100 |          id="path4224-1"
101 |          inkscape:connector-curvature="0" />
102 |     </marker>
103 |     <marker
104 |        inkscape:isstock="true"
105 |        style="overflow:visible"
106 |        id="Arrow1Mstart-9"
107 |        refX="0"
108 |        refY="0"
109 |        orient="auto"
110 |        inkscape:stockid="Arrow1Mstart">
111 |       <path
112 |          transform="matrix(0.4,0,0,0.4,4,0)"
113 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
114 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
115 |          id="path4221-0"
116 |          inkscape:connector-curvature="0" />
117 |     </marker>
118 |     <marker
119 |        inkscape:isstock="true"
120 |        style="overflow:visible"
121 |        id="Arrow1Mend-0"
122 |        refX="0"
123 |        refY="0"
124 |        orient="auto"
125 |        inkscape:stockid="Arrow1Mend">
126 |       <path
127 |          transform="matrix(-0.4,0,0,-0.4,-4,0)"
128 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
129 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
130 |          id="path4224-9"
131 |          inkscape:connector-curvature="0" />
132 |     </marker>
133 |     <marker
134 |        inkscape:isstock="true"
135 |        style="overflow:visible"
136 |        id="Arrow1Mstart-1-6"
137 |        refX="0"
138 |        refY="0"
139 |        orient="auto"
140 |        inkscape:stockid="Arrow1Mstart">
141 |       <path
142 |          transform="matrix(0.4,0,0,0.4,4,0)"
143 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
144 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
145 |          id="path4221-1-3"
146 |          inkscape:connector-curvature="0" />
147 |     </marker>
148 |     <marker
149 |        inkscape:isstock="true"
150 |        style="overflow:visible"
151 |        id="Arrow1Mend-3-8"
152 |        refX="0"
153 |        refY="0"
154 |        orient="auto"
155 |        inkscape:stockid="Arrow1Mend">
156 |       <path
157 |          transform="matrix(-0.4,0,0,-0.4,-4,0)"
158 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
159 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
160 |          id="path4224-1-8"
161 |          inkscape:connector-curvature="0" />
162 |     </marker>
163 |   </defs>
164 |   <metadata
165 |      id="metadata7">
166 |     <rdf:RDF>
167 |       <cc:Work
168 |          rdf:about="">
169 |         <dc:format>image/svg+xml</dc:format>
170 |         <dc:type
171 |            rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
172 |         <dc:title></dc:title>
173 |       </cc:Work>
174 |     </rdf:RDF>
175 |   </metadata>
176 |   <g
177 |      id="layer1"
178 |      inkscape:groupmode="layer"
179 |      inkscape:label="Capa 1"
180 |      transform="translate(-14.178693,-369.49137)">
181 |     <g
182 |        id="g4155">
183 |       <rect
184 |          style="fill:none;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
185 |          id="rect4136"
186 |          width="353.55341"
187 |          height="353.55341"
188 |          x="204.05081"
189 |          y="407.88489" />
190 |       <path
191 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
192 |          d="m 203.04066,584.66159 355.5737,0"
193 |          id="path4138"
194 |          inkscape:connector-curvature="0" />
195 |       <path
196 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
197 |          d="m 380.82751,406.87474 0,355.5737"
198 |          id="path4138-5"
199 |          inkscape:connector-curvature="0" />
200 |     </g>
201 |     <g
202 |        id="g6106-3"
203 |        transform="matrix(0,1,-1,0,965.28806,203.56835)">
204 |       <path
205 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
206 |          d="m 188.21429,584.50506 13.57142,0"
207 |          id="path4160-8"
208 |          inkscape:connector-curvature="0" />
209 |       <path
210 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
211 |          d="m 187.85715,407.36221 13.57142,0"
212 |          id="path4160-2-7"
213 |          inkscape:connector-curvature="0" />
214 |       <path
215 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
216 |          d="m 188.21429,760.93363 13.57142,0"
217 |          id="path4160-1-2"
218 |          inkscape:connector-curvature="0" />
219 |       <path
220 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-start:url(#Arrow1Mstart-9);marker-end:url(#Arrow1Mend-0)"
221 |          d="m 194.7069,758.72349 0,-172.11295"
222 |          id="path4209-3"
223 |          inkscape:connector-curvature="0"
224 |          sodipodi:nodetypes="cc" />
225 |       <path
226 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-start:url(#Arrow1Mstart-1-6);marker-end:url(#Arrow1Mend-3-8)"
227 |          d="m 194.82143,581.82051 0,-172.11295"
228 |          id="path4209-7-9"
229 |          inkscape:connector-curvature="0"
230 |          sodipodi:nodetypes="cc" />
231 |     </g>
232 |     <g
233 |        id="g6252"
234 |        transform="matrix(0.6,0,0,0.6,95.244214,176.70499)">
235 |       <g
236 |          word-spacing="normal"
237 |          letter-spacing="normal"
238 |          font-size-adjust="none"
239 |          font-stretch="normal"
240 |          font-weight="normal"
241 |          font-variant="normal"
242 |          font-style="normal"
243 |          stroke-miterlimit="10.433"
244 |          xml:space="preserve"
245 |          transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
246 |          id="content"
247 |          style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
248 |            id="path6255"
249 |            d="m 501.21,434.45 0.23,0.4 0.24,0.4 0.24,0.38 0.24,0.37 0.13,0.18 0.13,0.17 0.14,0.17 0.13,0.16 0.14,0.16 0.15,0.16 0.15,0.15 0.16,0.14 0.16,0.14 0.17,0.13 0.17,0.13 0.18,0.12 0.19,0.11 0.19,0.11 0.2,0.1 0.21,0.09 0.22,0.09 0.23,0.08 0.24,0.07 0.25,0.06 0.25,0.05 0.27,0.05 0.28,0.04 0.29,0.02 c 0.45,0.05 0.95,0.05 0.95,1 0,0.19 -0.2,0.55 -0.61,0.55 -1.19,0 -2.58,-0.16 -3.83,-0.16 -1.7,0 -3.55,0.16 -5.19,0.16 -0.29,0 -0.93,0 -0.93,-0.95 0,-0.55 0.43,-0.6 0.73,-0.6 1.2,-0.04 2.95,-0.45 2.95,-1.93 0,-0.55 -0.25,-0.96 -0.65,-1.66 l -13.46,-23.45 -1.84,24.7 c -0.05,1 -0.14,2.3 3.48,2.34 0.86,0 1.36,0 1.36,1 0,0.5 -0.56,0.55 -0.75,0.55 -2,0 -4.09,-0.16 -6.07,-0.16 -1.16,0 -4.1,0.16 -5.24,0.16 -0.3,0 -0.95,0 -0.95,-1 0,-0.55 0.5,-0.55 1.2,-0.55 2.19,0 2.53,-0.29 2.64,-1.25 l 0.3,-3.82 -12.61,-21.97 -1.89,25.25 c 0,0.59 0,1.75 3.78,1.79 0.5,0 1.05,0 1.05,1 0,0.55 -0.6,0.55 -0.69,0.55 -2,0 -4.09,-0.16 -6.13,-0.16 -1.75,0 -3.54,0.16 -5.23,0.16 -0.25,0 -0.91,0 -0.91,-0.95 0,-0.6 0.46,-0.6 1.25,-0.6 2.5,0 2.55,-0.45 2.64,-1.84 l 2.25,-30.44 c 0.05,-0.9 0.1,-1.29 0.8,-1.29 0.6,0 0.74,0.29 1.19,1.04 l 14.66,25.41 1.84,-25.16 c 0.09,-1.04 0.19,-1.29 0.8,-1.29 0.64,0 0.93,0.5 1.18,0.93 z"
250 |            inkscape:connector-curvature="0"
251 |            style="fill:#000000;stroke-width:0" /></g>    </g>
252 |     <g
253 |        id="g6252-5"
254 |        transform="matrix(0.6,0,0,0.6,273.56225,177.66185)">
255 |       <g
256 |          word-spacing="normal"
257 |          letter-spacing="normal"
258 |          font-size-adjust="none"
259 |          font-stretch="normal"
260 |          font-weight="normal"
261 |          font-variant="normal"
262 |          font-style="normal"
263 |          stroke-miterlimit="10.433"
264 |          xml:space="preserve"
265 |          transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
266 |          id="content-0"
267 |          style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
268 |            id="path6255-5"
269 |            d="m 501.21,434.45 0.23,0.4 0.24,0.4 0.24,0.38 0.24,0.37 0.13,0.18 0.13,0.17 0.14,0.17 0.13,0.16 0.14,0.16 0.15,0.16 0.15,0.15 0.16,0.14 0.16,0.14 0.17,0.13 0.17,0.13 0.18,0.12 0.19,0.11 0.19,0.11 0.2,0.1 0.21,0.09 0.22,0.09 0.23,0.08 0.24,0.07 0.25,0.06 0.25,0.05 0.27,0.05 0.28,0.04 0.29,0.02 c 0.45,0.05 0.95,0.05 0.95,1 0,0.19 -0.2,0.55 -0.61,0.55 -1.19,0 -2.58,-0.16 -3.83,-0.16 -1.7,0 -3.55,0.16 -5.19,0.16 -0.29,0 -0.93,0 -0.93,-0.95 0,-0.55 0.43,-0.6 0.73,-0.6 1.2,-0.04 2.95,-0.45 2.95,-1.93 0,-0.55 -0.25,-0.96 -0.65,-1.66 l -13.46,-23.45 -1.84,24.7 c -0.05,1 -0.14,2.3 3.48,2.34 0.86,0 1.36,0 1.36,1 0,0.5 -0.56,0.55 -0.75,0.55 -2,0 -4.09,-0.16 -6.07,-0.16 -1.16,0 -4.1,0.16 -5.24,0.16 -0.3,0 -0.95,0 -0.95,-1 0,-0.55 0.5,-0.55 1.2,-0.55 2.19,0 2.53,-0.29 2.64,-1.25 l 0.3,-3.82 -12.61,-21.97 -1.89,25.25 c 0,0.59 0,1.75 3.78,1.79 0.5,0 1.05,0 1.05,1 0,0.55 -0.6,0.55 -0.69,0.55 -2,0 -4.09,-0.16 -6.13,-0.16 -1.75,0 -3.54,0.16 -5.23,0.16 -0.25,0 -0.91,0 -0.91,-0.95 0,-0.6 0.46,-0.6 1.25,-0.6 2.5,0 2.55,-0.45 2.64,-1.84 l 2.25,-30.44 c 0.05,-0.9 0.1,-1.29 0.8,-1.29 0.6,0 0.74,0.29 1.19,1.04 l 14.66,25.41 1.84,-25.16 c 0.09,-1.04 0.19,-1.29 0.8,-1.29 0.64,0 0.93,0.5 1.18,0.93 z"
270 |            inkscape:connector-curvature="0"
271 |            style="fill:#000000;stroke-width:0" /></g>    </g>
272 |     <g
273 |        id="g6412"
274 |        transform="matrix(0.98866842,0,0,0.9770422,2.320815,9.9014487)">
275 |       <rect
276 |          y="410.41025"
277 |          x="207.08127"
278 |          height="15.152288"
279 |          width="15.152288"
280 |          id="rect6336"
281 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
282 |       <rect
283 |          y="428.59302"
284 |          x="224.75894"
285 |          height="15.152288"
286 |          width="15.152288"
287 |          id="rect6336-8"
288 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
289 |       <rect
290 |          y="446.77573"
291 |          x="242.43661"
292 |          height="15.152288"
293 |          width="15.152288"
294 |          id="rect6336-6"
295 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
296 |       <rect
297 |          y="464.9585"
298 |          x="260.11429"
299 |          height="15.152288"
300 |          width="15.152288"
301 |          id="rect6336-81"
302 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
303 |       <rect
304 |          y="483.14127"
305 |          x="277.79193"
306 |          height="15.152288"
307 |          width="15.152288"
308 |          id="rect6336-1"
309 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
310 |       <rect
311 |          y="500.81891"
312 |          x="295.4696"
313 |          height="15.152288"
314 |          width="15.152288"
315 |          id="rect6336-1-0"
316 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
317 |       <rect
318 |          y="518.49658"
319 |          x="313.14728"
320 |          height="15.152288"
321 |          width="15.152288"
322 |          id="rect6336-1-00"
323 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
324 |       <rect
325 |          y="536.17426"
326 |          x="330.82495"
327 |          height="15.152288"
328 |          width="15.152288"
329 |          id="rect6336-1-2"
330 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
331 |       <rect
332 |          y="554.35699"
333 |          x="347.99753"
334 |          height="15.152288"
335 |          width="15.152288"
336 |          id="rect6336-1-09"
337 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
338 |       <rect
339 |          y="572.03467"
340 |          x="365.67523"
341 |          height="15.152288"
342 |          width="15.152288"
343 |          id="rect6336-1-1"
344 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
345 |     </g>
346 |     <g
347 |        id="g6412-4"
348 |        transform="matrix(0.98866842,0,0,0.9770422,178.08736,8.9259446)">
349 |       <rect
350 |          y="410.41025"
351 |          x="207.08127"
352 |          height="15.152288"
353 |          width="15.152288"
354 |          id="rect6336-9"
355 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
356 |       <rect
357 |          y="428.59302"
358 |          x="224.75894"
359 |          height="15.152288"
360 |          width="15.152288"
361 |          id="rect6336-8-9"
362 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
363 |       <rect
364 |          y="446.77573"
365 |          x="242.43661"
366 |          height="15.152288"
367 |          width="15.152288"
368 |          id="rect6336-6-9"
369 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
370 |       <rect
371 |          y="464.9585"
372 |          x="260.11429"
373 |          height="15.152288"
374 |          width="15.152288"
375 |          id="rect6336-81-8"
376 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
377 |       <rect
378 |          y="483.14127"
379 |          x="277.79193"
380 |          height="15.152288"
381 |          width="15.152288"
382 |          id="rect6336-1-6"
383 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
384 |       <rect
385 |          y="500.81891"
386 |          x="295.4696"
387 |          height="15.152288"
388 |          width="15.152288"
389 |          id="rect6336-1-0-3"
390 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
391 |       <rect
392 |          y="518.49658"
393 |          x="313.14728"
394 |          height="15.152288"
395 |          width="15.152288"
396 |          id="rect6336-1-00-3"
397 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
398 |       <rect
399 |          y="536.17426"
400 |          x="330.82495"
401 |          height="15.152288"
402 |          width="15.152288"
403 |          id="rect6336-1-2-7"
404 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
405 |       <rect
406 |          y="554.35699"
407 |          x="347.99753"
408 |          height="15.152288"
409 |          width="15.152288"
410 |          id="rect6336-1-09-0"
411 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
412 |       <rect
413 |          y="572.03467"
414 |          x="365.67523"
415 |          height="15.152288"
416 |          width="15.152288"
417 |          id="rect6336-1-1-3"
418 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
419 |     </g>
420 |     <g
421 |        id="g6524"
422 |        transform="translate(0.50507627,1.2626907)">
423 |       <rect
424 |          y="585.17554"
425 |          x="224.22812"
426 |          height="14.804425"
427 |          width="14.980589"
428 |          id="rect6336-92"
429 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
430 |       <rect
431 |          y="602.94086"
432 |          x="241.70547"
433 |          height="14.804425"
434 |          width="14.980589"
435 |          id="rect6336-8-96"
436 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
437 |       <rect
438 |          y="620.70618"
439 |          x="259.18283"
440 |          height="14.804425"
441 |          width="14.980589"
442 |          id="rect6336-6-1"
443 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
444 |       <rect
445 |          y="638.4715"
446 |          x="276.66019"
447 |          height="14.804425"
448 |          width="14.980589"
449 |          id="rect6336-81-6"
450 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
451 |       <rect
452 |          y="656.23682"
453 |          x="294.13751"
454 |          height="14.804425"
455 |          width="14.980589"
456 |          id="rect6336-1-69"
457 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
458 |       <rect
459 |          y="673.50861"
460 |          x="311.61487"
461 |          height="14.804425"
462 |          width="14.980589"
463 |          id="rect6336-1-0-5"
464 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
465 |       <rect
466 |          y="690.78046"
467 |          x="329.09222"
468 |          height="14.804425"
469 |          width="14.980589"
470 |          id="rect6336-1-00-7"
471 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
472 |       <rect
473 |          y="708.05231"
474 |          x="346.56958"
475 |          height="14.804425"
476 |          width="14.980589"
477 |          id="rect6336-1-2-1"
478 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
479 |       <rect
480 |          y="725.81757"
481 |          x="363.54758"
482 |          height="14.804425"
483 |          width="14.980589"
484 |          id="rect6336-1-09-7"
485 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
486 |     </g>
487 |     <rect
488 |        style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
489 |        id="rect6336-1-1-8"
490 |        width="14.980589"
491 |        height="14.804425"
492 |        x="206.82596"
493 |        y="743.78387" />
494 |     <rect
495 |        style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
496 |        id="rect6535"
497 |        width="173.57819"
498 |        height="173.52565"
499 |        x="381.95752"
500 |        y="586.24719" />
501 |     <text
502 |        xml:space="preserve"
503 |        style="font-style:normal;font-weight:normal;font-size:31.25px;line-height:125%;font-family:Sans;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
504 |        x="11.111676"
505 |        y="508.90009"
506 |        id="text6537"
507 |        sodipodi:linespacing="125%"><tspan
508 |          sodipodi:role="line"
509 |          id="tspan6539"
510 |          x="11.111676"
511 |          y="508.90009">Background</tspan></text>
512 |     <text
513 |        xml:space="preserve"
514 |        style="font-style:normal;font-weight:normal;font-size:31.25px;line-height:125%;font-family:Sans;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
515 |        x="116.24474"
516 |        y="689.71741"
517 |        id="text6541"
518 |        sodipodi:linespacing="125%"><tspan
519 |          sodipodi:role="line"
520 |          id="tspan6543"
521 |          x="116.24474"
522 |          y="689.71741">Motif</tspan></text>
523 |   </g>
524 | </svg>
525 | 


--------------------------------------------------------------------------------
/doc/_static/transition_matrix_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_2.pdf


--------------------------------------------------------------------------------
/doc/_static/transition_matrix_2.svg:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2 | <!-- Created with Inkscape (http://www.inkscape.org/) -->
  3 | 
  4 | <svg
  5 |    xmlns:dc="http://purl.org/dc/elements/1.1/"
  6 |    xmlns:cc="http://creativecommons.org/ns#"
  7 |    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  8 |    xmlns:svg="http://www.w3.org/2000/svg"
  9 |    xmlns="http://www.w3.org/2000/svg"
 10 |    xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
 11 |    xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
 12 |    width="153.7901mm"
 13 |    height="111.03947mm"
 14 |    viewBox="0 0 544.92555 393.44693"
 15 |    id="svg2"
 16 |    version="1.1"
 17 |    inkscape:version="0.91 r"
 18 |    sodipodi:docname="transition_matrix_2.svg">
 19 |   <sodipodi:namedview
 20 |      id="base"
 21 |      pagecolor="#ffffff"
 22 |      bordercolor="#666666"
 23 |      borderopacity="1.0"
 24 |      inkscape:pageopacity="0.0"
 25 |      inkscape:pageshadow="2"
 26 |      inkscape:zoom="0.98994949"
 27 |      inkscape:cx="501.7555"
 28 |      inkscape:cy="145.48635"
 29 |      inkscape:document-units="px"
 30 |      inkscape:current-layer="layer1"
 31 |      showgrid="false"
 32 |      inkscape:window-width="1855"
 33 |      inkscape:window-height="1056"
 34 |      inkscape:window-x="65"
 35 |      inkscape:window-y="24"
 36 |      inkscape:window-maximized="1"
 37 |      fit-margin-top="0"
 38 |      fit-margin-left="0"
 39 |      fit-margin-right="0"
 40 |      fit-margin-bottom="0" />
 41 |   <defs
 42 |      id="defs4">
 43 |     <marker
 44 |        inkscape:isstock="true"
 45 |        style="overflow:visible"
 46 |        id="Arrow1Mend"
 47 |        refX="0"
 48 |        refY="0"
 49 |        orient="auto"
 50 |        inkscape:stockid="Arrow1Mend">
 51 |       <path
 52 |          transform="matrix(-0.4,0,0,-0.4,-4,0)"
 53 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 54 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 55 |          id="path4224"
 56 |          inkscape:connector-curvature="0" />
 57 |     </marker>
 58 |     <marker
 59 |        inkscape:isstock="true"
 60 |        style="overflow:visible"
 61 |        id="Arrow1Mstart"
 62 |        refX="0"
 63 |        refY="0"
 64 |        orient="auto"
 65 |        inkscape:stockid="Arrow1Mstart">
 66 |       <path
 67 |          transform="matrix(0.4,0,0,0.4,4,0)"
 68 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 69 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 70 |          id="path4221"
 71 |          inkscape:connector-curvature="0" />
 72 |     </marker>
 73 |     <marker
 74 |        inkscape:isstock="true"
 75 |        style="overflow:visible"
 76 |        id="Arrow1Mstart-1"
 77 |        refX="0"
 78 |        refY="0"
 79 |        orient="auto"
 80 |        inkscape:stockid="Arrow1Mstart">
 81 |       <path
 82 |          transform="matrix(0.4,0,0,0.4,4,0)"
 83 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 84 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
 85 |          id="path4221-1"
 86 |          inkscape:connector-curvature="0" />
 87 |     </marker>
 88 |     <marker
 89 |        inkscape:isstock="true"
 90 |        style="overflow:visible"
 91 |        id="Arrow1Mend-3"
 92 |        refX="0"
 93 |        refY="0"
 94 |        orient="auto"
 95 |        inkscape:stockid="Arrow1Mend">
 96 |       <path
 97 |          transform="matrix(-0.4,0,0,-0.4,-4,0)"
 98 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
 99 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
100 |          id="path4224-1"
101 |          inkscape:connector-curvature="0" />
102 |     </marker>
103 |     <marker
104 |        inkscape:isstock="true"
105 |        style="overflow:visible"
106 |        id="Arrow1Mstart-9"
107 |        refX="0"
108 |        refY="0"
109 |        orient="auto"
110 |        inkscape:stockid="Arrow1Mstart">
111 |       <path
112 |          transform="matrix(0.4,0,0,0.4,4,0)"
113 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
114 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
115 |          id="path4221-0"
116 |          inkscape:connector-curvature="0" />
117 |     </marker>
118 |     <marker
119 |        inkscape:isstock="true"
120 |        style="overflow:visible"
121 |        id="Arrow1Mend-0"
122 |        refX="0"
123 |        refY="0"
124 |        orient="auto"
125 |        inkscape:stockid="Arrow1Mend">
126 |       <path
127 |          transform="matrix(-0.4,0,0,-0.4,-4,0)"
128 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
129 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
130 |          id="path4224-9"
131 |          inkscape:connector-curvature="0" />
132 |     </marker>
133 |     <marker
134 |        inkscape:isstock="true"
135 |        style="overflow:visible"
136 |        id="Arrow1Mstart-1-6"
137 |        refX="0"
138 |        refY="0"
139 |        orient="auto"
140 |        inkscape:stockid="Arrow1Mstart">
141 |       <path
142 |          transform="matrix(0.4,0,0,0.4,4,0)"
143 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
144 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
145 |          id="path4221-1-3"
146 |          inkscape:connector-curvature="0" />
147 |     </marker>
148 |     <marker
149 |        inkscape:isstock="true"
150 |        style="overflow:visible"
151 |        id="Arrow1Mend-3-8"
152 |        refX="0"
153 |        refY="0"
154 |        orient="auto"
155 |        inkscape:stockid="Arrow1Mend">
156 |       <path
157 |          transform="matrix(-0.4,0,0,-0.4,-4,0)"
158 |          style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1"
159 |          d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z"
160 |          id="path4224-1-8"
161 |          inkscape:connector-curvature="0" />
162 |     </marker>
163 |   </defs>
164 |   <metadata
165 |      id="metadata7">
166 |     <rdf:RDF>
167 |       <cc:Work
168 |          rdf:about="">
169 |         <dc:format>image/svg+xml</dc:format>
170 |         <dc:type
171 |            rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
172 |         <dc:title></dc:title>
173 |       </cc:Work>
174 |     </rdf:RDF>
175 |   </metadata>
176 |   <g
177 |      id="layer1"
178 |      inkscape:groupmode="layer"
179 |      inkscape:label="Capa 1"
180 |      transform="translate(-14.178693,-369.49137)">
181 |     <g
182 |        id="g4155">
183 |       <rect
184 |          style="fill:none;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
185 |          id="rect4136"
186 |          width="353.55341"
187 |          height="353.55341"
188 |          x="204.05081"
189 |          y="407.88489" />
190 |       <path
191 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
192 |          d="m 203.04066,584.66159 355.5737,0"
193 |          id="path4138"
194 |          inkscape:connector-curvature="0" />
195 |       <path
196 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
197 |          d="m 380.82751,406.87474 0,355.5737"
198 |          id="path4138-5"
199 |          inkscape:connector-curvature="0" />
200 |     </g>
201 |     <g
202 |        id="g6106-3"
203 |        transform="matrix(0,1,-1,0,965.28806,203.56835)">
204 |       <path
205 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
206 |          d="m 188.21429,584.50506 13.57142,0"
207 |          id="path4160-8"
208 |          inkscape:connector-curvature="0" />
209 |       <path
210 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
211 |          d="m 187.85715,407.36221 13.57142,0"
212 |          id="path4160-2-7"
213 |          inkscape:connector-curvature="0" />
214 |       <path
215 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
216 |          d="m 188.21429,760.93363 13.57142,0"
217 |          id="path4160-1-2"
218 |          inkscape:connector-curvature="0" />
219 |       <path
220 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-start:url(#Arrow1Mstart-9);marker-end:url(#Arrow1Mend-0)"
221 |          d="m 194.7069,758.72349 0,-172.11295"
222 |          id="path4209-3"
223 |          inkscape:connector-curvature="0"
224 |          sodipodi:nodetypes="cc" />
225 |       <path
226 |          style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1;marker-start:url(#Arrow1Mstart-1-6);marker-end:url(#Arrow1Mend-3-8)"
227 |          d="m 194.82143,581.82051 0,-172.11295"
228 |          id="path4209-7-9"
229 |          inkscape:connector-curvature="0"
230 |          sodipodi:nodetypes="cc" />
231 |     </g>
232 |     <g
233 |        id="g6252"
234 |        transform="matrix(0.6,0,0,0.6,95.244214,176.70499)">
235 |       <g
236 |          word-spacing="normal"
237 |          letter-spacing="normal"
238 |          font-size-adjust="none"
239 |          font-stretch="normal"
240 |          font-weight="normal"
241 |          font-variant="normal"
242 |          font-style="normal"
243 |          stroke-miterlimit="10.433"
244 |          xml:space="preserve"
245 |          transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
246 |          id="content"
247 |          style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
248 |            id="path6255"
249 |            d="m 501.21,434.45 0.23,0.4 0.24,0.4 0.24,0.38 0.24,0.37 0.13,0.18 0.13,0.17 0.14,0.17 0.13,0.16 0.14,0.16 0.15,0.16 0.15,0.15 0.16,0.14 0.16,0.14 0.17,0.13 0.17,0.13 0.18,0.12 0.19,0.11 0.19,0.11 0.2,0.1 0.21,0.09 0.22,0.09 0.23,0.08 0.24,0.07 0.25,0.06 0.25,0.05 0.27,0.05 0.28,0.04 0.29,0.02 c 0.45,0.05 0.95,0.05 0.95,1 0,0.19 -0.2,0.55 -0.61,0.55 -1.19,0 -2.58,-0.16 -3.83,-0.16 -1.7,0 -3.55,0.16 -5.19,0.16 -0.29,0 -0.93,0 -0.93,-0.95 0,-0.55 0.43,-0.6 0.73,-0.6 1.2,-0.04 2.95,-0.45 2.95,-1.93 0,-0.55 -0.25,-0.96 -0.65,-1.66 l -13.46,-23.45 -1.84,24.7 c -0.05,1 -0.14,2.3 3.48,2.34 0.86,0 1.36,0 1.36,1 0,0.5 -0.56,0.55 -0.75,0.55 -2,0 -4.09,-0.16 -6.07,-0.16 -1.16,0 -4.1,0.16 -5.24,0.16 -0.3,0 -0.95,0 -0.95,-1 0,-0.55 0.5,-0.55 1.2,-0.55 2.19,0 2.53,-0.29 2.64,-1.25 l 0.3,-3.82 -12.61,-21.97 -1.89,25.25 c 0,0.59 0,1.75 3.78,1.79 0.5,0 1.05,0 1.05,1 0,0.55 -0.6,0.55 -0.69,0.55 -2,0 -4.09,-0.16 -6.13,-0.16 -1.75,0 -3.54,0.16 -5.23,0.16 -0.25,0 -0.91,0 -0.91,-0.95 0,-0.6 0.46,-0.6 1.25,-0.6 2.5,0 2.55,-0.45 2.64,-1.84 l 2.25,-30.44 c 0.05,-0.9 0.1,-1.29 0.8,-1.29 0.6,0 0.74,0.29 1.19,1.04 l 14.66,25.41 1.84,-25.16 c 0.09,-1.04 0.19,-1.29 0.8,-1.29 0.64,0 0.93,0.5 1.18,0.93 z"
250 |            inkscape:connector-curvature="0"
251 |            style="fill:#000000;stroke-width:0" /></g>    </g>
252 |     <g
253 |        id="g6252-5"
254 |        transform="matrix(0.6,0,0,0.6,273.56225,177.66185)">
255 |       <g
256 |          word-spacing="normal"
257 |          letter-spacing="normal"
258 |          font-size-adjust="none"
259 |          font-stretch="normal"
260 |          font-weight="normal"
261 |          font-variant="normal"
262 |          font-style="normal"
263 |          stroke-miterlimit="10.433"
264 |          xml:space="preserve"
265 |          transform="matrix(1.0629921,0,0,-1.0629921,-186.02362,789.27165)"
266 |          id="content-0"
267 |          style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;letter-spacing:normal;word-spacing:normal;text-anchor:start;fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:10.43299961;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"><path
268 |            id="path6255-5"
269 |            d="m 501.21,434.45 0.23,0.4 0.24,0.4 0.24,0.38 0.24,0.37 0.13,0.18 0.13,0.17 0.14,0.17 0.13,0.16 0.14,0.16 0.15,0.16 0.15,0.15 0.16,0.14 0.16,0.14 0.17,0.13 0.17,0.13 0.18,0.12 0.19,0.11 0.19,0.11 0.2,0.1 0.21,0.09 0.22,0.09 0.23,0.08 0.24,0.07 0.25,0.06 0.25,0.05 0.27,0.05 0.28,0.04 0.29,0.02 c 0.45,0.05 0.95,0.05 0.95,1 0,0.19 -0.2,0.55 -0.61,0.55 -1.19,0 -2.58,-0.16 -3.83,-0.16 -1.7,0 -3.55,0.16 -5.19,0.16 -0.29,0 -0.93,0 -0.93,-0.95 0,-0.55 0.43,-0.6 0.73,-0.6 1.2,-0.04 2.95,-0.45 2.95,-1.93 0,-0.55 -0.25,-0.96 -0.65,-1.66 l -13.46,-23.45 -1.84,24.7 c -0.05,1 -0.14,2.3 3.48,2.34 0.86,0 1.36,0 1.36,1 0,0.5 -0.56,0.55 -0.75,0.55 -2,0 -4.09,-0.16 -6.07,-0.16 -1.16,0 -4.1,0.16 -5.24,0.16 -0.3,0 -0.95,0 -0.95,-1 0,-0.55 0.5,-0.55 1.2,-0.55 2.19,0 2.53,-0.29 2.64,-1.25 l 0.3,-3.82 -12.61,-21.97 -1.89,25.25 c 0,0.59 0,1.75 3.78,1.79 0.5,0 1.05,0 1.05,1 0,0.55 -0.6,0.55 -0.69,0.55 -2,0 -4.09,-0.16 -6.13,-0.16 -1.75,0 -3.54,0.16 -5.23,0.16 -0.25,0 -0.91,0 -0.91,-0.95 0,-0.6 0.46,-0.6 1.25,-0.6 2.5,0 2.55,-0.45 2.64,-1.84 l 2.25,-30.44 c 0.05,-0.9 0.1,-1.29 0.8,-1.29 0.6,0 0.74,0.29 1.19,1.04 l 14.66,25.41 1.84,-25.16 c 0.09,-1.04 0.19,-1.29 0.8,-1.29 0.64,0 0.93,0.5 1.18,0.93 z"
270 |            inkscape:connector-curvature="0"
271 |            style="fill:#000000;stroke-width:0" /></g>    </g>
272 |     <g
273 |        id="g6412"
274 |        transform="matrix(0.98866842,0,0,0.9770422,2.320815,9.9014487)">
275 |       <rect
276 |          y="410.41025"
277 |          x="207.08127"
278 |          height="15.152288"
279 |          width="15.152288"
280 |          id="rect6336"
281 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
282 |       <rect
283 |          y="428.59302"
284 |          x="224.75894"
285 |          height="15.152288"
286 |          width="15.152288"
287 |          id="rect6336-8"
288 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
289 |       <rect
290 |          y="446.77573"
291 |          x="242.43661"
292 |          height="15.152288"
293 |          width="15.152288"
294 |          id="rect6336-6"
295 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
296 |       <rect
297 |          y="464.9585"
298 |          x="260.11429"
299 |          height="15.152288"
300 |          width="15.152288"
301 |          id="rect6336-81"
302 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
303 |       <rect
304 |          y="483.14127"
305 |          x="277.79193"
306 |          height="15.152288"
307 |          width="15.152288"
308 |          id="rect6336-1"
309 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
310 |       <rect
311 |          y="500.81891"
312 |          x="295.4696"
313 |          height="15.152288"
314 |          width="15.152288"
315 |          id="rect6336-1-0"
316 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
317 |       <rect
318 |          y="518.49658"
319 |          x="313.14728"
320 |          height="15.152288"
321 |          width="15.152288"
322 |          id="rect6336-1-00"
323 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
324 |       <rect
325 |          y="536.17426"
326 |          x="330.82495"
327 |          height="15.152288"
328 |          width="15.152288"
329 |          id="rect6336-1-2"
330 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
331 |       <rect
332 |          y="554.35699"
333 |          x="347.99753"
334 |          height="15.152288"
335 |          width="15.152288"
336 |          id="rect6336-1-09"
337 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
338 |       <rect
339 |          y="572.03467"
340 |          x="365.67523"
341 |          height="15.152288"
342 |          width="15.152288"
343 |          id="rect6336-1-1"
344 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
345 |     </g>
346 |     <g
347 |        id="g6412-4"
348 |        transform="matrix(0.98866842,0,0,0.9770422,178.08736,8.9259446)">
349 |       <rect
350 |          y="410.41025"
351 |          x="207.08127"
352 |          height="15.152288"
353 |          width="15.152288"
354 |          id="rect6336-9"
355 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
356 |       <rect
357 |          y="428.59302"
358 |          x="224.75894"
359 |          height="15.152288"
360 |          width="15.152288"
361 |          id="rect6336-8-9"
362 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
363 |       <rect
364 |          y="446.77573"
365 |          x="242.43661"
366 |          height="15.152288"
367 |          width="15.152288"
368 |          id="rect6336-6-9"
369 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
370 |       <rect
371 |          y="464.9585"
372 |          x="260.11429"
373 |          height="15.152288"
374 |          width="15.152288"
375 |          id="rect6336-81-8"
376 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
377 |       <rect
378 |          y="483.14127"
379 |          x="277.79193"
380 |          height="15.152288"
381 |          width="15.152288"
382 |          id="rect6336-1-6"
383 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
384 |       <rect
385 |          y="500.81891"
386 |          x="295.4696"
387 |          height="15.152288"
388 |          width="15.152288"
389 |          id="rect6336-1-0-3"
390 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
391 |       <rect
392 |          y="518.49658"
393 |          x="313.14728"
394 |          height="15.152288"
395 |          width="15.152288"
396 |          id="rect6336-1-00-3"
397 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
398 |       <rect
399 |          y="536.17426"
400 |          x="330.82495"
401 |          height="15.152288"
402 |          width="15.152288"
403 |          id="rect6336-1-2-7"
404 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
405 |       <rect
406 |          y="554.35699"
407 |          x="347.99753"
408 |          height="15.152288"
409 |          width="15.152288"
410 |          id="rect6336-1-09-0"
411 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
412 |       <rect
413 |          y="572.03467"
414 |          x="365.67523"
415 |          height="15.152288"
416 |          width="15.152288"
417 |          id="rect6336-1-1-3"
418 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:3;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
419 |     </g>
420 |     <g
421 |        id="g6524"
422 |        transform="translate(0.50507627,1.2626907)">
423 |       <rect
424 |          y="585.17554"
425 |          x="224.22812"
426 |          height="14.804425"
427 |          width="14.980589"
428 |          id="rect6336-92"
429 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
430 |       <rect
431 |          y="602.94086"
432 |          x="241.70547"
433 |          height="14.804425"
434 |          width="14.980589"
435 |          id="rect6336-8-96"
436 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
437 |       <rect
438 |          y="620.70618"
439 |          x="259.18283"
440 |          height="14.804425"
441 |          width="14.980589"
442 |          id="rect6336-6-1"
443 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
444 |       <rect
445 |          y="638.4715"
446 |          x="276.66019"
447 |          height="14.804425"
448 |          width="14.980589"
449 |          id="rect6336-81-6"
450 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
451 |       <rect
452 |          y="656.23682"
453 |          x="294.13751"
454 |          height="14.804425"
455 |          width="14.980589"
456 |          id="rect6336-1-69"
457 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
458 |       <rect
459 |          y="673.50861"
460 |          x="311.61487"
461 |          height="14.804425"
462 |          width="14.980589"
463 |          id="rect6336-1-0-5"
464 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
465 |       <rect
466 |          y="690.78046"
467 |          x="329.09222"
468 |          height="14.804425"
469 |          width="14.980589"
470 |          id="rect6336-1-00-7"
471 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
472 |       <rect
473 |          y="708.05231"
474 |          x="346.56958"
475 |          height="14.804425"
476 |          width="14.980589"
477 |          id="rect6336-1-2-1"
478 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
479 |       <rect
480 |          y="725.81757"
481 |          x="363.54758"
482 |          height="14.804425"
483 |          width="14.980589"
484 |          id="rect6336-1-09-7"
485 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
486 |     </g>
487 |     <rect
488 |        style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
489 |        id="rect6336-1-1-8"
490 |        width="14.980589"
491 |        height="14.804425"
492 |        x="206.82596"
493 |        y="743.78387" />
494 |     <text
495 |        xml:space="preserve"
496 |        style="font-style:normal;font-weight:normal;font-size:31.25px;line-height:125%;font-family:Sans;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
497 |        x="11.111676"
498 |        y="508.90009"
499 |        id="text6537"
500 |        sodipodi:linespacing="125%"><tspan
501 |          sodipodi:role="line"
502 |          id="tspan6539"
503 |          x="11.111676"
504 |          y="508.90009">Background</tspan></text>
505 |     <text
506 |        xml:space="preserve"
507 |        style="font-style:normal;font-weight:normal;font-size:31.25px;line-height:125%;font-family:Sans;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1"
508 |        x="116.24474"
509 |        y="689.71741"
510 |        id="text6541"
511 |        sodipodi:linespacing="125%"><tspan
512 |          sodipodi:role="line"
513 |          id="tspan6543"
514 |          x="116.24474"
515 |          y="689.71741">Motif</tspan></text>
516 |     <g
517 |        id="g6524-3"
518 |        transform="translate(176.20803,0.91831733)">
519 |       <rect
520 |          y="585.17554"
521 |          x="224.22812"
522 |          height="14.804425"
523 |          width="14.980589"
524 |          id="rect6336-92-1"
525 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
526 |       <rect
527 |          y="602.94086"
528 |          x="241.70547"
529 |          height="14.804425"
530 |          width="14.980589"
531 |          id="rect6336-8-96-8"
532 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
533 |       <rect
534 |          y="620.70618"
535 |          x="259.18283"
536 |          height="14.804425"
537 |          width="14.980589"
538 |          id="rect6336-6-1-5"
539 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
540 |       <rect
541 |          y="638.4715"
542 |          x="276.66019"
543 |          height="14.804425"
544 |          width="14.980589"
545 |          id="rect6336-81-6-2"
546 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
547 |       <rect
548 |          y="656.23682"
549 |          x="294.13751"
550 |          height="14.804425"
551 |          width="14.980589"
552 |          id="rect6336-1-69-4"
553 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
554 |       <rect
555 |          y="673.50861"
556 |          x="311.61487"
557 |          height="14.804425"
558 |          width="14.980589"
559 |          id="rect6336-1-0-5-4"
560 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
561 |       <rect
562 |          y="690.78046"
563 |          x="329.09222"
564 |          height="14.804425"
565 |          width="14.980589"
566 |          id="rect6336-1-00-7-1"
567 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
568 |       <rect
569 |          y="708.05231"
570 |          x="346.56958"
571 |          height="14.804425"
572 |          width="14.980589"
573 |          id="rect6336-1-2-1-5"
574 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
575 |       <rect
576 |          y="725.81757"
577 |          x="363.54758"
578 |          height="14.804425"
579 |          width="14.980589"
580 |          id="rect6336-1-09-7-2"
581 |          style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" />
582 |     </g>
583 |     <rect
584 |        style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.94851446;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1"
585 |        id="rect6336-1-1-8-7"
586 |        width="14.980589"
587 |        height="14.804425"
588 |        x="382.46515"
589 |        y="743.91937" />
590 |   </g>
591 | </svg>
592 | 


--------------------------------------------------------------------------------
/doc/_static/transition_matrix_3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_3.pdf


--------------------------------------------------------------------------------
/doc/_static/transition_matrix_4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_4.pdf


--------------------------------------------------------------------------------
/doc/_static/transition_matrix_5.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_5.pdf


--------------------------------------------------------------------------------
/doc/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # AILE documentation build configuration file, created by
  4 | # sphinx-quickstart on Wed Sep 23 10:38:15 2015.
  5 | #
  6 | # This file is execfile()d with the current directory set to its
  7 | # containing dir.
  8 | #
  9 | # Note that not all possible configuration values are present in this
 10 | # autogenerated file.
 11 | #
 12 | # All configuration values have a default; values that are commented out
 13 | # serve to show the default.
 14 | 
 15 | import sys
 16 | import os
 17 | import shlex
 18 | 
 19 | # If extensions (or modules to document with autodoc) are in another directory,
 20 | # add these directories to sys.path here. If the directory is relative to the
 21 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 22 | #sys.path.insert(0, os.path.abspath('.'))
 23 | 
 24 | # -- General configuration ------------------------------------------------
 25 | 
 26 | # If your documentation needs a minimal Sphinx version, state it here.
 27 | #needs_sphinx = '1.0'
 28 | 
 29 | # Add any Sphinx extension module names here, as strings. They can be
 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 31 | # ones.
 32 | extensions = [
 33 |     'sphinx.ext.autodoc',
 34 |     'sphinx.ext.mathjax',
 35 | ]
 36 | 
 37 | # Add any paths that contain templates here, relative to this directory.
 38 | templates_path = ['_templates']
 39 | 
 40 | # The suffix(es) of source filenames.
 41 | # You can specify multiple suffix as a list of string:
 42 | # source_suffix = ['.rst', '.md']
 43 | source_suffix = '.rst'
 44 | 
 45 | # The encoding of source files.
 46 | #source_encoding = 'utf-8-sig'
 47 | 
 48 | # The master toctree document.
 49 | master_doc = 'index'
 50 | 
 51 | # General information about the project.
 52 | project = u'AILE'
 53 | copyright = u'2015, Pedro López-Adeva'
 54 | author = u'Pedro López-Adeva'
 55 | 
 56 | # The version info for the project you're documenting, acts as replacement for
 57 | # |version| and |release|, also used in various other places throughout the
 58 | # built documents.
 59 | #
 60 | # The short X.Y version.
 61 | version = '0.0.1'
 62 | # The full version, including alpha/beta/rc tags.
 63 | release = '0.0.1'
 64 | 
 65 | # The language for content autogenerated by Sphinx. Refer to documentation
 66 | # for a list of supported languages.
 67 | #
 68 | # This is also used if you do content translation via gettext catalogs.
 69 | # Usually you set "language" from the command line for these cases.
 70 | language = None
 71 | 
 72 | # There are two options for replacing |today|: either, you set today to some
 73 | # non-false value, then it is used:
 74 | #today = ''
 75 | # Else, today_fmt is used as the format for a strftime call.
 76 | #today_fmt = '%B %d, %Y'
 77 | 
 78 | # List of patterns, relative to source directory, that match files and
 79 | # directories to ignore when looking for source files.
 80 | exclude_patterns = ['_build']
 81 | 
 82 | # The reST default role (used for this markup: `text`) to use for all
 83 | # documents.
 84 | #default_role = None
 85 | 
 86 | # If true, '()' will be appended to :func: etc. cross-reference text.
 87 | #add_function_parentheses = True
 88 | 
 89 | # If true, the current module name will be prepended to all description
 90 | # unit titles (such as .. function::).
 91 | #add_module_names = True
 92 | 
 93 | # If true, sectionauthor and moduleauthor directives will be shown in the
 94 | # output. They are ignored by default.
 95 | #show_authors = False
 96 | 
 97 | # The name of the Pygments (syntax highlighting) style to use.
 98 | pygments_style = 'sphinx'
 99 | 
100 | # A list of ignored prefixes for module index sorting.
101 | #modindex_common_prefix = []
102 | 
103 | # If true, keep warnings as "system message" paragraphs in the built documents.
104 | #keep_warnings = False
105 | 
106 | # If true, `todo` and `todoList` produce output, else they produce nothing.
107 | todo_include_todos = False
108 | 
109 | 
110 | # -- Options for HTML output ----------------------------------------------
111 | 
112 | # The theme to use for HTML and HTML Help pages.  See the documentation for
113 | # a list of builtin themes.
114 | html_theme = 'alabaster'
115 | 
116 | # Theme options are theme-specific and customize the look and feel of a theme
117 | # further.  For a list of options available for each theme, see the
118 | # documentation.
119 | #html_theme_options = {}
120 | 
121 | # Add any paths that contain custom themes here, relative to this directory.
122 | #html_theme_path = []
123 | 
124 | # The name for this set of Sphinx documents.  If None, it defaults to
125 | # "<project> v<release> documentation".
126 | #html_title = None
127 | 
128 | # A shorter title for the navigation bar.  Default is the same as html_title.
129 | #html_short_title = None
130 | 
131 | # The name of an image file (relative to this directory) to place at the top
132 | # of the sidebar.
133 | #html_logo = None
134 | 
135 | # The name of an image file (within the static path) to use as favicon of the
136 | # docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
137 | # pixels large.
138 | #html_favicon = None
139 | 
140 | # Add any paths that contain custom static files (such as style sheets) here,
141 | # relative to this directory. They are copied after the builtin static files,
142 | # so a file named "default.css" will overwrite the builtin "default.css".
143 | html_static_path = ['_static']
144 | 
145 | # Add any extra paths that contain custom files (such as robots.txt or
146 | # .htaccess) here, relative to this directory. These files are copied
147 | # directly to the root of the documentation.
148 | #html_extra_path = []
149 | 
150 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
151 | # using the given strftime format.
152 | #html_last_updated_fmt = '%b %d, %Y'
153 | 
154 | # If true, SmartyPants will be used to convert quotes and dashes to
155 | # typographically correct entities.
156 | #html_use_smartypants = True
157 | 
158 | # Custom sidebar templates, maps document names to template names.
159 | #html_sidebars = {}
160 | 
161 | # Additional templates that should be rendered to pages, maps page names to
162 | # template names.
163 | #html_additional_pages = {}
164 | 
165 | # If false, no module index is generated.
166 | #html_domain_indices = True
167 | 
168 | # If false, no index is generated.
169 | #html_use_index = True
170 | 
171 | # If true, the index is split into individual pages for each letter.
172 | #html_split_index = False
173 | 
174 | # If true, links to the reST sources are added to the pages.
175 | #html_show_sourcelink = True
176 | 
177 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
178 | #html_show_sphinx = True
179 | 
180 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
181 | #html_show_copyright = True
182 | 
183 | # If true, an OpenSearch description file will be output, and all pages will
184 | # contain a <link> tag referring to it.  The value of this option must be the
185 | # base URL from which the finished HTML is served.
186 | #html_use_opensearch = ''
187 | 
188 | # This is the file name suffix for HTML files (e.g. ".xhtml").
189 | #html_file_suffix = None
190 | 
191 | # Language to be used for generating the HTML full-text search index.
192 | # Sphinx supports the following languages:
193 | #   'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
194 | #   'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr'
195 | #html_search_language = 'en'
196 | 
197 | # A dictionary with options for the search language support, empty by default.
198 | # Now only 'ja' uses this config value
199 | #html_search_options = {'type': 'default'}
200 | 
201 | # The name of a javascript file (relative to the configuration directory) that
202 | # implements a search results scorer. If empty, the default will be used.
203 | #html_search_scorer = 'scorer.js'
204 | 
205 | # Output file base name for HTML help builder.
206 | htmlhelp_basename = 'AILEdoc'
207 | 
208 | # -- Options for LaTeX output ---------------------------------------------
209 | 
210 | latex_elements = {
211 | # The paper size ('letterpaper' or 'a4paper').
212 | #'papersize': 'letterpaper',
213 | 
214 | # The font size ('10pt', '11pt' or '12pt').
215 | #'pointsize': '10pt',
216 | 
217 | # Additional stuff for the LaTeX preamble.
218 | #'preamble': '',
219 | 
220 | # Latex figure (float) alignment
221 | #'figure_align': 'htbp',
222 | }
223 | 
224 | # Grouping the document tree into LaTeX files. List of tuples
225 | # (source start file, target name, title,
226 | #  author, documentclass [howto, manual, or own class]).
227 | latex_documents = [
228 |   (master_doc, 'AILE.tex', u'AILE Documentation',
229 |    u'Pedro López-Adeva', 'manual'),
230 | ]
231 | 
232 | # The name of an image file (relative to this directory) to place at the top of
233 | # the title page.
234 | #latex_logo = None
235 | 
236 | # For "manual" documents, if this is true, then toplevel headings are parts,
237 | # not chapters.
238 | #latex_use_parts = False
239 | 
240 | # If true, show page references after internal links.
241 | #latex_show_pagerefs = False
242 | 
243 | # If true, show URL addresses after external links.
244 | #latex_show_urls = False
245 | 
246 | # Documents to append as an appendix to all manuals.
247 | #latex_appendices = []
248 | 
249 | # If false, no module index is generated.
250 | #latex_domain_indices = True
251 | 
252 | 
253 | # -- Options for manual page output ---------------------------------------
254 | 
255 | # One entry per manual page. List of tuples
256 | # (source start file, name, description, authors, manual section).
257 | man_pages = [
258 |     (master_doc, 'aile', u'AILE Documentation',
259 |      [author], 1)
260 | ]
261 | 
262 | # If true, show URL addresses after external links.
263 | #man_show_urls = False
264 | 
265 | 
266 | # -- Options for Texinfo output -------------------------------------------
267 | 
268 | # Grouping the document tree into Texinfo files. List of tuples
269 | # (source start file, target name, title, author,
270 | #  dir menu entry, description, category)
271 | texinfo_documents = [
272 |   (master_doc, 'AILE', u'AILE Documentation',
273 |    author, 'AILE', 'One line description of project.',
274 |    'Miscellaneous'),
275 | ]
276 | 
277 | # Documents to append as an appendix to all manuals.
278 | #texinfo_appendices = []
279 | 
280 | # If false, no module index is generated.
281 | #texinfo_domain_indices = True
282 | 
283 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
284 | #texinfo_show_urls = 'footnote'
285 | 
286 | # If true, do not generate a @detailmenu in the "Top" node's menu.
287 | #texinfo_no_detailmenu = False
288 | 


--------------------------------------------------------------------------------
/doc/index.rst:
--------------------------------------------------------------------------------
 1 | .. AILE documentation master file, created by
 2 |    sphinx-quickstart on Wed Sep 23 10:38:15 2015.
 3 |    You can adapt this file completely to your liking, but it should at least
 4 |    contain the root `toctree` directive.
 5 | 
 6 | Welcome to AILE's documentation!
 7 | ================================
 8 | 
 9 | Contents:
10 | 
11 | .. toctree::
12 |    :maxdepth: 2
13 | 
14 |    notes
15 | 
16 | Indices and tables
17 | ==================
18 | 
19 | * :ref:`genindex`
20 | * :ref:`modindex`
21 | * :ref:`search`
22 | 
23 | 


--------------------------------------------------------------------------------
/doc/make.bat:
--------------------------------------------------------------------------------
  1 | @ECHO OFF
  2 | 
  3 | REM Command file for Sphinx documentation
  4 | 
  5 | if "%SPHINXBUILD%" == "" (
  6 | 	set SPHINXBUILD=sphinx-build
  7 | )
  8 | set BUILDDIR=_build
  9 | set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% .
 10 | set I18NSPHINXOPTS=%SPHINXOPTS% .
 11 | if NOT "%PAPER%" == "" (
 12 | 	set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
 13 | 	set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%
 14 | )
 15 | 
 16 | if "%1" == "" goto help
 17 | 
 18 | if "%1" == "help" (
 19 | 	:help
 20 | 	echo.Please use `make ^<target^>` where ^<target^> is one of
 21 | 	echo.  html       to make standalone HTML files
 22 | 	echo.  dirhtml    to make HTML files named index.html in directories
 23 | 	echo.  singlehtml to make a single large HTML file
 24 | 	echo.  pickle     to make pickle files
 25 | 	echo.  json       to make JSON files
 26 | 	echo.  htmlhelp   to make HTML files and a HTML help project
 27 | 	echo.  qthelp     to make HTML files and a qthelp project
 28 | 	echo.  devhelp    to make HTML files and a Devhelp project
 29 | 	echo.  epub       to make an epub
 30 | 	echo.  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter
 31 | 	echo.  text       to make text files
 32 | 	echo.  man        to make manual pages
 33 | 	echo.  texinfo    to make Texinfo files
 34 | 	echo.  gettext    to make PO message catalogs
 35 | 	echo.  changes    to make an overview over all changed/added/deprecated items
 36 | 	echo.  xml        to make Docutils-native XML files
 37 | 	echo.  pseudoxml  to make pseudoxml-XML files for display purposes
 38 | 	echo.  linkcheck  to check all external links for integrity
 39 | 	echo.  doctest    to run all doctests embedded in the documentation if enabled
 40 | 	echo.  coverage   to run coverage check of the documentation if enabled
 41 | 	goto end
 42 | )
 43 | 
 44 | if "%1" == "clean" (
 45 | 	for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
 46 | 	del /q /s %BUILDDIR%\*
 47 | 	goto end
 48 | )
 49 | 
 50 | 
 51 | REM Check if sphinx-build is available and fallback to Python version if any
 52 | %SPHINXBUILD% 2> nul
 53 | if errorlevel 9009 goto sphinx_python
 54 | goto sphinx_ok
 55 | 
 56 | :sphinx_python
 57 | 
 58 | set SPHINXBUILD=python -m sphinx.__init__
 59 | %SPHINXBUILD% 2> nul
 60 | if errorlevel 9009 (
 61 | 	echo.
 62 | 	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
 63 | 	echo.installed, then set the SPHINXBUILD environment variable to point
 64 | 	echo.to the full path of the 'sphinx-build' executable. Alternatively you
 65 | 	echo.may add the Sphinx directory to PATH.
 66 | 	echo.
 67 | 	echo.If you don't have Sphinx installed, grab it from
 68 | 	echo.http://sphinx-doc.org/
 69 | 	exit /b 1
 70 | )
 71 | 
 72 | :sphinx_ok
 73 | 
 74 | 
 75 | if "%1" == "html" (
 76 | 	%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
 77 | 	if errorlevel 1 exit /b 1
 78 | 	echo.
 79 | 	echo.Build finished. The HTML pages are in %BUILDDIR%/html.
 80 | 	goto end
 81 | )
 82 | 
 83 | if "%1" == "dirhtml" (
 84 | 	%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
 85 | 	if errorlevel 1 exit /b 1
 86 | 	echo.
 87 | 	echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
 88 | 	goto end
 89 | )
 90 | 
 91 | if "%1" == "singlehtml" (
 92 | 	%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml
 93 | 	if errorlevel 1 exit /b 1
 94 | 	echo.
 95 | 	echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.
 96 | 	goto end
 97 | )
 98 | 
 99 | if "%1" == "pickle" (
100 | 	%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
101 | 	if errorlevel 1 exit /b 1
102 | 	echo.
103 | 	echo.Build finished; now you can process the pickle files.
104 | 	goto end
105 | )
106 | 
107 | if "%1" == "json" (
108 | 	%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
109 | 	if errorlevel 1 exit /b 1
110 | 	echo.
111 | 	echo.Build finished; now you can process the JSON files.
112 | 	goto end
113 | )
114 | 
115 | if "%1" == "htmlhelp" (
116 | 	%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
117 | 	if errorlevel 1 exit /b 1
118 | 	echo.
119 | 	echo.Build finished; now you can run HTML Help Workshop with the ^
120 | .hhp project file in %BUILDDIR%/htmlhelp.
121 | 	goto end
122 | )
123 | 
124 | if "%1" == "qthelp" (
125 | 	%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
126 | 	if errorlevel 1 exit /b 1
127 | 	echo.
128 | 	echo.Build finished; now you can run "qcollectiongenerator" with the ^
129 | .qhcp project file in %BUILDDIR%/qthelp, like this:
130 | 	echo.^> qcollectiongenerator %BUILDDIR%\qthelp\AILE.qhcp
131 | 	echo.To view the help file:
132 | 	echo.^> assistant -collectionFile %BUILDDIR%\qthelp\AILE.ghc
133 | 	goto end
134 | )
135 | 
136 | if "%1" == "devhelp" (
137 | 	%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp
138 | 	if errorlevel 1 exit /b 1
139 | 	echo.
140 | 	echo.Build finished.
141 | 	goto end
142 | )
143 | 
144 | if "%1" == "epub" (
145 | 	%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub
146 | 	if errorlevel 1 exit /b 1
147 | 	echo.
148 | 	echo.Build finished. The epub file is in %BUILDDIR%/epub.
149 | 	goto end
150 | )
151 | 
152 | if "%1" == "latex" (
153 | 	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
154 | 	if errorlevel 1 exit /b 1
155 | 	echo.
156 | 	echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
157 | 	goto end
158 | )
159 | 
160 | if "%1" == "latexpdf" (
161 | 	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
162 | 	cd %BUILDDIR%/latex
163 | 	make all-pdf
164 | 	cd %~dp0
165 | 	echo.
166 | 	echo.Build finished; the PDF files are in %BUILDDIR%/latex.
167 | 	goto end
168 | )
169 | 
170 | if "%1" == "latexpdfja" (
171 | 	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
172 | 	cd %BUILDDIR%/latex
173 | 	make all-pdf-ja
174 | 	cd %~dp0
175 | 	echo.
176 | 	echo.Build finished; the PDF files are in %BUILDDIR%/latex.
177 | 	goto end
178 | )
179 | 
180 | if "%1" == "text" (
181 | 	%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text
182 | 	if errorlevel 1 exit /b 1
183 | 	echo.
184 | 	echo.Build finished. The text files are in %BUILDDIR%/text.
185 | 	goto end
186 | )
187 | 
188 | if "%1" == "man" (
189 | 	%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man
190 | 	if errorlevel 1 exit /b 1
191 | 	echo.
192 | 	echo.Build finished. The manual pages are in %BUILDDIR%/man.
193 | 	goto end
194 | )
195 | 
196 | if "%1" == "texinfo" (
197 | 	%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo
198 | 	if errorlevel 1 exit /b 1
199 | 	echo.
200 | 	echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.
201 | 	goto end
202 | )
203 | 
204 | if "%1" == "gettext" (
205 | 	%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale
206 | 	if errorlevel 1 exit /b 1
207 | 	echo.
208 | 	echo.Build finished. The message catalogs are in %BUILDDIR%/locale.
209 | 	goto end
210 | )
211 | 
212 | if "%1" == "changes" (
213 | 	%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
214 | 	if errorlevel 1 exit /b 1
215 | 	echo.
216 | 	echo.The overview file is in %BUILDDIR%/changes.
217 | 	goto end
218 | )
219 | 
220 | if "%1" == "linkcheck" (
221 | 	%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
222 | 	if errorlevel 1 exit /b 1
223 | 	echo.
224 | 	echo.Link check complete; look for any errors in the above output ^
225 | or in %BUILDDIR%/linkcheck/output.txt.
226 | 	goto end
227 | )
228 | 
229 | if "%1" == "doctest" (
230 | 	%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
231 | 	if errorlevel 1 exit /b 1
232 | 	echo.
233 | 	echo.Testing of doctests in the sources finished, look at the ^
234 | results in %BUILDDIR%/doctest/output.txt.
235 | 	goto end
236 | )
237 | 
238 | if "%1" == "coverage" (
239 | 	%SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage
240 | 	if errorlevel 1 exit /b 1
241 | 	echo.
242 | 	echo.Testing of coverage in the sources finished, look at the ^
243 | results in %BUILDDIR%/coverage/python.txt.
244 | 	goto end
245 | )
246 | 
247 | if "%1" == "xml" (
248 | 	%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml
249 | 	if errorlevel 1 exit /b 1
250 | 	echo.
251 | 	echo.Build finished. The XML files are in %BUILDDIR%/xml.
252 | 	goto end
253 | )
254 | 
255 | if "%1" == "pseudoxml" (
256 | 	%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml
257 | 	if errorlevel 1 exit /b 1
258 | 	echo.
259 | 	echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.
260 | 	goto end
261 | )
262 | 
263 | :end
264 | 


--------------------------------------------------------------------------------
/doc/notes.rst:
--------------------------------------------------------------------------------
  1 | Motif detection using a Hidden Markov Model
  2 | ===========================================
  3 | 
  4 | Introduction
  5 | ------------
  6 | 
  7 | This notes started as an adaptation of `Rabiner's tutorial
  8 | <http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.131.2084>`_ to
  9 | Profile HMM.
 10 | 
 11 | 
 12 | Problem description
 13 | -------------------
 14 | We have a sequence of length :math:`n` over some alphabet
 15 | :math:`\pmb{A}`. This sequence is generated by a random process, so the
 16 | sequence is a random variable :math:`\pmb{X}`:
 17 | 
 18 | .. math::
 19 | 
 20 |    \pmb{X} = X_1X_2...X_n
 21 | 
 22 | As said before the range of each :math:`X_i` is the alphabet:
 23 | 
 24 | .. math::
 25 | 
 26 |    Val(X_i) = \pmb{A}
 27 | 
 28 | We will call each element of :math:`\pmb{A}` a symbol.
 29 | The size of the alphabet is :math:`|\pmb{A}|`.
 30 | 
 31 | We assume that each :math:`X_i` can be part either of some repeating
 32 | motif of length :math:`W` or be just background noise. When
 33 | considering repeating motifs we allow to be gaps and deletions on the
 34 | motif. For example consider the following sequences made of the
 35 | letters A,B,C,D,E (spaces added to aid the reader):
 36 | 
 37 | - Repeating motif
 38 | 
 39 |   ``ABCDE ABCDE ABCDE ABCDE ABCDE``
 40 | 
 41 | - With some baground noise between motif repetitions
 42 | 
 43 |   ``ABCDE AB ABCDE BB ABCDE C ABCDE CCCC ABCDE``
 44 | 
 45 | - With deletions
 46 | 
 47 |   ``ABDE AB ABCDE BB BCDE C ABCDE CCCC ABDE``
 48 | 
 49 | - With gaps
 50 | 
 51 |   ``ABAACDE ABCDE A ACCCBCDE C ABCDE ABCE``
 52 | 
 53 | 
 54 | Model
 55 | -----
 56 | 
 57 | To account for background noise, deletions and gaps we model the
 58 | generation of the sequence with a `Profile Hidden Markov Model
 59 | <http://bioinformatics.oxfordjournals.org/content/14/9/755.full.pdf+html>`_
 60 | (PHMM). In a PHMM the hidden process generating the sequence changes
 61 | between states according to the following state diagram, where each node
 62 | represents an state and the process jumps randomly from one state to
 63 | another state following the arrows:
 64 | 
 65 | .. only:: html
 66 | 
 67 |    .. figure:: _static/PHMM_1.svg
 68 |       :align: center
 69 | 
 70 | .. only:: latex
 71 | 
 72 |    .. figure:: _static/PHMM_1.pdf
 73 |       :align: center
 74 | 
 75 | There are three types of states:
 76 | 
 77 | - Background states
 78 | 
 79 |   .. math::
 80 | 
 81 |      \pmb{S}^B=\left\{b_1,...,b_W\right\}
 82 | 
 83 | - Motif states
 84 | 
 85 |   .. math::
 86 | 
 87 |      \pmb{S}^M=\left\{m_1,...,m_W\right\}
 88 | 
 89 | 
 90 | - Delete states
 91 | 
 92 |   .. math::
 93 | 
 94 |      \pmb{S}^D=\left\{d_1,...,d_W\right\}
 95 | 
 96 | 
 97 | When the process jumps to a background state it emits a letter of the
 98 | sequence according to the background distribution. We assume that all
 99 | background states follow the same categorical distribution where each
100 | symbol :math:`a \in \pmb{A}` is emitted with probability :math:`f^B_a`.
101 | 
102 | When the process jumps to a motif state it emits a letter of the
103 | sequence according to the motif distribution. We assume that the
104 | emission probability depends on which motif state we are in, and so if
105 | we are in state :math:`m_j` each symbol :math:`a` is emitted with
106 | probability :math:`f^M_{ja}`.
107 | 
108 | When the process jumps to a deletion state it emits nothing. These are
109 | silent states introduced only for the purpose of allowing jumps
110 | between motifs :math:`m_j` and :math:`m_k` for :math:`k>j+1`.
111 | 
112 | Notice that the state diagram has loops. This is because it models the motif,
113 | which is allowed to repeat.
114 | 
115 | The above diagram is incomplete if we don't specify the probabilites of
116 | state transitions. The following table gives them for direct transitions
117 | between states:
118 | 
119 | +------------+------------+---------------------------------------------------------------+
120 | |                         | To state                                                      |
121 | |                         +------------+----------------+----------------+----------------+
122 | |                         | :math:`b_j`| :math:`b_{j+1}`| :math:`m_{j+1}`| :math:`d_{j+1}`|
123 | +------------+------------+------------+----------------+----------------+----------------+
124 | |            | :math:`b_j`| b(j)       | 0              | 1 - b(j)       | 0              |
125 | |            +------------+------------+----------------+----------------+----------------+
126 | | From state | :math:`m_j`| 0          | :math:`m_b`    | :math:`m`      | :math:`m_d`    |
127 | |            +------------+------------+----------------+----------------+----------------+
128 | |            | :math:`d_j`| 0          | 0              | 1 - d          | d              |
129 | +------------+------------+------------+----------------+----------------+----------------+
130 | 
131 | 
132 | To avoid introducing too much parameters we have assumed that the
133 | transition probabilities are stationary, the only exception being:
134 | 
135 | .. math::
136 | 
137 |    b(j) = b_{out}[j=W] + b_{in}[j\neq W]
138 | 
139 | Now let's define the random variable :math:`Z_i` as the state of the
140 | process when emitting symbol :math:`X_i`. Then the range of :math:`Z_i` is:
141 | 
142 | .. math::
143 | 
144 |    \pmb{S} = Val(Z_i) = \pmb{S}^B \cup \pmb{S}^M
145 | 
146 | With these variables we have the familiar HMM. Notice that although
147 | similar to the graph of the PHMM this one has different meaning. The
148 | former was an state diagram and this one is a Bayesian network. Each
149 | node is now a random variable instead of a concrete state:
150 | 
151 | .. only:: html
152 | 
153 |    .. figure:: _static/HMM_1.svg
154 |       :align: center
155 |       :figwidth: 60 %
156 | 
157 | .. only:: latex
158 | 
159 |    .. figure:: _static/HMM_1.pdf
160 |       :align: center
161 |       :figwidth: 60 %
162 | 
163 | We are capable of computing the transition probabilities between the
164 | emitting states :math:`Z_i`. For this we must consider any possible path that
165 | connects the two states, compute the probability of each path and sum
166 | all the probabilities. To make this more concrete consider the case of
167 | motif width :math:`W=6`, where we start from state :math:`m_2`:
168 | 
169 | .. only:: html
170 | 
171 |    .. figure:: _static/PHMM_2.svg
172 |       :align: center
173 | 
174 | .. only:: latex
175 | 
176 |    .. figure:: _static/PHMM_2.pdf
177 |       :align: center
178 | 
179 | We can see the pattern in the above formulas. If we consider the path going
180 | from :math:`m_j` to :math:`m_k` the number of delete states we must
181 | cross is:
182 | 
183 | .. math::
184 | 
185 |    L_{jk} = (k - j - 2) \mod W
186 | 
187 | Then the probability of the path is:
188 | 
189 | .. math::
190 | 
191 |    m_dd^{L_{jk}}(1-d) + [k=j+1]m
192 | 
193 | Where :math:`[k=j+1]` is the `Iverson bracket
194 | <https://en.wikipedia.org/wiki/Iverson_bracket>`_, which takes the
195 | value 1 only when the condition inside the brackets is true, and zero
196 | otherwise.
197 | 
198 | Notice that given the value of :math:`j` and :math:`L_{jk}` we can
199 | recover :math:`k` as:
200 | 
201 | .. math::
202 | 
203 |    k = 1 + (j + L_{jk} + 1) \mod W
204 | 
205 | To compute the final probability we need to take into account the
206 | cases where we make several loops over all the delete states. Each
207 | loop has probability :math:`d^W`, and :math:`l` loops then
208 | :math:`(d^W)^l`. The following figure shows the possible paths between
209 | two states depending on the number of loops:
210 | 
211 | .. only:: html
212 | 
213 |    .. figure:: _static/PHMM_3.svg
214 |       :align: center
215 | 
216 | .. only:: latex
217 | 
218 |    .. figure:: _static/PHMM_3.pdf
219 |       :align: center
220 | 
221 | The final probability is then:
222 | 
223 | .. math::
224 | 
225 |    P\left(Z_{i+1}=m_k|Z_i=m_j\right) = m_dd^{L_{jk}}\left(\sum_{l=0}^\infty(d^W)^l\right)(1-d) + [k=j+1]m
226 | 
227 | Making the summation we get we get the final transition probabilities
228 | when starting from state :math:`m_j`:
229 | 
230 | .. math::
231 | 
232 |    P\left(Z_{i+1}=b_{j+1}|Z_i=m_j\right) &= m_b \\
233 |    P\left(Z_{i+1}=m_k|Z_i=m_j\right) &= m_dF(L_{jk}, d) + [k=j+1]m \\
234 |    F(l, d) &= d^l\frac{1 - d}{1 - d^W}
235 | 
236 | The behavior of :math:`F(l, d)` near the boundaries is:
237 | 
238 | .. math::
239 | 
240 |    \underset{d \to 0}{\lim} F &= [l=0] \\
241 |    \underset{d \to 1}{\lim} F &= 1/W
242 | 
243 | The following figure shows a plot for :math:`W=6`:
244 | 
245 | .. only:: html
246 | 
247 |    .. figure:: _static/F_jk.svg
248 |       :align: center
249 | 
250 | .. only:: latex
251 | 
252 |    .. figure:: _static/F_jk.pdf
253 |       :align: center
254 | 
255 | We can see in the above graph that lower values of :math:`d`
256 | concentrate the probabilities on the state two steps ahead, and
257 | higher values spread the probabilities more evenly on all the states.
258 | 
259 | .. only:: html
260 | 
261 |    .. figure:: _static/F_jk_bars.svg
262 |       :align: center
263 | 
264 | .. only:: latex
265 | 
266 |    .. figure:: _static/F_jk_bars.pdf
267 |       :align: center
268 | 
269 | The transitions starting from a background states are trivial:
270 | 
271 | .. math::
272 | 
273 |    P\left(Z_{i+1}=b_j|Z_i=b_j\right) &= b(j) \\
274 |    P\left(Z_{i+1}=m_j|Z_i=b_j\right) &= 1 - b(j)
275 | 
276 | The emission probabilities:
277 | 
278 | .. math::
279 | 
280 |    P\left(X_i=a|Z_i=b_j\right) &= f^B_a \\
281 |    P\left(X_i=a|Z_i=m_j\right) &= f^M_{ja}
282 | 
283 | 
284 | Manual annotations
285 | ~~~~~~~~~~~~~~~~~~
286 | We extend the model with a random variable :math:`U_i` for each :math:`Z_i`:
287 | 
288 | .. only:: html
289 | 
290 |    .. figure:: _static/HMM_2.svg
291 |       :align: center
292 |       :figwidth: 60 %
293 | 
294 | .. only:: latex
295 | 
296 |    .. figure:: _static/HMM_2.pdf
297 |       :align: center
298 |       :figwidth: 60 %
299 | 
300 | These random variables are direct observations on the state variables that flag
301 | whether or not a hidden state is part of a motif:
302 | 
303 | .. math::
304 | 
305 |    U_i = [Z_i \in \pmb{S}^m]
306 | 
307 | We can use this to aid in detecting motifs. For example, a user can manually
308 | annotate where motifs are by specifying the value of :math:`U_i`.
309 | 
310 | Parameter estimation
311 | --------------------
312 | 
313 | Expectation-Maximization (EM)
314 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
315 | 
316 | For convenience let's aggregate all the model parameters into a single
317 | vector:
318 | 
319 | .. math::
320 | 
321 |    \pmb{\theta} &= \left(\pmb{t}, \pmb{f}^B, \pmb{f}^M\right) \\
322 |    \pmb{t} &= \left(\pmb{b}, \pmb{t}^M\right) \\
323 |    \pmb{b} &= (b_{in}, b_{out}) \\
324 |    \pmb{t}^M &= \left(d, m_d, m, m_b\right)
325 | 
326 | Using `EM
327 | <https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm>`_
328 | each step takes the form, where :math:`\pmb{\theta}^0`
329 | and :math:`\pmb{\theta}^1` are the current and next estimates of the
330 | parameters respectively:
331 | 
332 | .. math::
333 | 
334 |    \pmb{\theta}^1 &= \underset{\pmb{\theta}}{\arg\max}
335 |    Q(\pmb{\theta}|\pmb{\theta}^0) \\
336 |    Q(\pmb{\theta}|\pmb{\theta}^0) &= \log P(\pmb{\theta}) +
337 |    E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right]
338 | 
339 | :math:`\pmb{x}^D` is the training data, a particular realization of :math:`\pmb{X}`.
340 | 
341 | :math:`P(\pmb{\theta})` are the prior probabilities on the
342 | parameters. We are going to consider priors only on :math:`\pmb{f}^M`.
343 | 
344 | EM on a HMM
345 | ~~~~~~~~~~~
346 | 
347 | Taking into account that in a HMM the joint probability distribution
348 | factors as:
349 | 
350 | .. math::
351 | 
352 |    P(\pmb{X}, \pmb{Z}|\pmb{\theta}) = \prod_{i=1}^nP(Z_i|Z_{i-1}, \pmb{\theta})P(X_i|Z_i,
353 |    \pmb{\theta})
354 | 
355 | 
356 | We expand the expectation in the EM step as:
357 | 
358 | .. math::
359 | 
360 |    E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] =
361 |    \sum_{\pmb{z}}P(\pmb{z}|\pmb{x}^D,\pmb{\theta}^0)\sum_{i=1}^n
362 |    \left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right]
363 | 
364 | Interchanging the summations:
365 | 
366 | .. math::
367 | 
368 |    E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] =
369 |    \sum_{i=1}^n\sum_{\pmb{z}}P(\pmb{z}|X,\pmb{\theta}^0)
370 |    \left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right]
371 | 
372 | Defining the set :math:`\pmb{C}_i=\left\{Z_{i-1}, Z_i\right\}` we can always do:
373 | 
374 | .. math::
375 | 
376 |    P(\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0) = P(\pmb{Z} - \pmb{C}_i|\pmb{C}_i, \pmb{x}^D, \pmb{\theta}^0)P(\pmb{C}_i|\pmb{x}^D,\pmb{\theta}^0)
377 | 
378 | Now the summation over :math:`\pmb{Z}` can be decomposed as:
379 | 
380 | .. math::
381 | 
382 |    E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] =
383 |    \sum_{i=1}^n\left(\sum_{\pmb{z} - \pmb{c}_i}P(\pmb{z} - \pmb{c}_i|\pmb{x}^D,\pmb{\theta}^0)\right)
384 |    \sum_{\pmb{c}_i}P(\pmb{c_i}|\pmb{x}^D, \pmb{\theta}^0)\left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right]
385 | 
386 | Of course:
387 | 
388 | .. math::
389 | 
390 |    \sum_{\pmb{z} - \pmb{c}_i}P(\pmb{z} -
391 |    \pmb{c}_i|\pmb{x}^D,\pmb{\theta}^0) = 1
392 | 
393 | And so we finally get:
394 | 
395 | .. math::
396 | 
397 |    E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] =
398 |    \sum_{i=1}^n
399 |    \sum_{z_{i-1},z_i}P(z_{i-1},z_i|\pmb{x}^D, \pmb{\theta}^0)\left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right]
400 | 
401 | Computing :math:`P(Z_{i-1}, Z_i|\pmb{X},\pmb{\theta})`
402 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
403 | 
404 | To unclutter a little let's call:
405 | 
406 | .. math::
407 | 
408 |    \xi_i(s_1, s_2) &= P(Z_{i-1}=s_1, Z_i=s_2|\pmb{x}^D, \pmb{\theta}^0) \\
409 |    \gamma_i(s) &= P(Z_i=s|\pmb{x}^D, \pmb{\theta}^0)
410 | 
411 | Notice that:
412 | 
413 | .. math::
414 | 
415 |    \gamma_i(s) = \sum_{s_1} \xi_i(s_1, s)
416 | 
417 | And so we just need to compute :math:`\xi_i`. To compute it we first
418 | apply Bayes theorem:
419 | 
420 | .. math::
421 | 
422 |    P(Z_{i-1}, Z_i|\pmb{X},\pmb{\theta}^0) =
423 |    \frac{P(\pmb{X}|Z_{i-1},Z_i,\pmb{\theta}^0)P(Z_{i-1}, Z_i|\pmb{\theta}^0)}{P(\pmb{X}|\pmb{\theta}^0)}
424 | 
425 | Now, thanks to the Markov structure of the probabilities we can factor
426 | things:
427 | 
428 | .. math::
429 | 
430 |    P(\pmb{X}|Z_{i-1},Z_i,\pmb{\theta}^0) &=
431 |    P(X_1...X_{i-1}|Z_{i-1},\pmb{\theta}^0)P(X_i...X_n|Z_i,\pmb{\theta}^0)
432 |    \\
433 |    P(Z_{i-1},Z_i|\pmb{\theta}^0) &= P(Z_{i-1}|\pmb{\theta}^0)P(Z_i|Z_{i-1},\pmb{\theta}^0)
434 | 
435 | Renaming things a bit again:
436 | 
437 | .. math::
438 | 
439 |    \alpha_i(s) &= P(X_1...X_i,Z_i=s|\pmb{\theta}^0) \\
440 |    \beta_i(s) &= P(X_i...X_n|Z_i=s,\pmb{\theta}^0)
441 | 
442 | We get that:
443 | 
444 | .. math::
445 | 
446 |    \tilde{\xi}_i(s_1, s_2) &=
447 |    \alpha_{i-1}(s_1)\beta_i(s_2)P(Z_i=s_2|Z_{i-1}=s_1,\pmb{\theta}^0)
448 |    \\
449 |    \xi_i(s_1, s_2) &= \frac{\tilde{\xi}_i(s_1, s_2)}{\sum_{s_1,s_2}\tilde{\xi}_i(s_1, s_2)}
450 | 
451 | We compute the first values of :math:`\alpha` to see the pattern:
452 | 
453 | .. only:: html
454 | 
455 |    .. figure:: _static/forward.svg
456 |       :align: center
457 |       :figwidth: 60 %
458 | 
459 | .. only:: latex
460 | 
461 |    .. figure:: _static/forward.pdf
462 |       :align: center
463 |       :figwidth: 60 %
464 | 
465 | In general:
466 | 
467 | .. math::
468 | 
469 |    \alpha_1(s) &= P(Z_1=s)P(X_1|Z_1=s) \\
470 |    \alpha_i(s_2) &=
471 |    \left[\sum_{s_1}\alpha_{i-1}(s_1)P(Z_i=s_2|Z_{i-1}=s_1)\right]P(X_i|Z_i=s_2)
472 | 
473 | 
474 | Following the same process but starting from the end of the HMM we
475 | get:
476 | 
477 | .. math::
478 | 
479 |    \beta_n(s) &= P(X_n|Z_n=s) \\
480 |    \beta_{i-1}(s_1) &=
481 |    \left[\sum_{s_2}\beta_i(s_2)P(Z_i=s_2|Z_{i-1}=s_1)\right]P(X_{i-1}|Z_{i-1}=s_1)
482 | 
483 | 
484 | We can take into account in this step any information the user has provided about the
485 | states. If we know that:
486 | 
487 | .. math::
488 |    U_i = 0 &\implies \underset{s \in \pmb{S}^M}{\alpha (s)} = 0 \\
489 |    U_i = 1 &\implies \underset{s \in \pmb{S}^B}{\alpha (s)} = 0
490 | 
491 | And of course renormalize:
492 | 
493 | .. math::
494 |    \alpha(s) = \frac{\alpha(s)}{\sum_s \alpha(s)}
495 | 
496 | And also we do the same with :math:`\beta`.
497 | 
498 | Re-estimation equations
499 | ~~~~~~~~~~~~~~~~~~~~~~~
500 | 
501 | We now separate the expectation in two independent parts:
502 | 
503 | .. math::
504 | 
505 |    E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] &=
506 |    E^1(\pmb{f}^M, \pmb{f}^B) + E^2(\pmb{t}) \\
507 |    E^1(\pmb{f}^M, \pmb{f}^B) &= \sum_{i=1}^n \sum_{s \in \pmb{S}}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^M, \pmb{f}^B) \\
508 |    E^2(\pmb{t}) &= \sum_{i=1}^n \sum_{s_1,s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t})
509 | 
510 | We can continue splitting into independent parts:
511 | 
512 | .. math::
513 | 
514 |    E^1(\pmb{f}^M, \pmb{f}^B) &= E^{1B}(\pmb{f}^B) + E^{1M}(\pmb{f}^M) \\
515 |    E^{1B}(\pmb{f}^B) &= \sum_{i=1}^n \sum_{s \in \pmb{S}^B}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^B) \\
516 |    E^{1M}(\pmb{f}^M) &= \sum_{i=1}^n \sum_{s \in \pmb{S}^M}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^M)
517 | 
518 | And also:
519 | 
520 | .. math::
521 | 
522 |    E^2(\pmb{t}) &= E^{2B}(b) + E^{2M}(\pmb{t}^M) \\
523 |    E^{2B}(\pmb{b}) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^B}\sum_{s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{b}) \\
524 |    E^{2M}(\pmb{t}^M) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^M}\sum_{s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t}^M)
525 | 
526 | 
527 | We have now 4 independent maximization problems:
528 | 
529 | .. math::
530 | 
531 |    \left(\pmb{f}^B\right)^1 &= \underset{\pmb{f}^B}{\arg \max}E^{1B}\\
532 |    \left(\pmb{f}^M\right)^1& = \underset{\pmb{f}^M}{\arg \max}\left\{\log P(\pmb{f}^M) + E^{1M}\right\} \\
533 |    \pmb{b}^1 &= \underset{\pmb{b}}{\arg \max} E^{2B} \\
534 |    \left(\pmb{t}^M\right)^1 &= \underset{\pmb{t}^M}{\arg \max} E^{2M}
535 | 
536 | 
537 | Estimation of :math:`\pmb{f}^B`
538 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
539 | 
540 | We need to solve:
541 | 
542 | .. math::
543 | 
544 |    \left(\pmb{f}^B\right)^1 = \underset{\pmb{f}^B}{\arg \max}E^{1B}
545 | 
546 | With the constraint:
547 | 
548 | .. math::
549 | 
550 |    g_B &= \sum_{a \in \pmb{A}} f^B_a - 1 = 0
551 | 
552 | We will enforce the constraints using `Lagrange multipliers
553 | <https://en.wikipedia.org/wiki/Lagrange_multiplier>`_, and taking derivatives:
554 | 
555 | .. math::
556 | 
557 |    \frac{\partial}{\partial \pmb{f}^B,\lambda_B}\left\{E^{1B} - \lambda_Bg_B\right\} = 0
558 | 
559 | From this we get the closed form solution:
560 | 
561 | .. math::
562 | 
563 |    \tilde{f}^B_a &= \sum_{i=1}^n[x_i^D=a]\sum_{j=1}^W \gamma_i(b_j)  \\
564 |    f^B_a &= \frac{\tilde{f}^B_a}{\sum_{a \in \pmb{A}}\tilde{f}^B_a}
565 | 
566 | Estimation of :math:`\pmb{f}^M`
567 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
568 | 
569 | We need to solve:
570 | 
571 | .. math::
572 | 
573 |    \left(\pmb{f}^M\right)^1& = \underset{\pmb{f}^M}{\arg \max}\left\{\log P(\pmb{f}^M) + E^{1M}\right\}
574 | 
575 | With the constraint:
576 | 
577 | .. math::
578 | 
579 |    g_j &= \sum_{a \in \pmb{A}} f^M_{ja} - 1 = 0
580 | 
581 | We will enforce the constraints using `Lagrange multipliers
582 | <https://en.wikipedia.org/wiki/Lagrange_multiplier>`_, and taking derivatives:
583 | 
584 | .. math::
585 | 
586 |    \frac{\partial}{\partial \pmb{f}^M,\lambda_j}\left\{\log P(\pmb{f}^M) + E^{1M} - \lambda_jg_j\right\} = 0
587 | 
588 | If we use `Dirichlet distribution
589 | <https://en.wikipedia.org/wiki/Dirichlet_distribution>`_ over the
590 | symbols of each :math:`\pmb{f}^M_j`. The log-probability of the prior is
591 | then:
592 | 
593 | .. math::
594 | 
595 |    \log P(\pmb{f}^M_j|\pmb{\varepsilon}) = -\log B(\pmb{\varepsilon}) + \sum_{s \in \pmb{A}}(\varepsilon_s - 1)\log
596 |    \pmb{f}^M_j
597 | 
598 | And the derivative:
599 | 
600 | .. math::
601 | 
602 |    \frac{\partial}{\partial f^M_{ja}}\log P(\pmb{f}^M_j|\pmb{\varepsilon}) =
603 |    \frac{\varepsilon_a - 1}{f^M_{ja}}
604 | 
605 | From this we get the closed form solution:
606 | 
607 | .. math::
608 | 
609 |    \tilde{f}^M_{ja} &= \sum_{i=1}^n[x_i^D=a]\gamma_i(m_j) \\
610 |    f^M_{ja} &=
611 |    \frac{ \varepsilon_a - 1 + \tilde{f}^M_{ja}}{\varepsilon_0 - 1 +
612 |    \sum_{a \in \pmb{A}}\tilde{f}^M_{ja}}
613 | 
614 | Where :math:`\varepsilon_0` is called the concentration parameter and it's:
615 | 
616 | .. math::
617 | 
618 |    \varepsilon_0 = \sum_{a \in \pmb{A}}\varepsilon_a
619 | 
620 | 
621 | Estimation of :math:`b`
622 | ~~~~~~~~~~~~~~~~~~~~~~~
623 | 
624 | We need to solve:
625 | 
626 | .. math::
627 | 
628 |    \pmb{b}^1 = \underset{\pmb{b}}{\arg \max} E^{2B}
629 | 
630 | Taking derivatives we get:
631 | 
632 | .. math::
633 | 
634 |    b_{in}^1 &= \frac{\sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, b_j)}{
635 |    \sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, b_j) +
636 |    \sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, m_j)} \\
637 |    b_{out}^1 &= \frac{\sum_{i=1}^n\xi_i(b_1, b_1)}{
638 |    \sum_{i=1}^n \xi_i(b_1, b_1) +
639 |    \sum_{i=1}^n \xi_i(b_1, m_1)}
640 | 
641 | Estimation of :math:`\pmb{t}^M`
642 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
643 | 
644 | We need to solve:
645 | 
646 | .. math::
647 | 
648 |    \left(\pmb{t}^M\right)^1 &= \underset{\pmb{t}^M}{\arg \max} E^{2M}
649 | 
650 | Subject to the constraint:
651 | 
652 | .. math::
653 | 
654 |    g_m = m_b + m + m_d - 1 = 0
655 | 
656 | Trying to solve the above problem using Lagrange multipliers gives as
657 | a result a set of highly non-linear equations in :math:`d`, and since
658 | the rest of the parameters are coupled to :math:`d` either directly or
659 | trough the constraint it is not possible to find a closed form
660 | solution. Because of this we throw away the Lagrange multipliers and
661 | solve the maximization problem numerically.
662 | 
663 | Let's recap the shape of our problem:
664 | 
665 | .. math::
666 | 
667 |    E^{2M}(\pmb{t}^M) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^M}\sum_{s_2
668 |    \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t}^M) \\
669 | 
670 | .. math::
671 | 
672 |    P\left(Z_i=b_{j+1}|Z_{i-1}=m_j\right) &= m_b \\
673 |    P\left(Z_i=m_k|Z_{i-1}=m_j\right) &= m_dF(L_{jk}, d) + [k=j+1]m \\
674 |    F(l, d) &= d^l\frac{1 - d}{1 - d^W} \\
675 |    L_{jk} &= (k - j - 2) \mod W \\
676 |    K_{jl} &= 1 + \left[(j + l + 1) \mod W\right]
677 | 
678 | We rewrite the summation as:
679 | 
680 | .. math::
681 | 
682 |    E^{2M}(\pmb{t}^M) &=  \sum_{i=1}^n\sum_{j=1}^W
683 |        \left\{
684 |           \xi_i(m_j, b_{j+1})\log P(b_{j+1}|m_j, \pmb{t}^M) +
685 |           \sum_{l=0}^{W-1}\xi_i(m_j,m_{K_{jl}})\log P(m_{K_{jl}}|m_j,
686 |           \pmb{t}^M)
687 |        \right\} \\
688 |                      &= \sum_{i=1}^n\sum_{j=1}^W
689 |        \left\{
690 |           \xi_i(m_j, b_{j+1})\log m_b +
691 |           \sum_{l=0}^{W-1}\xi_i(m_j,m_{K_{jl}})
692 |           \log \left[m_dF(l, d) + [l=W-1]m \right]
693 |        \right\} \\
694 |                      &= \left(\sum_{i=1}^n\sum_{j=1}^W \xi_i(m_j, b_{j+1})\right)\log m_b  + \\
695 |                      & \quad
696 | 		     \sum_{l=0}^{W-1}\left(\sum_{i=1}^n\sum_{j=1}^W\xi_i(m_j,
697 | 		     m_{K_{jl}})\right)\log \left[m_d F(l,d) +
698 | 		     [l=W-1]m\right]
699 | 
700 | Since we are going to solve the problem numerically it is possible
701 | that we reach a local maximum. As long as we improve the starting
702 | point it is enough.
703 | 
704 | 
705 | Time Complexity
706 | ---------------
707 | 
708 | The time complexity of the parameter estimation problem is dominated
709 | by the estimation of :math:`\alpha`, :math:`\beta` and :math:`\xi`.
710 | We have a triple loops of this form:
711 | 
712 | .. code-block:: python
713 | 
714 |     for i in range(n):
715 |         for s in range(2*W):
716 | 	    for t in range(2*W):
717 | 	        xi[s, t, i] = alpha[s, i - 1]*beta[t, i]*pT[s, t]
718 | 
719 | 
720 | In the above code the math symbols have been rewriten as numpy arrays:
721 | 
722 | +----------------------------+--------------------+
723 | | Math                       | Python             |
724 | +============================+====================+
725 | | :math:`\xi_i(s, t)`        | ``xi[s, t, i]``    |
726 | +----------------------------+--------------------+
727 | | :math:`\alpha_i(s)`        | ``alpha[s, i]``    |
728 | +----------------------------+--------------------+
729 | | :math:`\beta_i(s)`         | ``beta[s, i]``     |
730 | +----------------------------+--------------------+
731 | | :math:`P(Z_{i+1}=t|Z_i=s)` | ``pT[s, t]``       |
732 | +----------------------------+--------------------+
733 | 
734 | 
735 | Time complexity is therefore :math:`O(nW^2)`. We can speed up the
736 | above triple loop by noticing that a lot of entries in the ``pT``
737 | matrix are zeros:
738 | 
739 | .. only:: html
740 | 
741 |    .. figure:: _static/transition_matrix_1.svg
742 |       :align: center
743 |       :figwidth: 60 %
744 | 
745 | .. only:: latex
746 | 
747 |    .. figure:: _static/transition_matrix_1.pdf
748 |       :align: center
749 |       :figwidth: 60 %
750 | 
751 | Taking only the non-zero entries into account we could move complexity
752 | from :math:`4nW^2` to :math:`nW(W+3)`, which would divide time
753 | almost by a factor of 4.
754 | 
755 | The reason that the lower right part of the matrix, that is the
756 | transition probabilities between motif states, is full is that the
757 | possibility of deleting states of the motif means that we can jump
758 | from any motif state to any other motif state. If on the other hand we
759 | didn't allow deletions the state transition diagram would look as:
760 | 
761 | .. only:: html
762 | 
763 |    .. figure:: _static/PHMM_4.svg
764 |       :align: center
765 | 
766 | .. only:: latex
767 | 
768 |    .. figure:: _static/PHMM_4.pdf
769 |       :align: center
770 | 
771 | Transition probabilities from the background states would remain
772 | unchanged and transition probabilities from the motif states would simply be:
773 | 
774 | .. math::
775 | 
776 |    P\left(Z_{i+1}=b_{j+1}|Z_i=m_j\right) &= m_b \\
777 |    P\left(Z_{i+1}=m_{j+1}|Z_i=m_j\right) &= m
778 | 
779 | Now the matrix of transition probabilities has the following non-zero
780 | entries:
781 | 
782 | .. only:: html
783 | 
784 |    .. figure:: _static/transition_matrix_2.svg
785 |       :align: center
786 |       :figwidth: 60 %
787 | 
788 | .. only:: latex
789 | 
790 |    .. figure:: _static/transition_matrix_2.pdf
791 |       :align: center
792 |       :figwidth: 60 %
793 | 
794 | And time complexity is just :math:`4nW`. This simplified model is
795 | actually a very good approximation of the more complex one. The
796 | following plot shows how the values of the full model are distributed
797 | inside the transition matrix for typical parameters:
798 | 
799 | 
800 | .. only:: html
801 | 
802 |    .. figure:: _static/transition_matrix_3.svg
803 |       :align: center
804 |       :figwidth: 60 %
805 | 
806 | .. only:: latex
807 | 
808 |    .. figure:: _static/transition_matrix_3.pdf
809 |       :align: center
810 |       :figwidth: 60 %
811 | 
812 | To see how fast probabilities decay away from the diagonal consider
813 | the following plot showing the values around state 60 (motif state
814 | 20):
815 | 
816 | .. only:: html
817 | 
818 |    .. figure:: _static/transition_matrix_4.svg
819 |       :align: center
820 | 
821 | .. only:: latex
822 | 
823 |    .. figure:: _static/transition_matrix_4.pdf
824 |       :align: center
825 | 
826 | We can make a compromise between the fully diagonal model and a model
827 | where only :math:`w` elements away from the diagonal are retained. The
828 | transition matrix would look like:
829 | 
830 | .. only:: html
831 | 
832 |    .. figure:: _static/transition_matrix_5.svg
833 |       :align: center
834 |       :figwidth: 60 %
835 | 
836 | .. only:: latex
837 | 
838 |    .. figure:: _static/transition_matrix_5.pdf
839 |       :align: center
840 |       :figwidth: 60 %
841 | 
842 | Time complexity would be then :math:`nW(3 + w)`.
843 | Let's call :math:`\tilde{P}(\pmb{X}, \pmb{Z})` the probability
844 | distribution of the sequence under this new approximate transition
845 | matrix. We can get a modified EM algorithm:
846 | 
847 | .. math::
848 | 
849 |    \log \left[\sum_{\pmb{Z}} P(\pmb{X}, \pmb{Z})\right] &=
850 |    \log \left[\sum_{\pmb{Z}} \tilde{P}(\pmb{X}, \pmb{Z})
851 |    \frac{P(\pmb{X}, \pmb{Z})}{\tilde{P}(\pmb{X}, \pmb{Z})}\right] \\
852 |    & \geq \sum_{\pmb{Z}}\tilde{P}(\pmb{X}, \pmb{Z})\log\frac{P(\pmb{X}, \pmb{Z})}{\tilde{P}(\pmb{X}, \pmb{Z})}
853 | 
854 | The EM algorithm proceeds as usual but with a modified function to
855 | maximize:
856 | 
857 | .. math::
858 | 
859 |    Q(\pmb{\theta}|\pmb{\theta}^0) = \log P(\pmb{\theta}) +
860 |    E_{\tilde{P}\left(\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0\right)}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right]
861 | 
862 | The only modification in the forward-backward algorithm is that there
863 | are more zeros now on the transition matrix. We get halfway to a
864 | variational approach like the one in `Factorial hidden Markov models <http://www.ee.columbia.edu/~sfchang/course/svia-F03/papers/factorial-HMM-97.pdf>`_.
865 | 


--------------------------------------------------------------------------------
/doc/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/requirements.txt


--------------------------------------------------------------------------------
/misc/demo1_img.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/misc/demo1_img.png


--------------------------------------------------------------------------------
/misc/demo2_img.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/misc/demo2_img.png


--------------------------------------------------------------------------------
/misc/visual.py:
--------------------------------------------------------------------------------
  1 | import selenium
  2 | import selenium.webdriver
  3 | 
  4 | import matplotlib.pyplot as plt
  5 | import matplotlib.patches as patches
  6 | 
  7 | 
  8 | def equal_delta(a, b, delta):
  9 |     return (a - delta <= b) and (b <= a + delta)
 10 | 
 11 | 
 12 | class BBox(object):
 13 |     def __init__(self):
 14 |         self._empty = True
 15 | 
 16 |         self.x1 = None
 17 |         self.y1 = None
 18 |         self.x2 = None
 19 |         self.y2 = None
 20 | 
 21 |     def wrap(self, element):
 22 |         if self._empty:
 23 |             self.x1 = element.x
 24 |             self.y1 = element.y
 25 |             self.x2 = element.x + element.width
 26 |             self.y2 = element.y + element.height
 27 |         else:
 28 |             self.x1 = min(self.x1, element.x)
 29 |             self.x2 = max(self.x2, element.x + element.width)
 30 |             self.y1 = min(self.y1, element.y)
 31 |             self.y2 = max(self.y2, element.y + element.height)
 32 | 
 33 |         self._empty = False
 34 | 
 35 |     def contains(self, other):
 36 |         if self._empty or other._empty:
 37 |             return False
 38 | 
 39 |         return (self.x1 <= other.x1 and
 40 |                 self.x2 >= other.x2 and
 41 |                 self.y1 <= other.y1 and
 42 |                 self.y2 >= other.y2)
 43 | 
 44 |     def halign(self, other, margin=5):
 45 |         return (equal_delta(self.y1, other.y1, margin) and
 46 |                 equal_delta(self.y2, other.y2, margin))
 47 | 
 48 |     def valign(self, other, margin=5):
 49 |         return (equal_delta(self.x1, other.x1, margin) and
 50 |                 equal_delta(self.x2, other.x2, margin))
 51 | 
 52 | 
 53 | class DOM(object):
 54 |     class Element(object):
 55 |         def __init__(self, parent=None, children=None):
 56 |             self.parent = parent
 57 |             self.children = children or []
 58 | 
 59 |     def __init__(self, browser, flat=False):
 60 |         def make_element(node, parent):
 61 |             element = DOM.Element(parent=parent)
 62 |             for k, v in node.rect.iteritems():
 63 |                 setattr(element, k, v)
 64 |             element.x = max(0, element.x)
 65 |             element.y = max(0, element.y)
 66 |             return element
 67 | 
 68 |         root_node = browser.find_elements_by_xpath('*')[0]
 69 |         if flat:
 70 |             self.root = make_element(root_node, None)
 71 |             for child in root_node.find_elements_by_xpath('//*'):
 72 |                 self.root.children.append(make_element(child, self.root))
 73 |         else:
 74 |             def fill(node, parent=None):
 75 |                 element = make_element(node, parent)
 76 |                 for child in node.find_elements_by_xpath('*'):
 77 |                     element.children.append(fill(child, parent=element))
 78 |                 return element
 79 | 
 80 |             self.root = fill(root_node)
 81 | 
 82 |     def draw(self, ax=None):
 83 |         if ax is None:
 84 |             fig = plt.figure()
 85 |             ax = fig.add_subplot(111, aspect='equal')
 86 |             ax.invert_yaxis()
 87 | 
 88 |         def _draw(element, bbox):
 89 |             ax.add_patch(
 90 |                 patches.Rectangle(
 91 |                     (element.x, element.y),
 92 |                     element.width,
 93 |                     element.height,
 94 |                     fill=False
 95 |                 )
 96 |             )
 97 | 
 98 |             bbox.wrap(element)
 99 |             for child in element.children:
100 |                 _draw(child, bbox)
101 | 
102 |         bbox = BBox()
103 |         _draw(self.root, bbox)
104 | 
105 |         ax.set_xlim(bbox.x1, bbox.x2)
106 |         ax.set_ylim(bbox.y1, bbox.y2)
107 | 
108 | 
109 | def get_dom(url):
110 |     browser = selenium.webdriver.Firefox()
111 |     browser.get(url)
112 |     dom = DOM(browser, flat=True)
113 |     browser.close()
114 |     return dom
115 | 
116 | 
117 | if __name__ == '__main__':
118 |     dom = get_dom('http://edition.cnn.com/')
119 |     dom.draw()
120 |     plt.show()
121 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | scrapely>=0.12.0
 2 | numpy>=1.9.2
 3 | scipy>=0.15.1
 4 | scikit-learn>=0.17
 5 | cython>=0.23.3
 6 | networkx>=1.10
 7 | pulp>=1.6.0
 8 | -e git+https://github.com/scrapinghub/portia.git@nui-develop#egg=slyd&subdirectory=slyd
 9 | -e git+https://github.com/scrapinghub/portia.git@nui-develop#egg=slybot&subdirectory=slybot
10 | # For demo2
11 | matplotlib>=1.4.3
12 | ete2>=2.3.10
13 | 


--------------------------------------------------------------------------------
/scripts/gen-slybot-project:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | 
 5 | import aile
 6 | 
 7 | usage = """
 8 | {0} url
 9 | 
10 | Will generate a directory 'slybot-project' with the necessary files.
11 | Execute after this:
12 | 
13 |     slybot crawl aile
14 | 
15 | Edit the project files to add other urls to be crawled and rename fields, etc...
16 | """
17 | 
18 | if __name__ == '__main__':
19 |     if len(sys.argv) != 2:
20 |         sys.exit(usage)
21 |     url = sys.argv[1]
22 | 
23 |     aile.generate_slybot_project(url, verbose=True)
24 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | from setuptools.extension import Extension
 3 | from Cython.Build import cythonize
 4 | 
 5 | extra_compile_args = ["-O3"]
 6 | extensions = [
 7 |     Extension("aile._kernel", ["aile/_kernel.pyx"],
 8 |               extra_compile_args=extra_compile_args
 9 |         ),
10 |     Extension("aile.dtw", ["aile/dtw.pyx"],
11 |               extra_compile_args=extra_compile_args
12 |     ),
13 | ]
14 | 
15 | setup(
16 |     name = 'AILE',
17 |     version = '0.0.2',
18 |     packages = ['aile'],
19 |     install_requires = [
20 |         'numpy',
21 |         'scipy',
22 |         'scikit-learn',
23 |         'scrapely',
24 |         'cython',
25 |         'networkx',
26 |         'pulp'
27 |     ],
28 |     dependency_links = [
29 |         'git+https://github.com/scrapinghub/portia.git@multiple-item-extraction#egg=slyd&subdirectory=slyd',
30 |         'git+https://github.com/scrapinghub/portia.git@multiple-item-extraction#egg=slybot&subdirectory=slybot'
31 |     ],
32 |     tests_requires = [
33 |         'pytest'
34 |     ],
35 |     ext_modules = cythonize(extensions),
36 |     scripts = ['scripts/gen-slybot-project']
37 | )
38 | 


--------------------------------------------------------------------------------
/test/requirements.txt:
--------------------------------------------------------------------------------
1 | pytest>=2.7.2


--------------------------------------------------------------------------------
/test/table.css:
--------------------------------------------------------------------------------
 1 | table a:link {
 2 | 	color: #666;
 3 | 	font-weight: bold;
 4 | 	text-decoration:none;
 5 | }
 6 | table a:visited {
 7 | 	color: #999999;
 8 | 	font-weight:bold;
 9 | 	text-decoration:none;
10 | }
11 | table a:active,
12 | table a:hover {
13 | 	color: #bd5a35;
14 | 	text-decoration:underline;
15 | }
16 | table {
17 | 	font-family:Arial, Helvetica, sans-serif;
18 | 	color:#666;
19 | 	font-size:12px;
20 | 	text-shadow: 1px 1px 0px #fff;
21 | 	background:#eaebec;
22 | 	margin:20px;
23 | 	border:#ccc 1px solid;
24 | 
25 | 	-moz-border-radius:3px;
26 | 	-webkit-border-radius:3px;
27 | 	border-radius:3px;
28 | 
29 | 	-moz-box-shadow: 0 1px 2px #d1d1d1;
30 | 	-webkit-box-shadow: 0 1px 2px #d1d1d1;
31 | 	box-shadow: 0 1px 2px #d1d1d1;
32 | }
33 | table th {
34 | 	padding:21px 25px 22px 25px;
35 | 	border-top:1px solid #fafafa;
36 | 	border-bottom:1px solid #e0e0e0;
37 | 
38 | 	background: #ededed;
39 | 	background: -webkit-gradient(linear, left top, left bottom, from(#ededed), to(#ebebeb));
40 | 	background: -moz-linear-gradient(top,  #ededed,  #ebebeb);
41 | }
42 | table th:first-child {
43 | 	text-align: left;
44 | 	padding-left:20px;
45 | }
46 | table tr:first-child th:first-child {
47 | 	-moz-border-radius-topleft:3px;
48 | 	-webkit-border-top-left-radius:3px;
49 | 	border-top-left-radius:3px;
50 | }
51 | table tr:first-child th:last-child {
52 | 	-moz-border-radius-topright:3px;
53 | 	-webkit-border-top-right-radius:3px;
54 | 	border-top-right-radius:3px;
55 | }
56 | table tr {
57 | 	text-align: center;
58 | 	padding-left:20px;
59 | }
60 | table td:first-child {
61 | 	text-align: left;
62 | 	padding-left:20px;
63 | 	border-left: 0;
64 | }
65 | table td {
66 | 	padding:18px;
67 | 	border-top: 1px solid #ffffff;
68 | 	border-bottom:1px solid #e0e0e0;
69 | 	border-left: 1px solid #e0e0e0;
70 | 
71 | 	background: #fafafa;
72 | 	background: -webkit-gradient(linear, left top, left bottom, from(#fbfbfb), to(#fafafa));
73 | 	background: -moz-linear-gradient(top,  #fbfbfb,  #fafafa);
74 | }
75 | table tr.even td {
76 | 	background: #f6f6f6;
77 | 	background: -webkit-gradient(linear, left top, left bottom, from(#f8f8f8), to(#f6f6f6));
78 | 	background: -moz-linear-gradient(top,  #f8f8f8,  #f6f6f6);
79 | }
80 | table tr:last-child td {
81 | 	border-bottom:0;
82 | }
83 | table tr:last-child td:first-child {
84 | 	-moz-border-radius-bottomleft:3px;
85 | 	-webkit-border-bottom-left-radius:3px;
86 | 	border-bottom-left-radius:3px;
87 | }
88 | table tr:last-child td:last-child {
89 | 	-moz-border-radius-bottomright:3px;
90 | 	-webkit-border-bottom-right-radius:3px;
91 | 	border-bottom-right-radius:3px;
92 | }
93 | table tr:hover td {
94 | 	background: #f2f2f2;
95 | 	background: -webkit-gradient(linear, left top, left bottom, from(#f2f2f2), to(#f0f0f0));
96 | 	background: -moz-linear-gradient(top,  #f2f2f2,  #f0f0f0);


--------------------------------------------------------------------------------
/test/test_slybot.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | import os
  3 | import tempfile
  4 | import json
  5 | import shutil
  6 | import collections
  7 | import contextlib
  8 | 
  9 | import aile
 10 | 
 11 | 
 12 | try:
 13 |     FILE = __file__
 14 | except NameError:
 15 |     FILE = './test'
 16 | 
 17 | TESTDIR = os.getenv('DATAPATH',
 18 |                     os.path.dirname(os.path.realpath(FILE)))
 19 | 
 20 | 
 21 | def get_local_url(filename):
 22 |     return 'file:///{0}/{1}'.format(TESTDIR, filename)
 23 | 
 24 | 
 25 | def item_name(schema, item):
 26 |     for field in item.keys():
 27 |         for name, fields in schema.iteritems():
 28 |             if field in fields:
 29 |                 return name
 30 |     return None
 31 | 
 32 | 
 33 | 
 34 | class ExtractTest(object):
 35 |     def __init__(self, train_url):
 36 |         self.train_url = get_local_url(train_url)
 37 |         self.project_path = tempfile.mkdtemp()
 38 |         self.item_extract = aile.generate_slybot_project(
 39 |             self.train_url, path=self.project_path, verbose=False)
 40 |         with open(os.path.join(self.project_path, 'items.json'), 'r') as schema_file:
 41 |             schema = json.load(schema_file)
 42 |         self.schema = {
 43 |             item_name: set(item['fields'])
 44 |             for item_name, item in schema.iteritems()}
 45 | 
 46 |     def run(self, url=None):
 47 |         if url is None:
 48 |             url = self.train_url
 49 |         else:
 50 |             url = get_local_url(url)
 51 |         extract_path = tempfile.mktemp(suffix='.json')
 52 |         opt = [
 53 |             '-s LOG_LEVEL=CRITICAL',
 54 |             '-s SLYDUPEFILTER_ENABLED=0',
 55 |             '-s PROJECT_DIR={0}'.format(self.project_path),
 56 |             '-o {0}'.format(extract_path)
 57 |         ]
 58 |         cmd = 'slybot crawl {1} aile -a start_urls="{0}"'.format(url, ' '.join(opt))
 59 |         if os.system(cmd) != 0:
 60 |             return None
 61 |         with open(extract_path, 'r') as extract_file:
 62 |             items = json.load(extract_file)
 63 |         os.remove(extract_path)
 64 |         grouped_items = collections.defaultdict(list)
 65 |         for item in items:
 66 |             name = item_name(self.schema, item)
 67 |             if name:
 68 |                 grouped_items[name].append(item)
 69 |         return grouped_items.values()
 70 | 
 71 |     def close(self):
 72 |         shutil.rmtree(self.project_path)
 73 | 
 74 | 
 75 | def find_fields(item, true_item):
 76 |     fields = []
 77 |     for true_field_value in true_item:
 78 |         found = False
 79 |         for field_name, field_value in item.iteritems():
 80 |             if true_field_value == field_value[0].strip():
 81 |                 found = True
 82 |                 fields.append(field_name)
 83 |         if not found:
 84 |             fields.append(None)
 85 |     return fields
 86 | 
 87 | 
 88 | class CheckExtractionCannotFindField(Exception):
 89 |     pass
 90 | 
 91 | 
 92 | class CheckExtractionDifferentNumberOfItems(Exception):
 93 |     def __init__(self, expected, found):
 94 |         self.expected = expected
 95 |         self.found = found
 96 | 
 97 |     def __str__(self):
 98 |         return 'Different number of items. Expected: {0}, Found: {1}'.format(
 99 |             self.expected, self.found)
100 | 
101 | 
102 | class CheckExtractionCannotFindItem(Exception):
103 |     def __init__(self, item):
104 |         self.item = item
105 | 
106 |     def __str__(self):
107 |         return "Couldn't extract: {0}".format(self.item)
108 | 
109 | 
110 | def _check_extraction(items, true_items):
111 |     fields = find_fields(items[0], true_items[0])
112 |     if not all(fields):
113 |         raise CheckExtractionCannotFindField()
114 |     if len(items) != len(true_items):
115 |         raise CheckExtractionDifferentNumberOfItems(len(true_items), len(items))
116 |     for true in true_items:
117 |         any_match = False
118 |         for extracted in items:
119 |             match = True
120 |             for field, true_value in zip(fields, true):
121 |                 if extracted[field][0].strip() != true_value:
122 |                     match = False
123 |                     break
124 |             if match:
125 |                 any_match = True
126 |         if not any_match:
127 |             raise CheckExtractionCannotFindItem(true)
128 |     return True
129 | 
130 | def check_extraction(all_items, true_items):
131 |     found = False
132 |     for items in all_items:
133 |         try:
134 |             found = _check_extraction(items, true_items)
135 |             if found:
136 |                 break
137 |         except CheckExtractionCannotFindField:
138 |             pass
139 |     assert found
140 | 
141 | 
142 | PATCH_OF_LAND_1 = [
143 |     ['158 Halsey Street, Brooklyn, New York'],
144 |     ['695 Monroe Street, Brooklyn, New York'],
145 |     ['138 Wood Road, Los Gatos, California'],
146 |     ['Multiple Addresses, Sacramento, California'],
147 |     ['438 29th St, San Francisco, California'],
148 |     ['747 Kingston Road, Princeton, New Jersey'],
149 |     ['2459 Ketchum Rd, Memphis, Tennessee'],
150 |     ['158 Halsey Street, Brooklyn, New York'],
151 |     ['697 Monroe St., Brooklyn, New York'],
152 |     ['2357 Greenfield Ave, Los Angeles, California'],
153 |     ['5567 Colwell Road, Penryn, California'],
154 |     ['2357 Greenfield Ave, Los Angeles, California']
155 | ]
156 | 
157 | 
158 | def test_patchofland_1():
159 |     with contextlib.closing(ExtractTest('Patch of Land.html')) as test:
160 |         check_extraction(
161 |             test.run('Patch of Land.html'), PATCH_OF_LAND_1)
162 | 
163 | MONSTER_1 = [
164 |     [u'Java Developer (Graduate) Trading'],
165 |     [u'C# Developer / C# .Net Programmer - R&D'],
166 |     [u'Cyber Security Analyst - SIEM, CISSP, Vulnerability'],
167 |     [u'UKEN Std Apply Auto Test Job With no RF and Q - Do not Apply'],
168 |     [u'UKEN Std & Exp Apply Auto Test Job Without Questions - Do not Apply'],
169 |     [u'UKEN Std Apply Auto Test Job Without Questions - Do not Apply'],
170 |     [u'UKEN Std & Exp Apply Auto Test Job With Questions - Do not Apply'],
171 |     [u'UKEN Express Apply Automation Test Job without Questions - Do not Apply'],
172 |     [u'UKEN Company Confidential Test Job With Questions - Do not Apply'],
173 |     [u'UKEN Std Apply Auto Test Job With Questions - Do not Apply'],
174 |     [u'UKEN Shared Std Apply Auto Test Job - Do not Apply'],
175 |     [u'UKEN Shared Apply Automation Test Job - Do not Apply'],
176 |     [u'Front-End Developer , £30-35K'],
177 |     [u'Ruby Developer - Fastest Growing Healthcare Startup'],
178 |     [u'C#, MVC, Sharepoint, ASP.NET System Analyst/Developers'],
179 |     [u'Oracle Applications DBA'],
180 |     [u'C++ Developer Contract'],
181 |     [u'Network Security Engineer -SolarWinds Specialist'],
182 |     [u'Software Development Manager/CTO - Unified Communications'],
183 |     [u'Senior C++ EFX Developer']
184 | ]
185 | 
186 | 
187 | def test_monster_1():
188 |     with contextlib.closing(ExtractTest('Monster.html')) as test:
189 |         check_extraction(
190 |             test.run('Monster.html'), MONSTER_1)
191 | 
192 | 
193 | ARS_TECHNICA_1 = [
194 |     [u'“Unauthorized code” in Juniper firewalls decrypts encrypted VPN traffic'],
195 |     [u'LifeLock ID protection service to pay record $100 million for failing customers'],
196 |     [u'Hacker hacks off Tesla with claims of self-driving car'],
197 |     [u'Windows 10 Mobile upgrade won’t hit older phones until 2016'],
198 |     [u'Video memories, storytelling, and Star Wars spoilers (no actual spoilers!)'],
199 |     [u'Blackberry CEO says Apple has gone to a “dark place” with pro-privacy stance'],
200 |     [u'Google ramps up EU lobbying as antitrust charges proceed'],
201 |     [u'Dealmaster: Get a 32GB Moto X Pure Edition unlocked smartphone for $349'],
202 |     [u'Apple gets a new COO, puts Phil Schiller in charge of the App Store'],
203 |     [u'Microsoft makes 16 more Xbox 360 games playable on Xbox One'],
204 |     [u'League of Legends now owned entirely by Chinese giant Tencent'],
205 |     [u'Busted by online package tracking, drug dealer gets more than 8 years in prison'],
206 |     [u'OneDrive for Business to get unlimited storage for enterprise customers'],
207 |     [u'Germany approves 30-minute software update fix for cheating Volkswagen diesels'],
208 |     [u'''Turing’s Shkreli on drug price-hike: “It gets people talking… that’s what art is”'''],
209 |     [u'Self-driving Ford Fusions are coming to California next year'],
210 |     [u'Cop who wanted to photograph teen’s erection in sexting case commits suicide'],
211 |     [u'Republicans in Congress let net neutrality rules live on (for now)'],
212 |     [u'Confirmed: Kojima leaves Konami to work on PS4 console exclusive [Updated]'],
213 |     [u'Netflix to offer less bandwidth for My Little Pony , more for Avengers'],
214 |     [u'Final NASA budget bill fully funds commercial crew and Earth science'],
215 |     [u'Firefox for Windows finally has an official, stable 64-bit build'],
216 |     [u'Smash Bros. DLC concludes with Bayonetta, Super Mario RPG Geno costume'],
217 |     [u'New XPRIZE competition looks for a better underwater robot'],
218 |     [u'Google’s new data-only Project Fi tablet plans don’t charge device fees'],
219 |     [u'Tech firms could owe up to 4% of global revenue if they violate new EU data law'],
220 |     [u'Android Pay adds in-app purchasing feature, catches up to Apple Pay'],
221 |     [u'Pebble’s new Health app integrates with Timeline, suggests tips to get healthier'],
222 |     [u'Websites may soon know if you’re mad—a little mouse will tell them'],
223 |     [u'Dust Bowl returns as an Expedition in Oath of the Gatewatch'],
224 |     [u'Orbitar, really? Some new exoplanet names are downright weird'],
225 |     [u'13 million MacKeeper users exposed after MongoDB door was left open'],
226 | ]
227 | 
228 | 
229 | def test_ars_technica_1():
230 |     with contextlib.closing(ExtractTest('Ars Technica.html')) as test:
231 |         check_extraction(
232 |             test.run('Ars Technica.html'), ARS_TECHNICA_1)
233 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
 1 | [tox]
 2 | envlist = py27
 3 | 
 4 | [testenv]
 5 | install_command =
 6 |     pip install --process-dependency-links {opts} {packages}
 7 | deps =
 8 |     -r{toxinidir}/requirements.txt
 9 |     -r{toxinidir}/test/requirements.txt
10 | commands =
11 |     py.test
12 | 
13 | setenv =
14 |     DATAPATH={toxinidir}/test


--------------------------------------------------------------------------------