├── .gitignore ├── .travis.yml ├── LICENSE ├── MANIFEST.in ├── README.md ├── aile ├── __init__.py ├── _kernel.pyx ├── dtw.pyx ├── kernel.py ├── ptree.py └── slybot_project.py ├── demo1.py ├── demo2.py ├── demo3.py ├── doc ├── Makefile ├── _static │ ├── F_jk.pdf │ ├── F_jk.svg │ ├── F_jk_bars.pdf │ ├── F_jk_bars.svg │ ├── F_jk_graph.py │ ├── F_jk_no_labels.svg │ ├── HMM_1.pdf │ ├── HMM_1.svg │ ├── HMM_2.pdf │ ├── HMM_2.svg │ ├── PHMM_1.pdf │ ├── PHMM_1.svg │ ├── PHMM_2.pdf │ ├── PHMM_2.svg │ ├── PHMM_3.pdf │ ├── PHMM_3.svg │ ├── PHMM_4.pdf │ ├── PHMM_4.svg │ ├── forward.pdf │ ├── forward.svg │ ├── transition_matrix_1.pdf │ ├── transition_matrix_1.svg │ ├── transition_matrix_2.pdf │ ├── transition_matrix_2.svg │ ├── transition_matrix_3.pdf │ ├── transition_matrix_3.svg │ ├── transition_matrix_4.pdf │ ├── transition_matrix_4.svg │ ├── transition_matrix_5.pdf │ └── transition_matrix_5.svg ├── conf.py ├── index.rst ├── make.bat ├── notes.rst └── requirements.txt ├── misc ├── demo1_img.png ├── demo2_img.png └── visual.py ├── requirements.txt ├── scripts └── gen-slybot-project ├── setup.py ├── test ├── Ars Technica.html ├── Monster.html ├── Patch of Land 2.html ├── Patch of Land.html ├── Reddit.html ├── requirements.txt ├── table.css └── test_slybot.py └── tox.ini /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.html 3 | *.png 4 | *.so 5 | *# 6 | *~ 7 | build/ -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 2.7 3 | 4 | install: 5 | - pip install cython 6 | - pip install -U tox 7 | 8 | script: 9 | - travis_wait tox -vvv 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Pedro López-Adeva Fernández-Layos 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include aile/*.pyx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Automatic Item List Extraction [![Build Status](https://travis-ci.org/scrapinghub/aile.svg?branch=master)](https://travis-ci.org/scrapinghub/aile) 2 | 3 | This repository is a temporary container for experiments in automatic extraction of list and tables from web pages. 4 | At some later point I will merge the surviving algorithms either in [scrapely](https://github.com/scrapy/scrapely) 5 | or [portia](https://github.com/scrapinghub/portia). 6 | 7 | I document my ideas and algorithms descriptions at [readthedocs](http://aile.readthedocs.org/en/latest/). 8 | 9 | The current approach is based on the HTML code of the page, treated as a stream of HTML tags as processed by 10 | [scrapely](https://github.com/scrapy/scrapely). An alternative approach would be to use also the web page 11 | rendering information ([this script](https://github.com/plafl/aile/blob/master/misc/visual.py) renders a tree 12 | of bounding boxes for each element). 13 | 14 | ## Installation 15 | pip install -r requirements.txt 16 | python setup.py develop 17 | 18 | ## Running 19 | If you want to have a feeling of how it works there are two demo scripts included in the repo. 20 | 21 | - demo1.py 22 | Will annotate the HTML code of a web page, marking as red the lines that form part of the repeating item 23 | and with a prefix number the field number inside the item. The output is written in the file 'annotated.html'. 24 | 25 | python demo1.py https://news.ycombinator.com 26 | 27 | ![annotated HTML](https://github.com/plafl/aile/blob/master/misc/demo1_img.png) 28 | 29 | - demo2.py 30 | Will label, color and draw the HTML tree so that repeating elements are easy to see. The output is interactive 31 | (requires PyQt4). 32 | 33 | python demo2.py https://news.ycombinator.com 34 | 35 | ![annotated tree](https://github.com/plafl/aile/blob/master/misc/demo2_img.png) 36 | 37 | ## Algorithms 38 | 39 | We are trying to auto-detect repeating patterns in the tags, not necessarily made of of *li*, *tr* or *td* tags. 40 | 41 | ### Clustering trees with a measure of similarity 42 | The idea is to compute the distance between all subtrees in the web page and run a clustering algorithm with this distance matrix. 43 | For a web page of size N this can be achieved in time O(N^2). The current algorithm actually computes a kernel and from the kernel 44 | computes the distance. The algorithm is based on: 45 | 46 | Kernels for semi-structured data 47 | Hisashi Kashima, Teruo Koyanagi 48 | 49 | Once we compute the distance between all subtrees of the web page DBSCAN clustering is run using the distance matrix. 50 | The result is massaged a little more until you get the result. 51 | 52 | ### Markov models 53 | The problem of detecting repeating patterns in streams is known as *motif discovery* and most of the literature about it seems 54 | to be published in the field of genetics. Inspired from this there is [a branch](https://github.com/plafl/aile/tree/markov_model) 55 | (MEME and Profile HMM algorithms). 56 | 57 | The Markov model approach has the following problems right now: 58 | 59 | - Requires several web pages for training, depending on the web page type 60 | - Training is performed using EM algorithm which requires several attempts until a good optimum is achieved 61 | - The number of hidden states is hard to determine. There are some heuristics applied that work partially 62 | 63 | These problems are not unsurmountable (I think) but require a lot of work: 64 | 65 | - Precision could be improved using [conditional random fields](https://en.wikipedia.org/wiki/Conditional_random_field). 66 | These could alleviate the need for data. 67 | - Training can run greatly in parallel. This is actually already done using [joblib](https://pythonhosted.org/joblib/parallel.html) 68 | in a single PC but it could be further improved using a cluster of computers 69 | - There are some papers about hidden state merging/splitting and even an 70 | [infinite number of states](http://machinelearning.wustl.edu/mlpapers/paper_files/nips02-AA01.pdf) 71 | -------------------------------------------------------------------------------- /aile/__init__.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | import scrapely 4 | 5 | from . import slybot_project 6 | from . import kernel 7 | from . import ptree 8 | 9 | def generate_slybot_project(url, path='slybot-project', verbose=False): 10 | def _print(s): 11 | if verbose: 12 | print s, 13 | 14 | _print('Downloading URL...') 15 | t1 = time.clock() 16 | page = scrapely.htmlpage.url_to_page(url) 17 | _print('done ({0}s)\n'.format(time.clock() - t1)) 18 | 19 | _print('Extracting items...') 20 | t1 = time.clock() 21 | ie = kernel.ItemExtract(ptree.PageTree(page), separate_descendants=True) 22 | _print('done ({0}s)\n'.format(time.clock() - t1)) 23 | 24 | _print('Generating slybot project...') 25 | t1 = time.clock() 26 | slybot_project.generate(ie, path) 27 | _print('done ({0}s)\n'.format(time.clock() - t1)) 28 | 29 | return ie 30 | -------------------------------------------------------------------------------- /aile/_kernel.pyx: -------------------------------------------------------------------------------- 1 | # cython: linetrace=True 2 | # distutils: define_macros=CYTHON_TRACE_NOGIL=1 3 | 4 | import collections 5 | import itertools 6 | 7 | import numpy as np 8 | cimport numpy as np 9 | cimport cython 10 | 11 | 12 | def order_pairs(nodes): 13 | """Given a list of fragments return pairs of equal nodes ordered in 14 | such a way that if pair_1 comes before pair_2 then no element of pair_2 is 15 | a descendant of pair_1""" 16 | grouped = collections.defaultdict(list) 17 | for i, node in enumerate(nodes): 18 | grouped[node].append(i) 19 | return sorted( 20 | [pair 21 | for node, indices in grouped.iteritems() 22 | for pair in itertools.combinations_with_replacement(sorted(indices), 2)], 23 | key=lambda x: x[0], reverse=True) 24 | 25 | 26 | def check_order(op, parents): 27 | """Test for order_pairs""" 28 | N = len(parents) 29 | C = np.zeros((N, N), dtype=int) 30 | for i, j in op: 31 | pi = parents[i] 32 | pj = parents[j] 33 | if pi > 0 and pj > 0: 34 | assert C[pi, pj] == 0 35 | C[i, j] = 1 36 | 37 | 38 | def similarity(ptree, max_items=1): 39 | all_classes = list({node.class_attr for node in ptree.nodes}) 40 | class_index = {c: i for i, c in enumerate(all_classes)} 41 | class_map = np.array([class_index[node.class_attr] for node in ptree.nodes]) 42 | N = len(all_classes) 43 | similarity = np.zeros((N, N), dtype=float) 44 | for i in range(N): 45 | for j in range(N): 46 | li = min(max_items, len(all_classes[i])) 47 | lj = min(max_items, len(all_classes[j])) 48 | lk = min(max_items, len(all_classes[i] & all_classes[j])) 49 | similarity[i, j] = (1.0 + lk)/(1.0 + li + lj - lk) 50 | return class_map, similarity 51 | 52 | 53 | @cython.boundscheck(False) 54 | cpdef build_counts(ptree, int max_depth=4, int max_childs=20): 55 | cdef int N = len(ptree) 56 | if max_childs is None: 57 | max_childs = N 58 | pairs = order_pairs(ptree.nodes) 59 | cdef np.ndarray[np.double_t, ndim=2] sim 60 | cdef np.ndarray[np.int_t, ndim=1] cmap 61 | cmap, sim = similarity(ptree) 62 | 63 | cdef np.ndarray[np.double_t, ndim=3] C = np.zeros((N, N, max_depth), dtype=float) 64 | cdef np.ndarray[np.double_t, ndim=3] S = np.zeros( 65 | (max_childs + 1, max_childs + 1, max_depth), dtype=float) 66 | S[0, :, :] = S[:, 0, :] = 1 67 | 68 | cdef int i1, i2, j1, j2, k1, k2 69 | cdef np.ndarray[np.int_t, ndim=2] children = ptree.children_matrix(max_childs) 70 | for i1, i2 in pairs: 71 | s = sim[cmap[i1], cmap[i2]] 72 | if children[i1, 0] == -1 and children[i2, 0] == -1: 73 | C[i2, i1, :] = C[i1, i2, :] = s 74 | else: 75 | for j1 in range(1, max_childs + 1): 76 | k1 = children[i1, j1 - 1] 77 | if k1 < 0: 78 | break 79 | for j2 in range(1, max_childs + 1): 80 | k2 = children[i2, j2 - 1] 81 | if k2 < 0: 82 | break 83 | S[j1, j2, :] = S[j1 - 1, j2 , : ] +\ 84 | S[j1 , j2 - 1, : ] -\ 85 | S[j1 - 1, j2 - 1, : ] 86 | S[j1, j2, 1:] += S[j1 - 1, j2 - 1, 1:]*C[k1, k2, :-1] 87 | C[i2, i1, :] = C[i1, i2, :] = s*S[j1 - 1, j2 - 1, :] 88 | return C[:, :, max_depth - 1] 89 | 90 | 91 | @cython.boundscheck(False) 92 | cpdef kernel(ptree, counts=None, int max_depth=4, int max_childs=20, double decay=0.5): 93 | cdef np.ndarray[np.double_t, ndim=2] C 94 | if counts is None: 95 | C = build_counts(ptree, max_depth, max_childs) 96 | else: 97 | C = counts 98 | 99 | cdef np.ndarray[np.double_t, ndim=2] K = np.zeros((C.shape[0], C.shape[1])) 100 | cdef np.ndarray[np.double_t, ndim=2] A = C.copy() 101 | cdef np.ndarray[np.double_t, ndim=2] B = C.copy() 102 | cdef int N = K.shape[0] 103 | cdef int i, j, pi, pj, ri, rj 104 | for i in range(N - 1, -1, -1): 105 | pi = ptree.parents[i] 106 | for j in range(N - 1, -1, -1): 107 | pj = ptree.parents[j] 108 | if pi > 0: 109 | A[pi, j] += decay*A[i, j] 110 | if pj > 0: 111 | B[i, pj] += decay*B[i, j] 112 | for i in range(N - 1, -1, -1): 113 | pi = ptree.parents[i] 114 | for j in range(N - 1, -1, -1): 115 | ri = max(ptree.match[i], i) 116 | rj = max(ptree.match[j], j) 117 | K[i, j] += A[i, j] + B[i, j] - C[i, j] 118 | pj = ptree.parents[j] 119 | if pi > 0 and pj > 0: 120 | K[pi, pj] += decay*K[i, j] 121 | return K 122 | 123 | 124 | cpdef min_dist_complete(np.ndarray[np.double_t, ndim=2] D): 125 | cdef np.ndarray[np.double_t, ndim=2] R = D.copy() 126 | cdef int N = D.shape[0] 127 | cdef int i, j, k 128 | cdef double d 129 | for k in range(N): 130 | for i in range(N): 131 | for j in range(N): 132 | d = max(R[i, k], R[k, j]) 133 | if R[i,j] > d: 134 | R[i, j] = d 135 | return R 136 | -------------------------------------------------------------------------------- /aile/dtw.pyx: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | cimport numpy as np 3 | cimport cython 4 | 5 | 6 | @cython.boundscheck(False) 7 | cpdef from_distance(np.ndarray[np.double_t, ndim=2] D): 8 | """Given a distance matrix compute the dynamic time warp distance. 9 | 10 | If the distance matrix 'D' is between sequences 's' and 't' then: 11 | 1. D[i, j] = |s[i] - t[j]| 12 | 2. DTW[i, j] represents the dynamic time warp distance between 13 | subsequences s[:i+1] and t[:j+1] 14 | 3. DTW[-1, -1] is the dynamic time warp distance between s and t 15 | """ 16 | cdef int m = D.shape[0] 17 | cdef int n = D.shape[1] 18 | cdef np.ndarray[np.double_t, ndim=2] DTW = np.zeros((m + 1, n + 1)) 19 | DTW[:, 0] = np.inf 20 | DTW[0, :] = np.inf 21 | DTW[0, 0] = 0 22 | cdef int i 23 | cdef int j 24 | for i in range(1, m + 1): 25 | for j in range(1, n + 1): 26 | DTW[i, j] = D[i - 1, j - 1] + min( 27 | DTW[i - 1, j ], 28 | DTW[i , j - 1], 29 | DTW[i - 1, j - 1]) 30 | return DTW 31 | 32 | 33 | @cython.boundscheck(False) 34 | cpdef path(np.ndarray[np.double_t, ndim=2] DTW): 35 | """Given a DTW matrix backtrack to find the alignment between two sequences""" 36 | cdef int m = DTW.shape[0] - 1 37 | cdef int n = DTW.shape[1] - 1 38 | cdef int i = m - 1 39 | cdef int j = n - 1 40 | cdef np.ndarray[np.int_t, ndim=1] s = np.zeros((m,), dtype=int) 41 | cdef np.ndarray[np.int_t, ndim=1] t = np.zeros((n,), dtype=int) 42 | while i >= 0 or j >= 0: 43 | s[i] = j 44 | t[j] = i 45 | if DTW[i, j + 1] < DTW[i + 1, j]: 46 | if DTW[i, j + 1] < DTW[i, j]: 47 | i -= 1 48 | else: 49 | i -= 1 50 | j -= 1 51 | elif DTW[i + 1, j] < DTW[i, j]: 52 | j -= 1 53 | else: 54 | i -= 1 55 | j -= 1 56 | return s, t 57 | 58 | 59 | @cython.boundscheck(False) 60 | cpdef match(np.ndarray[np.int_t, ndim=1] s, 61 | np.ndarray[np.int_t, ndim=1] t, 62 | np.ndarray[np.double_t, ndim=2] D): 63 | """Given the alignments from two sequences find the match between elements. 64 | 65 | When aligning an element from one sequence can correspond to several elements 66 | of the other one. Matching resolves this ambiguity forcing unique pairings. If 67 | an element is unpaired then it as assigned -1. 68 | """ 69 | s = s.copy() 70 | cdef int i, j, k, m 71 | cdef double d 72 | cdef int N = len(s) 73 | for i in range(N): 74 | j = s[i] 75 | m = k = i 76 | d = D[i, j] 77 | while k < N and s[k] == j: 78 | if D[k, j] < d: 79 | m = k 80 | d = D[k, j] 81 | k += 1 82 | k = i 83 | while k < N and s[k] == j: 84 | if k != m: 85 | s[k] = -1 86 | k += 1 87 | return s 88 | -------------------------------------------------------------------------------- /aile/kernel.py: -------------------------------------------------------------------------------- 1 | import collections 2 | import itertools 3 | 4 | import numpy as np 5 | import sklearn.cluster 6 | import networkx as nx 7 | import scrapely.htmlpage as hp 8 | 9 | import _kernel as _ker 10 | import dtw 11 | 12 | 13 | def to_rows(d): 14 | """Make a square matrix with rows equal to 'd'. 15 | 16 | >>> print to_rows(np.array([1,2,3,4])) 17 | [[1 2 3 4] 18 | [1 2 3 4] 19 | [1 2 3 4] 20 | [1 2 3 4]] 21 | """ 22 | return np.tile(d, (len(d), 1)) 23 | 24 | 25 | def to_cols(d): 26 | """Make a square matrix with columns equal to 'd'. 27 | 28 | >>> print ker.to_cols(np.array([1,2,3,4])) 29 | [[1 1 1 1] 30 | [2 2 2 2] 31 | [3 3 3 3] 32 | [4 4 4 4]] 33 | """ 34 | return np.tile(d.reshape(len(d), -1), (1, len(d))) 35 | 36 | 37 | def normalize_kernel(K): 38 | """New kernel with unit diagonal. 39 | 40 | K'[i, j] = K[i, j]/sqrt(K[i,i]*K[j,j]) 41 | """ 42 | d = np.diag(K).copy() 43 | d[d == 0] = 1.0 44 | return K/np.sqrt(to_rows(d)*to_cols(d)) 45 | 46 | 47 | def kernel_to_distance(K): 48 | """Build a distance matrix. 49 | 50 | From the dot product: 51 | |u - v|^2 = (u - v)(u - v) = u^2 + v^2 - 2uv 52 | """ 53 | d = np.diag(K) 54 | D = to_rows(d) + to_cols(d) - 2*K 55 | D[D < 0] = 0.0 # numerical error can make D go a little below 0 56 | return np.sqrt(D) 57 | 58 | 59 | def tree_size_distance(page_tree): 60 | """Build a distance matrix comparing subtree sizes. 61 | 62 | If T1 and T2 are trees and N1 and N2 the number of nodes within: 63 | |T1 - T2| = |N1 - N2|/(N1 + N2) 64 | Since: 65 | N1 >= 1 66 | N2 >= 1 67 | Then: 68 | 0 <= |T1 - T2| < 1 69 | """ 70 | s = page_tree.tree_size() 71 | a = to_cols(s).astype(float) 72 | b = to_rows(s).astype(float) 73 | return np.abs(a - b)/(a + b) 74 | 75 | 76 | def must_separate(nodes, page_tree): 77 | """Given a sequence of nodes and a PageTree return a list of pairs 78 | of nodes such that one is the ascendant/descendant of the other""" 79 | separate = [] 80 | for src in nodes: 81 | m = page_tree.match[src] 82 | if m >= 0: 83 | for tgt in range(src+1, m): 84 | if tgt in nodes: 85 | separate.append((src, tgt)) 86 | return separate 87 | 88 | 89 | def cut_descendants(D, nodes, page_tree): 90 | """Given the distance matrix D, a set of nodes and a PageTree 91 | perform a multicut of the complete graph of nodes separating 92 | the nodes that are descendant/ascendants of each other according to the 93 | PageTree""" 94 | index = {node: i for i, node in enumerate(nodes)} 95 | separate = [(index[i], index[j]) 96 | for i, j in must_separate(nodes, page_tree)] 97 | if separate: 98 | D = D[nodes, :][:, nodes].copy() 99 | for i, j in separate: 100 | D[i, j] = D[j, i] = np.inf 101 | E = _ker.min_dist_complete(D) 102 | eps = min(E[i,j] for i, j in separate) 103 | components = nx.connected_components( 104 | nx.Graph((nodes[i], nodes[j]) 105 | for (i, j) in zip(*np.nonzero(E < eps)))) 106 | else: 107 | components = [nodes] 108 | return components 109 | 110 | 111 | def labels_to_clusters(labels): 112 | """Given a an assignment of cluster label to each item return the a list 113 | of sets, where each set is a cluster""" 114 | return [np.flatnonzero(labels==label) for label in range(np.max(labels)+1)] 115 | 116 | 117 | def clusters_to_labels(clusters, n_samples): 118 | """Given a list with clusters label each item""" 119 | labels = np.repeat(-1, n_samples) 120 | for i, c in enumerate(clusters): 121 | for j in c: 122 | labels[j] = i 123 | return labels 124 | 125 | 126 | def boost(d, k=2): 127 | """Given a distance between 0 and 1 make it more nonlinear""" 128 | return 1 - (1 - d)**k 129 | 130 | 131 | class TreeClustering(object): 132 | def __init__(self, page_tree): 133 | self.page_tree = page_tree 134 | 135 | def fit_predict(self, X, min_cluster_size=6, d1=1.0, d2=0.1, eps=1.0, 136 | separate_descendants=True): 137 | """Fit the data X and label each sample. 138 | 139 | X is a kernel of size (n_samples, n_samples). From this kernel the 140 | distance matrix is computed and averaged with the tree size distance, 141 | and DBSCAN applied to the result. Finally, we enforce the constraint 142 | that a node cannot be inside the same cluster of any of its ascendants. 143 | 144 | Parameters 145 | --------- 146 | X : np.array 147 | Kernel matrix 148 | min_cluster_size : int 149 | Parameter to DBSCAN 150 | eps : int 151 | Parameter to DBSCAN 152 | d1 : float 153 | Weight of distance computed from X 154 | d2 : float 155 | Weight of distance computed from tree size 156 | separate_ascendants: bool 157 | True to enfonce the cannot-link constraints 158 | 159 | Returns 160 | ------- 161 | np.array 162 | A label for each sample 163 | """ 164 | Y = boost(tree_size_distance(self.page_tree), 2) 165 | D = d1*X + d2*Y 166 | clt = sklearn.cluster.DBSCAN( 167 | eps=eps, min_samples=min_cluster_size, metric='precomputed') 168 | self.clusters = [] 169 | for c in labels_to_clusters(clt.fit_predict(D)): 170 | if len(c) >= min_cluster_size: 171 | if separate_descendants: 172 | self.clusters += filter(lambda x: len(x) >= min_cluster_size, 173 | cut_descendants(D, c, self.page_tree)) 174 | else: 175 | self.clusters.append(c) 176 | self.labels = clusters_to_labels(self.clusters, D.shape[0]) 177 | return self.labels 178 | 179 | 180 | def cluster(page_tree, K, eps=1.2, d1=1.0, d2=0.1, separate_descendants=True): 181 | """Asign to each node in the tree a cluster label. 182 | 183 | Returns 184 | ------- 185 | np.array 186 | For each node a label id. Label ID -1 means that the node 187 | is an outlier (it isn't part of any cluster). 188 | """ 189 | return TreeClustering(page_tree).fit_predict( 190 | kernel_to_distance(normalize_kernel(K)), 191 | eps=eps, d1=d1, d2=d2, 192 | separate_descendants=separate_descendants) 193 | 194 | 195 | def clusters_tournament(ptree, labels): 196 | """A cluster 'wins' if some node inside the cluster is the ascendant 197 | of another node in the other cluster""" 198 | L = np.max(labels) + 1 199 | T = np.zeros((L, L), dtype=int) 200 | for i, m in enumerate(ptree.match): 201 | li = labels[i] 202 | if li != -1: 203 | for j in range(max(i + 1, m)): 204 | lj = labels[j] 205 | if lj != -1: 206 | T[li, lj] += 1 207 | return T 208 | 209 | 210 | def _make_acyclic(T, labels): 211 | """See https://en.wikipedia.org/wiki/Feedback_arc_set""" 212 | n = T.shape[0] 213 | if n == 0: 214 | return [] 215 | i = np.random.randint(0, n) 216 | L = [] 217 | R = [] 218 | for j in range(n): 219 | if j != i: 220 | if T[i, j] > T[j, i]: 221 | R.append(j) 222 | else: 223 | L.append(j) 224 | return (make_acyclic(T[L, :][:, L], labels[L]) + 225 | [labels[i]] + 226 | make_acyclic(T[R, :][:, R], labels[R])) 227 | 228 | 229 | def make_acyclic(T, labels=None): 230 | """Tiven a tournament T, try to rank the clusters in a consisten 231 | way""" 232 | if labels is None: 233 | labels = np.arange(T.shape[0]) 234 | return _make_acyclic(T, labels) 235 | 236 | 237 | def separate_clusters(ptree, labels): 238 | """Make sure no tree node is contained in two different clusters""" 239 | ranking = make_acyclic(clusters_tournament(ptree, labels)) 240 | clusters = labels_to_clusters(labels) 241 | labels = labels.copy() 242 | for i in ranking: 243 | for node in clusters[i]: 244 | labels[node+1:max(node+1, ptree.match[node])] = -1 245 | return labels 246 | 247 | 248 | def score_cluster(ptree, cluster, k=4): 249 | """Given a cluster assign a score. The higher the score the more probable 250 | that the cluster truly represents a repeating item""" 251 | if len(cluster) <= 1: 252 | return 0.0 253 | D = sklearn.neighbors.kneighbors_graph( 254 | ptree.distance[cluster, :][:, cluster], min(len(cluster) - 1, k), 255 | metric='precomputed', mode='distance') 256 | score = 0.0 257 | for i, j in zip(*D.nonzero()): 258 | a = cluster[i] 259 | b = cluster[j] 260 | si = max(a+1, ptree.match[a]) - a 261 | sj = max(b+1, ptree.match[b]) - b 262 | score += min(si, sj)/D[i, j]**2 263 | return score 264 | 265 | 266 | def some_root_has_label(labels, item, label): 267 | for root in item: 268 | if labels[root] == label: 269 | return True 270 | return False 271 | 272 | 273 | def extract_items_with_label(ptree, labels, label_to_extract): 274 | """Extract all items inside the labeled PageTree that are marked or have 275 | a sibling that is marked with label_to_extract. 276 | 277 | Returns 278 | ------- 279 | List[tuple] 280 | Where each tuple is the roots of the extracted subtrees. 281 | """ 282 | items = [] 283 | i = 0 284 | while i < len(labels): 285 | children = ptree.children(i) 286 | if np.any(labels[children] == label_to_extract): 287 | first = None 288 | item = [] 289 | for c in children: 290 | m = labels[c] 291 | if m != -1: 292 | if first is None: 293 | first = m 294 | elif m == first: 295 | if item: 296 | items.append(tuple(item)) 297 | item = [] 298 | # Only append tags as item roots 299 | if isinstance(ptree.page.parsed_body[ptree.index[c]], hp.HtmlTag): 300 | item.append(c) 301 | if item: 302 | items.append(tuple(item)) 303 | i = ptree.match[i] 304 | else: 305 | i += 1 306 | return filter(lambda item: some_root_has_label(labels, item, label_to_extract), 307 | items) 308 | 309 | 310 | def vote(sequence): 311 | """Return the most frequent item in sequence""" 312 | return max(collections.Counter(sequence).iteritems(), 313 | key=lambda kv: kv[1])[0] 314 | 315 | 316 | def regularize_item_length(ptree, labels, item_locations, max_items_cut_per=0.33): 317 | """Make sure all item locations have the same number of roots""" 318 | if not item_locations: 319 | return item_locations 320 | min_item_length = vote(len(item_location) for item_location in item_locations) 321 | cut_items = sum(len(item_location) > min_item_length 322 | for item_location in item_locations) 323 | if cut_items > max_items_cut_per*len(item_locations): 324 | return [] 325 | item_locations = filter(lambda x: len(x) >= min_item_length, 326 | item_locations) 327 | if cut_items > 0: 328 | label_count = collections.Counter( 329 | labels[root] for item_location in item_locations 330 | for root in item_location) 331 | new_item_locations = [] 332 | for item_location in item_locations: 333 | if len(item_location) > min_item_length: 334 | scored = sorted( 335 | ((label_count[labels[root]], root) for root in item_location), 336 | reverse=True) 337 | keep = set(x[1] for x in scored[:min_item_length]) 338 | new_item_location = tuple( 339 | root 340 | for root in item_location 341 | if root in keep) 342 | else: 343 | new_item_location = item_location 344 | new_item_locations.append(new_item_location) 345 | else: 346 | new_item_locations = item_locations 347 | return new_item_locations 348 | 349 | 350 | def extract_items(ptree, labels, min_n_items=6): 351 | """Extract the repeating items. 352 | 353 | The algorithm to extract the repeating items goes as follows: 354 | 1. Determine the label that covers most children on the page 355 | 2. If a node with that label has siblings, extract the siblings too, 356 | even if they have other labels. 357 | 358 | The output is a list of lists of items 359 | """ 360 | labels = separate_clusters(ptree, labels) 361 | scores = sorted( 362 | enumerate(score_cluster(ptree, cluster) 363 | for cluster in labels_to_clusters(labels)), 364 | key=lambda kv: kv[1], reverse=True) 365 | items = [] 366 | for label, score in scores: 367 | cluster = extract_items_with_label(ptree, labels, label) 368 | if len(cluster) < min_n_items: 369 | continue 370 | t = regularize_item_length(ptree, labels, cluster) 371 | if len(t) >= min_n_items: 372 | items.append(t) 373 | return items 374 | 375 | 376 | def path_distance(path_1, path_2): 377 | """Compute the prefix distance between the two paths. 378 | 379 | >>> p1 = [1, 0, 3, 4, 5, 6] 380 | >>> p2 = [1, 0, 2, 2, 2, 2, 2, 2] 381 | >>> print path_distance(p1, p2) 382 | 6 383 | """ 384 | d = max(len(path_1), len(path_2)) 385 | for a, b in zip(path_1, path_2): 386 | if a != b: 387 | break 388 | d -= 1 389 | return d 390 | 391 | 392 | def pairwise_path_distance(path_seq_1, path_seq_2): 393 | """Compute all pairwise distances between paths in path_seq_1 and 394 | path_seq_2""" 395 | N1 = len(path_seq_1) 396 | N2 = len(path_seq_2) 397 | D = np.zeros((N1, N2)) 398 | for i in range(N1): 399 | q1 = path_seq_1[i] 400 | for j in range(N2): 401 | D[i, j] = path_distance(q1, path_seq_2[j]) 402 | return D 403 | 404 | 405 | def extract_path_seq_1(ptree, item): 406 | paths = [] 407 | for root in item: 408 | for path in ptree.prefixes_at(root): 409 | paths.append((path[0], path)) 410 | return paths 411 | 412 | 413 | def extract_path_seq(ptree, items): 414 | all_paths = [] 415 | for item in items: 416 | paths = extract_path_seq_1(ptree, item) 417 | all_paths.append(paths) 418 | return all_paths 419 | 420 | 421 | def map_paths_1(func, paths): 422 | return [(leaf, [func(node) for node in path]) 423 | for leaf, path in paths] 424 | 425 | 426 | def map_paths(func, paths): 427 | return [map_paths_1(func, path_set) for path_set in paths] 428 | 429 | 430 | def find_cliques(G, min_size): 431 | """Find all cliques in G above a given size. 432 | 433 | If a node is part of a larger clique is deleted from the smaller ones. 434 | 435 | Returns 436 | ------- 437 | dict 438 | Mapping nodes to clique ID 439 | """ 440 | cliques = [] 441 | for K in nx.find_cliques(G): 442 | if len(K) >= min_size: 443 | cliques.append(set(K)) 444 | cliques.sort(reverse=True, key=lambda x: len(x)) 445 | L = set() 446 | for K in cliques: 447 | K -= L 448 | L |= K 449 | cliques = [J for J in cliques if len(J) >= min_size] 450 | node_to_clique = {} 451 | for i, K in enumerate(cliques): 452 | for node in K: 453 | if node not in node_to_clique: 454 | node_to_clique[node] = i 455 | return node_to_clique 456 | 457 | 458 | def match_graph(all_paths): 459 | """Build a graph where n1 and n2 share an edge if they have 460 | been matched using DTW""" 461 | G = nx.Graph() 462 | for path_set_1, path_set_2 in itertools.combinations(all_paths, 2): 463 | n1, p1 = zip(*path_set_1) 464 | n2, p2 = zip(*path_set_2) 465 | D = pairwise_path_distance(p1, p2) 466 | DTW = dtw.from_distance(D) 467 | a1, a2 = dtw.path(DTW) 468 | m = dtw.match(a1, a2, D) 469 | for i, j in enumerate(m): 470 | if j != -1: 471 | G.add_edge(n1[i], n2[j]) 472 | return G 473 | 474 | 475 | def align_items(ptree, items, node_to_clique): 476 | n_cols = max(node_to_clique.values()) + 1 477 | table = np.zeros((len(items), n_cols), dtype=int) - 1 478 | for i, item in enumerate(items): 479 | for root in item: 480 | for c in range(root, max(root + 1, ptree.match[root])): 481 | try: 482 | table[i, node_to_clique[c]] = c 483 | except KeyError: 484 | pass 485 | return table 486 | 487 | 488 | def extract_item_table(ptree, items, labels): 489 | return align_items( 490 | ptree, 491 | items, 492 | find_cliques( 493 | match_graph(map_paths( 494 | lambda x: labels[x], extract_path_seq(ptree, items))), 495 | 0.5*len(items)) 496 | ) 497 | 498 | 499 | ItemTable = collections.namedtuple('ItemTable', ['items', 'cells']) 500 | 501 | 502 | class ItemExtract(object): 503 | def __init__(self, page_tree, k_max_depth=2, k_decay=0.5, 504 | c_eps=1.2, c_d1=1.0, c_d2=1.0, separate_descendants=True): 505 | """Perform all extraction operations in sequence. 506 | 507 | Parameters 508 | ---------- 509 | k_max_depth : int 510 | Parameter to kernel computation 511 | k_decay : float 512 | Parameter to kernel computation 513 | c_eps : float 514 | Parameter to clustering 515 | c_d1 : float 516 | Parameter to clustering 517 | c_d2 : float 518 | Parameter to clustering 519 | separate_descendants : bool 520 | Parameter to clustering 521 | """ 522 | self.page_tree = page_tree 523 | self.kernel = _ker.kernel(page_tree, max_depth=k_max_depth, decay=k_decay) 524 | self.labels = cluster( 525 | page_tree, self.kernel, eps=c_eps, d1=c_d1, d2=c_d2, 526 | separate_descendants=separate_descendants) 527 | self.items = extract_items(page_tree, self.labels) 528 | self.tables = [ItemTable(items, extract_item_table(page_tree, items, self.labels)) 529 | for items in self.items] 530 | self.table_fragments = [ 531 | ItemTable([page_tree.fragment_index(np.array(root)) for root in item], 532 | page_tree.fragment_index(fields)) 533 | for item, fields in self.tables] 534 | -------------------------------------------------------------------------------- /aile/ptree.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | 3 | import numpy as np 4 | import scrapely.htmlpage as hp 5 | 6 | 7 | def match_fragments(fragments, max_backtrack=20): 8 | """Find the closing fragment for every fragment. 9 | 10 | Returns 11 | ------- 12 | numpy.array 13 | With as many elements as fragments. If the fragment has no 14 | closing pair then the array contains -1 at that position 15 | otherwise it contains the index of the closing pair 16 | """ 17 | match = np.repeat(-1, len(fragments)) 18 | stack = [] 19 | for i, fragment in enumerate(fragments): 20 | if isinstance(fragment, hp.HtmlTag): 21 | if fragment.tag_type == hp.HtmlTagType.OPEN_TAG: 22 | stack.append((i, fragment)) 23 | elif (fragment.tag_type == hp.HtmlTagType.CLOSE_TAG): 24 | if max_backtrack is None: 25 | max_j = len(stack) 26 | else: 27 | max_j = min(max_backtrack, len(stack)) 28 | for j in range(1, max_j + 1): 29 | last_i, last_tag = stack[-j] 30 | if (last_tag.tag == fragment.tag): 31 | match[last_i] = i 32 | match[i] = last_i 33 | stack[-j:] = [] 34 | break 35 | return match 36 | 37 | 38 | def is_tag(fragment): 39 | """Check if a fragment is also an HTML tag""" 40 | return isinstance(fragment, hp.HtmlTag) 41 | 42 | 43 | def get_class(fragment): 44 | """Return a set with class attributes for a given fragment""" 45 | if is_tag(fragment): 46 | return frozenset((fragment.attributes.get('class') or '').split()) 47 | else: 48 | return frozenset() 49 | 50 | 51 | class TreeNode(object): 52 | __slots__ = ('tag', 'class_attr') 53 | 54 | def __init__(self, tag, class_attr=frozenset()): 55 | self.tag = tag 56 | self.class_attr = class_attr 57 | 58 | def __hash__(self): 59 | return hash(self.tag) 60 | 61 | def __eq__(self, other): 62 | return self.tag == other.tag 63 | 64 | def __str__(self): 65 | return self.__repr__().encode('ascii', 'backslashreplace') 66 | 67 | def __repr__(self): 68 | s = unicode(self.tag) 69 | if self.class_attr: 70 | s += u'[' 71 | s += u','.join(self.class_attr) 72 | s += u']' 73 | return s 74 | 75 | 76 | def non_empty_text(page, fragment): 77 | return fragment.is_text_content and\ 78 | page.body[fragment.start:fragment.end].strip() 79 | 80 | 81 | def fragment_to_node(page, fragment): 82 | """Convert a fragment to a node inside a tree where we are going 83 | to compute the kernel""" 84 | if non_empty_text(page, fragment): 85 | return TreeNode('[T]') 86 | elif (is_tag(fragment) and 87 | fragment.tag_type != hp.HtmlTagType.CLOSE_TAG): 88 | return TreeNode(fragment.tag, get_class(fragment)) 89 | return None 90 | 91 | 92 | def tree_nodes(page): 93 | """Return a list of fragments from page where empty text has been deleted""" 94 | for i, fragment in enumerate(page.parsed_body): 95 | node = fragment_to_node(page, fragment) 96 | if node is not None: 97 | yield (i, node) 98 | 99 | 100 | class PageTree(object): 101 | def __init__(self, page): 102 | self.page = page 103 | index, self.nodes = zip(*tree_nodes(page)) 104 | self.index = np.array(index) 105 | self.n_nodes = len(self.nodes) 106 | self.reverse_index = np.repeat(-1, len(page.parsed_body)) 107 | for i, idx in enumerate(self.index): 108 | self.reverse_index[idx] = i 109 | match = match_fragments(page.parsed_body) 110 | self.match = np.repeat(-1, self.n_nodes) 111 | self.parents = np.repeat(-1, self.n_nodes) 112 | for i, m in enumerate(match): 113 | j = self.reverse_index[i] 114 | if j >= 0: 115 | if m >= 0: 116 | k = -1 117 | while k < 0: 118 | k = self.reverse_index[m] 119 | m += 1 120 | if m == len(match): 121 | k = len(self.match) 122 | break 123 | assert k >= 0 124 | else: 125 | k = j # no children 126 | self.match[j] = k 127 | for i, m in enumerate(self.match): 128 | self.parents[i+1:m] = i 129 | 130 | self.n_children = np.zeros((self.n_nodes,), dtype=int) 131 | self.i_child = np.zeros((self.n_nodes,), dtype=int) 132 | for i, p in enumerate(self.parents): 133 | if p > -1: 134 | self.i_child[i] = self.n_children[p] 135 | self.n_children[p] += 1 136 | self.max_childs = np.max(self.n_children) 137 | 138 | self.distance = np.ones((self.n_nodes, self.n_nodes), dtype=int) 139 | for i in range(self.n_nodes - 1, -1, -1): 140 | self.distance[i, i] = 0 141 | for a, b in itertools.combinations(self.children(i), 2): 142 | for j in range(a, max(a + 1, self.match[a])): 143 | for k in range(b, max(b + 1, self.match[b])): 144 | self.distance[j, k] = self.distance[j, a] + 2 + self.distance[b, k] 145 | self.distance[k, j] = self.distance[j, k] 146 | 147 | def __len__(self): 148 | """Number of nodes in tree""" 149 | return len(self.index) 150 | 151 | def children(self, i): 152 | """An array with the indices of the direct children of node 'i'""" 153 | return i + 1 + np.flatnonzero(self.parents[i+1:max(i+1, self.match[i])] == i) 154 | 155 | def children_matrix(self, max_childs=None): 156 | """A matrix of shape (len(tree), max_childs) where row 'i' contains the 157 | children of node 'i'""" 158 | if max_childs is None: 159 | max_childs = self.max_childs 160 | N = len(self.parents) 161 | C = np.repeat(-1, N*max_childs).reshape(N, max_childs) 162 | for i in range(N - 1, -1, -1): 163 | p = self.parents[i] 164 | if p >= 0: 165 | for j in range(max_childs): 166 | if C[p, j] == -1: 167 | C[p, j] = i 168 | break 169 | return C 170 | 171 | def siblings(self, i): 172 | """Siblings of node 'i'""" 173 | p = self.parents[i] 174 | if p != -1: 175 | return self.children(p) 176 | else: 177 | return np.flatnonzero(self.parents == -1) 178 | 179 | def prefix(self, i, stop_at=-1): 180 | """A path from 'i' going upwards up to 'stop_at'""" 181 | path = [] 182 | p = i 183 | while p >= stop_at and p != -1: 184 | path.append(p) 185 | p = self.parents[p] 186 | return path 187 | 188 | def prefixes_at(self, i): 189 | """A list of paths going upwards that start at a descendant of 'i' and 190 | end at a 'i'""" 191 | paths = [] 192 | for j in range(i, max(i+1, self.match[i])): 193 | paths.append(self.prefix(j, i)) 194 | return paths 195 | 196 | def tree_size(self): 197 | """Return an array where the i-th entry is the size of subtree 'i'""" 198 | r = np.arange(len(self.match)) 199 | s = r + 1 200 | return np.where(s > self.match, s, self.match) - r 201 | 202 | def fragment_index(self, tree_index): 203 | """Convert from tree node numbering to original fragment numbers""" 204 | return np.where( 205 | tree_index > 0, self.index[tree_index], -1) 206 | 207 | def is_descendant(self, parent, descendant): 208 | return descendant >= parent and \ 209 | descendant < max(parent + 1, self.match[parent]) 210 | 211 | def common_ascendant(self, nodes): 212 | s = set(range(self.n_nodes)) 213 | for node in nodes: 214 | s &= set(self.prefix(node)) 215 | return max(s) if s else -1 216 | -------------------------------------------------------------------------------- /demo1.py: -------------------------------------------------------------------------------- 1 | import time 2 | import sys 3 | import codecs 4 | import cgi 5 | 6 | import scrapely.htmlpage as hp 7 | import numpy as np 8 | 9 | import aile.kernel 10 | import aile.ptree 11 | 12 | 13 | def annotate(page, labels, out_path="annotated.html"): 14 | match = aile.ptree.match_fragments(page.parsed_body) 15 | with codecs.open(out_path, 'w', encoding='utf-8') as out: 16 | out.write(""" 17 | 18 | 19 | 20 | 36 | 37 | 38 |
39 | """)
40 |         indent = 0
41 |         def write(s):
42 |             out.write(indent*'    ')
43 |             out.write(s)
44 | 
45 |         for i, (fragment, label) in enumerate(
46 |                 zip(page.parsed_body, labels)):
47 |             if label >= 0:
48 |                 out.write('')
49 |             else:
50 |                 out.write('')
51 |             if isinstance(fragment, hp.HtmlTag):
52 |                 if fragment.tag_type == hp.HtmlTagType.CLOSE_TAG:
53 |                     if match[i] >= 0 and indent > 0:
54 |                         indent -= 1
55 |                     write(u'{0:3d}|</{1}>'.format(label, fragment.tag))
56 |                 else:
57 |                     write(u'{0:3d}|<{1}'.format(label, fragment.tag))
58 |                     for k,v in fragment.attributes.iteritems():
59 |                         out.write(u' {0}="{1}"'.format(k, v))
60 |                     if fragment.tag_type == hp.HtmlTagType.UNPAIRED_TAG:
61 |                         out.write('/')
62 |                     out.write('>')
63 |                     if match[i] >= 0:
64 |                         indent += 1
65 |             else:
66 |                 write(u'{0:3d}|{1}'.format(
67 |                     label,
68 |                     cgi.escape(page.body[fragment.start:fragment.end].strip())))
69 |             out.write('\n')
70 |         out.write("""
71 | 
72 | 73 | """) 74 | 75 | 76 | if __name__ == '__main__': 77 | url = sys.argv[1] 78 | 79 | print 'Downloading URL...', 80 | t1 = time.clock() 81 | page = hp.url_to_page(url) 82 | print 'done ({0}s)'.format(time.clock() - t1) 83 | 84 | print 'Extracting items...', 85 | t1 = time.clock() 86 | ie = aile.kernel.ItemExtract(aile.ptree.PageTree(page)) 87 | print 'done ({0}s)'.format(time.clock() - t1) 88 | 89 | print 'Annotating HTML' 90 | labels = np.repeat(-1, len(ie.page_tree.page.parsed_body)) 91 | items, cells = ie.table_fragments[0] 92 | for i in range(cells.shape[0]): 93 | for j in range(cells.shape[1]): 94 | labels[cells[i, j]] = j 95 | annotate(ie.page_tree.page, labels) 96 | -------------------------------------------------------------------------------- /demo2.py: -------------------------------------------------------------------------------- 1 | import time 2 | import sys 3 | 4 | import scrapely.htmlpage as hp 5 | import numpy as np 6 | import ete2 7 | import matplotlib.pyplot as plt 8 | 9 | import aile.kernel 10 | import aile.ptree 11 | 12 | 13 | def color_map(n_colors): 14 | cmap = plt.cm.Set3(np.linspace(0, 1, n_colors)) 15 | cmap = np.round(cmap[:,:-1]*255).astype(int) 16 | def to_hex(c): 17 | return hex(c)[2:-1] 18 | return ['#' + ''.join(map(to_hex, row)) for row in cmap] 19 | 20 | 21 | def draw_tree(ptree, labels=None): 22 | root = ete2.Tree(name='root') 23 | T = [ete2.Tree(name=(str(node) + '[' + str(i) + ']')) 24 | for i, node in enumerate(ptree.nodes)] 25 | if labels is not None: 26 | for t, lab in zip(T, labels): 27 | t.name += '{' + str(lab) + '}' 28 | for i, p in enumerate(ptree.parents): 29 | if p > 0: 30 | T[p].add_child(T[i]) 31 | else: 32 | root.add_child(T[i]) 33 | cmap = color_map(max(labels) + 2) 34 | for t, l in zip(T, labels): 35 | ns = ete2.NodeStyle() 36 | ns['bgcolor'] = cmap[l] 37 | t.set_style(ns) 38 | if not t.is_leaf(): 39 | t.add_face(ete2.TextFace(t.name), column=0, position='branch-top') 40 | root.show() 41 | 42 | 43 | if __name__ == '__main__': 44 | url = sys.argv[1] 45 | 46 | print 'Downloading URL...', 47 | t1 = time.clock() 48 | page = hp.url_to_page(url) 49 | print 'done ({0}s)'.format(time.clock() - t1) 50 | 51 | print 'Extracting items...', 52 | t1 = time.clock() 53 | ie = aile.kernel.ItemExtract(aile.ptree.PageTree(page)) 54 | print 'done ({0}s)'.format(time.clock() - t1) 55 | 56 | print 'Drawing HTML tree' 57 | draw_tree(ie.page_tree, ie.labels) 58 | -------------------------------------------------------------------------------- /demo3.py: -------------------------------------------------------------------------------- 1 | import time 2 | import sys 3 | 4 | import scrapely.htmlpage as hp 5 | 6 | import aile.kernel 7 | import aile.ptree 8 | import aile.slybot_project 9 | 10 | if __name__ == '__main__': 11 | url = sys.argv[1] 12 | if len(sys.argv) > 2: 13 | out_path = sys.argv[2] 14 | else: 15 | out_path = './slybot-project' 16 | 17 | print 'Downloading URL...', 18 | t1 = time.clock() 19 | page = hp.url_to_page(url) 20 | print 'done ({0}s)'.format(time.clock() - t1) 21 | 22 | print 'Extracting items...', 23 | t1 = time.clock() 24 | ie = aile.kernel.ItemExtract(aile.ptree.PageTree(page)) 25 | print 'done ({0}s)'.format(time.clock() - t1) 26 | 27 | print 'Generating slybot project...', 28 | t1 = time.clock() 29 | aile.slybot_project.generate(ie, out_path) 30 | print 'done ({0}s)'.format(time.clock() - t1) 31 | -------------------------------------------------------------------------------- /doc/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = _build 9 | 10 | # User-friendly check for sphinx-build 11 | ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) 12 | $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) 13 | endif 14 | 15 | # Internal variables. 16 | PAPEROPT_a4 = -D latex_paper_size=a4 17 | PAPEROPT_letter = -D latex_paper_size=letter 18 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 19 | # the i18n builder cannot share the environment and doctrees with the others 20 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 21 | 22 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext 23 | 24 | help: 25 | @echo "Please use \`make ' where is one of" 26 | @echo " html to make standalone HTML files" 27 | @echo " dirhtml to make HTML files named index.html in directories" 28 | @echo " singlehtml to make a single large HTML file" 29 | @echo " pickle to make pickle files" 30 | @echo " json to make JSON files" 31 | @echo " htmlhelp to make HTML files and a HTML help project" 32 | @echo " qthelp to make HTML files and a qthelp project" 33 | @echo " applehelp to make an Apple Help Book" 34 | @echo " devhelp to make HTML files and a Devhelp project" 35 | @echo " epub to make an epub" 36 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 37 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 38 | @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" 39 | @echo " text to make text files" 40 | @echo " man to make manual pages" 41 | @echo " texinfo to make Texinfo files" 42 | @echo " info to make Texinfo files and run them through makeinfo" 43 | @echo " gettext to make PO message catalogs" 44 | @echo " changes to make an overview of all changed/added/deprecated items" 45 | @echo " xml to make Docutils-native XML files" 46 | @echo " pseudoxml to make pseudoxml-XML files for display purposes" 47 | @echo " linkcheck to check all external links for integrity" 48 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 49 | @echo " coverage to run coverage check of the documentation (if enabled)" 50 | 51 | clean: 52 | rm -rf $(BUILDDIR)/* 53 | 54 | html: 55 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 56 | @echo 57 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 58 | 59 | dirhtml: 60 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 61 | @echo 62 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 63 | 64 | singlehtml: 65 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 66 | @echo 67 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 68 | 69 | pickle: 70 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 71 | @echo 72 | @echo "Build finished; now you can process the pickle files." 73 | 74 | json: 75 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 76 | @echo 77 | @echo "Build finished; now you can process the JSON files." 78 | 79 | htmlhelp: 80 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 81 | @echo 82 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 83 | ".hhp project file in $(BUILDDIR)/htmlhelp." 84 | 85 | qthelp: 86 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 87 | @echo 88 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 89 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 90 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/AILE.qhcp" 91 | @echo "To view the help file:" 92 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/AILE.qhc" 93 | 94 | applehelp: 95 | $(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp 96 | @echo 97 | @echo "Build finished. The help book is in $(BUILDDIR)/applehelp." 98 | @echo "N.B. You won't be able to view it unless you put it in" \ 99 | "~/Library/Documentation/Help or install it in your application" \ 100 | "bundle." 101 | 102 | devhelp: 103 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 104 | @echo 105 | @echo "Build finished." 106 | @echo "To view the help file:" 107 | @echo "# mkdir -p $$HOME/.local/share/devhelp/AILE" 108 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/AILE" 109 | @echo "# devhelp" 110 | 111 | epub: 112 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 113 | @echo 114 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 115 | 116 | latex: 117 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 118 | @echo 119 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 120 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 121 | "(use \`make latexpdf' here to do that automatically)." 122 | 123 | latexpdf: 124 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 125 | @echo "Running LaTeX files through pdflatex..." 126 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 127 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 128 | 129 | latexpdfja: 130 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 131 | @echo "Running LaTeX files through platex and dvipdfmx..." 132 | $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja 133 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 134 | 135 | text: 136 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 137 | @echo 138 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 139 | 140 | man: 141 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 142 | @echo 143 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 144 | 145 | texinfo: 146 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 147 | @echo 148 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 149 | @echo "Run \`make' in that directory to run these through makeinfo" \ 150 | "(use \`make info' here to do that automatically)." 151 | 152 | info: 153 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 154 | @echo "Running Texinfo files through makeinfo..." 155 | make -C $(BUILDDIR)/texinfo info 156 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 157 | 158 | gettext: 159 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 160 | @echo 161 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 162 | 163 | changes: 164 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 165 | @echo 166 | @echo "The overview file is in $(BUILDDIR)/changes." 167 | 168 | linkcheck: 169 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 170 | @echo 171 | @echo "Link check complete; look for any errors in the above output " \ 172 | "or in $(BUILDDIR)/linkcheck/output.txt." 173 | 174 | doctest: 175 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 176 | @echo "Testing of doctests in the sources finished, look at the " \ 177 | "results in $(BUILDDIR)/doctest/output.txt." 178 | 179 | coverage: 180 | $(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage 181 | @echo "Testing of coverage in the sources finished, look at the " \ 182 | "results in $(BUILDDIR)/coverage/python.txt." 183 | 184 | xml: 185 | $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml 186 | @echo 187 | @echo "Build finished. The XML files are in $(BUILDDIR)/xml." 188 | 189 | pseudoxml: 190 | $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml 191 | @echo 192 | @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." 193 | -------------------------------------------------------------------------------- /doc/_static/F_jk.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/F_jk.pdf -------------------------------------------------------------------------------- /doc/_static/F_jk_bars.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/F_jk_bars.pdf -------------------------------------------------------------------------------- /doc/_static/F_jk_graph.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | 4 | def F(W, L, d): 5 | return d**L*(1 - d)/(1 - d**W) 6 | 7 | W = 6 8 | d = np.linspace(0.0, 1.0, 1000, endpoint=False) 9 | 10 | plt.figure() 11 | for L in xrange(0, W): 12 | plt.plot(d, F(W, L, d)) 13 | plt.savefig('F_jk_no_labels.svg') 14 | 15 | plt.figure() 16 | W = 10 17 | L = np.arange(W) 18 | j = L - 0.4 19 | d = 0.8 20 | v = np.roll(F(W, L, d), 2) 21 | plt.bar(j, v, width=0.8, color='r', label='d=0.8') 22 | plt.hold(True) 23 | j = L - 0.3 24 | d = 0.3 25 | v = np.roll(F(W, L, d), 2) 26 | plt.bar(j, v, width=0.6, color='b', label='d=0.3') 27 | plt.xlim(-0.5, W - 0.5) 28 | plt.legend() 29 | plt.title(r'$F(l,d)$ $W=10$') 30 | plt.xlabel('Motif state') 31 | plt.ylabel(r'$F$') 32 | plt.savefig('F_jk_bars.svg') 33 | plt.savefig('F_jk_bars.pdf') 34 | 35 | -------------------------------------------------------------------------------- /doc/_static/F_jk_no_labels.svg: -------------------------------------------------------------------------------- 1 | 2 | 4 | 5 | 6 | 7 | 10 | 11 | 12 | 13 | 20 | 21 | 22 | 23 | 30 | 31 | 32 | 59 | 60 | 61 | 104 | 105 | 106 | 142 | 143 | 144 | 173 | 174 | 175 | 199 | 200 | 201 | 227 | 228 | 229 | 232 | 233 | 234 | 237 | 238 | 239 | 242 | 243 | 244 | 247 | 248 | 249 | 250 | 251 | 252 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 292 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | -------------------------------------------------------------------------------- /doc/_static/HMM_1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/HMM_1.pdf -------------------------------------------------------------------------------- /doc/_static/HMM_1.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 19 | 41 | 43 | 51 | 57 | 58 | 66 | 72 | 73 | 81 | 87 | 88 | 96 | 102 | 103 | 111 | 117 | 118 | 126 | 132 | 133 | 134 | 136 | 137 | 139 | image/svg+xml 140 | 142 | 143 | 144 | 145 | 146 | 151 | 153 | 156 | 159 | 180 | 186 | 187 | 190 | 193 | 214 | 220 | 221 | 222 | 225 | 228 | 231 | 252 | 258 | 259 | 262 | 265 | 286 | 292 | 293 | 294 | 296 | 298 | 301 | 322 | 328 | 329 | 332 | 335 | 356 | 362 | 363 | 364 | 369 | 374 | 379 | 384 | 389 | 394 | 397 | 403 | 409 | 415 | 416 | 417 | 418 | -------------------------------------------------------------------------------- /doc/_static/HMM_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/HMM_2.pdf -------------------------------------------------------------------------------- /doc/_static/PHMM_1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/PHMM_1.pdf -------------------------------------------------------------------------------- /doc/_static/PHMM_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/PHMM_2.pdf -------------------------------------------------------------------------------- /doc/_static/PHMM_3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/PHMM_3.pdf -------------------------------------------------------------------------------- /doc/_static/PHMM_4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/PHMM_4.pdf -------------------------------------------------------------------------------- /doc/_static/forward.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/forward.pdf -------------------------------------------------------------------------------- /doc/_static/transition_matrix_1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_1.pdf -------------------------------------------------------------------------------- /doc/_static/transition_matrix_1.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 19 | 41 | 43 | 51 | 57 | 58 | 66 | 72 | 73 | 81 | 87 | 88 | 96 | 102 | 103 | 111 | 117 | 118 | 126 | 132 | 133 | 141 | 147 | 148 | 156 | 162 | 163 | 164 | 166 | 167 | 169 | image/svg+xml 170 | 172 | 173 | 174 | 175 | 176 | 181 | 183 | 190 | 195 | 200 | 201 | 204 | 209 | 214 | 219 | 225 | 231 | 232 | 235 | 252 | 255 | 272 | 275 | 282 | 289 | 296 | 303 | 310 | 317 | 324 | 331 | 338 | 345 | 346 | 349 | 356 | 363 | 370 | 377 | 384 | 391 | 398 | 405 | 412 | 419 | 420 | 423 | 430 | 437 | 444 | 451 | 458 | 465 | 472 | 479 | 486 | 487 | 494 | 501 | Background 512 | Motif 523 | 524 | 525 | -------------------------------------------------------------------------------- /doc/_static/transition_matrix_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_2.pdf -------------------------------------------------------------------------------- /doc/_static/transition_matrix_2.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 19 | 41 | 43 | 51 | 57 | 58 | 66 | 72 | 73 | 81 | 87 | 88 | 96 | 102 | 103 | 111 | 117 | 118 | 126 | 132 | 133 | 141 | 147 | 148 | 156 | 162 | 163 | 164 | 166 | 167 | 169 | image/svg+xml 170 | 172 | 173 | 174 | 175 | 176 | 181 | 183 | 190 | 195 | 200 | 201 | 204 | 209 | 214 | 219 | 225 | 231 | 232 | 235 | 252 | 255 | 272 | 275 | 282 | 289 | 296 | 303 | 310 | 317 | 324 | 331 | 338 | 345 | 346 | 349 | 356 | 363 | 370 | 377 | 384 | 391 | 398 | 405 | 412 | 419 | 420 | 423 | 430 | 437 | 444 | 451 | 458 | 465 | 472 | 479 | 486 | 487 | 494 | Background 505 | Motif 516 | 519 | 526 | 533 | 540 | 547 | 554 | 561 | 568 | 575 | 582 | 583 | 590 | 591 | 592 | -------------------------------------------------------------------------------- /doc/_static/transition_matrix_3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_3.pdf -------------------------------------------------------------------------------- /doc/_static/transition_matrix_4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_4.pdf -------------------------------------------------------------------------------- /doc/_static/transition_matrix_5.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/_static/transition_matrix_5.pdf -------------------------------------------------------------------------------- /doc/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # AILE documentation build configuration file, created by 4 | # sphinx-quickstart on Wed Sep 23 10:38:15 2015. 5 | # 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | import sys 16 | import os 17 | import shlex 18 | 19 | # If extensions (or modules to document with autodoc) are in another directory, 20 | # add these directories to sys.path here. If the directory is relative to the 21 | # documentation root, use os.path.abspath to make it absolute, like shown here. 22 | #sys.path.insert(0, os.path.abspath('.')) 23 | 24 | # -- General configuration ------------------------------------------------ 25 | 26 | # If your documentation needs a minimal Sphinx version, state it here. 27 | #needs_sphinx = '1.0' 28 | 29 | # Add any Sphinx extension module names here, as strings. They can be 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 31 | # ones. 32 | extensions = [ 33 | 'sphinx.ext.autodoc', 34 | 'sphinx.ext.mathjax', 35 | ] 36 | 37 | # Add any paths that contain templates here, relative to this directory. 38 | templates_path = ['_templates'] 39 | 40 | # The suffix(es) of source filenames. 41 | # You can specify multiple suffix as a list of string: 42 | # source_suffix = ['.rst', '.md'] 43 | source_suffix = '.rst' 44 | 45 | # The encoding of source files. 46 | #source_encoding = 'utf-8-sig' 47 | 48 | # The master toctree document. 49 | master_doc = 'index' 50 | 51 | # General information about the project. 52 | project = u'AILE' 53 | copyright = u'2015, Pedro López-Adeva' 54 | author = u'Pedro López-Adeva' 55 | 56 | # The version info for the project you're documenting, acts as replacement for 57 | # |version| and |release|, also used in various other places throughout the 58 | # built documents. 59 | # 60 | # The short X.Y version. 61 | version = '0.0.1' 62 | # The full version, including alpha/beta/rc tags. 63 | release = '0.0.1' 64 | 65 | # The language for content autogenerated by Sphinx. Refer to documentation 66 | # for a list of supported languages. 67 | # 68 | # This is also used if you do content translation via gettext catalogs. 69 | # Usually you set "language" from the command line for these cases. 70 | language = None 71 | 72 | # There are two options for replacing |today|: either, you set today to some 73 | # non-false value, then it is used: 74 | #today = '' 75 | # Else, today_fmt is used as the format for a strftime call. 76 | #today_fmt = '%B %d, %Y' 77 | 78 | # List of patterns, relative to source directory, that match files and 79 | # directories to ignore when looking for source files. 80 | exclude_patterns = ['_build'] 81 | 82 | # The reST default role (used for this markup: `text`) to use for all 83 | # documents. 84 | #default_role = None 85 | 86 | # If true, '()' will be appended to :func: etc. cross-reference text. 87 | #add_function_parentheses = True 88 | 89 | # If true, the current module name will be prepended to all description 90 | # unit titles (such as .. function::). 91 | #add_module_names = True 92 | 93 | # If true, sectionauthor and moduleauthor directives will be shown in the 94 | # output. They are ignored by default. 95 | #show_authors = False 96 | 97 | # The name of the Pygments (syntax highlighting) style to use. 98 | pygments_style = 'sphinx' 99 | 100 | # A list of ignored prefixes for module index sorting. 101 | #modindex_common_prefix = [] 102 | 103 | # If true, keep warnings as "system message" paragraphs in the built documents. 104 | #keep_warnings = False 105 | 106 | # If true, `todo` and `todoList` produce output, else they produce nothing. 107 | todo_include_todos = False 108 | 109 | 110 | # -- Options for HTML output ---------------------------------------------- 111 | 112 | # The theme to use for HTML and HTML Help pages. See the documentation for 113 | # a list of builtin themes. 114 | html_theme = 'alabaster' 115 | 116 | # Theme options are theme-specific and customize the look and feel of a theme 117 | # further. For a list of options available for each theme, see the 118 | # documentation. 119 | #html_theme_options = {} 120 | 121 | # Add any paths that contain custom themes here, relative to this directory. 122 | #html_theme_path = [] 123 | 124 | # The name for this set of Sphinx documents. If None, it defaults to 125 | # " v documentation". 126 | #html_title = None 127 | 128 | # A shorter title for the navigation bar. Default is the same as html_title. 129 | #html_short_title = None 130 | 131 | # The name of an image file (relative to this directory) to place at the top 132 | # of the sidebar. 133 | #html_logo = None 134 | 135 | # The name of an image file (within the static path) to use as favicon of the 136 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 137 | # pixels large. 138 | #html_favicon = None 139 | 140 | # Add any paths that contain custom static files (such as style sheets) here, 141 | # relative to this directory. They are copied after the builtin static files, 142 | # so a file named "default.css" will overwrite the builtin "default.css". 143 | html_static_path = ['_static'] 144 | 145 | # Add any extra paths that contain custom files (such as robots.txt or 146 | # .htaccess) here, relative to this directory. These files are copied 147 | # directly to the root of the documentation. 148 | #html_extra_path = [] 149 | 150 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 151 | # using the given strftime format. 152 | #html_last_updated_fmt = '%b %d, %Y' 153 | 154 | # If true, SmartyPants will be used to convert quotes and dashes to 155 | # typographically correct entities. 156 | #html_use_smartypants = True 157 | 158 | # Custom sidebar templates, maps document names to template names. 159 | #html_sidebars = {} 160 | 161 | # Additional templates that should be rendered to pages, maps page names to 162 | # template names. 163 | #html_additional_pages = {} 164 | 165 | # If false, no module index is generated. 166 | #html_domain_indices = True 167 | 168 | # If false, no index is generated. 169 | #html_use_index = True 170 | 171 | # If true, the index is split into individual pages for each letter. 172 | #html_split_index = False 173 | 174 | # If true, links to the reST sources are added to the pages. 175 | #html_show_sourcelink = True 176 | 177 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 178 | #html_show_sphinx = True 179 | 180 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 181 | #html_show_copyright = True 182 | 183 | # If true, an OpenSearch description file will be output, and all pages will 184 | # contain a tag referring to it. The value of this option must be the 185 | # base URL from which the finished HTML is served. 186 | #html_use_opensearch = '' 187 | 188 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 189 | #html_file_suffix = None 190 | 191 | # Language to be used for generating the HTML full-text search index. 192 | # Sphinx supports the following languages: 193 | # 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja' 194 | # 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr' 195 | #html_search_language = 'en' 196 | 197 | # A dictionary with options for the search language support, empty by default. 198 | # Now only 'ja' uses this config value 199 | #html_search_options = {'type': 'default'} 200 | 201 | # The name of a javascript file (relative to the configuration directory) that 202 | # implements a search results scorer. If empty, the default will be used. 203 | #html_search_scorer = 'scorer.js' 204 | 205 | # Output file base name for HTML help builder. 206 | htmlhelp_basename = 'AILEdoc' 207 | 208 | # -- Options for LaTeX output --------------------------------------------- 209 | 210 | latex_elements = { 211 | # The paper size ('letterpaper' or 'a4paper'). 212 | #'papersize': 'letterpaper', 213 | 214 | # The font size ('10pt', '11pt' or '12pt'). 215 | #'pointsize': '10pt', 216 | 217 | # Additional stuff for the LaTeX preamble. 218 | #'preamble': '', 219 | 220 | # Latex figure (float) alignment 221 | #'figure_align': 'htbp', 222 | } 223 | 224 | # Grouping the document tree into LaTeX files. List of tuples 225 | # (source start file, target name, title, 226 | # author, documentclass [howto, manual, or own class]). 227 | latex_documents = [ 228 | (master_doc, 'AILE.tex', u'AILE Documentation', 229 | u'Pedro López-Adeva', 'manual'), 230 | ] 231 | 232 | # The name of an image file (relative to this directory) to place at the top of 233 | # the title page. 234 | #latex_logo = None 235 | 236 | # For "manual" documents, if this is true, then toplevel headings are parts, 237 | # not chapters. 238 | #latex_use_parts = False 239 | 240 | # If true, show page references after internal links. 241 | #latex_show_pagerefs = False 242 | 243 | # If true, show URL addresses after external links. 244 | #latex_show_urls = False 245 | 246 | # Documents to append as an appendix to all manuals. 247 | #latex_appendices = [] 248 | 249 | # If false, no module index is generated. 250 | #latex_domain_indices = True 251 | 252 | 253 | # -- Options for manual page output --------------------------------------- 254 | 255 | # One entry per manual page. List of tuples 256 | # (source start file, name, description, authors, manual section). 257 | man_pages = [ 258 | (master_doc, 'aile', u'AILE Documentation', 259 | [author], 1) 260 | ] 261 | 262 | # If true, show URL addresses after external links. 263 | #man_show_urls = False 264 | 265 | 266 | # -- Options for Texinfo output ------------------------------------------- 267 | 268 | # Grouping the document tree into Texinfo files. List of tuples 269 | # (source start file, target name, title, author, 270 | # dir menu entry, description, category) 271 | texinfo_documents = [ 272 | (master_doc, 'AILE', u'AILE Documentation', 273 | author, 'AILE', 'One line description of project.', 274 | 'Miscellaneous'), 275 | ] 276 | 277 | # Documents to append as an appendix to all manuals. 278 | #texinfo_appendices = [] 279 | 280 | # If false, no module index is generated. 281 | #texinfo_domain_indices = True 282 | 283 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 284 | #texinfo_show_urls = 'footnote' 285 | 286 | # If true, do not generate a @detailmenu in the "Top" node's menu. 287 | #texinfo_no_detailmenu = False 288 | -------------------------------------------------------------------------------- /doc/index.rst: -------------------------------------------------------------------------------- 1 | .. AILE documentation master file, created by 2 | sphinx-quickstart on Wed Sep 23 10:38:15 2015. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Welcome to AILE's documentation! 7 | ================================ 8 | 9 | Contents: 10 | 11 | .. toctree:: 12 | :maxdepth: 2 13 | 14 | notes 15 | 16 | Indices and tables 17 | ================== 18 | 19 | * :ref:`genindex` 20 | * :ref:`modindex` 21 | * :ref:`search` 22 | 23 | -------------------------------------------------------------------------------- /doc/make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | REM Command file for Sphinx documentation 4 | 5 | if "%SPHINXBUILD%" == "" ( 6 | set SPHINXBUILD=sphinx-build 7 | ) 8 | set BUILDDIR=_build 9 | set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% . 10 | set I18NSPHINXOPTS=%SPHINXOPTS% . 11 | if NOT "%PAPER%" == "" ( 12 | set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% 13 | set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% 14 | ) 15 | 16 | if "%1" == "" goto help 17 | 18 | if "%1" == "help" ( 19 | :help 20 | echo.Please use `make ^` where ^ is one of 21 | echo. html to make standalone HTML files 22 | echo. dirhtml to make HTML files named index.html in directories 23 | echo. singlehtml to make a single large HTML file 24 | echo. pickle to make pickle files 25 | echo. json to make JSON files 26 | echo. htmlhelp to make HTML files and a HTML help project 27 | echo. qthelp to make HTML files and a qthelp project 28 | echo. devhelp to make HTML files and a Devhelp project 29 | echo. epub to make an epub 30 | echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter 31 | echo. text to make text files 32 | echo. man to make manual pages 33 | echo. texinfo to make Texinfo files 34 | echo. gettext to make PO message catalogs 35 | echo. changes to make an overview over all changed/added/deprecated items 36 | echo. xml to make Docutils-native XML files 37 | echo. pseudoxml to make pseudoxml-XML files for display purposes 38 | echo. linkcheck to check all external links for integrity 39 | echo. doctest to run all doctests embedded in the documentation if enabled 40 | echo. coverage to run coverage check of the documentation if enabled 41 | goto end 42 | ) 43 | 44 | if "%1" == "clean" ( 45 | for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i 46 | del /q /s %BUILDDIR%\* 47 | goto end 48 | ) 49 | 50 | 51 | REM Check if sphinx-build is available and fallback to Python version if any 52 | %SPHINXBUILD% 2> nul 53 | if errorlevel 9009 goto sphinx_python 54 | goto sphinx_ok 55 | 56 | :sphinx_python 57 | 58 | set SPHINXBUILD=python -m sphinx.__init__ 59 | %SPHINXBUILD% 2> nul 60 | if errorlevel 9009 ( 61 | echo. 62 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 63 | echo.installed, then set the SPHINXBUILD environment variable to point 64 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 65 | echo.may add the Sphinx directory to PATH. 66 | echo. 67 | echo.If you don't have Sphinx installed, grab it from 68 | echo.http://sphinx-doc.org/ 69 | exit /b 1 70 | ) 71 | 72 | :sphinx_ok 73 | 74 | 75 | if "%1" == "html" ( 76 | %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html 77 | if errorlevel 1 exit /b 1 78 | echo. 79 | echo.Build finished. The HTML pages are in %BUILDDIR%/html. 80 | goto end 81 | ) 82 | 83 | if "%1" == "dirhtml" ( 84 | %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml 85 | if errorlevel 1 exit /b 1 86 | echo. 87 | echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. 88 | goto end 89 | ) 90 | 91 | if "%1" == "singlehtml" ( 92 | %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml 93 | if errorlevel 1 exit /b 1 94 | echo. 95 | echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. 96 | goto end 97 | ) 98 | 99 | if "%1" == "pickle" ( 100 | %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle 101 | if errorlevel 1 exit /b 1 102 | echo. 103 | echo.Build finished; now you can process the pickle files. 104 | goto end 105 | ) 106 | 107 | if "%1" == "json" ( 108 | %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json 109 | if errorlevel 1 exit /b 1 110 | echo. 111 | echo.Build finished; now you can process the JSON files. 112 | goto end 113 | ) 114 | 115 | if "%1" == "htmlhelp" ( 116 | %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp 117 | if errorlevel 1 exit /b 1 118 | echo. 119 | echo.Build finished; now you can run HTML Help Workshop with the ^ 120 | .hhp project file in %BUILDDIR%/htmlhelp. 121 | goto end 122 | ) 123 | 124 | if "%1" == "qthelp" ( 125 | %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp 126 | if errorlevel 1 exit /b 1 127 | echo. 128 | echo.Build finished; now you can run "qcollectiongenerator" with the ^ 129 | .qhcp project file in %BUILDDIR%/qthelp, like this: 130 | echo.^> qcollectiongenerator %BUILDDIR%\qthelp\AILE.qhcp 131 | echo.To view the help file: 132 | echo.^> assistant -collectionFile %BUILDDIR%\qthelp\AILE.ghc 133 | goto end 134 | ) 135 | 136 | if "%1" == "devhelp" ( 137 | %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp 138 | if errorlevel 1 exit /b 1 139 | echo. 140 | echo.Build finished. 141 | goto end 142 | ) 143 | 144 | if "%1" == "epub" ( 145 | %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub 146 | if errorlevel 1 exit /b 1 147 | echo. 148 | echo.Build finished. The epub file is in %BUILDDIR%/epub. 149 | goto end 150 | ) 151 | 152 | if "%1" == "latex" ( 153 | %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex 154 | if errorlevel 1 exit /b 1 155 | echo. 156 | echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. 157 | goto end 158 | ) 159 | 160 | if "%1" == "latexpdf" ( 161 | %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex 162 | cd %BUILDDIR%/latex 163 | make all-pdf 164 | cd %~dp0 165 | echo. 166 | echo.Build finished; the PDF files are in %BUILDDIR%/latex. 167 | goto end 168 | ) 169 | 170 | if "%1" == "latexpdfja" ( 171 | %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex 172 | cd %BUILDDIR%/latex 173 | make all-pdf-ja 174 | cd %~dp0 175 | echo. 176 | echo.Build finished; the PDF files are in %BUILDDIR%/latex. 177 | goto end 178 | ) 179 | 180 | if "%1" == "text" ( 181 | %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text 182 | if errorlevel 1 exit /b 1 183 | echo. 184 | echo.Build finished. The text files are in %BUILDDIR%/text. 185 | goto end 186 | ) 187 | 188 | if "%1" == "man" ( 189 | %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man 190 | if errorlevel 1 exit /b 1 191 | echo. 192 | echo.Build finished. The manual pages are in %BUILDDIR%/man. 193 | goto end 194 | ) 195 | 196 | if "%1" == "texinfo" ( 197 | %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo 198 | if errorlevel 1 exit /b 1 199 | echo. 200 | echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. 201 | goto end 202 | ) 203 | 204 | if "%1" == "gettext" ( 205 | %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale 206 | if errorlevel 1 exit /b 1 207 | echo. 208 | echo.Build finished. The message catalogs are in %BUILDDIR%/locale. 209 | goto end 210 | ) 211 | 212 | if "%1" == "changes" ( 213 | %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes 214 | if errorlevel 1 exit /b 1 215 | echo. 216 | echo.The overview file is in %BUILDDIR%/changes. 217 | goto end 218 | ) 219 | 220 | if "%1" == "linkcheck" ( 221 | %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck 222 | if errorlevel 1 exit /b 1 223 | echo. 224 | echo.Link check complete; look for any errors in the above output ^ 225 | or in %BUILDDIR%/linkcheck/output.txt. 226 | goto end 227 | ) 228 | 229 | if "%1" == "doctest" ( 230 | %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest 231 | if errorlevel 1 exit /b 1 232 | echo. 233 | echo.Testing of doctests in the sources finished, look at the ^ 234 | results in %BUILDDIR%/doctest/output.txt. 235 | goto end 236 | ) 237 | 238 | if "%1" == "coverage" ( 239 | %SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage 240 | if errorlevel 1 exit /b 1 241 | echo. 242 | echo.Testing of coverage in the sources finished, look at the ^ 243 | results in %BUILDDIR%/coverage/python.txt. 244 | goto end 245 | ) 246 | 247 | if "%1" == "xml" ( 248 | %SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml 249 | if errorlevel 1 exit /b 1 250 | echo. 251 | echo.Build finished. The XML files are in %BUILDDIR%/xml. 252 | goto end 253 | ) 254 | 255 | if "%1" == "pseudoxml" ( 256 | %SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml 257 | if errorlevel 1 exit /b 1 258 | echo. 259 | echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml. 260 | goto end 261 | ) 262 | 263 | :end 264 | -------------------------------------------------------------------------------- /doc/notes.rst: -------------------------------------------------------------------------------- 1 | Motif detection using a Hidden Markov Model 2 | =========================================== 3 | 4 | Introduction 5 | ------------ 6 | 7 | This notes started as an adaptation of `Rabiner's tutorial 8 | `_ to 9 | Profile HMM. 10 | 11 | 12 | Problem description 13 | ------------------- 14 | We have a sequence of length :math:`n` over some alphabet 15 | :math:`\pmb{A}`. This sequence is generated by a random process, so the 16 | sequence is a random variable :math:`\pmb{X}`: 17 | 18 | .. math:: 19 | 20 | \pmb{X} = X_1X_2...X_n 21 | 22 | As said before the range of each :math:`X_i` is the alphabet: 23 | 24 | .. math:: 25 | 26 | Val(X_i) = \pmb{A} 27 | 28 | We will call each element of :math:`\pmb{A}` a symbol. 29 | The size of the alphabet is :math:`|\pmb{A}|`. 30 | 31 | We assume that each :math:`X_i` can be part either of some repeating 32 | motif of length :math:`W` or be just background noise. When 33 | considering repeating motifs we allow to be gaps and deletions on the 34 | motif. For example consider the following sequences made of the 35 | letters A,B,C,D,E (spaces added to aid the reader): 36 | 37 | - Repeating motif 38 | 39 | ``ABCDE ABCDE ABCDE ABCDE ABCDE`` 40 | 41 | - With some baground noise between motif repetitions 42 | 43 | ``ABCDE AB ABCDE BB ABCDE C ABCDE CCCC ABCDE`` 44 | 45 | - With deletions 46 | 47 | ``ABDE AB ABCDE BB BCDE C ABCDE CCCC ABDE`` 48 | 49 | - With gaps 50 | 51 | ``ABAACDE ABCDE A ACCCBCDE C ABCDE ABCE`` 52 | 53 | 54 | Model 55 | ----- 56 | 57 | To account for background noise, deletions and gaps we model the 58 | generation of the sequence with a `Profile Hidden Markov Model 59 | `_ 60 | (PHMM). In a PHMM the hidden process generating the sequence changes 61 | between states according to the following state diagram, where each node 62 | represents an state and the process jumps randomly from one state to 63 | another state following the arrows: 64 | 65 | .. only:: html 66 | 67 | .. figure:: _static/PHMM_1.svg 68 | :align: center 69 | 70 | .. only:: latex 71 | 72 | .. figure:: _static/PHMM_1.pdf 73 | :align: center 74 | 75 | There are three types of states: 76 | 77 | - Background states 78 | 79 | .. math:: 80 | 81 | \pmb{S}^B=\left\{b_1,...,b_W\right\} 82 | 83 | - Motif states 84 | 85 | .. math:: 86 | 87 | \pmb{S}^M=\left\{m_1,...,m_W\right\} 88 | 89 | 90 | - Delete states 91 | 92 | .. math:: 93 | 94 | \pmb{S}^D=\left\{d_1,...,d_W\right\} 95 | 96 | 97 | When the process jumps to a background state it emits a letter of the 98 | sequence according to the background distribution. We assume that all 99 | background states follow the same categorical distribution where each 100 | symbol :math:`a \in \pmb{A}` is emitted with probability :math:`f^B_a`. 101 | 102 | When the process jumps to a motif state it emits a letter of the 103 | sequence according to the motif distribution. We assume that the 104 | emission probability depends on which motif state we are in, and so if 105 | we are in state :math:`m_j` each symbol :math:`a` is emitted with 106 | probability :math:`f^M_{ja}`. 107 | 108 | When the process jumps to a deletion state it emits nothing. These are 109 | silent states introduced only for the purpose of allowing jumps 110 | between motifs :math:`m_j` and :math:`m_k` for :math:`k>j+1`. 111 | 112 | Notice that the state diagram has loops. This is because it models the motif, 113 | which is allowed to repeat. 114 | 115 | The above diagram is incomplete if we don't specify the probabilites of 116 | state transitions. The following table gives them for direct transitions 117 | between states: 118 | 119 | +------------+------------+---------------------------------------------------------------+ 120 | | | To state | 121 | | +------------+----------------+----------------+----------------+ 122 | | | :math:`b_j`| :math:`b_{j+1}`| :math:`m_{j+1}`| :math:`d_{j+1}`| 123 | +------------+------------+------------+----------------+----------------+----------------+ 124 | | | :math:`b_j`| b(j) | 0 | 1 - b(j) | 0 | 125 | | +------------+------------+----------------+----------------+----------------+ 126 | | From state | :math:`m_j`| 0 | :math:`m_b` | :math:`m` | :math:`m_d` | 127 | | +------------+------------+----------------+----------------+----------------+ 128 | | | :math:`d_j`| 0 | 0 | 1 - d | d | 129 | +------------+------------+------------+----------------+----------------+----------------+ 130 | 131 | 132 | To avoid introducing too much parameters we have assumed that the 133 | transition probabilities are stationary, the only exception being: 134 | 135 | .. math:: 136 | 137 | b(j) = b_{out}[j=W] + b_{in}[j\neq W] 138 | 139 | Now let's define the random variable :math:`Z_i` as the state of the 140 | process when emitting symbol :math:`X_i`. Then the range of :math:`Z_i` is: 141 | 142 | .. math:: 143 | 144 | \pmb{S} = Val(Z_i) = \pmb{S}^B \cup \pmb{S}^M 145 | 146 | With these variables we have the familiar HMM. Notice that although 147 | similar to the graph of the PHMM this one has different meaning. The 148 | former was an state diagram and this one is a Bayesian network. Each 149 | node is now a random variable instead of a concrete state: 150 | 151 | .. only:: html 152 | 153 | .. figure:: _static/HMM_1.svg 154 | :align: center 155 | :figwidth: 60 % 156 | 157 | .. only:: latex 158 | 159 | .. figure:: _static/HMM_1.pdf 160 | :align: center 161 | :figwidth: 60 % 162 | 163 | We are capable of computing the transition probabilities between the 164 | emitting states :math:`Z_i`. For this we must consider any possible path that 165 | connects the two states, compute the probability of each path and sum 166 | all the probabilities. To make this more concrete consider the case of 167 | motif width :math:`W=6`, where we start from state :math:`m_2`: 168 | 169 | .. only:: html 170 | 171 | .. figure:: _static/PHMM_2.svg 172 | :align: center 173 | 174 | .. only:: latex 175 | 176 | .. figure:: _static/PHMM_2.pdf 177 | :align: center 178 | 179 | We can see the pattern in the above formulas. If we consider the path going 180 | from :math:`m_j` to :math:`m_k` the number of delete states we must 181 | cross is: 182 | 183 | .. math:: 184 | 185 | L_{jk} = (k - j - 2) \mod W 186 | 187 | Then the probability of the path is: 188 | 189 | .. math:: 190 | 191 | m_dd^{L_{jk}}(1-d) + [k=j+1]m 192 | 193 | Where :math:`[k=j+1]` is the `Iverson bracket 194 | `_, which takes the 195 | value 1 only when the condition inside the brackets is true, and zero 196 | otherwise. 197 | 198 | Notice that given the value of :math:`j` and :math:`L_{jk}` we can 199 | recover :math:`k` as: 200 | 201 | .. math:: 202 | 203 | k = 1 + (j + L_{jk} + 1) \mod W 204 | 205 | To compute the final probability we need to take into account the 206 | cases where we make several loops over all the delete states. Each 207 | loop has probability :math:`d^W`, and :math:`l` loops then 208 | :math:`(d^W)^l`. The following figure shows the possible paths between 209 | two states depending on the number of loops: 210 | 211 | .. only:: html 212 | 213 | .. figure:: _static/PHMM_3.svg 214 | :align: center 215 | 216 | .. only:: latex 217 | 218 | .. figure:: _static/PHMM_3.pdf 219 | :align: center 220 | 221 | The final probability is then: 222 | 223 | .. math:: 224 | 225 | P\left(Z_{i+1}=m_k|Z_i=m_j\right) = m_dd^{L_{jk}}\left(\sum_{l=0}^\infty(d^W)^l\right)(1-d) + [k=j+1]m 226 | 227 | Making the summation we get we get the final transition probabilities 228 | when starting from state :math:`m_j`: 229 | 230 | .. math:: 231 | 232 | P\left(Z_{i+1}=b_{j+1}|Z_i=m_j\right) &= m_b \\ 233 | P\left(Z_{i+1}=m_k|Z_i=m_j\right) &= m_dF(L_{jk}, d) + [k=j+1]m \\ 234 | F(l, d) &= d^l\frac{1 - d}{1 - d^W} 235 | 236 | The behavior of :math:`F(l, d)` near the boundaries is: 237 | 238 | .. math:: 239 | 240 | \underset{d \to 0}{\lim} F &= [l=0] \\ 241 | \underset{d \to 1}{\lim} F &= 1/W 242 | 243 | The following figure shows a plot for :math:`W=6`: 244 | 245 | .. only:: html 246 | 247 | .. figure:: _static/F_jk.svg 248 | :align: center 249 | 250 | .. only:: latex 251 | 252 | .. figure:: _static/F_jk.pdf 253 | :align: center 254 | 255 | We can see in the above graph that lower values of :math:`d` 256 | concentrate the probabilities on the state two steps ahead, and 257 | higher values spread the probabilities more evenly on all the states. 258 | 259 | .. only:: html 260 | 261 | .. figure:: _static/F_jk_bars.svg 262 | :align: center 263 | 264 | .. only:: latex 265 | 266 | .. figure:: _static/F_jk_bars.pdf 267 | :align: center 268 | 269 | The transitions starting from a background states are trivial: 270 | 271 | .. math:: 272 | 273 | P\left(Z_{i+1}=b_j|Z_i=b_j\right) &= b(j) \\ 274 | P\left(Z_{i+1}=m_j|Z_i=b_j\right) &= 1 - b(j) 275 | 276 | The emission probabilities: 277 | 278 | .. math:: 279 | 280 | P\left(X_i=a|Z_i=b_j\right) &= f^B_a \\ 281 | P\left(X_i=a|Z_i=m_j\right) &= f^M_{ja} 282 | 283 | 284 | Manual annotations 285 | ~~~~~~~~~~~~~~~~~~ 286 | We extend the model with a random variable :math:`U_i` for each :math:`Z_i`: 287 | 288 | .. only:: html 289 | 290 | .. figure:: _static/HMM_2.svg 291 | :align: center 292 | :figwidth: 60 % 293 | 294 | .. only:: latex 295 | 296 | .. figure:: _static/HMM_2.pdf 297 | :align: center 298 | :figwidth: 60 % 299 | 300 | These random variables are direct observations on the state variables that flag 301 | whether or not a hidden state is part of a motif: 302 | 303 | .. math:: 304 | 305 | U_i = [Z_i \in \pmb{S}^m] 306 | 307 | We can use this to aid in detecting motifs. For example, a user can manually 308 | annotate where motifs are by specifying the value of :math:`U_i`. 309 | 310 | Parameter estimation 311 | -------------------- 312 | 313 | Expectation-Maximization (EM) 314 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 315 | 316 | For convenience let's aggregate all the model parameters into a single 317 | vector: 318 | 319 | .. math:: 320 | 321 | \pmb{\theta} &= \left(\pmb{t}, \pmb{f}^B, \pmb{f}^M\right) \\ 322 | \pmb{t} &= \left(\pmb{b}, \pmb{t}^M\right) \\ 323 | \pmb{b} &= (b_{in}, b_{out}) \\ 324 | \pmb{t}^M &= \left(d, m_d, m, m_b\right) 325 | 326 | Using `EM 327 | `_ 328 | each step takes the form, where :math:`\pmb{\theta}^0` 329 | and :math:`\pmb{\theta}^1` are the current and next estimates of the 330 | parameters respectively: 331 | 332 | .. math:: 333 | 334 | \pmb{\theta}^1 &= \underset{\pmb{\theta}}{\arg\max} 335 | Q(\pmb{\theta}|\pmb{\theta}^0) \\ 336 | Q(\pmb{\theta}|\pmb{\theta}^0) &= \log P(\pmb{\theta}) + 337 | E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] 338 | 339 | :math:`\pmb{x}^D` is the training data, a particular realization of :math:`\pmb{X}`. 340 | 341 | :math:`P(\pmb{\theta})` are the prior probabilities on the 342 | parameters. We are going to consider priors only on :math:`\pmb{f}^M`. 343 | 344 | EM on a HMM 345 | ~~~~~~~~~~~ 346 | 347 | Taking into account that in a HMM the joint probability distribution 348 | factors as: 349 | 350 | .. math:: 351 | 352 | P(\pmb{X}, \pmb{Z}|\pmb{\theta}) = \prod_{i=1}^nP(Z_i|Z_{i-1}, \pmb{\theta})P(X_i|Z_i, 353 | \pmb{\theta}) 354 | 355 | 356 | We expand the expectation in the EM step as: 357 | 358 | .. math:: 359 | 360 | E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] = 361 | \sum_{\pmb{z}}P(\pmb{z}|\pmb{x}^D,\pmb{\theta}^0)\sum_{i=1}^n 362 | \left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right] 363 | 364 | Interchanging the summations: 365 | 366 | .. math:: 367 | 368 | E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] = 369 | \sum_{i=1}^n\sum_{\pmb{z}}P(\pmb{z}|X,\pmb{\theta}^0) 370 | \left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right] 371 | 372 | Defining the set :math:`\pmb{C}_i=\left\{Z_{i-1}, Z_i\right\}` we can always do: 373 | 374 | .. math:: 375 | 376 | P(\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0) = P(\pmb{Z} - \pmb{C}_i|\pmb{C}_i, \pmb{x}^D, \pmb{\theta}^0)P(\pmb{C}_i|\pmb{x}^D,\pmb{\theta}^0) 377 | 378 | Now the summation over :math:`\pmb{Z}` can be decomposed as: 379 | 380 | .. math:: 381 | 382 | E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] = 383 | \sum_{i=1}^n\left(\sum_{\pmb{z} - \pmb{c}_i}P(\pmb{z} - \pmb{c}_i|\pmb{x}^D,\pmb{\theta}^0)\right) 384 | \sum_{\pmb{c}_i}P(\pmb{c_i}|\pmb{x}^D, \pmb{\theta}^0)\left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right] 385 | 386 | Of course: 387 | 388 | .. math:: 389 | 390 | \sum_{\pmb{z} - \pmb{c}_i}P(\pmb{z} - 391 | \pmb{c}_i|\pmb{x}^D,\pmb{\theta}^0) = 1 392 | 393 | And so we finally get: 394 | 395 | .. math:: 396 | 397 | E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] = 398 | \sum_{i=1}^n 399 | \sum_{z_{i-1},z_i}P(z_{i-1},z_i|\pmb{x}^D, \pmb{\theta}^0)\left[\log P(x^D_i|z_i, \pmb{\theta}) + \log P(z_i|z_{i-1}, \pmb{\theta})\right] 400 | 401 | Computing :math:`P(Z_{i-1}, Z_i|\pmb{X},\pmb{\theta})` 402 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 403 | 404 | To unclutter a little let's call: 405 | 406 | .. math:: 407 | 408 | \xi_i(s_1, s_2) &= P(Z_{i-1}=s_1, Z_i=s_2|\pmb{x}^D, \pmb{\theta}^0) \\ 409 | \gamma_i(s) &= P(Z_i=s|\pmb{x}^D, \pmb{\theta}^0) 410 | 411 | Notice that: 412 | 413 | .. math:: 414 | 415 | \gamma_i(s) = \sum_{s_1} \xi_i(s_1, s) 416 | 417 | And so we just need to compute :math:`\xi_i`. To compute it we first 418 | apply Bayes theorem: 419 | 420 | .. math:: 421 | 422 | P(Z_{i-1}, Z_i|\pmb{X},\pmb{\theta}^0) = 423 | \frac{P(\pmb{X}|Z_{i-1},Z_i,\pmb{\theta}^0)P(Z_{i-1}, Z_i|\pmb{\theta}^0)}{P(\pmb{X}|\pmb{\theta}^0)} 424 | 425 | Now, thanks to the Markov structure of the probabilities we can factor 426 | things: 427 | 428 | .. math:: 429 | 430 | P(\pmb{X}|Z_{i-1},Z_i,\pmb{\theta}^0) &= 431 | P(X_1...X_{i-1}|Z_{i-1},\pmb{\theta}^0)P(X_i...X_n|Z_i,\pmb{\theta}^0) 432 | \\ 433 | P(Z_{i-1},Z_i|\pmb{\theta}^0) &= P(Z_{i-1}|\pmb{\theta}^0)P(Z_i|Z_{i-1},\pmb{\theta}^0) 434 | 435 | Renaming things a bit again: 436 | 437 | .. math:: 438 | 439 | \alpha_i(s) &= P(X_1...X_i,Z_i=s|\pmb{\theta}^0) \\ 440 | \beta_i(s) &= P(X_i...X_n|Z_i=s,\pmb{\theta}^0) 441 | 442 | We get that: 443 | 444 | .. math:: 445 | 446 | \tilde{\xi}_i(s_1, s_2) &= 447 | \alpha_{i-1}(s_1)\beta_i(s_2)P(Z_i=s_2|Z_{i-1}=s_1,\pmb{\theta}^0) 448 | \\ 449 | \xi_i(s_1, s_2) &= \frac{\tilde{\xi}_i(s_1, s_2)}{\sum_{s_1,s_2}\tilde{\xi}_i(s_1, s_2)} 450 | 451 | We compute the first values of :math:`\alpha` to see the pattern: 452 | 453 | .. only:: html 454 | 455 | .. figure:: _static/forward.svg 456 | :align: center 457 | :figwidth: 60 % 458 | 459 | .. only:: latex 460 | 461 | .. figure:: _static/forward.pdf 462 | :align: center 463 | :figwidth: 60 % 464 | 465 | In general: 466 | 467 | .. math:: 468 | 469 | \alpha_1(s) &= P(Z_1=s)P(X_1|Z_1=s) \\ 470 | \alpha_i(s_2) &= 471 | \left[\sum_{s_1}\alpha_{i-1}(s_1)P(Z_i=s_2|Z_{i-1}=s_1)\right]P(X_i|Z_i=s_2) 472 | 473 | 474 | Following the same process but starting from the end of the HMM we 475 | get: 476 | 477 | .. math:: 478 | 479 | \beta_n(s) &= P(X_n|Z_n=s) \\ 480 | \beta_{i-1}(s_1) &= 481 | \left[\sum_{s_2}\beta_i(s_2)P(Z_i=s_2|Z_{i-1}=s_1)\right]P(X_{i-1}|Z_{i-1}=s_1) 482 | 483 | 484 | We can take into account in this step any information the user has provided about the 485 | states. If we know that: 486 | 487 | .. math:: 488 | U_i = 0 &\implies \underset{s \in \pmb{S}^M}{\alpha (s)} = 0 \\ 489 | U_i = 1 &\implies \underset{s \in \pmb{S}^B}{\alpha (s)} = 0 490 | 491 | And of course renormalize: 492 | 493 | .. math:: 494 | \alpha(s) = \frac{\alpha(s)}{\sum_s \alpha(s)} 495 | 496 | And also we do the same with :math:`\beta`. 497 | 498 | Re-estimation equations 499 | ~~~~~~~~~~~~~~~~~~~~~~~ 500 | 501 | We now separate the expectation in two independent parts: 502 | 503 | .. math:: 504 | 505 | E_{\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] &= 506 | E^1(\pmb{f}^M, \pmb{f}^B) + E^2(\pmb{t}) \\ 507 | E^1(\pmb{f}^M, \pmb{f}^B) &= \sum_{i=1}^n \sum_{s \in \pmb{S}}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^M, \pmb{f}^B) \\ 508 | E^2(\pmb{t}) &= \sum_{i=1}^n \sum_{s_1,s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t}) 509 | 510 | We can continue splitting into independent parts: 511 | 512 | .. math:: 513 | 514 | E^1(\pmb{f}^M, \pmb{f}^B) &= E^{1B}(\pmb{f}^B) + E^{1M}(\pmb{f}^M) \\ 515 | E^{1B}(\pmb{f}^B) &= \sum_{i=1}^n \sum_{s \in \pmb{S}^B}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^B) \\ 516 | E^{1M}(\pmb{f}^M) &= \sum_{i=1}^n \sum_{s \in \pmb{S}^M}\gamma_i(s)\log P(x^D_i|s, \pmb{f}^M) 517 | 518 | And also: 519 | 520 | .. math:: 521 | 522 | E^2(\pmb{t}) &= E^{2B}(b) + E^{2M}(\pmb{t}^M) \\ 523 | E^{2B}(\pmb{b}) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^B}\sum_{s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{b}) \\ 524 | E^{2M}(\pmb{t}^M) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^M}\sum_{s_2 \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t}^M) 525 | 526 | 527 | We have now 4 independent maximization problems: 528 | 529 | .. math:: 530 | 531 | \left(\pmb{f}^B\right)^1 &= \underset{\pmb{f}^B}{\arg \max}E^{1B}\\ 532 | \left(\pmb{f}^M\right)^1& = \underset{\pmb{f}^M}{\arg \max}\left\{\log P(\pmb{f}^M) + E^{1M}\right\} \\ 533 | \pmb{b}^1 &= \underset{\pmb{b}}{\arg \max} E^{2B} \\ 534 | \left(\pmb{t}^M\right)^1 &= \underset{\pmb{t}^M}{\arg \max} E^{2M} 535 | 536 | 537 | Estimation of :math:`\pmb{f}^B` 538 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 539 | 540 | We need to solve: 541 | 542 | .. math:: 543 | 544 | \left(\pmb{f}^B\right)^1 = \underset{\pmb{f}^B}{\arg \max}E^{1B} 545 | 546 | With the constraint: 547 | 548 | .. math:: 549 | 550 | g_B &= \sum_{a \in \pmb{A}} f^B_a - 1 = 0 551 | 552 | We will enforce the constraints using `Lagrange multipliers 553 | `_, and taking derivatives: 554 | 555 | .. math:: 556 | 557 | \frac{\partial}{\partial \pmb{f}^B,\lambda_B}\left\{E^{1B} - \lambda_Bg_B\right\} = 0 558 | 559 | From this we get the closed form solution: 560 | 561 | .. math:: 562 | 563 | \tilde{f}^B_a &= \sum_{i=1}^n[x_i^D=a]\sum_{j=1}^W \gamma_i(b_j) \\ 564 | f^B_a &= \frac{\tilde{f}^B_a}{\sum_{a \in \pmb{A}}\tilde{f}^B_a} 565 | 566 | Estimation of :math:`\pmb{f}^M` 567 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 568 | 569 | We need to solve: 570 | 571 | .. math:: 572 | 573 | \left(\pmb{f}^M\right)^1& = \underset{\pmb{f}^M}{\arg \max}\left\{\log P(\pmb{f}^M) + E^{1M}\right\} 574 | 575 | With the constraint: 576 | 577 | .. math:: 578 | 579 | g_j &= \sum_{a \in \pmb{A}} f^M_{ja} - 1 = 0 580 | 581 | We will enforce the constraints using `Lagrange multipliers 582 | `_, and taking derivatives: 583 | 584 | .. math:: 585 | 586 | \frac{\partial}{\partial \pmb{f}^M,\lambda_j}\left\{\log P(\pmb{f}^M) + E^{1M} - \lambda_jg_j\right\} = 0 587 | 588 | If we use `Dirichlet distribution 589 | `_ over the 590 | symbols of each :math:`\pmb{f}^M_j`. The log-probability of the prior is 591 | then: 592 | 593 | .. math:: 594 | 595 | \log P(\pmb{f}^M_j|\pmb{\varepsilon}) = -\log B(\pmb{\varepsilon}) + \sum_{s \in \pmb{A}}(\varepsilon_s - 1)\log 596 | \pmb{f}^M_j 597 | 598 | And the derivative: 599 | 600 | .. math:: 601 | 602 | \frac{\partial}{\partial f^M_{ja}}\log P(\pmb{f}^M_j|\pmb{\varepsilon}) = 603 | \frac{\varepsilon_a - 1}{f^M_{ja}} 604 | 605 | From this we get the closed form solution: 606 | 607 | .. math:: 608 | 609 | \tilde{f}^M_{ja} &= \sum_{i=1}^n[x_i^D=a]\gamma_i(m_j) \\ 610 | f^M_{ja} &= 611 | \frac{ \varepsilon_a - 1 + \tilde{f}^M_{ja}}{\varepsilon_0 - 1 + 612 | \sum_{a \in \pmb{A}}\tilde{f}^M_{ja}} 613 | 614 | Where :math:`\varepsilon_0` is called the concentration parameter and it's: 615 | 616 | .. math:: 617 | 618 | \varepsilon_0 = \sum_{a \in \pmb{A}}\varepsilon_a 619 | 620 | 621 | Estimation of :math:`b` 622 | ~~~~~~~~~~~~~~~~~~~~~~~ 623 | 624 | We need to solve: 625 | 626 | .. math:: 627 | 628 | \pmb{b}^1 = \underset{\pmb{b}}{\arg \max} E^{2B} 629 | 630 | Taking derivatives we get: 631 | 632 | .. math:: 633 | 634 | b_{in}^1 &= \frac{\sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, b_j)}{ 635 | \sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, b_j) + 636 | \sum_{i=1}^n\sum_{j=2}^W \xi_i(b_j, m_j)} \\ 637 | b_{out}^1 &= \frac{\sum_{i=1}^n\xi_i(b_1, b_1)}{ 638 | \sum_{i=1}^n \xi_i(b_1, b_1) + 639 | \sum_{i=1}^n \xi_i(b_1, m_1)} 640 | 641 | Estimation of :math:`\pmb{t}^M` 642 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 643 | 644 | We need to solve: 645 | 646 | .. math:: 647 | 648 | \left(\pmb{t}^M\right)^1 &= \underset{\pmb{t}^M}{\arg \max} E^{2M} 649 | 650 | Subject to the constraint: 651 | 652 | .. math:: 653 | 654 | g_m = m_b + m + m_d - 1 = 0 655 | 656 | Trying to solve the above problem using Lagrange multipliers gives as 657 | a result a set of highly non-linear equations in :math:`d`, and since 658 | the rest of the parameters are coupled to :math:`d` either directly or 659 | trough the constraint it is not possible to find a closed form 660 | solution. Because of this we throw away the Lagrange multipliers and 661 | solve the maximization problem numerically. 662 | 663 | Let's recap the shape of our problem: 664 | 665 | .. math:: 666 | 667 | E^{2M}(\pmb{t}^M) &= \sum_{i=1}^n\sum_{s_1 \in \pmb{S}^M}\sum_{s_2 668 | \in \pmb{S}}\xi_i(s_1,s_2)\log P(s_2|s_1, \pmb{t}^M) \\ 669 | 670 | .. math:: 671 | 672 | P\left(Z_i=b_{j+1}|Z_{i-1}=m_j\right) &= m_b \\ 673 | P\left(Z_i=m_k|Z_{i-1}=m_j\right) &= m_dF(L_{jk}, d) + [k=j+1]m \\ 674 | F(l, d) &= d^l\frac{1 - d}{1 - d^W} \\ 675 | L_{jk} &= (k - j - 2) \mod W \\ 676 | K_{jl} &= 1 + \left[(j + l + 1) \mod W\right] 677 | 678 | We rewrite the summation as: 679 | 680 | .. math:: 681 | 682 | E^{2M}(\pmb{t}^M) &= \sum_{i=1}^n\sum_{j=1}^W 683 | \left\{ 684 | \xi_i(m_j, b_{j+1})\log P(b_{j+1}|m_j, \pmb{t}^M) + 685 | \sum_{l=0}^{W-1}\xi_i(m_j,m_{K_{jl}})\log P(m_{K_{jl}}|m_j, 686 | \pmb{t}^M) 687 | \right\} \\ 688 | &= \sum_{i=1}^n\sum_{j=1}^W 689 | \left\{ 690 | \xi_i(m_j, b_{j+1})\log m_b + 691 | \sum_{l=0}^{W-1}\xi_i(m_j,m_{K_{jl}}) 692 | \log \left[m_dF(l, d) + [l=W-1]m \right] 693 | \right\} \\ 694 | &= \left(\sum_{i=1}^n\sum_{j=1}^W \xi_i(m_j, b_{j+1})\right)\log m_b + \\ 695 | & \quad 696 | \sum_{l=0}^{W-1}\left(\sum_{i=1}^n\sum_{j=1}^W\xi_i(m_j, 697 | m_{K_{jl}})\right)\log \left[m_d F(l,d) + 698 | [l=W-1]m\right] 699 | 700 | Since we are going to solve the problem numerically it is possible 701 | that we reach a local maximum. As long as we improve the starting 702 | point it is enough. 703 | 704 | 705 | Time Complexity 706 | --------------- 707 | 708 | The time complexity of the parameter estimation problem is dominated 709 | by the estimation of :math:`\alpha`, :math:`\beta` and :math:`\xi`. 710 | We have a triple loops of this form: 711 | 712 | .. code-block:: python 713 | 714 | for i in range(n): 715 | for s in range(2*W): 716 | for t in range(2*W): 717 | xi[s, t, i] = alpha[s, i - 1]*beta[t, i]*pT[s, t] 718 | 719 | 720 | In the above code the math symbols have been rewriten as numpy arrays: 721 | 722 | +----------------------------+--------------------+ 723 | | Math | Python | 724 | +============================+====================+ 725 | | :math:`\xi_i(s, t)` | ``xi[s, t, i]`` | 726 | +----------------------------+--------------------+ 727 | | :math:`\alpha_i(s)` | ``alpha[s, i]`` | 728 | +----------------------------+--------------------+ 729 | | :math:`\beta_i(s)` | ``beta[s, i]`` | 730 | +----------------------------+--------------------+ 731 | | :math:`P(Z_{i+1}=t|Z_i=s)` | ``pT[s, t]`` | 732 | +----------------------------+--------------------+ 733 | 734 | 735 | Time complexity is therefore :math:`O(nW^2)`. We can speed up the 736 | above triple loop by noticing that a lot of entries in the ``pT`` 737 | matrix are zeros: 738 | 739 | .. only:: html 740 | 741 | .. figure:: _static/transition_matrix_1.svg 742 | :align: center 743 | :figwidth: 60 % 744 | 745 | .. only:: latex 746 | 747 | .. figure:: _static/transition_matrix_1.pdf 748 | :align: center 749 | :figwidth: 60 % 750 | 751 | Taking only the non-zero entries into account we could move complexity 752 | from :math:`4nW^2` to :math:`nW(W+3)`, which would divide time 753 | almost by a factor of 4. 754 | 755 | The reason that the lower right part of the matrix, that is the 756 | transition probabilities between motif states, is full is that the 757 | possibility of deleting states of the motif means that we can jump 758 | from any motif state to any other motif state. If on the other hand we 759 | didn't allow deletions the state transition diagram would look as: 760 | 761 | .. only:: html 762 | 763 | .. figure:: _static/PHMM_4.svg 764 | :align: center 765 | 766 | .. only:: latex 767 | 768 | .. figure:: _static/PHMM_4.pdf 769 | :align: center 770 | 771 | Transition probabilities from the background states would remain 772 | unchanged and transition probabilities from the motif states would simply be: 773 | 774 | .. math:: 775 | 776 | P\left(Z_{i+1}=b_{j+1}|Z_i=m_j\right) &= m_b \\ 777 | P\left(Z_{i+1}=m_{j+1}|Z_i=m_j\right) &= m 778 | 779 | Now the matrix of transition probabilities has the following non-zero 780 | entries: 781 | 782 | .. only:: html 783 | 784 | .. figure:: _static/transition_matrix_2.svg 785 | :align: center 786 | :figwidth: 60 % 787 | 788 | .. only:: latex 789 | 790 | .. figure:: _static/transition_matrix_2.pdf 791 | :align: center 792 | :figwidth: 60 % 793 | 794 | And time complexity is just :math:`4nW`. This simplified model is 795 | actually a very good approximation of the more complex one. The 796 | following plot shows how the values of the full model are distributed 797 | inside the transition matrix for typical parameters: 798 | 799 | 800 | .. only:: html 801 | 802 | .. figure:: _static/transition_matrix_3.svg 803 | :align: center 804 | :figwidth: 60 % 805 | 806 | .. only:: latex 807 | 808 | .. figure:: _static/transition_matrix_3.pdf 809 | :align: center 810 | :figwidth: 60 % 811 | 812 | To see how fast probabilities decay away from the diagonal consider 813 | the following plot showing the values around state 60 (motif state 814 | 20): 815 | 816 | .. only:: html 817 | 818 | .. figure:: _static/transition_matrix_4.svg 819 | :align: center 820 | 821 | .. only:: latex 822 | 823 | .. figure:: _static/transition_matrix_4.pdf 824 | :align: center 825 | 826 | We can make a compromise between the fully diagonal model and a model 827 | where only :math:`w` elements away from the diagonal are retained. The 828 | transition matrix would look like: 829 | 830 | .. only:: html 831 | 832 | .. figure:: _static/transition_matrix_5.svg 833 | :align: center 834 | :figwidth: 60 % 835 | 836 | .. only:: latex 837 | 838 | .. figure:: _static/transition_matrix_5.pdf 839 | :align: center 840 | :figwidth: 60 % 841 | 842 | Time complexity would be then :math:`nW(3 + w)`. 843 | Let's call :math:`\tilde{P}(\pmb{X}, \pmb{Z})` the probability 844 | distribution of the sequence under this new approximate transition 845 | matrix. We can get a modified EM algorithm: 846 | 847 | .. math:: 848 | 849 | \log \left[\sum_{\pmb{Z}} P(\pmb{X}, \pmb{Z})\right] &= 850 | \log \left[\sum_{\pmb{Z}} \tilde{P}(\pmb{X}, \pmb{Z}) 851 | \frac{P(\pmb{X}, \pmb{Z})}{\tilde{P}(\pmb{X}, \pmb{Z})}\right] \\ 852 | & \geq \sum_{\pmb{Z}}\tilde{P}(\pmb{X}, \pmb{Z})\log\frac{P(\pmb{X}, \pmb{Z})}{\tilde{P}(\pmb{X}, \pmb{Z})} 853 | 854 | The EM algorithm proceeds as usual but with a modified function to 855 | maximize: 856 | 857 | .. math:: 858 | 859 | Q(\pmb{\theta}|\pmb{\theta}^0) = \log P(\pmb{\theta}) + 860 | E_{\tilde{P}\left(\pmb{Z}|\pmb{x}^D,\pmb{\theta}^0\right)}\left[\log P(\pmb{x}^D,\pmb{Z}|\pmb{\theta})\right] 861 | 862 | The only modification in the forward-backward algorithm is that there 863 | are more zeros now on the transition matrix. We get halfway to a 864 | variational approach like the one in `Factorial hidden Markov models `_. 865 | -------------------------------------------------------------------------------- /doc/requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/doc/requirements.txt -------------------------------------------------------------------------------- /misc/demo1_img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/misc/demo1_img.png -------------------------------------------------------------------------------- /misc/demo2_img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scrapinghub/aile/251a98a762db470a7a191311193513350218dbba/misc/demo2_img.png -------------------------------------------------------------------------------- /misc/visual.py: -------------------------------------------------------------------------------- 1 | import selenium 2 | import selenium.webdriver 3 | 4 | import matplotlib.pyplot as plt 5 | import matplotlib.patches as patches 6 | 7 | 8 | def equal_delta(a, b, delta): 9 | return (a - delta <= b) and (b <= a + delta) 10 | 11 | 12 | class BBox(object): 13 | def __init__(self): 14 | self._empty = True 15 | 16 | self.x1 = None 17 | self.y1 = None 18 | self.x2 = None 19 | self.y2 = None 20 | 21 | def wrap(self, element): 22 | if self._empty: 23 | self.x1 = element.x 24 | self.y1 = element.y 25 | self.x2 = element.x + element.width 26 | self.y2 = element.y + element.height 27 | else: 28 | self.x1 = min(self.x1, element.x) 29 | self.x2 = max(self.x2, element.x + element.width) 30 | self.y1 = min(self.y1, element.y) 31 | self.y2 = max(self.y2, element.y + element.height) 32 | 33 | self._empty = False 34 | 35 | def contains(self, other): 36 | if self._empty or other._empty: 37 | return False 38 | 39 | return (self.x1 <= other.x1 and 40 | self.x2 >= other.x2 and 41 | self.y1 <= other.y1 and 42 | self.y2 >= other.y2) 43 | 44 | def halign(self, other, margin=5): 45 | return (equal_delta(self.y1, other.y1, margin) and 46 | equal_delta(self.y2, other.y2, margin)) 47 | 48 | def valign(self, other, margin=5): 49 | return (equal_delta(self.x1, other.x1, margin) and 50 | equal_delta(self.x2, other.x2, margin)) 51 | 52 | 53 | class DOM(object): 54 | class Element(object): 55 | def __init__(self, parent=None, children=None): 56 | self.parent = parent 57 | self.children = children or [] 58 | 59 | def __init__(self, browser, flat=False): 60 | def make_element(node, parent): 61 | element = DOM.Element(parent=parent) 62 | for k, v in node.rect.iteritems(): 63 | setattr(element, k, v) 64 | element.x = max(0, element.x) 65 | element.y = max(0, element.y) 66 | return element 67 | 68 | root_node = browser.find_elements_by_xpath('*')[0] 69 | if flat: 70 | self.root = make_element(root_node, None) 71 | for child in root_node.find_elements_by_xpath('//*'): 72 | self.root.children.append(make_element(child, self.root)) 73 | else: 74 | def fill(node, parent=None): 75 | element = make_element(node, parent) 76 | for child in node.find_elements_by_xpath('*'): 77 | element.children.append(fill(child, parent=element)) 78 | return element 79 | 80 | self.root = fill(root_node) 81 | 82 | def draw(self, ax=None): 83 | if ax is None: 84 | fig = plt.figure() 85 | ax = fig.add_subplot(111, aspect='equal') 86 | ax.invert_yaxis() 87 | 88 | def _draw(element, bbox): 89 | ax.add_patch( 90 | patches.Rectangle( 91 | (element.x, element.y), 92 | element.width, 93 | element.height, 94 | fill=False 95 | ) 96 | ) 97 | 98 | bbox.wrap(element) 99 | for child in element.children: 100 | _draw(child, bbox) 101 | 102 | bbox = BBox() 103 | _draw(self.root, bbox) 104 | 105 | ax.set_xlim(bbox.x1, bbox.x2) 106 | ax.set_ylim(bbox.y1, bbox.y2) 107 | 108 | 109 | def get_dom(url): 110 | browser = selenium.webdriver.Firefox() 111 | browser.get(url) 112 | dom = DOM(browser, flat=True) 113 | browser.close() 114 | return dom 115 | 116 | 117 | if __name__ == '__main__': 118 | dom = get_dom('http://edition.cnn.com/') 119 | dom.draw() 120 | plt.show() 121 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | scrapely>=0.12.0 2 | numpy>=1.9.2 3 | scipy>=0.15.1 4 | scikit-learn>=0.17 5 | cython>=0.23.3 6 | networkx>=1.10 7 | pulp>=1.6.0 8 | -e git+https://github.com/scrapinghub/portia.git@nui-develop#egg=slyd&subdirectory=slyd 9 | -e git+https://github.com/scrapinghub/portia.git@nui-develop#egg=slybot&subdirectory=slybot 10 | # For demo2 11 | matplotlib>=1.4.3 12 | ete2>=2.3.10 13 | -------------------------------------------------------------------------------- /scripts/gen-slybot-project: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys 4 | 5 | import aile 6 | 7 | usage = """ 8 | {0} url 9 | 10 | Will generate a directory 'slybot-project' with the necessary files. 11 | Execute after this: 12 | 13 | slybot crawl aile 14 | 15 | Edit the project files to add other urls to be crawled and rename fields, etc... 16 | """ 17 | 18 | if __name__ == '__main__': 19 | if len(sys.argv) != 2: 20 | sys.exit(usage) 21 | url = sys.argv[1] 22 | 23 | aile.generate_slybot_project(url, verbose=True) 24 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | from setuptools.extension import Extension 3 | from Cython.Build import cythonize 4 | 5 | extra_compile_args = ["-O3"] 6 | extensions = [ 7 | Extension("aile._kernel", ["aile/_kernel.pyx"], 8 | extra_compile_args=extra_compile_args 9 | ), 10 | Extension("aile.dtw", ["aile/dtw.pyx"], 11 | extra_compile_args=extra_compile_args 12 | ), 13 | ] 14 | 15 | setup( 16 | name = 'AILE', 17 | version = '0.0.2', 18 | packages = ['aile'], 19 | install_requires = [ 20 | 'numpy', 21 | 'scipy', 22 | 'scikit-learn', 23 | 'scrapely', 24 | 'cython', 25 | 'networkx', 26 | 'pulp' 27 | ], 28 | dependency_links = [ 29 | 'git+https://github.com/scrapinghub/portia.git@multiple-item-extraction#egg=slyd&subdirectory=slyd', 30 | 'git+https://github.com/scrapinghub/portia.git@multiple-item-extraction#egg=slybot&subdirectory=slybot' 31 | ], 32 | tests_requires = [ 33 | 'pytest' 34 | ], 35 | ext_modules = cythonize(extensions), 36 | scripts = ['scripts/gen-slybot-project'] 37 | ) 38 | -------------------------------------------------------------------------------- /test/requirements.txt: -------------------------------------------------------------------------------- 1 | pytest>=2.7.2 -------------------------------------------------------------------------------- /test/table.css: -------------------------------------------------------------------------------- 1 | table a:link { 2 | color: #666; 3 | font-weight: bold; 4 | text-decoration:none; 5 | } 6 | table a:visited { 7 | color: #999999; 8 | font-weight:bold; 9 | text-decoration:none; 10 | } 11 | table a:active, 12 | table a:hover { 13 | color: #bd5a35; 14 | text-decoration:underline; 15 | } 16 | table { 17 | font-family:Arial, Helvetica, sans-serif; 18 | color:#666; 19 | font-size:12px; 20 | text-shadow: 1px 1px 0px #fff; 21 | background:#eaebec; 22 | margin:20px; 23 | border:#ccc 1px solid; 24 | 25 | -moz-border-radius:3px; 26 | -webkit-border-radius:3px; 27 | border-radius:3px; 28 | 29 | -moz-box-shadow: 0 1px 2px #d1d1d1; 30 | -webkit-box-shadow: 0 1px 2px #d1d1d1; 31 | box-shadow: 0 1px 2px #d1d1d1; 32 | } 33 | table th { 34 | padding:21px 25px 22px 25px; 35 | border-top:1px solid #fafafa; 36 | border-bottom:1px solid #e0e0e0; 37 | 38 | background: #ededed; 39 | background: -webkit-gradient(linear, left top, left bottom, from(#ededed), to(#ebebeb)); 40 | background: -moz-linear-gradient(top, #ededed, #ebebeb); 41 | } 42 | table th:first-child { 43 | text-align: left; 44 | padding-left:20px; 45 | } 46 | table tr:first-child th:first-child { 47 | -moz-border-radius-topleft:3px; 48 | -webkit-border-top-left-radius:3px; 49 | border-top-left-radius:3px; 50 | } 51 | table tr:first-child th:last-child { 52 | -moz-border-radius-topright:3px; 53 | -webkit-border-top-right-radius:3px; 54 | border-top-right-radius:3px; 55 | } 56 | table tr { 57 | text-align: center; 58 | padding-left:20px; 59 | } 60 | table td:first-child { 61 | text-align: left; 62 | padding-left:20px; 63 | border-left: 0; 64 | } 65 | table td { 66 | padding:18px; 67 | border-top: 1px solid #ffffff; 68 | border-bottom:1px solid #e0e0e0; 69 | border-left: 1px solid #e0e0e0; 70 | 71 | background: #fafafa; 72 | background: -webkit-gradient(linear, left top, left bottom, from(#fbfbfb), to(#fafafa)); 73 | background: -moz-linear-gradient(top, #fbfbfb, #fafafa); 74 | } 75 | table tr.even td { 76 | background: #f6f6f6; 77 | background: -webkit-gradient(linear, left top, left bottom, from(#f8f8f8), to(#f6f6f6)); 78 | background: -moz-linear-gradient(top, #f8f8f8, #f6f6f6); 79 | } 80 | table tr:last-child td { 81 | border-bottom:0; 82 | } 83 | table tr:last-child td:first-child { 84 | -moz-border-radius-bottomleft:3px; 85 | -webkit-border-bottom-left-radius:3px; 86 | border-bottom-left-radius:3px; 87 | } 88 | table tr:last-child td:last-child { 89 | -moz-border-radius-bottomright:3px; 90 | -webkit-border-bottom-right-radius:3px; 91 | border-bottom-right-radius:3px; 92 | } 93 | table tr:hover td { 94 | background: #f2f2f2; 95 | background: -webkit-gradient(linear, left top, left bottom, from(#f2f2f2), to(#f0f0f0)); 96 | background: -moz-linear-gradient(top, #f2f2f2, #f0f0f0); -------------------------------------------------------------------------------- /test/test_slybot.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import os 3 | import tempfile 4 | import json 5 | import shutil 6 | import collections 7 | import contextlib 8 | 9 | import aile 10 | 11 | 12 | try: 13 | FILE = __file__ 14 | except NameError: 15 | FILE = './test' 16 | 17 | TESTDIR = os.getenv('DATAPATH', 18 | os.path.dirname(os.path.realpath(FILE))) 19 | 20 | 21 | def get_local_url(filename): 22 | return 'file:///{0}/{1}'.format(TESTDIR, filename) 23 | 24 | 25 | def item_name(schema, item): 26 | for field in item.keys(): 27 | for name, fields in schema.iteritems(): 28 | if field in fields: 29 | return name 30 | return None 31 | 32 | 33 | 34 | class ExtractTest(object): 35 | def __init__(self, train_url): 36 | self.train_url = get_local_url(train_url) 37 | self.project_path = tempfile.mkdtemp() 38 | self.item_extract = aile.generate_slybot_project( 39 | self.train_url, path=self.project_path, verbose=False) 40 | with open(os.path.join(self.project_path, 'items.json'), 'r') as schema_file: 41 | schema = json.load(schema_file) 42 | self.schema = { 43 | item_name: set(item['fields']) 44 | for item_name, item in schema.iteritems()} 45 | 46 | def run(self, url=None): 47 | if url is None: 48 | url = self.train_url 49 | else: 50 | url = get_local_url(url) 51 | extract_path = tempfile.mktemp(suffix='.json') 52 | opt = [ 53 | '-s LOG_LEVEL=CRITICAL', 54 | '-s SLYDUPEFILTER_ENABLED=0', 55 | '-s PROJECT_DIR={0}'.format(self.project_path), 56 | '-o {0}'.format(extract_path) 57 | ] 58 | cmd = 'slybot crawl {1} aile -a start_urls="{0}"'.format(url, ' '.join(opt)) 59 | if os.system(cmd) != 0: 60 | return None 61 | with open(extract_path, 'r') as extract_file: 62 | items = json.load(extract_file) 63 | os.remove(extract_path) 64 | grouped_items = collections.defaultdict(list) 65 | for item in items: 66 | name = item_name(self.schema, item) 67 | if name: 68 | grouped_items[name].append(item) 69 | return grouped_items.values() 70 | 71 | def close(self): 72 | shutil.rmtree(self.project_path) 73 | 74 | 75 | def find_fields(item, true_item): 76 | fields = [] 77 | for true_field_value in true_item: 78 | found = False 79 | for field_name, field_value in item.iteritems(): 80 | if true_field_value == field_value[0].strip(): 81 | found = True 82 | fields.append(field_name) 83 | if not found: 84 | fields.append(None) 85 | return fields 86 | 87 | 88 | class CheckExtractionCannotFindField(Exception): 89 | pass 90 | 91 | 92 | class CheckExtractionDifferentNumberOfItems(Exception): 93 | def __init__(self, expected, found): 94 | self.expected = expected 95 | self.found = found 96 | 97 | def __str__(self): 98 | return 'Different number of items. Expected: {0}, Found: {1}'.format( 99 | self.expected, self.found) 100 | 101 | 102 | class CheckExtractionCannotFindItem(Exception): 103 | def __init__(self, item): 104 | self.item = item 105 | 106 | def __str__(self): 107 | return "Couldn't extract: {0}".format(self.item) 108 | 109 | 110 | def _check_extraction(items, true_items): 111 | fields = find_fields(items[0], true_items[0]) 112 | if not all(fields): 113 | raise CheckExtractionCannotFindField() 114 | if len(items) != len(true_items): 115 | raise CheckExtractionDifferentNumberOfItems(len(true_items), len(items)) 116 | for true in true_items: 117 | any_match = False 118 | for extracted in items: 119 | match = True 120 | for field, true_value in zip(fields, true): 121 | if extracted[field][0].strip() != true_value: 122 | match = False 123 | break 124 | if match: 125 | any_match = True 126 | if not any_match: 127 | raise CheckExtractionCannotFindItem(true) 128 | return True 129 | 130 | def check_extraction(all_items, true_items): 131 | found = False 132 | for items in all_items: 133 | try: 134 | found = _check_extraction(items, true_items) 135 | if found: 136 | break 137 | except CheckExtractionCannotFindField: 138 | pass 139 | assert found 140 | 141 | 142 | PATCH_OF_LAND_1 = [ 143 | ['158 Halsey Street, Brooklyn, New York'], 144 | ['695 Monroe Street, Brooklyn, New York'], 145 | ['138 Wood Road, Los Gatos, California'], 146 | ['Multiple Addresses, Sacramento, California'], 147 | ['438 29th St, San Francisco, California'], 148 | ['747 Kingston Road, Princeton, New Jersey'], 149 | ['2459 Ketchum Rd, Memphis, Tennessee'], 150 | ['158 Halsey Street, Brooklyn, New York'], 151 | ['697 Monroe St., Brooklyn, New York'], 152 | ['2357 Greenfield Ave, Los Angeles, California'], 153 | ['5567 Colwell Road, Penryn, California'], 154 | ['2357 Greenfield Ave, Los Angeles, California'] 155 | ] 156 | 157 | 158 | def test_patchofland_1(): 159 | with contextlib.closing(ExtractTest('Patch of Land.html')) as test: 160 | check_extraction( 161 | test.run('Patch of Land.html'), PATCH_OF_LAND_1) 162 | 163 | MONSTER_1 = [ 164 | [u'Java Developer (Graduate) Trading'], 165 | [u'C# Developer / C# .Net Programmer - R&D'], 166 | [u'Cyber Security Analyst - SIEM, CISSP, Vulnerability'], 167 | [u'UKEN Std Apply Auto Test Job With no RF and Q - Do not Apply'], 168 | [u'UKEN Std & Exp Apply Auto Test Job Without Questions - Do not Apply'], 169 | [u'UKEN Std Apply Auto Test Job Without Questions - Do not Apply'], 170 | [u'UKEN Std & Exp Apply Auto Test Job With Questions - Do not Apply'], 171 | [u'UKEN Express Apply Automation Test Job without Questions - Do not Apply'], 172 | [u'UKEN Company Confidential Test Job With Questions - Do not Apply'], 173 | [u'UKEN Std Apply Auto Test Job With Questions - Do not Apply'], 174 | [u'UKEN Shared Std Apply Auto Test Job - Do not Apply'], 175 | [u'UKEN Shared Apply Automation Test Job - Do not Apply'], 176 | [u'Front-End Developer , £30-35K'], 177 | [u'Ruby Developer - Fastest Growing Healthcare Startup'], 178 | [u'C#, MVC, Sharepoint, ASP.NET System Analyst/Developers'], 179 | [u'Oracle Applications DBA'], 180 | [u'C++ Developer Contract'], 181 | [u'Network Security Engineer -SolarWinds Specialist'], 182 | [u'Software Development Manager/CTO - Unified Communications'], 183 | [u'Senior C++ EFX Developer'] 184 | ] 185 | 186 | 187 | def test_monster_1(): 188 | with contextlib.closing(ExtractTest('Monster.html')) as test: 189 | check_extraction( 190 | test.run('Monster.html'), MONSTER_1) 191 | 192 | 193 | ARS_TECHNICA_1 = [ 194 | [u'“Unauthorized code” in Juniper firewalls decrypts encrypted VPN traffic'], 195 | [u'LifeLock ID protection service to pay record $100 million for failing customers'], 196 | [u'Hacker hacks off Tesla with claims of self-driving car'], 197 | [u'Windows 10 Mobile upgrade won’t hit older phones until 2016'], 198 | [u'Video memories, storytelling, and Star Wars spoilers (no actual spoilers!)'], 199 | [u'Blackberry CEO says Apple has gone to a “dark place” with pro-privacy stance'], 200 | [u'Google ramps up EU lobbying as antitrust charges proceed'], 201 | [u'Dealmaster: Get a 32GB Moto X Pure Edition unlocked smartphone for $349'], 202 | [u'Apple gets a new COO, puts Phil Schiller in charge of the App Store'], 203 | [u'Microsoft makes 16 more Xbox 360 games playable on Xbox One'], 204 | [u'League of Legends now owned entirely by Chinese giant Tencent'], 205 | [u'Busted by online package tracking, drug dealer gets more than 8 years in prison'], 206 | [u'OneDrive for Business to get unlimited storage for enterprise customers'], 207 | [u'Germany approves 30-minute software update fix for cheating Volkswagen diesels'], 208 | [u'''Turing’s Shkreli on drug price-hike: “It gets people talking… that’s what art is”'''], 209 | [u'Self-driving Ford Fusions are coming to California next year'], 210 | [u'Cop who wanted to photograph teen’s erection in sexting case commits suicide'], 211 | [u'Republicans in Congress let net neutrality rules live on (for now)'], 212 | [u'Confirmed: Kojima leaves Konami to work on PS4 console exclusive [Updated]'], 213 | [u'Netflix to offer less bandwidth for My Little Pony , more for Avengers'], 214 | [u'Final NASA budget bill fully funds commercial crew and Earth science'], 215 | [u'Firefox for Windows finally has an official, stable 64-bit build'], 216 | [u'Smash Bros. DLC concludes with Bayonetta, Super Mario RPG Geno costume'], 217 | [u'New XPRIZE competition looks for a better underwater robot'], 218 | [u'Google’s new data-only Project Fi tablet plans don’t charge device fees'], 219 | [u'Tech firms could owe up to 4% of global revenue if they violate new EU data law'], 220 | [u'Android Pay adds in-app purchasing feature, catches up to Apple Pay'], 221 | [u'Pebble’s new Health app integrates with Timeline, suggests tips to get healthier'], 222 | [u'Websites may soon know if you’re mad—a little mouse will tell them'], 223 | [u'Dust Bowl returns as an Expedition in Oath of the Gatewatch'], 224 | [u'Orbitar, really? Some new exoplanet names are downright weird'], 225 | [u'13 million MacKeeper users exposed after MongoDB door was left open'], 226 | ] 227 | 228 | 229 | def test_ars_technica_1(): 230 | with contextlib.closing(ExtractTest('Ars Technica.html')) as test: 231 | check_extraction( 232 | test.run('Ars Technica.html'), ARS_TECHNICA_1) 233 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py27 3 | 4 | [testenv] 5 | install_command = 6 | pip install --process-dependency-links {opts} {packages} 7 | deps = 8 | -r{toxinidir}/requirements.txt 9 | -r{toxinidir}/test/requirements.txt 10 | commands = 11 | py.test 12 | 13 | setenv = 14 | DATAPATH={toxinidir}/test --------------------------------------------------------------------------------