├── .gitignore
├── .travis.yml
├── LICENSE
├── MANIFEST.in
├── README.md
├── aile
├── __init__.py
├── _kernel.pyx
├── dtw.pyx
├── kernel.py
├── ptree.py
└── slybot_project.py
├── demo1.py
├── demo2.py
├── demo3.py
├── doc
├── Makefile
├── _static
│ ├── F_jk.pdf
│ ├── F_jk.svg
│ ├── F_jk_bars.pdf
│ ├── F_jk_bars.svg
│ ├── F_jk_graph.py
│ ├── F_jk_no_labels.svg
│ ├── HMM_1.pdf
│ ├── HMM_1.svg
│ ├── HMM_2.pdf
│ ├── HMM_2.svg
│ ├── PHMM_1.pdf
│ ├── PHMM_1.svg
│ ├── PHMM_2.pdf
│ ├── PHMM_2.svg
│ ├── PHMM_3.pdf
│ ├── PHMM_3.svg
│ ├── PHMM_4.pdf
│ ├── PHMM_4.svg
│ ├── forward.pdf
│ ├── forward.svg
│ ├── transition_matrix_1.pdf
│ ├── transition_matrix_1.svg
│ ├── transition_matrix_2.pdf
│ ├── transition_matrix_2.svg
│ ├── transition_matrix_3.pdf
│ ├── transition_matrix_3.svg
│ ├── transition_matrix_4.pdf
│ ├── transition_matrix_4.svg
│ ├── transition_matrix_5.pdf
│ └── transition_matrix_5.svg
├── conf.py
├── index.rst
├── make.bat
├── notes.rst
└── requirements.txt
├── misc
├── demo1_img.png
├── demo2_img.png
└── visual.py
├── requirements.txt
├── scripts
└── gen-slybot-project
├── setup.py
├── test
├── Ars Technica.html
├── Monster.html
├── Patch of Land 2.html
├── Patch of Land.html
├── Reddit.html
├── requirements.txt
├── table.css
└── test_slybot.py
└── tox.ini
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.html
3 | *.png
4 | *.so
5 | *#
6 | *~
7 | build/
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 | python: 2.7
3 |
4 | install:
5 | - pip install cython
6 | - pip install -U tox
7 |
8 | script:
9 | - travis_wait tox -vvv
10 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2015 Pedro López-Adeva Fernández-Layos
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
23 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include aile/*.pyx
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Automatic Item List Extraction [](https://travis-ci.org/scrapinghub/aile)
2 |
3 | This repository is a temporary container for experiments in automatic extraction of list and tables from web pages.
4 | At some later point I will merge the surviving algorithms either in [scrapely](https://github.com/scrapy/scrapely)
5 | or [portia](https://github.com/scrapinghub/portia).
6 |
7 | I document my ideas and algorithms descriptions at [readthedocs](http://aile.readthedocs.org/en/latest/).
8 |
9 | The current approach is based on the HTML code of the page, treated as a stream of HTML tags as processed by
10 | [scrapely](https://github.com/scrapy/scrapely). An alternative approach would be to use also the web page
11 | rendering information ([this script](https://github.com/plafl/aile/blob/master/misc/visual.py) renders a tree
12 | of bounding boxes for each element).
13 |
14 | ## Installation
15 | pip install -r requirements.txt
16 | python setup.py develop
17 |
18 | ## Running
19 | If you want to have a feeling of how it works there are two demo scripts included in the repo.
20 |
21 | - demo1.py
22 | Will annotate the HTML code of a web page, marking as red the lines that form part of the repeating item
23 | and with a prefix number the field number inside the item. The output is written in the file 'annotated.html'.
24 |
25 | python demo1.py https://news.ycombinator.com
26 |
27 | 
28 |
29 | - demo2.py
30 | Will label, color and draw the HTML tree so that repeating elements are easy to see. The output is interactive
31 | (requires PyQt4).
32 |
33 | python demo2.py https://news.ycombinator.com
34 |
35 | 
36 |
37 | ## Algorithms
38 |
39 | We are trying to auto-detect repeating patterns in the tags, not necessarily made of of *li*, *tr* or *td* tags.
40 |
41 | ### Clustering trees with a measure of similarity
42 | The idea is to compute the distance between all subtrees in the web page and run a clustering algorithm with this distance matrix.
43 | For a web page of size N this can be achieved in time O(N^2). The current algorithm actually computes a kernel and from the kernel
44 | computes the distance. The algorithm is based on:
45 |
46 | Kernels for semi-structured data
47 | Hisashi Kashima, Teruo Koyanagi
48 |
49 | Once we compute the distance between all subtrees of the web page DBSCAN clustering is run using the distance matrix.
50 | The result is massaged a little more until you get the result.
51 |
52 | ### Markov models
53 | The problem of detecting repeating patterns in streams is known as *motif discovery* and most of the literature about it seems
54 | to be published in the field of genetics. Inspired from this there is [a branch](https://github.com/plafl/aile/tree/markov_model)
55 | (MEME and Profile HMM algorithms).
56 |
57 | The Markov model approach has the following problems right now:
58 |
59 | - Requires several web pages for training, depending on the web page type
60 | - Training is performed using EM algorithm which requires several attempts until a good optimum is achieved
61 | - The number of hidden states is hard to determine. There are some heuristics applied that work partially
62 |
63 | These problems are not unsurmountable (I think) but require a lot of work:
64 |
65 | - Precision could be improved using [conditional random fields](https://en.wikipedia.org/wiki/Conditional_random_field).
66 | These could alleviate the need for data.
67 | - Training can run greatly in parallel. This is actually already done using [joblib](https://pythonhosted.org/joblib/parallel.html)
68 | in a single PC but it could be further improved using a cluster of computers
69 | - There are some papers about hidden state merging/splitting and even an
70 | [infinite number of states](http://machinelearning.wustl.edu/mlpapers/paper_files/nips02-AA01.pdf)
71 |
--------------------------------------------------------------------------------
/aile/__init__.py:
--------------------------------------------------------------------------------
1 | import time
2 |
3 | import scrapely
4 |
5 | from . import slybot_project
6 | from . import kernel
7 | from . import ptree
8 |
9 | def generate_slybot_project(url, path='slybot-project', verbose=False):
10 | def _print(s):
11 | if verbose:
12 | print s,
13 |
14 | _print('Downloading URL...')
15 | t1 = time.clock()
16 | page = scrapely.htmlpage.url_to_page(url)
17 | _print('done ({0}s)\n'.format(time.clock() - t1))
18 |
19 | _print('Extracting items...')
20 | t1 = time.clock()
21 | ie = kernel.ItemExtract(ptree.PageTree(page), separate_descendants=True)
22 | _print('done ({0}s)\n'.format(time.clock() - t1))
23 |
24 | _print('Generating slybot project...')
25 | t1 = time.clock()
26 | slybot_project.generate(ie, path)
27 | _print('done ({0}s)\n'.format(time.clock() - t1))
28 |
29 | return ie
30 |
--------------------------------------------------------------------------------
/aile/_kernel.pyx:
--------------------------------------------------------------------------------
1 | # cython: linetrace=True
2 | # distutils: define_macros=CYTHON_TRACE_NOGIL=1
3 |
4 | import collections
5 | import itertools
6 |
7 | import numpy as np
8 | cimport numpy as np
9 | cimport cython
10 |
11 |
12 | def order_pairs(nodes):
13 | """Given a list of fragments return pairs of equal nodes ordered in
14 | such a way that if pair_1 comes before pair_2 then no element of pair_2 is
15 | a descendant of pair_1"""
16 | grouped = collections.defaultdict(list)
17 | for i, node in enumerate(nodes):
18 | grouped[node].append(i)
19 | return sorted(
20 | [pair
21 | for node, indices in grouped.iteritems()
22 | for pair in itertools.combinations_with_replacement(sorted(indices), 2)],
23 | key=lambda x: x[0], reverse=True)
24 |
25 |
26 | def check_order(op, parents):
27 | """Test for order_pairs"""
28 | N = len(parents)
29 | C = np.zeros((N, N), dtype=int)
30 | for i, j in op:
31 | pi = parents[i]
32 | pj = parents[j]
33 | if pi > 0 and pj > 0:
34 | assert C[pi, pj] == 0
35 | C[i, j] = 1
36 |
37 |
38 | def similarity(ptree, max_items=1):
39 | all_classes = list({node.class_attr for node in ptree.nodes})
40 | class_index = {c: i for i, c in enumerate(all_classes)}
41 | class_map = np.array([class_index[node.class_attr] for node in ptree.nodes])
42 | N = len(all_classes)
43 | similarity = np.zeros((N, N), dtype=float)
44 | for i in range(N):
45 | for j in range(N):
46 | li = min(max_items, len(all_classes[i]))
47 | lj = min(max_items, len(all_classes[j]))
48 | lk = min(max_items, len(all_classes[i] & all_classes[j]))
49 | similarity[i, j] = (1.0 + lk)/(1.0 + li + lj - lk)
50 | return class_map, similarity
51 |
52 |
53 | @cython.boundscheck(False)
54 | cpdef build_counts(ptree, int max_depth=4, int max_childs=20):
55 | cdef int N = len(ptree)
56 | if max_childs is None:
57 | max_childs = N
58 | pairs = order_pairs(ptree.nodes)
59 | cdef np.ndarray[np.double_t, ndim=2] sim
60 | cdef np.ndarray[np.int_t, ndim=1] cmap
61 | cmap, sim = similarity(ptree)
62 |
63 | cdef np.ndarray[np.double_t, ndim=3] C = np.zeros((N, N, max_depth), dtype=float)
64 | cdef np.ndarray[np.double_t, ndim=3] S = np.zeros(
65 | (max_childs + 1, max_childs + 1, max_depth), dtype=float)
66 | S[0, :, :] = S[:, 0, :] = 1
67 |
68 | cdef int i1, i2, j1, j2, k1, k2
69 | cdef np.ndarray[np.int_t, ndim=2] children = ptree.children_matrix(max_childs)
70 | for i1, i2 in pairs:
71 | s = sim[cmap[i1], cmap[i2]]
72 | if children[i1, 0] == -1 and children[i2, 0] == -1:
73 | C[i2, i1, :] = C[i1, i2, :] = s
74 | else:
75 | for j1 in range(1, max_childs + 1):
76 | k1 = children[i1, j1 - 1]
77 | if k1 < 0:
78 | break
79 | for j2 in range(1, max_childs + 1):
80 | k2 = children[i2, j2 - 1]
81 | if k2 < 0:
82 | break
83 | S[j1, j2, :] = S[j1 - 1, j2 , : ] +\
84 | S[j1 , j2 - 1, : ] -\
85 | S[j1 - 1, j2 - 1, : ]
86 | S[j1, j2, 1:] += S[j1 - 1, j2 - 1, 1:]*C[k1, k2, :-1]
87 | C[i2, i1, :] = C[i1, i2, :] = s*S[j1 - 1, j2 - 1, :]
88 | return C[:, :, max_depth - 1]
89 |
90 |
91 | @cython.boundscheck(False)
92 | cpdef kernel(ptree, counts=None, int max_depth=4, int max_childs=20, double decay=0.5):
93 | cdef np.ndarray[np.double_t, ndim=2] C
94 | if counts is None:
95 | C = build_counts(ptree, max_depth, max_childs)
96 | else:
97 | C = counts
98 |
99 | cdef np.ndarray[np.double_t, ndim=2] K = np.zeros((C.shape[0], C.shape[1]))
100 | cdef np.ndarray[np.double_t, ndim=2] A = C.copy()
101 | cdef np.ndarray[np.double_t, ndim=2] B = C.copy()
102 | cdef int N = K.shape[0]
103 | cdef int i, j, pi, pj, ri, rj
104 | for i in range(N - 1, -1, -1):
105 | pi = ptree.parents[i]
106 | for j in range(N - 1, -1, -1):
107 | pj = ptree.parents[j]
108 | if pi > 0:
109 | A[pi, j] += decay*A[i, j]
110 | if pj > 0:
111 | B[i, pj] += decay*B[i, j]
112 | for i in range(N - 1, -1, -1):
113 | pi = ptree.parents[i]
114 | for j in range(N - 1, -1, -1):
115 | ri = max(ptree.match[i], i)
116 | rj = max(ptree.match[j], j)
117 | K[i, j] += A[i, j] + B[i, j] - C[i, j]
118 | pj = ptree.parents[j]
119 | if pi > 0 and pj > 0:
120 | K[pi, pj] += decay*K[i, j]
121 | return K
122 |
123 |
124 | cpdef min_dist_complete(np.ndarray[np.double_t, ndim=2] D):
125 | cdef np.ndarray[np.double_t, ndim=2] R = D.copy()
126 | cdef int N = D.shape[0]
127 | cdef int i, j, k
128 | cdef double d
129 | for k in range(N):
130 | for i in range(N):
131 | for j in range(N):
132 | d = max(R[i, k], R[k, j])
133 | if R[i,j] > d:
134 | R[i, j] = d
135 | return R
136 |
--------------------------------------------------------------------------------
/aile/dtw.pyx:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | cimport numpy as np
3 | cimport cython
4 |
5 |
6 | @cython.boundscheck(False)
7 | cpdef from_distance(np.ndarray[np.double_t, ndim=2] D):
8 | """Given a distance matrix compute the dynamic time warp distance.
9 |
10 | If the distance matrix 'D' is between sequences 's' and 't' then:
11 | 1. D[i, j] = |s[i] - t[j]|
12 | 2. DTW[i, j] represents the dynamic time warp distance between
13 | subsequences s[:i+1] and t[:j+1]
14 | 3. DTW[-1, -1] is the dynamic time warp distance between s and t
15 | """
16 | cdef int m = D.shape[0]
17 | cdef int n = D.shape[1]
18 | cdef np.ndarray[np.double_t, ndim=2] DTW = np.zeros((m + 1, n + 1))
19 | DTW[:, 0] = np.inf
20 | DTW[0, :] = np.inf
21 | DTW[0, 0] = 0
22 | cdef int i
23 | cdef int j
24 | for i in range(1, m + 1):
25 | for j in range(1, n + 1):
26 | DTW[i, j] = D[i - 1, j - 1] + min(
27 | DTW[i - 1, j ],
28 | DTW[i , j - 1],
29 | DTW[i - 1, j - 1])
30 | return DTW
31 |
32 |
33 | @cython.boundscheck(False)
34 | cpdef path(np.ndarray[np.double_t, ndim=2] DTW):
35 | """Given a DTW matrix backtrack to find the alignment between two sequences"""
36 | cdef int m = DTW.shape[0] - 1
37 | cdef int n = DTW.shape[1] - 1
38 | cdef int i = m - 1
39 | cdef int j = n - 1
40 | cdef np.ndarray[np.int_t, ndim=1] s = np.zeros((m,), dtype=int)
41 | cdef np.ndarray[np.int_t, ndim=1] t = np.zeros((n,), dtype=int)
42 | while i >= 0 or j >= 0:
43 | s[i] = j
44 | t[j] = i
45 | if DTW[i, j + 1] < DTW[i + 1, j]:
46 | if DTW[i, j + 1] < DTW[i, j]:
47 | i -= 1
48 | else:
49 | i -= 1
50 | j -= 1
51 | elif DTW[i + 1, j] < DTW[i, j]:
52 | j -= 1
53 | else:
54 | i -= 1
55 | j -= 1
56 | return s, t
57 |
58 |
59 | @cython.boundscheck(False)
60 | cpdef match(np.ndarray[np.int_t, ndim=1] s,
61 | np.ndarray[np.int_t, ndim=1] t,
62 | np.ndarray[np.double_t, ndim=2] D):
63 | """Given the alignments from two sequences find the match between elements.
64 |
65 | When aligning an element from one sequence can correspond to several elements
66 | of the other one. Matching resolves this ambiguity forcing unique pairings. If
67 | an element is unpaired then it as assigned -1.
68 | """
69 | s = s.copy()
70 | cdef int i, j, k, m
71 | cdef double d
72 | cdef int N = len(s)
73 | for i in range(N):
74 | j = s[i]
75 | m = k = i
76 | d = D[i, j]
77 | while k < N and s[k] == j:
78 | if D[k, j] < d:
79 | m = k
80 | d = D[k, j]
81 | k += 1
82 | k = i
83 | while k < N and s[k] == j:
84 | if k != m:
85 | s[k] = -1
86 | k += 1
87 | return s
88 |
--------------------------------------------------------------------------------
/aile/kernel.py:
--------------------------------------------------------------------------------
1 | import collections
2 | import itertools
3 |
4 | import numpy as np
5 | import sklearn.cluster
6 | import networkx as nx
7 | import scrapely.htmlpage as hp
8 |
9 | import _kernel as _ker
10 | import dtw
11 |
12 |
13 | def to_rows(d):
14 | """Make a square matrix with rows equal to 'd'.
15 |
16 | >>> print to_rows(np.array([1,2,3,4]))
17 | [[1 2 3 4]
18 | [1 2 3 4]
19 | [1 2 3 4]
20 | [1 2 3 4]]
21 | """
22 | return np.tile(d, (len(d), 1))
23 |
24 |
25 | def to_cols(d):
26 | """Make a square matrix with columns equal to 'd'.
27 |
28 | >>> print ker.to_cols(np.array([1,2,3,4]))
29 | [[1 1 1 1]
30 | [2 2 2 2]
31 | [3 3 3 3]
32 | [4 4 4 4]]
33 | """
34 | return np.tile(d.reshape(len(d), -1), (1, len(d)))
35 |
36 |
37 | def normalize_kernel(K):
38 | """New kernel with unit diagonal.
39 |
40 | K'[i, j] = K[i, j]/sqrt(K[i,i]*K[j,j])
41 | """
42 | d = np.diag(K).copy()
43 | d[d == 0] = 1.0
44 | return K/np.sqrt(to_rows(d)*to_cols(d))
45 |
46 |
47 | def kernel_to_distance(K):
48 | """Build a distance matrix.
49 |
50 | From the dot product:
51 | |u - v|^2 = (u - v)(u - v) = u^2 + v^2 - 2uv
52 | """
53 | d = np.diag(K)
54 | D = to_rows(d) + to_cols(d) - 2*K
55 | D[D < 0] = 0.0 # numerical error can make D go a little below 0
56 | return np.sqrt(D)
57 |
58 |
59 | def tree_size_distance(page_tree):
60 | """Build a distance matrix comparing subtree sizes.
61 |
62 | If T1 and T2 are trees and N1 and N2 the number of nodes within:
63 | |T1 - T2| = |N1 - N2|/(N1 + N2)
64 | Since:
65 | N1 >= 1
66 | N2 >= 1
67 | Then:
68 | 0 <= |T1 - T2| < 1
69 | """
70 | s = page_tree.tree_size()
71 | a = to_cols(s).astype(float)
72 | b = to_rows(s).astype(float)
73 | return np.abs(a - b)/(a + b)
74 |
75 |
76 | def must_separate(nodes, page_tree):
77 | """Given a sequence of nodes and a PageTree return a list of pairs
78 | of nodes such that one is the ascendant/descendant of the other"""
79 | separate = []
80 | for src in nodes:
81 | m = page_tree.match[src]
82 | if m >= 0:
83 | for tgt in range(src+1, m):
84 | if tgt in nodes:
85 | separate.append((src, tgt))
86 | return separate
87 |
88 |
89 | def cut_descendants(D, nodes, page_tree):
90 | """Given the distance matrix D, a set of nodes and a PageTree
91 | perform a multicut of the complete graph of nodes separating
92 | the nodes that are descendant/ascendants of each other according to the
93 | PageTree"""
94 | index = {node: i for i, node in enumerate(nodes)}
95 | separate = [(index[i], index[j])
96 | for i, j in must_separate(nodes, page_tree)]
97 | if separate:
98 | D = D[nodes, :][:, nodes].copy()
99 | for i, j in separate:
100 | D[i, j] = D[j, i] = np.inf
101 | E = _ker.min_dist_complete(D)
102 | eps = min(E[i,j] for i, j in separate)
103 | components = nx.connected_components(
104 | nx.Graph((nodes[i], nodes[j])
105 | for (i, j) in zip(*np.nonzero(E < eps))))
106 | else:
107 | components = [nodes]
108 | return components
109 |
110 |
111 | def labels_to_clusters(labels):
112 | """Given a an assignment of cluster label to each item return the a list
113 | of sets, where each set is a cluster"""
114 | return [np.flatnonzero(labels==label) for label in range(np.max(labels)+1)]
115 |
116 |
117 | def clusters_to_labels(clusters, n_samples):
118 | """Given a list with clusters label each item"""
119 | labels = np.repeat(-1, n_samples)
120 | for i, c in enumerate(clusters):
121 | for j in c:
122 | labels[j] = i
123 | return labels
124 |
125 |
126 | def boost(d, k=2):
127 | """Given a distance between 0 and 1 make it more nonlinear"""
128 | return 1 - (1 - d)**k
129 |
130 |
131 | class TreeClustering(object):
132 | def __init__(self, page_tree):
133 | self.page_tree = page_tree
134 |
135 | def fit_predict(self, X, min_cluster_size=6, d1=1.0, d2=0.1, eps=1.0,
136 | separate_descendants=True):
137 | """Fit the data X and label each sample.
138 |
139 | X is a kernel of size (n_samples, n_samples). From this kernel the
140 | distance matrix is computed and averaged with the tree size distance,
141 | and DBSCAN applied to the result. Finally, we enforce the constraint
142 | that a node cannot be inside the same cluster of any of its ascendants.
143 |
144 | Parameters
145 | ---------
146 | X : np.array
147 | Kernel matrix
148 | min_cluster_size : int
149 | Parameter to DBSCAN
150 | eps : int
151 | Parameter to DBSCAN
152 | d1 : float
153 | Weight of distance computed from X
154 | d2 : float
155 | Weight of distance computed from tree size
156 | separate_ascendants: bool
157 | True to enfonce the cannot-link constraints
158 |
159 | Returns
160 | -------
161 | np.array
162 | A label for each sample
163 | """
164 | Y = boost(tree_size_distance(self.page_tree), 2)
165 | D = d1*X + d2*Y
166 | clt = sklearn.cluster.DBSCAN(
167 | eps=eps, min_samples=min_cluster_size, metric='precomputed')
168 | self.clusters = []
169 | for c in labels_to_clusters(clt.fit_predict(D)):
170 | if len(c) >= min_cluster_size:
171 | if separate_descendants:
172 | self.clusters += filter(lambda x: len(x) >= min_cluster_size,
173 | cut_descendants(D, c, self.page_tree))
174 | else:
175 | self.clusters.append(c)
176 | self.labels = clusters_to_labels(self.clusters, D.shape[0])
177 | return self.labels
178 |
179 |
180 | def cluster(page_tree, K, eps=1.2, d1=1.0, d2=0.1, separate_descendants=True):
181 | """Asign to each node in the tree a cluster label.
182 |
183 | Returns
184 | -------
185 | np.array
186 | For each node a label id. Label ID -1 means that the node
187 | is an outlier (it isn't part of any cluster).
188 | """
189 | return TreeClustering(page_tree).fit_predict(
190 | kernel_to_distance(normalize_kernel(K)),
191 | eps=eps, d1=d1, d2=d2,
192 | separate_descendants=separate_descendants)
193 |
194 |
195 | def clusters_tournament(ptree, labels):
196 | """A cluster 'wins' if some node inside the cluster is the ascendant
197 | of another node in the other cluster"""
198 | L = np.max(labels) + 1
199 | T = np.zeros((L, L), dtype=int)
200 | for i, m in enumerate(ptree.match):
201 | li = labels[i]
202 | if li != -1:
203 | for j in range(max(i + 1, m)):
204 | lj = labels[j]
205 | if lj != -1:
206 | T[li, lj] += 1
207 | return T
208 |
209 |
210 | def _make_acyclic(T, labels):
211 | """See https://en.wikipedia.org/wiki/Feedback_arc_set"""
212 | n = T.shape[0]
213 | if n == 0:
214 | return []
215 | i = np.random.randint(0, n)
216 | L = []
217 | R = []
218 | for j in range(n):
219 | if j != i:
220 | if T[i, j] > T[j, i]:
221 | R.append(j)
222 | else:
223 | L.append(j)
224 | return (make_acyclic(T[L, :][:, L], labels[L]) +
225 | [labels[i]] +
226 | make_acyclic(T[R, :][:, R], labels[R]))
227 |
228 |
229 | def make_acyclic(T, labels=None):
230 | """Tiven a tournament T, try to rank the clusters in a consisten
231 | way"""
232 | if labels is None:
233 | labels = np.arange(T.shape[0])
234 | return _make_acyclic(T, labels)
235 |
236 |
237 | def separate_clusters(ptree, labels):
238 | """Make sure no tree node is contained in two different clusters"""
239 | ranking = make_acyclic(clusters_tournament(ptree, labels))
240 | clusters = labels_to_clusters(labels)
241 | labels = labels.copy()
242 | for i in ranking:
243 | for node in clusters[i]:
244 | labels[node+1:max(node+1, ptree.match[node])] = -1
245 | return labels
246 |
247 |
248 | def score_cluster(ptree, cluster, k=4):
249 | """Given a cluster assign a score. The higher the score the more probable
250 | that the cluster truly represents a repeating item"""
251 | if len(cluster) <= 1:
252 | return 0.0
253 | D = sklearn.neighbors.kneighbors_graph(
254 | ptree.distance[cluster, :][:, cluster], min(len(cluster) - 1, k),
255 | metric='precomputed', mode='distance')
256 | score = 0.0
257 | for i, j in zip(*D.nonzero()):
258 | a = cluster[i]
259 | b = cluster[j]
260 | si = max(a+1, ptree.match[a]) - a
261 | sj = max(b+1, ptree.match[b]) - b
262 | score += min(si, sj)/D[i, j]**2
263 | return score
264 |
265 |
266 | def some_root_has_label(labels, item, label):
267 | for root in item:
268 | if labels[root] == label:
269 | return True
270 | return False
271 |
272 |
273 | def extract_items_with_label(ptree, labels, label_to_extract):
274 | """Extract all items inside the labeled PageTree that are marked or have
275 | a sibling that is marked with label_to_extract.
276 |
277 | Returns
278 | -------
279 | List[tuple]
280 | Where each tuple is the roots of the extracted subtrees.
281 | """
282 | items = []
283 | i = 0
284 | while i < len(labels):
285 | children = ptree.children(i)
286 | if np.any(labels[children] == label_to_extract):
287 | first = None
288 | item = []
289 | for c in children:
290 | m = labels[c]
291 | if m != -1:
292 | if first is None:
293 | first = m
294 | elif m == first:
295 | if item:
296 | items.append(tuple(item))
297 | item = []
298 | # Only append tags as item roots
299 | if isinstance(ptree.page.parsed_body[ptree.index[c]], hp.HtmlTag):
300 | item.append(c)
301 | if item:
302 | items.append(tuple(item))
303 | i = ptree.match[i]
304 | else:
305 | i += 1
306 | return filter(lambda item: some_root_has_label(labels, item, label_to_extract),
307 | items)
308 |
309 |
310 | def vote(sequence):
311 | """Return the most frequent item in sequence"""
312 | return max(collections.Counter(sequence).iteritems(),
313 | key=lambda kv: kv[1])[0]
314 |
315 |
316 | def regularize_item_length(ptree, labels, item_locations, max_items_cut_per=0.33):
317 | """Make sure all item locations have the same number of roots"""
318 | if not item_locations:
319 | return item_locations
320 | min_item_length = vote(len(item_location) for item_location in item_locations)
321 | cut_items = sum(len(item_location) > min_item_length
322 | for item_location in item_locations)
323 | if cut_items > max_items_cut_per*len(item_locations):
324 | return []
325 | item_locations = filter(lambda x: len(x) >= min_item_length,
326 | item_locations)
327 | if cut_items > 0:
328 | label_count = collections.Counter(
329 | labels[root] for item_location in item_locations
330 | for root in item_location)
331 | new_item_locations = []
332 | for item_location in item_locations:
333 | if len(item_location) > min_item_length:
334 | scored = sorted(
335 | ((label_count[labels[root]], root) for root in item_location),
336 | reverse=True)
337 | keep = set(x[1] for x in scored[:min_item_length])
338 | new_item_location = tuple(
339 | root
340 | for root in item_location
341 | if root in keep)
342 | else:
343 | new_item_location = item_location
344 | new_item_locations.append(new_item_location)
345 | else:
346 | new_item_locations = item_locations
347 | return new_item_locations
348 |
349 |
350 | def extract_items(ptree, labels, min_n_items=6):
351 | """Extract the repeating items.
352 |
353 | The algorithm to extract the repeating items goes as follows:
354 | 1. Determine the label that covers most children on the page
355 | 2. If a node with that label has siblings, extract the siblings too,
356 | even if they have other labels.
357 |
358 | The output is a list of lists of items
359 | """
360 | labels = separate_clusters(ptree, labels)
361 | scores = sorted(
362 | enumerate(score_cluster(ptree, cluster)
363 | for cluster in labels_to_clusters(labels)),
364 | key=lambda kv: kv[1], reverse=True)
365 | items = []
366 | for label, score in scores:
367 | cluster = extract_items_with_label(ptree, labels, label)
368 | if len(cluster) < min_n_items:
369 | continue
370 | t = regularize_item_length(ptree, labels, cluster)
371 | if len(t) >= min_n_items:
372 | items.append(t)
373 | return items
374 |
375 |
376 | def path_distance(path_1, path_2):
377 | """Compute the prefix distance between the two paths.
378 |
379 | >>> p1 = [1, 0, 3, 4, 5, 6]
380 | >>> p2 = [1, 0, 2, 2, 2, 2, 2, 2]
381 | >>> print path_distance(p1, p2)
382 | 6
383 | """
384 | d = max(len(path_1), len(path_2))
385 | for a, b in zip(path_1, path_2):
386 | if a != b:
387 | break
388 | d -= 1
389 | return d
390 |
391 |
392 | def pairwise_path_distance(path_seq_1, path_seq_2):
393 | """Compute all pairwise distances between paths in path_seq_1 and
394 | path_seq_2"""
395 | N1 = len(path_seq_1)
396 | N2 = len(path_seq_2)
397 | D = np.zeros((N1, N2))
398 | for i in range(N1):
399 | q1 = path_seq_1[i]
400 | for j in range(N2):
401 | D[i, j] = path_distance(q1, path_seq_2[j])
402 | return D
403 |
404 |
405 | def extract_path_seq_1(ptree, item):
406 | paths = []
407 | for root in item:
408 | for path in ptree.prefixes_at(root):
409 | paths.append((path[0], path))
410 | return paths
411 |
412 |
413 | def extract_path_seq(ptree, items):
414 | all_paths = []
415 | for item in items:
416 | paths = extract_path_seq_1(ptree, item)
417 | all_paths.append(paths)
418 | return all_paths
419 |
420 |
421 | def map_paths_1(func, paths):
422 | return [(leaf, [func(node) for node in path])
423 | for leaf, path in paths]
424 |
425 |
426 | def map_paths(func, paths):
427 | return [map_paths_1(func, path_set) for path_set in paths]
428 |
429 |
430 | def find_cliques(G, min_size):
431 | """Find all cliques in G above a given size.
432 |
433 | If a node is part of a larger clique is deleted from the smaller ones.
434 |
435 | Returns
436 | -------
437 | dict
438 | Mapping nodes to clique ID
439 | """
440 | cliques = []
441 | for K in nx.find_cliques(G):
442 | if len(K) >= min_size:
443 | cliques.append(set(K))
444 | cliques.sort(reverse=True, key=lambda x: len(x))
445 | L = set()
446 | for K in cliques:
447 | K -= L
448 | L |= K
449 | cliques = [J for J in cliques if len(J) >= min_size]
450 | node_to_clique = {}
451 | for i, K in enumerate(cliques):
452 | for node in K:
453 | if node not in node_to_clique:
454 | node_to_clique[node] = i
455 | return node_to_clique
456 |
457 |
458 | def match_graph(all_paths):
459 | """Build a graph where n1 and n2 share an edge if they have
460 | been matched using DTW"""
461 | G = nx.Graph()
462 | for path_set_1, path_set_2 in itertools.combinations(all_paths, 2):
463 | n1, p1 = zip(*path_set_1)
464 | n2, p2 = zip(*path_set_2)
465 | D = pairwise_path_distance(p1, p2)
466 | DTW = dtw.from_distance(D)
467 | a1, a2 = dtw.path(DTW)
468 | m = dtw.match(a1, a2, D)
469 | for i, j in enumerate(m):
470 | if j != -1:
471 | G.add_edge(n1[i], n2[j])
472 | return G
473 |
474 |
475 | def align_items(ptree, items, node_to_clique):
476 | n_cols = max(node_to_clique.values()) + 1
477 | table = np.zeros((len(items), n_cols), dtype=int) - 1
478 | for i, item in enumerate(items):
479 | for root in item:
480 | for c in range(root, max(root + 1, ptree.match[root])):
481 | try:
482 | table[i, node_to_clique[c]] = c
483 | except KeyError:
484 | pass
485 | return table
486 |
487 |
488 | def extract_item_table(ptree, items, labels):
489 | return align_items(
490 | ptree,
491 | items,
492 | find_cliques(
493 | match_graph(map_paths(
494 | lambda x: labels[x], extract_path_seq(ptree, items))),
495 | 0.5*len(items))
496 | )
497 |
498 |
499 | ItemTable = collections.namedtuple('ItemTable', ['items', 'cells'])
500 |
501 |
502 | class ItemExtract(object):
503 | def __init__(self, page_tree, k_max_depth=2, k_decay=0.5,
504 | c_eps=1.2, c_d1=1.0, c_d2=1.0, separate_descendants=True):
505 | """Perform all extraction operations in sequence.
506 |
507 | Parameters
508 | ----------
509 | k_max_depth : int
510 | Parameter to kernel computation
511 | k_decay : float
512 | Parameter to kernel computation
513 | c_eps : float
514 | Parameter to clustering
515 | c_d1 : float
516 | Parameter to clustering
517 | c_d2 : float
518 | Parameter to clustering
519 | separate_descendants : bool
520 | Parameter to clustering
521 | """
522 | self.page_tree = page_tree
523 | self.kernel = _ker.kernel(page_tree, max_depth=k_max_depth, decay=k_decay)
524 | self.labels = cluster(
525 | page_tree, self.kernel, eps=c_eps, d1=c_d1, d2=c_d2,
526 | separate_descendants=separate_descendants)
527 | self.items = extract_items(page_tree, self.labels)
528 | self.tables = [ItemTable(items, extract_item_table(page_tree, items, self.labels))
529 | for items in self.items]
530 | self.table_fragments = [
531 | ItemTable([page_tree.fragment_index(np.array(root)) for root in item],
532 | page_tree.fragment_index(fields))
533 | for item, fields in self.tables]
534 |
--------------------------------------------------------------------------------
/aile/ptree.py:
--------------------------------------------------------------------------------
1 | import itertools
2 |
3 | import numpy as np
4 | import scrapely.htmlpage as hp
5 |
6 |
7 | def match_fragments(fragments, max_backtrack=20):
8 | """Find the closing fragment for every fragment.
9 |
10 | Returns
11 | -------
12 | numpy.array
13 | With as many elements as fragments. If the fragment has no
14 | closing pair then the array contains -1 at that position
15 | otherwise it contains the index of the closing pair
16 | """
17 | match = np.repeat(-1, len(fragments))
18 | stack = []
19 | for i, fragment in enumerate(fragments):
20 | if isinstance(fragment, hp.HtmlTag):
21 | if fragment.tag_type == hp.HtmlTagType.OPEN_TAG:
22 | stack.append((i, fragment))
23 | elif (fragment.tag_type == hp.HtmlTagType.CLOSE_TAG):
24 | if max_backtrack is None:
25 | max_j = len(stack)
26 | else:
27 | max_j = min(max_backtrack, len(stack))
28 | for j in range(1, max_j + 1):
29 | last_i, last_tag = stack[-j]
30 | if (last_tag.tag == fragment.tag):
31 | match[last_i] = i
32 | match[i] = last_i
33 | stack[-j:] = []
34 | break
35 | return match
36 |
37 |
38 | def is_tag(fragment):
39 | """Check if a fragment is also an HTML tag"""
40 | return isinstance(fragment, hp.HtmlTag)
41 |
42 |
43 | def get_class(fragment):
44 | """Return a set with class attributes for a given fragment"""
45 | if is_tag(fragment):
46 | return frozenset((fragment.attributes.get('class') or '').split())
47 | else:
48 | return frozenset()
49 |
50 |
51 | class TreeNode(object):
52 | __slots__ = ('tag', 'class_attr')
53 |
54 | def __init__(self, tag, class_attr=frozenset()):
55 | self.tag = tag
56 | self.class_attr = class_attr
57 |
58 | def __hash__(self):
59 | return hash(self.tag)
60 |
61 | def __eq__(self, other):
62 | return self.tag == other.tag
63 |
64 | def __str__(self):
65 | return self.__repr__().encode('ascii', 'backslashreplace')
66 |
67 | def __repr__(self):
68 | s = unicode(self.tag)
69 | if self.class_attr:
70 | s += u'['
71 | s += u','.join(self.class_attr)
72 | s += u']'
73 | return s
74 |
75 |
76 | def non_empty_text(page, fragment):
77 | return fragment.is_text_content and\
78 | page.body[fragment.start:fragment.end].strip()
79 |
80 |
81 | def fragment_to_node(page, fragment):
82 | """Convert a fragment to a node inside a tree where we are going
83 | to compute the kernel"""
84 | if non_empty_text(page, fragment):
85 | return TreeNode('[T]')
86 | elif (is_tag(fragment) and
87 | fragment.tag_type != hp.HtmlTagType.CLOSE_TAG):
88 | return TreeNode(fragment.tag, get_class(fragment))
89 | return None
90 |
91 |
92 | def tree_nodes(page):
93 | """Return a list of fragments from page where empty text has been deleted"""
94 | for i, fragment in enumerate(page.parsed_body):
95 | node = fragment_to_node(page, fragment)
96 | if node is not None:
97 | yield (i, node)
98 |
99 |
100 | class PageTree(object):
101 | def __init__(self, page):
102 | self.page = page
103 | index, self.nodes = zip(*tree_nodes(page))
104 | self.index = np.array(index)
105 | self.n_nodes = len(self.nodes)
106 | self.reverse_index = np.repeat(-1, len(page.parsed_body))
107 | for i, idx in enumerate(self.index):
108 | self.reverse_index[idx] = i
109 | match = match_fragments(page.parsed_body)
110 | self.match = np.repeat(-1, self.n_nodes)
111 | self.parents = np.repeat(-1, self.n_nodes)
112 | for i, m in enumerate(match):
113 | j = self.reverse_index[i]
114 | if j >= 0:
115 | if m >= 0:
116 | k = -1
117 | while k < 0:
118 | k = self.reverse_index[m]
119 | m += 1
120 | if m == len(match):
121 | k = len(self.match)
122 | break
123 | assert k >= 0
124 | else:
125 | k = j # no children
126 | self.match[j] = k
127 | for i, m in enumerate(self.match):
128 | self.parents[i+1:m] = i
129 |
130 | self.n_children = np.zeros((self.n_nodes,), dtype=int)
131 | self.i_child = np.zeros((self.n_nodes,), dtype=int)
132 | for i, p in enumerate(self.parents):
133 | if p > -1:
134 | self.i_child[i] = self.n_children[p]
135 | self.n_children[p] += 1
136 | self.max_childs = np.max(self.n_children)
137 |
138 | self.distance = np.ones((self.n_nodes, self.n_nodes), dtype=int)
139 | for i in range(self.n_nodes - 1, -1, -1):
140 | self.distance[i, i] = 0
141 | for a, b in itertools.combinations(self.children(i), 2):
142 | for j in range(a, max(a + 1, self.match[a])):
143 | for k in range(b, max(b + 1, self.match[b])):
144 | self.distance[j, k] = self.distance[j, a] + 2 + self.distance[b, k]
145 | self.distance[k, j] = self.distance[j, k]
146 |
147 | def __len__(self):
148 | """Number of nodes in tree"""
149 | return len(self.index)
150 |
151 | def children(self, i):
152 | """An array with the indices of the direct children of node 'i'"""
153 | return i + 1 + np.flatnonzero(self.parents[i+1:max(i+1, self.match[i])] == i)
154 |
155 | def children_matrix(self, max_childs=None):
156 | """A matrix of shape (len(tree), max_childs) where row 'i' contains the
157 | children of node 'i'"""
158 | if max_childs is None:
159 | max_childs = self.max_childs
160 | N = len(self.parents)
161 | C = np.repeat(-1, N*max_childs).reshape(N, max_childs)
162 | for i in range(N - 1, -1, -1):
163 | p = self.parents[i]
164 | if p >= 0:
165 | for j in range(max_childs):
166 | if C[p, j] == -1:
167 | C[p, j] = i
168 | break
169 | return C
170 |
171 | def siblings(self, i):
172 | """Siblings of node 'i'"""
173 | p = self.parents[i]
174 | if p != -1:
175 | return self.children(p)
176 | else:
177 | return np.flatnonzero(self.parents == -1)
178 |
179 | def prefix(self, i, stop_at=-1):
180 | """A path from 'i' going upwards up to 'stop_at'"""
181 | path = []
182 | p = i
183 | while p >= stop_at and p != -1:
184 | path.append(p)
185 | p = self.parents[p]
186 | return path
187 |
188 | def prefixes_at(self, i):
189 | """A list of paths going upwards that start at a descendant of 'i' and
190 | end at a 'i'"""
191 | paths = []
192 | for j in range(i, max(i+1, self.match[i])):
193 | paths.append(self.prefix(j, i))
194 | return paths
195 |
196 | def tree_size(self):
197 | """Return an array where the i-th entry is the size of subtree 'i'"""
198 | r = np.arange(len(self.match))
199 | s = r + 1
200 | return np.where(s > self.match, s, self.match) - r
201 |
202 | def fragment_index(self, tree_index):
203 | """Convert from tree node numbering to original fragment numbers"""
204 | return np.where(
205 | tree_index > 0, self.index[tree_index], -1)
206 |
207 | def is_descendant(self, parent, descendant):
208 | return descendant >= parent and \
209 | descendant < max(parent + 1, self.match[parent])
210 |
211 | def common_ascendant(self, nodes):
212 | s = set(range(self.n_nodes))
213 | for node in nodes:
214 | s &= set(self.prefix(node))
215 | return max(s) if s else -1
216 |
--------------------------------------------------------------------------------
/demo1.py:
--------------------------------------------------------------------------------
1 | import time
2 | import sys
3 | import codecs
4 | import cgi
5 |
6 | import scrapely.htmlpage as hp
7 | import numpy as np
8 |
9 | import aile.kernel
10 | import aile.ptree
11 |
12 |
13 | def annotate(page, labels, out_path="annotated.html"):
14 | match = aile.ptree.match_fragments(page.parsed_body)
15 | with codecs.open(out_path, 'w', encoding='utf-8') as out:
16 | out.write("""
17 |
18 |
19 |