├── .gitignore ├── LICENSE ├── README.md ├── __init__.py ├── notebooks ├── HowTo.ipynb └── __init__.py ├── setup.py ├── tests ├── __init__.py └── test_textsplit.py └── textsplit ├── __init__.py ├── algorithm.py └── tools.py /.gitignore: -------------------------------------------------------------------------------- 1 | wvtool.egg-info/ 2 | **/*.pyc 3 | **/.ipynb_checkpoints/ 4 | **/data 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | This library contains simple functionality to tackle the problem of segmenting 3 | documents into coherent parts. Imagine you don't have a good paragraph 4 | annotation in your documents, as it is often the case for scraped pdfs or html 5 | documents. For NLP tasks you want to split them at points where the topic 6 | changes. Good results have been achieved using topic representations, but they 7 | involve a further step of topic modeling which is quite domain dependent. This 8 | approach uses only word embeddings which are assumed to be less domain specific. 9 | See [https://arxiv.org/pdf/1503.05543.pdf] for an overview and an approach very 10 | similar to the one presented here. 11 | 12 | 13 | The algorithm uses word embeddings to find a segmentation where the splits are 14 | chosen such that the segments are coherent. This coherence can be described as 15 | accumulated weighted cosine similarity of the words of a segment to the mean 16 | vector of that segment. More formally segments are chosen as to maximize the 17 | quantity |v|, where v is a segment vector and |.| denotes the l2-norm. The 18 | accumulated weighted cosine similarity turns up by a simple transformation: 19 | |v| = 1/|v| = = \sum_i = \sum_i |w_i| , 20 | where v = \sum_i w_i is the definition of the segment vector from word vectors 21 | w_i. The expansion gives a good intuition of what we try to achieve. As we 22 | usually compare word embeddings with cosine similarity, the last scalar product 23 | is just the cosine similarity of a word w_i to the segment 24 | vector v. The weighting with the length of w_i suppresses frequent noise words, 25 | that typically have a shorter length. 26 | 27 | This leads to the interpretation that coherence corresponds to segment vector 28 | length, in the sense that two segment vectors of same length contain the same 29 | amount of information. This interpretation is of course only capturing 30 | information that we are given as input by means of the word embeddings, but it 31 | serves as an abstraction. 32 | 33 | # Formalization 34 | 35 | To optimize for segment vector length |v|, we look for a sequence of split 36 | positions such that the sum of l2-norms of the segment vectors formed by summing 37 | the words between the splits is maximal. Given this objective without 38 | constraints, the optimal solution is to split the document between every two 39 | subsequent words (triangle inequality). We have to impose some limit on the 40 | granularity of the segmentation to get useful results. This is done by a penalty 41 | for every split made, that counts against the vector norms, i.e. is subtracted 42 | from the sum of vector norms. 43 | 44 | Let Seg := {(0 = t_0 < t_i < ... < t_n = L) | s_i natural number} where L is a 45 | documents length. A segment [a, b[ comprises the words at positions a, a+1, ..., 46 | b-1. Let l(j, k) := |\sum_i=j^{k-1} w_i| denote the vector of segment [i, j[. We 47 | optimize the function f mapping elements of Seg to the real numbers with 48 | f: (t_0, ..., t_n) \mapsto \sum_{i=0}^{n-1} (l(t_{i-1}, t_i) + l(t_i, t_{i+1}) - penalty). 49 | 50 | # Algorithms 51 | 52 | There are two variants, a greedy that is fast and a dynamic programming approach 53 | that computes the optimal segmentation. Both depend on a penalty hyperparameter, 54 | that defined the granularity of the split. 55 | 56 | ## Greedy 57 | Split the text iteratively at the position where the gain is highest until this 58 | gain would be below a given penalty threshold. The gain is the sum of norms of 59 | the left and right segments minus the norm of the segment that is to be split. 60 | 61 | ## Optimal (Dynamic Programming) 62 | Iteratively construct a data structure storing the results of optimally 63 | splitting a prefix of the document. This results in a matrix storing a score 64 | for making a segment from position i to j, given a optimal segmentation up to i. 65 | 66 | # Tools 67 | 68 | ## Penalty hyperparameter choice 69 | The greedy implementation does not need the penalty parameter, but can also be 70 | run by limiting the number of segments. This is leveraged by the `get_penalty` 71 | function to approximately determine a penalty parameter for a desired average 72 | segment length computed over a set of documents. 73 | 74 | ## Measure accuracy of segmentation against reference 75 | To measure the accuracy of an algorithm against a given reference segmentation 76 | `P_k` is a commonly used metric described e.g. in above paper. 77 | 78 | ## Apply segmentation definition to document 79 | The function `get_segments` simply applies a segmentation determined by one of 80 | the algorithms to e.g. the sentences of a text used when generating the 81 | segmentation. 82 | 83 | # Usage 84 | 85 | ## Input 86 | The algorithms are fed a matrix `docmat` containing vectors representing the 87 | content of a text. These vectors are supposed to have cosine similarity as a 88 | natural similarity measure and length roughly corresponding to the content 89 | length of a text particle. Particles could be words in which case word2vec 90 | embeddings are a good choice as vectors. The width of `docmat` is the embedding 91 | dimension and the height the number of particles. 92 | 93 | ## Split along sentence borders 94 | If you want to split text into paragraphs, you most likely already have a good 95 | idea of what potential sentence borders are. It makes sense not to give the word 96 | vectors as input but sentence vectors formed by e.g. the sum of word vectors, as 97 | it is usual practice. 98 | 99 | # Getting Started 100 | 101 | ``` 102 | pip install textsplit 103 | ``` 104 | 105 | In the Jupyter notebook HowTo.ipynb you find code that demonstrates the use of 106 | the module. It downloads a corpus to trains word2vec vectors on and an example 107 | text for segmentation. You achieve better results if you compute word vectors on 108 | a larger corpus. 109 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chschock/textsplit/27eff3f0a6d32e591db435910caf7fc2145cc6c7/__init__.py -------------------------------------------------------------------------------- /notebooks/HowTo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "/home/chris/.pyenv/versions/3.7.5/envs/work/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.\n", 13 | " warnings.warn(msg, category=DeprecationWarning)\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "import os\n", 19 | "import word2vec\n", 20 | "import pandas as pd\n", 21 | "import numpy as np\n", 22 | "from sklearn.feature_extraction.text import CountVectorizer" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Download toy corpus for wordvector training and example text" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "corpus_path = './text8' # be sure your corpus is cleaned from punctuation and lowercased\n", 39 | "if not os.path.exists(corpus_path):\n", 40 | " !wget http://mattmahoney.net/dc/text8.zip\n", 41 | " !unzip {corpus_path}\n", 42 | "\n", 43 | "links = {'tale2cities': 'https://www.gutenberg.org/files/98/98-0.txt', # a tale of two cities\n", 44 | " 'siddartha': 'http://www.gutenberg.org/cache/epub/2500/pg2500.txt'} # siddartha\n", 45 | "\n", 46 | "for link in links.values():\n", 47 | " text_path = os.path.basename(link)\n", 48 | " if not os.path.exists(text_path):\n", 49 | " !wget {link}" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "## Train wordvectors" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 3, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "wrdvec_path = 'wrdvecs.bin'\n", 66 | "if not os.path.exists(wrdvec_path):\n", 67 | " %time word2vec.word2vec(corpus_path, wrdvec_path, cbow=1, iter_=5, hs=1, threads=8, sample='1e-5', window=15, size=200, binary=1)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 4, 73 | "metadata": {}, 74 | "outputs": [ 75 | { 76 | "name": "stdout", 77 | "output_type": "stream", 78 | "text": [ 79 | "(71291, 200)\n" 80 | ] 81 | } 82 | ], 83 | "source": [ 84 | "model = word2vec.load(wrdvec_path)\n", 85 | "wrdvecs = pd.DataFrame(model.vectors, index=model.vocab)\n", 86 | "del model\n", 87 | "print(wrdvecs.shape)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "## get sentence tokenizer" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 5, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "from textsplit.tools import SimpleSentenceTokenizer\n", 104 | "sentence_tokenizer = SimpleSentenceTokenizer()" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## run get_penalty and split_optimal" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 6, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "%matplotlib inline\n", 121 | "from textsplit.tools import get_penalty, get_segments\n", 122 | "from textsplit.algorithm import split_optimal, split_greedy, get_total" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 7, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "penalty 3.02\n", 135 | "8591 sentences, 391 segments, avg 21.97 sentences per segment\n", 136 | "optimal score 51703.94, greedy score 51615.21\n", 137 | "ratio of scores 1.0017\n" 138 | ] 139 | }, 140 | { 141 | "data": { 142 | "image/png": "\n", 143 | "text/plain": [ 144 | "
" 145 | ] 146 | }, 147 | "metadata": { 148 | "needs_background": "light" 149 | }, 150 | "output_type": "display_data" 151 | }, 152 | { 153 | "data": { 154 | "image/png": "\n", 155 | "text/plain": [ 156 | "
" 157 | ] 158 | }, 159 | "metadata": { 160 | "needs_background": "light" 161 | }, 162 | "output_type": "display_data" 163 | } 164 | ], 165 | "source": [ 166 | "# link = links['siddartha']\n", 167 | "link = links['tale2cities']\n", 168 | "segment_len = 30 # segment target length in sentences\n", 169 | "book_path = os.path.basename(link)\n", 170 | "\n", 171 | "with open(book_path, 'rt') as f:\n", 172 | " text = f.read() #.replace('\\n', ' ') # punkt tokenizer handles newlines not so nice\n", 173 | "\n", 174 | "sentenced_text = sentence_tokenizer(text)\n", 175 | "vecr = CountVectorizer(vocabulary=wrdvecs.index)\n", 176 | "\n", 177 | "sentence_vectors = vecr.transform(sentenced_text).dot(wrdvecs)\n", 178 | "\n", 179 | "penalty = get_penalty([sentence_vectors], segment_len)\n", 180 | "print('penalty %4.2f' % penalty)\n", 181 | "\n", 182 | "optimal_segmentation = split_optimal(sentence_vectors, penalty, seg_limit=250)\n", 183 | "segmented_text = get_segments(sentenced_text, optimal_segmentation)\n", 184 | "\n", 185 | "print('%d sentences, %d segments, avg %4.2f sentences per segment' % (\n", 186 | " len(sentenced_text), len(segmented_text), len(sentenced_text) / len(segmented_text)))\n", 187 | "\n", 188 | "with open(book_path + '.seg', 'wt') as f:\n", 189 | " for i, segment_sentences in enumerate(segmented_text):\n", 190 | " segment_str = ' // '.join(segment_sentences)\n", 191 | " gain = optimal_segmentation.gains[i] if i < len(segmented_text) - 1 else 0\n", 192 | " segment_info = ' [%d sentences, %4.3f] ' % (len(segment_sentences), gain) \n", 193 | " print(segment_str + '\\n8<' + '=' * 30 + segment_info + \"=\" * 30, file=f)\n", 194 | "\n", 195 | "greedy_segmentation = split_greedy(sentence_vectors, max_splits=len(optimal_segmentation.splits))\n", 196 | "greedy_segmented_text = get_segments(sentenced_text, greedy_segmentation)\n", 197 | "lengths_optimal = [len(segment) for segment in segmented_text for sentence in segment]\n", 198 | "lengths_greedy = [len(segment) for segment in greedy_segmented_text for sentence in segment]\n", 199 | "df = pd.DataFrame({'greedy':lengths_greedy, 'optimal': lengths_optimal})\n", 200 | "df.plot.line(figsize=(18, 3), title='Segment lenghts over text')\n", 201 | "df.plot.hist(bins=30, alpha=0.5, figsize=(10, 3), title='Histogram of segment lengths')\n", 202 | "\n", 203 | "totals = [get_total(sentence_vectors, seg.splits, penalty) \n", 204 | " for seg in [optimal_segmentation, greedy_segmentation]]\n", 205 | "print('optimal score %4.2f, greedy score %4.2f' % tuple(totals))\n", 206 | "print('ratio of scores %5.4f' % (totals[0] / totals[1]))" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "## Evaluation" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "If you look into the written file `book_path`.seg, the snippet line is often at the boundary of a paragraph. The word embeddings computed above are neither very good nor adapted to the text. Every unknown word has a zero vector. Choosing some more or less random vector for unknown words might improve the accuracy given those unknown terms appear repeatedly within a section." 221 | ] 222 | } 223 | ], 224 | "metadata": { 225 | "kernelspec": { 226 | "display_name": "Python 3", 227 | "language": "python", 228 | "name": "python3" 229 | }, 230 | "language_info": { 231 | "codemirror_mode": { 232 | "name": "ipython", 233 | "version": 3 234 | }, 235 | "file_extension": ".py", 236 | "mimetype": "text/x-python", 237 | "name": "python", 238 | "nbconvert_exporter": "python", 239 | "pygments_lexer": "ipython3", 240 | "version": "3.7.5" 241 | } 242 | }, 243 | "nbformat": 4, 244 | "nbformat_minor": 4 245 | } 246 | -------------------------------------------------------------------------------- /notebooks/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chschock/textsplit/27eff3f0a6d32e591db435910caf7fc2145cc6c7/notebooks/__init__.py -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='textsplit', 5 | version=0.5, 6 | description='Segment documents into coherent parts using wordembeddings.', 7 | url='https://github.com/chschock/textsplit', 8 | long_description=open('README.md', 'r').read(), 9 | long_description_content_type='text/markdown', 10 | author='Christoph Schock', 11 | author_email='chschock@gmail.com', 12 | license='MIT', 13 | packages=find_packages(), 14 | zip_safe=False, 15 | install_requires=[ 16 | 'nose>=1.3.7', 17 | 'numpy>=1.13.1', 18 | ], 19 | classifiers=( 20 | 'Programming Language :: Python :: 3.6', 21 | 'License :: OSI Approved :: MIT License', 22 | 'Operating System :: OS Independent', 23 | 'Topic :: Scientific/Engineering :: Artificial Intelligence', 24 | ), 25 | keywords='nlp text segmentation paragraph embeddings', 26 | ) 27 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chschock/textsplit/27eff3f0a6d32e591db435910caf7fc2145cc6c7/tests/__init__.py -------------------------------------------------------------------------------- /tests/test_textsplit.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import numpy as np 3 | from ..textsplit.algorithm import split_greedy, split_optimal, get_total, get_gains 4 | from ..textsplit.tools import get_penalty, P_k 5 | 6 | DIM = 20 7 | 8 | def getDoc(segment_len, n_seg): 9 | return np.vstack([np.tile(w, (segment_len, 1)) 10 | for w in np.random.random((n_seg, DIM))]) 11 | 12 | 13 | docA = getDoc(20, 10) 14 | penaltyA = get_penalty([docA], 20) # get_penalty is deterministic here 15 | 16 | class TestTextSplit(unittest.TestCase): 17 | 18 | def test_get_penalty(self): 19 | seg = split_greedy(docA, penalty=penaltyA) 20 | self.assertEqual(len(seg.splits), 9) 21 | 22 | def test_split_greedy_penalty(self): 23 | seg = split_greedy(docA, penalty=penaltyA) 24 | self.assertEqual(len(seg.splits), len(seg.gains)) 25 | self.assertGreater(np.percentile(seg.gains, 25), penaltyA) 26 | gains2 = get_gains(docA, seg.splits) 27 | self.assertTrue(all(np.isclose(seg.gains, gains2))) 28 | 29 | def test_split_greedy_max_splits(self): 30 | seg = split_greedy(docA, max_splits=5) 31 | self.assertEqual(len(seg.splits), len(seg.gains)) 32 | self.assertTrue(len(seg.splits) == len(seg.gains) == 5) 33 | 34 | def test_split_greedy_penalty_max_splits(self): 35 | seg = split_greedy(docA, penalty=penaltyA, max_splits=5) 36 | self.assertEqual(len(seg.splits), len(seg.gains)) 37 | self.assertEqual(len(seg.splits), 5) 38 | self.assertGreater(np.percentile(seg.gains, 25), penaltyA) 39 | 40 | def test_split_optimal(self): 41 | seg = split_optimal(docA, penalty=penaltyA) 42 | self.assertEqual(len(seg.splits), len(seg.gains)) 43 | print(len(seg.splits)) 44 | self.assertGreater(np.min(seg.gains) + 0.00001, penaltyA) 45 | 46 | def test_split_optimal_vs_greedy(self): 47 | docs = [np.random.random((100, DIM)) for _ in range(100)] 48 | penalty = get_penalty(docs, 10) 49 | for i, doc in enumerate(docs): 50 | seg_o = split_optimal(doc, penalty=penalty) 51 | seg_g = split_greedy(doc, penalty=penalty) 52 | self.assertAlmostEqual(seg_o.total, get_total(doc, seg_o.splits, penalty), places=3) 53 | self.assertAlmostEqual(seg_g.total, get_total(doc, seg_g.splits, penalty), places=3) 54 | self.assertGreaterEqual(seg_o.total + 0.001, seg_g.total) 55 | 56 | def test_split_optimal_with_seg_limit(self): 57 | docs = [np.random.random((100, DIM)) for _ in range(10)] 58 | penalty = get_penalty(docs, 20) 59 | for i, doc in enumerate(docs): 60 | seg = split_optimal(doc, penalty=penalty) 61 | cuts = [0] + seg.splits + [100] 62 | seg2 = split_optimal( 63 | doc, penalty=penalty, seg_limit=np.diff(cuts).max()+1) 64 | self.assertTrue(seg2.optimal) 65 | self.assertEqual(seg.splits, seg2.splits) 66 | self.assertAlmostEqual(seg.total, seg2.total) 67 | 68 | def test_P_k(self): 69 | docs = [np.random.random((100, DIM)) for _ in range(10)] 70 | penalty = get_penalty(docs, 10) 71 | for i, doc in enumerate(docs): 72 | seg_o = split_optimal(doc, penalty=penalty) 73 | seg_g = split_greedy(doc, penalty=penalty) 74 | pk = P_k(seg_o.splits, seg_g.splits, len(doc)) 75 | self.assertGreaterEqual(pk, 0) 76 | self.assertGreaterEqual(1, pk) 77 | -------------------------------------------------------------------------------- /textsplit/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chschock/textsplit/27eff3f0a6d32e591db435910caf7fc2145cc6c7/textsplit/__init__.py -------------------------------------------------------------------------------- /textsplit/algorithm.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from numpy.linalg import norm 3 | 4 | from collections import namedtuple 5 | Segmentation = namedtuple('Segmentation', 6 | 'total splits gains min_gain optimal') 7 | 8 | def split_greedy(docmat, penalty=None, max_splits=None): 9 | """ 10 | Iteratively segment a document into segments being greedy about the 11 | next choice. This gives very accurate results on crafted documents, i.e. 12 | artificial concatenations of random documents. 13 | 14 | `penalty` is the minimum quantity a split has to improve the score to be 15 | made. If not given `total` is not computed. 16 | `max_splits` is a limit on the number of splits. 17 | Either `penalty` or `max_splits` have to be given. 18 | 19 | Whenever the iteration reaches the while block the following holds: 20 | `cuts` == splits + [L] where splits are the segment start indices 21 | `segscore` maps all segment start indices to segment vector lengths 22 | `score_l[i]` is the cumulated vector length from the cut left of i to i 23 | `score_r[i]` is the cumulated vector length from i to the cut right of i 24 | `score_out[i]` is the sum of all segscores not including the segment at i 25 | `scores[i]` is the sum of all segment vector lengths if we split at i 26 | 27 | These quantities are repaired after determining a next split from `scores`. 28 | 29 | Returns `total`, `splits`, `gains` where 30 | - `total` is the score diminished by len(splits) * penalty to make it 31 | continuous in the input. It is comparable to the output of split_optimal. 32 | - `splits` is the list of splits 33 | - `gains` is a list of uplift each split contributes vs. leaving it out 34 | 35 | Note: The splitting strategy suggests all resulting splits will have gain at 36 | least `penalty`. This is not the case as new splits can decrease the gain 37 | of others. This can be repaired by blocking positions where a split would 38 | decrease the gain of an existing one to less than `penalty` but is not 39 | implemented here. 40 | """ 41 | L, dim = docmat.shape 42 | 43 | assert max_splits is not None or (penalty is not None and penalty > 0) 44 | 45 | # norm(cumvecs[j] - cumvecs[i]) == norm(w_i + ... + w_{j-1}) 46 | cumvecs = np.cumsum(np.vstack((np.zeros((1, dim)), docmat)), axis=0) 47 | 48 | # cut[0] seg[0] cut[1] seg[1] ... seg[L-1] cut[L] 49 | cuts = [0, L] 50 | segscore = dict() 51 | segscore[0] = norm(cumvecs[L, :] - cumvecs[0, :], ord=2) 52 | segscore[L] = 0 # corner case, always 0 53 | score_l = norm(cumvecs[:L, :] - cumvecs[0, :], axis=1, ord=2) 54 | score_r = norm(cumvecs[L, :] - cumvecs[:L, :], axis=1, ord=2) 55 | score_out = np.zeros(L) 56 | score_out[0] = -np.inf # forbidden split position 57 | score = score_out + score_l + score_r 58 | 59 | min_gain = np.inf 60 | while True: 61 | split = np.argmax(score) 62 | 63 | if score[split] == - np.inf: 64 | break 65 | 66 | cut_l = max([c for c in cuts if c < split]) 67 | cut_r = min([c for c in cuts if split < c]) 68 | split_gain = score_l[split] + score_r[split] - segscore[cut_l] 69 | if penalty is not None: 70 | if split_gain < penalty: 71 | break 72 | 73 | min_gain = min(min_gain, split_gain) 74 | 75 | segscore[cut_l] = score_l[split] 76 | segscore[split] = score_r[split] 77 | 78 | cuts.append(split) 79 | cuts = sorted(cuts) 80 | 81 | if max_splits is not None: 82 | if len(cuts) >= max_splits + 2: 83 | break 84 | 85 | # differential changes to score arrays 86 | score_l[split:cut_r] = norm( 87 | cumvecs[split:cut_r, :] - cumvecs[split, :], axis=1, ord=2) 88 | score_r[cut_l:split] = norm( 89 | cumvecs[split, :] - cumvecs[cut_l:split, :], axis=1, ord=2) 90 | 91 | # adding following constant not necessary, only for score semantics 92 | score_out += split_gain 93 | score_out[cut_l:split] += segscore[split] - split_gain 94 | score_out[split:cut_r] += segscore[cut_l] - split_gain 95 | score_out[split] = -np.inf 96 | 97 | # update score 98 | score = score_out + score_l + score_r 99 | 100 | cuts = sorted(cuts) 101 | splits = cuts[1:-1] 102 | if penalty is None: 103 | total = None 104 | else: 105 | total = sum( 106 | norm(cumvecs[l, :] - cumvecs[r, :], ord=2) 107 | for l, r in zip(cuts[: -1], cuts[1:])) - len(splits) * penalty 108 | gains = [] 109 | for beg, cen, end in zip(cuts[:-2], cuts[1:-1], cuts[2:]): 110 | no_split_score = norm(cumvecs[end, :] - cumvecs[beg, :], ord=2) 111 | gains.append(segscore[beg] + segscore[cen] - no_split_score) 112 | 113 | return Segmentation(total, splits, gains, 114 | min_gain=min_gain, optimal=None) 115 | 116 | 117 | def split_optimal(docmat, penalty, seg_limit=None): 118 | """ 119 | Determine the configuration of splits with the highest score, given that 120 | splitting has a cost of `penalty`. `seg_limit` is a limitation on the length 121 | of a segment that saves memory and computation, but gives poor results 122 | should there be no split withing the range. 123 | The algorithm is built upon the idea that there is a accumulated score 124 | matrix containing the maximal score of creating a segment (i, j), containing 125 | all words [w_i, ..., w_j] at position i, j. The matrix `acc` is indexed to 126 | contain the first `seg_limit` elements of each row of the score matrix. 127 | `colmax` contains the column maxima of the score matrix. 128 | `ptr` is a backtracking pointer to determine the splits made while 129 | forward accumulating the highest score in the score matrix. 130 | """ 131 | L, dim = docmat.shape 132 | lim = L if seg_limit is None else seg_limit 133 | assert lim > 0 134 | assert penalty > 0 135 | 136 | acc = np.full((L, lim), -np.inf, dtype=np.float32) 137 | colmax = np.full((L,), -np.inf, dtype=np.float32) 138 | ptr = np.zeros(L, dtype=np.int32) 139 | 140 | for i in range(L): 141 | score_so_far = colmax[i-1] if i > 0 else 0. 142 | 143 | ctxvecs = np.cumsum(docmat[i:i+lim, :], axis=0) 144 | winsz = ctxvecs.shape[0] 145 | score = norm(ctxvecs, axis=1, ord=2) 146 | acc[i, :winsz] = score_so_far - penalty + score 147 | 148 | deltas = np.where(acc[i, :winsz] > colmax[i:i+lim])[0] 149 | js = i + deltas 150 | colmax[js] = acc[i, deltas] 151 | ptr[js] = i 152 | 153 | path = [ptr[-1]] 154 | while path[0] != 0: 155 | path.insert(0, ptr[path[0] - 1]) 156 | 157 | splits = path[1:] 158 | gains = get_gains(docmat, path[1:]) 159 | optimal = all(np.diff([0] + splits + [L]) < lim) 160 | 161 | total = colmax[-1] + penalty 162 | 163 | return Segmentation(total, splits, gains, 164 | min_gain=None, optimal=optimal) 165 | 166 | 167 | def get_total(docmat, splits, penalty): 168 | """ 169 | Compute the total score of a split configuration with given penalty. 170 | """ 171 | L, dim = docmat.shape 172 | cuts = [0] + list(splits) + [L] 173 | cumvecs = np.cumsum(np.vstack((np.zeros((1, dim)), docmat)), axis=0) 174 | return sum( 175 | norm(cumvecs[l, :] - cumvecs[r, :], ord=2) 176 | for l, r in zip(cuts[:-1], cuts[1:])) - len(splits) * penalty 177 | 178 | 179 | def get_gains(docmat, splits, width=None): 180 | """ 181 | Calculate gains of the splits towards the left and right neighbouring 182 | split. 183 | If `width` is given, calculate gains of the splits towards a centered window 184 | of length 2 * `width`. 185 | """ 186 | gains = [] 187 | L = docmat.shape[0] 188 | for beg, cen, end in zip([0] + splits[:-1], splits, splits[1:] + [L]): 189 | if width is not None and width > 0: 190 | beg, end = max(cen - width, 0), min(cen + width, L) 191 | 192 | slice_l, slice_r, slice_t = [slice(beg, cen), # left context 193 | slice(cen, end), # right context 194 | slice(beg, end)] # total context 195 | 196 | gains.append(norm(docmat[slice_l, :].sum(axis=0), ord=2) + 197 | norm(docmat[slice_r, :].sum(axis=0), ord=2) - 198 | norm(docmat[slice_t, :].sum(axis=0), ord=2)) 199 | return gains 200 | -------------------------------------------------------------------------------- /textsplit/tools.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import re 3 | import random 4 | from .algorithm import split_greedy 5 | 6 | def get_segments(text_particles, segmentation): 7 | """ 8 | Reorganize text particles by aggregating them to arrays described by the 9 | provided `segmentation`. 10 | """ 11 | segmented_text = [] 12 | L = len(text_particles) 13 | for beg, end in zip([0] + segmentation.splits, segmentation.splits + [L]): 14 | segmented_text.append(text_particles[beg:end]) 15 | return segmented_text 16 | 17 | def get_penalty(docmats, segment_len): 18 | """ 19 | Determine penalty for segments having length `segment_len` on average. 20 | This is achieved by stochastically rounding the expected number 21 | of splits per document `max_splits` and taking the minimal split_gain that 22 | occurs in split_greedy given `max_splits`. 23 | """ 24 | penalties = [] 25 | for docmat in docmats: 26 | avg_n_seg = docmat.shape[0] / segment_len 27 | max_splits = int(avg_n_seg) + (random.random() < avg_n_seg % 1) - 1 28 | if max_splits >= 1: 29 | seg = split_greedy(docmat, max_splits=max_splits) 30 | if seg.min_gain < np.inf: 31 | penalties.append(seg.min_gain) 32 | if len(penalties) > 0: 33 | return np.mean(penalties) 34 | raise ValueError('All documents too short for given segment_len.') 35 | 36 | 37 | def P_k(splits_ref, splits_hyp, N): 38 | """ 39 | Metric to evaluate reference splits against hypothesised splits. 40 | Lower is better. 41 | `N` is the text length. 42 | """ 43 | k = round(N / (len(splits_ref) + 1) / 2 - 1) 44 | ref = np.array(splits_ref, dtype=np.int32) 45 | hyp = np.array(splits_hyp, dtype=np.int32) 46 | 47 | def is_split_between(splits, l, r): 48 | return np.sometrue(np.logical_and(splits - l >= 0, splits - r < 0)) 49 | 50 | acc = 0 51 | for i in range(N-k): 52 | acc += is_split_between(ref, i, i+k) != is_split_between(hyp, i, i+k) 53 | 54 | return acc / (N-k) 55 | 56 | 57 | class SimpleSentenceTokenizer: 58 | 59 | def __init__(self, breaking_chars='.!?'): 60 | assert len(breaking_chars) > 0 61 | self.breaking_chars = breaking_chars 62 | self.prog = re.compile(r".+?[{}]\W+".format(breaking_chars), re.DOTALL) 63 | 64 | def __call__(self, text): 65 | return self.prog.findall(text) 66 | --------------------------------------------------------------------------------