├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── SRP ├── SRP.py ├── SRP_files.py ├── __init__.py └── flycheck_SRP.py ├── bld.bat ├── build.sh ├── docs ├── Build Fiction Set.ipynb ├── Classification Using Tensorflow Estimators.ipynb ├── Find Text Lab Books in Hathi.ipynb ├── Hash a corpus of text files into SRP space.ipynb ├── Increasing Speed through batch processing.ipynb ├── Recursive SRP tests.ipynb └── Splitting Ids.ipynb ├── meta.yaml ├── pyproject.toml ├── requirements.txt ├── run_test.sh ├── setup.py ├── tests └── test.py └── utils ├── clean_file.py └── expand_half-precision.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask instance folder 57 | instance/ 58 | 59 | # Scrapy stuff: 60 | .scrapy 61 | 62 | # Sphinx documentation 63 | docs/_build/ 64 | 65 | # PyBuilder 66 | target/ 67 | 68 | # IPython Notebook 69 | .ipynb_checkpoints 70 | 71 | # pyenv 72 | .python-version 73 | 74 | # celery beat schedule file 75 | celerybeat-schedule 76 | 77 | # dotenv 78 | .env 79 | 80 | # virtualenv 81 | venv/ 82 | ENV/ 83 | 84 | # Spyder project settings 85 | .spyderprojecttests/ 86 | test.bin 87 | .DS_Store 88 | nohup.out 89 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "3.6" 4 | - "3.7" 5 | - "3.8" 6 | # command to install dependencies 7 | install: 8 | - pip install -r requirements.txt 9 | - pip install . 10 | # command to run tests 11 | script: cd tests && python test.py 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT LICENSE 2 | 3 | Copyright (c) 2016-2021 Benjamin Schmidt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # pySRP 2 | 3 | Python module implementing Stable Random Projections, as described in 4 | [Cultural Analytics Vol. 1, Issue 2, 2018 October 04, 2018: Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries](https://doi.org/10.22148/16.025) 5 | 6 | These create interchangeable, data-agnostic vectorized representations of text suitable for a variety of contexts. Unlike most vectorizations, they are suitable for representing texts in any language that uses space-tokenization, or non-linguistic content, since they contain no implicit language model besides words. 7 | 8 | You may want to use them in concert with the pre-distributed Hathi SRP features 9 | described further here. 10 | 11 | ## Installation 12 | 13 | Requires python 3 14 | 15 | ```bash 16 | pip install pysrp 17 | ``` 18 | ## Changelog 19 | 20 | **Version 2.0 (July 2022) slightly changes the default tokenization algorithm!** 21 | 22 | Previously it was `\w`; now it is `[\p{L}\p{Pc}\p{N}\p{M}]+`. 23 | 24 | I also no longer recommend use 25 | 26 | ## Usage 27 | 28 | ### Examples 29 | 30 | See the [docs folder](https://github.com/bmschmidt/pySRP/tree/master/docs) 31 | for some IPython notebooks demonstrating: 32 | 33 | * [Taking a subset of the full Hathi collection (100,000 works of fiction) based on 34 | identifiers, and exploring the major clusters within fiction.](https://github.com/bmschmidt/pySRP/blob/master/docs/Build%20Fiction%20Set.ipynb) 35 | * [Creating a new SRP representation of text files and plotting dimensionality reductions of them by language and time](https://github.com/bmschmidt/pySRP/blob/master/docs/Hash%20a%20corpus%20of%20text%20files%20into%20SRP%20space.ipynb) 36 | * [Searching for copies of one set of books in the full HathiTrust collection, and using Hathi metadata to identify duplicates and find errors in local item descriptions.](https://github.com/bmschmidt/pySRP/blob/master/docs/Find%20Text%20Lab%20Books%20in%20Hathi.ipynb) 37 | * [Training a classifier based on library metadata using TensorFlow, and then applyinig that classification to other sorts of text.](https://github.com/bmschmidt/pySRP/blob/master/docs/Classification%20Using%20Tensorflow%20Estimators.ipynb) 38 | 39 | ### Basic Usage 40 | 41 | Use the SRP class to build an object to perform transformations. 42 | 43 | This is a class method, rather than a function, which builds a cache of previously seen words. 44 | 45 | ```python 46 | import srp 47 | # initialize with desired number of dimensions 48 | hasher = srp.SRP(640) 49 | ``` 50 | 51 | The most important method is 'stable_transform'. 52 | 53 | This can tokenize and then compute the SRP. 54 | 55 | ```python 56 | hasher.stable_transform(words = "foo bar bar")) 57 | ``` 58 | 59 | If counts are already computed, word and count vectors can be passed separately. 60 | 61 | ```python 62 | hasher.stable_transform(words = ["foo","bar"],counts = [1,2]) 63 | ``` 64 | 65 | 66 | ## Read/write tools 67 | 68 | SRP files are stored in a binary file format to save space. 69 | This format is the same used by the binary word2vec format. 70 | 71 | **DEPRECATION NOTICE** 72 | 73 | This format is now deprecated--I recommend the Apache Arrow binary serialization format instead. 74 | 75 | ```python 76 | file = SRP.Vector_file("hathivectors.bin") 77 | 78 | for (key, vector) in file: 79 | pass 80 | # 'key' is a unique identifier for a document in a corpus 81 | # 'vector' is a `numpy.array` of type `= self.cache_limit: 126 | # Clear the cache; maybe things have changed. 127 | self._hash_dict = self._last_hash_dict 128 | self._hash_dict = {} 129 | 130 | if cache and self._cache_size() < self.cache_limit: 131 | self._hash_dict[string] = value 132 | 133 | return value 134 | 135 | def tokenlist(self, string, regex = tokenregex, lower = True): 136 | if isinstance(string, bytes): 137 | string = string.decode("utf-8") 138 | return regex.findall(string) 139 | 140 | def tokenize(self, string, regex=tokenregex, lower = True): 141 | parts = self.tokenlist(string, regex, lower) 142 | count = dict() 143 | for part in parts: 144 | if lower: 145 | part = part.lower() 146 | try: 147 | count[part] += 1 148 | except KeyError: 149 | count[part] = 1 150 | return count 151 | 152 | def standardize(self, words, counts, unzip = True): 153 | full = dict() 154 | 155 | for i in range(len(words)): 156 | """ 157 | Here we retokenize each token. A full text can be tokenized 158 | at a single pass 159 | by passing words = [string], counts=[1] 160 | """ 161 | subCounts = self.tokenize(words[i]) 162 | for (part, partCounts) in subCounts.items(): 163 | part = regex.sub(u'\d', "#", part) 164 | addition = counts[i] * partCounts 165 | try: 166 | full[part] += addition 167 | except KeyError: 168 | full[part] = addition 169 | words = [] 170 | counts = np.zeros(len(full), "= limit: 76 | break 77 | 78 | output.close() 79 | 80 | import pyarrow as pa 81 | from pyarrow import ipc 82 | from pyarrow import feather 83 | 84 | class Arrow_File(object): 85 | """ 86 | Store in an arrow file. 87 | """ 88 | 89 | def __init__(self, filename, dims=float("Inf"), mode="r", max_rows=float("Inf"), precision = "float", offset_cache = False): 90 | """ 91 | Creates an SRP object. 92 | 93 | filename: The location on disk. 94 | dims: The number of vectors to store for each document. Typically ~100 to ~1000. 95 | Need not be specified if working with an existing file. 96 | mode: One of: 'r' (read an existing file); 'w' (create a new file); 'a' (append to the 97 | end of an existing file) 98 | max_rows: clip the document to a fixed length. Best left unused. 99 | precision: bytes to use for each. 4 (single-precision) is standard; 2 (half precision) is also reasonable. 0 embeds not as floats, but instead into binary hamming space. 100 | offset_cache: Whether to store the byte offset lookup information for vectors. By default, 101 | this is False, which means the offset table is built on load and kept in memory. 102 | """ 103 | 104 | self.filename = filename 105 | self.dims = dims 106 | self.mode = mode 107 | self.max_rows = max_rows 108 | if precision == 2: 109 | precision = "half" 110 | elif precision == 4: 111 | precision = "float" 112 | try: 113 | assert precision in {"half", "float", "binary"} 114 | except: 115 | e = "Only `4` (single) and `2` (half) bytes are valid options for `precision`" 116 | raise ValueError(e) 117 | self.precision = precision 118 | if self.precision == "half": 119 | self.float_format = ' 0) 267 | else: 268 | raise TypeError("Numpy array must be of type '= self._BATCH_SIZE: 277 | self.flush() 278 | 279 | try: 280 | self._prefix_lookup[identifier.split(self.sep, 1)[0]].append((identifier, self.file.tell())) 281 | except AttributeError: 282 | pass 283 | try: 284 | self._offset_lookup[identifier] = self.file.tell() 285 | except AttributeError: 286 | pass 287 | 288 | def close(self): 289 | """ 290 | Close the file. It's extremely important to call this method in write modes: 291 | not just that the last few files will be missing. 292 | If it isn't, the header will have out-of-date information and files won't be read. 293 | """ 294 | self.flush() 295 | self.file.close() 296 | 297 | if self.offset_cache: 298 | self._offset_lookup.close() 299 | 300 | def _regex_search(self, regex): 301 | 302 | self._build_offset_lookup() 303 | values = [(i, k) for k, i in self._offset_lookup.items() if re.search(regex, k)] 304 | # Sort to ensure values are returned in disk order. 305 | values.sort() 306 | for i, k in values: 307 | yield (k, self[k]) 308 | 309 | 310 | self._build_offset_lookup() 311 | values = [(i, k) for k, i in self._offset_lookup.items() if re.search(regex, k)] 312 | # Sort to ensure values are returned in disk order. 313 | values.sort() 314 | for i, k in values: 315 | yield (k, self[k]) 316 | 317 | def __getitem__(self, label): 318 | """ 319 | Attributes can be accessed in three ways. 320 | 321 | 322 | With a string: this returns just the vector for that string. 323 | With a list of strings: this returns a multidimensional array for each query passed. 324 | If any of the requested items do not exist, this will fail. 325 | With a single *compiled* regular expression (from either the regex or re module). This 326 | will return an iterator over key, value pairs of keys that match the regex. 327 | """ 328 | 329 | self._build_offset_lookup() 330 | 331 | if self.mode == 'a' or self.mode == 'w': 332 | self.file.flush() 333 | 334 | if isinstance(label, original_regex_type): 335 | # Convert from re type since that's 336 | # more standard 337 | label = re.compile(label.pattern) 338 | 339 | if isinstance(label, regex_type): 340 | return self._regex_search(label) 341 | 342 | if isinstance(label, original_regex_type): 343 | label = re.compile(label.pattern) 344 | 345 | if isinstance(label, regex_type): 346 | return self._regex_search(label) 347 | 348 | if isinstance(label, MutableSequence): 349 | is_iterable = True 350 | else: 351 | is_iterable = False 352 | label = [label] 353 | 354 | vecs = [] 355 | # Will fail on any missing labels 356 | 357 | # Prefill and sort so that any block are done in disk-order. 358 | # This may make a big difference if you're on a tape drive! 359 | 360 | vecs = np.zeros((len(label), self.vector_size), ' 0) 702 | else: 703 | raise TypeError("Numpy array must be of type ' self.vector_size: 758 | warnings.warn( 759 | "WARNING: data has only {} columns but call requested top {}".format( 760 | self.vector_size, self.dims)) 761 | if self.dims == float("Inf") or self.dims == self.vector_size: 762 | self.dims = self.vector_size 763 | self.slice_and_dice = False 764 | else: 765 | self.slice_and_dice = True 766 | 767 | self.set_binary_len() 768 | 769 | self.remaining_words = min([self.vocab_size, self.max_rows]) 770 | 771 | def _check_if_half_precision(self): 772 | 773 | body_start = self.file.tell() 774 | word, weights = (self._read_row_name(), self._read_binary_row()) 775 | 776 | meanval = np.mean(np.abs(weights)) 777 | 778 | if meanval > 1e10: 779 | warning = "Average size is extremely large" + \ 780 | "did you mean to specify 'precision = half or precision = binary'?" 781 | warnings.warn(warning) 782 | 783 | def _read_row_name(self): 784 | buffer = [] 785 | while True: 786 | ch = self.file.read(1) 787 | if not ch and self.remaining_words > 0: 788 | print("Ran out of data with {} words left".format( 789 | self.remaining_words)) 790 | return 791 | if ch == b' ': 792 | break 793 | if ch != b'\n': 794 | # ignore newlines in front of words (some binary files have em) 795 | buffer.append(ch) 796 | try: 797 | word = b''.join(buffer).decode() 798 | except: 799 | print("Couldn't export:") 800 | print(buffer) 801 | raise 802 | return word 803 | 804 | 805 | def _build_offset_lookup(self, force=False, sep = None): 806 | if hasattr(self, "_offset_lookup") and not force and not sep: 807 | return 808 | if hasattr(self, "_prefix_lookup") and not force and sep: 809 | return 810 | 811 | if sep is not None: 812 | prefix_lookup = defaultdict(list) 813 | else: 814 | offset_lookup = {} 815 | 816 | self._preload_metadata() 817 | # Add warning for duplicate ids. 818 | i = 0 819 | while i < self.vocab_size: 820 | label = self._read_row_name() 821 | if sep is None and label in offset_lookup: 822 | warnings.warn( 823 | "Warning: this vector file has duplicate identifiers " + 824 | "(words) The last vector representation of each " + 825 | "identifier will be used, and earlier ones ignored.") 826 | if sep: 827 | key = label.split(sep, 1)[0] 828 | loc = self.file.tell() 829 | prefix_lookup[key].append((label, loc)) 830 | else: 831 | offset_lookup[label] = self.file.tell() 832 | # Skip to the next name without reading. 833 | self.file.seek(self.binary_len, 1) 834 | i += 1 835 | 836 | if self.offset_cache: 837 | # While building the full dict in memory then saving to cache should be quicker 838 | # (for prefix lookup), this defeats the primary value of the cache in avoid holding 839 | # huge objects in memory. An intermediate write when the dict is getting big 840 | # will be needed for scale. 841 | if sep: 842 | self._prefix_lookup = SqliteDict(self.filename + '.prefix.db', 843 | autocommit=False, journal_mode ='OFF') 844 | for key, value in prefix_lookup.items(): 845 | self._prefix_lookup[key] = value 846 | self._prefix_lookup.commit() 847 | else: 848 | self._offset_lookup = SqliteDict(self.filename + '.offset.db', encode=int, decode=int, 849 | autocommit=False, journal_mode ='OFF') 850 | for key, value in offset_lookup.items(): 851 | self._offset_lookup[key] = value 852 | self._offset_lookup.commit() 853 | else: 854 | if sep: 855 | self._prefix_lookup = prefix_lookup 856 | else: 857 | self._offset_lookup = offset_lookup 858 | 859 | def sort(self, destination, sort = "names", safe = True, chunk_size = 2000): 860 | """ 861 | This method sorts a vector file by its keys without reading it into memory. 862 | 863 | It also cleans 864 | 865 | destination: A new file to be written. 866 | 867 | sort: one of 'names' (default sort by the filenames), 'random' 868 | (sort randomly), or 'none' (keep the current order) 869 | 870 | safe: whether to check for (and eliminate) duplicate keys and 871 | 872 | chunk_size: How many vectors to read into memory at a time. Larger numbers 873 | may improve performance, especially on hard drives, 874 | by keeping the disk head from moving around. 875 | """ 876 | 877 | self._build_offset_lookup() 878 | ks = list(self._offset_lookup.keys()) 879 | if sort == 'names': 880 | ks.sort() 881 | elif sort == 'random': 882 | random.shuffle(ks) 883 | elif sort == 'none': 884 | pass 885 | else: 886 | raise NotImplementedError("sort type must be one of [names, random, none]") 887 | # Chunk size matters because we can pull the vectors 888 | # from the disk in order within each chunk. 889 | 890 | last_written = None 891 | with Vector_file(destination, 892 | dims = self.dims, 893 | mode = "w", 894 | precision = self.precision) as output: 895 | for i in range(0, len(ks), chunk_size): 896 | keys = ks[i:(i + chunk_size)] 897 | for key, row in zip(keys, self[keys]): 898 | if safe: 899 | norm = np.linalg.norm(row) 900 | if np.isinf(norm) or np.isnan(norm) or norm == 0: 901 | continue 902 | if key == last_written: 903 | continue 904 | last_written = key 905 | output.add_row(key, row) 906 | 907 | def _regex_search(self, regex): 908 | 909 | self._build_offset_lookup() 910 | values = [(i, k) for k, i in self._offset_lookup.items() if re.search(regex, k)] 911 | # Sort to ensure values are returned in disk order. 912 | values.sort() 913 | for i, k in values: 914 | yield (k, self[k]) 915 | 916 | 917 | self._build_offset_lookup() 918 | values = [(i, k) for k, i in self._offset_lookup.items() if re.search(regex, k)] 919 | # Sort to ensure values are returned in disk order. 920 | values.sort() 921 | for i, k in values: 922 | yield (k, self[k]) 923 | 924 | def flush(self): 925 | """ 926 | Flushing requires rewriting the metadata at the head as well as flushing the file buffer. 927 | """ 928 | self.file.flush() 929 | self._rewrite_header() 930 | 931 | def __getitem__(self, label): 932 | """ 933 | Attributes can be accessed in three ways. 934 | 935 | 936 | With a string: this returns just the vector for that string. 937 | With a list of strings: this returns a multidimensional array for each query passed. 938 | If any of the requested items do not exist, this will fail. 939 | With a single *compiled* regular expression (from either the regex or re module). This 940 | will return an iterator over key, value pairs of keys that match the regex. 941 | """ 942 | 943 | self._build_offset_lookup() 944 | 945 | if self.mode == 'a' or self.mode == 'w': 946 | self.file.flush() 947 | 948 | if isinstance(label, original_regex_type): 949 | # Convert from re type since that's 950 | # more standard 951 | label = re.compile(label.pattern) 952 | 953 | if isinstance(label, regex_type): 954 | return self._regex_search(label) 955 | 956 | if isinstance(label, original_regex_type): 957 | label = re.compile(label.pattern) 958 | 959 | if isinstance(label, regex_type): 960 | return self._regex_search(label) 961 | 962 | if isinstance(label, MutableSequence): 963 | is_iterable = True 964 | else: 965 | is_iterable = False 966 | label = [label] 967 | 968 | vecs = [] 969 | # Will fail on any missing labels 970 | 971 | # Prefill and sort so that any block are done in disk-order. 972 | # This may make a big difference if you're on a tape drive! 973 | 974 | vecs = np.zeros((len(label), self.vector_size), '= 2 and what != \"train\":\n", 171 | " continue\n", 172 | " if id in lookup:\n", 173 | " cat = lookup[id]\n", 174 | " # Normalize vectors to unit length.\n", 175 | " row = row/np.linalg.norm(row.astype('