├── .gitignore ├── .travis.yml ├── Cargo.lock ├── Cargo.toml ├── FORMAT.md ├── LICENSE ├── NOTICE ├── README.md ├── rust2vec-utils ├── Cargo.toml └── src │ ├── bin │ ├── r2v-compute-accuracy.rs │ ├── r2v-convert.rs │ ├── r2v-metadata.rs │ ├── r2v-quantize.rs │ └── r2v-similar.rs │ └── lib.rs └── rust2vec ├── Cargo.toml ├── src ├── embeddings.rs ├── io.rs ├── lib.rs ├── metadata.rs ├── prelude.rs ├── similarity.rs ├── storage.rs ├── subword.rs ├── tests.rs ├── text.rs ├── util.rs ├── vocab.rs └── word2vec.rs └── testdata ├── analogy.bin ├── similarity.bin ├── similarity.fifu ├── similarity.nodims └── similarity.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .* 2 | *.bk 3 | target 4 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: rust 2 | rust: 3 | - stable 4 | - beta 5 | - nightly 6 | script: 7 | - rustup component add clippy 8 | - cargo build --verbose 9 | - cargo test --verbose 10 | - ( cd rust2vec-utils ; cargo build --verbose --features "opq reductive/openblas" ) 11 | - cargo clippy 12 | matrix: 13 | allow_failures: 14 | - rust: nightly 15 | sudo: required 16 | dist: trusty 17 | addons: 18 | apt: 19 | packages: 20 | - libopenblas-dev 21 | - gfortran 22 | -------------------------------------------------------------------------------- /Cargo.toml: -------------------------------------------------------------------------------- 1 | [workspace] 2 | 3 | members = [ 4 | "rust2vec", 5 | "rust2vec-utils", 6 | ] 7 | -------------------------------------------------------------------------------- /FORMAT.md: -------------------------------------------------------------------------------- 1 | # The finalfusion format v0 2 | 3 | ## Goals 4 | 5 | `finalfusion` is a format for storing word embeddings. The goals of the 6 | first version of the finalfusion format are: 7 | 8 | 1. Easy to parse 9 | 2. Fast to parse 10 | 3. Extensible 11 | 4. Support for: 12 | * Memory mapping 13 | * Tokens with spaces 14 | * Subword units 15 | * Quantized matrices 16 | 5. Existing embeddings should be convertible 17 | 18 | ## File format 19 | 20 | Each `finalfusion` file consists of a header, followed by chunks. Currently, 21 | a `finalfusion` file must contain the following chunk order: 22 | 23 | 1. Optional metadata chunk 24 | 2. Vocabulary chunk 25 | 3. Storage chunk 26 | 27 | The permitted chunks may be extended in a future version of the 28 | specification. In particular, we would like to make it possible: 29 | 30 | * To have multiple storage chunks per vocabulary. 31 | * To have multiple vocab-storage pairs. 32 | 33 | All data must be in little endian byte order. 34 | 35 | ## Header 36 | 37 | The header consists of: 38 | 39 | - 4 bytes of magic: `['F', 'i', 'F', 'u']` 40 | - Format version number: u32 41 | - Number of chunks: u32 (`n_chunks`) 42 | - Chunk identifiers: `[u32; n_chunks]` 43 | 44 | ## Data types 45 | 46 | ``` 47 | 0: i8 48 | 1: u8 49 | 2: i16 50 | 3: u16 51 | 4: i32 52 | 5: u32 53 | 6: i64 54 | 7: u64 55 | 8: i128 56 | 9: u128 57 | 10: f32 58 | 11: f64 59 | ``` 60 | 61 | ## Chunks 62 | 63 | ### Chunk format 64 | 65 | The chunk format is as follows: 66 | 67 | - Chunk identifier: u32 68 | - Chunk data length: u64 69 | - Chunk data: n bytes 70 | 71 | ### Vocab 72 | 73 | - Chunk identifier: 0 74 | - Vocab length: u64 (`vocab_len`) 75 | - `vocab_len` times: 76 | - word length in bytes: u32 (`word_len`) 77 | - `word_len` times u8. 78 | 79 | ### Subword vocab 80 | 81 | - Chunk identifier: 3 82 | - Minimum n-gram length: u32 83 | - Maximum n-gram length: u32 84 | - Bucket exponent: u32 85 | - Vocab length: u64 (`vocab_len`) 86 | - `vocab_len` times: 87 | - word length in bytes: u32 (`word_len`) 88 | - `word_len` times u8. 89 | 90 | 91 | ### Embedding matrix 92 | 93 | - Chunk identifier: 1 94 | - Shape: 95 | - Rows: u64 (`n_rows`) 96 | - Cols: u32 (`n_cols`) 97 | - Data type: u32 (`data_type`) 98 | - Padding, such that data is at a multiple of `size_of::()`. 99 | - Data: `n_row` * `n_cols` * `sizeof(data_type)` 100 | 101 | ### Quantized embedding matrix 102 | 103 | - Chunk identifier: 1 104 | - Use projection (0 or 1): u32 105 | - Use norms (0 or 1): u32 106 | - Quantized embedding length: u32 (`quantized_len`) 107 | - Reconstructed embedding length: u32 (`reconstructed_len`) 108 | - Number of quantizer centroids: u32 109 | - Quantized matrix rows: u64 (`matrix_rows`) 110 | - Quantized matrix type: u32 (`quantized_type`) 111 | - Reconstruced matrix type: u32 (`reconstructed_type`) 112 | - Padding, such that data is at a multiple of the largest matrix data type. 113 | - Projection matrix: `reconstructed_len` x `reconstructed_len` x `sizeof(reconstructed_type)` 114 | - Subquantizers: `quantized_len` x (`reconstructed_len` / `quantized_len`) x `sizeof(quantized_type)` 115 | - Norms: `matrix_rows` x `sizeof(reconstructed_type)` 116 | - Quantized embedding matrix: `matrix_rows` x `quantized_len` x `sizeof(reconstructed_type)` 117 | 118 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | This software contains portions of rust2vec, developed by 2 | Daniël de Kok . 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **Warning:** The 2 | [finalfusion](https://finalfusion.github.com/finalfusion-rust) crate 3 | supersedes the rust2vec crate. 4 | -------------------------------------------------------------------------------- /rust2vec-utils/Cargo.toml: -------------------------------------------------------------------------------- 1 | [package] 2 | name = "rust2vec-utils" 3 | version = "0.5.0" 4 | authors = ["Daniël de Kok "] 5 | edition = "2018" 6 | description = "rust2vec utilities" 7 | documentation = "https://docs.rs/rust2vec/" 8 | homepage = "https://github.com/danieldk/rust2vec" 9 | repository = "https://github.com/danieldk/rust2vec" 10 | license = "Apache-2.0" 11 | readme = "../README.md" 12 | 13 | [badges] 14 | maintenance = { status = "deprecated" } 15 | 16 | [dependencies] 17 | clap = "2" 18 | env_logger = "0.6" 19 | failure = "0.1" 20 | getopts = "0.2" 21 | ndarray = "0.12" 22 | num_cpus = "1" 23 | rayon = "1" 24 | reductive = "0.2" 25 | rust2vec = { path = "../rust2vec", version = "0.5" } 26 | stdinout = "0.4" 27 | toml = "0.4" 28 | 29 | [features] 30 | default = [] 31 | opq = ["reductive/opq-train"] 32 | -------------------------------------------------------------------------------- /rust2vec-utils/src/bin/r2v-compute-accuracy.rs: -------------------------------------------------------------------------------- 1 | use std::collections::BTreeMap; 2 | use std::io::BufRead; 3 | use std::sync::{Arc, Mutex}; 4 | 5 | use clap::{App, AppSettings, Arg, ArgMatches}; 6 | use rayon::prelude::*; 7 | use rayon::ThreadPoolBuilder; 8 | use rust2vec::prelude::*; 9 | use rust2vec::similarity::Analogy; 10 | use rust2vec_utils::{read_embeddings_view, EmbeddingFormat}; 11 | use stdinout::{Input, OrExit}; 12 | 13 | static DEFAULT_CLAP_SETTINGS: &[AppSettings] = &[ 14 | AppSettings::DontCollapseArgsInUsage, 15 | AppSettings::UnifiedHelpMessage, 16 | ]; 17 | 18 | fn main() { 19 | let matches = parse_args(); 20 | let config = config_from_matches(&matches); 21 | 22 | ThreadPoolBuilder::new() 23 | .num_threads(config.n_threads) 24 | .build_global() 25 | .unwrap(); 26 | 27 | let embeddings = 28 | read_embeddings_view(&config.embeddings_filename, EmbeddingFormat::FinalFusion) 29 | .or_exit("Cannot read embeddings", 1); 30 | 31 | let analogies_file = Input::from(config.analogies_filename); 32 | let reader = analogies_file 33 | .buf_read() 34 | .or_exit("Cannot open analogy file for reading", 1); 35 | 36 | let instances = read_analogies(reader); 37 | process_analogies(&embeddings, &instances); 38 | } 39 | 40 | // Option constants 41 | static EMBEDDINGS: &str = "EMBEDDINGS"; 42 | static ANALOGIES: &str = "ANALOGIES"; 43 | static THREADS: &str = "threads"; 44 | 45 | fn parse_args() -> ArgMatches<'static> { 46 | App::new("r2v-compute-accuracy") 47 | .settings(DEFAULT_CLAP_SETTINGS) 48 | .arg( 49 | Arg::with_name(THREADS) 50 | .long("threads") 51 | .value_name("N") 52 | .help("Number of threads (default: logical_cpus / 2)") 53 | .takes_value(true), 54 | ) 55 | .arg( 56 | Arg::with_name(EMBEDDINGS) 57 | .help("Embedding file") 58 | .index(1) 59 | .required(true), 60 | ) 61 | .arg(Arg::with_name(ANALOGIES).help("Analogy file").index(2)) 62 | .get_matches() 63 | } 64 | 65 | struct Config { 66 | analogies_filename: Option, 67 | embeddings_filename: String, 68 | n_threads: usize, 69 | } 70 | 71 | fn config_from_matches(matches: &ArgMatches) -> Config { 72 | let embeddings_filename = matches.value_of(EMBEDDINGS).unwrap().to_owned(); 73 | let analogies_filename = matches.value_of(ANALOGIES).map(ToOwned::to_owned); 74 | let n_threads = matches 75 | .value_of("threads") 76 | .map(|v| v.parse().or_exit("Cannot parse number of threads", 1)) 77 | .unwrap_or(num_cpus::get() / 2); 78 | 79 | Config { 80 | analogies_filename, 81 | embeddings_filename, 82 | n_threads, 83 | } 84 | } 85 | 86 | struct Counts { 87 | n_correct: usize, 88 | n_instances: usize, 89 | n_skipped: usize, 90 | } 91 | 92 | impl Default for Counts { 93 | fn default() -> Self { 94 | Counts { 95 | n_correct: 0, 96 | n_instances: 0, 97 | n_skipped: 0, 98 | } 99 | } 100 | } 101 | 102 | #[derive(Clone)] 103 | struct Eval<'a> { 104 | embeddings: &'a Embeddings, 105 | section_counts: Arc>>, 106 | } 107 | 108 | impl<'a> Eval<'a> { 109 | fn new(embeddings: &'a Embeddings) -> Self { 110 | Eval { 111 | embeddings, 112 | section_counts: Arc::new(Mutex::new(BTreeMap::new())), 113 | } 114 | } 115 | 116 | /// Evaluate an analogy. 117 | fn eval_analogy(&self, instance: &Instance) { 118 | // Skip instances where the to-be-predicted word is not in the 119 | // vocab. This is a shortcoming of the vocab size and not of the 120 | // embedding model itself. 121 | if self.embeddings.vocab().idx(&instance.answer).is_none() { 122 | let mut section_counts = self.section_counts.lock().unwrap(); 123 | let counts = section_counts.entry(instance.section.clone()).or_default(); 124 | counts.n_skipped += 1; 125 | return; 126 | } 127 | 128 | // If the model is not able to provide a query result, it is counted 129 | // as an error. 130 | let is_correct = self 131 | .embeddings 132 | .analogy(&instance.query.0, &instance.query.1, &instance.query.2, 1) 133 | .map(|r| r.first().unwrap().word == instance.answer) 134 | .unwrap_or(false); 135 | 136 | let mut section_counts = self.section_counts.lock().unwrap(); 137 | let counts = section_counts.entry(instance.section.clone()).or_default(); 138 | counts.n_instances += 1; 139 | if is_correct { 140 | counts.n_correct += 1; 141 | } 142 | } 143 | 144 | /// Print the accuracy for a section. 145 | fn print_section_accuracy(&self, section: &str, counts: &Counts) { 146 | if counts.n_instances == 0 { 147 | eprintln!("{}: no evaluation instances", section); 148 | return; 149 | } 150 | 151 | println!( 152 | "{}: {}/{} correct, accuracy: {:.2}, skipped: {}", 153 | section, 154 | counts.n_correct, 155 | counts.n_instances, 156 | (counts.n_correct as f64 / counts.n_instances as f64) * 100., 157 | counts.n_skipped, 158 | ); 159 | } 160 | } 161 | 162 | impl<'a> Drop for Eval<'a> { 163 | fn drop(&mut self) { 164 | let section_counts = self.section_counts.lock().unwrap(); 165 | 166 | // Print out counts for all sections. 167 | for (section, counts) in section_counts.iter() { 168 | self.print_section_accuracy(section, counts); 169 | } 170 | 171 | let n_correct = section_counts.values().map(|c| c.n_correct).sum::(); 172 | let n_instances = section_counts 173 | .values() 174 | .map(|c| c.n_instances) 175 | .sum::(); 176 | let n_skipped = section_counts.values().map(|c| c.n_skipped).sum::(); 177 | let n_instances_with_skipped = n_instances + n_skipped; 178 | 179 | // Print out overall counts. 180 | println!( 181 | "Total: {}/{} correct, accuracy: {:.2}", 182 | n_correct, 183 | n_instances, 184 | (n_correct as f64 / n_instances as f64) * 100. 185 | ); 186 | 187 | // Print skip counts. 188 | println!( 189 | "Skipped: {}/{} ({}%)", 190 | n_skipped, 191 | n_instances_with_skipped, 192 | (n_skipped as f64 / n_instances_with_skipped as f64) * 100. 193 | ); 194 | } 195 | } 196 | 197 | struct Instance { 198 | section: String, 199 | query: (String, String, String), 200 | answer: String, 201 | } 202 | 203 | fn read_analogies(reader: impl BufRead) -> Vec { 204 | let mut section = String::new(); 205 | 206 | let mut instances = Vec::new(); 207 | 208 | for line in reader.lines() { 209 | let line = line.or_exit("Cannot read line.", 1); 210 | 211 | if line.starts_with(": ") { 212 | section = line.chars().skip(2).collect::(); 213 | continue; 214 | } 215 | 216 | let quadruple: Vec<_> = line.split_whitespace().collect(); 217 | 218 | instances.push(Instance { 219 | section: section.clone(), 220 | query: ( 221 | quadruple[0].to_owned(), 222 | quadruple[1].to_owned(), 223 | quadruple[2].to_owned(), 224 | ), 225 | answer: quadruple[3].to_owned(), 226 | }); 227 | } 228 | 229 | instances 230 | } 231 | 232 | fn process_analogies(embeddings: &Embeddings, instances: &[Instance]) { 233 | let eval = Eval::new(&embeddings); 234 | instances 235 | .par_iter() 236 | .for_each(|instance| eval.eval_analogy(instance)); 237 | } 238 | -------------------------------------------------------------------------------- /rust2vec-utils/src/bin/r2v-convert.rs: -------------------------------------------------------------------------------- 1 | use std::fs::File; 2 | use std::io::{BufReader, BufWriter, Read}; 3 | 4 | use clap::{App, AppSettings, Arg, ArgMatches}; 5 | use failure::err_msg; 6 | use rust2vec::prelude::*; 7 | use rust2vec_utils::EmbeddingFormat; 8 | use stdinout::OrExit; 9 | use toml::Value; 10 | 11 | static DEFAULT_CLAP_SETTINGS: &[AppSettings] = &[ 12 | AppSettings::DontCollapseArgsInUsage, 13 | AppSettings::UnifiedHelpMessage, 14 | ]; 15 | 16 | struct Config { 17 | input_filename: String, 18 | output_filename: String, 19 | metadata_filename: Option, 20 | input_format: EmbeddingFormat, 21 | output_format: EmbeddingFormat, 22 | normalization: bool, 23 | } 24 | 25 | // Option constants 26 | static INPUT_FORMAT: &str = "input_format"; 27 | static METADATA_FILENAME: &str = "metadata_filename"; 28 | static NO_NORMALIZATION: &str = "no_normalization"; 29 | static OUTPUT_FORMAT: &str = "output_format"; 30 | 31 | // Argument constants 32 | static INPUT: &str = "INPUT"; 33 | static OUTPUT: &str = "OUTPUT"; 34 | 35 | fn parse_args() -> ArgMatches<'static> { 36 | App::new("r2v-convert") 37 | .settings(DEFAULT_CLAP_SETTINGS) 38 | .arg( 39 | Arg::with_name(INPUT) 40 | .help("Finalfrontier model") 41 | .index(1) 42 | .required(true), 43 | ) 44 | .arg(Arg::with_name(OUTPUT).help("Output file").index(2)) 45 | .arg( 46 | Arg::with_name(INPUT_FORMAT) 47 | .short("f") 48 | .long("from") 49 | .value_name("FORMAT") 50 | .help("Input format: finalfusion, text, textdims, word2vec (default: word2vec)") 51 | .takes_value(true), 52 | ) 53 | .arg( 54 | Arg::with_name(METADATA_FILENAME) 55 | .short("m") 56 | .long("metadata") 57 | .value_name("FILENAME") 58 | .help("TOML metadata add to the embeddings") 59 | .takes_value(true), 60 | ) 61 | .arg( 62 | Arg::with_name(NO_NORMALIZATION) 63 | .short("n") 64 | .long("no-normalization") 65 | .help("Do not normalize embeddings during conversion."), 66 | ) 67 | .arg( 68 | Arg::with_name(OUTPUT_FORMAT) 69 | .short("t") 70 | .long("to") 71 | .value_name("FORMAT") 72 | .help("Output format: finalfusion, text, textdims, word2vec (default: finalfusion)") 73 | .takes_value(true), 74 | ) 75 | .get_matches() 76 | } 77 | 78 | fn config_from_matches(matches: &ArgMatches) -> Config { 79 | let input_filename = matches.value_of(INPUT).unwrap().to_owned(); 80 | let input_format = matches 81 | .value_of(INPUT_FORMAT) 82 | .map(|v| EmbeddingFormat::try_from(v).or_exit("Cannot parse input format", 1)) 83 | .unwrap_or(EmbeddingFormat::Word2Vec); 84 | let output_filename = matches.value_of(OUTPUT).unwrap().to_owned(); 85 | let output_format = matches 86 | .value_of(OUTPUT_FORMAT) 87 | .map(|v| EmbeddingFormat::try_from(v).or_exit("Cannot parse output format", 1)) 88 | .unwrap_or(EmbeddingFormat::FinalFusion); 89 | 90 | let metadata_filename = matches.value_of(METADATA_FILENAME).map(ToOwned::to_owned); 91 | 92 | let normalization = !matches.is_present(NO_NORMALIZATION); 93 | 94 | Config { 95 | input_filename, 96 | output_filename, 97 | input_format, 98 | output_format, 99 | metadata_filename, 100 | normalization, 101 | } 102 | } 103 | 104 | fn main() { 105 | let matches = parse_args(); 106 | let config = config_from_matches(&matches); 107 | 108 | let metadata = config.metadata_filename.map(read_metadata).map(Metadata); 109 | 110 | let mut embeddings = read_embeddings( 111 | &config.input_filename, 112 | config.input_format, 113 | config.normalization, 114 | ); 115 | 116 | // Overwrite metadata if provided, otherwise retain existing metadata. 117 | if metadata.is_some() { 118 | embeddings.set_metadata(metadata); 119 | } 120 | 121 | write_embeddings(embeddings, &config.output_filename, config.output_format); 122 | } 123 | 124 | fn read_metadata(filename: impl AsRef) -> Value { 125 | let f = File::open(filename.as_ref()).or_exit("Cannot open metadata file", 1); 126 | let mut reader = BufReader::new(f); 127 | let mut buf = String::new(); 128 | reader 129 | .read_to_string(&mut buf) 130 | .or_exit("Cannot read metadata", 1); 131 | buf.parse::() 132 | .or_exit("Cannot parse metadata TOML", 1) 133 | } 134 | 135 | fn read_embeddings( 136 | filename: &str, 137 | embedding_format: EmbeddingFormat, 138 | normalization: bool, 139 | ) -> Embeddings { 140 | let f = File::open(filename).or_exit("Cannot open embeddings file", 1); 141 | let mut reader = BufReader::new(f); 142 | 143 | use EmbeddingFormat::*; 144 | match embedding_format { 145 | FinalFusion => ReadEmbeddings::read_embeddings(&mut reader), 146 | FinalFusionMmap => MmapEmbeddings::mmap_embeddings(&mut reader), 147 | Word2Vec => { 148 | ReadWord2Vec::read_word2vec_binary(&mut reader, normalization).map(Embeddings::into) 149 | } 150 | Text => ReadText::read_text(&mut reader, normalization).map(Embeddings::into), 151 | TextDims => ReadTextDims::read_text_dims(&mut reader, normalization).map(Embeddings::into), 152 | } 153 | .or_exit("Cannot read embeddings", 1) 154 | } 155 | 156 | fn write_embeddings( 157 | embeddings: Embeddings, 158 | filename: &str, 159 | embedding_format: EmbeddingFormat, 160 | ) { 161 | let f = File::create(filename).or_exit("Cannot create embeddings file", 1); 162 | let mut writer = BufWriter::new(f); 163 | 164 | use EmbeddingFormat::*; 165 | match embedding_format { 166 | FinalFusion => embeddings.write_embeddings(&mut writer), 167 | FinalFusionMmap => Err(err_msg("Writing to this format is not supported")), 168 | Word2Vec => embeddings.write_word2vec_binary(&mut writer), 169 | Text => embeddings.write_text(&mut writer), 170 | TextDims => embeddings.write_text_dims(&mut writer), 171 | } 172 | .or_exit("Cannot write embeddings", 1) 173 | } 174 | -------------------------------------------------------------------------------- /rust2vec-utils/src/bin/r2v-metadata.rs: -------------------------------------------------------------------------------- 1 | use std::fs::File; 2 | use std::io::{BufReader, BufWriter, Write}; 3 | 4 | use clap::{App, AppSettings, Arg, ArgMatches}; 5 | use rust2vec::prelude::*; 6 | use stdinout::{OrExit, Output}; 7 | use toml::ser::to_string_pretty; 8 | 9 | static DEFAULT_CLAP_SETTINGS: &[AppSettings] = &[ 10 | AppSettings::DontCollapseArgsInUsage, 11 | AppSettings::UnifiedHelpMessage, 12 | ]; 13 | 14 | struct Config { 15 | input_filename: String, 16 | output_filename: Option, 17 | } 18 | 19 | // Argument constants 20 | static INPUT: &str = "INPUT"; 21 | static OUTPUT: &str = "OUTPUT"; 22 | 23 | fn parse_args() -> ArgMatches<'static> { 24 | App::new("r2v-metadata") 25 | .settings(DEFAULT_CLAP_SETTINGS) 26 | .arg( 27 | Arg::with_name(INPUT) 28 | .help("finalfusion model") 29 | .index(1) 30 | .required(true), 31 | ) 32 | .arg(Arg::with_name(OUTPUT).help("Output file").index(2)) 33 | .get_matches() 34 | } 35 | 36 | fn config_from_matches(matches: &ArgMatches) -> Config { 37 | let input_filename = matches.value_of(INPUT).unwrap().to_owned(); 38 | let output_filename = matches.value_of(OUTPUT).map(ToOwned::to_owned); 39 | 40 | Config { 41 | input_filename, 42 | output_filename, 43 | } 44 | } 45 | 46 | fn main() { 47 | let matches = parse_args(); 48 | let config = config_from_matches(&matches); 49 | 50 | let output = Output::from(config.output_filename); 51 | let mut writer = BufWriter::new(output.write().or_exit("Cannot open output for writing", 1)); 52 | 53 | if let Some(metadata) = read_metadata(&config.input_filename) { 54 | writer 55 | .write_all( 56 | to_string_pretty(&metadata.0) 57 | .or_exit("Cannot serialize metadata to TOML", 1) 58 | .as_bytes(), 59 | ) 60 | .or_exit("Cannot write metadata", 1); 61 | } 62 | } 63 | 64 | fn read_metadata(filename: &str) -> Option { 65 | let f = File::open(filename).or_exit("Cannot open embeddings file", 1); 66 | let mut reader = BufReader::new(f); 67 | ReadMetadata::read_metadata(&mut reader).or_exit("Cannot read metadata", 1) 68 | } 69 | -------------------------------------------------------------------------------- /rust2vec-utils/src/bin/r2v-quantize.rs: -------------------------------------------------------------------------------- 1 | use std::fs::File; 2 | use std::io::BufWriter; 3 | use std::process; 4 | 5 | use clap::{App, AppSettings, Arg, ArgMatches}; 6 | use ndarray::ArrayView1; 7 | use rayon::ThreadPoolBuilder; 8 | use reductive::pq::PQ; 9 | #[cfg(feature = "opq")] 10 | use reductive::pq::{GaussianOPQ, OPQ}; 11 | use rust2vec::prelude::*; 12 | use rust2vec_utils::{read_embeddings_view, EmbeddingFormat}; 13 | use stdinout::OrExit; 14 | 15 | static DEFAULT_CLAP_SETTINGS: &[AppSettings] = &[ 16 | AppSettings::DontCollapseArgsInUsage, 17 | AppSettings::UnifiedHelpMessage, 18 | ]; 19 | 20 | struct Config { 21 | input_filename: String, 22 | input_format: EmbeddingFormat, 23 | n_attempts: usize, 24 | n_iterations: usize, 25 | n_subquantizers: Option, 26 | n_threads: usize, 27 | output_filename: String, 28 | quantizer: String, 29 | quantizer_bits: u32, 30 | } 31 | 32 | // Option constants 33 | static INPUT_FORMAT: &str = "input_format"; 34 | static N_ATTEMPTS: &str = "n_attempts"; 35 | static N_ITERATIONS: &str = "n_iterations"; 36 | static N_SUBQUANTIZERS: &str = "n_subquantizers"; 37 | static N_THREADS: &str = "n_threads"; 38 | static QUANTIZER: &str = "quantizer"; 39 | static QUANTIZER_BITS: &str = "quantizer_bits"; 40 | 41 | // Argument constants 42 | static INPUT: &str = "INPUT"; 43 | static OUTPUT: &str = "OUTPUT"; 44 | 45 | fn config_from_matches(matches: &ArgMatches) -> Config { 46 | // Arguments 47 | let input_filename = matches.value_of(INPUT).unwrap().to_owned(); 48 | let output_filename = matches.value_of(OUTPUT).unwrap().to_owned(); 49 | 50 | // Options 51 | let input_format = matches 52 | .value_of(INPUT_FORMAT) 53 | .map(|v| EmbeddingFormat::try_from(v).or_exit("Cannot parse input format", 1)) 54 | .unwrap_or(EmbeddingFormat::Word2Vec); 55 | let n_attempts = matches 56 | .value_of(N_ATTEMPTS) 57 | .map(|a| a.parse().or_exit("Cannot parse number of attempts", 1)) 58 | .unwrap_or(1); 59 | let n_iterations = matches 60 | .value_of(N_ITERATIONS) 61 | .map(|a| a.parse().or_exit("Cannot parse number of iterations", 1)) 62 | .unwrap_or(100); 63 | let n_subquantizers = matches 64 | .value_of(N_SUBQUANTIZERS) 65 | .map(|a| a.parse().or_exit("Cannot parse number of subquantizers", 1)); 66 | let n_threads = matches 67 | .value_of(N_THREADS) 68 | .map(|a| a.parse().or_exit("Cannot parse number of threads", 1)) 69 | .unwrap_or(num_cpus::get() / 2); 70 | let quantizer = matches 71 | .value_of(QUANTIZER) 72 | .map(ToOwned::to_owned) 73 | .unwrap_or_else(|| "pq".to_owned()); 74 | let quantizer_bits = matches 75 | .value_of(QUANTIZER_BITS) 76 | .map(|a| { 77 | a.parse() 78 | .or_exit("Cannot parse number of quantizer_bits", 1) 79 | }) 80 | .unwrap_or(8); 81 | if quantizer_bits > 8 { 82 | eprintln!( 83 | "Maximum number of quantizer bits: 8, was: {}", 84 | quantizer_bits 85 | ); 86 | process::exit(1); 87 | } 88 | 89 | Config { 90 | input_filename, 91 | input_format, 92 | n_attempts, 93 | n_iterations, 94 | n_subquantizers, 95 | n_threads, 96 | output_filename, 97 | quantizer, 98 | quantizer_bits, 99 | } 100 | } 101 | 102 | fn cosine_similarity(u: ArrayView1, v: ArrayView1) -> f32 { 103 | let u_norm = u.dot(&u).sqrt(); 104 | let v_norm = v.dot(&v).sqrt(); 105 | u.dot(&v) / (u_norm * v_norm) 106 | } 107 | 108 | fn euclidean_distance(u: ArrayView1, v: ArrayView1) -> f32 { 109 | let dist_vec = &u - &v; 110 | dist_vec.dot(&dist_vec).sqrt() 111 | } 112 | 113 | fn parse_args() -> ArgMatches<'static> { 114 | App::new("r2v-convert") 115 | .settings(DEFAULT_CLAP_SETTINGS) 116 | .arg( 117 | Arg::with_name(INPUT) 118 | .help("Finalfrontier model") 119 | .index(1) 120 | .required(true), 121 | ) 122 | .arg(Arg::with_name(OUTPUT).help("Output file").index(2)) 123 | .arg( 124 | Arg::with_name(N_ATTEMPTS) 125 | .short("a") 126 | .long("attempts") 127 | .value_name("N") 128 | .help("Number of quantization attempts (default: 1)") 129 | .takes_value(true), 130 | ) 131 | .arg( 132 | Arg::with_name(QUANTIZER_BITS) 133 | .short("b") 134 | .long("bits") 135 | .value_name("N") 136 | .help("Number of quantizer bits (default: 8, max: 8)") 137 | .takes_value(true), 138 | ) 139 | .arg( 140 | Arg::with_name(INPUT_FORMAT) 141 | .short("f") 142 | .long("from") 143 | .value_name("FORMAT") 144 | .help("Input format: finalfusion, text, textdims, word2vec (default: word2vec)") 145 | .takes_value(true), 146 | ) 147 | .arg( 148 | Arg::with_name(N_ITERATIONS) 149 | .short("i") 150 | .long("iter") 151 | .value_name("N") 152 | .help("Number of iterations (default: 100)") 153 | .takes_value(true), 154 | ) 155 | .arg( 156 | Arg::with_name(QUANTIZER) 157 | .short("q") 158 | .long("quantizer") 159 | .value_name("QUANTIZER") 160 | .help("Quantizer: opq, pq, or gaussian_opq (default: pq)") 161 | .takes_value(true), 162 | ) 163 | .arg( 164 | Arg::with_name(N_SUBQUANTIZERS) 165 | .short("s") 166 | .long("subquantizers") 167 | .value_name("N") 168 | .help("Number of subquantizers (default: d/2)") 169 | .takes_value(true), 170 | ) 171 | .arg( 172 | Arg::with_name(N_THREADS) 173 | .short("t") 174 | .long("threads") 175 | .value_name("N") 176 | .help("Number of threads (default: logical_cpus /2)") 177 | .takes_value(true), 178 | ) 179 | .get_matches() 180 | } 181 | 182 | fn print_loss(storage: &StorageView, quantized_storage: &Storage) { 183 | let mut cosine_similarity_sum = 0f32; 184 | let mut euclidean_distance_sum = 0f32; 185 | 186 | for (idx, embedding) in storage.view().outer_iter().enumerate() { 187 | let reconstruction = quantized_storage.embedding(idx); 188 | cosine_similarity_sum += cosine_similarity(embedding, reconstruction.as_view()); 189 | euclidean_distance_sum += euclidean_distance(embedding, reconstruction.as_view()); 190 | } 191 | 192 | eprintln!( 193 | "Average cosine similarity: {}", 194 | cosine_similarity_sum / storage.view().rows() as f32 195 | ); 196 | 197 | eprintln!( 198 | "Average euclidean distance: {}", 199 | euclidean_distance_sum / storage.view().rows() as f32 200 | ); 201 | } 202 | 203 | #[cfg(not(feature = "opq"))] 204 | fn quantize_storage(config: &Config, storage: &impl StorageView) -> QuantizedArray { 205 | let n_subquantizers = config.n_subquantizers.unwrap_or(storage.shape().1 / 2); 206 | 207 | match config.quantizer.as_str() { 208 | "pq" => storage.quantize::>( 209 | n_subquantizers, 210 | config.quantizer_bits, 211 | config.n_iterations, 212 | config.n_attempts, 213 | true, 214 | ), 215 | quantizer => { 216 | eprintln!("Unknown quantizer: {}", quantizer); 217 | process::exit(1); 218 | } 219 | } 220 | } 221 | 222 | #[cfg(feature = "opq")] 223 | fn quantize_storage(config: &Config, storage: &impl StorageView) -> QuantizedArray { 224 | let n_subquantizers = config.n_subquantizers.unwrap_or(storage.shape().1 / 2); 225 | 226 | match config.quantizer.as_str() { 227 | "pq" => storage.quantize::>( 228 | n_subquantizers, 229 | config.quantizer_bits, 230 | config.n_iterations, 231 | config.n_attempts, 232 | true, 233 | ), 234 | "opq" => storage.quantize::( 235 | n_subquantizers, 236 | config.quantizer_bits, 237 | config.n_iterations, 238 | config.n_attempts, 239 | true, 240 | ), 241 | "gaussian_opq" => storage.quantize::( 242 | n_subquantizers, 243 | config.quantizer_bits, 244 | config.n_iterations, 245 | config.n_attempts, 246 | true, 247 | ), 248 | quantizer => { 249 | eprintln!("Unknown quantizer: {}", quantizer); 250 | process::exit(1); 251 | } 252 | } 253 | } 254 | 255 | fn write_embeddings(embeddings: &Embeddings, filename: &str) { 256 | let f = File::create(filename).or_exit("Cannot create embeddings file", 1); 257 | let mut writer = BufWriter::new(f); 258 | embeddings 259 | .write_embeddings(&mut writer) 260 | .or_exit("Cannot write embeddings", 1) 261 | } 262 | 263 | fn main() { 264 | env_logger::init(); 265 | 266 | let matches = parse_args(); 267 | let config = config_from_matches(&matches); 268 | 269 | ThreadPoolBuilder::new() 270 | .num_threads(config.n_threads) 271 | .build_global() 272 | .unwrap(); 273 | 274 | let embeddings = read_embeddings_view(&config.input_filename, config.input_format) 275 | .or_exit("Cannot read embeddings", 1); 276 | 277 | // Quantize 278 | let quantized_storage = quantize_storage(&config, embeddings.storage()); 279 | let quantized_embeddings = Embeddings::new( 280 | embeddings.metadata().cloned(), 281 | embeddings.vocab().clone(), 282 | quantized_storage, 283 | ); 284 | 285 | write_embeddings(&quantized_embeddings, &config.output_filename); 286 | 287 | print_loss(embeddings.storage(), quantized_embeddings.storage()); 288 | } 289 | -------------------------------------------------------------------------------- /rust2vec-utils/src/bin/r2v-similar.rs: -------------------------------------------------------------------------------- 1 | use std::io::BufRead; 2 | 3 | use clap::{App, AppSettings, Arg, ArgMatches}; 4 | use rust2vec::similarity::Similarity; 5 | use rust2vec_utils::{read_embeddings_view, EmbeddingFormat}; 6 | use stdinout::{Input, OrExit}; 7 | 8 | static DEFAULT_CLAP_SETTINGS: &[AppSettings] = &[ 9 | AppSettings::DontCollapseArgsInUsage, 10 | AppSettings::UnifiedHelpMessage, 11 | ]; 12 | 13 | fn parse_args() -> ArgMatches<'static> { 14 | App::new("r2v-similar") 15 | .settings(DEFAULT_CLAP_SETTINGS) 16 | .arg( 17 | Arg::with_name("format") 18 | .short("f") 19 | .value_name("FORMAT") 20 | .help("Embedding format: finalfusion, finalfusion_mmap, word2vec, text, or textdims (default: finalfusion)") 21 | .takes_value(true), 22 | ) 23 | .arg( 24 | Arg::with_name("neighbors") 25 | .short("k") 26 | .value_name("K") 27 | .help("Return K nearest neighbors (default: 10)") 28 | .takes_value(true), 29 | ) 30 | .arg( 31 | Arg::with_name("EMBEDDINGS") 32 | .help("Embeddings file") 33 | .index(1) 34 | .required(true), 35 | ) 36 | .arg(Arg::with_name("INPUT").help("Input words").index(2)) 37 | .get_matches() 38 | } 39 | 40 | struct Config { 41 | embeddings_filename: String, 42 | embedding_format: EmbeddingFormat, 43 | k: usize, 44 | } 45 | 46 | fn config_from_matches<'a>(matches: &ArgMatches<'a>) -> Config { 47 | let embeddings_filename = matches.value_of("EMBEDDINGS").unwrap().to_owned(); 48 | 49 | let embedding_format = matches 50 | .value_of("format") 51 | .map(|f| EmbeddingFormat::try_from(f).or_exit("Cannot parse embedding format", 1)) 52 | .unwrap_or(EmbeddingFormat::FinalFusion); 53 | 54 | let k = matches 55 | .value_of("neighbors") 56 | .map(|v| v.parse().or_exit("Cannot parse k", 1)) 57 | .unwrap_or(10); 58 | 59 | Config { 60 | embeddings_filename, 61 | embedding_format, 62 | k, 63 | } 64 | } 65 | 66 | fn main() { 67 | let matches = parse_args(); 68 | let config = config_from_matches(&matches); 69 | 70 | let embeddings = read_embeddings_view(&config.embeddings_filename, config.embedding_format) 71 | .or_exit("Cannot read embeddings", 1); 72 | 73 | let input = Input::from(matches.value_of("INPUT")); 74 | let reader = input.buf_read().or_exit("Cannot open input for reading", 1); 75 | 76 | for line in reader.lines() { 77 | let line = line.or_exit("Cannot read line", 1).trim().to_owned(); 78 | if line.is_empty() { 79 | continue; 80 | } 81 | 82 | let results = match embeddings.similarity(&line, config.k) { 83 | Some(results) => results, 84 | None => continue, 85 | }; 86 | 87 | for similar in results { 88 | println!("{}\t{}", similar.word, similar.similarity); 89 | } 90 | } 91 | } 92 | -------------------------------------------------------------------------------- /rust2vec-utils/src/lib.rs: -------------------------------------------------------------------------------- 1 | use std::fs::File; 2 | use std::io::BufReader; 3 | 4 | use failure::{format_err, Error, ResultExt}; 5 | 6 | use rust2vec::prelude::*; 7 | 8 | #[derive(Clone, Copy, PartialEq, Eq)] 9 | pub enum EmbeddingFormat { 10 | FinalFusion, 11 | FinalFusionMmap, 12 | Word2Vec, 13 | Text, 14 | TextDims, 15 | } 16 | 17 | impl EmbeddingFormat { 18 | pub fn try_from(format: impl AsRef) -> Result { 19 | use EmbeddingFormat::*; 20 | 21 | match format.as_ref() { 22 | "finalfusion" => Ok(FinalFusion), 23 | "finalfusion_mmap" => Ok(FinalFusionMmap), 24 | "word2vec" => Ok(Word2Vec), 25 | "text" => Ok(Text), 26 | "textdims" => Ok(TextDims), 27 | unknown => Err(format_err!("Unknown embedding format: {}", unknown)), 28 | } 29 | } 30 | } 31 | 32 | pub fn read_embeddings_view( 33 | filename: &str, 34 | embedding_format: EmbeddingFormat, 35 | ) -> Result, Error> { 36 | let f = File::open(filename).context("Cannot open embeddings file")?; 37 | let mut reader = BufReader::new(f); 38 | 39 | use EmbeddingFormat::*; 40 | let embeddings = match embedding_format { 41 | FinalFusion => ReadEmbeddings::read_embeddings(&mut reader), 42 | FinalFusionMmap => MmapEmbeddings::mmap_embeddings(&mut reader), 43 | Word2Vec => ReadWord2Vec::read_word2vec_binary(&mut reader, true).map(Embeddings::into), 44 | Text => ReadText::read_text(&mut reader, true).map(Embeddings::into), 45 | TextDims => ReadTextDims::read_text_dims(&mut reader, true).map(Embeddings::into), 46 | } 47 | .context("Cannot read embeddings")?; 48 | 49 | Ok(embeddings) 50 | } 51 | -------------------------------------------------------------------------------- /rust2vec/Cargo.toml: -------------------------------------------------------------------------------- 1 | [package] 2 | name = "rust2vec" 3 | version = "0.5.0" 4 | edition = "2018" 5 | authors = ["Daniël de Kok "] 6 | description = "Reader and writer for common word embedding formats" 7 | documentation = "https://docs.rs/rust2vec/" 8 | keywords = ["embeddings", "word2vec", "glove", "finalfusion", "subword"] 9 | homepage = "https://github.com/danieldk/rust2vec" 10 | repository = "https://github.com/danieldk/rust2vec" 11 | license = "Apache-2.0" 12 | readme = "../README.md" 13 | exclude = [ 14 | ".gitignore", 15 | ".travis.yml" 16 | ] 17 | 18 | [badges] 19 | maintenance = { status = "deprecated" } 20 | 21 | [dependencies] 22 | byteorder = "1" 23 | failure = "0.1" 24 | fnv = "1" 25 | itertools = "0.8" 26 | memmap = "0.7" 27 | ndarray = "0.12" 28 | ordered-float = "1" 29 | rand = "0.6" 30 | rand_xorshift = "0.1" 31 | reductive = "0.2" 32 | toml = "0.4" 33 | 34 | [dev-dependencies] 35 | maplit = "1" 36 | lazy_static = "1" 37 | -------------------------------------------------------------------------------- /rust2vec/src/embeddings.rs: -------------------------------------------------------------------------------- 1 | //! Word embeddings. 2 | 3 | use std::fs::File; 4 | use std::io::{BufReader, Read, Seek, Write}; 5 | use std::iter::Enumerate; 6 | use std::mem; 7 | use std::slice; 8 | 9 | use failure::{ensure, Error}; 10 | use ndarray::Array1; 11 | 12 | use crate::io::{ 13 | private::{ChunkIdentifier, Header, MmapChunk, ReadChunk, WriteChunk}, 14 | MmapEmbeddings, ReadEmbeddings, WriteEmbeddings, 15 | }; 16 | use crate::metadata::Metadata; 17 | use crate::storage::{ 18 | CowArray, CowArray1, MmapArray, NdArray, QuantizedArray, Storage, StorageViewWrap, StorageWrap, 19 | }; 20 | use crate::util::l2_normalize; 21 | use crate::vocab::{SimpleVocab, SubwordVocab, Vocab, VocabWrap, WordIndex}; 22 | 23 | /// Word embeddings. 24 | /// 25 | /// This data structure stores word embeddings (also known as *word vectors*) 26 | /// and provides some useful methods on the embeddings, such as similarity 27 | /// and analogy queries. 28 | #[derive(Debug)] 29 | pub struct Embeddings { 30 | metadata: Option, 31 | storage: S, 32 | vocab: V, 33 | } 34 | 35 | impl Embeddings { 36 | /// Construct an embeddings from a vocabulary and storage. 37 | pub fn new(metadata: Option, vocab: V, storage: S) -> Self { 38 | Embeddings { 39 | metadata, 40 | vocab, 41 | storage, 42 | } 43 | } 44 | 45 | /// Decompose embeddings in its vocabulary and storage. 46 | pub fn into_parts(self) -> (Option, V, S) { 47 | (self.metadata, self.vocab, self.storage) 48 | } 49 | 50 | /// Get metadata. 51 | pub fn metadata(&self) -> Option<&Metadata> { 52 | self.metadata.as_ref() 53 | } 54 | 55 | /// Get metadata mutably. 56 | pub fn metadata_mut(&mut self) -> Option<&mut Metadata> { 57 | self.metadata.as_mut() 58 | } 59 | 60 | /// Set metadata. 61 | /// 62 | /// Returns the previously-stored metadata. 63 | pub fn set_metadata(&mut self, mut metadata: Option) -> Option { 64 | mem::swap(&mut self.metadata, &mut metadata); 65 | metadata 66 | } 67 | 68 | /// Get the embedding storage. 69 | pub fn storage(&self) -> &S { 70 | &self.storage 71 | } 72 | 73 | /// Get the vocabulary. 74 | pub fn vocab(&self) -> &V { 75 | &self.vocab 76 | } 77 | } 78 | 79 | #[allow(clippy::len_without_is_empty)] 80 | impl Embeddings 81 | where 82 | V: Vocab, 83 | S: Storage, 84 | { 85 | /// Return the length (in vector components) of the word embeddings. 86 | pub fn dims(&self) -> usize { 87 | self.storage.shape().1 88 | } 89 | 90 | /// Get the embedding of a word. 91 | pub fn embedding(&self, word: &str) -> Option> { 92 | match self.vocab.idx(word)? { 93 | WordIndex::Word(idx) => Some(self.storage.embedding(idx)), 94 | WordIndex::Subword(indices) => { 95 | let mut embed = Array1::zeros((self.storage.shape().1,)); 96 | for idx in indices { 97 | embed += &self.storage.embedding(idx).as_view(); 98 | } 99 | 100 | l2_normalize(embed.view_mut()); 101 | 102 | Some(CowArray::Owned(embed)) 103 | } 104 | } 105 | } 106 | 107 | /// Get an iterator over pairs of words and the corresponding embeddings. 108 | pub fn iter(&self) -> Iter { 109 | Iter { 110 | storage: &self.storage, 111 | inner: self.vocab.words().iter().enumerate(), 112 | } 113 | } 114 | 115 | /// Get the vocabulary size. 116 | /// 117 | /// The vocabulary size excludes subword units. 118 | pub fn len(&self) -> usize { 119 | self.vocab.len() 120 | } 121 | } 122 | 123 | macro_rules! impl_embeddings_from( 124 | ($vocab:ty, $storage:ty, $storage_wrap:ty) => { 125 | impl From> for Embeddings { 126 | fn from(from: Embeddings<$vocab, $storage>) -> Self { 127 | let (metadata, vocab, storage) = from.into_parts(); 128 | Embeddings::new(metadata, vocab.into(), storage.into()) 129 | } 130 | } 131 | } 132 | ); 133 | 134 | // Hmpf. We with the blanket From for T implementation, we need 135 | // specialization to generalize this. 136 | impl_embeddings_from!(SimpleVocab, NdArray, StorageWrap); 137 | impl_embeddings_from!(SimpleVocab, NdArray, StorageViewWrap); 138 | impl_embeddings_from!(SimpleVocab, MmapArray, StorageWrap); 139 | impl_embeddings_from!(SimpleVocab, MmapArray, StorageViewWrap); 140 | impl_embeddings_from!(SimpleVocab, QuantizedArray, StorageWrap); 141 | impl_embeddings_from!(SubwordVocab, NdArray, StorageWrap); 142 | impl_embeddings_from!(SubwordVocab, NdArray, StorageViewWrap); 143 | impl_embeddings_from!(SubwordVocab, MmapArray, StorageWrap); 144 | impl_embeddings_from!(SubwordVocab, MmapArray, StorageViewWrap); 145 | impl_embeddings_from!(SubwordVocab, QuantizedArray, StorageWrap); 146 | 147 | impl<'a, V, S> IntoIterator for &'a Embeddings 148 | where 149 | V: Vocab, 150 | S: Storage, 151 | { 152 | type Item = (&'a str, CowArray1<'a, f32>); 153 | type IntoIter = Iter<'a>; 154 | 155 | fn into_iter(self) -> Self::IntoIter { 156 | self.iter() 157 | } 158 | } 159 | 160 | impl MmapEmbeddings for Embeddings 161 | where 162 | Self: Sized, 163 | V: ReadChunk, 164 | S: MmapChunk, 165 | { 166 | fn mmap_embeddings(read: &mut BufReader) -> Result { 167 | let header = Header::read_chunk(read)?; 168 | let chunks = header.chunk_identifiers(); 169 | ensure!(!chunks.is_empty(), "Embedding file without chunks."); 170 | 171 | let metadata = if header.chunk_identifiers()[0] == ChunkIdentifier::Metadata { 172 | Some(Metadata::read_chunk(read)?) 173 | } else { 174 | None 175 | }; 176 | 177 | let vocab = V::read_chunk(read)?; 178 | let storage = S::mmap_chunk(read)?; 179 | 180 | Ok(Embeddings { 181 | metadata, 182 | vocab, 183 | storage, 184 | }) 185 | } 186 | } 187 | 188 | impl ReadEmbeddings for Embeddings 189 | where 190 | V: ReadChunk, 191 | S: ReadChunk, 192 | { 193 | fn read_embeddings(read: &mut R) -> Result 194 | where 195 | R: Read + Seek, 196 | { 197 | let header = Header::read_chunk(read)?; 198 | let chunks = header.chunk_identifiers(); 199 | ensure!(!chunks.is_empty(), "Embedding file without chunks."); 200 | 201 | let metadata = if header.chunk_identifiers()[0] == ChunkIdentifier::Metadata { 202 | Some(Metadata::read_chunk(read)?) 203 | } else { 204 | None 205 | }; 206 | 207 | let vocab = V::read_chunk(read)?; 208 | let storage = S::read_chunk(read)?; 209 | 210 | Ok(Embeddings { 211 | metadata, 212 | vocab, 213 | storage, 214 | }) 215 | } 216 | } 217 | 218 | impl WriteEmbeddings for Embeddings 219 | where 220 | V: WriteChunk, 221 | S: WriteChunk, 222 | { 223 | fn write_embeddings(&self, write: &mut W) -> Result<(), Error> 224 | where 225 | W: Write + Seek, 226 | { 227 | let mut chunks = match self.metadata { 228 | Some(ref metadata) => vec![metadata.chunk_identifier()], 229 | None => vec![], 230 | }; 231 | 232 | chunks.extend_from_slice(&[ 233 | self.vocab.chunk_identifier(), 234 | self.storage.chunk_identifier(), 235 | ]); 236 | 237 | Header::new(chunks).write_chunk(write)?; 238 | if let Some(ref metadata) = self.metadata { 239 | metadata.write_chunk(write)?; 240 | } 241 | 242 | self.vocab.write_chunk(write)?; 243 | self.storage.write_chunk(write)?; 244 | Ok(()) 245 | } 246 | } 247 | 248 | /// Iterator over embeddings. 249 | pub struct Iter<'a> { 250 | storage: &'a Storage, 251 | inner: Enumerate>, 252 | } 253 | 254 | impl<'a> Iterator for Iter<'a> { 255 | type Item = (&'a str, CowArray1<'a, f32>); 256 | 257 | fn next(&mut self) -> Option { 258 | self.inner 259 | .next() 260 | .map(|(idx, word)| (word.as_str(), self.storage.embedding(idx))) 261 | } 262 | } 263 | 264 | #[cfg(test)] 265 | mod tests { 266 | use std::fs::File; 267 | use std::io::{BufReader, Cursor, Seek, SeekFrom}; 268 | 269 | use toml::{toml, toml_internal}; 270 | 271 | use super::Embeddings; 272 | use crate::io::{MmapEmbeddings, ReadEmbeddings, WriteEmbeddings}; 273 | use crate::metadata::Metadata; 274 | use crate::storage::{MmapArray, NdArray, StorageView}; 275 | use crate::vocab::SimpleVocab; 276 | use crate::word2vec::ReadWord2Vec; 277 | 278 | fn test_embeddings() -> Embeddings { 279 | let mut reader = BufReader::new(File::open("testdata/similarity.bin").unwrap()); 280 | Embeddings::read_word2vec_binary(&mut reader, false).unwrap() 281 | } 282 | 283 | fn test_metadata() -> Metadata { 284 | Metadata(toml! { 285 | [hyperparameters] 286 | dims = 300 287 | ns = 5 288 | 289 | [description] 290 | description = "Test model" 291 | language = "de" 292 | }) 293 | } 294 | 295 | #[test] 296 | fn mmap() { 297 | let check_embeds = test_embeddings(); 298 | let mut reader = BufReader::new(File::open("testdata/similarity.fifu").unwrap()); 299 | let embeds: Embeddings = 300 | Embeddings::mmap_embeddings(&mut reader).unwrap(); 301 | assert_eq!(embeds.vocab(), check_embeds.vocab()); 302 | assert_eq!(embeds.storage().view(), check_embeds.storage().view()); 303 | } 304 | 305 | #[test] 306 | fn write_read_simple_roundtrip() { 307 | let check_embeds = test_embeddings(); 308 | let mut cursor = Cursor::new(Vec::new()); 309 | check_embeds.write_embeddings(&mut cursor).unwrap(); 310 | cursor.seek(SeekFrom::Start(0)).unwrap(); 311 | let embeds: Embeddings = 312 | Embeddings::read_embeddings(&mut cursor).unwrap(); 313 | assert_eq!(embeds.storage().view(), check_embeds.storage().view()); 314 | assert_eq!(embeds.vocab(), check_embeds.vocab()); 315 | } 316 | 317 | #[test] 318 | fn write_read_simple_metadata_roundtrip() { 319 | let mut check_embeds = test_embeddings(); 320 | check_embeds.set_metadata(Some(test_metadata())); 321 | 322 | let mut cursor = Cursor::new(Vec::new()); 323 | check_embeds.write_embeddings(&mut cursor).unwrap(); 324 | cursor.seek(SeekFrom::Start(0)).unwrap(); 325 | let embeds: Embeddings = 326 | Embeddings::read_embeddings(&mut cursor).unwrap(); 327 | assert_eq!(embeds.storage().view(), check_embeds.storage().view()); 328 | assert_eq!(embeds.vocab(), check_embeds.vocab()); 329 | } 330 | } 331 | -------------------------------------------------------------------------------- /rust2vec/src/io.rs: -------------------------------------------------------------------------------- 1 | //! Traits for I/O. 2 | //! 3 | //! This module provides traits for reading embeddings 4 | //! (`ReadEmbeddings`), memory mapping embeddings (`MmapEmbeddings`), 5 | //! and writing embeddings (`WriteEmbeddings`). 6 | 7 | use std::fs::File; 8 | use std::io::{BufReader, Read, Seek, Write}; 9 | 10 | use failure::Error; 11 | 12 | /// Read finalfusion embeddings. 13 | /// 14 | /// This trait is used to read embeddings in the finalfusion format. 15 | /// Implementations are provided for the vocabulary and storage types 16 | /// in this crate. 17 | /// 18 | /// ``` 19 | /// use std::fs::File; 20 | /// 21 | /// use rust2vec::prelude::*; 22 | /// 23 | /// let mut f = File::open("testdata/similarity.fifu").unwrap(); 24 | /// let embeddings: Embeddings = 25 | /// Embeddings::read_embeddings(&mut f).unwrap(); 26 | /// ``` 27 | pub trait ReadEmbeddings 28 | where 29 | Self: Sized, 30 | { 31 | /// Read the embeddings. 32 | fn read_embeddings(read: &mut R) -> Result 33 | where 34 | R: Read + Seek; 35 | } 36 | 37 | /// Read finalfusion embeddings metadata. 38 | /// 39 | /// This trait is used to read the metadata of embeddings in the 40 | /// finalfusion format. This is typically faster than 41 | /// `ReadEmbeddings::read_embeddings`. 42 | /// 43 | /// ``` 44 | /// use std::fs::File; 45 | /// 46 | /// use rust2vec::prelude::*; 47 | /// 48 | /// let mut f = File::open("testdata/similarity.fifu").unwrap(); 49 | /// let metadata: Option = 50 | /// ReadMetadata::read_metadata(&mut f).unwrap(); 51 | /// ``` 52 | pub trait ReadMetadata 53 | where 54 | Self: Sized, 55 | { 56 | /// Read the metadata. 57 | fn read_metadata(read: &mut R) -> Result 58 | where 59 | R: Read + Seek; 60 | } 61 | 62 | /// Memory-map finalfusion embeddings. 63 | /// 64 | /// This trait is used to read finalfusion embeddings while [memory 65 | /// mapping](https://en.wikipedia.org/wiki/Mmap) the embedding matrix. 66 | /// This leads to considerable memory savings, since the operating 67 | /// system will load the relevant pages from disk on demand. 68 | /// 69 | /// Memory mapping is currently not implemented for quantized 70 | /// matrices. 71 | pub trait MmapEmbeddings 72 | where 73 | Self: Sized, 74 | { 75 | fn mmap_embeddings(read: &mut BufReader) -> Result; 76 | } 77 | 78 | /// Write embeddings in finalfusion format. 79 | /// 80 | /// This trait is used to write embeddings in finalfusion 81 | /// format. Writing in finalfusion format is supported regardless of 82 | /// the original format of the embeddings. 83 | pub trait WriteEmbeddings { 84 | fn write_embeddings(&self, write: &mut W) -> Result<(), Error> 85 | where 86 | W: Write + Seek; 87 | } 88 | 89 | pub(crate) mod private { 90 | use std::fs::File; 91 | use std::io::{BufReader, Read, Seek, Write}; 92 | 93 | use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt}; 94 | use failure::{ensure, format_err, Error, ResultExt}; 95 | 96 | const MODEL_VERSION: u32 = 0; 97 | 98 | const MAGIC: [u8; 4] = [b'F', b'i', b'F', b'u']; 99 | 100 | #[derive(Clone, Copy, Debug, PartialEq, Eq)] 101 | #[repr(u32)] 102 | pub enum ChunkIdentifier { 103 | Header = 0, 104 | SimpleVocab = 1, 105 | NdArray = 2, 106 | SubwordVocab = 3, 107 | QuantizedArray = 4, 108 | Metadata = 5, 109 | } 110 | 111 | impl ChunkIdentifier { 112 | pub fn try_from(identifier: u32) -> Option { 113 | use ChunkIdentifier::*; 114 | 115 | match identifier { 116 | 1 => Some(SimpleVocab), 117 | 2 => Some(NdArray), 118 | 3 => Some(SubwordVocab), 119 | 4 => Some(QuantizedArray), 120 | 5 => Some(Metadata), 121 | _ => None, 122 | } 123 | } 124 | } 125 | 126 | pub trait TypeId { 127 | fn type_id() -> u32; 128 | } 129 | 130 | macro_rules! typeid_impl { 131 | ($type:ty, $id:expr) => { 132 | impl TypeId for $type { 133 | fn type_id() -> u32 { 134 | $id 135 | } 136 | } 137 | }; 138 | } 139 | 140 | typeid_impl!(f32, 10); 141 | typeid_impl!(u8, 1); 142 | 143 | pub trait ReadChunk 144 | where 145 | Self: Sized, 146 | { 147 | fn read_chunk(read: &mut R) -> Result 148 | where 149 | R: Read + Seek; 150 | } 151 | 152 | /// Memory-mappable chunks. 153 | pub trait MmapChunk 154 | where 155 | Self: Sized, 156 | { 157 | /// Memory map a chunk. 158 | /// 159 | /// The given `File` object should be positioned at the start of the chunk. 160 | fn mmap_chunk(read: &mut BufReader) -> Result; 161 | } 162 | 163 | pub trait WriteChunk { 164 | /// Get the identifier of a chunk. 165 | fn chunk_identifier(&self) -> ChunkIdentifier; 166 | 167 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 168 | where 169 | W: Write + Seek; 170 | } 171 | 172 | #[derive(Debug, Eq, PartialEq)] 173 | pub(crate) struct Header { 174 | chunk_identifiers: Vec, 175 | } 176 | 177 | impl Header { 178 | pub fn new(chunk_identifiers: impl Into>) -> Self { 179 | Header { 180 | chunk_identifiers: chunk_identifiers.into(), 181 | } 182 | } 183 | 184 | pub fn chunk_identifiers(&self) -> &[ChunkIdentifier] { 185 | &self.chunk_identifiers 186 | } 187 | } 188 | 189 | impl WriteChunk for Header { 190 | fn chunk_identifier(&self) -> ChunkIdentifier { 191 | ChunkIdentifier::Header 192 | } 193 | 194 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 195 | where 196 | W: Write + Seek, 197 | { 198 | write.write_all(&MAGIC)?; 199 | write.write_u32::(MODEL_VERSION)?; 200 | write.write_u32::(self.chunk_identifiers.len() as u32)?; 201 | 202 | for &identifier in &self.chunk_identifiers { 203 | write.write_u32::(identifier as u32)? 204 | } 205 | 206 | Ok(()) 207 | } 208 | } 209 | 210 | impl ReadChunk for Header { 211 | fn read_chunk(read: &mut R) -> Result 212 | where 213 | R: Read + Seek, 214 | { 215 | // Magic and version ceremony. 216 | let mut magic = [0u8; 4]; 217 | read.read_exact(&mut magic)?; 218 | ensure!( 219 | magic == MAGIC, 220 | "File does not have finalfusion magic, expected: {}, was: {}", 221 | String::from_utf8_lossy(&MAGIC), 222 | String::from_utf8_lossy(&magic) 223 | ); 224 | let version = read.read_u32::()?; 225 | ensure!( 226 | version == MODEL_VERSION, 227 | "Unknown model version, expected: {}, was: {}", 228 | MODEL_VERSION, 229 | version 230 | ); 231 | 232 | // Read chunk identifiers. 233 | let chunk_identifiers_len = read.read_u32::()? as usize; 234 | let mut chunk_identifiers = Vec::with_capacity(chunk_identifiers_len); 235 | for _ in 0..chunk_identifiers_len { 236 | let identifier = read 237 | .read_u32::() 238 | .with_context(|e| format!("Cannot read chunk identifier: {}", e))?; 239 | let chunk_identifier = ChunkIdentifier::try_from(identifier) 240 | .ok_or_else(|| format_err!("Unknown chunk identifier: {}", identifier))?; 241 | chunk_identifiers.push(chunk_identifier); 242 | } 243 | 244 | Ok(Header { chunk_identifiers }) 245 | } 246 | } 247 | 248 | } 249 | 250 | #[cfg(test)] 251 | mod tests { 252 | use std::io::{Cursor, Seek, SeekFrom}; 253 | 254 | use crate::io::private::{ChunkIdentifier, Header, ReadChunk, WriteChunk}; 255 | 256 | #[test] 257 | fn header_write_read_roundtrip() { 258 | let check_header = 259 | Header::new(vec![ChunkIdentifier::SimpleVocab, ChunkIdentifier::NdArray]); 260 | let mut cursor = Cursor::new(Vec::new()); 261 | check_header.write_chunk(&mut cursor).unwrap(); 262 | cursor.seek(SeekFrom::Start(0)).unwrap(); 263 | let header = Header::read_chunk(&mut cursor).unwrap(); 264 | assert_eq!(header, check_header); 265 | } 266 | } 267 | -------------------------------------------------------------------------------- /rust2vec/src/lib.rs: -------------------------------------------------------------------------------- 1 | //! A library for reading, writing, and using word embeddings. 2 | //! 3 | //! rust2vec allows you to read, write, and use word2vec and GloVe 4 | //! embeddings. rust2vec uses *finalfusion* as its native data 5 | //! format, which has several benefits over the word2vec and GloVe 6 | //! formats. 7 | 8 | #[deprecated(note = "rust2vec is superseded by the finalfusion crate")] 9 | pub mod embeddings; 10 | 11 | pub mod io; 12 | 13 | pub mod metadata; 14 | 15 | pub mod prelude; 16 | 17 | pub mod similarity; 18 | 19 | pub mod storage; 20 | 21 | pub(crate) mod subword; 22 | 23 | pub mod text; 24 | 25 | pub(crate) mod util; 26 | 27 | pub mod vocab; 28 | 29 | pub mod word2vec; 30 | 31 | #[cfg(test)] 32 | mod tests; 33 | -------------------------------------------------------------------------------- /rust2vec/src/metadata.rs: -------------------------------------------------------------------------------- 1 | //! Metadata 2 | 3 | use std::io::{Read, Seek, Write}; 4 | 5 | use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt}; 6 | use failure::{ensure, err_msg, Error}; 7 | use toml::Value; 8 | 9 | use crate::io::{ 10 | private::{ChunkIdentifier, Header, ReadChunk, WriteChunk}, 11 | ReadMetadata, 12 | }; 13 | 14 | /// Embeddings metadata. 15 | /// 16 | /// finalfusion metadata in TOML format. 17 | #[derive(Clone, Debug, PartialEq)] 18 | pub struct Metadata(pub Value); 19 | 20 | impl ReadChunk for Metadata { 21 | fn read_chunk(read: &mut R) -> Result 22 | where 23 | R: Read + Seek, 24 | { 25 | let chunk_id = ChunkIdentifier::try_from(read.read_u32::()?) 26 | .ok_or_else(|| err_msg("Unknown chunk identifier"))?; 27 | ensure!( 28 | chunk_id == ChunkIdentifier::Metadata, 29 | "Cannot read chunk {:?} as Metadata", 30 | chunk_id 31 | ); 32 | 33 | // Read chunk length. 34 | let chunk_len = read.read_u64::()? as usize; 35 | 36 | // Read TOML data. 37 | let mut buf = vec![0; chunk_len]; 38 | read.read_exact(&mut buf)?; 39 | let buf_str = String::from_utf8(buf)?; 40 | 41 | Ok(Metadata(buf_str.parse::()?)) 42 | } 43 | } 44 | 45 | impl WriteChunk for Metadata { 46 | fn chunk_identifier(&self) -> ChunkIdentifier { 47 | ChunkIdentifier::Metadata 48 | } 49 | 50 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 51 | where 52 | W: Write + Seek, 53 | { 54 | let metadata_str = self.0.to_string(); 55 | 56 | write.write_u32::(self.chunk_identifier() as u32)?; 57 | write.write_u64::(metadata_str.len() as u64)?; 58 | write.write_all(metadata_str.as_bytes())?; 59 | 60 | Ok(()) 61 | } 62 | } 63 | 64 | impl ReadMetadata for Option { 65 | fn read_metadata(read: &mut R) -> Result 66 | where 67 | R: Read + Seek, 68 | { 69 | let header = Header::read_chunk(read)?; 70 | let chunks = header.chunk_identifiers(); 71 | ensure!(!chunks.is_empty(), "Embedding file without chunks."); 72 | 73 | if header.chunk_identifiers()[0] == ChunkIdentifier::Metadata { 74 | Ok(Some(Metadata::read_chunk(read)?)) 75 | } else { 76 | Ok(None) 77 | } 78 | } 79 | } 80 | 81 | #[cfg(test)] 82 | mod tests { 83 | use std::io::{Cursor, Read, Seek, SeekFrom}; 84 | 85 | use byteorder::{LittleEndian, ReadBytesExt}; 86 | use toml::{toml, toml_internal}; 87 | 88 | use super::Metadata; 89 | use crate::io::private::{ReadChunk, WriteChunk}; 90 | 91 | fn read_chunk_size(read: &mut impl Read) -> u64 { 92 | // Skip identifier. 93 | read.read_u32::().unwrap(); 94 | 95 | // Return chunk length. 96 | read.read_u64::().unwrap() 97 | } 98 | 99 | fn test_metadata() -> Metadata { 100 | Metadata(toml! { 101 | [hyperparameters] 102 | dims = 300 103 | ns = 5 104 | 105 | [description] 106 | description = "Test model" 107 | language = "de" 108 | }) 109 | } 110 | 111 | #[test] 112 | fn metadata_correct_chunk_size() { 113 | let check_metadata = test_metadata(); 114 | let mut cursor = Cursor::new(Vec::new()); 115 | check_metadata.write_chunk(&mut cursor).unwrap(); 116 | cursor.seek(SeekFrom::Start(0)).unwrap(); 117 | 118 | let chunk_size = read_chunk_size(&mut cursor); 119 | assert_eq!( 120 | cursor.read_to_end(&mut Vec::new()).unwrap(), 121 | chunk_size as usize 122 | ); 123 | } 124 | 125 | #[test] 126 | fn metadata_write_read_roundtrip() { 127 | let check_metadata = test_metadata(); 128 | let mut cursor = Cursor::new(Vec::new()); 129 | check_metadata.write_chunk(&mut cursor).unwrap(); 130 | cursor.seek(SeekFrom::Start(0)).unwrap(); 131 | let metadata = Metadata::read_chunk(&mut cursor).unwrap(); 132 | assert_eq!(metadata, check_metadata); 133 | } 134 | } 135 | -------------------------------------------------------------------------------- /rust2vec/src/prelude.rs: -------------------------------------------------------------------------------- 1 | //! Prelude exports the most commonly-used types and traits. 2 | 3 | pub use crate::embeddings::Embeddings; 4 | 5 | pub use crate::io::{MmapEmbeddings, ReadEmbeddings, ReadMetadata, WriteEmbeddings}; 6 | 7 | pub use crate::metadata::Metadata; 8 | 9 | pub use crate::storage::{ 10 | MmapArray, NdArray, Quantize, QuantizedArray, Storage, StorageView, StorageViewWrap, 11 | StorageWrap, 12 | }; 13 | 14 | pub use crate::text::{ReadText, ReadTextDims, WriteText, WriteTextDims}; 15 | 16 | pub use crate::word2vec::{ReadWord2Vec, WriteWord2Vec}; 17 | 18 | pub use crate::vocab::{SimpleVocab, SubwordVocab, Vocab, VocabWrap}; 19 | -------------------------------------------------------------------------------- /rust2vec/src/similarity.rs: -------------------------------------------------------------------------------- 1 | //! Traits and trait implementations for similarity queries. 2 | 3 | use std::cmp::Ordering; 4 | use std::collections::{BinaryHeap, HashSet}; 5 | 6 | use ndarray::{s, Array1, ArrayView1, ArrayView2}; 7 | use ordered_float::NotNan; 8 | 9 | use crate::embeddings::Embeddings; 10 | use crate::storage::StorageView; 11 | use crate::util::l2_normalize; 12 | use crate::vocab::Vocab; 13 | 14 | /// A word with its similarity. 15 | /// 16 | /// This data structure is used to store a pair consisting of a word and 17 | /// its similarity to a query word. 18 | #[derive(Debug, Eq, PartialEq)] 19 | pub struct WordSimilarity<'a> { 20 | pub similarity: NotNan, 21 | pub word: &'a str, 22 | } 23 | 24 | impl<'a> Ord for WordSimilarity<'a> { 25 | fn cmp(&self, other: &Self) -> Ordering { 26 | match other.similarity.cmp(&self.similarity) { 27 | Ordering::Equal => self.word.cmp(other.word), 28 | ordering => ordering, 29 | } 30 | } 31 | } 32 | 33 | impl<'a> PartialOrd for WordSimilarity<'a> { 34 | fn partial_cmp(&self, other: &WordSimilarity) -> Option { 35 | Some(self.cmp(other)) 36 | } 37 | } 38 | 39 | /// Trait for analogy queries. 40 | pub trait Analogy { 41 | /// Perform an analogy query. 42 | /// 43 | /// This method returns words that are close in vector space the analogy 44 | /// query `word1` is to `word2` as `word3` is to `?`. More concretely, 45 | /// it searches embeddings that are similar to: 46 | /// 47 | /// *embedding(word2) - embedding(word1) + embedding(word3)* 48 | /// 49 | /// At most, `limit` results are returned. 50 | fn analogy( 51 | &self, 52 | word1: &str, 53 | word2: &str, 54 | word3: &str, 55 | limit: usize, 56 | ) -> Option>; 57 | } 58 | 59 | impl Analogy for Embeddings 60 | where 61 | V: Vocab, 62 | S: StorageView, 63 | { 64 | fn analogy( 65 | &self, 66 | word1: &str, 67 | word2: &str, 68 | word3: &str, 69 | limit: usize, 70 | ) -> Option> { 71 | self.analogy_by(word1, word2, word3, limit, |embeds, embed| { 72 | embeds.dot(&embed) 73 | }) 74 | } 75 | } 76 | 77 | /// Trait for analogy queries with a custom similarity function. 78 | pub trait AnalogyBy { 79 | /// Perform an analogy query using the given similarity function. 80 | /// 81 | /// This method returns words that are close in vector space the analogy 82 | /// query `word1` is to `word2` as `word3` is to `?`. More concretely, 83 | /// it searches embeddings that are similar to: 84 | /// 85 | /// *embedding(word2) - embedding(word1) + embedding(word3)* 86 | /// 87 | /// At most, `limit` results are returned. 88 | fn analogy_by( 89 | &self, 90 | word1: &str, 91 | word2: &str, 92 | word3: &str, 93 | limit: usize, 94 | similarity: F, 95 | ) -> Option> 96 | where 97 | F: FnMut(ArrayView2, ArrayView1) -> Array1; 98 | } 99 | 100 | impl AnalogyBy for Embeddings 101 | where 102 | V: Vocab, 103 | S: StorageView, 104 | { 105 | fn analogy_by( 106 | &self, 107 | word1: &str, 108 | word2: &str, 109 | word3: &str, 110 | limit: usize, 111 | similarity: F, 112 | ) -> Option> 113 | where 114 | F: FnMut(ArrayView2, ArrayView1) -> Array1, 115 | { 116 | let embedding1 = self.embedding(word1)?; 117 | let embedding2 = self.embedding(word2)?; 118 | let embedding3 = self.embedding(word3)?; 119 | 120 | let mut embedding = (&embedding2.as_view() - &embedding1.as_view()) + embedding3.as_view(); 121 | l2_normalize(embedding.view_mut()); 122 | 123 | let skip = [word1, word2, word3].iter().cloned().collect(); 124 | 125 | Some(self.similarity_(embedding.view(), &skip, limit, similarity)) 126 | } 127 | } 128 | 129 | /// Trait for similarity queries. 130 | pub trait Similarity { 131 | /// Find words that are similar to the query word. 132 | /// 133 | /// The similarity between two words is defined by the dot product of 134 | /// the embeddings. If the vectors are unit vectors (e.g. by virtue of 135 | /// calling `normalize`), this is the cosine similarity. At most, `limit` 136 | /// results are returned. 137 | fn similarity(&self, word: &str, limit: usize) -> Option>; 138 | } 139 | 140 | impl Similarity for Embeddings 141 | where 142 | V: Vocab, 143 | S: StorageView, 144 | { 145 | fn similarity(&self, word: &str, limit: usize) -> Option> { 146 | self.similarity_by(word, limit, |embeds, embed| embeds.dot(&embed)) 147 | } 148 | } 149 | 150 | /// Trait for similarity queries with a custom similarity function. 151 | pub trait SimilarityBy { 152 | /// Find words that are similar to the query word using the given similarity 153 | /// function. 154 | /// 155 | /// The similarity function should return, given the embeddings matrix and 156 | /// the word vector a vector of similarity scores. At most, `limit` results 157 | /// are returned. 158 | fn similarity_by( 159 | &self, 160 | word: &str, 161 | limit: usize, 162 | similarity: F, 163 | ) -> Option> 164 | where 165 | F: FnMut(ArrayView2, ArrayView1) -> Array1; 166 | } 167 | 168 | impl SimilarityBy for Embeddings 169 | where 170 | V: Vocab, 171 | S: StorageView, 172 | { 173 | fn similarity_by( 174 | &self, 175 | word: &str, 176 | limit: usize, 177 | similarity: F, 178 | ) -> Option> 179 | where 180 | F: FnMut(ArrayView2, ArrayView1) -> Array1, 181 | { 182 | let embed = self.embedding(word)?; 183 | let mut skip = HashSet::new(); 184 | skip.insert(word); 185 | 186 | Some(self.similarity_(embed.as_view(), &skip, limit, similarity)) 187 | } 188 | } 189 | 190 | trait SimilarityPrivate { 191 | fn similarity_( 192 | &self, 193 | embed: ArrayView1, 194 | skip: &HashSet<&str>, 195 | limit: usize, 196 | similarity: F, 197 | ) -> Vec 198 | where 199 | F: FnMut(ArrayView2, ArrayView1) -> Array1; 200 | } 201 | 202 | impl SimilarityPrivate for Embeddings 203 | where 204 | V: Vocab, 205 | S: StorageView, 206 | { 207 | fn similarity_( 208 | &self, 209 | embed: ArrayView1, 210 | skip: &HashSet<&str>, 211 | limit: usize, 212 | mut similarity: F, 213 | ) -> Vec 214 | where 215 | F: FnMut(ArrayView2, ArrayView1) -> Array1, 216 | { 217 | // ndarray#474 218 | #[allow(clippy::deref_addrof)] 219 | let sims = similarity( 220 | self.storage().view().slice(s![0..self.vocab().len(), ..]), 221 | embed.view(), 222 | ); 223 | 224 | let mut results = BinaryHeap::with_capacity(limit); 225 | for (idx, &sim) in sims.iter().enumerate() { 226 | let word = &self.vocab().words()[idx]; 227 | 228 | // Don't add words that we are explicitly asked to skip. 229 | if skip.contains(word.as_str()) { 230 | continue; 231 | } 232 | 233 | let word_similarity = WordSimilarity { 234 | word, 235 | similarity: NotNan::new(sim).expect("Encountered NaN"), 236 | }; 237 | 238 | if results.len() < limit { 239 | results.push(word_similarity); 240 | } else { 241 | let mut peek = results.peek_mut().expect("Cannot peek non-empty heap"); 242 | if word_similarity < *peek { 243 | *peek = word_similarity 244 | } 245 | } 246 | } 247 | 248 | results.into_sorted_vec() 249 | } 250 | } 251 | 252 | #[cfg(test)] 253 | mod tests { 254 | 255 | use std::fs::File; 256 | use std::io::BufReader; 257 | 258 | use crate::embeddings::Embeddings; 259 | use crate::similarity::{Analogy, Similarity}; 260 | use crate::word2vec::ReadWord2Vec; 261 | 262 | static SIMILARITY_ORDER_STUTTGART_10: &'static [&'static str] = &[ 263 | "Karlsruhe", 264 | "Mannheim", 265 | "München", 266 | "Darmstadt", 267 | "Heidelberg", 268 | "Wiesbaden", 269 | "Kassel", 270 | "Düsseldorf", 271 | "Leipzig", 272 | "Berlin", 273 | ]; 274 | 275 | static SIMILARITY_ORDER: &'static [&'static str] = &[ 276 | "Potsdam", 277 | "Hamburg", 278 | "Leipzig", 279 | "Dresden", 280 | "München", 281 | "Düsseldorf", 282 | "Bonn", 283 | "Stuttgart", 284 | "Weimar", 285 | "Berlin-Charlottenburg", 286 | "Rostock", 287 | "Karlsruhe", 288 | "Chemnitz", 289 | "Breslau", 290 | "Wiesbaden", 291 | "Hannover", 292 | "Mannheim", 293 | "Kassel", 294 | "Köln", 295 | "Danzig", 296 | "Erfurt", 297 | "Dessau", 298 | "Bremen", 299 | "Charlottenburg", 300 | "Magdeburg", 301 | "Neuruppin", 302 | "Darmstadt", 303 | "Jena", 304 | "Wien", 305 | "Heidelberg", 306 | "Dortmund", 307 | "Stettin", 308 | "Schwerin", 309 | "Neubrandenburg", 310 | "Greifswald", 311 | "Göttingen", 312 | "Braunschweig", 313 | "Berliner", 314 | "Warschau", 315 | "Berlin-Spandau", 316 | ]; 317 | 318 | static ANALOGY_ORDER: &'static [&'static str] = &[ 319 | "Deutschland", 320 | "Westdeutschland", 321 | "Sachsen", 322 | "Mitteldeutschland", 323 | "Brandenburg", 324 | "Polen", 325 | "Norddeutschland", 326 | "Dänemark", 327 | "Schleswig-Holstein", 328 | "Österreich", 329 | "Bayern", 330 | "Thüringen", 331 | "Bundesrepublik", 332 | "Ostdeutschland", 333 | "Preußen", 334 | "Deutschen", 335 | "Hessen", 336 | "Potsdam", 337 | "Mecklenburg", 338 | "Niedersachsen", 339 | "Hamburg", 340 | "Süddeutschland", 341 | "Bremen", 342 | "Russland", 343 | "Deutschlands", 344 | "BRD", 345 | "Litauen", 346 | "Mecklenburg-Vorpommern", 347 | "DDR", 348 | "West-Berlin", 349 | "Saarland", 350 | "Lettland", 351 | "Hannover", 352 | "Rostock", 353 | "Sachsen-Anhalt", 354 | "Pommern", 355 | "Schweden", 356 | "Deutsche", 357 | "deutschen", 358 | "Westfalen", 359 | ]; 360 | 361 | #[test] 362 | fn test_similarity() { 363 | let f = File::open("testdata/similarity.bin").unwrap(); 364 | let mut reader = BufReader::new(f); 365 | let embeddings = Embeddings::read_word2vec_binary(&mut reader, true).unwrap(); 366 | 367 | let result = embeddings.similarity("Berlin", 40); 368 | assert!(result.is_some()); 369 | let result = result.unwrap(); 370 | assert_eq!(40, result.len()); 371 | 372 | for (idx, word_similarity) in result.iter().enumerate() { 373 | assert_eq!(SIMILARITY_ORDER[idx], word_similarity.word) 374 | } 375 | 376 | let result = embeddings.similarity("Berlin", 10); 377 | assert!(result.is_some()); 378 | let result = result.unwrap(); 379 | assert_eq!(10, result.len()); 380 | 381 | println!("{:?}", result); 382 | 383 | for (idx, word_similarity) in result.iter().enumerate() { 384 | assert_eq!(SIMILARITY_ORDER[idx], word_similarity.word) 385 | } 386 | } 387 | 388 | #[test] 389 | fn test_similarity_limit() { 390 | let f = File::open("testdata/similarity.bin").unwrap(); 391 | let mut reader = BufReader::new(f); 392 | let embeddings = Embeddings::read_word2vec_binary(&mut reader, true).unwrap(); 393 | 394 | let result = embeddings.similarity("Stuttgart", 10); 395 | assert!(result.is_some()); 396 | let result = result.unwrap(); 397 | assert_eq!(10, result.len()); 398 | 399 | println!("{:?}", result); 400 | 401 | for (idx, word_similarity) in result.iter().enumerate() { 402 | assert_eq!(SIMILARITY_ORDER_STUTTGART_10[idx], word_similarity.word) 403 | } 404 | } 405 | 406 | #[test] 407 | fn test_analogy() { 408 | let f = File::open("testdata/analogy.bin").unwrap(); 409 | let mut reader = BufReader::new(f); 410 | let embeddings = Embeddings::read_word2vec_binary(&mut reader, true).unwrap(); 411 | 412 | let result = embeddings.analogy("Paris", "Frankreich", "Berlin", 40); 413 | assert!(result.is_some()); 414 | let result = result.unwrap(); 415 | assert_eq!(40, result.len()); 416 | 417 | for (idx, word_similarity) in result.iter().enumerate() { 418 | assert_eq!(ANALOGY_ORDER[idx], word_similarity.word) 419 | } 420 | } 421 | 422 | } 423 | -------------------------------------------------------------------------------- /rust2vec/src/storage.rs: -------------------------------------------------------------------------------- 1 | //! Embedding matrix representations. 2 | 3 | use std::fs::File; 4 | use std::io::{BufReader, Read, Seek, SeekFrom, Write}; 5 | use std::mem::size_of; 6 | 7 | use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt}; 8 | use failure::{ensure, format_err, Error}; 9 | use memmap::{Mmap, MmapOptions}; 10 | use ndarray::{Array, Array1, Array2, ArrayView, ArrayView2, Dimension, Ix1, Ix2}; 11 | use rand::{FromEntropy, Rng}; 12 | use rand_xorshift::XorShiftRng; 13 | use reductive::pq::{QuantizeVector, ReconstructVector, TrainPQ, PQ}; 14 | 15 | use crate::io::private::{ChunkIdentifier, MmapChunk, ReadChunk, TypeId, WriteChunk}; 16 | 17 | /// Copy-on-write wrapper for `Array`/`ArrayView`. 18 | /// 19 | /// The `CowArray` type stores an owned array or an array view. In 20 | /// both cases a view (`as_view`) or an owned array (`into_owned`) can 21 | /// be obtained. If the wrapped array is a view, retrieving an owned 22 | /// array will copy the underlying data. 23 | pub enum CowArray<'a, A, D> { 24 | Borrowed(ArrayView<'a, A, D>), 25 | Owned(Array), 26 | } 27 | 28 | impl<'a, A, D> CowArray<'a, A, D> 29 | where 30 | D: Dimension, 31 | { 32 | pub fn as_view(&self) -> ArrayView { 33 | match self { 34 | CowArray::Borrowed(borrow) => borrow.view(), 35 | CowArray::Owned(owned) => owned.view(), 36 | } 37 | } 38 | } 39 | 40 | impl<'a, A, D> CowArray<'a, A, D> 41 | where 42 | A: Clone, 43 | D: Dimension, 44 | { 45 | pub fn into_owned(self) -> Array { 46 | match self { 47 | CowArray::Borrowed(borrow) => borrow.to_owned(), 48 | CowArray::Owned(owned) => owned, 49 | } 50 | } 51 | } 52 | 53 | /// 1D copy-on-write array. 54 | pub type CowArray1<'a, A> = CowArray<'a, A, Ix1>; 55 | 56 | /// Memory-mapped matrix. 57 | pub struct MmapArray { 58 | map: Mmap, 59 | shape: Ix2, 60 | } 61 | 62 | impl MmapChunk for MmapArray { 63 | fn mmap_chunk(read: &mut BufReader) -> Result { 64 | ensure!( 65 | read.read_u32::()? == ChunkIdentifier::NdArray as u32, 66 | "invalid chunk identifier for NdArray" 67 | ); 68 | 69 | // Read and discard chunk length. 70 | read.read_u64::()?; 71 | 72 | let rows = read.read_u64::()? as usize; 73 | let cols = read.read_u32::()? as usize; 74 | let shape = Ix2(rows, cols); 75 | 76 | ensure!( 77 | read.read_u32::()? == f32::type_id(), 78 | "Expected single precision floating point matrix for NdArray." 79 | ); 80 | 81 | let n_padding = padding::(read.seek(SeekFrom::Current(0))?); 82 | read.seek(SeekFrom::Current(n_padding as i64))?; 83 | 84 | // Set up memory mapping. 85 | let matrix_len = shape.size() * size_of::(); 86 | let offset = read.seek(SeekFrom::Current(0))?; 87 | let mut mmap_opts = MmapOptions::new(); 88 | let map = unsafe { 89 | mmap_opts 90 | .offset(offset) 91 | .len(matrix_len) 92 | .map(&read.get_ref())? 93 | }; 94 | 95 | // Position the reader after the matrix. 96 | read.seek(SeekFrom::Current(matrix_len as i64))?; 97 | 98 | Ok(MmapArray { map, shape }) 99 | } 100 | } 101 | 102 | impl WriteChunk for MmapArray { 103 | fn chunk_identifier(&self) -> ChunkIdentifier { 104 | ChunkIdentifier::NdArray 105 | } 106 | 107 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 108 | where 109 | W: Write + Seek, 110 | { 111 | NdArray::write_ndarray_chunk(self.view(), write) 112 | } 113 | } 114 | 115 | /// In-memory `ndarray` matrix. 116 | #[derive(Debug)] 117 | pub struct NdArray(pub Array2); 118 | 119 | impl NdArray { 120 | fn write_ndarray_chunk(data: ArrayView2, write: &mut W) -> Result<(), Error> 121 | where 122 | W: Write + Seek, 123 | { 124 | write.write_u32::(ChunkIdentifier::NdArray as u32)?; 125 | let n_padding = padding::(write.seek(SeekFrom::Current(0))?); 126 | // Chunk size: rows (u64), columns (u32), type id (u32), 127 | // padding ([0,4) bytes), matrix. 128 | let chunk_len = size_of::() 129 | + size_of::() 130 | + size_of::() 131 | + n_padding as usize 132 | + (data.rows() * data.cols() * size_of::()); 133 | write.write_u64::(chunk_len as u64)?; 134 | write.write_u64::(data.rows() as u64)?; 135 | write.write_u32::(data.cols() as u32)?; 136 | write.write_u32::(f32::type_id())?; 137 | 138 | // Write padding, such that the embedding matrix starts on at 139 | // a multiple of the size of f32 (4 bytes). This is necessary 140 | // for memory mapping a matrix. Interpreting the raw u8 data 141 | // as a proper f32 array requires that the data is aligned in 142 | // memory. However, we cannot always memory map the starting 143 | // offset of the matrix directly, since mmap(2) requires a 144 | // file offset that is page-aligned. Since the page size is 145 | // always a larger power of 2 (e.g. 2^12), which is divisible 146 | // by 4, the offset of the matrix with regards to the page 147 | // boundary is also a multiple of 4. 148 | 149 | let padding = vec![0; n_padding as usize]; 150 | write.write_all(&padding)?; 151 | 152 | for row in data.outer_iter() { 153 | for col in row.iter() { 154 | write.write_f32::(*col)?; 155 | } 156 | } 157 | 158 | Ok(()) 159 | } 160 | } 161 | 162 | impl ReadChunk for NdArray { 163 | fn read_chunk(read: &mut R) -> Result 164 | where 165 | R: Read + Seek, 166 | { 167 | let chunk_id = read.read_u32::()?; 168 | let chunk_id = ChunkIdentifier::try_from(chunk_id) 169 | .ok_or_else(|| format_err!("Unknown chunk identifier: {}", chunk_id))?; 170 | ensure!( 171 | chunk_id == ChunkIdentifier::NdArray, 172 | "Cannot read chunk {:?} as NdArray", 173 | chunk_id 174 | ); 175 | 176 | // Read and discard chunk length. 177 | read.read_u64::()?; 178 | 179 | let rows = read.read_u64::()? as usize; 180 | let cols = read.read_u32::()? as usize; 181 | 182 | ensure!( 183 | read.read_u32::()? == f32::type_id(), 184 | "Expected single precision floating point matrix for NdArray." 185 | ); 186 | 187 | let n_padding = padding::(read.seek(SeekFrom::Current(0))?); 188 | read.seek(SeekFrom::Current(n_padding as i64))?; 189 | 190 | let mut data = vec![0f32; rows * cols]; 191 | read.read_f32_into::(&mut data)?; 192 | 193 | Ok(NdArray(Array2::from_shape_vec((rows, cols), data)?)) 194 | } 195 | } 196 | 197 | impl WriteChunk for NdArray { 198 | fn chunk_identifier(&self) -> ChunkIdentifier { 199 | ChunkIdentifier::NdArray 200 | } 201 | 202 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 203 | where 204 | W: Write + Seek, 205 | { 206 | Self::write_ndarray_chunk(self.0.view(), write) 207 | } 208 | } 209 | 210 | /// Quantized embedding matrix. 211 | pub struct QuantizedArray { 212 | quantizer: PQ, 213 | quantized: Array2, 214 | norms: Option>, 215 | } 216 | 217 | impl ReadChunk for QuantizedArray { 218 | fn read_chunk(read: &mut R) -> Result 219 | where 220 | R: Read + Seek, 221 | { 222 | let chunk_id = read.read_u32::()?; 223 | let chunk_id = ChunkIdentifier::try_from(chunk_id) 224 | .ok_or_else(|| format_err!("Unknown chunk identifier: {}", chunk_id))?; 225 | ensure!( 226 | chunk_id == ChunkIdentifier::QuantizedArray, 227 | "Cannot read chunk {:?} as QuantizedArray", 228 | chunk_id 229 | ); 230 | 231 | // Read and discard chunk length. 232 | read.read_u64::()?; 233 | 234 | let projection = read.read_u32::()? != 0; 235 | let read_norms = read.read_u32::()? != 0; 236 | let quantized_len = read.read_u32::()? as usize; 237 | let reconstructed_len = read.read_u32::()? as usize; 238 | let n_centroids = read.read_u32::()? as usize; 239 | let n_embeddings = read.read_u64::()? as usize; 240 | 241 | ensure!( 242 | read.read_u32::()? == u8::type_id(), 243 | "Expected unsigned byte quantized embedding matrices." 244 | ); 245 | 246 | ensure!( 247 | read.read_u32::()? == f32::type_id(), 248 | "Expected single precision floating point matrix quantizer matrices." 249 | ); 250 | 251 | let n_padding = padding::(read.seek(SeekFrom::Current(0))?); 252 | read.seek(SeekFrom::Current(n_padding as i64))?; 253 | 254 | let projection = if projection { 255 | let mut projection_vec = vec![0f32; reconstructed_len * reconstructed_len]; 256 | read.read_f32_into::(&mut projection_vec)?; 257 | Some(Array2::from_shape_vec( 258 | (reconstructed_len, reconstructed_len), 259 | projection_vec, 260 | )?) 261 | } else { 262 | None 263 | }; 264 | 265 | let mut quantizers = Vec::with_capacity(quantized_len); 266 | for _ in 0..quantized_len { 267 | let mut subquantizer_vec = 268 | vec![0f32; n_centroids * (reconstructed_len / quantized_len)]; 269 | read.read_f32_into::(&mut subquantizer_vec)?; 270 | let subquantizer = Array2::from_shape_vec( 271 | (n_centroids, reconstructed_len / quantized_len), 272 | subquantizer_vec, 273 | )?; 274 | quantizers.push(subquantizer); 275 | } 276 | 277 | let norms = if read_norms { 278 | let mut norms_vec = vec![0f32; n_embeddings]; 279 | read.read_f32_into::(&mut norms_vec)?; 280 | Some(Array1::from_vec(norms_vec)) 281 | } else { 282 | None 283 | }; 284 | 285 | let mut quantized_embeddings_vec = vec![0u8; n_embeddings * quantized_len]; 286 | read.read_exact(&mut quantized_embeddings_vec)?; 287 | let quantized = 288 | Array2::from_shape_vec((n_embeddings, quantized_len), quantized_embeddings_vec)?; 289 | 290 | Ok(QuantizedArray { 291 | quantizer: PQ::new(projection, quantizers), 292 | quantized, 293 | norms, 294 | }) 295 | } 296 | } 297 | 298 | impl WriteChunk for QuantizedArray { 299 | fn chunk_identifier(&self) -> ChunkIdentifier { 300 | ChunkIdentifier::QuantizedArray 301 | } 302 | 303 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 304 | where 305 | W: Write + Seek, 306 | { 307 | write.write_u32::(ChunkIdentifier::QuantizedArray as u32)?; 308 | 309 | // projection (u32), use_norms (u32), quantized_len (u32), 310 | // reconstructed_len (u32), n_centroids (u32), rows (u64), 311 | // types (2 x u32 bytes), padding, projection matrix, 312 | // centroids, norms, quantized data. 313 | let n_padding = padding::(write.seek(SeekFrom::Current(0))?); 314 | let chunk_size = size_of::() 315 | + size_of::() 316 | + size_of::() 317 | + size_of::() 318 | + size_of::() 319 | + size_of::() 320 | + 2 * size_of::() 321 | + n_padding as usize 322 | + self.quantizer.projection().is_some() as usize 323 | * self.quantizer.reconstructed_len() 324 | * self.quantizer.reconstructed_len() 325 | * size_of::() 326 | + self.quantizer.quantized_len() 327 | * self.quantizer.n_quantizer_centroids() 328 | * (self.quantizer.reconstructed_len() / self.quantizer.quantized_len()) 329 | * size_of::() 330 | + self.norms.is_some() as usize * self.quantized.rows() * size_of::() 331 | + self.quantized.rows() * self.quantizer.quantized_len(); 332 | 333 | write.write_u64::(chunk_size as u64)?; 334 | 335 | write.write_u32::(self.quantizer.projection().is_some() as u32)?; 336 | write.write_u32::(self.norms.is_some() as u32)?; 337 | write.write_u32::(self.quantizer.quantized_len() as u32)?; 338 | write.write_u32::(self.quantizer.reconstructed_len() as u32)?; 339 | write.write_u32::(self.quantizer.n_quantizer_centroids() as u32)?; 340 | write.write_u64::(self.quantized.rows() as u64)?; 341 | 342 | // Quantized and reconstruction types. 343 | write.write_u32::(u8::type_id())?; 344 | write.write_u32::(f32::type_id())?; 345 | 346 | let padding = vec![0u8; n_padding as usize]; 347 | write.write_all(&padding)?; 348 | 349 | // Write projection matrix. 350 | if let Some(projection) = self.quantizer.projection() { 351 | for row in projection.outer_iter() { 352 | for &col in row { 353 | write.write_f32::(col)?; 354 | } 355 | } 356 | } 357 | 358 | // Write subquantizers. 359 | for subquantizer in self.quantizer.subquantizers() { 360 | for row in subquantizer.outer_iter() { 361 | for &col in row { 362 | write.write_f32::(col)?; 363 | } 364 | } 365 | } 366 | 367 | // Write norms. 368 | if let Some(ref norms) = self.norms { 369 | for row in norms.outer_iter() { 370 | for &col in row { 371 | write.write_f32::(col)?; 372 | } 373 | } 374 | } 375 | 376 | // Write quantized embedding matrix. 377 | for row in self.quantized.outer_iter() { 378 | for &col in row { 379 | write.write_u8(col)?; 380 | } 381 | } 382 | 383 | Ok(()) 384 | } 385 | } 386 | 387 | /// Storage types wrapper. 388 | /// 389 | /// This crate makes it possible to create fine-grained embedding 390 | /// types, such as `Embeddings` or 391 | /// `Embeddings`. However, in some cases 392 | /// it is more pleasant to have a single type that covers all 393 | /// vocabulary and storage types. `VocabWrap` and `StorageWrap` wrap 394 | /// all the vocabularies and storage types known to this crate such 395 | /// that the type `Embeddings` covers all 396 | /// variations. 397 | pub enum StorageWrap { 398 | NdArray(NdArray), 399 | QuantizedArray(QuantizedArray), 400 | MmapArray(MmapArray), 401 | } 402 | 403 | impl From for StorageWrap { 404 | fn from(s: MmapArray) -> Self { 405 | StorageWrap::MmapArray(s) 406 | } 407 | } 408 | 409 | impl From for StorageWrap { 410 | fn from(s: NdArray) -> Self { 411 | StorageWrap::NdArray(s) 412 | } 413 | } 414 | 415 | impl From for StorageWrap { 416 | fn from(s: QuantizedArray) -> Self { 417 | StorageWrap::QuantizedArray(s) 418 | } 419 | } 420 | 421 | impl ReadChunk for StorageWrap { 422 | fn read_chunk(read: &mut R) -> Result 423 | where 424 | R: Read + Seek, 425 | { 426 | let chunk_start_pos = read.seek(SeekFrom::Current(0))?; 427 | 428 | let chunk_id = read.read_u32::()?; 429 | let chunk_id = ChunkIdentifier::try_from(chunk_id) 430 | .ok_or_else(|| format_err!("Unknown chunk identifier: {}", chunk_id))?; 431 | 432 | read.seek(SeekFrom::Start(chunk_start_pos))?; 433 | 434 | match chunk_id { 435 | ChunkIdentifier::NdArray => NdArray::read_chunk(read).map(StorageWrap::NdArray), 436 | ChunkIdentifier::QuantizedArray => { 437 | QuantizedArray::read_chunk(read).map(StorageWrap::QuantizedArray) 438 | } 439 | _ => Err(format_err!( 440 | "Chunk type {:?} cannot be read as storage", 441 | chunk_id 442 | )), 443 | } 444 | } 445 | } 446 | 447 | impl MmapChunk for StorageWrap { 448 | fn mmap_chunk(read: &mut BufReader) -> Result { 449 | let chunk_start_pos = read.seek(SeekFrom::Current(0))?; 450 | 451 | let chunk_id = read.read_u32::()?; 452 | let chunk_id = ChunkIdentifier::try_from(chunk_id) 453 | .ok_or_else(|| format_err!("Unknown chunk identifier: {}", chunk_id))?; 454 | 455 | read.seek(SeekFrom::Start(chunk_start_pos))?; 456 | 457 | match chunk_id { 458 | ChunkIdentifier::NdArray => MmapArray::mmap_chunk(read).map(StorageWrap::MmapArray), 459 | _ => Err(format_err!( 460 | "Chunk type {:?} cannot be memory mapped as viewable storage", 461 | chunk_id 462 | )), 463 | } 464 | } 465 | } 466 | 467 | impl WriteChunk for StorageWrap { 468 | fn chunk_identifier(&self) -> ChunkIdentifier { 469 | match self { 470 | StorageWrap::MmapArray(inner) => inner.chunk_identifier(), 471 | StorageWrap::NdArray(inner) => inner.chunk_identifier(), 472 | StorageWrap::QuantizedArray(inner) => inner.chunk_identifier(), 473 | } 474 | } 475 | 476 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 477 | where 478 | W: Write + Seek, 479 | { 480 | match self { 481 | StorageWrap::MmapArray(inner) => inner.write_chunk(write), 482 | StorageWrap::NdArray(inner) => inner.write_chunk(write), 483 | StorageWrap::QuantizedArray(inner) => inner.write_chunk(write), 484 | } 485 | } 486 | } 487 | 488 | /// Wrapper for storage types that implement views. 489 | /// 490 | /// This type covers the subset of storage types that implement 491 | /// `StorageView`. See the `StorageWrap` type for more information. 492 | pub enum StorageViewWrap { 493 | MmapArray(MmapArray), 494 | NdArray(NdArray), 495 | } 496 | 497 | impl From for StorageViewWrap { 498 | fn from(s: MmapArray) -> Self { 499 | StorageViewWrap::MmapArray(s) 500 | } 501 | } 502 | 503 | impl From for StorageViewWrap { 504 | fn from(s: NdArray) -> Self { 505 | StorageViewWrap::NdArray(s) 506 | } 507 | } 508 | 509 | impl ReadChunk for StorageViewWrap { 510 | fn read_chunk(read: &mut R) -> Result 511 | where 512 | R: Read + Seek, 513 | { 514 | let chunk_start_pos = read.seek(SeekFrom::Current(0))?; 515 | 516 | let chunk_id = read.read_u32::()?; 517 | let chunk_id = ChunkIdentifier::try_from(chunk_id) 518 | .ok_or_else(|| format_err!("Unknown chunk identifier: {}", chunk_id))?; 519 | 520 | read.seek(SeekFrom::Start(chunk_start_pos))?; 521 | 522 | match chunk_id { 523 | ChunkIdentifier::NdArray => NdArray::read_chunk(read).map(StorageViewWrap::NdArray), 524 | _ => Err(format_err!( 525 | "Chunk type {:?} cannot be read as viewable storage", 526 | chunk_id 527 | )), 528 | } 529 | } 530 | } 531 | 532 | impl WriteChunk for StorageViewWrap { 533 | fn chunk_identifier(&self) -> ChunkIdentifier { 534 | match self { 535 | StorageViewWrap::MmapArray(inner) => inner.chunk_identifier(), 536 | StorageViewWrap::NdArray(inner) => inner.chunk_identifier(), 537 | } 538 | } 539 | 540 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 541 | where 542 | W: Write + Seek, 543 | { 544 | match self { 545 | StorageViewWrap::MmapArray(inner) => inner.write_chunk(write), 546 | StorageViewWrap::NdArray(inner) => inner.write_chunk(write), 547 | } 548 | } 549 | } 550 | 551 | impl MmapChunk for StorageViewWrap { 552 | fn mmap_chunk(read: &mut BufReader) -> Result { 553 | let chunk_start_pos = read.seek(SeekFrom::Current(0))?; 554 | 555 | let chunk_id = read.read_u32::()?; 556 | let chunk_id = ChunkIdentifier::try_from(chunk_id) 557 | .ok_or_else(|| format_err!("Unknown chunk identifier: {}", chunk_id))?; 558 | 559 | read.seek(SeekFrom::Start(chunk_start_pos))?; 560 | 561 | match chunk_id { 562 | ChunkIdentifier::NdArray => MmapArray::mmap_chunk(read).map(StorageViewWrap::MmapArray), 563 | _ => Err(format_err!( 564 | "Chunk type {:?} cannot be memory mapped as viewable storage", 565 | chunk_id 566 | )), 567 | } 568 | } 569 | } 570 | 571 | /// Embedding matrix storage. 572 | /// 573 | /// To allow for embeddings to be stored in different manners (e.g. 574 | /// regular *n x d* matrix or as quantized vectors), this trait 575 | /// abstracts over concrete storage types. 576 | pub trait Storage { 577 | fn embedding(&self, idx: usize) -> CowArray1; 578 | 579 | fn shape(&self) -> (usize, usize); 580 | } 581 | 582 | impl Storage for MmapArray { 583 | fn embedding(&self, idx: usize) -> CowArray1 { 584 | CowArray::Owned( 585 | // Alignment is ok, padding guarantees that the pointer is at 586 | // a multiple of 4. 587 | #[allow(clippy::cast_ptr_alignment)] 588 | unsafe { ArrayView2::from_shape_ptr(self.shape, self.map.as_ptr() as *const f32) } 589 | .row(idx) 590 | .to_owned(), 591 | ) 592 | } 593 | 594 | fn shape(&self) -> (usize, usize) { 595 | self.shape.into_pattern() 596 | } 597 | } 598 | 599 | impl Storage for NdArray { 600 | fn embedding(&self, idx: usize) -> CowArray1 { 601 | CowArray::Borrowed(self.0.row(idx)) 602 | } 603 | 604 | fn shape(&self) -> (usize, usize) { 605 | self.0.dim() 606 | } 607 | } 608 | 609 | impl Storage for QuantizedArray { 610 | fn embedding(&self, idx: usize) -> CowArray1 { 611 | let mut reconstructed = self.quantizer.reconstruct_vector(self.quantized.row(idx)); 612 | if let Some(ref norms) = self.norms { 613 | reconstructed *= norms[idx]; 614 | } 615 | 616 | CowArray::Owned(reconstructed) 617 | } 618 | 619 | fn shape(&self) -> (usize, usize) { 620 | (self.quantized.rows(), self.quantizer.reconstructed_len()) 621 | } 622 | } 623 | 624 | impl Storage for StorageWrap { 625 | fn embedding(&self, idx: usize) -> CowArray1 { 626 | match self { 627 | StorageWrap::MmapArray(inner) => inner.embedding(idx), 628 | StorageWrap::NdArray(inner) => inner.embedding(idx), 629 | StorageWrap::QuantizedArray(inner) => inner.embedding(idx), 630 | } 631 | } 632 | 633 | fn shape(&self) -> (usize, usize) { 634 | match self { 635 | StorageWrap::MmapArray(inner) => inner.shape(), 636 | StorageWrap::NdArray(inner) => inner.shape(), 637 | StorageWrap::QuantizedArray(inner) => inner.shape(), 638 | } 639 | } 640 | } 641 | 642 | impl Storage for StorageViewWrap { 643 | fn embedding(&self, idx: usize) -> CowArray1 { 644 | match self { 645 | StorageViewWrap::MmapArray(inner) => inner.embedding(idx), 646 | StorageViewWrap::NdArray(inner) => inner.embedding(idx), 647 | } 648 | } 649 | 650 | fn shape(&self) -> (usize, usize) { 651 | match self { 652 | StorageViewWrap::MmapArray(inner) => inner.shape(), 653 | StorageViewWrap::NdArray(inner) => inner.shape(), 654 | } 655 | } 656 | } 657 | 658 | /// Storage that provide a view of the embedding matrix. 659 | pub trait StorageView: Storage { 660 | /// Get a view of the embedding matrix. 661 | fn view(&self) -> ArrayView2; 662 | } 663 | 664 | impl StorageView for NdArray { 665 | fn view(&self) -> ArrayView2 { 666 | self.0.view() 667 | } 668 | } 669 | 670 | impl StorageView for MmapArray { 671 | fn view(&self) -> ArrayView2 { 672 | // Alignment is ok, padding guarantees that the pointer is at 673 | // a multiple of 4. 674 | #[allow(clippy::cast_ptr_alignment)] 675 | unsafe { 676 | ArrayView2::from_shape_ptr(self.shape, self.map.as_ptr() as *const f32) 677 | } 678 | } 679 | } 680 | 681 | impl StorageView for StorageViewWrap { 682 | fn view(&self) -> ArrayView2 { 683 | match self { 684 | StorageViewWrap::MmapArray(inner) => inner.view(), 685 | StorageViewWrap::NdArray(inner) => inner.view(), 686 | } 687 | } 688 | } 689 | 690 | /// Quantizable embedding matrix. 691 | pub trait Quantize { 692 | /// Quantize the embedding matrix. 693 | /// 694 | /// This method trains a quantizer for the embedding matrix and 695 | /// then quantizes the matrix using this quantizer. 696 | /// 697 | /// The xorshift PRNG is used for picking the initial quantizer 698 | /// centroids. 699 | fn quantize( 700 | &self, 701 | n_subquantizers: usize, 702 | n_subquantizer_bits: u32, 703 | n_iterations: usize, 704 | n_attempts: usize, 705 | normalize: bool, 706 | ) -> QuantizedArray 707 | where 708 | T: TrainPQ, 709 | { 710 | self.quantize_using::( 711 | n_subquantizers, 712 | n_subquantizer_bits, 713 | n_iterations, 714 | n_attempts, 715 | normalize, 716 | &mut XorShiftRng::from_entropy(), 717 | ) 718 | } 719 | 720 | /// Quantize the embedding matrix using the provided RNG. 721 | /// 722 | /// This method trains a quantizer for the embedding matrix and 723 | /// then quantizes the matrix using this quantizer. 724 | fn quantize_using( 725 | &self, 726 | n_subquantizers: usize, 727 | n_subquantizer_bits: u32, 728 | n_iterations: usize, 729 | n_attempts: usize, 730 | normalize: bool, 731 | rng: &mut R, 732 | ) -> QuantizedArray 733 | where 734 | T: TrainPQ, 735 | R: Rng; 736 | } 737 | 738 | impl Quantize for S 739 | where 740 | S: StorageView, 741 | { 742 | /// Quantize the embedding matrix. 743 | /// 744 | /// This method trains a quantizer for the embedding matrix and 745 | /// then quantizes the matrix using this quantizer. 746 | fn quantize_using( 747 | &self, 748 | n_subquantizers: usize, 749 | n_subquantizer_bits: u32, 750 | n_iterations: usize, 751 | n_attempts: usize, 752 | normalize: bool, 753 | rng: &mut R, 754 | ) -> QuantizedArray 755 | where 756 | T: TrainPQ, 757 | R: Rng, 758 | { 759 | let (embeds, norms) = if normalize { 760 | let norms = self.view().outer_iter().map(|e| e.dot(&e).sqrt()).collect(); 761 | let mut normalized = self.view().to_owned(); 762 | for (mut embedding, &norm) in normalized.outer_iter_mut().zip(&norms) { 763 | embedding /= norm; 764 | } 765 | (CowArray::Owned(normalized), Some(norms)) 766 | } else { 767 | (CowArray::Borrowed(self.view()), None) 768 | }; 769 | 770 | let quantizer = T::train_pq_using( 771 | n_subquantizers, 772 | n_subquantizer_bits, 773 | n_iterations, 774 | n_attempts, 775 | embeds.as_view(), 776 | rng, 777 | ); 778 | 779 | let quantized = quantizer.quantize_batch(embeds.as_view()); 780 | 781 | QuantizedArray { 782 | quantizer, 783 | quantized, 784 | norms, 785 | } 786 | } 787 | } 788 | 789 | fn padding(pos: u64) -> u64 { 790 | let size = size_of::() as u64; 791 | size - (pos % size) 792 | } 793 | 794 | #[cfg(test)] 795 | mod tests { 796 | use std::io::{Cursor, Read, Seek, SeekFrom}; 797 | 798 | use byteorder::{LittleEndian, ReadBytesExt}; 799 | use ndarray::Array2; 800 | use reductive::pq::PQ; 801 | 802 | use crate::io::private::{ReadChunk, WriteChunk}; 803 | use crate::storage::{NdArray, Quantize, QuantizedArray, StorageView}; 804 | 805 | const N_ROWS: usize = 100; 806 | const N_COLS: usize = 100; 807 | 808 | fn test_ndarray() -> NdArray { 809 | let test_data = Array2::from_shape_fn((N_ROWS, N_COLS), |(r, c)| { 810 | r as f32 * N_COLS as f32 + c as f32 811 | }); 812 | 813 | NdArray(test_data) 814 | } 815 | 816 | fn test_quantized_array(norms: bool) -> QuantizedArray { 817 | let ndarray = test_ndarray(); 818 | ndarray.quantize::>(10, 4, 5, 1, norms) 819 | } 820 | 821 | fn read_chunk_size(read: &mut impl Read) -> u64 { 822 | // Skip identifier. 823 | read.read_u32::().unwrap(); 824 | 825 | // Return chunk length. 826 | read.read_u64::().unwrap() 827 | } 828 | 829 | #[test] 830 | fn ndarray_correct_chunk_size() { 831 | let check_arr = test_ndarray(); 832 | let mut cursor = Cursor::new(Vec::new()); 833 | check_arr.write_chunk(&mut cursor).unwrap(); 834 | cursor.seek(SeekFrom::Start(0)).unwrap(); 835 | 836 | let chunk_size = read_chunk_size(&mut cursor); 837 | assert_eq!( 838 | cursor.read_to_end(&mut Vec::new()).unwrap(), 839 | chunk_size as usize 840 | ); 841 | } 842 | 843 | #[test] 844 | fn ndarray_write_read_roundtrip() { 845 | let check_arr = test_ndarray(); 846 | let mut cursor = Cursor::new(Vec::new()); 847 | check_arr.write_chunk(&mut cursor).unwrap(); 848 | cursor.seek(SeekFrom::Start(0)).unwrap(); 849 | let arr = NdArray::read_chunk(&mut cursor).unwrap(); 850 | assert_eq!(arr.view(), check_arr.view()); 851 | } 852 | 853 | #[test] 854 | fn quantized_array_correct_chunk_size() { 855 | let check_arr = test_quantized_array(false); 856 | let mut cursor = Cursor::new(Vec::new()); 857 | check_arr.write_chunk(&mut cursor).unwrap(); 858 | cursor.seek(SeekFrom::Start(0)).unwrap(); 859 | 860 | let chunk_size = read_chunk_size(&mut cursor); 861 | assert_eq!( 862 | cursor.read_to_end(&mut Vec::new()).unwrap(), 863 | chunk_size as usize 864 | ); 865 | } 866 | 867 | #[test] 868 | fn quantized_array_norms_correct_chunk_size() { 869 | let check_arr = test_quantized_array(true); 870 | let mut cursor = Cursor::new(Vec::new()); 871 | check_arr.write_chunk(&mut cursor).unwrap(); 872 | cursor.seek(SeekFrom::Start(0)).unwrap(); 873 | 874 | let chunk_size = read_chunk_size(&mut cursor); 875 | assert_eq!( 876 | cursor.read_to_end(&mut Vec::new()).unwrap(), 877 | chunk_size as usize 878 | ); 879 | } 880 | 881 | #[test] 882 | fn quantized_array_read_write_roundtrip() { 883 | let check_arr = test_quantized_array(true); 884 | let mut cursor = Cursor::new(Vec::new()); 885 | check_arr.write_chunk(&mut cursor).unwrap(); 886 | cursor.seek(SeekFrom::Start(0)).unwrap(); 887 | let arr = QuantizedArray::read_chunk(&mut cursor).unwrap(); 888 | assert_eq!(arr.quantizer, check_arr.quantizer); 889 | assert_eq!(arr.quantized, check_arr.quantized); 890 | } 891 | } 892 | -------------------------------------------------------------------------------- /rust2vec/src/subword.rs: -------------------------------------------------------------------------------- 1 | use std::cmp; 2 | use std::hash::{Hash, Hasher}; 3 | 4 | use fnv::FnvHasher; 5 | 6 | /// Iterator over n-grams in a sequence. 7 | /// 8 | /// N-grams provides an iterator over the n-grams in a sentence between a 9 | /// minimum and maximum length. 10 | /// 11 | /// **Warning:** no guarantee is provided with regard to the iteration 12 | /// order. The iterator only guarantees that all n-grams are produced. 13 | pub struct NGrams<'a, T> 14 | where 15 | T: 'a, 16 | { 17 | max_n: usize, 18 | min_n: usize, 19 | seq: &'a [T], 20 | ngram: &'a [T], 21 | } 22 | 23 | impl<'a, T> NGrams<'a, T> { 24 | /// Create a new n-ngram iterator. 25 | /// 26 | /// The iterator will create n-ngrams of length *[min_n, max_n]* 27 | pub fn new(seq: &'a [T], min_n: usize, max_n: usize) -> Self { 28 | assert!(min_n != 0, "The minimum n-gram length cannot be zero."); 29 | assert!( 30 | min_n <= max_n, 31 | "The maximum length should be equal to or greater than the minimum length." 32 | ); 33 | 34 | let upper = cmp::min(max_n, seq.len()); 35 | 36 | NGrams { 37 | min_n, 38 | max_n, 39 | seq, 40 | ngram: &seq[..upper], 41 | } 42 | } 43 | } 44 | 45 | impl<'a, T> Iterator for NGrams<'a, T> { 46 | type Item = &'a [T]; 47 | 48 | fn next(&mut self) -> Option { 49 | if self.ngram.len() < self.min_n { 50 | if self.seq.len() <= self.min_n { 51 | return None; 52 | } 53 | 54 | self.seq = &self.seq[1..]; 55 | 56 | let upper = cmp::min(self.max_n, self.seq.len()); 57 | self.ngram = &self.seq[..upper]; 58 | } 59 | 60 | let ngram = self.ngram; 61 | 62 | self.ngram = &self.ngram[..self.ngram.len() - 1]; 63 | 64 | Some(ngram) 65 | } 66 | } 67 | 68 | /// Extension trait for computing subword indices. 69 | /// 70 | /// Subword indexing assigns an identifier to each subword (n-gram) of a 71 | /// string. A subword is indexed by computing its hash and then mapping 72 | /// the hash to a bucket. 73 | /// 74 | /// Since a non-perfect hash function is used, multiple subwords can 75 | /// map to the same index. 76 | pub trait SubwordIndices { 77 | /// Return the subword indices of the subwords of a string. 78 | /// 79 | /// The n-grams that are used are of length *[min_n, max_n]*, these are 80 | /// mapped to indices into *2^buckets_exp* buckets. 81 | /// 82 | /// The largest possible bucket exponent is 64. 83 | fn subword_indices(&self, min_n: usize, max_n: usize, buckets_exp: usize) -> Vec; 84 | } 85 | 86 | impl SubwordIndices for str { 87 | fn subword_indices(&self, min_n: usize, max_n: usize, buckets_exp: usize) -> Vec { 88 | assert!( 89 | buckets_exp <= 64, 90 | "The largest possible buckets exponent is 64." 91 | ); 92 | 93 | let mask = if buckets_exp == 64 { 94 | !0 95 | } else { 96 | (1 << buckets_exp) - 1 97 | }; 98 | 99 | let chars: Vec<_> = self.chars().collect(); 100 | 101 | let mut indices = Vec::with_capacity((max_n - min_n + 1) * chars.len()); 102 | for ngram in NGrams::new(&chars, min_n, max_n) { 103 | let mut hasher = FnvHasher::default(); 104 | ngram.hash(&mut hasher); 105 | indices.push(hasher.finish() & mask); 106 | } 107 | 108 | indices 109 | } 110 | } 111 | 112 | #[cfg(test)] 113 | mod tests { 114 | use lazy_static::lazy_static; 115 | use maplit::hashmap; 116 | use std::collections::HashMap; 117 | 118 | use super::{NGrams, SubwordIndices}; 119 | 120 | #[test] 121 | fn ngrams_test() { 122 | let hello_chars: Vec<_> = "hellö world".chars().collect(); 123 | let mut hello_check: Vec<&[char]> = vec![ 124 | &['h'], 125 | &['h', 'e'], 126 | &['h', 'e', 'l'], 127 | &['e'], 128 | &['e', 'l'], 129 | &['e', 'l', 'l'], 130 | &['l'], 131 | &['l', 'l'], 132 | &['l', 'l', 'ö'], 133 | &['l'], 134 | &['l', 'ö'], 135 | &['l', 'ö', ' '], 136 | &['ö'], 137 | &['ö', ' '], 138 | &['ö', ' ', 'w'], 139 | &[' '], 140 | &[' ', 'w'], 141 | &[' ', 'w', 'o'], 142 | &['w'], 143 | &['w', 'o'], 144 | &['w', 'o', 'r'], 145 | &['o'], 146 | &['o', 'r'], 147 | &['o', 'r', 'l'], 148 | &['r'], 149 | &['r', 'l'], 150 | &['r', 'l', 'd'], 151 | &['l'], 152 | &['l', 'd'], 153 | &['d'], 154 | ]; 155 | 156 | hello_check.sort(); 157 | 158 | let mut hello_ngrams: Vec<_> = NGrams::new(&hello_chars, 1, 3).collect(); 159 | hello_ngrams.sort(); 160 | 161 | assert_eq!(hello_check, hello_ngrams); 162 | } 163 | 164 | #[test] 165 | fn ngrams_23_test() { 166 | let hello_chars: Vec<_> = "hello world".chars().collect(); 167 | let mut hello_check: Vec<&[char]> = vec![ 168 | &['h', 'e'], 169 | &['h', 'e', 'l'], 170 | &['e', 'l'], 171 | &['e', 'l', 'l'], 172 | &['l', 'l'], 173 | &['l', 'l', 'o'], 174 | &['l', 'o'], 175 | &['l', 'o', ' '], 176 | &['o', ' '], 177 | &['o', ' ', 'w'], 178 | &[' ', 'w'], 179 | &[' ', 'w', 'o'], 180 | &['w', 'o'], 181 | &['w', 'o', 'r'], 182 | &['o', 'r'], 183 | &['o', 'r', 'l'], 184 | &['r', 'l'], 185 | &['r', 'l', 'd'], 186 | &['l', 'd'], 187 | ]; 188 | hello_check.sort(); 189 | 190 | let mut hello_ngrams: Vec<_> = NGrams::new(&hello_chars, 2, 3).collect(); 191 | hello_ngrams.sort(); 192 | 193 | assert_eq!(hello_check, hello_ngrams); 194 | } 195 | 196 | #[test] 197 | fn empty_ngram_test() { 198 | let check: &[&[char]] = &[]; 199 | assert_eq!(NGrams::::new(&[], 1, 3).collect::>(), check); 200 | } 201 | 202 | #[test] 203 | #[should_panic] 204 | fn incorrect_min_n_test() { 205 | NGrams::::new(&[], 0, 3); 206 | } 207 | 208 | #[test] 209 | #[should_panic] 210 | fn incorrect_max_n_test() { 211 | NGrams::::new(&[], 2, 1); 212 | } 213 | 214 | lazy_static! { 215 | static ref SUBWORD_TESTS_2: HashMap<&'static str, Vec> = hashmap! { 216 | "" => 217 | vec![0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3], 218 | "" => 219 | vec![0, 0, 0, 0, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3], 220 | }; 221 | } 222 | 223 | lazy_static! { 224 | static ref SUBWORD_TESTS_21: HashMap<&'static str, Vec> = hashmap! { 225 | "" => 226 | vec![214157, 233912, 311961, 488897, 620206, 741276, 841219, 227 | 1167494, 1192256, 1489905, 1532271, 1644730, 1666166, 228 | 1679745, 1680294, 1693100, 2026735, 2065822], 229 | "" => 230 | vec![75867, 104120, 136555, 456131, 599360, 722393, 938007, 231 | 985859, 1006102, 1163391, 1218704, 1321513, 1505861, 232 | 1892376], 233 | }; 234 | } 235 | 236 | #[test] 237 | fn subword_indices_4_test() { 238 | // The goal of this test is to ensure that we are correctly bucketing 239 | // subwords. With a bucket exponent of 2, there are 2^2 = 4 buckets, 240 | // so we should see bucket numbers [0..3]. 241 | 242 | for (word, indices_check) in SUBWORD_TESTS_2.iter() { 243 | let mut indices = word.subword_indices(3, 6, 2); 244 | indices.sort(); 245 | assert_eq!(indices_check, &indices); 246 | } 247 | } 248 | 249 | #[test] 250 | fn subword_indices_2m_test() { 251 | // This test checks against precomputed bucket numbers. The goal of 252 | // if this test is to ensure that the subword_indices() method hashes 253 | // to the same buckets in the future. 254 | 255 | for (word, indices_check) in SUBWORD_TESTS_21.iter() { 256 | let mut indices = word.subword_indices(3, 6, 21); 257 | indices.sort(); 258 | assert_eq!(indices_check, &indices); 259 | } 260 | } 261 | } 262 | -------------------------------------------------------------------------------- /rust2vec/src/tests.rs: -------------------------------------------------------------------------------- 1 | use std::fs::File; 2 | use std::io::{BufReader, Read, Seek, SeekFrom}; 3 | 4 | use crate::embeddings::Embeddings; 5 | use crate::vocab::Vocab; 6 | use crate::word2vec::{ReadWord2Vec, WriteWord2Vec}; 7 | 8 | #[test] 9 | fn test_read_word2vec_binary() { 10 | let f = File::open("testdata/similarity.bin").unwrap(); 11 | let mut reader = BufReader::new(f); 12 | let embeddings = Embeddings::read_word2vec_binary(&mut reader, false).unwrap(); 13 | assert_eq!(41, embeddings.vocab().len()); 14 | assert_eq!(100, embeddings.dims()); 15 | } 16 | 17 | #[test] 18 | fn test_word2vec_binary_roundtrip() { 19 | let mut reader = BufReader::new(File::open("testdata/similarity.bin").unwrap()); 20 | let mut check = Vec::new(); 21 | reader.read_to_end(&mut check).unwrap(); 22 | 23 | // Read embeddings. 24 | reader.seek(SeekFrom::Start(0)).unwrap(); 25 | let embeddings = Embeddings::read_word2vec_binary(&mut reader, false).unwrap(); 26 | 27 | // Write embeddings to a byte vector. 28 | let mut output = Vec::new(); 29 | embeddings.write_word2vec_binary(&mut output).unwrap(); 30 | 31 | assert_eq!(check, output); 32 | } 33 | -------------------------------------------------------------------------------- /rust2vec/src/text.rs: -------------------------------------------------------------------------------- 1 | //! Readers and writers for text formats. 2 | //! 3 | //! This module provides two readers/writers: 4 | //! 5 | //! 1. `ReadText`/`WriteText`: word embeddings in text format. In this 6 | //! format, each line contains a word followed by its 7 | //! embedding. The word and the embedding vector components are 8 | //! separated by a space. This format is used by GloVe. 9 | //! 2. `ReadTextDims`/`WriteTextDims`: this format is the same as (1), 10 | //! but the data is preceded by a line with the shape of the 11 | //! embedding matrix. This format is used by word2vec's text 12 | //! output. 13 | //! 14 | //! For example: 15 | //! 16 | //! ``` 17 | //! use std::fs::File; 18 | //! use std::io::BufReader; 19 | //! 20 | //! use rust2vec::prelude::*; 21 | //! 22 | //! let mut reader = BufReader::new(File::open("testdata/similarity.txt").unwrap()); 23 | //! 24 | //! // Read the embeddings. The second arguments specifies whether 25 | //! // the embeddings should be normalized to unit vectors. 26 | //! let embeddings = Embeddings::read_text_dims(&mut reader, true) 27 | //! .unwrap(); 28 | //! 29 | //! // Look up an embedding. 30 | //! let embedding = embeddings.embedding("Berlin"); 31 | //! ``` 32 | 33 | use std::io::{BufRead, Write}; 34 | 35 | use failure::{ensure, err_msg, Error, ResultExt}; 36 | use itertools::Itertools; 37 | use ndarray::Array2; 38 | 39 | use crate::embeddings::Embeddings; 40 | use crate::storage::{NdArray, Storage}; 41 | use crate::util::l2_normalize; 42 | use crate::vocab::{SimpleVocab, Vocab}; 43 | 44 | /// Method to construct `Embeddings` from a text file. 45 | /// 46 | /// This trait defines an extension to `Embeddings` to read the word embeddings 47 | /// from a text stream. The text should contain one word embedding per line in 48 | /// the following format: 49 | /// 50 | /// *word0 component_1 component_2 ... component_n* 51 | pub trait ReadText 52 | where 53 | Self: Sized, 54 | R: BufRead, 55 | { 56 | /// Read the embeddings from the given buffered reader. 57 | fn read_text(reader: &mut R, normalize: bool) -> Result; 58 | } 59 | 60 | impl ReadText for Embeddings 61 | where 62 | R: BufRead, 63 | { 64 | fn read_text(reader: &mut R, normalize: bool) -> Result { 65 | read_embeds(reader, None, normalize) 66 | } 67 | } 68 | 69 | /// Method to construct `Embeddings` from a text file with dimensions. 70 | /// 71 | /// This trait defines an extension to `Embeddings` to read the word embeddings 72 | /// from a text stream. The text must contain as the first line the shape of 73 | /// the embedding matrix: 74 | /// 75 | /// *vocab_size n_components* 76 | /// 77 | /// The remainder of the stream should contain one word embedding per line in 78 | /// the following format: 79 | /// 80 | /// *word0 component_1 component_2 ... component_n* 81 | pub trait ReadTextDims 82 | where 83 | Self: Sized, 84 | R: BufRead, 85 | { 86 | /// Read the embeddings from the given buffered reader. 87 | fn read_text_dims(reader: &mut R, normalize: bool) -> Result; 88 | } 89 | 90 | impl ReadTextDims for Embeddings 91 | where 92 | R: BufRead, 93 | { 94 | fn read_text_dims(reader: &mut R, normalize: bool) -> Result { 95 | let mut dims = String::new(); 96 | reader.read_line(&mut dims)?; 97 | 98 | let mut dims_iter = dims.split_whitespace(); 99 | let vocab_len = dims_iter 100 | .next() 101 | .ok_or_else(|| failure::err_msg("Missing vocabulary size"))? 102 | .parse::() 103 | .context("Cannot parse vocabulary size")?; 104 | let embed_len = dims_iter 105 | .next() 106 | .ok_or_else(|| failure::err_msg("Missing vocabulary size"))? 107 | .parse::() 108 | .context("Cannot parse vocabulary size")?; 109 | 110 | read_embeds(reader, Some((vocab_len, embed_len)), normalize) 111 | } 112 | } 113 | 114 | fn read_embeds( 115 | reader: &mut R, 116 | shape: Option<(usize, usize)>, 117 | normalize: bool, 118 | ) -> Result, Error> 119 | where 120 | R: BufRead, 121 | { 122 | let (mut words, mut data) = if let Some((n_words, dims)) = shape { 123 | ( 124 | Vec::with_capacity(n_words), 125 | Vec::with_capacity(n_words * dims), 126 | ) 127 | } else { 128 | (Vec::new(), Vec::new()) 129 | }; 130 | 131 | for line in reader.lines() { 132 | let line = line?; 133 | let mut parts = line.split_whitespace(); 134 | 135 | let word = parts.next().ok_or_else(|| err_msg("Empty line"))?.trim(); 136 | words.push(word.to_owned()); 137 | 138 | for part in parts { 139 | data.push(part.parse()?); 140 | } 141 | } 142 | 143 | let shape = if let Some((n_words, dims)) = shape { 144 | ensure!( 145 | words.len() == n_words, 146 | "Expected {} words, got: {}", 147 | n_words, 148 | words.len() 149 | ); 150 | ensure!( 151 | data.len() / n_words == dims, 152 | "Expected {} dimensions, got: {}", 153 | dims, 154 | data.len() / n_words 155 | ); 156 | (n_words, dims) 157 | } else { 158 | let dims = data.len() / words.len(); 159 | (words.len(), dims) 160 | }; 161 | 162 | ensure!( 163 | data.len() % shape.1 == 0, 164 | "Number of dimensions per vector is not constant" 165 | ); 166 | 167 | let mut matrix = Array2::from_shape_vec(shape, data)?; 168 | 169 | if normalize { 170 | for mut embedding in matrix.outer_iter_mut() { 171 | l2_normalize(embedding.view_mut()); 172 | } 173 | } 174 | 175 | Ok(Embeddings::new( 176 | None, 177 | SimpleVocab::new(words), 178 | NdArray(matrix), 179 | )) 180 | } 181 | 182 | /// Method to write `Embeddings` to a text file. 183 | /// 184 | /// This trait defines an extension to `Embeddings` to write the word embeddings 185 | /// as text. The text will contain one word embedding per line in the following 186 | /// format: 187 | /// 188 | /// *word0 component_1 component_2 ... component_n* 189 | pub trait WriteText 190 | where 191 | W: Write, 192 | { 193 | /// Read the embeddings from the given buffered reader. 194 | fn write_text(&self, writer: &mut W) -> Result<(), Error>; 195 | } 196 | 197 | impl WriteText for Embeddings 198 | where 199 | W: Write, 200 | V: Vocab, 201 | S: Storage, 202 | { 203 | /// Write the embeddings to the given writer. 204 | fn write_text(&self, write: &mut W) -> Result<(), Error> { 205 | for (word, embed) in self.iter() { 206 | let embed_str = embed.as_view().iter().map(ToString::to_string).join(" "); 207 | writeln!(write, "{} {}", word, embed_str)?; 208 | } 209 | 210 | Ok(()) 211 | } 212 | } 213 | 214 | /// Method to write `Embeddings` to a text file. 215 | /// 216 | /// This trait defines an extension to `Embeddings` to write the word embeddings 217 | /// as text. The text will contain one word embedding per line in the following 218 | /// format: 219 | /// 220 | /// *word0 component_1 component_2 ... component_n* 221 | pub trait WriteTextDims 222 | where 223 | W: Write, 224 | { 225 | /// Write the embeddings to the given writer. 226 | fn write_text_dims(&self, writer: &mut W) -> Result<(), Error>; 227 | } 228 | 229 | impl WriteTextDims for Embeddings 230 | where 231 | W: Write, 232 | V: Vocab, 233 | S: Storage, 234 | { 235 | fn write_text_dims(&self, write: &mut W) -> Result<(), Error> { 236 | writeln!(write, "{} {}", self.vocab().len(), self.dims())?; 237 | self.write_text(write) 238 | } 239 | } 240 | 241 | #[cfg(test)] 242 | mod tests { 243 | use std::fs::File; 244 | use std::io::{BufReader, Read, Seek, SeekFrom}; 245 | 246 | use crate::embeddings::Embeddings; 247 | use crate::storage::{NdArray, StorageView}; 248 | use crate::vocab::{SimpleVocab, Vocab}; 249 | use crate::word2vec::ReadWord2Vec; 250 | 251 | use super::{ReadText, ReadTextDims, WriteText, WriteTextDims}; 252 | 253 | fn read_word2vec() -> Embeddings { 254 | let f = File::open("testdata/similarity.bin").unwrap(); 255 | let mut reader = BufReader::new(f); 256 | Embeddings::read_word2vec_binary(&mut reader, false).unwrap() 257 | } 258 | 259 | #[test] 260 | fn read_text() { 261 | let f = File::open("testdata/similarity.nodims").unwrap(); 262 | let mut reader = BufReader::new(f); 263 | let text_embeddings = Embeddings::read_text(&mut reader, false).unwrap(); 264 | 265 | let embeddings = read_word2vec(); 266 | assert_eq!(text_embeddings.vocab().words(), embeddings.vocab().words()); 267 | assert_eq!( 268 | text_embeddings.storage().view(), 269 | embeddings.storage().view() 270 | ); 271 | } 272 | 273 | #[test] 274 | fn read_text_dims() { 275 | let f = File::open("testdata/similarity.txt").unwrap(); 276 | let mut reader = BufReader::new(f); 277 | let text_embeddings = Embeddings::read_text_dims(&mut reader, false).unwrap(); 278 | 279 | let embeddings = read_word2vec(); 280 | assert_eq!(text_embeddings.vocab().words(), embeddings.vocab().words()); 281 | assert_eq!( 282 | text_embeddings.storage().view(), 283 | embeddings.storage().view() 284 | ); 285 | } 286 | 287 | #[test] 288 | fn test_word2vec_text_roundtrip() { 289 | let mut reader = BufReader::new(File::open("testdata/similarity.nodims").unwrap()); 290 | let mut check = String::new(); 291 | reader.read_to_string(&mut check).unwrap(); 292 | 293 | // Read embeddings. 294 | reader.seek(SeekFrom::Start(0)).unwrap(); 295 | let embeddings = Embeddings::read_text(&mut reader, false).unwrap(); 296 | 297 | // Write embeddings to a byte vector. 298 | let mut output = Vec::new(); 299 | embeddings.write_text(&mut output).unwrap(); 300 | 301 | assert_eq!(check, String::from_utf8_lossy(&output)); 302 | } 303 | 304 | #[test] 305 | fn test_word2vec_text_dims_roundtrip() { 306 | let mut reader = BufReader::new(File::open("testdata/similarity.txt").unwrap()); 307 | let mut check = String::new(); 308 | reader.read_to_string(&mut check).unwrap(); 309 | 310 | // Read embeddings. 311 | reader.seek(SeekFrom::Start(0)).unwrap(); 312 | let embeddings = Embeddings::read_text_dims(&mut reader, false).unwrap(); 313 | 314 | // Write embeddings to a byte vector. 315 | let mut output = Vec::new(); 316 | embeddings.write_text_dims(&mut output).unwrap(); 317 | 318 | assert_eq!(check, String::from_utf8_lossy(&output)); 319 | } 320 | } 321 | -------------------------------------------------------------------------------- /rust2vec/src/util.rs: -------------------------------------------------------------------------------- 1 | use ndarray::ArrayViewMut1; 2 | 3 | pub fn l2_normalize(mut v: ArrayViewMut1) -> f32 { 4 | let norm = v.dot(&v).sqrt(); 5 | 6 | if norm != 0. { 7 | v /= norm; 8 | } 9 | 10 | norm 11 | } 12 | -------------------------------------------------------------------------------- /rust2vec/src/vocab.rs: -------------------------------------------------------------------------------- 1 | //! Embedding vocabularies 2 | 3 | use std::collections::HashMap; 4 | use std::io::{Read, Seek, SeekFrom, Write}; 5 | use std::mem::size_of; 6 | 7 | use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt}; 8 | use failure::{ensure, err_msg, format_err, Error}; 9 | 10 | use crate::io::private::{ChunkIdentifier, ReadChunk, WriteChunk}; 11 | use crate::subword::SubwordIndices; 12 | 13 | #[derive(Clone, Debug, Eq, PartialEq)] 14 | /// Index of a vocabulary word. 15 | pub enum WordIndex { 16 | /// The index of an in-vocabulary word. 17 | Word(usize), 18 | 19 | /// The subword indices of out-of-vocabulary words. 20 | Subword(Vec), 21 | } 22 | 23 | /// Vocabulary without subword units. 24 | #[derive(Clone, Debug, Eq, PartialEq)] 25 | pub struct SimpleVocab { 26 | indices: HashMap, 27 | words: Vec, 28 | } 29 | 30 | impl SimpleVocab { 31 | pub fn new(words: impl Into>) -> Self { 32 | let words = words.into(); 33 | let indices = create_indices(&words); 34 | SimpleVocab { words, indices } 35 | } 36 | } 37 | 38 | impl ReadChunk for SimpleVocab { 39 | fn read_chunk(read: &mut R) -> Result 40 | where 41 | R: Read + Seek, 42 | { 43 | let chunk_id = ChunkIdentifier::try_from(read.read_u32::()?) 44 | .ok_or_else(|| err_msg("Unknown chunk identifier"))?; 45 | ensure!( 46 | chunk_id == ChunkIdentifier::SimpleVocab, 47 | "Cannot read chunk {:?} as SimpleVocab", 48 | chunk_id 49 | ); 50 | 51 | // Read and discard chunk length. 52 | read.read_u64::()?; 53 | 54 | let vocab_len = read.read_u64::()? as usize; 55 | let mut words = Vec::with_capacity(vocab_len); 56 | for _ in 0..vocab_len { 57 | let word_len = read.read_u32::()? as usize; 58 | let mut bytes = vec![0; word_len]; 59 | read.read_exact(&mut bytes)?; 60 | let word = String::from_utf8(bytes)?; 61 | words.push(word); 62 | } 63 | 64 | Ok(SimpleVocab::new(words)) 65 | } 66 | } 67 | 68 | impl WriteChunk for SimpleVocab { 69 | fn chunk_identifier(&self) -> ChunkIdentifier { 70 | ChunkIdentifier::SimpleVocab 71 | } 72 | 73 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 74 | where 75 | W: Write + Seek, 76 | { 77 | // Chunk size: vocabulary size (u64), for each word: 78 | // word length in bytes (4 bytes), word bytes (variable-length). 79 | let chunk_len = size_of::() 80 | + self 81 | .words 82 | .iter() 83 | .map(|w| w.len() + size_of::()) 84 | .sum::(); 85 | 86 | write.write_u32::(ChunkIdentifier::SimpleVocab as u32)?; 87 | write.write_u64::(chunk_len as u64)?; 88 | write.write_u64::(self.words.len() as u64)?; 89 | 90 | for word in &self.words { 91 | write.write_u32::(word.len() as u32)?; 92 | write.write_all(word.as_bytes())?; 93 | } 94 | 95 | Ok(()) 96 | } 97 | } 98 | 99 | /// Vocabulary with subword units. 100 | #[derive(Clone, Debug, Eq, PartialEq)] 101 | pub struct SubwordVocab { 102 | indices: HashMap, 103 | words: Vec, 104 | min_n: u32, 105 | max_n: u32, 106 | buckets_exp: u32, 107 | } 108 | 109 | impl SubwordVocab { 110 | const BOW: char = '<'; 111 | const EOW: char = '>'; 112 | 113 | pub fn new(words: impl Into>, min_n: u32, max_n: u32, buckets_exp: u32) -> Self { 114 | let words = words.into(); 115 | let indices = create_indices(&words); 116 | 117 | SubwordVocab { 118 | indices, 119 | words, 120 | min_n, 121 | max_n, 122 | buckets_exp, 123 | } 124 | } 125 | 126 | fn bracket(word: impl AsRef) -> String { 127 | let mut bracketed = String::new(); 128 | bracketed.push(Self::BOW); 129 | bracketed.push_str(word.as_ref()); 130 | bracketed.push(Self::EOW); 131 | 132 | bracketed 133 | } 134 | 135 | /// Get the subword indices of a token. 136 | /// 137 | /// Returns `None` when the model does not support subwords or 138 | /// when no subwords could be extracted. 139 | fn subword_indices(&self, word: &str) -> Option> { 140 | let indices = Self::bracket(word) 141 | .as_str() 142 | .subword_indices( 143 | self.min_n as usize, 144 | self.max_n as usize, 145 | self.buckets_exp as usize, 146 | ) 147 | .into_iter() 148 | .map(|idx| idx as usize + self.len()) 149 | .collect::>(); 150 | if indices.is_empty() { 151 | None 152 | } else { 153 | Some(indices) 154 | } 155 | } 156 | } 157 | 158 | impl ReadChunk for SubwordVocab { 159 | fn read_chunk(read: &mut R) -> Result 160 | where 161 | R: Read + Seek, 162 | { 163 | let chunk_id = ChunkIdentifier::try_from(read.read_u32::()?) 164 | .ok_or_else(|| err_msg("Unknown chunk identifier"))?; 165 | ensure!( 166 | chunk_id == ChunkIdentifier::SubwordVocab, 167 | "Cannot read chunk {:?} as SubwordVocab", 168 | chunk_id 169 | ); 170 | 171 | // Read and discard chunk length. 172 | read.read_u64::()?; 173 | 174 | let vocab_len = read.read_u64::()? as usize; 175 | let min_n = read.read_u32::()?; 176 | let max_n = read.read_u32::()?; 177 | let buckets_exp = read.read_u32::()?; 178 | 179 | let mut words = Vec::with_capacity(vocab_len); 180 | for _ in 0..vocab_len { 181 | let word_len = read.read_u32::()? as usize; 182 | let mut bytes = vec![0; word_len]; 183 | read.read_exact(&mut bytes)?; 184 | let word = String::from_utf8(bytes)?; 185 | words.push(word); 186 | } 187 | 188 | Ok(SubwordVocab::new(words, min_n, max_n, buckets_exp)) 189 | } 190 | } 191 | 192 | impl WriteChunk for SubwordVocab { 193 | fn chunk_identifier(&self) -> ChunkIdentifier { 194 | ChunkIdentifier::SubwordVocab 195 | } 196 | 197 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 198 | where 199 | W: Write + Seek, 200 | { 201 | // Chunk size: vocab size (u64), minimum n-gram length (u32), 202 | // maximum n-gram length (u32), bucket exponent (u32), for 203 | // each word: word length in bytes (u32), word bytes 204 | // (variable-length). 205 | let chunk_len = size_of::() 206 | + size_of::() 207 | + size_of::() 208 | + size_of::() 209 | + self 210 | .words() 211 | .iter() 212 | .map(|w| w.len() + size_of::()) 213 | .sum::(); 214 | 215 | write.write_u32::(ChunkIdentifier::SubwordVocab as u32)?; 216 | write.write_u64::(chunk_len as u64)?; 217 | write.write_u64::(self.words.len() as u64)?; 218 | write.write_u32::(self.min_n)?; 219 | write.write_u32::(self.max_n)?; 220 | write.write_u32::(self.buckets_exp)?; 221 | 222 | for word in self.words() { 223 | write.write_u32::(word.len() as u32)?; 224 | write.write_all(word.as_bytes())?; 225 | } 226 | 227 | Ok(()) 228 | } 229 | } 230 | 231 | /// Vocabulary types wrapper. 232 | /// 233 | /// This crate makes it possible to create fine-grained embedding 234 | /// types, such as `Embeddings` or 235 | /// `Embeddings`. However, in some cases 236 | /// it is more pleasant to have a single type that covers all 237 | /// vocabulary and storage types. `VocabWrap` and `StorageWrap` wrap 238 | /// all the vocabularies and storage types known to this crate such 239 | /// that the type `Embeddings` covers all 240 | /// variations. 241 | #[derive(Clone, Debug)] 242 | pub enum VocabWrap { 243 | SimpleVocab(SimpleVocab), 244 | SubwordVocab(SubwordVocab), 245 | } 246 | 247 | impl From for VocabWrap { 248 | fn from(v: SimpleVocab) -> Self { 249 | VocabWrap::SimpleVocab(v) 250 | } 251 | } 252 | 253 | impl From for VocabWrap { 254 | fn from(v: SubwordVocab) -> Self { 255 | VocabWrap::SubwordVocab(v) 256 | } 257 | } 258 | 259 | impl ReadChunk for VocabWrap { 260 | fn read_chunk(read: &mut R) -> Result 261 | where 262 | R: Read + Seek, 263 | { 264 | let chunk_start_pos = read.seek(SeekFrom::Current(0))?; 265 | let chunk_id = ChunkIdentifier::try_from(read.read_u32::()?) 266 | .ok_or_else(|| err_msg("Unknown chunk identifier"))?; 267 | 268 | read.seek(SeekFrom::Start(chunk_start_pos))?; 269 | 270 | match chunk_id { 271 | ChunkIdentifier::SimpleVocab => { 272 | SimpleVocab::read_chunk(read).map(VocabWrap::SimpleVocab) 273 | } 274 | ChunkIdentifier::SubwordVocab => { 275 | SubwordVocab::read_chunk(read).map(VocabWrap::SubwordVocab) 276 | } 277 | _ => Err(format_err!( 278 | "Chunk type {:?} cannot be read as a vocabulary", 279 | chunk_id 280 | )), 281 | } 282 | } 283 | } 284 | 285 | impl WriteChunk for VocabWrap { 286 | fn chunk_identifier(&self) -> ChunkIdentifier { 287 | match self { 288 | VocabWrap::SimpleVocab(inner) => inner.chunk_identifier(), 289 | VocabWrap::SubwordVocab(inner) => inner.chunk_identifier(), 290 | } 291 | } 292 | 293 | fn write_chunk(&self, write: &mut W) -> Result<(), Error> 294 | where 295 | W: Write + Seek, 296 | { 297 | match self { 298 | VocabWrap::SimpleVocab(inner) => inner.write_chunk(write), 299 | VocabWrap::SubwordVocab(inner) => inner.write_chunk(write), 300 | } 301 | } 302 | } 303 | 304 | /// Embedding vocabularies. 305 | #[allow(clippy::len_without_is_empty)] 306 | pub trait Vocab: Clone { 307 | /// Get the index of a token. 308 | fn idx(&self, word: &str) -> Option; 309 | 310 | /// Get the vocabulary size. 311 | fn len(&self) -> usize; 312 | 313 | /// Get the words in the vocabulary. 314 | fn words(&self) -> &[String]; 315 | } 316 | 317 | impl Vocab for SimpleVocab { 318 | fn idx(&self, word: &str) -> Option { 319 | self.indices.get(word).cloned().map(WordIndex::Word) 320 | } 321 | 322 | fn len(&self) -> usize { 323 | self.indices.len() 324 | } 325 | 326 | fn words(&self) -> &[String] { 327 | &self.words 328 | } 329 | } 330 | 331 | impl Vocab for SubwordVocab { 332 | fn idx(&self, word: &str) -> Option { 333 | // If the word is known, return its index. 334 | if let Some(idx) = self.indices.get(word).cloned() { 335 | return Some(WordIndex::Word(idx)); 336 | } 337 | 338 | // Otherwise, return the subword indices. 339 | self.subword_indices(word).map(WordIndex::Subword) 340 | } 341 | 342 | fn len(&self) -> usize { 343 | self.indices.len() 344 | } 345 | 346 | fn words(&self) -> &[String] { 347 | &self.words 348 | } 349 | } 350 | 351 | impl Vocab for VocabWrap { 352 | fn idx(&self, word: &str) -> Option { 353 | match self { 354 | VocabWrap::SimpleVocab(inner) => inner.idx(word), 355 | VocabWrap::SubwordVocab(inner) => inner.idx(word), 356 | } 357 | } 358 | 359 | /// Get the vocabulary size. 360 | fn len(&self) -> usize { 361 | match self { 362 | VocabWrap::SimpleVocab(inner) => inner.len(), 363 | VocabWrap::SubwordVocab(inner) => inner.len(), 364 | } 365 | } 366 | 367 | /// Get the words in the vocabulary. 368 | fn words(&self) -> &[String] { 369 | match self { 370 | VocabWrap::SimpleVocab(inner) => inner.words(), 371 | VocabWrap::SubwordVocab(inner) => inner.words(), 372 | } 373 | } 374 | } 375 | 376 | fn create_indices(words: &[String]) -> HashMap { 377 | let mut indices = HashMap::new(); 378 | 379 | for (idx, word) in words.iter().enumerate() { 380 | indices.insert(word.to_owned(), idx); 381 | } 382 | 383 | indices 384 | } 385 | 386 | #[cfg(test)] 387 | mod tests { 388 | use std::io::{Cursor, Read, Seek, SeekFrom}; 389 | 390 | use byteorder::{LittleEndian, ReadBytesExt}; 391 | 392 | use super::{SimpleVocab, SubwordVocab}; 393 | use crate::io::private::{ReadChunk, WriteChunk}; 394 | 395 | fn test_simple_vocab() -> SimpleVocab { 396 | let words = vec![ 397 | "this".to_owned(), 398 | "is".to_owned(), 399 | "a".to_owned(), 400 | "test".to_owned(), 401 | ]; 402 | 403 | SimpleVocab::new(words) 404 | } 405 | 406 | fn test_subword_vocab() -> SubwordVocab { 407 | let words = vec![ 408 | "this".to_owned(), 409 | "is".to_owned(), 410 | "a".to_owned(), 411 | "test".to_owned(), 412 | ]; 413 | SubwordVocab::new(words, 3, 6, 20) 414 | } 415 | 416 | fn read_chunk_size(read: &mut impl Read) -> u64 { 417 | // Skip identifier. 418 | read.read_u32::().unwrap(); 419 | 420 | // Return chunk length. 421 | read.read_u64::().unwrap() 422 | } 423 | 424 | #[test] 425 | fn simple_vocab_write_read_roundtrip() { 426 | let check_vocab = test_simple_vocab(); 427 | let mut cursor = Cursor::new(Vec::new()); 428 | check_vocab.write_chunk(&mut cursor).unwrap(); 429 | cursor.seek(SeekFrom::Start(0)).unwrap(); 430 | let vocab = SimpleVocab::read_chunk(&mut cursor).unwrap(); 431 | assert_eq!(vocab, check_vocab); 432 | } 433 | 434 | #[test] 435 | fn simple_vocab_correct_chunk_size() { 436 | let check_vocab = test_simple_vocab(); 437 | let mut cursor = Cursor::new(Vec::new()); 438 | check_vocab.write_chunk(&mut cursor).unwrap(); 439 | cursor.seek(SeekFrom::Start(0)).unwrap(); 440 | 441 | let chunk_size = read_chunk_size(&mut cursor); 442 | assert_eq!( 443 | cursor.read_to_end(&mut Vec::new()).unwrap(), 444 | chunk_size as usize 445 | ); 446 | } 447 | 448 | #[test] 449 | fn subword_vocab_write_read_roundtrip() { 450 | let check_vocab = test_subword_vocab(); 451 | let mut cursor = Cursor::new(Vec::new()); 452 | check_vocab.write_chunk(&mut cursor).unwrap(); 453 | cursor.seek(SeekFrom::Start(0)).unwrap(); 454 | let vocab = SubwordVocab::read_chunk(&mut cursor).unwrap(); 455 | assert_eq!(vocab, check_vocab); 456 | } 457 | 458 | #[test] 459 | fn subword_vocab_correct_chunk_size() { 460 | let check_vocab = test_subword_vocab(); 461 | let mut cursor = Cursor::new(Vec::new()); 462 | check_vocab.write_chunk(&mut cursor).unwrap(); 463 | cursor.seek(SeekFrom::Start(0)).unwrap(); 464 | 465 | let chunk_size = read_chunk_size(&mut cursor); 466 | assert_eq!( 467 | cursor.read_to_end(&mut Vec::new()).unwrap(), 468 | chunk_size as usize 469 | ); 470 | } 471 | } 472 | -------------------------------------------------------------------------------- /rust2vec/src/word2vec.rs: -------------------------------------------------------------------------------- 1 | //! Reader and writer for the word2vec binary format. 2 | //! 3 | //! Embeddings in the word2vec binary format are these formats are 4 | //! read as follows: 5 | //! 6 | //! ``` 7 | //! use std::fs::File; 8 | //! use std::io::BufReader; 9 | //! 10 | //! use rust2vec::prelude::*; 11 | //! 12 | //! let mut reader = BufReader::new(File::open("testdata/similarity.bin").unwrap()); 13 | //! 14 | //! // Read the embeddings. The second arguments specifies whether 15 | //! // the embeddings should be normalized to unit vectors. 16 | //! let embeddings = Embeddings::read_word2vec_binary(&mut reader, true) 17 | //! .unwrap(); 18 | //! 19 | //! // Look up an embedding. 20 | //! let embedding = embeddings.embedding("Berlin"); 21 | //! ``` 22 | 23 | use std::io::{BufRead, Write}; 24 | use std::mem; 25 | use std::slice::from_raw_parts_mut; 26 | 27 | use byteorder::{LittleEndian, WriteBytesExt}; 28 | use failure::{err_msg, Error}; 29 | use ndarray::{Array2, Axis}; 30 | 31 | use crate::embeddings::Embeddings; 32 | use crate::storage::{NdArray, Storage}; 33 | use crate::util::l2_normalize; 34 | use crate::vocab::{SimpleVocab, Vocab}; 35 | 36 | /// Method to construct `Embeddings` from a word2vec binary file. 37 | /// 38 | /// This trait defines an extension to `Embeddings` to read the word embeddings 39 | /// from a file in word2vec binary format. 40 | pub trait ReadWord2Vec 41 | where 42 | Self: Sized, 43 | R: BufRead, 44 | { 45 | /// Read the embeddings from the given buffered reader. 46 | fn read_word2vec_binary(reader: &mut R, normalize: bool) -> Result; 47 | } 48 | 49 | impl ReadWord2Vec for Embeddings 50 | where 51 | R: BufRead, 52 | { 53 | fn read_word2vec_binary(reader: &mut R, normalize: bool) -> Result { 54 | let n_words = read_number(reader, b' ')?; 55 | let embed_len = read_number(reader, b'\n')?; 56 | 57 | let mut matrix = Array2::zeros((n_words, embed_len)); 58 | let mut words = Vec::with_capacity(n_words); 59 | 60 | for idx in 0..n_words { 61 | let word = read_string(reader, b' ')?; 62 | let word = word.trim(); 63 | words.push(word.to_owned()); 64 | 65 | let mut embedding = matrix.index_axis_mut(Axis(0), idx); 66 | 67 | { 68 | let mut embedding_raw = match embedding.as_slice_mut() { 69 | Some(s) => unsafe { typed_to_bytes(s) }, 70 | None => return Err(err_msg("Matrix not contiguous")), 71 | }; 72 | reader.read_exact(&mut embedding_raw)?; 73 | } 74 | } 75 | 76 | if normalize { 77 | for mut embedding in matrix.outer_iter_mut() { 78 | l2_normalize(embedding.view_mut()); 79 | } 80 | } 81 | 82 | Ok(Embeddings::new( 83 | None, 84 | SimpleVocab::new(words), 85 | NdArray(matrix), 86 | )) 87 | } 88 | } 89 | 90 | fn read_number(reader: &mut BufRead, delim: u8) -> Result { 91 | let field_str = read_string(reader, delim)?; 92 | Ok(field_str.parse()?) 93 | } 94 | 95 | fn read_string(reader: &mut BufRead, delim: u8) -> Result { 96 | let mut buf = Vec::new(); 97 | reader.read_until(delim, &mut buf)?; 98 | buf.pop(); 99 | Ok(String::from_utf8(buf)?) 100 | } 101 | 102 | unsafe fn typed_to_bytes(slice: &mut [T]) -> &mut [u8] { 103 | from_raw_parts_mut( 104 | slice.as_mut_ptr() as *mut u8, 105 | slice.len() * mem::size_of::(), 106 | ) 107 | } 108 | 109 | /// Method to write `Embeddings` to a word2vec binary file. 110 | /// 111 | /// This trait defines an extension to `Embeddings` to write the word embeddings 112 | /// to a file in word2vec binary format. 113 | pub trait WriteWord2Vec 114 | where 115 | W: Write, 116 | { 117 | /// Write the embeddings from the given writer. 118 | fn write_word2vec_binary(&self, w: &mut W) -> Result<(), Error>; 119 | } 120 | 121 | impl WriteWord2Vec for Embeddings 122 | where 123 | W: Write, 124 | V: Vocab, 125 | S: Storage, 126 | { 127 | fn write_word2vec_binary(&self, w: &mut W) -> Result<(), Error> 128 | where 129 | W: Write, 130 | { 131 | writeln!(w, "{} {}", self.vocab().len(), self.dims())?; 132 | 133 | for (word, embed) in self.iter() { 134 | write!(w, "{} ", word)?; 135 | 136 | // Write embedding to a vector with little-endian encoding. 137 | for v in embed.as_view() { 138 | w.write_f32::(*v)?; 139 | } 140 | 141 | w.write_all(&[0x0a])?; 142 | } 143 | 144 | Ok(()) 145 | } 146 | } 147 | -------------------------------------------------------------------------------- /rust2vec/testdata/analogy.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danieldk/rust2vec/41e0632f51eb51d941e7860895be418705bc678d/rust2vec/testdata/analogy.bin -------------------------------------------------------------------------------- /rust2vec/testdata/similarity.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danieldk/rust2vec/41e0632f51eb51d941e7860895be418705bc678d/rust2vec/testdata/similarity.bin -------------------------------------------------------------------------------- /rust2vec/testdata/similarity.fifu: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danieldk/rust2vec/41e0632f51eb51d941e7860895be418705bc678d/rust2vec/testdata/similarity.fifu -------------------------------------------------------------------------------- /rust2vec/testdata/similarity.nodims: -------------------------------------------------------------------------------- 1 | Berlin 1.4724287 -3.0660472 1.2919313 3.178568 1.90044 -2.5189524 1.999584 -0.9336575 -2.672562 -2.4814026 -1.6819122 -4.8213744 1.0391914 0.5594078 0.79230624 -1.9676353 -1.6043797 0.14731598 -1.3272952 0.35020277 0.3601632 -0.3288135 0.81085825 -0.15568691 -3.970952 -0.9991984 1.0356499 -0.87879723 -1.6305407 -0.13331155 -0.5631624 0.8695107 0.37236077 0.08698855 -1.866027 1.9221265 0.09188359 -0.87970036 0.19700529 0.28974667 3.5789065 0.67385364 0.09417508 2.000851 1.1188437 -3.0917869 -0.31632495 -1.4203465 3.27164 -0.4145638 -1.7861257 -0.9354539 1.0743507 0.8758309 2.8281496 -1.7598739 -0.3187864 0.029769417 -2.3000247 1.0032835 -2.618711 5.037977 -1.7888905 2.0828536 0.7878193 0.029274292 1.2881647 0.5874792 2.7693577 1.2624205 1.39799 -0.03126408 0.5335201 1.0715518 -0.35350195 0.7236325 -2.1143568 -0.65899295 -0.9009533 0.27017146 1.4241284 -1.6128805 0.47837606 0.92541206 -0.82395524 -0.8642171 1.7095582 0.44955313 0.51228195 -0.9758797 -0.56663215 -0.92973757 2.0861468 3.0681355 -2.5254073 3.0756123 -1.1215981 -1.6008509 1.2285731 0.9783312 2 | Potsdam 2.1806493 -2.1420271 0.39108157 2.6696212 2.1306365 -1.620653 1.11783 -0.48762217 -2.740397 -1.5487118 -1.9752603 -3.6563969 0.5549982 1.9216422 0.92049307 -1.012398 -1.802143 -0.9905557 -2.8306844 -0.4000288 -0.4434338 -0.16391866 1.1310339 0.77226526 -4.2902603 -1.917334 1.2348615 -0.8492573 -2.443106 -0.5470828 -2.1903002 0.9924115 0.8337637 0.008502694 -2.012811 0.39595807 -0.12880456 -0.9881574 0.37015662 -0.63543737 3.6737576 0.06061727 -1.0504802 1.6499404 1.2415177 -2.8529696 -1.1279812 -2.7283988 1.8690581 0.19301386 -0.2834016 -1.5914469 2.6947787 0.79130524 2.5396674 -1.4818038 0.29525942 -1.8913345 -3.6693594 0.50415343 -2.3017197 4.589723 -1.8428047 1.4981691 0.49011025 0.32169512 3.4563036 -0.24081603 2.2460399 1.3189701 0.35500714 -0.636182 0.9643915 -0.5861548 -0.99294496 0.5341161 -1.6744504 -0.56519824 -1.9197882 1.5675427 2.3801298 0.32737896 0.23349418 0.9933065 0.10999042 -0.97541165 0.94789284 0.90614897 0.696443 0.037369933 -0.19528356 -0.5509587 1.7404714 2.655653 -3.0760105 2.6993978 -0.2965713 -1.3025614 1.4548284 -0.6903138 3 | Hamburg 0.4788255 -3.967343 1.1047715 2.6440225 0.60471845 -2.5118766 1.4696972 -0.4973323 -1.6615546 -2.4239113 -0.59188414 -3.502221 1.6147016 1.2382033 1.0770555 -1.4295287 -0.8796352 0.14159687 -0.6082031 0.1284982 0.7834193 -0.38857833 1.6316457 0.259674 -2.8446531 -1.1515193 -0.8148757 -0.31960547 -2.9239964 0.88809747 -0.84069586 0.4803386 0.21044937 0.79038495 -2.6835477 1.6357746 -0.42380887 -1.1495888 1.6994416 -0.34663934 3.9783978 0.8170817 0.15714933 3.3729765 1.4110607 -2.4280055 -0.70826614 -0.99382937 3.2502427 -0.4775877 -1.734122 1.576659 0.21289098 -0.70036286 2.40948 -0.7965453 0.14933442 -0.22131115 -1.3322595 0.40128574 -2.5444393 2.1805856 -2.2447803 1.6169271 0.83752316 0.984588 0.25304157 1.8828899 1.0015125 2.133122 0.52751476 1.2244046 1.0903093 1.7118615 0.5912307 1.1165092 -1.7309636 -1.0335245 -0.4508689 0.49650037 2.1963146 -2.4764233 -0.3523233 0.2862572 -0.16439326 -1.3313167 0.2201799 1.1139678 0.16551498 -0.8288779 -0.888256 -0.7978216 1.1053181 4.6329308 -3.1685114 3.6558213 -1.7478929 -0.80149263 1.0683511 0.9312934 4 | Leipzig 0.76293504 -3.9712656 1.439183 3.3895504 3.3073556 -2.8665438 3.175805 0.43280777 -2.5932677 -2.289003 -2.0658562 -4.2276464 1.778095 2.7869446 1.3783438 -0.88617224 0.19531386 0.53868717 -2.6819577 1.7986319 0.012435815 -0.91381717 0.2762175 1.0501877 -4.040192 -1.4964848 0.7272464 -0.53534126 -2.1375997 0.69183654 0.91598064 1.52499 -0.7642817 0.6103151 -1.60211 1.9885116 -1.1169372 -1.8181924 0.52587205 0.67105275 3.886781 -0.6161123 -0.75117207 1.990302 1.6462587 -2.8843968 -0.87020373 -3.499557 2.7631874 0.10435269 -2.1606202 -1.5144792 2.0407827 2.4870102 1.4161308 -1.1555979 1.1070981 -0.21489511 -3.0637407 1.8440766 -1.941054 3.932652 -1.2583886 2.5001783 0.7809312 -0.37164783 2.2988632 -1.1895068 1.8822465 1.2318531 -1.1575056 -0.9978991 2.9044764 -0.16954686 -0.42839298 1.321524 -2.1555204 -1.7755115 -2.0158286 -1.1899067 2.9957414 -1.456828 0.014243404 1.2383107 -0.6006595 0.507483 2.203124 1.6735793 0.61495835 -2.2341802 -1.4414783 -1.215616 2.436923 3.307275 -2.7870371 0.6347615 -0.5415665 -1.1946487 -0.93109924 -0.75525737 5 | Dresden 1.4830393 -2.1041064 1.5380704 3.0213842 3.5982454 -2.830215 2.7828872 0.7929602 -2.794067 -2.2185907 -1.7122478 -4.1100373 1.8872013 3.4449346 0.68826157 -1.7484419 -0.5578897 -0.7990673 -1.7905833 0.78482324 -0.115760714 0.033562 -0.47313848 1.0816466 -3.4885328 -1.3895888 1.5654894 -2.1535795 -0.91556585 0.14537974 -0.35413167 1.3126079 -0.3496932 0.84601784 -1.4647814 1.7066644 -0.8741384 -2.15035 0.53159475 -0.4259317 4.3474813 -1.491072 -1.4061366 0.5328625 1.4400252 -2.366021 -1.9429433 -2.9716759 1.8089212 0.07723984 -1.7319798 -1.3349689 2.6314526 2.6112654 1.9114788 -1.2875332 1.4027938 -0.89346695 -3.9943836 0.4749197 -1.9437139 5.1928263 -2.0613582 2.4165585 -0.8855015 -0.022410275 2.344985 -0.90404487 1.7291049 1.5397507 -0.5362272 -1.0805588 2.6596174 0.11570591 -0.007056159 1.2742046 -2.4798312 -0.5189481 -1.9706321 0.46611267 1.2759173 -0.4419072 -1.1042553 0.599357 0.16299419 0.5400928 0.19667199 1.884171 1.2272686 -1.2424859 -0.68182045 -1.6905276 2.3942795 2.5897222 -2.447481 1.9752814 -1.7427028 -1.7839674 -0.63109744 -0.1590475 6 | München 1.5581971 -2.694572 2.0958502 2.8834038 0.8114807 -2.685636 1.5643935 -1.4720266 -1.3652112 -2.7260942 -1.690674 -5.457859 -0.79455656 1.8528954 0.8194059 -2.432036 -2.112506 -0.63779604 0.39192328 1.0948844 -1.2697182 0.33826646 -0.8565712 0.9969374 -3.19768 -0.90511006 -1.6199468 -2.2043524 -1.3684238 0.43872362 1.2488184 1.3292778 -0.73115426 1.851259 -3.191608 1.447241 0.0036239403 -1.5151725 0.7690032 -1.1638637 3.512945 -1.1420819 0.25593963 1.5016538 2.0236127 -2.1627603 -1.2536234 -2.2722287 4.049661 0.024156798 -0.9842732 0.6380802 -0.5649606 2.3362699 2.5375628 -1.3744534 -0.28534806 -0.7478447 -1.3420091 1.7010773 -1.1509776 4.2620177 -0.25741062 2.432371 1.3142465 1.8221841 0.70770067 -0.18285336 2.2706513 2.8187113 -0.5358914 -0.33096182 1.6965461 2.1975157 -2.2036457 0.16775945 -2.139775 -0.93282837 -0.24334294 -1.9675298 1.3964478 -2.1778998 0.72122544 1.5923612 -0.35344982 1.553124 0.38507807 -0.6632246 1.3712205 -2.2499042 0.67210513 -2.019939 2.1168294 2.640666 -2.8597682 2.6500952 -1.7529404 -1.9723235 -0.6133123 0.91012543 7 | Düsseldorf 1.5212423 -3.9001079 1.7847265 3.2833288 0.3844156 -2.1836567 0.975444 -0.7471421 -0.49933934 -2.8120863 -0.34014714 -4.145453 1.2113757 0.35558903 1.6788709 -3.4618826 -1.6455655 -0.63275194 0.21100935 -2.0480332 -0.52259064 -1.0434821 0.20867386 0.761265 -3.6849155 -1.3287961 -0.23359175 -1.0305196 -1.7882146 0.9501958 0.38715315 1.2582024 1.2610508 0.661611 -1.6118804 -0.3079202 -0.6327049 -1.4893847 1.9941448 -2.2638383 4.49119 -0.80002576 -0.17894669 2.3914015 2.6378434 -0.3556171 -1.8294861 0.2760385 3.9054818 -0.32086903 -1.8745781 0.40066805 0.3364297 0.19304688 1.7238512 -1.5422405 -0.4791712 -0.7788698 -1.2188956 0.4910014 -1.297487 4.137419 -1.4034173 2.741027 -0.13779528 0.54068 -0.4948545 1.9248078 1.6319622 2.5900772 -1.366469 -0.561143 0.645385 0.38225615 -1.9899735 -0.07527464 -1.0381002 -0.41467375 -1.195821 -1.5798789 2.0088217 -1.7918046 0.6025667 0.072079524 0.49446368 -0.6943376 0.75301003 0.39142188 1.4027998 -0.7229614 -0.8084729 0.125882 1.223775 2.923515 -1.6562092 2.9426985 -0.49647766 -1.6906776 -0.15633585 0.6956885 8 | Bonn 1.7732444 -4.0663724 2.072197 4.2958775 0.18073265 -1.360263 1.3787003 -2.6878824 -0.06871051 -2.274103 -0.003273571 -3.6256404 1.1025541 -0.1968193 1.2135613 -2.3674445 -2.793173 1.2752544 -1.2341235 -0.18676393 -1.331118 -0.91930425 -0.6074049 0.9812637 -4.4210143 -1.5624912 0.5330766 -0.09736781 -0.9911854 -0.14593926 -0.093018785 0.8693645 -0.43050107 -0.5296043 -1.6157752 0.96951985 -0.33930546 -2.0170388 1.0264121 -1.3959492 3.6000252 -0.04480099 -0.052739866 2.3069725 1.4800119 -1.2987778 -0.20072353 -0.61382055 3.5290253 -0.61366093 -0.9096922 -1.1583596 -0.5218217 1.8625399 1.4897516 -2.2108228 -0.4911918 0.24198373 -0.4253451 0.3577809 -1.0172968 2.8009262 -0.49275985 2.0446684 1.57874 0.055591706 -0.65575784 2.097634 1.6710955 1.3335977 -0.8540007 -0.4798934 0.29218873 -0.40195554 -3.4654505 0.5404215 -0.22525021 -2.2620277 -0.33007228 -1.2447248 3.4830062 -1.3223616 2.025733 1.1014658 1.3948184 -1.8369954 1.0099273 0.58137643 0.98908925 0.015818706 -0.67495096 0.35997796 1.423693 2.9020085 -3.3476353 2.236684 -0.32532597 -2.5610816 -0.2984117 -0.1103389 9 | Stuttgart 0.566494 -3.3570497 0.20334707 2.7772198 1.6102041 -2.6712117 1.6524634 -0.57150054 -1.799484 -4.1122923 -1.8026357 -4.316234 0.24488114 1.6587019 0.63662624 -2.6119182 -0.87825954 1.6793637 -0.7267286 1.1920238 -1.1784716 0.58053046 0.021910671 1.4142443 -3.0392811 -0.6221683 -0.22543208 -1.1319795 -2.3557017 0.16795665 1.4316692 1.2761024 -0.4635025 1.5018222 -1.8756893 0.4774182 -0.27453715 -2.166672 1.3811263 -0.664769 4.758611 0.020260576 -0.03183456 1.4845581 1.1988331 -2.0244591 -1.8282847 -0.20666443 2.8957145 -0.35303822 0.17230172 1.4478674 -2.0968559 2.4509158 0.7593037 -0.43448827 -0.9298496 -2.4931943 -2.5525696 1.3863871 -0.77869016 4.6269565 -1.5415394 2.8452494 -0.62160087 1.0701655 1.1247526 0.7991183 1.4733721 2.6291966 -1.0968398 -0.30187252 2.3760073 1.5041481 -2.6256216 0.59593666 -1.273372 -0.6450869 -1.1046237 -1.9701321 3.2812998 -0.36661112 -0.028649885 0.48226166 0.24222066 0.031353075 -0.12167936 0.07345636 0.1060736 -1.813918 -0.1482184 -0.16733412 1.9059938 2.9482884 -2.3108249 0.11083524 -1.6019679 -1.7852423 -0.75115997 -0.21917112 10 | Weimar 0.81683314 -1.7131653 2.436876 2.9921494 1.845896 -0.29539084 2.6221786 -1.1147174 -2.5372846 -2.334248 -2.3316758 -3.4694836 0.3528517 4.3798513 1.2068924 -1.1165236 1.4166012 1.2022911 -2.4026651 1.1520902 0.3682473 0.26225334 -0.13478528 0.9253019 -3.921975 -1.4104351 0.93902147 -1.2302884 0.16889736 -0.2928235 0.5133384 1.7373384 -0.35981962 1.8200374 -1.4862964 1.1422827 -0.46242154 -0.98481894 1.3346587 -0.75390804 2.93283 -2.214732 -0.15668304 1.0968354 0.80897737 -2.3562891 -0.5104329 -2.7905853 1.4891392 0.2564415 -2.0071492 -1.1218625 1.6000528 2.6111226 0.75279814 -3.2125823 1.0141833 -0.94432616 -1.8565971 0.19892144 -1.2871708 4.8923583 -2.1467447 2.49729 -0.6899097 -1.9306469 3.0746555 -1.0212816 1.7709491 0.9943837 -0.46632946 0.35771576 1.1900855 0.41690326 -0.098320544 0.7506466 -2.0931897 -1.4453712 -1.5266876 0.3486694 1.5608495 1.5761656 -0.024269057 1.0271972 0.9672212 -2.3567986 0.4470294 1.6981502 1.1327515 -1.7120334 -0.6982484 -2.2307353 1.3626552 1.8623816 -1.4885651 1.7773243 0.10906778 -3.4978728 -0.04031212 0.92997164 11 | Berlin-Charlottenburg 1.5128573 -1.5925028 0.6586595 2.737699 2.7336202 -3.3243005 2.3284876 -0.4010127 -2.5603123 -1.236744 1.390698 -3.4452677 0.10947353 1.6791242 1.7465216 -0.76290476 -2.2212007 -1.7264746 -1.6744612 -3.056577 -2.49312 -0.07676466 0.8911905 0.6336585 -3.212906 -2.2531605 0.083335854 0.24141365 -0.59732366 -0.98398954 -0.6318233 1.0299761 -0.26325086 0.02371216 -0.9989773 1.6258873 -2.2719183 -0.15970646 0.68009555 0.83962363 4.3310847 1.0046169 -0.97362506 0.4421385 1.035217 -3.0715208 -0.005432196 -1.245041 1.9522611 0.1145647 -1.1913565 -1.1630477 1.0894196 0.71561384 3.754566 -2.4362912 -1.285735 -2.2569542 -3.184358 1.0847605 -0.94772005 4.8925624 -0.5711599 1.2855332 0.7056007 -0.18500423 0.63779205 -0.8603174 1.1504306 2.7624261 -0.2313923 0.72988546 0.86560756 1.2779663 -0.64521337 -0.7648987 -2.4959733 0.72240245 -2.578197 -0.26208177 0.54461426 0.0914198 0.5369705 1.3007053 1.4684235 -0.6077469 0.2473778 -1.4296608 0.31666595 0.999417 -0.5698435 -1.8468993 0.5013717 0.90414715 -1.0855249 1.2087494 -0.5743711 -0.8130733 1.8982494 1.5004699 12 | Rostock -0.052447423 -3.60559 1.6772109 2.86706 1.9994404 -2.4348578 1.6227752 0.8745734 -1.6735575 -1.1641387 -1.2109839 -2.7128108 2.6931345 2.4101596 1.7361269 -0.3911597 0.28977996 -1.826342 -2.747718 0.40901753 1.0916499 -0.70737875 0.17943698 2.0854406 -3.933727 -3.1403418 -0.19658765 0.86128414 -1.990467 0.52544236 -0.68965673 0.2678122 -0.4056638 -0.22843067 -2.2503095 1.3350903 -0.87765706 -1.1228162 0.6979005 0.0049363077 3.679995 0.3805372 -2.3832633 2.8149588 1.0458624 -3.4371023 -1.9802814 -2.762811 2.9123113 0.10116282 -1.2954397 0.08389999 3.2545602 0.28438094 2.5642972 0.47498396 0.16124012 -0.11982153 -2.2074656 -0.5093451 -1.7396588 2.801415 -1.7982526 1.1646467 0.42070258 1.4792393 2.8597028 -0.8791618 0.25874716 1.6536496 -0.70525676 0.16340649 1.8989698 -0.3339532 -0.095169045 1.763818 -1.5596108 -3.0067837 -2.4666238 0.4830095 3.154445 -1.3701065 -1.0491397 -0.06306716 0.46231583 1.5220621 0.7570933 2.8409429 0.10693448 -1.1121051 -0.9337621 -1.1660988 0.8417191 4.024915 -4.042802 2.2258606 -1.2680017 -0.31036356 0.83193445 -1.2228782 13 | Karlsruhe 1.5838838 -2.529165 0.9414354 3.7976122 1.6307899 -2.4193742 2.354482 -1.1965469 -1.6334033 -3.3886178 -1.8650794 -3.5369074 0.33077472 2.2121544 0.26852635 -1.5807775 -1.8900567 1.4273667 -0.48683122 0.96753025 -2.2271042 0.4032111 -0.084693246 1.3407979 -3.3360238 -1.1322114 0.32269242 -1.2017418 -2.0599213 0.114292204 0.4215994 0.79854107 -0.61375797 1.3691764 -1.80406 1.0086576 -0.38792765 -3.078458 1.4099476 -1.2147843 4.136998 -0.6958336 -0.38451162 2.6059084 1.5553446 -1.7188534 -1.5271364 0.79726213 2.5949821 -0.39603162 0.21297146 0.28881106 -1.3120706 2.9030619 1.1491185 -1.0110502 -1.576977 -2.4263206 -1.5825466 1.4203827 -0.6895136 3.9579902 -1.6935902 2.7911727 -0.9775646 1.254181 1.9612513 0.23665561 1.0309166 2.9538777 -0.3533035 -0.79715294 2.3807254 0.6430419 -3.1850677 -0.39104682 -1.2210801 -0.9732431 -0.2532587 -0.6335479 2.77573 -0.042449255 0.36324272 0.86438656 1.6090865 -0.031009398 -0.48638025 0.31716207 0.5224234 -1.1756518 -0.17951767 -0.09912228 2.530875 2.7977724 -2.0488753 0.20153423 -1.1002016 -2.4104908 -0.3136595 0.11849096 14 | Chemnitz -0.071821 -2.078619 0.96774507 2.2073932 2.5093505 -2.2785385 1.9068918 1.747782 -1.5290073 -2.1504567 -1.9481535 -3.25365 1.1573522 2.8399737 0.9552885 -0.50837624 -1.0282772 0.6313287 -2.3201306 -0.9153467 -0.24772462 -1.171374 -0.06752727 0.1618347 -2.9034374 -1.274849 1.0693676 -2.2402093 -0.9883827 0.26384512 0.3823496 1.0184706 -0.18779366 -0.024206221 -0.8312848 0.9084917 -0.9396645 -1.2581304 1.1011178 -0.6893514 4.0972996 -0.5598103 -0.9826887 0.7696983 1.2559915 -2.526354 -1.6847618 -1.8165182 0.4604122 -0.63151234 -1.5176642 -1.2501324 2.1910279 2.5304148 0.9832758 -0.07615802 0.8814643 0.10660285 -2.8825188 1.7149248 -1.8577533 3.8233044 -2.4479718 1.8349282 -1.4475033 0.12769814 2.8790116 -0.8553676 2.7400064 2.0031266 -0.84577686 -0.8724914 2.9834616 0.21649545 0.037663236 0.42180708 -2.3219974 -0.3660082 -2.2743728 -0.43197122 1.7215517 0.7759706 -0.8745138 1.6466581 1.0030715 1.0787317 0.8280506 1.0751222 0.82671565 -1.9481515 -1.3891073 -1.2558906 1.3042547 1.268454 -2.6396513 1.6141497 -1.6413977 0.6219629 -1.8313429 0.0176474 15 | Breslau 0.66671115 -1.4477109 1.0112724 4.2330403 2.26337 -2.566192 1.6472995 -0.94408286 -2.763655 -0.96037793 -2.1646404 -2.903815 1.8353232 1.119287 2.0169706 0.109428614 -1.1536467 -1.7285823 -1.703771 0.71207243 -1.2090474 -1.3298646 -0.7168917 0.7118372 -4.754815 -3.5631673 1.5013397 -0.40079096 0.89500564 -1.0728468 1.462173 0.58924294 -2.7559283 0.21749988 -1.6912439 0.64358234 -2.0710313 -0.9739145 2.2502794 -0.29603416 1.6878437 -1.9989101 -0.12949786 1.4111143 2.1771693 -4.3933425 0.45143798 -1.1235601 3.9444873 -0.72424275 -3.2929354 -0.9749108 2.5522845 0.6606583 2.6612196 -2.2223554 0.89602685 0.9460411 -0.6462083 1.5754563 -1.8068516 4.975164 -2.313616 2.444937 1.5838213 -0.62329596 1.585351 -1.0637891 0.42865443 1.4238337 -1.9904886 -2.2281165 3.041281 1.4201051 -2.0143278 0.6113068 -2.4830418 -0.9999673 0.2892749 0.8584928 2.4978707 -1.0638367 -0.34895396 0.14908566 -1.7034122 1.941026 1.8985492 1.2984878 -0.0162009 -1.5578259 -0.008550553 -2.187807 2.0593472 1.5489694 -0.97414887 1.9475485 0.24402145 -0.8469044 0.0606515 -0.5003547 16 | Wiesbaden 1.4199681 -2.5668006 0.39140195 2.9957764 1.0183543 -1.4078726 0.7635888 -0.41390583 -1.5961545 -2.699855 -1.1129898 -3.5326867 -0.72821206 1.1345066 0.8207142 -1.2185962 -1.9531763 1.2046174 -0.4504608 -0.7006502 -1.534066 -1.0847932 0.77006334 0.87105274 -3.6663823 -1.6704682 -0.36631128 -1.9595237 -1.2353991 0.82079273 0.31586438 1.099766 -0.01171402 1.5779866 -1.5359977 -0.26175913 -0.8228132 -2.7640197 1.208897 -2.4825473 4.4543166 -1.9231594 -0.02933743 2.2114215 0.80981475 -0.3680666 -0.58253014 0.5325622 2.2461965 0.52198386 -0.7502252 0.40309754 -0.6810579 0.44405124 0.58179885 -1.7256589 0.087126754 0.3616069 -1.0917072 0.6464616 -0.64715344 3.034295 -1.0669016 3.0844803 0.23103613 1.0196046 1.3634642 2.1255064 1.3838927 3.629445 -0.6646856 -0.50577396 -0.24480678 0.70948344 -2.4192917 -0.059503227 -1.9207587 0.33785114 0.39449245 0.54787153 1.6864594 0.5424148 0.74976885 1.2223327 0.3308578 -0.9897992 0.005390374 -0.07435875 1.691617 -1.1103004 -1.0191981 0.6588127 1.7831817 2.5723722 -2.276287 2.0083318 -0.15307742 -1.6040094 0.49362868 0.08535702 17 | Hannover -0.69787604 -2.8318765 0.7091279 3.5683258 0.628702 -1.2518525 1.3134501 1.261545 -2.958457 -2.6016433 -2.7561047 -3.0847988 1.0172442 2.5681057 2.4131522 -0.68230635 -1.5334098 -0.58935624 -0.8431034 1.6505982 1.4996935 -0.64220685 1.1116873 1.8012233 -2.033421 -2.3749971 -0.6984424 -0.3277968 -2.6966162 1.0468854 -0.79633635 -0.42681897 -0.7734028 -0.030661095 -3.3196568 0.30523533 -1.2223358 -0.5729508 3.0617197 -2.9068668 5.124666 0.023640223 -0.2518654 3.486461 0.13928273 -2.251292 -1.9075116 -1.5468445 2.4325805 -0.7512338 -1.6511614 0.5302179 0.5033397 -1.2068863 1.7543223 -2.639877 -0.13291581 -1.1188376 -2.846398 1.4638711 -1.5446904 2.002745 -4.279742 1.7247477 -0.3494775 0.27441013 1.2180074 2.1948845 0.6781029 3.4813664 -0.3927816 0.64181846 1.0156127 0.090220876 -2.145199 0.7525145 -1.1697323 0.36291173 -1.0443352 -0.13443908 2.818602 -0.26965895 0.48110953 0.36181536 1.2128823 -1.0970176 1.1327068 1.5118134 0.41952717 -1.3881661 0.001177523 -1.2824947 1.2979654 3.0024998 -4.0008774 1.570004 -1.4554096 -1.9601312 0.14944243 -0.4392077 18 | Mannheim 1.6853468 -3.2835474 0.5989641 2.9801364 1.1810145 -1.3968529 0.9408627 -0.31254974 -1.6122487 -3.2608855 -1.8724927 -3.6118238 0.4104572 1.0427544 0.30374062 -1.9935431 -1.5625105 1.9390556 0.2802701 0.7135401 -1.2754225 0.86246777 -0.5044709 0.742776 -3.7144203 -1.1937207 0.6574015 -1.5149904 -2.0845299 1.5331029 0.5528398 0.8555008 0.35484433 1.0747472 -1.9781708 0.7532778 -1.9935217 -2.8789897 0.15468112 0.01772207 3.759633 -1.2340535 -0.7571003 2.9222007 1.5944724 -1.7833782 -1.9419789 0.5366383 2.66199 -0.18022706 -0.22188514 0.94885194 -0.94251925 1.803573 0.027757896 -1.0817147 -1.053687 -1.6144716 -1.0843568 1.1414813 -1.1633259 3.2404246 -1.8814781 2.945782 0.7031601 1.4079549 0.46572563 0.3180502 0.67708963 1.772458 -1.7707399 -0.23093641 1.2238898 -0.1963919 -3.496091 -0.6290872 -1.9636999 0.653173 0.046926603 0.30976585 2.6482317 -1.1836964 0.18703046 1.8312885 -0.19829342 0.31099528 0.061475094 -0.00835403 0.07782858 -0.63588816 -0.9380958 0.9327447 2.840647 2.300486 -2.8627465 0.6520051 -1.6713516 -0.90254253 -0.9834176 -0.17870733 19 | Kassel 1.5573624 -2.7414863 1.5721854 2.4153137 2.0354354 -0.6525211 1.6377213 -0.16256155 -1.8441646 -3.4728076 -2.572633 -3.7617068 0.23523624 1.2710865 1.2162752 -0.9205896 -1.0587155 0.9277047 -2.8340244 0.3071252 0.47764614 -1.3377731 0.91559184 0.9001432 -1.9663873 -1.7913288 0.10947271 -2.24387 -1.1993378 1.057816 -0.20420754 1.7875043 -0.22031724 1.3919324 -2.4828248 -0.32803512 -1.6879225 -1.686564 2.3967042 -2.0687506 4.3484454 -1.5348932 -0.3441588 2.5623853 0.71311176 -0.13994598 -1.1567454 -0.6108672 2.0728295 0.107921764 -1.7342259 0.020487288 -0.06484621 1.0478864 0.16133566 -2.6260536 -0.20000869 -0.59325665 -1.6615988 0.9442255 -0.9971362 3.3918362 -2.491314 2.04659 0.23428509 0.6876007 2.062595 1.6268865 1.1007411 3.1196811 -1.7570467 -0.7812416 1.0148875 -0.22405078 -2.5530744 -1.7251579 -1.5185262 -0.18235597 -0.8886198 -2.190613 2.6763387 -0.103315204 -0.683338 1.2251213 0.7690897 -1.1588682 1.0098182 0.71360654 1.3288852 -0.2081421 0.34807426 -1.4329259 0.93843347 3.3448193 -3.1033094 1.558151 0.36050677 -1.5369536 -0.007639308 -0.91065574 20 | Köln 1.2908704 -3.3767362 1.4670706 3.8735557 -1.9078406 -1.6077236 1.331111 -0.29959556 -0.7380596 -2.6523669 -0.8157046 -3.6321323 1.5686741 0.67384714 1.4079194 -2.3653526 -0.74354786 -0.61259615 -0.5996697 -0.8911651 -0.37995145 -1.440093 -0.7279234 0.8999324 -3.6521199 -1.612156 -0.84070253 0.30401257 -2.1488779 1.4132677 0.61849993 0.09630572 -0.06868635 0.2913071 -1.2995808 0.52277637 -0.8921397 -1.711682 1.2151649 -2.328338 4.783667 -2.0333195 -1.2947583 1.402545 1.9636314 -1.3122685 -1.2194855 -1.1223034 5.143693 -0.122294925 -1.7692726 0.06915681 0.24433112 0.4655718 0.30582476 -0.9042114 0.40114832 0.64410394 0.41263345 0.3226011 -1.2358178 3.9019098 -1.5221609 2.6278183 0.6591511 1.1480902 -0.73108625 2.197401 1.5581341 2.0085595 -2.032876 -0.522519 0.7546398 -0.00058358227 -3.252497 -0.2191995 0.21831991 -2.148597 -1.085623 -1.0190282 2.8564777 -2.6440372 0.50124097 0.83531433 0.2787406 -1.2303942 1.0826375 1.2501755 1.0798703 -1.5076725 -0.16757096 -0.14564253 1.5245255 2.7275012 -2.7803981 3.1123276 0.13020447 -0.8966806 -0.51066583 0.99631786 21 | Danzig 0.3434559 -0.43505847 0.02619345 3.6308215 0.8936174 -1.6695503 0.5677632 -0.55206066 -2.2976527 0.38730383 -1.474245 -1.8394965 1.6696627 0.4256326 1.9035703 -0.3328135 -0.60560215 -3.0340357 -0.2385389 -0.08148415 0.56776917 -0.8746059 0.5579435 -0.15805362 -2.7945879 -0.8971243 2.3140311 0.9382688 -0.9135117 -0.98293906 -0.54332334 1.5357034 -1.8185693 -0.15237619 -1.6972352 2.3965542 -1.8655112 -0.9251705 1.1926954 -1.45271 1.5386003 -0.23270151 -0.80040723 2.4281464 2.188324 -3.9811633 -0.29961133 -0.3741601 3.5476863 -1.4168851 -2.7655072 0.29077634 3.2625425 -0.96680015 3.5290577 -1.9322486 0.6610368 0.60808754 -0.8311669 -0.11044524 -2.4496436 3.3420672 -1.9751523 0.47625235 0.32710505 -1.0272101 1.3547778 0.13447116 0.13154672 0.28813428 -1.3289757 -0.8861751 1.4825644 1.3329834 -0.5213931 1.5170165 -1.668272 -0.77056 0.780198 2.1736033 2.6864924 -1.2351998 -1.5128056 1.3868883 -2.0170314 -0.06513051 1.1191978 1.8304222 0.7731705 -1.1966338 0.18132062 -0.7299618 1.2886244 2.2685423 -2.0956504 4.2226095 0.083444014 -0.8096383 2.1072474 -0.40903544 22 | Erfurt 0.80782825 -2.3669503 1.1096654 3.270655 0.7745064 -1.4519039 1.9219042 0.8311167 -1.8042164 -2.2313366 -1.7731649 -2.9549952 1.3400627 3.8829648 1.4899487 -0.84751326 0.11752469 -0.113706656 -3.7200713 0.29025638 1.7293645 -1.8050618 -0.6477939 1.1124035 -2.2349725 -1.6650132 0.16187541 -1.4247701 -1.110588 1.782536 -0.20283519 1.8115572 -0.97517 -0.056721106 -1.1387268 0.82461 -1.2046744 -0.44286248 0.8746458 -1.0310925 3.6327696 -2.0560498 -1.8590186 0.91231483 0.23903866 -2.6812892 -0.9170981 -2.7532418 1.3944647 -0.18646018 -1.4106637 -1.1197873 2.4322257 2.0784647 0.43617037 -0.34893444 0.8888223 0.213287 -2.1159236 1.1269192 -1.5725394 4.0761075 -1.6051449 2.4778018 -0.22551571 0.35592985 3.0320697 -0.2609272 1.8677168 1.7028247 -1.7724258 -0.21624248 2.085475 -0.7155931 -2.193457 -0.074285775 -1.4831631 -1.6593729 -1.9956537 -1.0582907 2.9888003 -0.07701468 -0.15102485 1.477318 0.7021823 1.0834221 1.4207053 1.6859922 1.4054524 -2.4089198 -0.7524289 -0.6321149 2.283272 2.055233 -3.406661 2.2726152 -0.6259773 -0.43021062 -0.91989833 -1.3181835 23 | Dessau 0.1655563 -1.3956314 0.90144336 1.5295044 2.1083896 -1.5131856 3.11738 1.23529 -2.8733816 -2.1547744 -2.9989424 -2.7864363 1.2590435 4.0247183 1.2933178 0.08136682 0.2159727 0.023950133 -2.622547 -0.30600816 0.5474296 -1.4347308 1.3657694 0.19545676 -2.600723 -2.0367246 1.8279332 -0.57736945 0.10621047 0.81651056 -0.01658237 1.6952422 0.62712497 -0.06848756 -0.23226205 0.45668364 -2.4231057 -0.58289593 1.6418141 -0.34882122 2.8821259 -1.2120405 -1.3069633 1.4901137 0.801778 -3.2862923 -1.6011274 -1.1307199 1.5939882 0.7571907 -1.7293857 0.70702064 2.5279675 1.5347164 2.2690432 -2.343719 0.96801174 -1.8281279 -2.00432 0.7275825 -0.07954622 5.1753235 -2.5561414 1.8409889 -0.23537984 -0.83301705 1.4757801 -1.6028922 1.7116698 0.86495364 -0.8482321 -0.053865537 1.2644356 0.100376725 0.4115513 0.3793352 -1.7280862 -0.13078277 -2.2270527 0.4125044 1.671268 1.2934842 -0.99142736 0.64955485 0.25455117 -0.5455502 -0.17631105 1.8463999 -0.21313727 -0.9188718 -0.9872116 -0.35021922 1.9959435 1.3820934 -1.8582863 2.0353177 -0.92023367 -0.32049862 -0.35664362 -0.45437723 24 | Bremen -0.056817725 -4.0306034 0.28295764 4.138671 -0.46787897 -2.3579674 0.9685631 0.43590134 -1.3404713 -1.359564 -0.82277346 -2.200871 1.7565475 2.3212283 1.4477017 -1.3962991 0.17130011 -1.3545963 -0.90605104 -0.4249804 1.5688779 -0.99374324 1.4624224 0.7877644 -2.3831103 -2.1295328 -1.6660641 0.2232544 -2.6901212 0.72590595 -0.94406307 0.5009411 0.4212869 0.42321622 -3.0735166 1.0214167 -0.7041424 -0.75787455 1.7760745 -2.6773489 3.53388 0.14633165 -0.5842844 4.046178 0.89186704 -2.8787107 -1.4739583 -1.6824589 3.0122874 -0.60311264 -2.085615 1.0714922 0.50133646 -1.3832761 0.50981027 -0.37709832 -0.27450308 -0.18078382 -1.0240859 -0.46115756 -1.5725161 1.2821352 -3.427919 1.2404199 -0.105012506 1.4428252 1.5813133 2.5607285 0.67731065 2.4371612 0.05218624 1.4794445 0.71612155 0.647879 -0.5704556 0.49217856 -1.0569705 -2.0387053 -2.0052538 0.60034734 2.4288895 -1.2816066 -0.0058241324 0.5419774 0.052929502 -1.0115052 -0.79919124 1.1636823 0.1553158 -1.325528 -1.1693679 -0.18858348 0.76196253 3.9781427 -3.962511 2.4247482 -0.5868158 -1.2412579 1.3530936 0.8364813 25 | Charlottenburg 1.4314452 -2.0112507 0.15726177 3.1659262 2.5931208 -1.7140193 1.7628257 0.9795751 -2.6923652 -1.5155613 0.14690244 -2.9725266 0.21247612 1.8636783 1.1395919 -0.7462718 -2.5799801 -2.4860883 -1.0964936 -2.0945902 -2.031724 0.91948587 -0.94041955 0.87401015 -2.8062131 -2.03308 1.4667398 -1.6644052 -1.4544762 -1.5358213 -2.6417744 1.027692 0.6337462 -0.04204235 -1.0423899 0.30470085 -1.0266916 -0.29810634 1.0189035 0.78062224 3.6740541 -0.25270796 -1.3284174 2.0838847 2.4714863 -1.8952535 -0.62455285 -2.240672 1.1648048 -1.1860713 -1.2601839 -1.2649171 2.6012642 0.8949726 3.5715735 -1.9717021 -2.32372 -1.5937723 -4.7061763 -0.8585368 -1.234185 4.544065 -1.3224397 2.2238266 -0.2457188 -1.1923413 4.0377336 0.2114308 2.693322 1.8623192 -0.64286083 -0.44151437 0.76619583 -0.077199295 -1.0291207 0.093187936 -1.6077003 2.8273823 -1.2799146 1.1930313 0.52900743 0.9238819 -0.10265649 2.496671 -0.25731698 -0.15532756 -0.23765275 -0.63343656 1.2745422 0.64967775 -0.15446915 -1.3178043 1.3338369 2.5326035 -1.3882474 1.3841184 0.6642203 0.061416734 -0.18116742 0.37801945 26 | Magdeburg -0.111748755 -2.24759 0.11168967 4.45879 0.38540328 -1.5804166 1.8756238 0.39969856 -2.2514343 -1.716332 -2.2404668 -2.8578625 2.091869 3.6410477 2.4087481 -0.06400154 0.22710498 -2.0958865 -3.6809313 -0.3861755 1.822062 -1.7394545 0.20696157 1.4365294 -3.2243042 -2.8954718 0.10165746 -0.041837998 -1.4161025 0.92310804 -1.1862137 1.3374597 0.6387167 -0.5271952 -1.8884375 0.45582685 -1.8371872 -1.1202972 1.294254 -2.3937535 4.215464 -1.6285824 -2.4546711 1.766155 1.4688944 -3.2269146 -0.91308457 -2.319828 2.783319 0.075614624 -2.0303493 -1.125721 3.4136958 1.3943502 0.64361435 -0.32597852 0.3090037 0.3711765 -2.137608 0.7852026 -0.7969678 3.779312 -3.2298493 1.9387023 0.02844188 0.7776044 3.4220507 -0.30752668 1.2521039 0.73021215 -1.043963 -0.4769559 1.8335025 -0.78244656 -1.7536565 0.53932756 -1.2387652 -1.1620597 -1.9672645 0.6698811 3.7666326 0.10227215 -1.3601748 0.156355 0.08152423 0.93103224 1.4799974 2.6911323 0.9215576 -1.2821338 -0.31545508 -1.1342897 2.2280722 1.8452109 -3.7362247 2.4601543 -0.49386513 -0.12730736 0.52896893 -1.0381876 27 | Neuruppin 0.7938417 -1.4199102 -0.29886818 1.9067128 1.5843079 -2.3157525 1.0817573 0.43618572 -3.074589 -0.7706987 -2.1261826 -2.4559512 0.74530095 1.0889645 0.36465564 -0.086909376 -1.1237684 -0.29044613 -2.6431572 -1.4860691 0.83298266 -1.3829392 1.8363639 1.5619378 -2.6269846 -2.090307 0.5364342 -0.8800139 -2.366494 1.1199762 -0.95106214 0.89785904 0.26504895 -0.8575919 -0.0642335 0.21264197 -0.54398566 -0.55889726 0.059133522 -0.38284835 3.7620852 -0.8573412 -0.91275114 1.6976646 0.5558987 -2.3734746 -0.51133555 -1.0483861 0.9167707 -0.6801628 -0.7685203 -1.1650672 3.5943456 -0.26206577 2.0313137 -1.3207816 0.39841977 -0.10597091 -1.4641358 -0.14300579 -0.22713313 3.6090295 -0.9419036 1.4355401 -0.14827976 0.9908568 3.3576007 0.093565874 1.8190674 1.61832 -0.10078941 -0.81375796 1.3892626 -0.8166351 0.3403556 -0.4397914 -2.1620357 -0.26649517 -1.1201472 1.6570412 1.0037491 0.9714701 -1.1341999 0.7290102 0.8637159 0.58211744 1.0401824 2.1678882 0.24826486 0.8230898 -1.5858092 -1.0280147 0.41382217 2.5091665 -1.6362106 2.7815018 -0.2850932 1.6758252 0.012874184 -0.40856394 28 | Darmstadt 2.1785948 -3.0084906 0.9053276 1.9703248 2.3588898 -1.8440421 2.237687 0.23525219 -1.7640982 -2.363411 -1.3782189 -3.588947 -0.44801024 2.2511523 1.1648772 -0.8354481 -1.7246238 0.9661168 -1.2154392 0.2956995 -1.5263938 -0.5478666 0.14722441 1.5254953 -3.0831766 -2.008154 0.09313446 -1.650825 -1.4124466 -0.52827173 0.37653273 0.8195437 -0.9399349 1.4341503 -2.0733914 0.6823288 -1.0809618 -2.638556 2.0018249 -1.6989331 3.416184 -1.4131893 -0.17068532 2.353585 0.287574 -0.22151496 -0.8280962 0.03599766 2.532272 0.27132946 -0.82180786 -0.4635075 -1.7591395 1.8958253 0.36986142 -2.3455107 -0.9025554 -0.12911968 -1.0992248 1.0318556 -1.0564029 3.900211 -1.529202 2.8833652 -0.8231274 0.13364689 2.0224166 0.9488081 0.07936948 3.7267098 -1.211716 -0.48009595 1.9288377 0.048756894 -3.1609192 -1.4912387 -1.726199 0.5922906 -0.59022456 -0.48867622 1.8906224 0.44138446 0.7798785 0.49933106 1.690574 0.356895 0.49697328 -0.392594 1.742188 -1.636731 -0.20036502 -0.7651072 1.4105887 3.1679807 -3.1402793 0.09239923 -0.6187876 -1.9657005 0.5367421 0.20516641 29 | Jena 0.36548987 -4.4266157 2.404663 3.417021 2.6990235 -1.8708422 1.5725478 -0.32197574 -1.7296216 -1.411577 -2.2151818 -3.3211722 0.8895865 3.8782103 1.5885187 0.063262716 0.72666234 0.28054917 -3.567375 2.2810423 0.73989064 0.045885984 -0.44482374 2.4907873 -3.1560278 -3.1205337 1.1144172 -0.19041239 -0.7317541 -0.013166833 0.71129936 1.2701628 -0.77981687 0.3232006 -1.9194636 1.1054618 -0.83354324 -1.3860487 0.71039045 1.7656512 3.1134346 -1.1821804 -1.1095362 2.2711847 1.0612683 -2.9040024 0.80669403 -3.5129268 1.9777881 0.89234304 -1.9588604 -2.2599657 1.5907444 1.812148 -0.39461204 -0.7725356 -0.6612171 -0.4639231 -2.5922878 2.9657912 -1.5687541 2.2780232 -1.8262712 0.9486923 0.46188048 -0.40299138 2.9846742 -1.9050412 1.2097722 0.73007745 -0.70580184 0.22523287 2.360982 -1.6112047 -1.1594934 0.28057626 -1.6908371 -1.45007 -1.7296367 -1.7059181 2.4509165 -0.1001473 0.4494566 0.93004006 0.9120924 0.5360778 1.6031588 1.1098434 0.49805385 -1.5912942 -0.7888077 -1.1849483 0.88856214 3.6105285 -3.1564763 0.5891592 -1.6905077 -0.61891943 -1.4020185 -0.64932656 30 | Wien 1.6993456 -0.40865183 0.5797754 3.4815657 1.7573857 -3.1886191 0.3566416 -0.6975592 -3.859243 -3.155961 -1.0740143 -3.58529 0.67661256 0.29185933 1.3465438 -1.1531074 -1.1902778 -0.18237594 1.6637058 1.9740822 -3.0145948 1.723712 -2.8370864 1.1359831 -3.8882313 -0.5695027 0.4884211 -1.9920927 0.1649188 -1.0625001 1.3705724 -1.2098595 -0.64773774 0.61239684 -2.684736 1.8161288 0.30268002 -0.6789807 -0.003797051 0.114639334 2.8872168 -1.6054888 0.86912966 0.5515413 2.6305335 -2.170429 -0.8399388 -3.142541 2.4221284 0.20356885 -1.1745616 -0.2810759 -0.09192879 1.9158556 2.1861286 -1.0327406 1.7885467 0.0032725662 -1.7075858 0.004278481 -3.5037134 4.7563796 0.692135 1.9845288 1.4111327 0.63193256 0.27833036 0.340828 -0.37938273 2.6420143 0.56455994 0.39525995 2.2772179 2.655672 -3.074566 2.421852 -0.39430264 -1.5122169 -0.090939164 0.34043363 0.5385451 -2.0365355 0.68616474 -0.12834004 -1.8848356 0.8590185 -0.40686676 0.35849574 2.596493 -2.8829587 0.66155344 -1.6615717 2.7336633 2.0063124 -2.103265 2.6897328 -2.7216613 -1.3637185 -1.3217815 1.9726645 31 | Heidelberg 0.95605487 -4.077256 1.9794812 3.2241185 0.8397046 -2.37968 1.4302995 -1.3703109 -0.88826466 -2.7481925 -1.3715233 -3.9541984 0.16763079 1.177009 1.1446624 -0.6578606 -1.4168125 1.7505691 -1.6498784 2.1174672 -1.2620153 1.1109424 -0.16721016 1.6466146 -4.753744 -1.9300082 -0.33637124 -0.39937449 -0.6664219 -0.16009551 1.2595451 -0.29816666 -1.8596368 1.4278371 -0.9931706 0.80801076 -1.7195141 -3.0194263 1.277873 0.6937594 2.2662117 -0.8939052 0.6723852 2.1953204 0.94570893 -2.31718 1.4580711 -0.861772 2.9102674 -0.13490906 -0.36932224 0.0063912272 -0.7617989 2.358036 -0.06586548 -2.050144 -1.4483451 -0.9548961 -0.764383 2.299293 -1.2434621 2.8987014 -0.857512 2.894767 1.6656187 0.04730405 0.49203664 0.7141799 -0.23050547 0.91149914 -1.5635058 -0.37247837 2.1658978 -0.21404417 -3.367783 -0.5827673 -1.2216394 -1.808152 0.56231254 -0.8315598 3.2911878 -1.7587705 0.39581776 1.1958071 0.4194178 -0.26361084 0.46990255 0.447435 0.7350322 -0.78251904 -0.7560631 -1.3411317 0.39994267 2.9115274 -2.9446468 0.40293023 -1.592829 -1.4177874 -0.9813924 -0.50021124 32 | Dortmund 0.68851846 -3.7168872 1.030971 3.025947 -0.090976216 -0.6902239 0.88207036 0.59306043 -0.9139919 -2.0079331 -0.25084594 -3.4633718 1.5611578 0.79087114 1.475983 -2.5823216 -0.9479045 -1.2950747 -0.4798476 -1.5018681 0.84693044 -1.6320845 -0.4360999 1.636944 -3.0910478 -1.8877461 -0.23549005 0.22897239 -1.7697595 1.0345008 -0.41668475 1.0463299 0.79172236 -0.35555798 -0.5951941 -0.8528993 -1.9537778 -1.0680767 1.3289938 -1.7366859 4.9866576 -0.29085052 -1.9361976 2.4613261 1.4037215 -0.25600204 -2.61446 0.34836322 4.0891976 -1.2641752 -1.1852937 -0.6659593 0.19036418 0.55924386 1.2418216 -0.6012969 0.09240304 0.6673061 -0.7837457 0.7181358 -1.6238198 2.6891775 -2.073313 0.44234467 -0.76230717 1.6794713 -0.5436234 2.1253817 1.688478 2.1973119 -1.7909006 -0.2591567 0.23928462 1.0051067 -2.517045 -0.050783694 -1.6954061 -0.108008526 -1.8975257 -2.130573 2.188023 -1.1880983 0.886349 0.29565653 1.5573137 0.11981172 2.0169466 0.43243888 0.78264254 -1.7004241 -0.02813352 -0.20374432 1.1624932 3.8423135 -2.3108947 2.1682165 -1.226625 -0.36796907 -0.52189887 -0.31769487 33 | Stettin -0.054340772 -1.2635287 0.03148213 3.5059521 1.4322209 -2.3846755 1.5993967 0.13821866 -3.1656337 -0.42710593 -2.1200912 -1.9455518 2.0161607 0.69416994 1.763165 -0.017210148 -0.45016348 -2.8045368 -1.4200227 0.07205085 0.7343193 -2.3224533 0.20110615 0.35741618 -3.77711 -2.0923815 1.5150994 -0.06526995 -0.9375911 -0.83235 -0.3384462 1.6533086 -2.3059387 1.225535 -2.0261662 0.04333439 -1.9986398 0.26196086 2.6975207 -1.2180759 2.660996 -0.18017015 -0.63837755 1.6475778 1.6913066 -3.2451262 -0.9683815 -0.16509525 3.458229 -1.4877906 -1.8902271 -0.0679146 3.6293502 -1.0235803 3.329278 -1.6619351 1.7220277 -0.010675311 -1.5472616 -1.2673485 -0.26966333 3.3198526 -2.685923 1.3483436 1.2357507 -0.3303389 2.2160459 -0.19286108 0.5660989 1.1028975 -1.0221868 -1.7723488 2.3031785 0.71726346 -1.0630293 0.6245848 -2.3799093 -1.2330412 -0.097322196 2.676756 2.731589 -0.78263694 -1.460205 -0.32363698 -0.93262833 1.4364008 0.25051886 1.9216368 -1.1213309 -0.2498429 -0.27549142 -0.97610384 2.4982333 2.4083498 -1.7849033 3.2259789 0.3372887 0.041479632 2.2055006 -1.0527576 34 | Schwerin 0.83079773 -0.8924859 0.65041333 2.8386643 0.7753815 -1.1645408 1.2349639 1.0168655 -2.3385434 -1.2468014 -2.3517914 -1.2932155 1.1319875 2.1880314 2.097139 -0.10361721 -0.54267454 -1.8080263 -3.0377347 0.23861113 1.5164828 -0.7138678 0.30331564 0.95972043 -4.4699316 -3.6566439 -0.33158284 0.11869797 -0.17549531 0.33105826 -1.8652042 0.576428 0.46529508 0.6997144 -2.7190042 -1.0353235 -1.0139047 -0.6397547 1.6120995 -2.593996 3.2474892 0.10928622 -1.6773603 0.8260087 1.3585242 -3.1278865 -1.9273988 -1.512383 2.1514223 0.1258143 -2.1222727 0.252223 3.1697137 -0.14386712 1.8028492 0.13507245 1.2949926 -0.206384 -2.4632733 -1.4238125 -0.502773 4.181763 -2.8247986 2.2398696 1.000836 0.99052197 4.7929716 0.15693314 1.3981422 0.7635068 0.71248424 0.40644673 1.4972076 -0.052207485 -0.29409662 1.210332 -1.6117026 -1.8968577 -1.9136035 1.5949801 1.9474542 1.409596 -0.95737046 -0.5069543 -0.015407094 -0.08477078 0.6478471 2.7474868 0.13648887 -0.21730044 -0.1652914 -1.5711025 2.0300932 2.9598205 -2.6143978 2.0712955 -1.0636466 -1.2238 1.0401605 -0.4937262 35 | Neubrandenburg 0.5337957 -1.4917641 -0.25133058 1.0414242 1.1787254 -1.8096468 0.4546967 0.6456694 -2.5179164 -0.84661925 -1.603039 -2.179942 2.1898844 1.7489692 0.9218151 0.0265563 -1.0405837 -1.3072684 -3.0200222 -1.15086 1.4068086 -0.6853456 0.9558094 0.36950547 -2.4637325 -2.890407 0.12548387 -0.33946538 -1.742703 0.024779577 -1.0787203 0.3316309 0.7497587 0.46028283 -1.661405 0.71384245 -0.25566822 -0.12007549 0.22011502 -1.1505278 3.4172606 0.572205 -1.4951712 2.8447897 0.8894803 -2.1484234 -1.4682385 -1.7924082 0.17189023 0.37384236 -0.79067415 -0.3109392 3.0089211 0.29409957 1.7833916 1.0597328 -0.32242438 0.028875794 -3.5940561 -1.3246691 -1.8284535 3.2316635 -1.7540749 0.96433896 -1.6236311 0.9805822 4.988431 0.25574964 2.1778357 1.5061827 0.23732321 0.036576133 0.8514585 -0.28041282 0.25926977 1.7919458 -2.9261863 -1.5926665 -2.1906352 1.1202184 2.3838425 0.6946129 -0.74633116 0.28667426 1.2182994 0.85997474 1.0849514 2.6579041 0.052525338 -1.0835707 -1.0752919 -1.7102283 2.3572314 2.9711668 -3.6870399 1.9474386 -1.3372123 0.5343863 0.9682266 -1.6537476 36 | Greifswald 0.1652766 -3.5949454 2.1336477 4.1200643 2.2866993 -2.463635 2.048972 -0.5040621 -1.8794425 -0.06421502 -1.5513515 -2.4307077 2.0569189 2.3177984 1.4994664 0.2575602 -0.025734236 -2.790341 -3.5343769 0.9497378 0.081635915 -0.74197423 0.73926663 1.9250898 -5.1364846 -3.9089255 0.72733 0.7961628 -0.8696721 -1.0500472 0.03268303 0.9725387 -2.1241002 -0.5890719 -1.5035678 1.1408298 -0.99336207 -1.4618317 1.3378011 -0.093261994 2.7690227 0.81153685 -1.6681685 2.5312011 1.3357638 -3.4964762 -1.0661113 -2.8256094 2.4806714 -1.2465186 -2.2604566 -0.4616819 3.2676392 0.6601436 2.271021 -0.7424759 -0.073794246 -1.0668892 -1.3551625 0.29630437 -1.1881994 1.3426702 -1.487533 1.200032 1.3716654 0.2137088 1.6763198 -0.7122881 -0.39352697 0.26784614 -1.1138753 -0.5583302 2.3048165 -0.38951385 -0.98190403 1.3499286 -1.8438789 -3.2866073 -1.5057027 0.9614518 3.8960872 -0.58216655 -0.3801048 -0.07488416 1.4002591 0.5406893 0.7329908 2.8021164 0.42899436 -0.16141252 -0.28739074 -1.881642 -0.59934366 4.892059 -2.5251484 1.4294227 -1.8354392 -1.2819895 0.44077408 -1.6484693 37 | Göttingen -0.122977376 -4.6110187 2.7388806 2.8336575 1.3883183 -1.4654944 2.3721359 -1.7411104 -1.193904 -1.4967352 -2.5326424 -3.3122282 1.1478269 2.834303 1.4258072 0.8935621 -1.0141217 0.78006244 -2.9491022 1.6486821 -0.086301155 -0.944451 1.6648539 1.5473653 -4.002028 -3.101649 -0.26449785 0.53808665 -0.6188652 -0.43683887 0.2877977 1.1222014 -2.112639 0.64383477 -1.9440013 0.88001245 -1.8034527 -2.404448 2.2434692 0.90443194 2.577597 -0.62143743 0.17547432 4.315954 0.5258403 -1.8181914 1.447458 -1.5547183 2.2681353 -0.17019042 -1.6164857 -0.59390116 0.9682918 1.267776 0.9324052 -2.6000106 -1.3031914 -1.1148666 -0.2895839 2.4207237 -1.2336751 0.47808287 -1.6611596 1.1473874 2.081935 -0.7481806 1.5208572 -0.19777593 0.78129643 1.7016252 -1.705271 -0.15058106 1.4956691 -0.63897485 -1.901858 0.41566655 -1.8587918 -1.8882244 -0.25810224 -1.1748425 3.5915368 -0.5357882 0.6541734 1.5102884 1.0257877 -0.5122612 1.3937393 1.3419125 1.0535557 -1.0727713 -0.15176319 -2.0055702 -0.08776956 3.8620412 -3.9716406 0.5049128 -1.4670595 -1.7259696 -0.16387616 -0.4768521 38 | Braunschweig 0.070235215 -3.3072174 0.96919 3.6504455 0.83147234 -0.9355551 0.91183656 1.3919315 -2.2620707 -1.7589343 -2.6636453 -2.5505173 0.87088704 2.9597576 2.6366763 0.45481706 -0.4270969 -1.4037206 -2.2106411 1.5711362 2.0408337 -1.1279006 0.7976403 2.116148 -2.4313078 -2.6761973 -1.1701454 -0.43038902 -1.2154555 0.5626676 -1.4212137 0.97650164 -0.776503 -0.30748147 -2.576778 -0.11043852 -2.5005887 -1.6627984 2.0436442 -2.766603 4.510436 -1.1256311 -1.169368 2.3623164 1.2347726 -1.7047348 -0.8959036 -1.85159 1.7977918 -0.57307225 -1.5166203 0.3095703 0.65960217 -0.057881176 0.0363691 -2.102169 0.18580666 -0.4247576 -1.9678447 1.0175797 -1.3916875 2.0439641 -3.8498597 1.6544343 0.28701076 0.13320003 3.4759657 0.7000201 0.100731105 3.2806695 -2.0567603 0.37438932 0.5724269 0.34349224 -0.8257335 0.43675777 -1.2818834 -0.66438943 -0.9734074 0.539733 2.9298987 -0.24684793 0.3553779 0.4439978 -0.45271614 -1.2657002 0.79255223 1.6031727 0.48184 -1.6000599 0.21821561 -1.0750562 0.8820038 3.057838 -3.1418607 1.5231175 -0.9102674 -1.5785712 0.37466288 -0.48154473 39 | Berliner 2.5021896 -1.93324 0.62058073 4.6480837 4.4127827 -1.523275 1.1078434 2.1851509 -2.9182823 -0.37328464 -0.47262973 -3.8505764 1.3183446 0.91085815 0.8714414 -2.4382932 -2.5759513 0.6540026 -0.012791386 -1.9180672 1.026707 2.5738692 1.0055073 -1.0430847 -4.2999935 -1.5705029 1.4127455 -1.5199605 -3.931993 -0.41933867 0.27617005 0.8046653 0.8166263 -0.4015745 0.30025792 2.2572944 0.99583864 -0.9169119 -1.6128504 -0.5664827 4.188296 -1.2985296 -1.8234825 1.0170832 0.7866375 -1.2588413 -2.0308022 -2.639482 2.272782 1.515848 -1.5250673 -0.7432919 0.91581887 -0.81104904 1.7992172 -0.10896411 -1.8980829 -0.53088236 -3.846311 -2.424622 -1.2538743 5.984656 -3.3172991 0.12078718 1.9760884 -0.7840208 3.0020523 0.43757495 2.258775 1.2697811 1.7896241 0.1640398 0.54885143 0.39416325 -0.6555018 0.8664546 -1.7527426 1.2593247 -6.214188 1.2876759 0.65667766 -1.1322315 1.0171529 1.396287 -0.38880938 0.2110226 0.3926466 1.8455745 0.7311767 -0.5988452 -2.1766667 -0.29076427 -1.64551 1.5849361 -2.7349072 -0.42576337 0.8060806 -0.8787876 -1.2732279 1.2627565 40 | Warschau 1.7711434 1.755447 1.4286655 2.5262399 2.0234625 -2.1741664 2.111254 -1.7445623 -5.018872 -1.576659 -2.6320632 -4.514658 1.8665249 -0.5658911 0.34365138 0.017336166 -1.8486159 -2.4962223 0.24756038 1.4055457 0.8588321 0.35199225 -1.43941 -1.2252843 -2.420198 -0.78274983 3.0546646 0.62422067 -1.0041724 -0.2443084 0.92575693 0.54317707 -0.2752779 -0.67511445 -1.6013588 2.5263104 -1.2210354 0.46443388 -1.0302343 -0.054171536 0.46670872 -0.83005935 -0.20407055 1.7997283 2.118658 -3.9629784 0.35601822 -1.0754963 3.599026 -0.5747844 -2.4134588 -1.797236 2.7411623 0.49563867 2.0865214 -1.8215057 3.1970396 1.0357429 -2.2994704 -0.8102771 -2.5109003 3.7647011 -0.61976844 1.7216551 0.21010154 -1.1537349 0.08629706 1.4690574 -0.20090978 1.6530685 -0.4279085 -0.47829515 0.8595145 1.1116269 -2.388109 0.68994623 -1.6641881 -0.8519223 1.6952112 -0.40065882 1.3539343 -1.4072578 -0.35652128 -1.6083196 -1.6132793 1.169395 0.74953365 1.8232499 2.001881 -0.49464163 0.9751009 0.16553725 2.4203427 1.2875725 -0.5475174 4.190866 1.1683273 0.0885413 1.4156662 -0.67190605 41 | Berlin-Spandau 1.5107871 -0.76364285 0.9072823 1.9016161 1.2115762 -1.5719041 1.2324407 -1.159568 -1.8591305 -1.3239199 -0.868089 -2.4689415 0.018907672 1.4944386 0.8891671 0.8476858 -2.9271998 -1.547096 -1.5533094 -2.4536703 -0.61805004 -0.020517787 1.6072648 0.058831472 -2.2733688 -1.7421594 0.19086894 -1.3981186 -2.022411 0.64980876 -0.4192276 0.39129508 -0.6865307 0.18565977 -0.042095687 1.8122067 -1.607078 -0.2702758 -0.4970061 -0.061173443 4.517128 2.0872662 -1.2874446 0.39391938 1.0150174 -2.996208 -0.09027219 0.7794083 1.3744642 -0.32250872 -0.38681895 -0.65707433 2.0306628 1.0638795 2.244114 -0.6849804 0.123426214 -0.32418516 -1.7535619 0.06498341 -0.028026368 3.7052617 -1.6668594 1.2728359 0.36765975 0.3088829 1.2758119 -0.073761486 1.0732712 -0.4793625 -0.5099331 0.1359661 -0.37006935 0.8534095 0.13102347 -0.9747145 -2.61755 -0.31107 -2.072855 0.6390753 1.0684209 0.34763482 0.62938184 2.257483 1.6696324 -0.41071332 -0.50958544 0.60526407 -0.27884632 0.593429 -0.9150496 -1.8200486 -0.39855394 0.2722582 -0.85434484 2.4347494 -0.7615509 0.56471443 0.9812795 -1.6216714 42 | --------------------------------------------------------------------------------