├── .travis.yml ├── LICENSE ├── Makefile ├── README.md ├── closestmatch.go ├── closestmatch_test.go ├── cmclient ├── client.go └── client_test.go ├── cmserver └── server.go ├── levenshtein ├── levenshtein.go └── levenshtein_test.go └── test ├── books.list ├── catcher.txt ├── data.go ├── popular.txt └── potter.txt /.travis.yml: -------------------------------------------------------------------------------- 1 | language: go 2 | 3 | go: 4 | - 1.8 -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Zack 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | .PHONY: test 2 | test: 3 | go test -cover -run=. 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # closestmatch :page_with_curl: 3 | 4 | Version 5 | Build Status 6 | Code Coverage 7 | GoDoc 8 | 9 | *closestmatch* is a simple and fast Go library for fuzzy matching an input string to a list of target strings. *closestmatch* is useful for handling input from a user where the input (which could be mispelled or out of order) needs to match a key in a database. *closestmatch* uses a [bag-of-words approach](https://en.wikipedia.org/wiki/Bag-of-words_model) to precompute character n-grams to represent each possible target string. The closest matches have highest overlap between the sets of n-grams. The precomputation scales well and is much faster and more accurate than Levenshtein for long strings. 10 | 11 | 12 | Getting Started 13 | =============== 14 | 15 | ## Install 16 | 17 | ``` 18 | go get -u -v github.com/schollz/closestmatch 19 | ``` 20 | 21 | ## Use 22 | 23 | #### Create a *closestmatch* object from a list words 24 | 25 | ```golang 26 | // Take a slice of keys, say band names that are similar 27 | // http://www.tonedeaf.com.au/412720/38-bands-annoyingly-similar-names.htm 28 | wordsToTest := []string{"King Gizzard", "The Lizard Wizard", "Lizzard Wizzard"} 29 | 30 | // Choose a set of bag sizes, more is more accurate but slower 31 | bagSizes := []int{2} 32 | 33 | // Create a closestmatch object 34 | cm := closestmatch.New(wordsToTest, bagSizes) 35 | ``` 36 | 37 | #### Find the closest match, or find the *N* closest matches 38 | 39 | ```golang 40 | fmt.Println(cm.Closest("kind gizard")) 41 | // returns 'King Gizzard' 42 | 43 | fmt.Println(cm.ClosestN("kind gizard",3)) 44 | // returns [King Gizzard Lizzard Wizzard The Lizard Wizard] 45 | ``` 46 | 47 | #### Calculate the accuracy 48 | 49 | ```golang 50 | // Calculate accuracy 51 | fmt.Println(cm.AccuracyMutatingWords()) 52 | // ~ 66 % (still way better than Levenshtein which hits 0% with this particular set) 53 | 54 | // Improve accuracy by adding more bags 55 | bagSizes = []int{2, 3, 4} 56 | cm = closestmatch.New(wordsToTest, bagSizes) 57 | fmt.Println(cm.AccuracyMutatingWords()) 58 | // accuracy improves to ~ 76 % 59 | ``` 60 | 61 | #### Save/Load 62 | 63 | ```golang 64 | // Save your current calculated bags 65 | cm.Save("closestmatches.gob") 66 | 67 | // Open it again 68 | cm2, _ := closestmatch.Load("closestmatches.gob") 69 | fmt.Println(cm2.Closest("lizard wizard")) 70 | // prints "The Lizard Wizard" 71 | ``` 72 | 73 | ### Advantages 74 | 75 | *closestmatch* is more accurate than Levenshtein for long strings (like in the test corpus). 76 | 77 | *closestmatch* is ~20x faster than [a fast implementation of Levenshtein](https://groups.google.com/forum/#!topic/golang-nuts/YyH1f_qCZVc). Try it yourself with the benchmarks: 78 | 79 | ```bash 80 | cd $GOPATH/src/github.com/schollz/closestmatch && go test -run=None -bench=. > closestmatch.bench 81 | cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test -run=None -bench=. > levenshtein.bench 82 | benchcmp levenshtein.bench ../closestmatch.bench 83 | ``` 84 | 85 | which gives the following benchmark (on Intel i7-3770 CPU @ 3.40GHz w/ 8 processors): 86 | 87 | ```bash 88 | benchmark old ns/op new ns/op delta 89 | BenchmarkNew-8 1.47 1933870 +131555682.31% 90 | BenchmarkClosestOne-8 104603530 4855916 -95.36% 91 | ``` 92 | 93 | The `New()` function in *closestmatch* is so slower than *levenshtein* because there is precomputation needed. 94 | 95 | ### Disadvantages 96 | 97 | *closestmatch* does worse for matching lists of single words, like a dictionary. For comparison: 98 | 99 | 100 | ``` 101 | $ cd $GOPATH/src/github.com/schollz/closestmatch && go test 102 | Accuracy with mutating words in book list: 90.0% 103 | Accuracy with mutating letters in book list: 100.0% 104 | Accuracy with mutating letters in dictionary: 38.9% 105 | ``` 106 | 107 | while levenshtein performs slightly better for a single-word dictionary (but worse for longer names, like book titles): 108 | 109 | ``` 110 | $ cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test 111 | Accuracy with mutating words in book list: 40.0% 112 | Accuracy with mutating letters in book list: 100.0% 113 | Accuracy with mutating letters in dictionary: 64.8% 114 | ``` 115 | 116 | ## License 117 | 118 | MIT 119 | -------------------------------------------------------------------------------- /closestmatch.go: -------------------------------------------------------------------------------- 1 | package closestmatch 2 | 3 | import ( 4 | "compress/gzip" 5 | "encoding/json" 6 | "math/rand" 7 | "os" 8 | "sort" 9 | "strings" 10 | "sync" 11 | ) 12 | 13 | // ClosestMatch is the structure that contains the 14 | // substring sizes and carrys a map of the substrings for 15 | // easy lookup 16 | type ClosestMatch struct { 17 | SubstringSizes []int 18 | SubstringToID map[string]map[uint32]struct{} 19 | ID map[uint32]IDInfo 20 | mux sync.Mutex 21 | } 22 | 23 | // IDInfo carries the information about the keys 24 | type IDInfo struct { 25 | Key string 26 | NumSubstrings int 27 | } 28 | 29 | // New returns a new structure for performing closest matches 30 | func New(possible []string, subsetSize []int) *ClosestMatch { 31 | cm := new(ClosestMatch) 32 | cm.SubstringSizes = subsetSize 33 | cm.SubstringToID = make(map[string]map[uint32]struct{}) 34 | cm.ID = make(map[uint32]IDInfo) 35 | for i, s := range possible { 36 | substrings := cm.splitWord(strings.ToLower(s)) 37 | cm.ID[uint32(i)] = IDInfo{Key: s, NumSubstrings: len(substrings)} 38 | for substring := range substrings { 39 | if _, ok := cm.SubstringToID[substring]; !ok { 40 | cm.SubstringToID[substring] = make(map[uint32]struct{}) 41 | } 42 | cm.SubstringToID[substring][uint32(i)] = struct{}{} 43 | } 44 | } 45 | 46 | return cm 47 | } 48 | 49 | // Load can load a previously saved ClosestMatch object from disk 50 | func Load(filename string) (*ClosestMatch, error) { 51 | cm := new(ClosestMatch) 52 | 53 | f, err := os.Open(filename) 54 | defer f.Close() 55 | if err != nil { 56 | return cm, err 57 | } 58 | 59 | w, err := gzip.NewReader(f) 60 | if err != nil { 61 | return cm, err 62 | } 63 | 64 | err = json.NewDecoder(w).Decode(&cm) 65 | return cm, err 66 | } 67 | 68 | // Add more words to ClosestMatch structure 69 | func (cm *ClosestMatch) Add(possible []string) { 70 | cm.mux.Lock() 71 | for i, s := range possible { 72 | substrings := cm.splitWord(strings.ToLower(s)) 73 | cm.ID[uint32(i)] = IDInfo{Key: s, NumSubstrings: len(substrings)} 74 | for substring := range substrings { 75 | if _, ok := cm.SubstringToID[substring]; !ok { 76 | cm.SubstringToID[substring] = make(map[uint32]struct{}) 77 | } 78 | cm.SubstringToID[substring][uint32(i)] = struct{}{} 79 | } 80 | } 81 | cm.mux.Unlock() 82 | } 83 | 84 | // Save writes the current ClosestSave object as a gzipped JSON file 85 | func (cm *ClosestMatch) Save(filename string) error { 86 | f, err := os.Create(filename) 87 | if err != nil { 88 | return err 89 | } 90 | defer f.Close() 91 | w := gzip.NewWriter(f) 92 | defer w.Close() 93 | enc := json.NewEncoder(w) 94 | // enc.SetIndent("", " ") 95 | return enc.Encode(cm) 96 | } 97 | 98 | func (cm *ClosestMatch) worker(id int, jobs <-chan job, results chan<- result) { 99 | for j := range jobs { 100 | m := make(map[string]int) 101 | cm.mux.Lock() 102 | if ids, ok := cm.SubstringToID[j.substring]; ok { 103 | weight := 1000 / len(ids) 104 | for id := range ids { 105 | if _, ok2 := m[cm.ID[id].Key]; !ok2 { 106 | m[cm.ID[id].Key] = 0 107 | } 108 | m[cm.ID[id].Key] += 1 + 1000/len(cm.ID[id].Key) + weight 109 | } 110 | } 111 | cm.mux.Unlock() 112 | results <- result{m: m} 113 | } 114 | } 115 | 116 | type job struct { 117 | substring string 118 | } 119 | 120 | type result struct { 121 | m map[string]int 122 | } 123 | 124 | func (cm *ClosestMatch) match(searchWord string) map[string]int { 125 | searchSubstrings := cm.splitWord(searchWord) 126 | searchSubstringsLen := len(searchSubstrings) 127 | 128 | jobs := make(chan job, searchSubstringsLen) 129 | results := make(chan result, searchSubstringsLen) 130 | workers := 8 131 | 132 | for w := 1; w <= workers; w++ { 133 | go cm.worker(w, jobs, results) 134 | } 135 | 136 | for substring := range searchSubstrings { 137 | jobs <- job{substring: substring} 138 | } 139 | close(jobs) 140 | 141 | m := make(map[string]int) 142 | for a := 1; a <= searchSubstringsLen; a++ { 143 | r := <-results 144 | for key := range r.m { 145 | if _, ok := m[key]; ok { 146 | m[key] += r.m[key] 147 | } else { 148 | m[key] = r.m[key] 149 | } 150 | } 151 | } 152 | 153 | return m 154 | } 155 | 156 | // Closest searches for the `searchWord` and returns the closest match 157 | func (cm *ClosestMatch) Closest(searchWord string) string { 158 | for _, pair := range rankByWordCount(cm.match(searchWord)) { 159 | return pair.Key 160 | } 161 | return "" 162 | } 163 | 164 | // ClosestN searches for the `searchWord` and returns the n closests matches 165 | func (cm *ClosestMatch) ClosestN(searchWord string, max int) []string { 166 | matches := make([]string, 0, max) 167 | for i, pair := range rankByWordCount(cm.match(searchWord)) { 168 | if i >= max { 169 | break 170 | } 171 | matches = append(matches, pair.Key) 172 | } 173 | return matches 174 | } 175 | 176 | func rankByWordCount(wordFrequencies map[string]int) PairList { 177 | pl := make(PairList, len(wordFrequencies)) 178 | i := 0 179 | for k, v := range wordFrequencies { 180 | pl[i] = Pair{k, v} 181 | i++ 182 | } 183 | sort.Sort(sort.Reverse(pl)) 184 | return pl 185 | } 186 | 187 | type Pair struct { 188 | Key string 189 | Value int 190 | } 191 | 192 | type PairList []Pair 193 | 194 | func (p PairList) Len() int { return len(p) } 195 | func (p PairList) Less(i, j int) bool { return p[i].Value < p[j].Value } 196 | func (p PairList) Swap(i, j int) { p[i], p[j] = p[j], p[i] } 197 | 198 | func (cm *ClosestMatch) splitWord(word string) map[string]struct{} { 199 | wordHash := make(map[string]struct{}) 200 | for _, j := range cm.SubstringSizes { 201 | for i := 0; i < len(word)-j+1; i++ { 202 | substring := string(word[i : i+j]) 203 | if len(strings.TrimSpace(substring)) > 0 { 204 | wordHash[string(word[i:i+j])] = struct{}{} 205 | } 206 | } 207 | } 208 | if len(wordHash) == 0 { 209 | wordHash[word] = struct{}{} 210 | } 211 | return wordHash 212 | } 213 | 214 | // AccuracyMutatingWords runs some basic tests against the wordlist to 215 | // see how accurate this bag-of-characters method is against 216 | // the target dataset 217 | func (cm *ClosestMatch) AccuracyMutatingWords() float64 { 218 | rand.Seed(1) 219 | percentCorrect := 0.0 220 | numTrials := 0.0 221 | 222 | for wordTrials := 0; wordTrials < 200; wordTrials++ { 223 | 224 | var testString, originalTestString string 225 | cm.mux.Lock() 226 | testStringNum := rand.Intn(len(cm.ID)) 227 | i := 0 228 | for id := range cm.ID { 229 | i++ 230 | if i != testStringNum { 231 | continue 232 | } 233 | originalTestString = cm.ID[id].Key 234 | break 235 | } 236 | cm.mux.Unlock() 237 | 238 | var words []string 239 | choice := rand.Intn(3) 240 | if choice == 0 { 241 | // remove a random word 242 | words = strings.Split(originalTestString, " ") 243 | if len(words) < 3 { 244 | continue 245 | } 246 | deleteWordI := rand.Intn(len(words)) 247 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 248 | testString = strings.Join(words, " ") 249 | } else if choice == 1 { 250 | // remove a random word and reverse 251 | words = strings.Split(originalTestString, " ") 252 | if len(words) > 1 { 253 | deleteWordI := rand.Intn(len(words)) 254 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 255 | for left, right := 0, len(words)-1; left < right; left, right = left+1, right-1 { 256 | words[left], words[right] = words[right], words[left] 257 | } 258 | } else { 259 | continue 260 | } 261 | testString = strings.Join(words, " ") 262 | } else { 263 | // remove a random word and shuffle and replace 2 random letters 264 | words = strings.Split(originalTestString, " ") 265 | if len(words) > 1 { 266 | deleteWordI := rand.Intn(len(words)) 267 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 268 | for i := range words { 269 | j := rand.Intn(i + 1) 270 | words[i], words[j] = words[j], words[i] 271 | } 272 | } 273 | testString = strings.Join(words, " ") 274 | letters := "abcdefghijklmnopqrstuvwxyz" 275 | if len(testString) == 0 { 276 | continue 277 | } 278 | ii := rand.Intn(len(testString)) 279 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 280 | ii = rand.Intn(len(testString)) 281 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 282 | } 283 | closest := cm.Closest(testString) 284 | if closest == originalTestString { 285 | percentCorrect += 1.0 286 | } else { 287 | //fmt.Printf("Original: %s, Mutilated: %s, Match: %s\n", originalTestString, testString, closest) 288 | } 289 | numTrials += 1.0 290 | } 291 | return 100.0 * percentCorrect / numTrials 292 | } 293 | 294 | // AccuracyMutatingLetters runs some basic tests against the wordlist to 295 | // see how accurate this bag-of-characters method is against 296 | // the target dataset when mutating individual letters (adding, removing, changing) 297 | func (cm *ClosestMatch) AccuracyMutatingLetters() float64 { 298 | rand.Seed(1) 299 | percentCorrect := 0.0 300 | numTrials := 0.0 301 | 302 | for wordTrials := 0; wordTrials < 200; wordTrials++ { 303 | 304 | var testString, originalTestString string 305 | cm.mux.Lock() 306 | testStringNum := rand.Intn(len(cm.ID)) 307 | i := 0 308 | for id := range cm.ID { 309 | i++ 310 | if i != testStringNum { 311 | continue 312 | } 313 | originalTestString = cm.ID[id].Key 314 | break 315 | } 316 | cm.mux.Unlock() 317 | testString = originalTestString 318 | 319 | // letters to replace with 320 | letters := "abcdefghijklmnopqrstuvwxyz" 321 | 322 | choice := rand.Intn(3) 323 | if choice == 0 { 324 | // replace random letter 325 | ii := rand.Intn(len(testString)) 326 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 327 | } else if choice == 1 { 328 | // delete random letter 329 | ii := rand.Intn(len(testString)) 330 | testString = testString[:ii] + testString[ii+1:] 331 | } else { 332 | // add random letter 333 | ii := rand.Intn(len(testString)) 334 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii:] 335 | } 336 | closest := cm.Closest(testString) 337 | if closest == originalTestString { 338 | percentCorrect += 1.0 339 | } else { 340 | //fmt.Printf("Original: %s, Mutilated: %s, Match: %s\n", originalTestString, testString, closest) 341 | } 342 | numTrials += 1.0 343 | } 344 | 345 | return 100.0 * percentCorrect / numTrials 346 | } 347 | -------------------------------------------------------------------------------- /closestmatch_test.go: -------------------------------------------------------------------------------- 1 | package closestmatch 2 | 3 | import ( 4 | "fmt" 5 | "io/ioutil" 6 | "strings" 7 | "testing" 8 | 9 | "github.com/schollz/closestmatch/test" 10 | ) 11 | 12 | func BenchmarkNew(b *testing.B) { 13 | for i := 0; i < b.N; i++ { 14 | New(test.WordsToTest, []int{3}) 15 | } 16 | } 17 | 18 | func BenchmarkSplitOne(b *testing.B) { 19 | cm := New(test.WordsToTest, []int{3}) 20 | searchWord := test.SearchWords[0] 21 | b.ResetTimer() 22 | for i := 0; i < b.N; i++ { 23 | cm.splitWord(searchWord) 24 | } 25 | } 26 | 27 | func BenchmarkClosestOne(b *testing.B) { 28 | bText, _ := ioutil.ReadFile("test/books.list") 29 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 30 | cm := New(wordsToTest, []int{3}) 31 | searchWord := test.SearchWords[0] 32 | b.ResetTimer() 33 | for i := 0; i < b.N; i++ { 34 | cm.Closest(searchWord) 35 | } 36 | } 37 | 38 | func BenchmarkClosest3(b *testing.B) { 39 | bText, _ := ioutil.ReadFile("test/books.list") 40 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 41 | cm := New(wordsToTest, []int{3}) 42 | searchWord := test.SearchWords[0] 43 | b.ResetTimer() 44 | for i := 0; i < b.N; i++ { 45 | cm.ClosestN(searchWord, 3) 46 | } 47 | } 48 | 49 | func BenchmarkClosest30(b *testing.B) { 50 | bText, _ := ioutil.ReadFile("test/books.list") 51 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 52 | cm := New(wordsToTest, []int{3}) 53 | searchWord := test.SearchWords[0] 54 | b.ResetTimer() 55 | for i := 0; i < b.N; i++ { 56 | cm.ClosestN(searchWord, 30) 57 | } 58 | } 59 | 60 | func BenchmarkFileLoad(b *testing.B) { 61 | bText, _ := ioutil.ReadFile("test/books.list") 62 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 63 | cm := New(wordsToTest, []int{3, 4}) 64 | cm.Save("test/books.list.cm.gz") 65 | b.ResetTimer() 66 | for i := 0; i < b.N; i++ { 67 | Load("test/books.list.cm.gz") 68 | } 69 | } 70 | 71 | func BenchmarkFileSave(b *testing.B) { 72 | bText, _ := ioutil.ReadFile("test/books.list") 73 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 74 | cm := New(wordsToTest, []int{3, 4}) 75 | b.ResetTimer() 76 | for i := 0; i < b.N; i++ { 77 | cm.Save("test/books.list.cm.gz") 78 | } 79 | } 80 | 81 | func ExampleMatchingSmall() { 82 | cm := New([]string{"love", "loving", "cat", "kit", "cats"}, []int{4}) 83 | fmt.Println(cm.splitWord("love")) 84 | fmt.Println(cm.splitWord("kit")) 85 | fmt.Println(cm.Closest("kit")) 86 | // Output: 87 | // map[love:{}] 88 | // map[kit:{}] 89 | // kit 90 | 91 | } 92 | 93 | func ExampleMatchingSimple() { 94 | cm := New(test.WordsToTest, []int{3}) 95 | for _, searchWord := range test.SearchWords { 96 | fmt.Printf("'%s' matched '%s'\n", searchWord, cm.Closest(searchWord)) 97 | } 98 | // Output: 99 | // 'cervantes don quixote' matched 'don quixote by miguel de cervantes saavedra' 100 | // 'mysterious afur at styles by christie' matched 'the mysterious affair at styles by agatha christie' 101 | // 'hard times by charles dickens' matched 'hard times by charles dickens' 102 | // 'complete william shakespeare' matched 'the complete works of william shakespeare by william shakespeare' 103 | // 'war by hg wells' matched 'the war of the worlds by h. g. wells' 104 | 105 | } 106 | 107 | func ExampleMatchingN() { 108 | cm := New(test.WordsToTest, []int{4}) 109 | fmt.Println(cm.ClosestN("war h.g. wells", 3)) 110 | // Output: 111 | // [the war of the worlds by h. g. wells the time machine by h. g. wells war and peace by graf leo tolstoy] 112 | } 113 | 114 | func ExampleMatchingBigList() { 115 | bText, _ := ioutil.ReadFile("test/books.list") 116 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 117 | cm := New(wordsToTest, []int{3}) 118 | searchWord := "island of a thod mirrors" 119 | fmt.Println(cm.Closest(searchWord)) 120 | // Output: 121 | // island of a thousand mirrors by nayomi munaweera 122 | } 123 | 124 | func ExampleMatchingCatcher() { 125 | bText, _ := ioutil.ReadFile("test/catcher.txt") 126 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 127 | cm := New(wordsToTest, []int{5}) 128 | searchWord := "catcher in the rye by jd salinger" 129 | for i, match := range cm.ClosestN(searchWord, 3) { 130 | if i == 2 { 131 | fmt.Println(match) 132 | } 133 | } 134 | // Output: 135 | // the catcher in the rye by j.d. salinger 136 | } 137 | 138 | func ExampleMatchingPotter() { 139 | bText, _ := ioutil.ReadFile("test/potter.txt") 140 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 141 | cm := New(wordsToTest, []int{5}) 142 | searchWord := "harry potter and the half blood prince by j.k. rowling" 143 | for i, match := range cm.ClosestN(searchWord, 3) { 144 | if i == 1 { 145 | fmt.Println(match) 146 | } 147 | } 148 | // Output: 149 | // harry potter and the order of the phoenix (harry potter, #5, part 1) by j.k. rowling 150 | } 151 | 152 | func TestAccuracyBookWords(t *testing.T) { 153 | bText, _ := ioutil.ReadFile("test/books.list") 154 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 155 | cm := New(wordsToTest, []int{4, 5}) 156 | accuracy := cm.AccuracyMutatingWords() 157 | fmt.Printf("Accuracy with mutating words in book list:\t%2.1f%%\n", accuracy) 158 | } 159 | 160 | func TestAccuracyBookLetters(t *testing.T) { 161 | bText, _ := ioutil.ReadFile("test/books.list") 162 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 163 | cm := New(wordsToTest, []int{5}) 164 | accuracy := cm.AccuracyMutatingLetters() 165 | fmt.Printf("Accuracy with mutating letters in book list:\t%2.1f%%\n", accuracy) 166 | } 167 | 168 | func TestAccuracyDictionaryLetters(t *testing.T) { 169 | bText, _ := ioutil.ReadFile("test/popular.txt") 170 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 171 | cm := New(wordsToTest, []int{2, 3, 4}) 172 | accuracy := cm.AccuracyMutatingWords() 173 | fmt.Printf("Accuracy with mutating letters in dictionary:\t%2.1f%%\n", accuracy) 174 | } 175 | 176 | func TestSaveLoad(t *testing.T) { 177 | bText, _ := ioutil.ReadFile("test/books.list") 178 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 179 | type TestStruct struct { 180 | cm *ClosestMatch 181 | } 182 | tst := new(TestStruct) 183 | tst.cm = New(wordsToTest, []int{5}) 184 | err := tst.cm.Save("test.gob") 185 | if err != nil { 186 | t.Error(err) 187 | } 188 | 189 | tst2 := new(TestStruct) 190 | tst2.cm, err = Load("test.gob") 191 | if err != nil { 192 | t.Error(err) 193 | } 194 | answer2 := tst2.cm.Closest("war of the worlds by hg wells") 195 | answer1 := tst.cm.Closest("war of the worlds by hg wells") 196 | if answer1 != answer2 { 197 | t.Errorf("Differing answers: '%s' '%s'", answer1, answer2) 198 | } 199 | } 200 | -------------------------------------------------------------------------------- /cmclient/client.go: -------------------------------------------------------------------------------- 1 | package cmclient 2 | 3 | import ( 4 | "bytes" 5 | "encoding/json" 6 | "fmt" 7 | "net/http" 8 | ) 9 | 10 | // Connection is the BoltDB server instance 11 | type Connection struct { 12 | Address string 13 | } 14 | 15 | // Open will load a connection to BoltDB 16 | func Open(address string) (*Connection, error) { 17 | c := new(Connection) 18 | c.Address = address 19 | resp, err := http.Get(c.Address + "/uptime") 20 | if err != nil { 21 | return c, err 22 | } 23 | defer resp.Body.Close() 24 | return c, nil 25 | } 26 | 27 | func (c *Connection) Closest(searchString string) (match string, err error) { 28 | type QueryJSON struct { 29 | SearchString string `json:"s"` 30 | } 31 | 32 | payloadJSON := new(QueryJSON) 33 | payloadJSON.SearchString = searchString 34 | 35 | payloadBytes, err := json.Marshal(payloadJSON) 36 | if err != nil { 37 | return 38 | } 39 | body := bytes.NewReader(payloadBytes) 40 | 41 | req, err := http.NewRequest("POST", fmt.Sprintf("%s/match", c.Address), body) 42 | if err != nil { 43 | return 44 | } 45 | req.Header.Set("Content-Type", "application/json") 46 | 47 | resp, err := http.DefaultClient.Do(req) 48 | if err != nil { 49 | return 50 | } 51 | defer resp.Body.Close() 52 | 53 | type ResultJSON struct { 54 | Result string `json:"r"` 55 | } 56 | var r ResultJSON 57 | err = json.NewDecoder(resp.Body).Decode(&r) 58 | match = r.Result 59 | return 60 | } 61 | 62 | func (c *Connection) ClosestN(searchString string, n int) (matches []string, err error) { 63 | matches = []string{} 64 | type QueryJSON struct { 65 | SearchString string `json:"s"` 66 | N int `json:"n"` 67 | } 68 | 69 | payloadJSON := new(QueryJSON) 70 | payloadJSON.SearchString = searchString 71 | payloadJSON.N = n 72 | 73 | payloadBytes, err := json.Marshal(payloadJSON) 74 | if err != nil { 75 | return 76 | } 77 | body := bytes.NewReader(payloadBytes) 78 | 79 | req, err := http.NewRequest("POST", fmt.Sprintf("%s/match", c.Address), body) 80 | if err != nil { 81 | return 82 | } 83 | req.Header.Set("Content-Type", "application/json") 84 | 85 | resp, err := http.DefaultClient.Do(req) 86 | if err != nil { 87 | return 88 | } 89 | defer resp.Body.Close() 90 | 91 | type ResultJSON struct { 92 | Results []string `json:"r"` 93 | } 94 | var r ResultJSON 95 | err = json.NewDecoder(resp.Body).Decode(&r) 96 | matches = r.Results 97 | return 98 | } 99 | -------------------------------------------------------------------------------- /cmclient/client_test.go: -------------------------------------------------------------------------------- 1 | package cmclient 2 | 3 | import ( 4 | "fmt" 5 | "testing" 6 | ) 7 | 8 | var testingServer = "http://localhost:8051" 9 | 10 | func TestClosest(t *testing.T) { 11 | conn, _ := Open(testingServer) 12 | match, err := conn.Closest("The War of the Worlds by H.G. Wells") 13 | if err != nil { 14 | t.Error(err) 15 | } 16 | if match != "The Time Machine/The War of the Worlds by H.G. Wells" { 17 | t.Error(match) 18 | } 19 | } 20 | 21 | func TestClosestN(t *testing.T) { 22 | conn, _ := Open(testingServer) 23 | matches, err := conn.ClosestN("The War of the Worlds by H.G. Wells", 10) 24 | if err != nil { 25 | t.Error(err) 26 | } 27 | if len(matches) != 10 { 28 | t.Errorf("Got %d", len(matches)) 29 | } 30 | fmt.Println(matches) 31 | } 32 | -------------------------------------------------------------------------------- /cmserver/server.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "fmt" 5 | "io/ioutil" 6 | "net/http" 7 | "os" 8 | "strings" 9 | "time" 10 | 11 | "strconv" 12 | 13 | "github.com/gin-gonic/gin" 14 | "github.com/jcelliott/lumber" 15 | "github.com/schollz/closestmatch" 16 | "gopkg.in/urfave/cli.v1" 17 | ) 18 | 19 | var version string 20 | var log *lumber.ConsoleLogger 21 | var cm *closestmatch.ClosestMatch 22 | 23 | func main() { 24 | 25 | app := cli.NewApp() 26 | app.Name = "cmserver" 27 | app.Usage = "fancy server for connecting to a closestmatch db" 28 | app.Version = version 29 | app.Compiled = time.Now() 30 | app.Action = func(c *cli.Context) error { 31 | listfile := c.GlobalString("list") 32 | verbose := c.GlobalBool("debug") 33 | port := c.GlobalString("port") 34 | 35 | if verbose { 36 | log = lumber.NewConsoleLogger(lumber.TRACE) 37 | } else { 38 | log = lumber.NewConsoleLogger(lumber.WARN) 39 | } 40 | 41 | log.Info("Loading closestmatch...") 42 | var errcm error 43 | cm, errcm = closestmatch.Load(listfile + ".cm") 44 | if errcm != nil { 45 | log.Warn(errcm.Error()) 46 | log.Info("...loading data file...") 47 | var intArray []int 48 | for _, intStr := range strings.Split(c.GlobalString("bags"), ",") { 49 | intInt, _ := strconv.Atoi(intStr) 50 | intArray = append(intArray, intInt) 51 | } 52 | keys, err := ioutil.ReadFile(listfile) 53 | if err != nil { 54 | log.Error(err.Error()) 55 | return err 56 | } 57 | log.Info("...computing cm...") 58 | cm = closestmatch.New(strings.Split(string(keys), "\n"), intArray) 59 | log.Info("...computed.") 60 | //log.Info("Saving...") 61 | //cm.Save(listfile + ".cm") 62 | //log.Info("...saving.") 63 | } 64 | 65 | startTime := time.Now() 66 | 67 | gin.SetMode(gin.ReleaseMode) 68 | r := gin.Default() 69 | r.GET("/v1/api", func(c *gin.Context) { 70 | c.String(200, ` 71 | 72 | // Get map of buckets and the number of keys in each 73 | GET /uptime 74 | `) 75 | }) 76 | r.GET("/uptime", func(c *gin.Context) { 77 | c.JSON(200, gin.H{ 78 | "uptime": time.Since(startTime).String(), 79 | }) 80 | }) 81 | r.POST("/match", handleMatch) 82 | 83 | fmt.Printf("cmserver (v.%s) running on :%s\n", version, port) 84 | r.Run(":" + port) // listen and serve on 0.0.0.0:8080 85 | return nil 86 | } 87 | app.Flags = []cli.Flag{ 88 | cli.StringFlag{ 89 | Name: "port, p", 90 | Value: "8051", 91 | Usage: "port to use to listen", 92 | }, 93 | cli.StringFlag{ 94 | Name: "list,l", 95 | Value: "", 96 | Usage: "list of phrases to load into closestmatch", 97 | }, 98 | cli.StringFlag{ 99 | Name: "bags,b", 100 | Value: "2,3", 101 | Usage: "comma separated bags", 102 | }, 103 | cli.BoolFlag{ 104 | Name: "debug,d", 105 | Usage: "turn on debug mode", 106 | }, 107 | } 108 | app.Run(os.Args) 109 | 110 | } 111 | 112 | // test with 113 | // http POST localhost:8051/match s='The War of the Worlds by HG Wells' 114 | func handleMatch(c *gin.Context) { 115 | type QueryJSON struct { 116 | SearchString string `json:"s"` 117 | N int `json:"n"` 118 | } 119 | var json QueryJSON 120 | if c.BindJSON(&json) != nil { 121 | log.Trace("Got %v", json) 122 | c.String(http.StatusBadRequest, "Must provide search_string") 123 | return 124 | } 125 | log.Trace("Got %v", json) 126 | if json.N == 0 { 127 | c.JSON(http.StatusOK, gin.H{"r": cm.Closest(json.SearchString)}) 128 | } else { 129 | c.JSON(http.StatusOK, gin.H{"r": cm.ClosestN(json.SearchString, json.N)}) 130 | } 131 | } 132 | -------------------------------------------------------------------------------- /levenshtein/levenshtein.go: -------------------------------------------------------------------------------- 1 | package levenshtein 2 | 3 | import ( 4 | "math/rand" 5 | "strings" 6 | ) 7 | 8 | // LevenshteinDistance 9 | // from https://groups.google.com/forum/#!topic/golang-nuts/YyH1f_qCZVc 10 | // (no min, compute lengths once, pointers, 2 rows array) 11 | // fastest profiled 12 | func LevenshteinDistance(a, b *string) int { 13 | la := len(*a) 14 | lb := len(*b) 15 | d := make([]int, la+1) 16 | var lastdiag, olddiag, temp int 17 | 18 | for i := 1; i <= la; i++ { 19 | d[i] = i 20 | } 21 | for i := 1; i <= lb; i++ { 22 | d[0] = i 23 | lastdiag = i - 1 24 | for j := 1; j <= la; j++ { 25 | olddiag = d[j] 26 | min := d[j] + 1 27 | if (d[j-1] + 1) < min { 28 | min = d[j-1] + 1 29 | } 30 | if (*a)[j-1] == (*b)[i-1] { 31 | temp = 0 32 | } else { 33 | temp = 1 34 | } 35 | if (lastdiag + temp) < min { 36 | min = lastdiag + temp 37 | } 38 | d[j] = min 39 | lastdiag = olddiag 40 | } 41 | } 42 | return d[la] 43 | } 44 | 45 | type ClosestMatch struct { 46 | WordsToTest []string 47 | } 48 | 49 | func New(wordsToTest []string) *ClosestMatch { 50 | cm := new(ClosestMatch) 51 | cm.WordsToTest = wordsToTest 52 | return cm 53 | } 54 | 55 | func (cm *ClosestMatch) Closest(searchWord string) string { 56 | bestVal := 10000 57 | bestWord := "" 58 | for _, word := range cm.WordsToTest { 59 | newVal := LevenshteinDistance(&searchWord, &word) 60 | if newVal < bestVal { 61 | bestVal = newVal 62 | bestWord = word 63 | } 64 | } 65 | return bestWord 66 | } 67 | 68 | func (cm *ClosestMatch) Accuracy() float64 { 69 | rand.Seed(1) 70 | percentCorrect := 0.0 71 | numTrials := 0.0 72 | 73 | for wordTrials := 0; wordTrials < 100; wordTrials++ { 74 | 75 | var testString, originalTestString string 76 | testStringNum := rand.Intn(len(cm.WordsToTest)) 77 | i := 0 78 | for _, s := range cm.WordsToTest { 79 | i++ 80 | if i != testStringNum { 81 | continue 82 | } 83 | originalTestString = s 84 | break 85 | } 86 | 87 | // remove a random word 88 | for trial := 0; trial < 4; trial++ { 89 | words := strings.Split(originalTestString, " ") 90 | if len(words) < 3 { 91 | continue 92 | } 93 | deleteWordI := rand.Intn(len(words)) 94 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 95 | testString = strings.Join(words, " ") 96 | if cm.Closest(testString) == originalTestString { 97 | percentCorrect += 1.0 98 | } 99 | numTrials += 1.0 100 | } 101 | 102 | // remove a random word and reverse 103 | for trial := 0; trial < 4; trial++ { 104 | words := strings.Split(originalTestString, " ") 105 | if len(words) > 1 { 106 | deleteWordI := rand.Intn(len(words)) 107 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 108 | for left, right := 0, len(words)-1; left < right; left, right = left+1, right-1 { 109 | words[left], words[right] = words[right], words[left] 110 | } 111 | } else { 112 | continue 113 | } 114 | testString = strings.Join(words, " ") 115 | if cm.Closest(testString) == originalTestString { 116 | percentCorrect += 1.0 117 | } 118 | numTrials += 1.0 119 | } 120 | 121 | // remove a random word and shuffle and replace random letter 122 | for trial := 0; trial < 4; trial++ { 123 | words := strings.Split(originalTestString, " ") 124 | if len(words) > 1 { 125 | deleteWordI := rand.Intn(len(words)) 126 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 127 | for i := range words { 128 | j := rand.Intn(i + 1) 129 | words[i], words[j] = words[j], words[i] 130 | } 131 | } 132 | testString = strings.Join(words, " ") 133 | letters := "abcdefghijklmnopqrstuvwxyz" 134 | if len(testString) == 0 { 135 | continue 136 | } 137 | ii := rand.Intn(len(testString)) 138 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 139 | ii = rand.Intn(len(testString)) 140 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 141 | if cm.Closest(testString) == originalTestString { 142 | percentCorrect += 1.0 143 | } 144 | numTrials += 1.0 145 | } 146 | 147 | if cm.Closest(testString) == originalTestString { 148 | percentCorrect += 1.0 149 | } 150 | numTrials += 1.0 151 | 152 | } 153 | 154 | return 100.0 * percentCorrect / numTrials 155 | } 156 | 157 | func (cm *ClosestMatch) AccuracySimple() float64 { 158 | rand.Seed(1) 159 | percentCorrect := 0.0 160 | numTrials := 0.0 161 | 162 | for wordTrials := 0; wordTrials < 500; wordTrials++ { 163 | 164 | var testString, originalTestString string 165 | testStringNum := rand.Intn(len(cm.WordsToTest)) 166 | 167 | originalTestString = cm.WordsToTest[testStringNum] 168 | 169 | testString = originalTestString 170 | 171 | // letters to replace with 172 | letters := "abcdefghijklmnopqrstuvwxyz" 173 | 174 | choice := rand.Intn(3) 175 | if choice == 0 { 176 | // replace random letter 177 | ii := rand.Intn(len(testString)) 178 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 179 | } else if choice == 1 { 180 | // delete random letter 181 | ii := rand.Intn(len(testString)) 182 | testString = testString[:ii] + testString[ii+1:] 183 | } else { 184 | // add random letter 185 | ii := rand.Intn(len(testString)) 186 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii:] 187 | } 188 | closest := cm.Closest(testString) 189 | if closest == originalTestString { 190 | percentCorrect += 1.0 191 | } else { 192 | //fmt.Printf("Original: %s, Mutilated: %s, Match: %s\n", originalTestString, testString, closest) 193 | } 194 | numTrials += 1.0 195 | } 196 | 197 | return 100.0 * percentCorrect / numTrials 198 | } 199 | 200 | // AccuracyMutatingWords runs some basic tests against the wordlist to 201 | // see how accurate this bag-of-characters method is against 202 | // the target dataset 203 | func (cm *ClosestMatch) AccuracyMutatingWords() float64 { 204 | rand.Seed(1) 205 | percentCorrect := 0.0 206 | numTrials := 0.0 207 | 208 | for wordTrials := 0; wordTrials < 200; wordTrials++ { 209 | 210 | var testString, originalTestString string 211 | testStringNum := rand.Intn(len(cm.WordsToTest)) 212 | originalTestString = cm.WordsToTest[testStringNum] 213 | testString = originalTestString 214 | 215 | var words []string 216 | choice := rand.Intn(3) 217 | if choice == 0 { 218 | // remove a random word 219 | words = strings.Split(originalTestString, " ") 220 | if len(words) < 3 { 221 | continue 222 | } 223 | deleteWordI := rand.Intn(len(words)) 224 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 225 | testString = strings.Join(words, " ") 226 | } else if choice == 1 { 227 | // remove a random word and reverse 228 | words = strings.Split(originalTestString, " ") 229 | if len(words) > 1 { 230 | deleteWordI := rand.Intn(len(words)) 231 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 232 | for left, right := 0, len(words)-1; left < right; left, right = left+1, right-1 { 233 | words[left], words[right] = words[right], words[left] 234 | } 235 | } else { 236 | continue 237 | } 238 | testString = strings.Join(words, " ") 239 | } else { 240 | // remove a random word and shuffle and replace 2 random letters 241 | words = strings.Split(originalTestString, " ") 242 | if len(words) > 1 { 243 | deleteWordI := rand.Intn(len(words)) 244 | words = append(words[:deleteWordI], words[deleteWordI+1:]...) 245 | for i := range words { 246 | j := rand.Intn(i + 1) 247 | words[i], words[j] = words[j], words[i] 248 | } 249 | } 250 | testString = strings.Join(words, " ") 251 | letters := "abcdefghijklmnopqrstuvwxyz" 252 | if len(testString) == 0 { 253 | continue 254 | } 255 | ii := rand.Intn(len(testString)) 256 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 257 | ii = rand.Intn(len(testString)) 258 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 259 | } 260 | closest := cm.Closest(testString) 261 | if closest == originalTestString { 262 | percentCorrect += 1.0 263 | } else { 264 | //fmt.Printf("Original: %s, Mutilated: %s, Match: %s\n", originalTestString, testString, closest) 265 | } 266 | numTrials += 1.0 267 | } 268 | return 100.0 * percentCorrect / numTrials 269 | } 270 | 271 | // AccuracyMutatingLetters runs some basic tests against the wordlist to 272 | // see how accurate this bag-of-characters method is against 273 | // the target dataset when mutating individual letters (adding, removing, changing) 274 | func (cm *ClosestMatch) AccuracyMutatingLetters() float64 { 275 | rand.Seed(1) 276 | percentCorrect := 0.0 277 | numTrials := 0.0 278 | 279 | for wordTrials := 0; wordTrials < 200; wordTrials++ { 280 | 281 | var testString, originalTestString string 282 | testStringNum := rand.Intn(len(cm.WordsToTest) - 1) 283 | originalTestString = cm.WordsToTest[testStringNum] 284 | testString = originalTestString 285 | 286 | // letters to replace with 287 | letters := "abcdefghijklmnopqrstuvwxyz" 288 | 289 | choice := rand.Intn(3) 290 | if choice == 0 { 291 | // replace random letter 292 | ii := rand.Intn(len(testString)) 293 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii+1:] 294 | } else if choice == 1 { 295 | // delete random letter 296 | ii := rand.Intn(len(testString)) 297 | testString = testString[:ii] + testString[ii+1:] 298 | } else { 299 | // add random letter 300 | ii := rand.Intn(len(testString)) 301 | testString = testString[:ii] + string(letters[rand.Intn(len(letters))]) + testString[ii:] 302 | } 303 | closest := cm.Closest(testString) 304 | if closest == originalTestString { 305 | percentCorrect += 1.0 306 | } else { 307 | //fmt.Printf("Original: %s, Mutilated: %s, Match: %s\n", originalTestString, testString, closest) 308 | } 309 | numTrials += 1.0 310 | } 311 | 312 | return 100.0 * percentCorrect / numTrials 313 | } 314 | -------------------------------------------------------------------------------- /levenshtein/levenshtein_test.go: -------------------------------------------------------------------------------- 1 | package levenshtein 2 | 3 | import ( 4 | "fmt" 5 | "io/ioutil" 6 | "strings" 7 | "testing" 8 | 9 | "github.com/schollz/closestmatch/test" 10 | ) 11 | 12 | func BenchmarkNew(b *testing.B) { 13 | for i := 0; i < b.N; i++ { 14 | New(test.WordsToTest) 15 | } 16 | } 17 | 18 | func BenchmarkClosestOne(b *testing.B) { 19 | bText, _ := ioutil.ReadFile("../test/books.list") 20 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 21 | cm := New(wordsToTest) 22 | searchWord := test.SearchWords[0] 23 | b.ResetTimer() 24 | for i := 0; i < b.N; i++ { 25 | cm.Closest(searchWord) 26 | } 27 | } 28 | 29 | func ExampleMatching() { 30 | cm := New(test.WordsToTest) 31 | for _, searchWord := range test.SearchWords { 32 | fmt.Printf("'%s' matched '%s'\n", searchWord, cm.Closest(searchWord)) 33 | } 34 | // Output: 35 | // 'cervantes don quixote' matched 'emma by jane austen' 36 | // 'mysterious afur at styles by christie' matched 'the mysterious affair at styles by agatha christie' 37 | // 'hard times by charles dickens' matched 'hard times by charles dickens' 38 | // 'complete william shakespeare' matched 'the iliad by homer' 39 | // 'war by hg wells' matched 'beowulf' 40 | 41 | } 42 | 43 | func TestAccuracyBookWords(t *testing.T) { 44 | bText, _ := ioutil.ReadFile("../test/books.list") 45 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 46 | cm := New(wordsToTest) 47 | accuracy := cm.AccuracyMutatingWords() 48 | fmt.Printf("Accuracy with mutating words in book list:\t%2.1f%%\n", accuracy) 49 | } 50 | 51 | func TestAccuracyBookletters(t *testing.T) { 52 | bText, _ := ioutil.ReadFile("../test/books.list") 53 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 54 | cm := New(wordsToTest) 55 | accuracy := cm.AccuracyMutatingLetters() 56 | fmt.Printf("Accuracy with mutating letters in book list:\t%2.1f%%\n", accuracy) 57 | } 58 | 59 | func TestAccuracyDictionaryletters(t *testing.T) { 60 | bText, _ := ioutil.ReadFile("../test/popular.txt") 61 | wordsToTest := strings.Split(strings.ToLower(string(bText)), "\n") 62 | cm := New(wordsToTest) 63 | accuracy := cm.AccuracyMutatingWords() 64 | fmt.Printf("Accuracy with mutating letters in dictionary:\t%2.1f%%\n", accuracy) 65 | } 66 | -------------------------------------------------------------------------------- /test/catcher.txt: -------------------------------------------------------------------------------- 1 | The Catcher in the Rye by J.D. Salinger Student Packet Grades 9-12 (Novel Units Guides) by Gloria Levine 2 | The Catcher in the Rye by J.D. Salinger: A Study Guide by Ray Moore 3 | A Reader's Companion to J.D. Salinger's the Catcher in the Rye by Peter Beidler 4 | A Reader's Companion To J.D. Salinger's The Catcher In The Rye by Peter G. Beidler 5 | Depression in J.D. Salinger's The Catcher in the Rye by Dedria Bryfonski 6 | Critica; Insights: The Catcher in the Rye, by J.D. Salinger by Joseph Dewey 7 | The Catcher in the Rye/Franny and Zooey/Nine Stories/Raise High the Roof Beam, Carpenters by J.D. Salinger 8 | The Catcher in the Rye by J. D. Salinger Summary & Study Guide by BookRags 9 | The Catcher in the Rye and J.D. Salinger by Jonathan Coupland 10 | Monarch Notes: J. D. Salinger's The Catcher in the Rye by Laurie E. Rozakis 11 | J. D. Salinger: The Catcher In The Rye by Brian Donnelly 12 | A Reader's Companion to J. D. Salinger's The Catcher in the Rye by Peter Beidler 13 | Jerome D. Salinger, The Catcher In The Rye by Hans-Otto Jahnke 14 | Robert Cormier, I Am The Cheese, J. D. Salinger, The Catcher In The Rye by Peter Jone 15 | The Catcher in the Rye by Jerome Salinger 16 | J.D. Salinger: The Catcher In The Rye (Barron's Studies in American Literature) by Richard Lettis 17 | J.D. Salinger: The Catcher in the Rye and Other Works by Raychel Haugrud Reiff 18 | The Catcher in the Rye: A Reader's Guide to the J.D. Salinger Novel by Robert Crayola 19 | The Catcher In The Rye, De J.D. Salinger by Claire Bernas-Martel 20 | J.D. Salinger's The Catcher in the Rye by Harold Bloom (Bloom's Modern Critical Interpretations) 21 | J.D. Salingers 'The Catcher in the Rye.' Materialien. (Lernmaterialien) by Herbert Rühl 22 | Cliffs Notes on Salinger's the Catcher in the Rye by Robert B. Kaplan 23 | Cliffs Notes on Salinger's The Catcher in the Rye by Stanley P. Baldwin 24 | J. D. Salinger's the Catcher in the Rye: A Routledge Guide by Sarah Graham 25 | The Catcher in the Rye - and Salinger by Jerome Smith 26 | The Catcher in the Rye and Salinger by Jonathan Coupland 27 | Catcher In The Rye, J.D. Salinger by Nigel Tookey 28 | The Catcher in the Rye and JD Salinger by Andrew Hastings 29 | The Catcher in the Rye Guide and Other Works of JD Salinge by Peter Baxter 30 | The Catcher in the Rye by Joy Leavitt 31 | New Essays on the Catcher in the Rye by Jack Salzman 32 | Salinger's The Catcher in the Rye (Reader's Guides) by Sarah Graham 33 | The Catcher in the Rye by Shmoop 34 | The Catcher in the Rye - A - Z by Jecks Stapley 35 | The Candidate in the Rye: A Parody of The Catcher in the Rye Starring Donald J. Trump by John Marquane 36 | Readings on the Catcher in the Rye (Literary Companion Series) by Steven Engel 37 | The Catcher In The Rye; Owlsgate 35s Study Guide by David Neilson 38 | The Catcher in the Rye (A BookHacker Summary) by BookHacker 39 | The Catcher in the Rye - Barron's Book Notes by Barron's Book Notes 40 | The Catcher in the Rye (SparkNotes Literature Guide) by SparkNotes 41 | The Catcher in the Rye (Study Guide) by Minute Help Guides 42 | The Catcher in the Rye (York Notes) by Nigel Tookey 43 | Masterwork Studies Series: The Catcher in the Rye (Paperback) by Sanford Pinsker 44 | The Catcher in the Rye by J.D. Salinger -------------------------------------------------------------------------------- /test/data.go: -------------------------------------------------------------------------------- 1 | package test 2 | 3 | import ( 4 | "strings" 5 | ) 6 | 7 | var books = `Pride and Prejudice by Jane Austen 8 | Alice's Adventures in Wonderland by Lewis Carroll 9 | The Importance of Being Earnest: A Trivial Comedy for Serious People by Oscar Wilde 10 | A Tale of Two Cities by Charles Dickens 11 | A Doll's House : a play by Henrik Ibsen 12 | Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley 13 | The Yellow Wallpaper by Charlotte Perkins Gilman 14 | The Adventures of Tom Sawyer by Mark Twain 15 | Metamorphosis by Franz Kafka 16 | Adventures of Huckleberry Finn by Mark Twain 17 | Light Science for Leisure Hours by Richard A. Proctor 18 | Grimms' Fairy Tales by Jacob Grimm and Wilhelm Grimm 19 | Jane Eyre: An Autobiography by Charlotte Brontë 20 | Dracula by Bram Stoker 21 | Moby Dick; Or, The Whale by Herman Melville 22 | The Adventures of Sherlock Holmes by Arthur Conan Doyle 23 | Il Principe. English by Niccolò Machiavelli 24 | Emma by Jane Austen 25 | Great Expectations by Charles Dickens 26 | The Picture of Dorian Gray by Oscar Wilde 27 | Beyond the Hills of Dream by W. Wilfred Campbell 28 | The Hospital Murders by Means Davis and Augusta Tucker Townsend 29 | Dirty Dustbins and Sloppy Streets by H. Percy Boulnois 30 | Leviathan by Thomas Hobbes 31 | The Count of Monte Cristo, Illustrated by Alexandre Dumas 32 | Heart of Darkness by Joseph Conrad 33 | Ulysses by James Joyce 34 | War and Peace by graf Leo Tolstoy 35 | Narrative of the Life of Frederick Douglass, an American Slave by Frederick Douglass 36 | The Radio Boys Seek the Lost Atlantis by Gerald Breckenridge 37 | The Bab Ballads by W. S. Gilbert 38 | Wuthering Heights by Emily Brontë 39 | The Awakening, and Selected Short Stories by Kate Chopin 40 | The Romance of Lust: A Classic Victorian erotic novel by Anonymous 41 | Beowulf 42 | Les Misérables by Victor Hugo 43 | Siddhartha by Hermann Hesse 44 | The Kama Sutra of Vatsyayana by Vatsyayana 45 | Treasure Island by Robert Louis Stevenson 46 | Dubliners by James Joyce 47 | Reminiscences of Western Travels by Shao Xiang Lin 48 | The Souls of Black Folk by W. E. B. Du Bois 49 | Leaves of Grass by Walt Whitman 50 | A Christmas Carol in Prose; Being a Ghost Story of Christmas by Charles Dickens 51 | Tractatus Logico-Philosophicus by Ludwig Wittgenstein 52 | A Modest Proposal by Jonathan Swift 53 | Essays of Michel de Montaigne — Complete by Michel de Montaigne 54 | Prestuplenie i nakazanie. English by Fyodor Dostoyevsky 55 | Practical Grammar and Composition by Thomas Wood 56 | A Study in Scarlet by Arthur Conan Doyle 57 | Sense and Sensibility by Jane Austen 58 | Don Quixote by Miguel de Cervantes Saavedra 59 | Peter Pan by J. M. Barrie 60 | The Republic by Plato 61 | The Life and Adventures of Robinson Crusoe by Daniel Defoe 62 | The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson 63 | Gulliver's Travels into Several Remote Nations of the World by Jonathan Swift 64 | My Secret Life, Volumes I. to III. by Anonymous 65 | Beyond Good and Evil by Friedrich Wilhelm Nietzsche 66 | The Brothers Karamazov by Fyodor Dostoyevsky 67 | The Time Machine by H. G. Wells 68 | Also sprach Zarathustra. English by Friedrich Wilhelm Nietzsche 69 | The Federalist Papers by Alexander Hamilton and John Jay and James Madison 70 | Songs of Innocence, and Songs of Experience by William Blake 71 | The Iliad by Homer 72 | Hastings & Environs; A Sketch-Book by H. G. Hampton 73 | The Hound of the Baskervilles by Arthur Conan Doyle 74 | The Children of Odin: The Book of Northern Myths by Padraic Colum 75 | Autobiography of Benjamin Franklin by Benjamin Franklin 76 | The Divine Comedy by Dante, Illustrated by Dante Alighieri 77 | Hedda Gabler by Henrik Ibsen 78 | Hard Times by Charles Dickens 79 | The Jungle Book by Rudyard Kipling 80 | The Real Captain Kidd by Cornelius Neale Dalton 81 | On Liberty by John Stuart Mill 82 | The Complete Works of William Shakespeare by William Shakespeare 83 | The Tragical History of Doctor Faustus by Christopher Marlowe 84 | Anne of Green Gables by L. M. Montgomery 85 | The Jungle by Upton Sinclair 86 | The Tragedy of Romeo and Juliet by William Shakespeare 87 | De l'amour by Charles Baudelaire and Félix-François Gautier 88 | Ethan Frome by Edith Wharton 89 | Oliver Twist by Charles Dickens 90 | The Turn of the Screw by Henry James 91 | The Wonderful Wizard of Oz by L. Frank Baum 92 | The Legend of Sleepy Hollow by Washington Irving 93 | The Ship of Coral by H. De Vere Stacpoole 94 | Democracy and Education: An Introduction to the Philosophy of Education by John Dewey 95 | Candide by Voltaire 96 | Pygmalion by Bernard Shaw 97 | Walden, and On The Duty Of Civil Disobedience by Henry David Thoreau 98 | Three Men in a Boat by Jerome K. Jerome 99 | A Portrait of the Artist as a Young Man by James Joyce 100 | Manifest der Kommunistischen Partei. English by Friedrich Engels and Karl Marx 101 | Through the Looking-Glass by Lewis Carroll 102 | Le Morte d'Arthur: Volume 1 by Sir Thomas Malory 103 | The Mysterious Affair at Styles by Agatha Christie 104 | Korean—English Dictionary by Leon Kuperman 105 | The War of the Worlds by H. G. Wells 106 | A Concise Dictionary of Middle English from A.D. 1150 to 1580 by A. L. Mayhew and Walter W. Skeat 107 | Armageddon in Retrospect by Kurt Vonnegut 108 | Red Riding Hood by Sarah Blakley-Cartwright 109 | The Kingdom of This World by Alejo Carpentier 110 | Hitty, Her First Hundred Years by Rachel Field` 111 | 112 | var WordsToTest []string 113 | var SearchWords = []string{"cervantes don quixote", "mysterious afur at styles by christie", "hard times by charles dickens", "complete william shakespeare", "War by HG Wells"} 114 | 115 | func init() { 116 | WordsToTest = strings.Split(strings.ToLower(books), "\n") 117 | for i := range SearchWords { 118 | SearchWords[i] = strings.ToLower(SearchWords[i]) 119 | } 120 | } 121 | -------------------------------------------------------------------------------- /test/potter.txt: -------------------------------------------------------------------------------- 1 | Harry Potter And The Half Blood Prince Deluxe Gift Book by BBC 2 | Harry Potter and the Half Blood Prince: The Interactive Quiz Book (The Harry Potter Series.) by Julia Reed 3 | Harry Potter And The Half Blood Prince: Poster Annual 2010 by BBC 4 | Harry Potter And The Half Blood Prince: (Piano Solo) by Nicholas Hooper 5 | Harry Potter and the Half-Blood Prince by Shmoop 6 | Garri Potter i Princ Polukrovka / Harry Potter and the Half-Blood Prince [IN RUSSIAN] by Rouling Dzh. 7 | Mark Reads Harry Potter and the Half-Blood Prince by Mark Oshiro (Mark Reads Harry Potter #6) 8 | Harry Potter Films (Film Guide): Harry Potter and the Order of the Phoenix, List of Harry Potter Cast Members, Harry Potter and the Half-Blood Prince, by Books Group 9 | Selections from Harry Potter and the Half-Blood Prince: Piano Solos by Songbook 10 | Harry Potter and the Half-Blood Prince: Movie Poster Book by Scholastic 11 | The Ultimate Unofficial Harry Potter® Trivia Book: Secrets, Mysteries And Fun Facts Including Half Blood Prince Book 6 by Daniel Lawrence 12 | Unauthorized Half-Blood Prince Update: News and Speculation about Harry Potter Book Six by J. K. Rowling by W. Frederick Zimmerman 13 | Harry Potter and the Sorcerer's Stone: Book 1 - Novel by J.K Rowling -- Summary & More! by Ez- Summary 14 | Harry Potter and the Goblet of Fire by J. K. Rowling | Chapter Outlines by BookRags 15 | Harry Potter and the Order of the Court: The J.K. Rowling Copyright Case and the Question of Fair Use by Robert S. Want 16 | Harry Potter And The Order Of Phoenix: A Summary About This Novel Of J.K Rowling!! (Harry Potter And The Order Of Phoenix: A Detailed Summary-- Book 5, Box Set, Novel, Rowling) by The Summary Guy 17 | Harry Potter and the Charming Prince by slashpervert (The Bound Prince #7) 18 | Myths and Symbols in J.K. Rowling's Harry Potter and the Philosopher's Stone by Volker Geyer 19 | Harry Potter and the Order of the Phoenix (Harry Potter, #5, Part 1) by J.K. Rowling 20 | Harry Potter and the Order of the Phoenix by J. K. Rowling | Chapter Outlines by BookRags 21 | Buchspicker: Übersetzungshilfe Zu "Harry Potter And The Deathly Hallows" (Harry Potter 7): Ausgewählte Vokabeln Für Jede Seite Des Romans Von J.K. Rowling by Thorsten Hinrichsen 22 | Buchspicker: Übersetzungshilfe zu "Harry Potter and the philosopher's stone" und "Harry Potter and the chamber of secrets" (Harry Potter 1 + 2) ausgewählte Vokabeln für jede Seite der Romane von J. K. Rowling by Thorsten Hinrichsen --------------------------------------------------------------------------------