├── .all_images.yml ├── .gitignore ├── CHANGES.md ├── COPYING ├── Gemfile ├── README.md ├── Rakefile ├── VERSION ├── amatch.gemspec ├── bin ├── agrep └── dupfind ├── ext ├── amatch_ext.c ├── common.h ├── extconf.rb ├── pair.c └── pair.h ├── images └── amatch_ext.png ├── install.rb ├── lib ├── amatch.rb └── amatch │ ├── .keep │ ├── polite.rb │ ├── rude.rb │ └── version.rb └── tests ├── test_damerau_levenshtein.rb ├── test_hamming.rb ├── test_jaro.rb ├── test_jaro_winkler.rb ├── test_levenshtein.rb ├── test_longest_subsequence.rb ├── test_longest_substring.rb ├── test_pair_distance.rb └── test_sellers.rb /.all_images.yml: -------------------------------------------------------------------------------- 1 | dockerfile: |- 2 | RUN apk add --no-cache build-base git 3 | 4 | fail_fast: yes 5 | 6 | script: &script |- 7 | echo -e "\e[1m" 8 | ruby -v 9 | echo -e "\e[0m" 10 | rm -f Gemfile.lock 11 | bundle install 12 | bundle exec rake clobber test 13 | 14 | images: 15 | ruby:3.4-alpine: *script 16 | ruby:3.3-alpine: *script 17 | ruby:3.2-alpine: *script 18 | ruby:3.1-alpine: *script 19 | ruby:3.0-alpine: *script 20 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.bundle 2 | *.o 3 | *.so 4 | .*.sw[pon] 5 | .AppleDouble 6 | .rbx 7 | Gemfile.lock 8 | Makefile 9 | pkg 10 | -------------------------------------------------------------------------------- /CHANGES.md: -------------------------------------------------------------------------------- 1 | # Changes 2 | 3 | ## 2022-05-15 v0.4.1 4 | 5 | * **Moved CI testing from Travis to All Images** 6 | + Updated configuration to use `all_images` instead of Travis for continuous integration testing. 7 | 8 | ## 2017-07-04 v0.4.0 9 | 10 | * Officially support DamerauLevenshtein matching algorithm. 11 | * Change license to Apache 2.0 12 | 13 | ## 2017-05-23 v0.3.1 14 | 15 | * Include PairDistance fix from dominikgrygiel, Thx. 16 | 17 | ## 2014-03-27 v0.3.0 18 | 19 | * Update some dependencies 20 | 21 | ## 2013-10-14 v0.2.12 22 | 23 | * Include test fix from Juanito Fatas . Thx! 24 | 25 | ## 2013-01-16 v0.2.11 26 | 27 | * Include some fixes from Jason Colburne . 28 | Thx! 29 | 30 | ## 2012-02-06 v0.2.10 31 | 32 | * Use xfree instead of free to avoid (possible) problems. 33 | 34 | ## 2011-11-15 v0.2.9 35 | 36 | * Provide amatch/rude and amatch/polite for require (the latter doesn't 37 | extend ::String on its own) 38 | * `pair_distance_similar` method now can take an optional regexp argument for 39 | tokenizing. 40 | 41 | ## 2011-08-06 v0.2.8 42 | 43 | * Depend on tins library. 44 | 45 | ## 2011-08-06 v0.2.7 46 | 47 | * Fix some violations of ISO C90 standard. 48 | 49 | ## 2011-07-16 v0.2.6 50 | 51 | * Applied patch by Kevin J. Lynagh fixing memory 52 | leak in Jaro match. 53 | 54 | ## 2009-09-25 v0.2.5 55 | 56 | * Added lib to gem's require_paths. 57 | * Using rake-compiler now. 58 | 59 | ## 2009-08-25 v0.2.4 60 | 61 | * Included Jaro and Jaro-Winkler metrics implementation of Kevin Ballard 62 | . Thanks a lot. 63 | * Made the extension compile under Ruby 1.9. 64 | 65 | ## 2006-06-25 v0.2.3 66 | 67 | * Fixed agrep.rb to use the new API. 68 | 69 | ## 2005-10-11 v0.2.2 70 | * Fixed a typo in extconf.rb that prohibitted compiling on 71 | non-gcc compilers. 72 | 73 | ## 2005-09-12 v0.2.1 74 | 75 | * Bugfix: Wrong type for pattern length corrected. Thanks to David 76 | Heinemeier Hansson for reporting it. 77 | 78 | ## 2005-06-01 v0.2.0 79 | 80 | * Major changes in API and implementation: 81 | Now the Levenshtein edit distance, Sellers edit distance, the Hamming 82 | distance, the longest common subsequence length, the longest common 83 | substring length, and the pair distance metric can be computed. 84 | 85 | ## 2005-01-20 v0.1.4 86 | 87 | * Better argument handling in initialization method 88 | * Minor changes in Rakefile and README.en 89 | 90 | ## 2004-09-27 v0.1.3 91 | 92 | * Rakefile and gem support added. 93 | 94 | ## 2004-09-24 v0.1.2 95 | 96 | * Uses Test::Unit for regression tests now. 97 | 98 | ## 2002-04-21 v0.1.1 99 | 100 | * Minor changes: documentation, more test cases and exceptions. 101 | 102 | ## 2009-08-26 v0.1.0 103 | 104 | * Initial Version 105 | -------------------------------------------------------------------------------- /COPYING: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [2017] [Florian Frank] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | 204 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | # vim: set filetype=ruby et sw=2 ts=2: 2 | 3 | source 'https://rubygems.org' 4 | 5 | gemspec 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # amatch - Approximate Matching Extension for Ruby 2 | 3 | ## Description 4 | 5 | This is a collection of classes that can be used for Approximate 6 | matching, searching, and comparing of Strings. They implement algorithms 7 | that compute the Levenshtein edit distance, Sellers edit distance, the 8 | Hamming distance, the longest common subsequence length, the longest common 9 | substring length, the pair distance metric, the Jaro-Winkler metric. 10 | 11 | ## Installation 12 | 13 | To install this extension as a gem type 14 | 15 | # gem install amatch 16 | 17 | into the shell. 18 | 19 | ## Download 20 | 21 | The homepage of this library is located at 22 | 23 | * https://github.com/flori/amatch 24 | 25 | ## Examples 26 | 27 | require 'amatch' 28 | # => true 29 | include Amatch 30 | # => Object 31 | 32 | m = Sellers.new("pattern") 33 | # => # 34 | m.match("pattren") 35 | # => 2.0 36 | m.substitution = m.insertion = 3 37 | # => 3 38 | m.match("pattren") 39 | # => 4.0 40 | m.reset_weights 41 | # => # 42 | m.match(["pattren","parent"]) 43 | # => [2.0, 4.0] 44 | m.search("abcpattrendef") 45 | # => 2.0 46 | 47 | m = Levenshtein.new("pattern") 48 | # => # 49 | m.match("pattren") 50 | # => 2 51 | m.search("abcpattrendef") 52 | # => 2 53 | "pattern language".levenshtein_similar("language of patterns") 54 | # => 0.2 55 | 56 | m = Amatch::DamerauLevenshtein.new("pattern") 57 | # => # 58 | m.match("pattren") 59 | # => 1 60 | "pattern language".damerau_levenshtein_similar("language of patterns") 61 | # => 0.19999999999999996 62 | 63 | m = Hamming.new("pattern") 64 | # => # 65 | m.match("pattren") 66 | # => 2 67 | "pattern language".hamming_similar("language of patterns") 68 | # => 0.1 69 | 70 | m = PairDistance.new("pattern") 71 | # => # 72 | m.match("pattr en") 73 | # => 0.545454545454545 74 | m.match("pattr en", nil) 75 | # => 0.461538461538462 76 | m.match("pattr en", /t+/) 77 | # => 0.285714285714286 78 | "pattern language".pair_distance_similar("language of patterns") 79 | # => 0.928571428571429 80 | 81 | m = LongestSubsequence.new("pattern") 82 | # => # 83 | m.match("pattren") 84 | # => 6 85 | "pattern language".longest_subsequence_similar("language of patterns") 86 | # => 0.4 87 | 88 | m = LongestSubstring.new("pattern") 89 | # => # 90 | m.match("pattren") 91 | # => 4 92 | "pattern language".longest_substring_similar("language of patterns") 93 | # => 0.4 94 | 95 | m = Jaro.new("pattern") 96 | # => # 97 | m.match("paTTren") 98 | # => 0.952380952380952 99 | m.ignore_case = false 100 | m.match("paTTren") 101 | # => 0.742857142857143 102 | "pattern language".jaro_similar("language of patterns") 103 | # => 0.672222222222222 104 | 105 | m = JaroWinkler.new("pattern") 106 | # # 107 | m.match("paTTren") 108 | # => 0.971428571712403 109 | m.ignore_case = false 110 | m.match("paTTren") 111 | # => 0.79428571505206 112 | m.scaling_factor = 0.05 113 | m.match("pattren") 114 | # => 0.961904762046678 115 | "pattern language".jarowinkler_similar("language of patterns") 116 | # => 0.672222222222222 117 | 118 | ## Author 119 | 120 | Florian Frank mailto:flori@ping.de 121 | 122 | ## License 123 | 124 | Apache License, Version 2.0 – See the COPYING file in the source archive. 125 | -------------------------------------------------------------------------------- /Rakefile: -------------------------------------------------------------------------------- 1 | # vim: set filetype=ruby et sw=2 ts=2: 2 | 3 | require 'gem_hadar' 4 | 5 | GemHadar do 6 | name 'amatch' 7 | author 'Florian Frank' 8 | email 'flori@ping.de' 9 | homepage "http://github.com/flori/#{name}" 10 | summary 'Approximate String Matching library' 11 | description <1.0' 26 | dependency 'mize' 27 | development_dependency 'test-unit', '~>3.0' 28 | development_dependency 'all_images' 29 | required_ruby_version '>=2.4' 30 | licenses << 'Apache-2.0' 31 | end 32 | -------------------------------------------------------------------------------- /VERSION: -------------------------------------------------------------------------------- 1 | 0.4.1 2 | -------------------------------------------------------------------------------- /amatch.gemspec: -------------------------------------------------------------------------------- 1 | # -*- encoding: utf-8 -*- 2 | # stub: amatch 0.4.1 ruby libext 3 | # stub: ext/extconf.rb 4 | 5 | Gem::Specification.new do |s| 6 | s.name = "amatch".freeze 7 | s.version = "0.4.1".freeze 8 | 9 | s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version= 10 | s.require_paths = ["lib".freeze, "ext".freeze] 11 | s.authors = ["Florian Frank".freeze] 12 | s.date = "2024-08-31" 13 | s.description = "Amatch is a library for approximate string matching and searching in strings.\nSeveral algorithms can be used to do this, and it's also possible to compute a\nsimilarity metric number between 0.0 and 1.0 for two given strings.\n".freeze 14 | s.email = "flori@ping.de".freeze 15 | s.executables = ["agrep".freeze, "dupfind".freeze] 16 | s.extensions = ["ext/extconf.rb".freeze] 17 | s.extra_rdoc_files = ["README.md".freeze, "lib/amatch.rb".freeze, "lib/amatch/polite.rb".freeze, "lib/amatch/rude.rb".freeze, "lib/amatch/version.rb".freeze, "ext/amatch_ext.c".freeze, "ext/pair.c".freeze] 18 | s.files = ["CHANGES.md".freeze, "COPYING".freeze, "Gemfile".freeze, "README.md".freeze, "Rakefile".freeze, "amatch.gemspec".freeze, "bin/agrep".freeze, "bin/dupfind".freeze, "ext/amatch_ext.c".freeze, "ext/common.h".freeze, "ext/extconf.rb".freeze, "ext/pair.c".freeze, "ext/pair.h".freeze, "images/amatch_ext.png".freeze, "install.rb".freeze, "lib/amatch.rb".freeze, "lib/amatch/.keep".freeze, "lib/amatch/polite.rb".freeze, "lib/amatch/rude.rb".freeze, "lib/amatch/version.rb".freeze, "tests/test_damerau_levenshtein.rb".freeze, "tests/test_hamming.rb".freeze, "tests/test_jaro.rb".freeze, "tests/test_jaro_winkler.rb".freeze, "tests/test_levenshtein.rb".freeze, "tests/test_longest_subsequence.rb".freeze, "tests/test_longest_substring.rb".freeze, "tests/test_pair_distance.rb".freeze, "tests/test_sellers.rb".freeze] 19 | s.homepage = "http://github.com/flori/amatch".freeze 20 | s.licenses = ["Apache-2.0".freeze] 21 | s.rdoc_options = ["--title".freeze, "Amatch - Approximate Matching".freeze, "--main".freeze, "README.md".freeze] 22 | s.required_ruby_version = Gem::Requirement.new(">= 2.4".freeze) 23 | s.rubygems_version = "3.5.16".freeze 24 | s.summary = "Approximate String Matching library".freeze 25 | s.test_files = ["tests/test_damerau_levenshtein.rb".freeze, "tests/test_hamming.rb".freeze, "tests/test_jaro.rb".freeze, "tests/test_jaro_winkler.rb".freeze, "tests/test_levenshtein.rb".freeze, "tests/test_longest_subsequence.rb".freeze, "tests/test_longest_substring.rb".freeze, "tests/test_pair_distance.rb".freeze, "tests/test_sellers.rb".freeze] 26 | 27 | s.specification_version = 4 28 | 29 | s.add_development_dependency(%q.freeze, ["~> 1.17.0".freeze]) 30 | s.add_development_dependency(%q.freeze, ["~> 3.0".freeze]) 31 | s.add_development_dependency(%q.freeze, [">= 0".freeze]) 32 | s.add_runtime_dependency(%q.freeze, ["~> 1.0".freeze]) 33 | s.add_runtime_dependency(%q.freeze, [">= 0".freeze]) 34 | end 35 | -------------------------------------------------------------------------------- /bin/agrep: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env ruby 2 | 3 | require 'amatch' 4 | require 'getoptlong' 5 | 6 | def usage(msg, options) 7 | puts msg, "Usage: #{File.basename($0)} [OPTIONS] PATTERN [FILE ...]", "" 8 | options.each do |o| 9 | puts " " + o[1] + ", " + o[0] + " " + 10 | (o[2] == GetoptLong::REQUIRED_ARGUMENT ? 'ARGUMENT' : '') 11 | end 12 | puts "\nReport bugs to ." 13 | exit 0 14 | end 15 | 16 | class Amatch::Levenshtein 17 | def search_relative(strings) 18 | if Array === strings 19 | search(strings).map { |s| s.to_f / pattern.size } 20 | else 21 | search(strings).to_f / pattern.size 22 | end 23 | end 24 | end 25 | 26 | $algorithm = 'Levenshtein' 27 | $distance = 1 28 | $mode = :search 29 | begin 30 | parser = GetoptLong.new 31 | options = [ 32 | [ '--algorithm', '-a', GetoptLong::REQUIRED_ARGUMENT ], 33 | [ '--distance', '-d', GetoptLong::REQUIRED_ARGUMENT ], 34 | [ '--relative', '-r', GetoptLong::NO_ARGUMENT ], 35 | [ '--verbose', '-v', GetoptLong::NO_ARGUMENT ], 36 | [ '--help', '-h', GetoptLong::NO_ARGUMENT ], 37 | ] 38 | parser.set_options(*options) 39 | parser.each_option do |name, arg| 40 | name = name.sub(/^--/, '') 41 | case name 42 | when 'algorithm' 43 | $algorithm = arg 44 | when 'distance' 45 | $distance = arg.to_f 46 | when 'relative' 47 | $mode = :search_relative 48 | when 'verbose' 49 | $verbose = 1 50 | when 'help' 51 | usage('You\'ve asked for it!', options) 52 | end 53 | end 54 | rescue 55 | exit 1 56 | end 57 | pattern = ARGV.shift or usage('Pattern needed!', options) 58 | 59 | matcher = Amatch.const_get($algorithm).new(pattern) 60 | size = 0 61 | start = Time.new 62 | if ARGV.size > 0 then 63 | ARGV.each do |filename| 64 | File.stat(filename).file? or next 65 | size += File.size(filename) 66 | begin 67 | File.open(filename, 'r').each_line.each_slice(1000) do |lines| 68 | results = matcher.__send__($mode, lines) 69 | lines.zip(results) do |line, r| 70 | if r <= $distance 71 | puts "#{filename}:#{line}" 72 | end 73 | end 74 | end 75 | rescue 76 | STDERR.puts "Failure at #{filename}: #{$!} => Skipping!" 77 | end 78 | end 79 | else 80 | STDIN.each_line.each_slice(1000) do |lines| 81 | size += lines.size 82 | results = matcher.__send__($mode, lines) 83 | lines.zip(results) do |line, r| 84 | if r <= $distance 85 | puts line 86 | end 87 | end 88 | end 89 | end 90 | time = Time.new - start 91 | $verbose and STDERR.printf "%.3f secs running, scanned %.3f KB/s.\n", 92 | time, size / time / 1024 93 | exit 0 94 | -------------------------------------------------------------------------------- /bin/dupfind: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env ruby 2 | 3 | require 'tins/go' 4 | include Tins::GO 5 | require 'tins/minimize' 6 | class Array 7 | include Tins::Minimize 8 | end 9 | require 'amatch' 10 | begin 11 | require 'infobar' 12 | rescue LoadError 13 | warn "Please install gem infobar to run this executable!" 14 | exit 1 15 | end 16 | 17 | def usage 18 | puts <. 28 | EOT 29 | exit 0 30 | end 31 | 32 | class FindDuplicates 33 | def initialize(algo, p_lim, filename) 34 | @algo, @p_lim, @filename = algo, p_lim, filename 35 | end 36 | 37 | attr_reader :filename 38 | 39 | attr_reader :algo 40 | 41 | attr_reader :p_lim 42 | 43 | memoize method: 44 | def lines 45 | File.readlines(filename) 46 | end 47 | 48 | memoize method: 49 | def matrix 50 | result = lines.with_infobar(label: filename, output: STDERR).map do |l1| 51 | +infobar 52 | a = algo.new(l1) 53 | r = a.similar(lines) 54 | r.map! { |s| s >= p_lim ? ?1 : ?0 } 55 | r.join 56 | end 57 | infobar.finish 58 | infobar.newline 59 | result 60 | end 61 | 62 | def pbm(output: $>) 63 | output << <
) 72 | IO.popen("pnmtopng", 'w+') do |conv| 73 | pbm(output: conv) 74 | conv.close_write 75 | output.write(conv.read) 76 | end 77 | self 78 | end 79 | 80 | def create_image 81 | suffix = Regexp.quote(File.extname(filename)) 82 | f = filename.sub(/(#{suffix}|)\z/, '.png') 83 | File.open(f, 'wb') do |output| 84 | png(output: output) 85 | infobar.puts "Writing output to #{f.inspect}." 86 | end 87 | self 88 | end 89 | 90 | def similar_ranges(min_range: 3, skip_range: 0) 91 | set = 0 92 | ranges = { set => [] } 93 | m = matrix 94 | n = m.size 95 | skip_count = 0 96 | n.downto(1) do |h| 97 | (n - h + 1).upto(n - 1) do |k| 98 | i = k 99 | j = k - (n - h + 1) 100 | if m[i][j] == ?1 101 | skip_count = 0 102 | ranges[set] << [ i, j ] 103 | elsif !ranges[set].empty? && skip_count < skip_range 104 | skip_count += 1 105 | else 106 | skip_count = 0 107 | ranges[set].empty? or ranges[set += 1] = [] 108 | end 109 | end 110 | skip_count = 0 111 | ranges[set].empty? or ranges[set += 1] = [] 112 | end 113 | ranges.each { |_, r| 114 | r.flatten! 115 | r.sort! 116 | r.map! { |x| x + 1 } 117 | r.minimize! 118 | r.reject! { |s| s.size < min_range } 119 | }.reject! { |_, r| r.empty? } 120 | unions = [] 121 | while !ranges.empty? 122 | _, r = ranges.first 123 | equivalent = ranges.reject { |_, v| (v & r).empty? } 124 | unions << equivalent.values.flatten.uniq 125 | ranges.delete_if { |k, _| equivalent.keys.include?(k) } 126 | end 127 | unions.each do |r| 128 | r.map! do |x| 129 | "#{filename}:#{x.begin}-#{x.end}" 130 | end 131 | end 132 | unions 133 | end 134 | end 135 | 136 | opts = go 'a:p:R:r:ih' 137 | 138 | usage if opts[?h] 139 | algo = Amatch.const_get(opts[?a] || 'Levenshtein') 140 | p_lim = (opts[?p] || 0.95).to_f 141 | min_range = (opts[?r] || 3).to_i 142 | skip_range = opts[?R].to_i 143 | ARGV.empty? and usage 144 | 145 | filenames = ARGV.inject([]) { |s, f| s.concat(Dir[f]) } 146 | for filename in filenames 147 | finder = FindDuplicates.new(algo, p_lim, filename) 148 | opts[?i] and finder.create_image 149 | for s in finder.similar_ranges(min_range: min_range, skip_range: skip_range) 150 | infobar.reset 151 | puts s, ?\n 152 | end 153 | end 154 | -------------------------------------------------------------------------------- /ext/amatch_ext.c: -------------------------------------------------------------------------------- 1 | #include "ruby.h" 2 | #include "pair.h" 3 | #include 4 | #include "common.h" 5 | 6 | static VALUE rb_mAmatch, rb_mAmatchStringMethods, rb_cLevenshtein, 7 | rb_cDamerauLevenshtein, rb_cSellers, rb_cHamming, 8 | rb_cPairDistance, rb_cLongestSubsequence, rb_cLongestSubstring, 9 | rb_cJaro, rb_cJaroWinkler; 10 | 11 | static ID id_split, id_to_f; 12 | 13 | #define GET_STRUCT(klass) \ 14 | klass *amatch; \ 15 | Data_Get_Struct(self, klass, amatch); 16 | 17 | #define DEF_ALLOCATOR(type) \ 18 | static type *type##_allocate() \ 19 | { \ 20 | type *obj = ALLOC(type); \ 21 | MEMZERO(obj, type, 1); \ 22 | return obj; \ 23 | } 24 | 25 | #define DEF_CONSTRUCTOR(klass, type) \ 26 | static VALUE rb_##klass##_s_allocate(VALUE klass2) \ 27 | { \ 28 | type *amatch = type##_allocate(); \ 29 | return Data_Wrap_Struct(klass2, NULL, rb_##klass##_free, amatch); \ 30 | } \ 31 | VALUE rb_##klass##_new(VALUE klass2, VALUE pattern) \ 32 | { \ 33 | VALUE obj = rb_##klass##_s_allocate(klass2); \ 34 | rb_##klass##_initialize(obj, pattern); \ 35 | return obj; \ 36 | } 37 | 38 | #define DEF_RB_FREE(klass, type) \ 39 | static void rb_##klass##_free(type *amatch) \ 40 | { \ 41 | MEMZERO(amatch->pattern, char, amatch->pattern_len); \ 42 | xfree(amatch->pattern); \ 43 | MEMZERO(amatch, type, 1); \ 44 | xfree(amatch); \ 45 | } 46 | 47 | #define DEF_PATTERN_ACCESSOR(type) \ 48 | static void type##_pattern_set(type *amatch, VALUE pattern) \ 49 | { \ 50 | Check_Type(pattern, T_STRING); \ 51 | xfree(amatch->pattern); \ 52 | amatch->pattern_len = (int) RSTRING_LEN(pattern); \ 53 | amatch->pattern = ALLOC_N(char, amatch->pattern_len); \ 54 | MEMCPY(amatch->pattern, RSTRING_PTR(pattern), char, \ 55 | RSTRING_LEN(pattern)); \ 56 | } \ 57 | static VALUE rb_##type##_pattern(VALUE self) \ 58 | { \ 59 | GET_STRUCT(type) \ 60 | return rb_str_new(amatch->pattern, amatch->pattern_len); \ 61 | } \ 62 | static VALUE rb_##type##_pattern_set(VALUE self, VALUE pattern) \ 63 | { \ 64 | GET_STRUCT(type) \ 65 | type##_pattern_set(amatch, pattern); \ 66 | return Qnil; \ 67 | } 68 | 69 | #define DEF_ITERATE_STRINGS(type) \ 70 | static VALUE type##_iterate_strings(type *amatch, VALUE strings, \ 71 | VALUE (*match_function) (type *amatch, VALUE strings)) \ 72 | { \ 73 | if (TYPE(strings) == T_STRING) { \ 74 | return match_function(amatch, strings); \ 75 | } else { \ 76 | int i; \ 77 | VALUE result; \ 78 | Check_Type(strings, T_ARRAY); \ 79 | result = rb_ary_new2(RARRAY_LEN(strings)); \ 80 | for (i = 0; i < RARRAY_LEN(strings); i++) { \ 81 | VALUE string = rb_ary_entry(strings, i); \ 82 | if (TYPE(string) != T_STRING) { \ 83 | rb_raise(rb_eTypeError, \ 84 | "array has to contain only strings (%s given)", \ 85 | NIL_P(string) ? \ 86 | "NilClass" : \ 87 | rb_class2name(CLASS_OF(string))); \ 88 | } \ 89 | rb_ary_push(result, match_function(amatch, string)); \ 90 | } \ 91 | return result; \ 92 | } \ 93 | } 94 | 95 | #define DEF_RB_READER(type, function, name, converter) \ 96 | VALUE function(VALUE self) \ 97 | { \ 98 | GET_STRUCT(type) \ 99 | return converter(amatch->name); \ 100 | } 101 | 102 | #define DEF_RB_WRITER(type, function, name, vtype, caster, converter, check)\ 103 | VALUE function(VALUE self, VALUE value) \ 104 | { \ 105 | vtype value_ ## vtype; \ 106 | GET_STRUCT(type) \ 107 | caster(value); \ 108 | value_ ## vtype = converter(value); \ 109 | if (!(value_ ## vtype check)) \ 110 | rb_raise(rb_eTypeError, "check of value " #check " failed"); \ 111 | amatch->name = value_ ## vtype; \ 112 | return Qnil; \ 113 | } 114 | 115 | 116 | #define CAST2FLOAT(obj) \ 117 | if (TYPE(obj) != T_FLOAT && rb_respond_to(obj, id_to_f)) \ 118 | obj = rb_funcall(obj, id_to_f, 0, 0); \ 119 | else \ 120 | Check_Type(obj, T_FLOAT) 121 | #define FLOAT2C(obj) (RFLOAT_VALUE(obj)) 122 | 123 | #define CAST2BOOL(obj) \ 124 | if (obj == Qfalse || obj == Qnil) \ 125 | obj = Qfalse; \ 126 | else \ 127 | obj = Qtrue; 128 | #define BOOL2C(obj) (obj == Qtrue) 129 | #define C2BOOL(obj) (obj ? Qtrue : Qfalse) 130 | 131 | #define OPTIMIZE_TIME \ 132 | if (amatch->pattern_len < RSTRING_LEN(string)) { \ 133 | a_ptr = amatch->pattern; \ 134 | a_len = (int) amatch->pattern_len; \ 135 | b_ptr = RSTRING_PTR(string); \ 136 | b_len = (int) RSTRING_LEN(string); \ 137 | } else { \ 138 | a_ptr = RSTRING_PTR(string); \ 139 | a_len = (int) RSTRING_LEN(string); \ 140 | b_ptr = amatch->pattern; \ 141 | b_len = (int) amatch->pattern_len; \ 142 | } 143 | 144 | #define DONT_OPTIMIZE \ 145 | a_ptr = amatch->pattern; \ 146 | a_len = (int) amatch->pattern_len; \ 147 | b_ptr = RSTRING_PTR(string); \ 148 | b_len = (int) RSTRING_LEN(string); \ 149 | 150 | /* 151 | * C structures of the Amatch classes 152 | */ 153 | 154 | typedef struct GeneralStruct { 155 | char *pattern; 156 | int pattern_len; 157 | } General; 158 | 159 | DEF_ALLOCATOR(General) 160 | DEF_PATTERN_ACCESSOR(General) 161 | DEF_ITERATE_STRINGS(General) 162 | 163 | typedef struct SellersStruct { 164 | char *pattern; 165 | int pattern_len; 166 | double substitution; 167 | double deletion; 168 | double insertion; 169 | } Sellers; 170 | 171 | DEF_ALLOCATOR(Sellers) 172 | DEF_PATTERN_ACCESSOR(Sellers) 173 | DEF_ITERATE_STRINGS(Sellers) 174 | 175 | static void Sellers_reset_weights(Sellers *self) 176 | { 177 | self->substitution = 1.0; 178 | self->deletion = 1.0; 179 | self->insertion = 1.0; 180 | } 181 | 182 | typedef struct PairDistanceStruct { 183 | char *pattern; 184 | int pattern_len; 185 | PairArray *pattern_pair_array; 186 | } PairDistance; 187 | 188 | DEF_ALLOCATOR(PairDistance) 189 | DEF_PATTERN_ACCESSOR(PairDistance) 190 | 191 | typedef struct JaroStruct { 192 | char *pattern; 193 | int pattern_len; 194 | int ignore_case; 195 | } Jaro; 196 | 197 | DEF_ALLOCATOR(Jaro) 198 | DEF_PATTERN_ACCESSOR(Jaro) 199 | DEF_ITERATE_STRINGS(Jaro) 200 | 201 | typedef struct JaroWinklerStruct { 202 | char *pattern; 203 | int pattern_len; 204 | int ignore_case; 205 | double scaling_factor; 206 | } JaroWinkler; 207 | 208 | DEF_ALLOCATOR(JaroWinkler) 209 | DEF_PATTERN_ACCESSOR(JaroWinkler) 210 | DEF_ITERATE_STRINGS(JaroWinkler) 211 | 212 | /* 213 | * Levenshtein edit distances are computed here: 214 | */ 215 | 216 | #define COMPUTE_LEVENSHTEIN_DISTANCE \ 217 | c = 0; \ 218 | p = 0; \ 219 | for (i = 1; i <= a_len; i++) { \ 220 | c = i % 2; /* current row */ \ 221 | p = (i - 1) % 2; /* previous row */ \ 222 | v[c][0] = i; /* first column */ \ 223 | for (j = 1; j <= b_len; j++) { \ 224 | /* Bellman's principle of optimality: */ \ 225 | weight = v[p][j - 1] + (a_ptr[i - 1] == b_ptr[j - 1] ? 0 : 1); \ 226 | if (weight > v[p][j] + 1) { \ 227 | weight = v[p][j] + 1; \ 228 | } \ 229 | if (weight > v[c][j - 1] + 1) { \ 230 | weight = v[c][j - 1] + 1; \ 231 | } \ 232 | v[c][j] = weight; \ 233 | } \ 234 | } 235 | 236 | static VALUE Levenshtein_match(General *amatch, VALUE string) 237 | { 238 | VALUE result; 239 | char *a_ptr, *b_ptr; 240 | int a_len, b_len; 241 | int *v[2], weight; 242 | int i, j, c, p; 243 | 244 | Check_Type(string, T_STRING); 245 | DONT_OPTIMIZE 246 | 247 | v[0] = ALLOC_N(int, b_len + 1); 248 | v[1] = ALLOC_N(int, b_len + 1); 249 | for (i = 0; i <= b_len; i++) { 250 | v[0][i] = i; 251 | v[1][i] = i; 252 | } 253 | 254 | COMPUTE_LEVENSHTEIN_DISTANCE 255 | 256 | result = INT2FIX(v[c][b_len]); 257 | 258 | xfree(v[0]); 259 | xfree(v[1]); 260 | 261 | return result; 262 | } 263 | 264 | static VALUE Levenshtein_similar(General *amatch, VALUE string) 265 | { 266 | VALUE result; 267 | char *a_ptr, *b_ptr; 268 | int a_len, b_len; 269 | int *v[2], weight; 270 | int i, j, c, p; 271 | 272 | Check_Type(string, T_STRING); 273 | DONT_OPTIMIZE 274 | 275 | if (a_len == 0 && b_len == 0) return rb_float_new(1.0); 276 | if (a_len == 0 || b_len == 0) return rb_float_new(0.0); 277 | v[0] = ALLOC_N(int, b_len + 1); 278 | v[1] = ALLOC_N(int, b_len + 1); 279 | for (i = 0; i <= b_len; i++) { 280 | v[0][i] = i; 281 | v[1][i] = i; 282 | } 283 | 284 | COMPUTE_LEVENSHTEIN_DISTANCE 285 | 286 | if (b_len > a_len) { 287 | result = rb_float_new(1.0 - ((double) v[c][b_len]) / b_len); 288 | } else { 289 | result = rb_float_new(1.0 - ((double) v[c][b_len]) / a_len); 290 | } 291 | 292 | xfree(v[0]); 293 | xfree(v[1]); 294 | 295 | return result; 296 | } 297 | 298 | static VALUE Levenshtein_search(General *amatch, VALUE string) 299 | { 300 | VALUE result; 301 | char *a_ptr, *b_ptr; 302 | int a_len, b_len; 303 | int *v[2], weight, min; 304 | int i, j, c, p; 305 | 306 | Check_Type(string, T_STRING); 307 | DONT_OPTIMIZE 308 | 309 | v[0] = ALLOC_N(int, b_len + 1); 310 | v[1] = ALLOC_N(int, b_len + 1); 311 | MEMZERO(v[0], int, b_len + 1); 312 | MEMZERO(v[1], int, b_len + 1); 313 | 314 | COMPUTE_LEVENSHTEIN_DISTANCE 315 | 316 | for (i = 0, min = a_len; i <= b_len; i++) { 317 | if (v[c][i] < min) min = v[c][i]; 318 | } 319 | 320 | result = INT2FIX(min); 321 | 322 | xfree(v[0]); 323 | xfree(v[1]); 324 | 325 | return result; 326 | } 327 | 328 | /* 329 | * DamerauLevenshtein edit distances are computed here: 330 | */ 331 | 332 | #define COMPUTE_DAMERAU_LEVENSHTEIN_DISTANCE \ 333 | c = 0; \ 334 | p = 0; \ 335 | pp = 0; \ 336 | for (i = 1; i <= a_len; i++) { \ 337 | c = i % 3; /* current row */ \ 338 | p = (i - 1) % 3; /* previous row */ \ 339 | pp = (i - 2) % 3; /* previous previous row */ \ 340 | v[c][0] = i; /* first column */ \ 341 | for (j = 1; j <= b_len; j++) { \ 342 | /* Bellman's principle of optimality: */ \ 343 | weight = v[p][j - 1] + (a_ptr[i - 1] == b_ptr[j - 1] ? 0 : 1); \ 344 | if (weight > v[p][j] + 1) { \ 345 | weight = v[p][j] + 1; \ 346 | } \ 347 | if (weight > v[c][j - 1] + 1) { \ 348 | weight = v[c][j - 1] + 1; \ 349 | } \ 350 | if (i > 2 && j > 2 && a_ptr[i - 1] == b_ptr[j - 2] && a_ptr[i - 2] == b_ptr[j - 1]) {\ 351 | if (weight > v[pp][j - 2]) { \ 352 | weight = v[pp][j - 2] + (a_ptr[i - 1] == b_ptr[j - 1] ? 0 : 1); \ 353 | } \ 354 | } \ 355 | v[c][j] = weight; \ 356 | } \ 357 | } 358 | 359 | static VALUE DamerauLevenshtein_match(General *amatch, VALUE string) 360 | { 361 | VALUE result; 362 | char *a_ptr, *b_ptr; 363 | int a_len, b_len; 364 | int *v[3], weight; 365 | int i, j, c, p, pp; 366 | 367 | Check_Type(string, T_STRING); 368 | DONT_OPTIMIZE 369 | 370 | v[0] = ALLOC_N(int, b_len + 1); 371 | v[1] = ALLOC_N(int, b_len + 1); 372 | v[2] = ALLOC_N(int, b_len + 1); 373 | for (i = 0; i <= b_len; i++) { 374 | v[0][i] = i; 375 | v[1][i] = i; 376 | v[2][i] = i; 377 | } 378 | 379 | COMPUTE_DAMERAU_LEVENSHTEIN_DISTANCE 380 | 381 | result = INT2FIX(v[c][b_len]); 382 | 383 | xfree(v[0]); 384 | xfree(v[1]); 385 | xfree(v[2]); 386 | 387 | return result; 388 | } 389 | 390 | static VALUE DamerauLevenshtein_similar(General *amatch, VALUE string) 391 | { 392 | VALUE result; 393 | char *a_ptr, *b_ptr; 394 | int a_len, b_len; 395 | int *v[3], weight; 396 | int i, j, c, p, pp; 397 | 398 | Check_Type(string, T_STRING); 399 | DONT_OPTIMIZE 400 | 401 | if (a_len == 0 && b_len == 0) return rb_float_new(1.0); 402 | if (a_len == 0 || b_len == 0) return rb_float_new(0.0); 403 | v[0] = ALLOC_N(int, b_len + 1); 404 | v[1] = ALLOC_N(int, b_len + 1); 405 | v[2] = ALLOC_N(int, b_len + 1); 406 | for (i = 0; i <= b_len; i++) { 407 | v[0][i] = i; 408 | v[1][i] = i; 409 | v[2][i] = i; 410 | } 411 | 412 | COMPUTE_DAMERAU_LEVENSHTEIN_DISTANCE 413 | 414 | if (b_len > a_len) { 415 | result = rb_float_new(1.0 - ((double) v[c][b_len]) / b_len); 416 | } else { 417 | result = rb_float_new(1.0 - ((double) v[c][b_len]) / a_len); 418 | } 419 | 420 | xfree(v[0]); 421 | xfree(v[1]); 422 | xfree(v[2]); 423 | 424 | return result; 425 | } 426 | 427 | static VALUE DamerauLevenshtein_search(General *amatch, VALUE string) 428 | { 429 | VALUE result; 430 | char *a_ptr, *b_ptr; 431 | int a_len, b_len; 432 | int *v[3], weight, min; 433 | int i, j, c, p, pp; 434 | 435 | Check_Type(string, T_STRING); 436 | DONT_OPTIMIZE 437 | 438 | v[0] = ALLOC_N(int, b_len + 1); 439 | v[1] = ALLOC_N(int, b_len + 1); 440 | v[2] = ALLOC_N(int, b_len + 1); 441 | MEMZERO(v[0], int, b_len + 1); 442 | MEMZERO(v[1], int, b_len + 1); 443 | MEMZERO(v[2], int, b_len + 1); 444 | 445 | COMPUTE_DAMERAU_LEVENSHTEIN_DISTANCE 446 | 447 | for (i = 0, min = a_len; i <= b_len; i++) { 448 | if (v[c][i] < min) min = v[c][i]; 449 | } 450 | 451 | result = INT2FIX(min); 452 | 453 | xfree(v[0]); 454 | xfree(v[1]); 455 | xfree(v[2]); 456 | 457 | return result; 458 | } 459 | 460 | /* 461 | * Sellers edit distances are computed here: 462 | */ 463 | 464 | #define COMPUTE_SELLERS_DISTANCE \ 465 | c = 0; \ 466 | p = 0; \ 467 | for (i = 1; i <= a_len; i++) { \ 468 | c = i % 2; /* current row */ \ 469 | p = (i - 1) % 2; /* previous row */ \ 470 | v[c][0] = i * amatch->deletion; /* first column */ \ 471 | for (j = 1; j <= b_len; j++) { \ 472 | /* Bellman's principle of optimality: */ \ 473 | weight = v[p][j - 1] + \ 474 | (a_ptr[i - 1] == b_ptr[j - 1] ? 0 : amatch->substitution); \ 475 | if (weight > v[p][j] + amatch->insertion) { \ 476 | weight = v[p][j] + amatch->insertion; \ 477 | } \ 478 | if (weight > v[c][j - 1] + amatch->deletion) { \ 479 | weight = v[c][j - 1] + amatch->deletion; \ 480 | } \ 481 | v[c][j] = weight; \ 482 | } \ 483 | p = c; \ 484 | } 485 | 486 | static VALUE Sellers_match(Sellers *amatch, VALUE string) 487 | { 488 | VALUE result; 489 | char *a_ptr, *b_ptr; 490 | int a_len, b_len; 491 | double *v[2], weight; 492 | int i, j, c, p; 493 | 494 | Check_Type(string, T_STRING); 495 | DONT_OPTIMIZE 496 | 497 | v[0] = ALLOC_N(double, b_len + 1); 498 | v[1] = ALLOC_N(double, b_len + 1); 499 | for (i = 0; i <= b_len; i++) { 500 | v[0][i] = i * amatch->deletion; 501 | v[1][i] = i * amatch->deletion; 502 | } 503 | 504 | COMPUTE_SELLERS_DISTANCE 505 | 506 | result = rb_float_new(v[p][b_len]); 507 | xfree(v[0]); 508 | xfree(v[1]); 509 | return result; 510 | } 511 | 512 | static VALUE Sellers_similar(Sellers *amatch, VALUE string) 513 | { 514 | VALUE result; 515 | char *a_ptr, *b_ptr; 516 | int a_len, b_len; 517 | double *v[2], weight, max_weight; 518 | int i, j, c, p; 519 | 520 | if (amatch->insertion >= amatch->deletion) { 521 | if (amatch->substitution >= amatch->insertion) { 522 | max_weight = amatch->substitution; 523 | } else { 524 | max_weight = amatch->insertion; 525 | } 526 | } else { 527 | if (amatch->substitution >= amatch->deletion) { 528 | max_weight = amatch->substitution; 529 | } else { 530 | max_weight = amatch->deletion; 531 | } 532 | } 533 | 534 | Check_Type(string, T_STRING); 535 | DONT_OPTIMIZE 536 | 537 | if (a_len == 0 && b_len == 0) return rb_float_new(1.0); 538 | if (a_len == 0 || b_len == 0) return rb_float_new(0.0); 539 | v[0] = ALLOC_N(double, b_len + 1); 540 | v[1] = ALLOC_N(double, b_len + 1); 541 | for (i = 0; i <= b_len; i++) { 542 | v[0][i] = i * amatch->deletion; 543 | v[1][i] = i * amatch->deletion; 544 | } 545 | 546 | COMPUTE_SELLERS_DISTANCE 547 | 548 | if (b_len > a_len) { 549 | result = rb_float_new(1.0 - v[p][b_len] / (b_len * max_weight)); 550 | } else { 551 | result = rb_float_new(1.0 - v[p][b_len] / (a_len * max_weight)); 552 | } 553 | xfree(v[0]); 554 | xfree(v[1]); 555 | return result; 556 | } 557 | 558 | static VALUE Sellers_search(Sellers *amatch, VALUE string) 559 | { 560 | VALUE result; 561 | char *a_ptr, *b_ptr; 562 | int a_len, b_len; 563 | double *v[2], weight, min; 564 | int i, j, c, p; 565 | 566 | Check_Type(string, T_STRING); 567 | DONT_OPTIMIZE 568 | 569 | v[0] = ALLOC_N(double, b_len + 1); 570 | v[1] = ALLOC_N(double, b_len + 1); 571 | MEMZERO(v[0], double, b_len + 1); 572 | MEMZERO(v[1], double, b_len + 1); 573 | 574 | COMPUTE_SELLERS_DISTANCE 575 | 576 | for (i = 0, min = a_len; i <= b_len; i++) { 577 | if (v[p][i] < min) min = v[p][i]; 578 | } 579 | result = rb_float_new(min); 580 | xfree(v[0]); 581 | xfree(v[1]); 582 | 583 | return result; 584 | } 585 | 586 | /* 587 | * Pair distances are computed here: 588 | */ 589 | 590 | static VALUE PairDistance_match(PairDistance *amatch, VALUE string, VALUE regexp, int use_regexp) 591 | { 592 | double result; 593 | VALUE string_tokens, tokens; 594 | PairArray *pattern_pair_array, *pair_array; 595 | 596 | Check_Type(string, T_STRING); 597 | if (!NIL_P(regexp) || use_regexp) { 598 | tokens = rb_funcall( 599 | rb_str_new(amatch->pattern, amatch->pattern_len), 600 | id_split, 1, regexp 601 | ); 602 | string_tokens = rb_funcall(string, id_split, 1, regexp); 603 | } else { 604 | VALUE tmp = rb_str_new(amatch->pattern, amatch->pattern_len); 605 | tokens = rb_ary_new4(1, &tmp); 606 | string_tokens = rb_ary_new4(1, &string); 607 | } 608 | 609 | if (!amatch->pattern_pair_array) { 610 | pattern_pair_array = PairArray_new(tokens); 611 | amatch->pattern_pair_array = pattern_pair_array; 612 | } else { 613 | pattern_pair_array = amatch->pattern_pair_array; 614 | pair_array_reactivate(amatch->pattern_pair_array); 615 | } 616 | pair_array = PairArray_new(string_tokens); 617 | 618 | result = pair_array_match(pattern_pair_array, pair_array); 619 | pair_array_destroy(pair_array); 620 | return rb_float_new(result); 621 | } 622 | 623 | /* 624 | * Hamming distances are computed here: 625 | */ 626 | 627 | #define COMPUTE_HAMMING_DISTANCE \ 628 | for (i = 0, result = b_len - a_len; i < a_len; i++) { \ 629 | if (i >= b_len) { \ 630 | result += a_len - b_len; \ 631 | break; \ 632 | } \ 633 | if (b_ptr[i] != a_ptr[i]) result++; \ 634 | } 635 | 636 | static VALUE Hamming_match(General *amatch, VALUE string) 637 | { 638 | char *a_ptr, *b_ptr; 639 | int a_len, b_len; 640 | int i, result; 641 | 642 | Check_Type(string, T_STRING); 643 | OPTIMIZE_TIME 644 | COMPUTE_HAMMING_DISTANCE 645 | return INT2FIX(result); 646 | } 647 | 648 | static VALUE Hamming_similar(General *amatch, VALUE string) 649 | { 650 | char *a_ptr, *b_ptr; 651 | int a_len, b_len; 652 | int i, result; 653 | 654 | Check_Type(string, T_STRING); 655 | OPTIMIZE_TIME 656 | if (a_len == 0 && b_len == 0) return rb_float_new(1.0); 657 | if (a_len == 0 || b_len == 0) return rb_float_new(0.0); 658 | COMPUTE_HAMMING_DISTANCE 659 | return rb_float_new(1.0 - ((double) result) / b_len); 660 | } 661 | 662 | /* 663 | * Longest Common Subsequence computation 664 | */ 665 | 666 | #define COMPUTE_LONGEST_SUBSEQUENCE \ 667 | l[0] = ALLOC_N(int, b_len + 1); \ 668 | l[1] = ALLOC_N(int, b_len + 1); \ 669 | for (i = a_len, c = 0, p = 1; i >= 0; i--) { \ 670 | for (j = b_len; j >= 0; j--) { \ 671 | if (i == a_len || j == b_len) { \ 672 | l[c][j] = 0; \ 673 | } else if (a_ptr[i] == b_ptr[j]) { \ 674 | l[c][j] = 1 + l[p][j + 1]; \ 675 | } else { \ 676 | int x = l[p][j], y = l[c][j + 1]; \ 677 | if (x > y) l[c][j] = x; else l[c][j] = y; \ 678 | } \ 679 | } \ 680 | p = c; \ 681 | c = (c + 1) % 2; \ 682 | } \ 683 | result = l[p][0]; \ 684 | xfree(l[0]); \ 685 | xfree(l[1]); 686 | 687 | 688 | static VALUE LongestSubsequence_match(General *amatch, VALUE string) 689 | { 690 | char *a_ptr, *b_ptr; 691 | int a_len, b_len; 692 | int result, c, p, i, j, *l[2]; 693 | 694 | Check_Type(string, T_STRING); 695 | OPTIMIZE_TIME 696 | 697 | if (a_len == 0 || b_len == 0) return INT2FIX(0); 698 | COMPUTE_LONGEST_SUBSEQUENCE 699 | return INT2FIX(result); 700 | } 701 | 702 | static VALUE LongestSubsequence_similar(General *amatch, VALUE string) 703 | { 704 | char *a_ptr, *b_ptr; 705 | int a_len, b_len; 706 | int result, c, p, i, j, *l[2]; 707 | 708 | Check_Type(string, T_STRING); 709 | OPTIMIZE_TIME 710 | 711 | if (a_len == 0 && b_len == 0) return rb_float_new(1.0); 712 | if (a_len == 0 || b_len == 0) return rb_float_new(0.0); 713 | COMPUTE_LONGEST_SUBSEQUENCE 714 | return rb_float_new(((double) result) / b_len); 715 | } 716 | 717 | /* 718 | * Longest Common Substring computation 719 | */ 720 | 721 | #define COMPUTE_LONGEST_SUBSTRING \ 722 | l[0] = ALLOC_N(int, b_len); \ 723 | MEMZERO(l[0], int, b_len); \ 724 | l[1] = ALLOC_N(int, b_len); \ 725 | MEMZERO(l[1], int, b_len); \ 726 | result = 0; \ 727 | for (i = 0, c = 0, p = 1; i < a_len; i++) { \ 728 | for (j = 0; j < b_len; j++) { \ 729 | if (a_ptr[i] == b_ptr[j]) { \ 730 | l[c][j] = j == 0 ? 1 : 1 + l[p][j - 1]; \ 731 | if (l[c][j] > result) result = l[c][j]; \ 732 | } else { \ 733 | l[c][j] = 0; \ 734 | } \ 735 | } \ 736 | p = c; \ 737 | c = (c + 1) % 2; \ 738 | } \ 739 | xfree(l[0]); \ 740 | xfree(l[1]); 741 | 742 | static VALUE LongestSubstring_match(General *amatch, VALUE string) 743 | { 744 | char *a_ptr, *b_ptr; 745 | int a_len, b_len; 746 | int result, c, p, i, j, *l[2]; 747 | 748 | Check_Type(string, T_STRING); 749 | OPTIMIZE_TIME 750 | if (a_len == 0 || b_len == 0) return INT2FIX(0); 751 | COMPUTE_LONGEST_SUBSTRING 752 | return INT2FIX(result); 753 | } 754 | 755 | static VALUE LongestSubstring_similar(General *amatch, VALUE string) 756 | { 757 | char *a_ptr, *b_ptr; 758 | int a_len, b_len; 759 | int result, c, p, i, j, *l[2]; 760 | 761 | Check_Type(string, T_STRING); 762 | OPTIMIZE_TIME 763 | if (a_len == 0 && b_len == 0) return rb_float_new(1.0); 764 | if (a_len == 0 || b_len == 0) return rb_float_new(0.0); 765 | COMPUTE_LONGEST_SUBSTRING 766 | return rb_float_new(((double) result) / b_len); 767 | } 768 | 769 | /* 770 | * Jaro computation 771 | */ 772 | 773 | #define COMPUTE_JARO \ 774 | l[0] = ALLOC_N(int, a_len); \ 775 | MEMZERO(l[0], int, a_len); \ 776 | l[1] = ALLOC_N(int, b_len); \ 777 | MEMZERO(l[1], int, b_len); \ 778 | max_dist = ((a_len > b_len ? a_len : b_len) / 2) - 1; \ 779 | m = 0; \ 780 | for (i = 0; i < a_len; i++) { \ 781 | low = (i > max_dist ? i - max_dist : 0); \ 782 | high = (i + max_dist < b_len ? i + max_dist : b_len - 1); \ 783 | for (j = low; j <= high; j++) { \ 784 | if (!l[1][j] && a_ptr[i] == b_ptr[j]) { \ 785 | l[0][i] = 1; \ 786 | l[1][j] = 1; \ 787 | m++; \ 788 | break; \ 789 | } \ 790 | } \ 791 | } \ 792 | if (m == 0) { \ 793 | result = 0.0; \ 794 | } else { \ 795 | k = t = 0; \ 796 | for (i = 0; i < a_len; i++) { \ 797 | if (l[0][i]) { \ 798 | for (j = k; j < b_len; j++) { \ 799 | if (l[1][j]) { \ 800 | k = j + 1; \ 801 | break; \ 802 | } \ 803 | } \ 804 | if (a_ptr[i] != b_ptr[j]) { \ 805 | t++; \ 806 | } \ 807 | } \ 808 | } \ 809 | t = t / 2; \ 810 | result = (((double)m)/a_len + ((double)m)/b_len + ((double)(m-t))/m)/3.0; \ 811 | } \ 812 | xfree(l[0]); \ 813 | xfree(l[1]); 814 | 815 | 816 | #define LOWERCASE_STRINGS \ 817 | char *ying, *yang; \ 818 | ying = ALLOC_N(char, a_len); \ 819 | MEMCPY(ying, a_ptr, char, a_len); \ 820 | a_ptr = ying; \ 821 | yang = ALLOC_N(char, b_len); \ 822 | MEMCPY(yang, b_ptr, char, b_len); \ 823 | b_ptr = yang; \ 824 | for (i = 0; i < a_len; i++) { \ 825 | if (islower(a_ptr[i])) a_ptr[i] = toupper(a_ptr[i]); \ 826 | } \ 827 | for (i = 0; i < b_len; i++) { \ 828 | if (islower(b_ptr[i])) b_ptr[i] = toupper(b_ptr[i]); \ 829 | } 830 | 831 | static VALUE Jaro_match(Jaro *amatch, VALUE string) 832 | { 833 | char *a_ptr, *b_ptr; 834 | int a_len, b_len, max_dist, m, t, i, j, k, low, high; 835 | int *l[2]; 836 | double result; 837 | 838 | Check_Type(string, T_STRING); 839 | OPTIMIZE_TIME 840 | if (a_len == 0 && b_len == 0) return rb_float_new(1.0); 841 | if (a_len == 0 || b_len == 0) return rb_float_new(0.0); 842 | if (amatch->ignore_case) { 843 | LOWERCASE_STRINGS 844 | } 845 | COMPUTE_JARO 846 | if (amatch->ignore_case) { 847 | xfree(a_ptr); 848 | xfree(b_ptr); 849 | } 850 | return rb_float_new(result); 851 | } 852 | 853 | /* 854 | * Jaro-Winkler computation 855 | */ 856 | 857 | static VALUE JaroWinkler_match(JaroWinkler *amatch, VALUE string) 858 | { 859 | char *a_ptr, *b_ptr; 860 | int a_len, b_len, max_dist, m, t, i, j, k, low, high, n; 861 | int *l[2]; 862 | double result; 863 | 864 | Check_Type(string, T_STRING); 865 | OPTIMIZE_TIME 866 | if (a_len == 0 && b_len == 0) return rb_float_new(1.0); 867 | if (a_len == 0 || b_len == 0) return rb_float_new(0.0); 868 | if (amatch->ignore_case) { 869 | LOWERCASE_STRINGS 870 | } 871 | COMPUTE_JARO 872 | n = 0; 873 | for (i = 0; i < (a_len >= 4 ? 4 : a_len); i++) { 874 | if (a_ptr[i] == b_ptr[i]) { 875 | n++; 876 | } else { 877 | break; 878 | } 879 | } 880 | result = result + n*amatch->scaling_factor*(1-result); 881 | if (amatch->ignore_case) { 882 | xfree(a_ptr); 883 | xfree(b_ptr); 884 | } 885 | return rb_float_new(result); 886 | } 887 | 888 | /* 889 | * Ruby API 890 | */ 891 | 892 | /* 893 | * Document-class: Amatch::Levenshtein 894 | * 895 | * The Levenshtein edit distance is defined as the minimal costs involved to 896 | * transform one string into another by using three elementary operations: 897 | * deletion, insertion and substitution of a character. To transform "water" 898 | * into "wine", for instance, you have to substitute "a" -> "i": "witer", "t" 899 | * -> "n": "winer" and delete "r": "wine". The edit distance between "water" 900 | * and "wine" is 3, because you have to apply three operations. The edit 901 | * distance between "wine" and "wine" is 0 of course: no operation is 902 | * necessary for the transformation -- they're already the same string. It's 903 | * easy to see that more similar strings have smaller edit distances than 904 | * strings that differ a lot. 905 | */ 906 | 907 | DEF_RB_FREE(Levenshtein, General) 908 | 909 | /* 910 | * call-seq: new(pattern) 911 | * 912 | * Creates a new Amatch::Levenshtein instance from pattern. 913 | */ 914 | static VALUE rb_Levenshtein_initialize(VALUE self, VALUE pattern) 915 | { 916 | GET_STRUCT(General) 917 | General_pattern_set(amatch, pattern); 918 | return self; 919 | } 920 | 921 | DEF_CONSTRUCTOR(Levenshtein, General) 922 | 923 | /* 924 | * call-seq: match(strings) -> results 925 | * 926 | * Uses this Amatch::Levenshtein instance to match Amatch::Levenshtein#pattern 927 | * against strings. It returns the number operations, the Sellers 928 | * distance. strings has to be either a String or an Array of 929 | * Strings. The returned results is either a Float or an Array of 930 | * Floats respectively. 931 | */ 932 | static VALUE rb_Levenshtein_match(VALUE self, VALUE strings) 933 | { 934 | GET_STRUCT(General) 935 | return General_iterate_strings(amatch, strings, Levenshtein_match); 936 | } 937 | 938 | /* 939 | * call-seq: similar(strings) -> results 940 | * 941 | * Uses this Amatch::Levenshtein instance to match Amatch::Levenshtein#pattern 942 | * against strings, and compute a Levenshtein distance metric 943 | * number between 0.0 for very unsimilar strings and 1.0 for an exact match. 944 | * strings has to be either a String or an Array of Strings. The 945 | * returned results is either a Fixnum or an Array of Fixnums 946 | * respectively. 947 | */ 948 | static VALUE rb_Levenshtein_similar(VALUE self, VALUE strings) 949 | { 950 | GET_STRUCT(General) 951 | return General_iterate_strings(amatch, strings, Levenshtein_similar); 952 | } 953 | 954 | /* 955 | * call-seq: levenshtein_similar(strings) -> results 956 | * 957 | * If called on a String, this string is used as a Amatch::Levenshtein#pattern 958 | * to match against strings. It returns a Levenshtein distance 959 | * metric number between 0.0 for very unsimilar strings and 1.0 for an exact 960 | * match. strings has to be either a String or an Array of 961 | * Strings. The returned results is either a Float or an Array of 962 | * Floats respectively. 963 | */ 964 | static VALUE rb_str_levenshtein_similar(VALUE self, VALUE strings) 965 | { 966 | VALUE amatch = rb_Levenshtein_new(rb_cLevenshtein, self); 967 | return rb_Levenshtein_similar(amatch, strings); 968 | } 969 | 970 | /* 971 | * call-seq: search(strings) -> results 972 | * 973 | * searches Amatch::Levenshtein#pattern in strings and returns the 974 | * edit distance (the sum of character operations) as a Fixnum value, by greedy 975 | * trimming prefixes or postfixes of the match. strings has 976 | * to be either a String or an Array of Strings. The returned 977 | * results is either a Float or an Array of Floats respectively. 978 | */ 979 | static VALUE rb_Levenshtein_search(VALUE self, VALUE strings) 980 | { 981 | GET_STRUCT(General) 982 | return General_iterate_strings(amatch, strings, Levenshtein_search); 983 | } 984 | 985 | /* 986 | * Document-class: Amatch::DamerauLevenshtein 987 | * XXX 988 | * The DamerauLevenshtein edit distance is defined as the minimal costs 989 | * involved to transform one string into another by using three elementary 990 | * operations: deletion, insertion and substitution of a character. To 991 | * transform "water" into "wine", for instance, you have to substitute "a" -> 992 | * "i": "witer", "t" -> "n": "winer" and delete "r": "wine". The edit distance 993 | * between "water" and "wine" is 3, because you have to apply three 994 | * operations. The edit distance between "wine" and "wine" is 0 of course: no 995 | * operation is necessary for the transformation -- they're already the same 996 | * string. It's easy to see that more similar strings have smaller edit 997 | * distances than strings that differ a lot. 998 | */ 999 | 1000 | DEF_RB_FREE(DamerauLevenshtein, General) 1001 | 1002 | /* 1003 | * call-seq: new(pattern) 1004 | * XXX 1005 | * Creates a new Amatch::DamerauLevenshtein instance from pattern. 1006 | */ 1007 | static VALUE rb_DamerauLevenshtein_initialize(VALUE self, VALUE pattern) 1008 | { 1009 | GET_STRUCT(General) 1010 | General_pattern_set(amatch, pattern); 1011 | return self; 1012 | } 1013 | 1014 | DEF_CONSTRUCTOR(DamerauLevenshtein, General) 1015 | 1016 | /* 1017 | * call-seq: match(strings) -> results 1018 | * XXX 1019 | * Uses this Amatch::DamerauLevenshtein instance to match Amatch::DamerauLevenshtein#pattern 1020 | * against strings. It returns the number operations, the Sellers 1021 | * distance. strings has to be either a String or an Array of 1022 | * Strings. The returned results is either a Float or an Array of 1023 | * Floats respectively. 1024 | */ 1025 | static VALUE rb_DamerauLevenshtein_match(VALUE self, VALUE strings) 1026 | { 1027 | GET_STRUCT(General) 1028 | return General_iterate_strings(amatch, strings, DamerauLevenshtein_match); 1029 | } 1030 | 1031 | /* 1032 | * call-seq: similar(strings) -> results 1033 | * XXX 1034 | * Uses this Amatch::DamerauLevenshtein instance to match Amatch::DamerauLevenshtein#pattern 1035 | * against strings, and compute a DamerauLevenshtein distance metric 1036 | * number between 0.0 for very unsimilar strings and 1.0 for an exact match. 1037 | * strings has to be either a String or an Array of Strings. The 1038 | * returned results is either a Fixnum or an Array of Fixnums 1039 | * respectively. 1040 | */ 1041 | static VALUE rb_DamerauLevenshtein_similar(VALUE self, VALUE strings) 1042 | { 1043 | GET_STRUCT(General) 1044 | return General_iterate_strings(amatch, strings, DamerauLevenshtein_similar); 1045 | } 1046 | 1047 | /* 1048 | * call-seq: levenshtein_similar(strings) -> results 1049 | * XXX 1050 | * If called on a String, this string is used as a Amatch::DamerauLevenshtein#pattern 1051 | * to match against strings. It returns a DamerauLevenshtein distance 1052 | * metric number between 0.0 for very unsimilar strings and 1.0 for an exact 1053 | * match. strings has to be either a String or an Array of 1054 | * Strings. The returned results is either a Float or an Array of 1055 | * Floats respectively. 1056 | */ 1057 | static VALUE rb_str_damerau_levenshtein_similar(VALUE self, VALUE strings) 1058 | { 1059 | VALUE amatch = rb_DamerauLevenshtein_new(rb_cDamerauLevenshtein, self); 1060 | return rb_DamerauLevenshtein_similar(amatch, strings); 1061 | } 1062 | 1063 | /* 1064 | * call-seq: search(strings) -> results 1065 | * XXX 1066 | * searches Amatch::DamerauLevenshtein#pattern in strings and returns the 1067 | * edit distance (the sum of character operations) as a Fixnum value, by greedy 1068 | * trimming prefixes or postfixes of the match. strings has 1069 | * to be either a String or an Array of Strings. The returned 1070 | * results is either a Float or an Array of Floats respectively. 1071 | */ 1072 | static VALUE rb_DamerauLevenshtein_search(VALUE self, VALUE strings) 1073 | { 1074 | GET_STRUCT(General) 1075 | return General_iterate_strings(amatch, strings, DamerauLevenshtein_search); 1076 | } 1077 | 1078 | /* 1079 | * Document-class: Amatch::Sellers 1080 | * 1081 | * The Sellers edit distance is very similar to the Levenshtein edit distance. 1082 | * The difference is, that you can also specify different weights for every 1083 | * operation to prefer special operations over others. This extension of the 1084 | * Sellers edit distance is also known under the names: Needleman-Wunsch 1085 | * distance. 1086 | */ 1087 | 1088 | DEF_RB_FREE(Sellers, Sellers) 1089 | 1090 | /* 1091 | * Document-method: substitution 1092 | * 1093 | * call-seq: substitution -> weight 1094 | * 1095 | * Returns the weight of the substitution operation, that is used to compute 1096 | * the Sellers distance. 1097 | */ 1098 | DEF_RB_READER(Sellers, rb_Sellers_substitution, substitution, 1099 | rb_float_new) 1100 | 1101 | /* 1102 | * Document-method: deletion 1103 | * 1104 | * call-seq: deletion -> weight 1105 | * 1106 | * Returns the weight of the deletion operation, that is used to compute 1107 | * the Sellers distance. 1108 | */ 1109 | DEF_RB_READER(Sellers, rb_Sellers_deletion, deletion, 1110 | rb_float_new) 1111 | 1112 | /* 1113 | * Document-method: insertion 1114 | * 1115 | * call-seq: insertion -> weight 1116 | * 1117 | * Returns the weight of the insertion operation, that is used to compute 1118 | * the Sellers distance. 1119 | */ 1120 | DEF_RB_READER(Sellers, rb_Sellers_insertion, insertion, 1121 | rb_float_new) 1122 | 1123 | /* 1124 | * Document-method: substitution= 1125 | * 1126 | * call-seq: substitution=(weight) 1127 | * 1128 | * Sets the weight of the substitution operation, that is used to compute 1129 | * the Sellers distance, to weight. The weight 1130 | * should be a Float value >= 0.0. 1131 | */ 1132 | DEF_RB_WRITER(Sellers, rb_Sellers_substitution_set, substitution, 1133 | double, CAST2FLOAT, FLOAT2C, >= 0) 1134 | 1135 | /* 1136 | * Document-method: deletion= 1137 | * 1138 | * call-seq: deletion=(weight) 1139 | * 1140 | * Sets the weight of the deletion operation, that is used to compute 1141 | * the Sellers distance, to weight. The weight 1142 | * should be a Float value >= 0.0. 1143 | */ 1144 | DEF_RB_WRITER(Sellers, rb_Sellers_deletion_set, deletion, 1145 | double, CAST2FLOAT, FLOAT2C, >= 0) 1146 | 1147 | /* 1148 | * Document-method: insertion= 1149 | * 1150 | * call-seq: insertion=(weight) 1151 | * 1152 | * Sets the weight of the insertion operation, that is used to compute 1153 | * the Sellers distance, to weight. The weight 1154 | * should be a Float value >= 0.0. 1155 | */ 1156 | DEF_RB_WRITER(Sellers, rb_Sellers_insertion_set, insertion, 1157 | double, CAST2FLOAT, FLOAT2C, >= 0) 1158 | 1159 | /* 1160 | * Resets all weights (substitution, deletion, and insertion) to 1.0. 1161 | */ 1162 | static VALUE rb_Sellers_reset_weights(VALUE self) 1163 | { 1164 | GET_STRUCT(Sellers) 1165 | Sellers_reset_weights(amatch); 1166 | return self; 1167 | } 1168 | 1169 | /* 1170 | * call-seq: new(pattern) 1171 | * 1172 | * Creates a new Amatch::Sellers instance from pattern, 1173 | * with all weights initially set to 1.0. 1174 | */ 1175 | static VALUE rb_Sellers_initialize(VALUE self, VALUE pattern) 1176 | { 1177 | GET_STRUCT(Sellers) 1178 | Sellers_pattern_set(amatch, pattern); 1179 | Sellers_reset_weights(amatch); 1180 | return self; 1181 | } 1182 | 1183 | DEF_CONSTRUCTOR(Sellers, Sellers) 1184 | 1185 | /* 1186 | * Document-method: pattern 1187 | * 1188 | * call-seq: pattern -> pattern string 1189 | * 1190 | * Returns the current pattern string of this Amatch::Sellers instance. 1191 | */ 1192 | 1193 | /* 1194 | * Document-method: pattern= 1195 | * 1196 | * call-seq: pattern=(pattern) 1197 | * 1198 | * Sets the current pattern string of this Amatch::Sellers instance to 1199 | * pattern. 1200 | */ 1201 | 1202 | /* 1203 | * call-seq: match(strings) -> results 1204 | * 1205 | * Uses this Amatch::Sellers instance to match Sellers#pattern against 1206 | * strings, while taking into account the given weights. It 1207 | * returns the number of weighted character operations, the Sellers distance. 1208 | * strings has to be either a String or an Array of Strings. The 1209 | * returned results is either a Float or an Array of Floats 1210 | * respectively. 1211 | */ 1212 | static VALUE rb_Sellers_match(VALUE self, VALUE strings) 1213 | { 1214 | GET_STRUCT(Sellers) 1215 | return Sellers_iterate_strings(amatch, strings, Sellers_match); 1216 | } 1217 | 1218 | /* 1219 | * call-seq: similar(strings) -> results 1220 | * 1221 | * Uses this Amatch::Sellers instance to match Amatch::Sellers#pattern 1222 | * against strings (taking into account the given weights), and 1223 | * compute a Sellers distance metric number between 0.0 for very unsimilar 1224 | * strings and 1.0 for an exact match. strings has to be either a 1225 | * String or an Array of Strings. The returned results is either 1226 | * a Fixnum or an Array of Fixnums 1227 | * respectively. 1228 | */ 1229 | static VALUE rb_Sellers_similar(VALUE self, VALUE strings) 1230 | { 1231 | GET_STRUCT(Sellers) 1232 | return Sellers_iterate_strings(amatch, strings, Sellers_similar); 1233 | } 1234 | 1235 | /* 1236 | * call-seq: search(strings) -> results 1237 | * 1238 | * searches Sellers#pattern in strings and returns the edit 1239 | * distance (the sum of weighted character operations) as a Float value, by 1240 | * greedy trimming prefixes or postfixes of the match. strings has 1241 | * to be either a String or an Array of Strings. The returned 1242 | * results is either a Float or an Array of Floats respectively. 1243 | */ 1244 | static VALUE rb_Sellers_search(VALUE self, VALUE strings) 1245 | { 1246 | GET_STRUCT(Sellers) 1247 | return Sellers_iterate_strings(amatch, strings, Sellers_search); 1248 | } 1249 | 1250 | /* 1251 | * Document-class: Amatch::PairDistance 1252 | * 1253 | * The pair distance between two strings is based on the number of adjacent 1254 | * character pairs, that are contained in both strings. The similiarity 1255 | * metric of two strings s1 and s2 is 1256 | * 2*|union(pairs(s1), pairs(s2))| / |pairs(s1)| + |pairs(s2)| 1257 | * If it is 1.0 the two strings are an exact match, if less than 1.0 they 1258 | * are more dissimilar. The advantage of considering adjacent characters, is to 1259 | * take account not only of the characters, but also of the character ordering 1260 | * in the original strings. 1261 | * 1262 | * This metric is very capable to find similarities in natural languages. 1263 | * It is explained in more detail in Simon White's article "How to Strike a 1264 | * Match", located at this url: 1265 | * http://www.catalysoft.com/articles/StrikeAMatch.html 1266 | * It is also very similar (a special case) to the method described under 1267 | * http://citeseer.lcs.mit.edu/gravano01using.html in "Using q-grams in a DBMS 1268 | * for Approximate String Processing." 1269 | */ 1270 | DEF_RB_FREE(PairDistance, PairDistance) 1271 | 1272 | /* 1273 | * call-seq: new(pattern) 1274 | * 1275 | * Creates a new Amatch::PairDistance instance from pattern. 1276 | */ 1277 | static VALUE rb_PairDistance_initialize(VALUE self, VALUE pattern) 1278 | { 1279 | GET_STRUCT(PairDistance) 1280 | PairDistance_pattern_set(amatch, pattern); 1281 | return self; 1282 | } 1283 | 1284 | DEF_CONSTRUCTOR(PairDistance, PairDistance) 1285 | 1286 | /* 1287 | * call-seq: match(strings, regexp = /\s+/) -> results 1288 | * 1289 | * Uses this Amatch::PairDistance instance to match PairDistance#pattern against 1290 | * strings. It returns the pair distance measure, that is a 1291 | * returned value of 1.0 is an exact match, partial matches are lower 1292 | * values, while 0.0 means no match at all. 1293 | * 1294 | * strings has to be either a String or an 1295 | * Array of Strings. The argument regexp is used to split the 1296 | * pattern and strings into tokens first. It defaults to /\s+/. If the 1297 | * splitting should be omitted, call the method with nil as regexp 1298 | * explicitly. 1299 | * 1300 | * The returned results is either a Float or an 1301 | * Array of Floats respectively. 1302 | */ 1303 | static VALUE rb_PairDistance_match(int argc, VALUE *argv, VALUE self) 1304 | { 1305 | VALUE result, strings, regexp = Qnil; 1306 | int use_regexp; 1307 | GET_STRUCT(PairDistance) 1308 | 1309 | rb_scan_args(argc, argv, "11", &strings, ®exp); 1310 | use_regexp = NIL_P(regexp) && argc != 2; 1311 | if (TYPE(strings) == T_STRING) { 1312 | result = PairDistance_match(amatch, strings, regexp, use_regexp); 1313 | } else { 1314 | int i; 1315 | Check_Type(strings, T_ARRAY); 1316 | result = rb_ary_new2(RARRAY_LEN(strings)); 1317 | for (i = 0; i < RARRAY_LEN(strings); i++) { 1318 | VALUE string = rb_ary_entry(strings, i); 1319 | if (TYPE(string) != T_STRING) { 1320 | rb_raise(rb_eTypeError, 1321 | "array has to contain only strings (%s given)", 1322 | NIL_P(string) ? 1323 | "NilClass" : 1324 | rb_class2name(CLASS_OF(string))); 1325 | } 1326 | rb_ary_push(result, 1327 | PairDistance_match(amatch, string, regexp, use_regexp)); 1328 | } 1329 | } 1330 | pair_array_destroy(amatch->pattern_pair_array); 1331 | amatch->pattern_pair_array = NULL; 1332 | return result; 1333 | } 1334 | 1335 | /* 1336 | * call-seq: pair_distance_similar(strings, regexp = nil) -> results 1337 | * 1338 | * If called on a String, this string is used as a Amatch::PairDistance#pattern 1339 | * to match against strings using /\s+/ as the tokenizing regular 1340 | * expression. It returns a pair distance metric number between 0.0 for very 1341 | * unsimilar strings and 1.0 for an exact match. strings has to be 1342 | * either a String or an Array of Strings. 1343 | * 1344 | * The returned results is either a Float or an Array of Floats 1345 | * respectively. 1346 | */ 1347 | static VALUE rb_str_pair_distance_similar(int argc, VALUE *argv, VALUE self) 1348 | { 1349 | VALUE amatch, string, regexp = Qnil; 1350 | rb_scan_args(argc, argv, "11", &string, ®exp); 1351 | amatch = rb_PairDistance_new(rb_cPairDistance, self); 1352 | if (NIL_P(regexp)) { 1353 | return rb_PairDistance_match(1, &string, amatch); 1354 | } else { 1355 | VALUE *args = alloca(2); 1356 | args[0] = string; 1357 | args[1] = regexp; 1358 | return rb_PairDistance_match(2, args, amatch); 1359 | } 1360 | } 1361 | 1362 | /* 1363 | * Document-class: Amatch::Hamming 1364 | * 1365 | * This class computes the Hamming distance between two strings. 1366 | * 1367 | * The Hamming distance between two strings is the number of characters, that 1368 | * are different. Thus a hamming distance of 0 means an exact 1369 | * match, a hamming distance of 1 means one character is different, and so on. 1370 | * If one string is longer than the other string, the missing characters are 1371 | * counted as different characters. 1372 | */ 1373 | 1374 | DEF_RB_FREE(Hamming, General) 1375 | 1376 | /* 1377 | * call-seq: new(pattern) 1378 | * 1379 | * Creates a new Amatch::Hamming instance from pattern. 1380 | */ 1381 | static VALUE rb_Hamming_initialize(VALUE self, VALUE pattern) 1382 | { 1383 | GET_STRUCT(General) 1384 | General_pattern_set(amatch, pattern); 1385 | return self; 1386 | } 1387 | 1388 | DEF_CONSTRUCTOR(Hamming, General) 1389 | 1390 | /* 1391 | * call-seq: match(strings) -> results 1392 | * 1393 | * Uses this Amatch::Hamming instance to match Amatch::Hamming#pattern against 1394 | * strings, that is compute the hamming distance between 1395 | * pattern and strings. strings has to 1396 | * be either a String or an Array of Strings. The returned results 1397 | * is either a Fixnum or an Array of Fixnums respectively. 1398 | */ 1399 | static VALUE rb_Hamming_match(VALUE self, VALUE strings) 1400 | { 1401 | GET_STRUCT(General) 1402 | return General_iterate_strings(amatch, strings, Hamming_match); 1403 | } 1404 | 1405 | /* 1406 | * call-seq: similar(strings) -> results 1407 | * 1408 | * Uses this Amatch::Hamming instance to match Amatch::Hamming#pattern against 1409 | * strings, and compute a Hamming distance metric number between 1410 | * 0.0 for very unsimilar strings and 1.0 for an exact match. 1411 | * strings has to be either a String or an Array of Strings. The 1412 | * returned results is either a Fixnum or an Array of Fixnums 1413 | * respectively. 1414 | */ 1415 | static VALUE rb_Hamming_similar(VALUE self, VALUE strings) 1416 | { 1417 | GET_STRUCT(General) 1418 | return General_iterate_strings(amatch, strings, Hamming_similar); 1419 | } 1420 | 1421 | /* 1422 | * call-seq: hamming_similar(strings) -> results 1423 | * 1424 | * If called on a String, this string is used as a Amatch::Hamming#pattern to 1425 | * match against strings. It returns a Hamming distance metric 1426 | * number between 0.0 for very unsimilar strings and 1.0 for an exact match. 1427 | * strings 1428 | * has to be either a String or an Array of Strings. The returned 1429 | * results is either a Float or an Array of Floats respectively. 1430 | */ 1431 | static VALUE rb_str_hamming_similar(VALUE self, VALUE strings) 1432 | { 1433 | VALUE amatch = rb_Hamming_new(rb_cHamming, self); 1434 | return rb_Hamming_similar(amatch, strings); 1435 | } 1436 | 1437 | 1438 | /* 1439 | * Document-class: Amatch::LongestSubsequence 1440 | * 1441 | * This class computes the length of the longest subsequence common to two 1442 | * strings. A subsequence doesn't have to be contiguous. The longer the common 1443 | * subsequence is, the more similar the two strings will be. 1444 | * 1445 | * The longest common subsequence between "test" and "test" is of length 4, 1446 | * because "test" itself is this subsequence. The longest common subsequence 1447 | * between "test" and "east" is "e", "s", "t" and the length of the 1448 | * sequence is 3. 1449 | */ 1450 | DEF_RB_FREE(LongestSubsequence, General) 1451 | 1452 | /* 1453 | * call-seq: new(pattern) 1454 | * 1455 | * Creates a new Amatch::LongestSubsequence instance from pattern. 1456 | */ 1457 | static VALUE rb_LongestSubsequence_initialize(VALUE self, VALUE pattern) 1458 | { 1459 | GET_STRUCT(General) 1460 | General_pattern_set(amatch, pattern); 1461 | return self; 1462 | } 1463 | 1464 | DEF_CONSTRUCTOR(LongestSubsequence, General) 1465 | 1466 | /* 1467 | * call-seq: match(strings) -> results 1468 | * 1469 | * Uses this Amatch::LongestSubsequence instance to match 1470 | * LongestSubsequence#pattern against strings, that is compute the 1471 | * length of the longest common subsequence. strings has to be 1472 | * either a String or an Array of Strings. The returned results 1473 | * is either a Fixnum or an Array of Fixnums respectively. 1474 | */ 1475 | static VALUE rb_LongestSubsequence_match(VALUE self, VALUE strings) 1476 | { 1477 | GET_STRUCT(General) 1478 | return General_iterate_strings(amatch, strings, LongestSubsequence_match); 1479 | } 1480 | 1481 | /* 1482 | * call-seq: similar(strings) -> results 1483 | * 1484 | * Uses this Amatch::LongestSubsequence instance to match 1485 | * Amatch::LongestSubsequence#pattern against strings, and compute 1486 | * a longest substring distance metric number between 0.0 for very unsimilar 1487 | * strings and 1.0 for an exact match. strings has to be either a 1488 | * String or an Array of Strings. The returned results is either 1489 | * a Fixnum or an Array of Fixnums 1490 | */ 1491 | static VALUE rb_LongestSubsequence_similar(VALUE self, VALUE strings) 1492 | { 1493 | GET_STRUCT(General) 1494 | return General_iterate_strings(amatch, strings, LongestSubsequence_similar); 1495 | } 1496 | 1497 | /* 1498 | * call-seq: longest_subsequence_similar(strings) -> results 1499 | * 1500 | * If called on a String, this string is used as a 1501 | * Amatch::LongestSubsequence#pattern to match against strings. It 1502 | * returns a longest subsequence distance metric number between 0.0 for very 1503 | * unsimilar strings and 1.0 for an exact match. strings has to be 1504 | * either a String or an Array of Strings. The returned results 1505 | * is either a Float or an Array of Floats respectively. 1506 | */ 1507 | static VALUE rb_str_longest_subsequence_similar(VALUE self, VALUE strings) 1508 | { 1509 | VALUE amatch = rb_LongestSubsequence_new(rb_cLongestSubsequence, self); 1510 | return rb_LongestSubsequence_similar(amatch, strings); 1511 | } 1512 | 1513 | /* 1514 | * Document-class: Amatch::LongestSubstring 1515 | * 1516 | * The longest common substring is the longest substring, that is part of 1517 | * two strings. A substring is contiguous, while a subsequence need not to 1518 | * be. The longer the common substring is, the more similar the two strings 1519 | * will be. 1520 | * 1521 | * The longest common substring between 'string' and 'string' is 'string' 1522 | * again, thus the longest common substring length is 6. The longest common 1523 | * substring between 'string' and 'storing' is 'ring', thus the longest common 1524 | * substring length is 4. 1525 | */ 1526 | 1527 | DEF_RB_FREE(LongestSubstring, General) 1528 | 1529 | /* 1530 | * call-seq: new(pattern) 1531 | * 1532 | * Creates a new Amatch::LongestSubstring instance from pattern. 1533 | */ 1534 | static VALUE rb_LongestSubstring_initialize(VALUE self, VALUE pattern) 1535 | { 1536 | GET_STRUCT(General) 1537 | General_pattern_set(amatch, pattern); 1538 | return self; 1539 | } 1540 | 1541 | DEF_CONSTRUCTOR(LongestSubstring, General) 1542 | 1543 | /* 1544 | * call-seq: match(strings) -> results 1545 | * 1546 | * Uses this Amatch::LongestSubstring instance to match 1547 | * LongestSubstring#pattern against strings, that is compute the 1548 | * length of the longest common substring. strings has to be 1549 | * either a String or an Array of Strings. The returned results 1550 | * is either a Fixnum or an Array of Fixnums respectively. 1551 | */ 1552 | static VALUE rb_LongestSubstring_match(VALUE self, VALUE strings) 1553 | { 1554 | GET_STRUCT(General) 1555 | return General_iterate_strings(amatch, strings, LongestSubstring_match); 1556 | } 1557 | 1558 | /* 1559 | * call-seq: similar(strings) -> results 1560 | * 1561 | * Uses this Amatch::LongestSubstring instance to match 1562 | * Amatch::LongestSubstring#pattern against strings, and compute a 1563 | * longest substring distance metric number between 0.0 for very unsimilar 1564 | * strings and 1.0 for an exact match. strings has to be either a 1565 | * String or an Array of Strings. The returned results is either 1566 | * a Fixnum or an Array of Fixnums 1567 | * respectively. 1568 | */ 1569 | static VALUE rb_LongestSubstring_similar(VALUE self, VALUE strings) 1570 | { 1571 | GET_STRUCT(General) 1572 | return General_iterate_strings(amatch, strings, LongestSubstring_similar); 1573 | } 1574 | 1575 | /* 1576 | * call-seq: longest_substring_similar(strings) -> results 1577 | * 1578 | * If called on a String, this string is used as a 1579 | * Amatch::LongestSubstring#pattern to match against strings. It 1580 | * returns a longest substring distance metric number between 0.0 for very 1581 | * unsimilar strings and 1.0 for an exact match. strings has to be 1582 | * either a String or an Array of Strings. The returned results 1583 | * is either a Float or an Array of Floats respectively. 1584 | */ 1585 | static VALUE rb_str_longest_substring_similar(VALUE self, VALUE strings) 1586 | { 1587 | VALUE amatch = rb_LongestSubstring_new(rb_cLongestSubstring, self); 1588 | return rb_LongestSubstring_similar(amatch, strings); 1589 | } 1590 | 1591 | /* 1592 | * Document-class: Amatch::Jaro 1593 | * 1594 | * This class computes the Jaro metric for two strings. 1595 | * The Jaro metric computes the similarity between 0 (no match) 1596 | * and 1 (exact match) by looking for matching and transposed characters. 1597 | */ 1598 | DEF_RB_FREE(Jaro, Jaro) 1599 | 1600 | /* 1601 | * Document-method: ignore_case 1602 | * 1603 | * call-seq: ignore_case -> true/false 1604 | * 1605 | * Returns whether case is ignored when computing matching characters. 1606 | */ 1607 | DEF_RB_READER(Jaro, rb_Jaro_ignore_case, ignore_case, C2BOOL) 1608 | 1609 | /* 1610 | * Document-method: ignore_case= 1611 | * 1612 | * call-seq: ignore_case=(true/false) 1613 | * 1614 | * Sets whether case is ignored when computing matching characters. 1615 | */ 1616 | DEF_RB_WRITER(Jaro, rb_Jaro_ignore_case_set, ignore_case, 1617 | int, CAST2BOOL, BOOL2C, != Qundef) 1618 | 1619 | /* 1620 | * call-seq: new(pattern) 1621 | * 1622 | * Creates a new Amatch::Jaro instance from pattern. 1623 | */ 1624 | static VALUE rb_Jaro_initialize(VALUE self, VALUE pattern) 1625 | { 1626 | GET_STRUCT(Jaro) 1627 | Jaro_pattern_set(amatch, pattern); 1628 | amatch->ignore_case = 1; 1629 | return self; 1630 | } 1631 | 1632 | DEF_CONSTRUCTOR(Jaro, Jaro) 1633 | 1634 | /* 1635 | * call-seq: match(strings) -> results 1636 | * 1637 | * Uses this Amatch::Jaro instance to match 1638 | * Jaro#pattern against strings, that is compute the 1639 | * jaro metric with the strings. strings has to be 1640 | * either a String or an Array of Strings. The returned results 1641 | * is either a Float or an Array of Floats respectively. 1642 | */ 1643 | static VALUE rb_Jaro_match(VALUE self, VALUE strings) 1644 | { 1645 | GET_STRUCT(Jaro) 1646 | return Jaro_iterate_strings(amatch, strings, Jaro_match); 1647 | } 1648 | 1649 | /* 1650 | * call-seq: jaro_similar(strings) -> results 1651 | * 1652 | * If called on a String, this string is used as a 1653 | * Amatch::Jaro#pattern to match against strings. It 1654 | * returns a Jaro metric number between 0.0 for very 1655 | * unsimilar strings and 1.0 for an exact match. strings has to be 1656 | * either a String or an Array of Strings. The returned results 1657 | * is either a Float or an Array of Floats respectively. 1658 | */ 1659 | static VALUE rb_str_jaro_similar(VALUE self, VALUE strings) 1660 | { 1661 | VALUE amatch = rb_Jaro_new(rb_cJaro, self); 1662 | return rb_Jaro_match(amatch, strings); 1663 | } 1664 | 1665 | /* 1666 | * Document-class: Amatch::JaroWinkler 1667 | * 1668 | * This class computes the Jaro-Winkler metric for two strings. 1669 | * The Jaro-Winkler metric computes the similarity between 0 (no match) 1670 | * and 1 (exact match) by looking for matching and transposed characters. 1671 | * 1672 | * It is a variant of the Jaro metric, with additional weighting towards 1673 | * common prefixes. 1674 | */ 1675 | DEF_RB_FREE(JaroWinkler, JaroWinkler) 1676 | 1677 | /* 1678 | * Document-method: ignore_case 1679 | * 1680 | * call-seq: ignore_case -> true/false 1681 | * 1682 | * Returns whether case is ignored when computing matching characters. 1683 | * Default is true. 1684 | */ 1685 | DEF_RB_READER(JaroWinkler, rb_JaroWinkler_ignore_case, ignore_case, C2BOOL) 1686 | 1687 | /* 1688 | * Document-method: scaling_factor 1689 | * 1690 | * call-seq: scaling_factor -> weight 1691 | * 1692 | * The scaling factor is how much weight to give common prefixes. 1693 | * Default is 0.1. 1694 | */ 1695 | DEF_RB_READER(JaroWinkler, rb_JaroWinkler_scaling_factor, scaling_factor, rb_float_new) 1696 | 1697 | /* 1698 | * Document-method: ignore_case= 1699 | * 1700 | * call-seq: ignore_case=(true/false) 1701 | * 1702 | * Sets whether case is ignored when computing matching characters. 1703 | */ 1704 | DEF_RB_WRITER(JaroWinkler, rb_JaroWinkler_ignore_case_set, ignore_case, 1705 | int, CAST2BOOL, BOOL2C, != Qundef) 1706 | 1707 | /* 1708 | * Document-method: scaling_factor= 1709 | * 1710 | * call-seq: scaling_factor=(weight) 1711 | * 1712 | * Sets the weight to give common prefixes. 1713 | */ 1714 | DEF_RB_WRITER(JaroWinkler, rb_JaroWinkler_scaling_factor_set, scaling_factor, 1715 | double, CAST2FLOAT, FLOAT2C, >= 0) 1716 | 1717 | /* 1718 | * call-seq: new(pattern) 1719 | * 1720 | * Creates a new Amatch::JaroWinkler instance from pattern. 1721 | */ 1722 | static VALUE rb_JaroWinkler_initialize(VALUE self, VALUE pattern) 1723 | { 1724 | GET_STRUCT(JaroWinkler) 1725 | JaroWinkler_pattern_set(amatch, pattern); 1726 | amatch->ignore_case = 1; 1727 | amatch->scaling_factor = 0.1; 1728 | return self; 1729 | } 1730 | 1731 | DEF_CONSTRUCTOR(JaroWinkler, JaroWinkler) 1732 | 1733 | /* 1734 | * call-seq: match(strings) -> results 1735 | * 1736 | * Uses this Amatch::Jaro instance to match 1737 | * Jaro#pattern against strings, that is compute the 1738 | * jaro metric with the strings. strings has to be 1739 | * either a String or an Array of Strings. The returned results 1740 | * is either a Float or an Array of Floats respectively. 1741 | */ 1742 | static VALUE rb_JaroWinkler_match(VALUE self, VALUE strings) 1743 | { 1744 | GET_STRUCT(JaroWinkler) 1745 | return JaroWinkler_iterate_strings(amatch, strings, JaroWinkler_match); 1746 | } 1747 | 1748 | /* 1749 | * call-seq: jarowinkler_similar(strings) -> results 1750 | * 1751 | * If called on a String, this string is used as a 1752 | * Amatch::JaroWinkler#pattern to match against strings. It 1753 | * returns a Jaro-Winkler metric number between 0.0 for very 1754 | * unsimilar strings and 1.0 for an exact match. strings has to be 1755 | * either a String or an Array of Strings. The returned results 1756 | * are either a Float or an Array of Floats respectively. 1757 | */ 1758 | static VALUE rb_str_jarowinkler_similar(VALUE self, VALUE strings) 1759 | { 1760 | VALUE amatch = rb_JaroWinkler_new(rb_cJaro, self); 1761 | return rb_JaroWinkler_match(amatch, strings); 1762 | } 1763 | 1764 | /* 1765 | * This is the namespace module that includes all other classes, modules, and 1766 | * constants. 1767 | */ 1768 | 1769 | void Init_amatch_ext() 1770 | { 1771 | rb_require("amatch/version"); 1772 | rb_mAmatch = rb_define_module("Amatch"); 1773 | /* This module can be mixed into ::String or its subclasses to mixin the similary methods directly. */ 1774 | rb_mAmatchStringMethods = rb_define_module_under(rb_mAmatch, "StringMethods"); 1775 | 1776 | /* Levenshtein */ 1777 | rb_cLevenshtein = rb_define_class_under(rb_mAmatch, "Levenshtein", rb_cObject); 1778 | rb_define_alloc_func(rb_cLevenshtein, rb_Levenshtein_s_allocate); 1779 | rb_define_method(rb_cLevenshtein, "initialize", rb_Levenshtein_initialize, 1); 1780 | rb_define_method(rb_cLevenshtein, "pattern", rb_General_pattern, 0); 1781 | rb_define_method(rb_cLevenshtein, "pattern=", rb_General_pattern_set, 1); 1782 | rb_define_method(rb_cLevenshtein, "match", rb_Levenshtein_match, 1); 1783 | rb_define_method(rb_cLevenshtein, "search", rb_Levenshtein_search, 1); 1784 | rb_define_method(rb_cLevenshtein, "similar", rb_Levenshtein_similar, 1); 1785 | rb_define_method(rb_mAmatchStringMethods, "levenshtein_similar", rb_str_levenshtein_similar, 1); 1786 | 1787 | /* DamerauLevenshtein */ 1788 | rb_cDamerauLevenshtein = rb_define_class_under(rb_mAmatch, "DamerauLevenshtein", rb_cObject); 1789 | rb_define_alloc_func(rb_cDamerauLevenshtein, rb_DamerauLevenshtein_s_allocate); 1790 | rb_define_method(rb_cDamerauLevenshtein, "initialize", rb_DamerauLevenshtein_initialize, 1); 1791 | rb_define_method(rb_cDamerauLevenshtein, "pattern", rb_General_pattern, 0); 1792 | rb_define_method(rb_cDamerauLevenshtein, "pattern=", rb_General_pattern_set, 1); 1793 | rb_define_method(rb_cDamerauLevenshtein, "match", rb_DamerauLevenshtein_match, 1); 1794 | rb_define_method(rb_cDamerauLevenshtein, "search", rb_DamerauLevenshtein_search, 1); 1795 | rb_define_method(rb_cDamerauLevenshtein, "similar", rb_DamerauLevenshtein_similar, 1); 1796 | rb_define_method(rb_mAmatchStringMethods, "damerau_levenshtein_similar", rb_str_damerau_levenshtein_similar, 1); 1797 | 1798 | /* Sellers */ 1799 | rb_cSellers = rb_define_class_under(rb_mAmatch, "Sellers", rb_cObject); 1800 | rb_define_alloc_func(rb_cSellers, rb_Sellers_s_allocate); 1801 | rb_define_method(rb_cSellers, "initialize", rb_Sellers_initialize, 1); 1802 | rb_define_method(rb_cSellers, "pattern", rb_Sellers_pattern, 0); 1803 | rb_define_method(rb_cSellers, "pattern=", rb_Sellers_pattern_set, 1); 1804 | rb_define_method(rb_cSellers, "substitution", rb_Sellers_substitution, 0); 1805 | rb_define_method(rb_cSellers, "substitution=", rb_Sellers_substitution_set, 1); 1806 | rb_define_method(rb_cSellers, "deletion", rb_Sellers_deletion, 0); 1807 | rb_define_method(rb_cSellers, "deletion=", rb_Sellers_deletion_set, 1); 1808 | rb_define_method(rb_cSellers, "insertion", rb_Sellers_insertion, 0); 1809 | rb_define_method(rb_cSellers, "insertion=", rb_Sellers_insertion_set, 1); 1810 | rb_define_method(rb_cSellers, "reset_weights", rb_Sellers_reset_weights, 0); 1811 | rb_define_method(rb_cSellers, "match", rb_Sellers_match, 1); 1812 | rb_define_method(rb_cSellers, "search", rb_Sellers_search, 1); 1813 | rb_define_method(rb_cSellers, "similar", rb_Sellers_similar, 1); 1814 | 1815 | /* Hamming */ 1816 | rb_cHamming = rb_define_class_under(rb_mAmatch, "Hamming", rb_cObject); 1817 | rb_define_alloc_func(rb_cHamming, rb_Hamming_s_allocate); 1818 | rb_define_method(rb_cHamming, "initialize", rb_Hamming_initialize, 1); 1819 | rb_define_method(rb_cHamming, "pattern", rb_General_pattern, 0); 1820 | rb_define_method(rb_cHamming, "pattern=", rb_General_pattern_set, 1); 1821 | rb_define_method(rb_cHamming, "match", rb_Hamming_match, 1); 1822 | rb_define_method(rb_cHamming, "similar", rb_Hamming_similar, 1); 1823 | rb_define_method(rb_mAmatchStringMethods, "hamming_similar", rb_str_hamming_similar, 1); 1824 | 1825 | /* Pair Distance Metric / Dice Coefficient */ 1826 | rb_cPairDistance = rb_define_class_under(rb_mAmatch, "PairDistance", rb_cObject); 1827 | rb_define_alloc_func(rb_cPairDistance, rb_PairDistance_s_allocate); 1828 | rb_define_method(rb_cPairDistance, "initialize", rb_PairDistance_initialize, 1); 1829 | rb_define_method(rb_cPairDistance, "pattern", rb_PairDistance_pattern, 0); 1830 | rb_define_method(rb_cPairDistance, "pattern=", rb_PairDistance_pattern_set, 1); 1831 | rb_define_method(rb_cPairDistance, "match", rb_PairDistance_match, -1); 1832 | rb_define_alias(rb_cPairDistance, "similar", "match"); 1833 | rb_define_method(rb_mAmatchStringMethods, "pair_distance_similar", rb_str_pair_distance_similar, -1); 1834 | 1835 | /* Longest Common Subsequence */ 1836 | rb_cLongestSubsequence = rb_define_class_under(rb_mAmatch, "LongestSubsequence", rb_cObject); 1837 | rb_define_alloc_func(rb_cLongestSubsequence, rb_LongestSubsequence_s_allocate); 1838 | rb_define_method(rb_cLongestSubsequence, "initialize", rb_LongestSubsequence_initialize, 1); 1839 | rb_define_method(rb_cLongestSubsequence, "pattern", rb_General_pattern, 0); 1840 | rb_define_method(rb_cLongestSubsequence, "pattern=", rb_General_pattern_set, 1); 1841 | rb_define_method(rb_cLongestSubsequence, "match", rb_LongestSubsequence_match, 1); 1842 | rb_define_method(rb_cLongestSubsequence, "similar", rb_LongestSubsequence_similar, 1); 1843 | rb_define_method(rb_mAmatchStringMethods, "longest_subsequence_similar", rb_str_longest_subsequence_similar, 1); 1844 | 1845 | /* Longest Common Substring */ 1846 | rb_cLongestSubstring = rb_define_class_under(rb_mAmatch, "LongestSubstring", rb_cObject); 1847 | rb_define_alloc_func(rb_cLongestSubstring, rb_LongestSubstring_s_allocate); 1848 | rb_define_method(rb_cLongestSubstring, "initialize", rb_LongestSubstring_initialize, 1); 1849 | rb_define_method(rb_cLongestSubstring, "pattern", rb_General_pattern, 0); 1850 | rb_define_method(rb_cLongestSubstring, "pattern=", rb_General_pattern_set, 1); 1851 | rb_define_method(rb_cLongestSubstring, "match", rb_LongestSubstring_match, 1); 1852 | rb_define_method(rb_cLongestSubstring, "similar", rb_LongestSubstring_similar, 1); 1853 | rb_define_method(rb_mAmatchStringMethods, "longest_substring_similar", rb_str_longest_substring_similar, 1); 1854 | 1855 | /* Jaro */ 1856 | rb_cJaro = rb_define_class_under(rb_mAmatch, "Jaro", rb_cObject); 1857 | rb_define_alloc_func(rb_cJaro, rb_Jaro_s_allocate); 1858 | rb_define_method(rb_cJaro, "initialize", rb_Jaro_initialize, 1); 1859 | rb_define_method(rb_cJaro, "pattern", rb_Jaro_pattern, 0); 1860 | rb_define_method(rb_cJaro, "pattern=", rb_Jaro_pattern_set, 1); 1861 | rb_define_method(rb_cJaro, "ignore_case", rb_Jaro_ignore_case, 0); 1862 | rb_define_method(rb_cJaro, "ignore_case=", rb_Jaro_ignore_case_set, 1); 1863 | rb_define_method(rb_cJaro, "match", rb_Jaro_match, 1); 1864 | rb_define_alias(rb_cJaro, "similar", "match"); 1865 | rb_define_method(rb_mAmatchStringMethods, "jaro_similar", rb_str_jaro_similar, 1); 1866 | 1867 | /* Jaro-Winkler */ 1868 | rb_cJaroWinkler = rb_define_class_under(rb_mAmatch, "JaroWinkler", rb_cObject); 1869 | rb_define_alloc_func(rb_cJaroWinkler, rb_JaroWinkler_s_allocate); 1870 | rb_define_method(rb_cJaroWinkler, "initialize", rb_JaroWinkler_initialize, 1); 1871 | rb_define_method(rb_cJaroWinkler, "pattern", rb_JaroWinkler_pattern, 0); 1872 | rb_define_method(rb_cJaroWinkler, "pattern=", rb_JaroWinkler_pattern_set, 1); 1873 | rb_define_method(rb_cJaroWinkler, "ignore_case", rb_JaroWinkler_ignore_case, 0); 1874 | rb_define_method(rb_cJaroWinkler, "ignore_case=", rb_JaroWinkler_ignore_case_set, 1); 1875 | rb_define_method(rb_cJaroWinkler, "scaling_factor", rb_JaroWinkler_scaling_factor, 0); 1876 | rb_define_method(rb_cJaroWinkler, "scaling_factor=", rb_JaroWinkler_scaling_factor_set, 1); 1877 | rb_define_method(rb_cJaroWinkler, "match", rb_JaroWinkler_match, 1); 1878 | rb_define_alias(rb_cJaroWinkler, "similar", "match"); 1879 | rb_define_method(rb_mAmatchStringMethods, "jarowinkler_similar", rb_str_jarowinkler_similar, 1); 1880 | 1881 | id_split = rb_intern("split"); 1882 | id_to_f = rb_intern("to_f"); 1883 | } 1884 | -------------------------------------------------------------------------------- /ext/common.h: -------------------------------------------------------------------------------- 1 | #ifndef __COMMON_H__ 2 | # define __COMMON_H__ 3 | 4 | #ifndef RSTRING_PTR 5 | #define RSTRING_PTR(str) (RSTRING(str)->ptr) 6 | #endif 7 | 8 | #ifndef RSTRING_LEN 9 | #define RSTRING_LEN(str) (RSTRING(str)->len) 10 | #endif 11 | 12 | #ifndef RARRAY_PTR 13 | #define RARRAY_PTR(ary) (RARRAY(ary)->ptr) 14 | #endif 15 | 16 | #ifndef RARRAY_LEN 17 | #define RARRAY_LEN(ary) (RARRAY(ary)->len) 18 | #endif 19 | 20 | #ifndef RFLOAT_VALUE 21 | #define RFLOAT_VALUE(val) (RFLOAT(val)->value) 22 | #endif 23 | 24 | 25 | #endif 26 | -------------------------------------------------------------------------------- /ext/extconf.rb: -------------------------------------------------------------------------------- 1 | require 'mkmf' 2 | require 'rbconfig' 3 | 4 | if CONFIG['CC'] =~ /gcc/ 5 | $CFLAGS << ' -Wall' 6 | if ENV['DEBUG'] 7 | $CFLAGS << ' -O0 -ggdb' 8 | else 9 | $CFLAGS << ' -O3' 10 | end 11 | else 12 | $CFLAGS << ' -O3' 13 | end 14 | create_makefile 'amatch_ext' 15 | -------------------------------------------------------------------------------- /ext/pair.c: -------------------------------------------------------------------------------- 1 | #include "pair.h" 2 | 3 | #define DEBUG 0 4 | 5 | static int predict_length(VALUE tokens) 6 | { 7 | int i, l, result; 8 | for (i = 0, result = 0; i < RARRAY_LEN(tokens); i++) { 9 | VALUE t = rb_ary_entry(tokens, i); 10 | l = (int) RSTRING_LEN(t) - 1; 11 | if (l > 0) result += l; 12 | } 13 | return result; 14 | } 15 | 16 | PairArray *PairArray_new(VALUE tokens) 17 | { 18 | int i, j, k, len = predict_length(tokens); 19 | PairArray *pair_array = ALLOC(PairArray); 20 | Pair *pairs = ALLOC_N(Pair, len); 21 | MEMZERO(pairs, Pair, len); 22 | pair_array->pairs = pairs; 23 | pair_array->len = len; 24 | for (i = 0, k = 0; i < RARRAY_LEN(tokens); i++) { 25 | VALUE t = rb_ary_entry(tokens, i); 26 | char *string = RSTRING_PTR(t); 27 | for (j = 0; j < RSTRING_LEN(t) - 1; j++) { 28 | pairs[k].fst = string[j]; 29 | pairs[k].snd = string[j + 1]; 30 | pairs[k].status = PAIR_ACTIVE; 31 | k++; 32 | } 33 | } 34 | return pair_array; 35 | } 36 | 37 | void pair_array_reactivate(PairArray *self) 38 | { 39 | int i; 40 | for (i = 0; i < self->len; i++) { 41 | self->pairs[i].status = PAIR_ACTIVE; 42 | } 43 | } 44 | 45 | double pair_array_match(PairArray *self, PairArray *other) 46 | { 47 | int i, j, matches = 0; 48 | int sum = self->len + other->len; 49 | if (sum == 0) return 1.0; 50 | for (i = 0; i < self->len; i++) { 51 | for (j = 0; j < other->len; j++) { 52 | #if DEBUG 53 | pair_print(self->pairs[i]); 54 | putc(' ', stdout); 55 | pair_print(other->pairs[j]); 56 | printf(" -> %d\n", pair_equal(self->pairs[i], other->pairs[j])); 57 | #endif 58 | if (pair_equal(self->pairs[i], other->pairs[j])) { 59 | matches++; 60 | other->pairs[j].status = PAIR_INACTIVE; 61 | break; 62 | } 63 | } 64 | } 65 | return ((double) (2 * matches)) / sum; 66 | } 67 | 68 | void pair_print(Pair pair) 69 | { 70 | printf("%c%c (%d)", pair.fst, pair.snd, pair.status); 71 | } 72 | 73 | void pair_array_destroy(PairArray *pair_array) 74 | { 75 | if (pair_array->pairs) { 76 | xfree(pair_array->pairs); 77 | } 78 | xfree(pair_array); 79 | } 80 | -------------------------------------------------------------------------------- /ext/pair.h: -------------------------------------------------------------------------------- 1 | #ifndef PAIR_H_INCLUDED 2 | #define PAIR_H_INCLUDED 3 | 4 | #include "ruby.h" 5 | #include "common.h" 6 | 7 | enum { PAIR_ACTIVE = 1, PAIR_INACTIVE = 2 }; 8 | 9 | typedef struct PairStruct { 10 | char fst; 11 | char snd; 12 | char status; 13 | char __align; 14 | } Pair; 15 | 16 | typedef struct PairArrayStruct { 17 | Pair *pairs; 18 | int len; 19 | } PairArray; 20 | 21 | PairArray *PairArray_new(VALUE tokens); 22 | #define pair_equal(a, b) \ 23 | ((a).fst == (b).fst && (a).snd == (b).snd && ((a).status & (b).status & PAIR_ACTIVE)) 24 | double pair_array_match(PairArray *self, PairArray *other); 25 | void pair_array_destroy(PairArray *pair_array); 26 | void pair_print(Pair pair); 27 | void pair_array_reactivate(PairArray *self); 28 | 29 | #endif 30 | -------------------------------------------------------------------------------- /images/amatch_ext.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flori/amatch/39b54a846763bf817170bfd4f220671e62d5adf4/images/amatch_ext.png -------------------------------------------------------------------------------- /install.rb: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env ruby 2 | 3 | require 'rbconfig' 4 | include Config 5 | require 'fileutils' 6 | include FileUtils::Verbose 7 | 8 | MAKE = ENV['MAKE'] || %w[gmake make].find { |c| system(c, '-v') } 9 | 10 | bindir = CONFIG['bindir'] 11 | archdir = CONFIG['sitearchdir'] 12 | libdir = CONFIG['sitelibdir'] 13 | dlext = CONFIG['DLEXT'] 14 | cd 'ext' do 15 | system 'ruby extconf.rb' or exit 1 16 | system "#{MAKE}" or exit 1 17 | mkdir_p archdir 18 | install "amatch.#{dlext}", archdir 19 | end 20 | cd 'bin' do 21 | filename = 'edit_json.rb' 22 | install('agrep.rb', bindir) 23 | end 24 | cd 'lib/amatch' do 25 | mkdir_p d = File.join(libdir, 'amatch') 26 | install 'version.rb', d 27 | end 28 | warn " *** Installed amatch extension." 29 | -------------------------------------------------------------------------------- /lib/amatch.rb: -------------------------------------------------------------------------------- 1 | module Amatch 2 | end 3 | require 'amatch/rude' 4 | -------------------------------------------------------------------------------- /lib/amatch/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/flori/amatch/39b54a846763bf817170bfd4f220671e62d5adf4/lib/amatch/.keep -------------------------------------------------------------------------------- /lib/amatch/polite.rb: -------------------------------------------------------------------------------- 1 | module Amatch 2 | end 3 | require 'amatch_ext' 4 | module Amatch 5 | # alias 6 | DiceCoefficient = PairDistance 7 | end 8 | -------------------------------------------------------------------------------- /lib/amatch/rude.rb: -------------------------------------------------------------------------------- 1 | require 'amatch_ext' 2 | module Amatch 3 | DiceCoefficient = PairDistance 4 | end 5 | class ::String 6 | include ::Amatch::StringMethods 7 | end 8 | -------------------------------------------------------------------------------- /lib/amatch/version.rb: -------------------------------------------------------------------------------- 1 | module Amatch 2 | # Amatch version 3 | VERSION = '0.4.1' 4 | VERSION_ARRAY = VERSION.split('.').map(&:to_i) # :nodoc: 5 | VERSION_MAJOR = VERSION_ARRAY[0] # :nodoc: 6 | VERSION_MINOR = VERSION_ARRAY[1] # :nodoc: 7 | VERSION_BUILD = VERSION_ARRAY[2] # :nodoc: 8 | end 9 | -------------------------------------------------------------------------------- /tests/test_damerau_levenshtein.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestDamerauLevenshtein < Test::Unit::TestCase 5 | include Amatch 6 | 7 | def setup 8 | @d = 0.000001 9 | @empty = DamerauLevenshtein.new('') 10 | @simple = DamerauLevenshtein.new('test') 11 | @long = DamerauLevenshtein.new('A' * 160) 12 | end 13 | 14 | def test_version 15 | assert_kind_of String, Amatch::VERSION 16 | end 17 | 18 | def test_match 19 | assert_equal 4, @simple.match('') 20 | assert_equal 0, @simple.match('test') 21 | assert_equal 1, @simple.match('testa') 22 | assert_equal 1, @simple.match('atest') 23 | assert_equal 1, @simple.match('teast') 24 | assert_equal 1, @simple.match('est') 25 | assert_equal 1, @simple.match('tes') 26 | assert_equal 1, @simple.match('tst') 27 | assert_equal 1, @simple.match('best') 28 | assert_equal 1, @simple.match('tost') 29 | assert_equal 1, @simple.match('tesa') 30 | assert_equal 3, @simple.match('taex') 31 | assert_equal 6, @simple.match('aaatestbbb') 32 | assert_equal 1, @simple.match('tset') 33 | end 34 | 35 | def test_search 36 | assert_equal 4, @simple.search('') 37 | assert_equal 0, @empty.search('') 38 | assert_equal 0, @empty.search('test') 39 | assert_equal 0, @simple.search('aaatestbbb') 40 | assert_equal 3, @simple.search('aaataexbbb') 41 | assert_equal 4, @simple.search('aaaaaaaaa') 42 | end 43 | 44 | def test_array_result 45 | assert_equal [1, 0], @simple.match(["tets", "test"]) 46 | assert_equal [1, 0], @simple.search(["tetsaaa", "testaaa"]) 47 | assert_raises(TypeError) { @simple.match([:foo, "bar"]) } 48 | end 49 | 50 | def test_pattern_setting 51 | assert_raises(TypeError) { @simple.pattern = :something } 52 | assert_equal 0, @simple.match('test') 53 | @simple.pattern = '' 54 | assert_equal 4, @simple.match('test') 55 | @simple.pattern = 'test' 56 | assert_equal 0, @simple.match('test') 57 | end 58 | 59 | def test_similar 60 | assert_in_delta 1, @empty.similar(''), @d 61 | assert_in_delta 0, @empty.similar('not empty'), @d 62 | assert_in_delta 0.0, @simple.similar(''), @d 63 | assert_in_delta 1.0, @simple.similar('test'), @d 64 | assert_in_delta 0.8, @simple.similar('testa'), @d 65 | assert_in_delta 0.8, @simple.similar('atest'), @d 66 | assert_in_delta 0.8, @simple.similar('teast'), @d 67 | assert_in_delta 0.75, @simple.similar('est'), @d 68 | assert_in_delta 0.75, @simple.similar('tes'), @d 69 | assert_in_delta 0.75, @simple.similar('tst'), @d 70 | assert_in_delta 0.75, @simple.similar('best'), @d 71 | assert_in_delta 0.75, @simple.similar('tost'), @d 72 | assert_in_delta 0.75, @simple.similar('tesa'), @d 73 | assert_in_delta 0.25, @simple.similar('taex'), @d 74 | assert_in_delta 0.4, @simple.similar('aaatestbbb'), @d 75 | assert_in_delta 0.75, @simple.pattern.damerau_levenshtein_similar('est'), @d 76 | end 77 | 78 | def test_transpositions 79 | assert_in_delta 1.0, 'atestatest'.damerau_levenshtein_similar('atestatest'), @d 80 | assert_in_delta 0.9, 'atestatest'.damerau_levenshtein_similar('atetsatest'), @d 81 | assert_in_delta 0.8, 'atestatest'.damerau_levenshtein_similar('atetsatset'), @d 82 | end 83 | 84 | def test_long 85 | assert_in_delta 1.0, @long.similar(@long.pattern), @d 86 | end 87 | 88 | def test_long2 89 | a = "lost this fantasy, this fantasy, this fantasy, this fantasy, this fantasy, this fantasy\r\n\r\nGood love Neat work\r\n\r\nSuper job Fancy work\r\n\r\nPants job Cool work" 90 | b = "lost\r\n\r\nGood love Neat work\r\n\r\nSuper job Fancy work\r\n\r\nPants job Cool work" 91 | assert a.damerau_levenshtein_similar(b) 92 | end 93 | end 94 | -------------------------------------------------------------------------------- /tests/test_hamming.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestHamming < Test::Unit::TestCase 5 | include Amatch 6 | 7 | D = 0.000001 8 | 9 | def setup 10 | @small = Hamming.new('test') 11 | @empty = Hamming.new('') 12 | @long = Hamming.new('A' * 160) 13 | end 14 | 15 | def test_empty 16 | assert_in_delta 0, @empty.match(''), D 17 | assert_in_delta 9, @empty.match('not empty'), D 18 | assert_in_delta 1, @empty.similar(''), D 19 | assert_in_delta 0, @empty.similar('not empty'), D 20 | end 21 | 22 | def test_small_match 23 | assert_in_delta 4, @small.match(''), D 24 | assert_in_delta 0, @small.match('test'), D 25 | assert_in_delta 1, @small.match('testa'), D 26 | assert_in_delta 5, @small.match('atest'), D 27 | assert_in_delta 3, @small.match('teast'), D 28 | assert_in_delta 4, @small.match('est'), D 29 | assert_in_delta 1, @small.match('tes'), D 30 | assert_in_delta 3, @small.match('tst'), D 31 | assert_in_delta 1, @small.match('best'), D 32 | assert_in_delta 1, @small.match('tost'), D 33 | assert_in_delta 1, @small.match('tesa'), D 34 | assert_in_delta 3, @small.match('taex'), D 35 | assert_in_delta 9, @small.match('aaatestbbb'), D 36 | end 37 | 38 | def test_small_similar 39 | assert_in_delta 0.0, @small.similar(''), D 40 | assert_in_delta 1.0, @small.similar('test'), D 41 | assert_in_delta 0.8, @small.similar('testa'), D 42 | assert_in_delta 0.0, @small.similar('atest'), D 43 | assert_in_delta 0.4, @small.similar('teast'), D 44 | assert_in_delta 0, @small.similar('est'), D 45 | assert_in_delta 0.75, @small.similar('tes'), D 46 | assert_in_delta 0.25, @small.similar('tst'), D 47 | assert_in_delta 0.75, @small.similar('best'), D 48 | assert_in_delta 0.75, @small.similar('tost'), D 49 | assert_in_delta 0.75, @small.similar('tesa'), D 50 | assert_in_delta 0.25, @small.similar('taex'), D 51 | assert_in_delta 0.1, @small.similar('aaatestbbb'), D 52 | assert_in_delta 0.8, @small.pattern.hamming_similar('testa'), D 53 | end 54 | 55 | def test_long 56 | assert_in_delta 1.0, @long.similar(@long.pattern), D 57 | end 58 | end 59 | -------------------------------------------------------------------------------- /tests/test_jaro.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestJaro < Test::Unit::TestCase 5 | include Amatch 6 | 7 | D = 0.0005 8 | 9 | def setup 10 | @martha = Jaro.new('Martha') 11 | @dwayne = Jaro.new('dwayne') 12 | @dixon = Jaro.new('DIXON') 13 | @one = Jaro.new('one') 14 | end 15 | 16 | def test_case 17 | @martha.ignore_case = true 18 | assert_in_delta 0.944, @martha.match('MARHTA'), D 19 | @martha.ignore_case = false 20 | assert_in_delta 0.444, @martha.match('MARHTA'), D 21 | end 22 | 23 | def test_match 24 | assert_in_delta 0.944, @martha.match('MARHTA'), D 25 | assert_in_delta 0.822, @dwayne.match('DUANE'), D 26 | assert_in_delta 0.767, @dixon.match('DICKSONX'), D 27 | assert_in_delta 0.667, @one.match('orange'), D 28 | end 29 | end 30 | -------------------------------------------------------------------------------- /tests/test_jaro_winkler.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestJaroWinkler < Test::Unit::TestCase 5 | include Amatch 6 | 7 | D = 0.0005 8 | 9 | def setup 10 | @martha = JaroWinkler.new('Martha') 11 | @dwayne = JaroWinkler.new('dwayne') 12 | @dixon = JaroWinkler.new('DIXON') 13 | @one = JaroWinkler.new("one") 14 | end 15 | 16 | def test_case 17 | @martha.ignore_case = true 18 | assert_in_delta 0.961, @martha.match('MARHTA'), D 19 | @martha.ignore_case = false 20 | assert_in_delta 0.500, @martha.match('MARHTA'), D 21 | end 22 | 23 | def test_match 24 | assert_in_delta 0.961, @martha.match('MARHTA'), D 25 | assert_in_delta 0.840, @dwayne.match('DUANE'), D 26 | assert_in_delta 0.813, @dixon.match('DICKSONX'), D 27 | assert_in_delta 0, @one.match('two'), D 28 | assert_in_delta 0.700, @one.match('orange'), D 29 | end 30 | 31 | def test_scaling_factor 32 | assert_in_delta 0.1, @martha.scaling_factor, 0.0000001 33 | @martha.scaling_factor = 0.2 34 | assert_in_delta 0.978, @martha.match('MARHTA'), D 35 | @martha.scaling_factor = 0.5 # this is far too high 36 | assert_in_delta 1.028, @martha.match('MARHTA'), D 37 | end 38 | end 39 | -------------------------------------------------------------------------------- /tests/test_levenshtein.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestLevenshtein < Test::Unit::TestCase 5 | include Amatch 6 | 7 | def setup 8 | @d = 0.000001 9 | @empty = Levenshtein.new('') 10 | @simple = Levenshtein.new('test') 11 | @long = Levenshtein.new('A' * 160) 12 | end 13 | 14 | def test_version 15 | assert_kind_of String, Amatch::VERSION 16 | end 17 | 18 | def test_match 19 | assert_equal 4, @simple.match('') 20 | assert_equal 0, @simple.match('test') 21 | assert_equal 1, @simple.match('testa') 22 | assert_equal 1, @simple.match('atest') 23 | assert_equal 1, @simple.match('teast') 24 | assert_equal 1, @simple.match('est') 25 | assert_equal 1, @simple.match('tes') 26 | assert_equal 1, @simple.match('tst') 27 | assert_equal 1, @simple.match('best') 28 | assert_equal 1, @simple.match('tost') 29 | assert_equal 1, @simple.match('tesa') 30 | assert_equal 3, @simple.match('taex') 31 | assert_equal 6, @simple.match('aaatestbbb') 32 | end 33 | 34 | def test_search 35 | assert_equal 4, @simple.search('') 36 | assert_equal 0, @empty.search('') 37 | assert_equal 0, @empty.search('test') 38 | assert_equal 0, @simple.search('aaatestbbb') 39 | assert_equal 3, @simple.search('aaataexbbb') 40 | assert_equal 4, @simple.search('aaaaaaaaa') 41 | end 42 | 43 | def test_array_result 44 | assert_equal [2, 0], @simple.match(["tets", "test"]) 45 | assert_equal [1, 0], @simple.search(["tetsaaa", "testaaa"]) 46 | assert_raises(TypeError) { @simple.match([:foo, "bar"]) } 47 | end 48 | 49 | def test_pattern_setting 50 | assert_raises(TypeError) { @simple.pattern = :something } 51 | assert_equal 0, @simple.match('test') 52 | @simple.pattern = '' 53 | assert_equal 4, @simple.match('test') 54 | @simple.pattern = 'test' 55 | assert_equal 0, @simple.match('test') 56 | end 57 | 58 | def test_similar 59 | assert_in_delta 1, @empty.similar(''), @d 60 | assert_in_delta 0, @empty.similar('not empty'), @d 61 | assert_in_delta 0.0, @simple.similar(''), @d 62 | assert_in_delta 1.0, @simple.similar('test'), @d 63 | assert_in_delta 0.8, @simple.similar('testa'), @d 64 | assert_in_delta 0.8, @simple.similar('atest'), @d 65 | assert_in_delta 0.8, @simple.similar('teast'), @d 66 | assert_in_delta 0.75, @simple.similar('est'), @d 67 | assert_in_delta 0.75, @simple.similar('tes'), @d 68 | assert_in_delta 0.75, @simple.similar('tst'), @d 69 | assert_in_delta 0.75, @simple.similar('best'), @d 70 | assert_in_delta 0.75, @simple.similar('tost'), @d 71 | assert_in_delta 0.75, @simple.similar('tesa'), @d 72 | assert_in_delta 0.25, @simple.similar('taex'), @d 73 | assert_in_delta 0.4, @simple.similar('aaatestbbb'), @d 74 | assert_in_delta 0.75, @simple.pattern.levenshtein_similar('est'), @d 75 | end 76 | 77 | def test_long 78 | assert_in_delta 1.0, @long.similar(@long.pattern), @d 79 | end 80 | 81 | def test_long2 82 | a = "lost this fantasy, this fantasy, this fantasy, this fantasy, this fantasy, this fantasy\r\n\r\nGood love Neat work\r\n\r\nSuper job Fancy work\r\n\r\nPants job Cool work" 83 | b = "lost\r\n\r\nGood love Neat work\r\n\r\nSuper job Fancy work\r\n\r\nPants job Cool work" 84 | assert a.levenshtein_similar(b) 85 | end 86 | end 87 | -------------------------------------------------------------------------------- /tests/test_longest_subsequence.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestLongestSubsequence < Test::Unit::TestCase 5 | include Amatch 6 | 7 | D = 0.000001 8 | 9 | def setup 10 | @small = LongestSubsequence.new('test') 11 | @empty = LongestSubsequence.new('') 12 | @long = LongestSubsequence.new('A' * 160) 13 | end 14 | 15 | def test_empty_subsequence 16 | assert_equal 0, @empty.match('') 17 | assert_equal 0, @empty.match('a') 18 | assert_equal 0, @small.match('') 19 | assert_equal 0, @empty.match('not empty') 20 | end 21 | 22 | def test_small_subsequence 23 | assert_equal 4, @small.match('test') 24 | assert_equal 4, @small.match('testa') 25 | assert_equal 4, @small.match('atest') 26 | assert_equal 4, @small.match('teast') 27 | assert_equal 3, @small.match('est') 28 | assert_equal 3, @small.match('tes') 29 | assert_equal 3, @small.match('tst') 30 | assert_equal 3, @small.match('best') 31 | assert_equal 3, @small.match('tost') 32 | assert_equal 3, @small.match('tesa') 33 | assert_equal 2, @small.match('taex') 34 | assert_equal 1, @small.match('aaatbbb') 35 | assert_equal 1, @small.match('aaasbbb') 36 | assert_equal 4, @small.match('aaatestbbb') 37 | end 38 | 39 | def test_similar 40 | assert_in_delta 1, @empty.similar(''), D 41 | assert_in_delta 0, @empty.similar('not empty'), D 42 | assert_in_delta 0.0, @small.similar(''), D 43 | assert_in_delta 1.0, @small.similar('test'), D 44 | assert_in_delta 0.8, @small.similar('testa'), D 45 | assert_in_delta 0.8, @small.similar('atest'), D 46 | assert_in_delta 0.8, @small.similar('teast'), D 47 | assert_in_delta 0.75, @small.similar('est'), D 48 | assert_in_delta 0.75, @small.similar('tes'), D 49 | assert_in_delta 0.75, @small.similar('tst'), D 50 | assert_in_delta 0.75, @small.similar('best'), D 51 | assert_in_delta 0.75, @small.similar('tost'), D 52 | assert_in_delta 0.75, @small.similar('tesa'), D 53 | assert_in_delta 0.50, @small.similar('taex'), D 54 | assert_in_delta 0.4, @small.similar('aaatestbbb'), D 55 | assert_in_delta 0.75, @small.pattern.longest_subsequence_similar('est'), D 56 | end 57 | 58 | def test_long 59 | assert_in_delta 1.0, @long.similar(@long.pattern), D 60 | end 61 | end 62 | -------------------------------------------------------------------------------- /tests/test_longest_substring.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestLongestSubstring < Test::Unit::TestCase 5 | include Amatch 6 | 7 | D = 0.000001 8 | 9 | def setup 10 | @small = LongestSubstring.new('test') 11 | @empty = LongestSubstring.new('') 12 | @long = LongestSubstring.new('A' * 160) 13 | end 14 | 15 | def test_empty_substring 16 | assert_in_delta 0, @empty.match(''), D 17 | assert_in_delta 0, @empty.match('a'), D 18 | assert_in_delta 0, @small.match(''), D 19 | assert_in_delta 0, @empty.match('not empty'), D 20 | end 21 | 22 | def test_small_substring 23 | assert_in_delta 4, @small.match('test'), D 24 | assert_in_delta 4, @small.match('testa'), D 25 | assert_in_delta 4, @small.match('atest'), D 26 | assert_in_delta 2, @small.match('teast'), D 27 | assert_in_delta 3, @small.match('est'), D 28 | assert_in_delta 3, @small.match('tes'), D 29 | assert_in_delta 2, @small.match('tst'), D 30 | assert_in_delta 3, @small.match('best'), D 31 | assert_in_delta 2, @small.match('tost'), D 32 | assert_in_delta 3, @small.match('tesa'), D 33 | assert_in_delta 1, @small.match('taex'), D 34 | assert_in_delta 1, @small.match('aaatbbb'), D 35 | assert_in_delta 1, @small.match('aaasbbb'), D 36 | assert_in_delta 4, @small.match('aaatestbbb'), D 37 | end 38 | 39 | def test_similar 40 | assert_in_delta 1, @empty.similar(''), D 41 | assert_in_delta 0, @empty.similar('not empty'), D 42 | assert_in_delta 0.0, @small.similar(''), D 43 | assert_in_delta 1.0, @small.similar('test'), D 44 | assert_in_delta 0.8, @small.similar('testa'), D 45 | assert_in_delta 0.8, @small.similar('atest'), D 46 | assert_in_delta 0.4, @small.similar('teast'), D 47 | assert_in_delta 0.75, @small.similar('est'), D 48 | assert_in_delta 0.75, @small.similar('tes'), D 49 | assert_in_delta 0.5, @small.similar('tst'), D 50 | assert_in_delta 0.75, @small.similar('best'), D 51 | assert_in_delta 0.5, @small.similar('tost'), D 52 | assert_in_delta 0.75, @small.similar('tesa'), D 53 | assert_in_delta 0.25, @small.similar('taex'), D 54 | assert_in_delta 0.4, @small.similar('aaatestbbb'), D 55 | assert_in_delta 0.75, @small.pattern.longest_substring_similar('est'), D 56 | end 57 | 58 | def test_long 59 | assert_in_delta 1.0, @long.similar(@long.pattern), D 60 | end 61 | end 62 | -------------------------------------------------------------------------------- /tests/test_pair_distance.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestPairDistance < Test::Unit::TestCase 5 | include Amatch 6 | 7 | D = 0.000001 8 | 9 | def setup 10 | @single = PairDistance.new('test') 11 | @empty = PairDistance.new('') 12 | @france = PairDistance.new('republic of france') 13 | @germany = PairDistance.new('federal republic of germany') 14 | @csv = PairDistance.new('foo,bar,baz') 15 | @long = PairDistance.new('A' * 160) 16 | end 17 | 18 | def test_alternative_constant 19 | assert_equal PairDistance, DiceCoefficient 20 | end 21 | 22 | def test_empty 23 | assert_in_delta 1, @empty.match(''), D 24 | assert_in_delta 0, @empty.match('not empty'), D 25 | assert_in_delta 1, @empty.similar(''), D 26 | assert_in_delta 0, @empty.similar('not empty'), D 27 | end 28 | 29 | def test_countries 30 | assert_in_delta 0.5555555, @france.match('france'), D 31 | assert_in_delta 0.1052631, @france.match('germany'), D 32 | assert_in_delta 0.4615384, @germany.match('germany'), D 33 | assert_in_delta 0.16, @germany.match('france'), D 34 | assert_in_delta 0.6829268, 35 | @germany.match('german democratic republic'), D 36 | assert_in_delta 0.72, 37 | @france.match('french republic'), D 38 | assert_in_delta 0.4375, 39 | @germany.match('french republic'), D 40 | assert_in_delta 0.5294117, 41 | @france.match('german democratic republic'), D 42 | end 43 | 44 | def test_single 45 | assert_in_delta 0, @single.match(''), D 46 | assert_in_delta 1, @single.match('test'), D 47 | assert_in_delta 0.8571428, @single.match('testa'), D 48 | assert_in_delta 0.8571428, @single.match('atest'), D 49 | assert_in_delta 0.5714285, @single.match('teast'), D 50 | assert_in_delta 0.8, @single.match('est'), D 51 | assert_in_delta 0.8, @single.match('tes'), D 52 | assert_in_delta 0.4, @single.match('tst'), D 53 | assert_in_delta 0.6666666, @single.match('best'), D 54 | assert_in_delta 0.3333333, @single.match('tost'), D 55 | assert_in_delta 0.6666666, @single.match('tesa'), D 56 | assert_in_delta 0.0, @single.match('taex'), D 57 | assert_in_delta 0.5, @single.match('aaatestbbb'), D 58 | assert_in_delta 0.6, @single.match('aaa test bbb'), D 59 | assert_in_delta 0.6, @single.match('test aaa bbb'), D 60 | assert_in_delta 0.6, @single.match('bbb aaa test'), D 61 | assert_in_delta 0.8571428, @single.pattern.pair_distance_similar('atest'), D 62 | assert_in_delta 1.0, @france.pattern.pair_distance_similar('of france, republic', /[, ]+/), D 63 | assert_in_delta 0.9230769, @france.pattern.pair_distance_similar('of france, republik', /[, ]+/), D 64 | end 65 | 66 | def test_csv 67 | assert_in_delta 0, @csv.match('', /,/), D 68 | assert_in_delta 0.5, @csv.match('foo', /,/), D 69 | assert_in_delta 0.5, @csv.match('bar', /,/), D 70 | assert_in_delta 0.5, @csv.match('baz', /,/), D 71 | assert_in_delta 0.8, @csv.match('foo,bar', /,/), D 72 | assert_in_delta 0.8, @csv.match('bar,foo', /,/), D 73 | assert_in_delta 0.8, @csv.match('bar,baz', /,/), D 74 | assert_in_delta 0.8, @csv.match('baz,bar', /,/), D 75 | assert_in_delta 0.8, @csv.match('foo,baz', /,/), D 76 | assert_in_delta 0.8, @csv.match('baz,foo', /,/), D 77 | assert_in_delta 1, @csv.match('foo,bar,baz', /,/), D 78 | assert_in_delta 1, @csv.match('foo,baz,bar', /,/), D 79 | assert_in_delta 1, @csv.match('baz,foo,bar', /,/), D 80 | assert_in_delta 1, @csv.match('baz,bar,foo', /,/), D 81 | assert_in_delta 1, @csv.match('bar,foo,baz', /,/), D 82 | assert_in_delta 1, @csv.match('bar,baz,foo', /,/), D 83 | assert_in_delta 1, @csv.match('foo,bar,baz', nil), D 84 | assert_in_delta 0.9, @csv.match('foo,baz,bar', nil), D 85 | assert_in_delta 0.9, @csv.match('foo,baz,bar'), D 86 | assert_in_delta 0.9, @csv.similar('foo,baz,bar'), D 87 | end 88 | 89 | def test_long 90 | assert_in_delta 1.0, @long.similar(@long.pattern), D 91 | end 92 | end 93 | -------------------------------------------------------------------------------- /tests/test_sellers.rb: -------------------------------------------------------------------------------- 1 | require 'test/unit' 2 | require 'amatch' 3 | 4 | class TestSellers < Test::Unit::TestCase 5 | include Amatch 6 | 7 | def setup 8 | @d = 0.000001 9 | @empty = Sellers.new('') 10 | @simple = Sellers.new('test') 11 | @long = Sellers.new('A' * 160) 12 | end 13 | 14 | def test_weights 15 | assert_in_delta 1, @simple.substitution, @d 16 | assert_in_delta 1, @simple.insertion, @d 17 | assert_in_delta 1, @simple.deletion, @d 18 | @simple.insertion = 1 19 | @simple.substitution = @simple.deletion = 1000 20 | assert_in_delta 1, @simple.match('tst'), @d 21 | assert_in_delta 1, @simple.search('bbbtstccc'), @d 22 | @simple.deletion = 1 23 | @simple.substitution = @simple.insertion = 1000 24 | assert_in_delta 1, @simple.match('tedst'), @d 25 | assert_in_delta 1, @simple.search('bbbtedstccc'), @d 26 | @simple.substitution = 1 27 | @simple.deletion = @simple.insertion = 1000 28 | assert_in_delta 1, @simple.match('tast'), @d 29 | assert_in_delta 1, @simple.search('bbbtastccc'), @d 30 | @simple.insertion = 0.5 31 | @simple.substitution = @simple.deletion = 1000 32 | assert_in_delta 0.5, @simple.match('tst'), @d 33 | assert_in_delta 0.5, @simple.search('bbbtstccc'), @d 34 | @simple.deletion = 0.5 35 | @simple.substitution = @simple.insertion = 1000 36 | assert_in_delta 0.5, @simple.match('tedst'), @d 37 | assert_in_delta 0.5, @simple.search('bbbtedstccc'), @d 38 | @simple.substitution = 0.5 39 | @simple.deletion = @simple.insertion = 1000 40 | assert_in_delta 0.5, @simple.match('tast'), @d 41 | assert_in_delta 0.5, @simple.search('bbbtastccc'), @d 42 | @simple.reset_weights 43 | assert_in_delta 1, @simple.substitution, @d 44 | assert_in_delta 1, @simple.insertion, @d 45 | assert_in_delta 1, @simple.deletion, @d 46 | end 47 | 48 | def test_weight_exceptions 49 | assert_raises(TypeError) { @simple.substitution = :something } 50 | assert_raises(TypeError) { @simple.insertion = :something } 51 | assert_raises(TypeError) { @simple.deletion = :something } 52 | end 53 | 54 | def test_similar 55 | assert_in_delta 0.0, @simple.similar(''), @d 56 | assert_in_delta 1.0, @simple.similar('test'), @d 57 | assert_in_delta 0.8, @simple.similar('testa'), @d 58 | assert_in_delta 0.8, @simple.similar('atest'), @d 59 | assert_in_delta 0.8, @simple.similar('teast'), @d 60 | assert_in_delta 0.75, @simple.similar('est'), @d 61 | assert_in_delta 0.75, @simple.similar('tes'), @d 62 | assert_in_delta 0.75, @simple.similar('tst'), @d 63 | assert_in_delta 0.75, @simple.similar('best'), @d 64 | assert_in_delta 0.75, @simple.similar('tost'), @d 65 | assert_in_delta 0.75, @simple.similar('tesa'), @d 66 | assert_in_delta 0.25, @simple.similar('taex'), @d 67 | assert_in_delta 0.4, @simple.similar('aaatestbbb'), @d 68 | assert_in_delta 0.75, @simple.pattern.levenshtein_similar('est'), @d 69 | end 70 | 71 | def test_similar2 72 | assert_in_delta 1, @empty.similar(''), @d 73 | assert_in_delta 0, @empty.similar('not empty'), @d 74 | assert_in_delta 0.0, @simple.similar(''), @d 75 | assert_in_delta 1.0, @simple.similar('test'), @d 76 | assert_in_delta 0.8, @simple.similar('testa'), @d 77 | assert_in_delta 0.8, @simple.similar('atest'), @d 78 | assert_in_delta 0.8, @simple.similar('teast'), @d 79 | assert_in_delta 0.75, @simple.similar('est'), @d 80 | assert_in_delta 0.75, @simple.similar('tes'), @d 81 | assert_in_delta 0.75, @simple.similar('tst'), @d 82 | assert_in_delta 0.75, @simple.similar('best'), @d 83 | assert_in_delta 0.75, @simple.similar('tost'), @d 84 | assert_in_delta 0.75, @simple.similar('tesa'), @d 85 | assert_in_delta 0.25, @simple.similar('taex'), @d 86 | assert_in_delta 0.4, @simple.similar('aaatestbbb'), @d 87 | @simple.insertion = 1 88 | @simple.substitution = @simple.deletion = 2 89 | assert_in_delta 0.875, @simple.similar('tst'), @d 90 | end 91 | 92 | def test_long 93 | assert_in_delta 1.0, @long.similar(@long.pattern), @d 94 | end 95 | end 96 | --------------------------------------------------------------------------------