├── .Rbuildignore ├── .github ├── ISSUE_TEMPLATE │ ├── config.yml │ └── issue_template.md └── workflows │ ├── issue.yml │ ├── stale-actions.yml │ └── tic.yml ├── .gitignore ├── DESCRIPTION ├── Dockerfile ├── LICENSE.md ├── NAMESPACE ├── NEWS.md ├── R ├── RcppExports.R ├── RcppModule.R ├── nmslib.R └── package.R ├── README.md ├── codecov.yml ├── inst ├── CITATION └── Non_Metric_Space_Library_(NMSLIB)_Manual.pdf ├── man ├── KernelKnnCV_nmslib.Rd ├── KernelKnn_nmslib.Rd ├── NMSlib.Rd ├── TO_scipy_sparse.Rd ├── import_internal.Rd ├── inner_kernel_function.Rd └── mat_2scipy_sparse.Rd ├── src ├── Makevars ├── Makevars.win ├── RcppExports.cpp ├── init.c └── utils.cpp ├── tests ├── testthat.R └── testthat │ ├── helper-init.R │ ├── helper-skip.R │ ├── setup.R │ └── test-nmslibR_pkg.R ├── tic.R └── vignettes └── the_nmslibR_package.Rmd /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^codecov\.yml$ 2 | ^.git$ 3 | ^.github$ 4 | ^\.ccache$ 5 | ^\.github$ 6 | ^tic\.R$ 7 | ^Dockerfile$ 8 | ^LICENSE.md$ 9 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/config.yml: -------------------------------------------------------------------------------- 1 | # For more info see: https://docs.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repository#configuring-the-template-chooser 2 | 3 | blank_issues_enabled: true 4 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/issue_template.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report or feature request 3 | about: Describe a bug you've encountered or make a case for a new feature 4 | --- 5 | 6 | Please briefly describe your problem and what output you expect. If you have a question, you also have the option of (but I'm flexible if it's not too complicated) 7 | 8 | Please include a minimal reproducible example 9 | 10 | Please give a brief description of the problem 11 | 12 | Please add your Operating System (e.g., Windows10, Macintosh, Linux) and the R version that you use (e.g., 3.6.2) 13 | 14 | If my package uses Python (via 'reticulate') then please add also the Python version (e.g., Python 3.8) and the 'reticulate' version (e.g., 1.18.0) 15 | -------------------------------------------------------------------------------- /.github/workflows/issue.yml: -------------------------------------------------------------------------------- 1 | # For more info see: https://github.com/Renato66/auto-label 2 | # for the 'secrets.GITHUB_TOKEN' see: https://docs.github.com/en/actions/reference/authentication-in-a-workflow#about-the-github_token-secret 3 | 4 | name: Labeling new issue 5 | on: 6 | issues: 7 | types: ['opened'] 8 | jobs: 9 | build: 10 | runs-on: ubuntu-latest 11 | steps: 12 | - uses: Renato66/auto-label@v2 13 | with: 14 | repo-token: ${{ secrets.GITHUB_TOKEN }} 15 | ignore-comments: true 16 | labels-synonyms: '{"bug":["error","need fix","not working"],"enhancement":["upgrade"],"question":["help"]}' 17 | labels-not-allowed: '["good first issue"]' 18 | default-labels: '["help wanted"]' 19 | -------------------------------------------------------------------------------- /.github/workflows/stale-actions.yml: -------------------------------------------------------------------------------- 1 | # for the 'secrets.GITHUB_TOKEN' see: https://docs.github.com/en/actions/reference/authentication-in-a-workflow#about-the-github_token-secret 2 | 3 | name: "Mark or close stale issues and PRs" 4 | 5 | on: 6 | schedule: 7 | - cron: "00 * * * *" 8 | 9 | jobs: 10 | stale: 11 | runs-on: ubuntu-latest 12 | steps: 13 | - uses: actions/stale@v3 14 | with: 15 | repo-token: ${{ secrets.GITHUB_TOKEN }} 16 | days-before-stale: 12 17 | days-before-close: 7 18 | stale-issue-message: "This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond." 19 | stale-pr-message: "This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs." 20 | close-issue-message: "This issue was automatically closed because of being stale. Feel free to re-open a closed issue and the Human-lampros will respond." 21 | close-pr-message: "This PR was automatically closed because of being stale." 22 | stale-pr-label: "stale" 23 | stale-issue-label: "stale" 24 | exempt-issue-labels: "bug,enhancement,pinned,security,pending,work_in_progress" 25 | exempt-pr-labels: "bug,enhancement,pinned,security,pending,work_in_progress" 26 | -------------------------------------------------------------------------------- /.github/workflows/tic.yml: -------------------------------------------------------------------------------- 1 | ## tic GitHub Actions template: linux-macos-windows-deploy 2 | ## revision date: 2020-12-11 3 | on: 4 | workflow_dispatch: 5 | push: 6 | pull_request: 7 | # for now, CRON jobs only run on the default branch of the repo (i.e. usually on master) 8 | schedule: 9 | # * is a special character in YAML so you have to quote this string 10 | - cron: "0 4 * * *" 11 | 12 | name: tic 13 | 14 | jobs: 15 | all: 16 | runs-on: ${{ matrix.config.os }} 17 | 18 | name: ${{ matrix.config.os }} (${{ matrix.config.r }}) 19 | 20 | strategy: 21 | fail-fast: false 22 | matrix: 23 | config: 24 | # use a different tic template type if you do not want to build on all listed platforms 25 | - { os: windows-latest, r: "release" } 26 | - { os: ubuntu-latest, r: "devel" } 27 | - { os: ubuntu-latest, r: "release", pkgdown: "true", latex: "true" } 28 | 29 | env: 30 | # otherwise remotes::fun() errors cause the build to fail. Example: Unavailability of binaries 31 | R_REMOTES_NO_ERRORS_FROM_WARNINGS: true 32 | CRAN: ${{ matrix.config.cran }} 33 | # make sure to run `tic::use_ghactions_deploy()` to set up deployment 34 | TIC_DEPLOY_KEY: ${{ secrets.TIC_DEPLOY_KEY }} 35 | # prevent rgl issues because no X11 display is available 36 | RGL_USE_NULL: true 37 | # if you use bookdown or blogdown, replace "PKGDOWN" by the respective 38 | # capitalized term. This also might need to be done in tic.R 39 | BUILD_PKGDOWN: ${{ matrix.config.pkgdown }} 40 | # macOS >= 10.15.4 linking 41 | SDKROOT: /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk 42 | # use GITHUB_TOKEN from GitHub to workaround rate limits in {remotes} 43 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 44 | 45 | steps: 46 | - uses: actions/checkout@v3 47 | 48 | - uses: r-lib/actions/setup-r@v2 49 | with: 50 | r-version: ${{ matrix.config.r }} 51 | Ncpus: 4 52 | 53 | # LaTeX. Installation time: 54 | # Linux: ~ 1 min 55 | # macOS: ~ 1 min 30s 56 | # Windows: never finishes 57 | - uses: r-lib/actions/setup-tinytex@v2 58 | if: matrix.config.latex == 'true' 59 | 60 | - uses: r-lib/actions/setup-pandoc@v2 61 | 62 | # set date/week for use in cache creation 63 | # https://github.community/t5/GitHub-Actions/How-to-set-and-access-a-Workflow-variable/m-p/42970 64 | # - cache R packages daily 65 | - name: "[Cache] Prepare daily timestamp for cache" 66 | if: runner.os != 'Windows' 67 | id: date 68 | run: echo "::set-output name=date::$(date '+%d-%m')" 69 | 70 | - name: "[Cache] Cache R packages" 71 | if: runner.os != 'Windows' 72 | uses: pat-s/always-upload-cache@v2.1.3 73 | with: 74 | path: ${{ env.R_LIBS_USER }} 75 | key: ${{ runner.os }}-r-${{ matrix.config.r }}-${{steps.date.outputs.date}} 76 | restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-${{steps.date.outputs.date}} 77 | 78 | # for some strange Windows reason this step and the next one need to be decoupled 79 | - name: "[Stage] Prepare" 80 | run: | 81 | Rscript -e "if (!requireNamespace('remotes')) install.packages('remotes', type = 'source')" 82 | Rscript -e "if (getRversion() < '3.2' && !requireNamespace('curl')) install.packages('curl', type = 'source')" 83 | 84 | - name: "[Stage] [Linux] Install curl and libgit2" 85 | if: runner.os == 'Linux' 86 | run: sudo apt install libcurl4-openssl-dev libgit2-dev 87 | 88 | - name: "[Stage] [macOS] Install libgit2" 89 | if: runner.os == 'macOS' 90 | run: brew install libgit2 91 | 92 | - name: "[Stage] [macOS] Install system libs for pkgdown" 93 | if: runner.os == 'macOS' && matrix.config.pkgdown != '' 94 | run: brew install harfbuzz fribidi 95 | 96 | - name: "[Stage] [Linux] Install system libs for pkgdown" 97 | if: runner.os == 'Linux' && matrix.config.pkgdown != '' 98 | run: sudo apt install libharfbuzz-dev libfribidi-dev 99 | 100 | - name: "[Stage] Install" 101 | if: matrix.config.os != 'macOS-latest' || matrix.config.r != 'devel' 102 | run: Rscript -e "remotes::install_github('ropensci/tic')" -e "print(tic::dsl_load())" -e "tic::prepare_all_stages()" -e "tic::before_install()" -e "tic::install()" 103 | 104 | # macOS devel needs its own stage because we need to work with an option to suppress the usage of binaries 105 | - name: "[Stage] Prepare & Install (macOS-devel)" 106 | if: matrix.config.os == 'macOS-latest' && matrix.config.r == 'devel' 107 | run: | 108 | echo -e 'options(Ncpus = 4, pkgType = "source", repos = structure(c(CRAN = "https://cloud.r-project.org/")))' > $HOME/.Rprofile 109 | Rscript -e "remotes::install_github('ropensci/tic')" -e "print(tic::dsl_load())" -e "tic::prepare_all_stages()" -e "tic::before_install()" -e "tic::install()" 110 | 111 | - name: "[Stage] Script" 112 | run: Rscript -e 'tic::script()' 113 | 114 | - name: "[Stage] After Success" 115 | if: matrix.config.os == 'macOS-latest' && matrix.config.r == 'release' 116 | run: Rscript -e "tic::after_success()" 117 | 118 | - name: "[Stage] Upload R CMD check artifacts" 119 | if: failure() 120 | uses: actions/upload-artifact@v2.2.1 121 | with: 122 | name: ${{ runner.os }}-r${{ matrix.config.r }}-results 123 | path: check 124 | - name: "[Stage] Before Deploy" 125 | run: | 126 | Rscript -e "tic::before_deploy()" 127 | 128 | - name: "[Stage] Deploy" 129 | run: Rscript -e "tic::deploy()" 130 | 131 | - name: "[Stage] After Deploy" 132 | run: Rscript -e "tic::after_deploy()" 133 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | docs/ 2 | .Rhistory 3 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: nmslibR 2 | Type: Package 3 | Title: Non Metric Space (Approximate) Library 4 | Version: 1.0.7 5 | Date: 2023-02-01 6 | Authors@R: c( person("Lampros", "Mouselimis", email = "mouselimislampros@gmail.com", role = c("aut", "cre"), comment = c(ORCID = "https://orcid.org/0000-0002-8024-1546")), person("B.", "Naidan", role = "cph", comment = "Author of the Non-Metric Space Library (NMSLIB)"), person("L.", "Boytsov", role = "cph", comment = "Author of the Non-Metric Space Library (NMSLIB)"), person("Yu.", "Malkov", role = "cph", comment = "Author of the Non-Metric Space Library (NMSLIB)"), person("B.", "Frederickson", role = "cph", comment = "Author of the Non-Metric Space Library (NMSLIB)"), person("D.", "Novak", role = "cph", comment = "Author of the Non-Metric Space Library (NMSLIB)") ) 7 | BugReports: https://github.com/mlampros/nmslibR/issues 8 | URL: https://github.com/mlampros/nmslibR 9 | Description: A Non-Metric Space Library ('NMSLIB' ) wrapper, which according to the authors "is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The goal of the 'NMSLIB' Library is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods". The wrapper also includes Approximate Kernel k-Nearest-Neighbor functions based on the 'NMSLIB' 'Python' Library. 10 | License: Apache License (>= 2.0) 11 | SystemRequirements: python3-dev: apt-get install -y python3-dev (deb), python3-pip: apt-get install -y python3-pip (deb), numpy: pip3 install numpy (deb), scipy: pip3 install scipy (deb), nmslib: pip3 install --no-binary :all: nmslib (deb) 12 | Encoding: UTF-8 13 | Depends: 14 | R(>= 3.2.3) 15 | Imports: 16 | Rcpp (>= 0.12.7), 17 | reticulate, 18 | R6, 19 | Matrix, 20 | KernelKnn, 21 | utils, 22 | lifecycle 23 | LinkingTo: Rcpp, RcppArmadillo (>= 0.8.0) 24 | Suggests: 25 | testthat, 26 | covr, 27 | knitr, 28 | rmarkdown 29 | VignetteBuilder: knitr 30 | RoxygenNote: 7.2.3 31 | Config/reticulate: 32 | list( 33 | packages = list( 34 | list(package = "nmslib", pip = TRUE), 35 | list(package = "scipy", pip = TRUE) 36 | ) 37 | ) 38 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM rocker/rstudio:devel 2 | 3 | LABEL maintainer='Lampros Mouselimis' 4 | 5 | RUN export DEBIAN_FRONTEND=noninteractive; apt-get -y update && \ 6 | apt-get install -y libssl-dev python pandoc pandoc-citeproc libicu-dev libcurl4-openssl-dev libpng-dev && \ 7 | apt-get install -y sudo && \ 8 | apt-get install -y python3-dev && \ 9 | apt-get install -y python3-pip && \ 10 | pip3 install numpy && \ 11 | pip3 install scipy && \ 12 | pip3 install --no-binary :all: nmslib && \ 13 | R -e "install.packages(c( 'Rcpp', 'reticulate', 'R6', 'Matrix', 'KernelKnn', 'utils', 'RcppArmadillo', 'testthat', 'covr', 'knitr', 'rmarkdown', 'lifecycle', 'remotes' ), repos = 'https://cloud.r-project.org/' )" && \ 14 | R -e "remotes::install_github('mlampros/nmslibR', upgrade = 'never', dependencies = FALSE, repos = 'https://cloud.r-project.org/')" && \ 15 | apt-get autoremove -y && \ 16 | apt-get clean 17 | 18 | 19 | ENV USER rstudio 20 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | Apache License 2 | ============== 3 | 4 | _Version 2.0, January 2004_ 5 | _<>_ 6 | 7 | ### Terms and Conditions for use, reproduction, and distribution 8 | 9 | #### 1. Definitions 10 | 11 | “License” shall mean the terms and conditions for use, reproduction, and 12 | distribution as defined by Sections 1 through 9 of this document. 13 | 14 | “Licensor” shall mean the copyright owner or entity authorized by the copyright 15 | owner that is granting the License. 16 | 17 | “Legal Entity” shall mean the union of the acting entity and all other entities 18 | that control, are controlled by, or are under common control with that entity. 19 | For the purposes of this definition, “control” means **(i)** the power, direct or 20 | indirect, to cause the direction or management of such entity, whether by 21 | contract or otherwise, or **(ii)** ownership of fifty percent (50%) or more of the 22 | outstanding shares, or **(iii)** beneficial ownership of such entity. 23 | 24 | “You” (or “Your”) shall mean an individual or Legal Entity exercising 25 | permissions granted by this License. 26 | 27 | “Source” form shall mean the preferred form for making modifications, including 28 | but not limited to software source code, documentation source, and configuration 29 | files. 30 | 31 | “Object” form shall mean any form resulting from mechanical transformation or 32 | translation of a Source form, including but not limited to compiled object code, 33 | generated documentation, and conversions to other media types. 34 | 35 | “Work” shall mean the work of authorship, whether in Source or Object form, made 36 | available under the License, as indicated by a copyright notice that is included 37 | in or attached to the work (an example is provided in the Appendix below). 38 | 39 | “Derivative Works” shall mean any work, whether in Source or Object form, that 40 | is based on (or derived from) the Work and for which the editorial revisions, 41 | annotations, elaborations, or other modifications represent, as a whole, an 42 | original work of authorship. For the purposes of this License, Derivative Works 43 | shall not include works that remain separable from, or merely link (or bind by 44 | name) to the interfaces of, the Work and Derivative Works thereof. 45 | 46 | “Contribution” shall mean any work of authorship, including the original version 47 | of the Work and any modifications or additions to that Work or Derivative Works 48 | thereof, that is intentionally submitted to Licensor for inclusion in the Work 49 | by the copyright owner or by an individual or Legal Entity authorized to submit 50 | on behalf of the copyright owner. For the purposes of this definition, 51 | “submitted” means any form of electronic, verbal, or written communication sent 52 | to the Licensor or its representatives, including but not limited to 53 | communication on electronic mailing lists, source code control systems, and 54 | issue tracking systems that are managed by, or on behalf of, the Licensor for 55 | the purpose of discussing and improving the Work, but excluding communication 56 | that is conspicuously marked or otherwise designated in writing by the copyright 57 | owner as “Not a Contribution.” 58 | 59 | “Contributor” shall mean Licensor and any individual or Legal Entity on behalf 60 | of whom a Contribution has been received by Licensor and subsequently 61 | incorporated within the Work. 62 | 63 | #### 2. Grant of Copyright License 64 | 65 | Subject to the terms and conditions of this License, each Contributor hereby 66 | grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, 67 | irrevocable copyright license to reproduce, prepare Derivative Works of, 68 | publicly display, publicly perform, sublicense, and distribute the Work and such 69 | Derivative Works in Source or Object form. 70 | 71 | #### 3. Grant of Patent License 72 | 73 | Subject to the terms and conditions of this License, each Contributor hereby 74 | grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, 75 | irrevocable (except as stated in this section) patent license to make, have 76 | made, use, offer to sell, sell, import, and otherwise transfer the Work, where 77 | such license applies only to those patent claims licensable by such Contributor 78 | that are necessarily infringed by their Contribution(s) alone or by combination 79 | of their Contribution(s) with the Work to which such Contribution(s) was 80 | submitted. If You institute patent litigation against any entity (including a 81 | cross-claim or counterclaim in a lawsuit) alleging that the Work or a 82 | Contribution incorporated within the Work constitutes direct or contributory 83 | patent infringement, then any patent licenses granted to You under this License 84 | for that Work shall terminate as of the date such litigation is filed. 85 | 86 | #### 4. Redistribution 87 | 88 | You may reproduce and distribute copies of the Work or Derivative Works thereof 89 | in any medium, with or without modifications, and in Source or Object form, 90 | provided that You meet the following conditions: 91 | 92 | * **(a)** You must give any other recipients of the Work or Derivative Works a copy of 93 | this License; and 94 | * **(b)** You must cause any modified files to carry prominent notices stating that You 95 | changed the files; and 96 | * **(c)** You must retain, in the Source form of any Derivative Works that You distribute, 97 | all copyright, patent, trademark, and attribution notices from the Source form 98 | of the Work, excluding those notices that do not pertain to any part of the 99 | Derivative Works; and 100 | * **(d)** If the Work includes a “NOTICE” text file as part of its distribution, then any 101 | Derivative Works that You distribute must include a readable copy of the 102 | attribution notices contained within such NOTICE file, excluding those notices 103 | that do not pertain to any part of the Derivative Works, in at least one of the 104 | following places: within a NOTICE text file distributed as part of the 105 | Derivative Works; within the Source form or documentation, if provided along 106 | with the Derivative Works; or, within a display generated by the Derivative 107 | Works, if and wherever such third-party notices normally appear. The contents of 108 | the NOTICE file are for informational purposes only and do not modify the 109 | License. You may add Your own attribution notices within Derivative Works that 110 | You distribute, alongside or as an addendum to the NOTICE text from the Work, 111 | provided that such additional attribution notices cannot be construed as 112 | modifying the License. 113 | 114 | You may add Your own copyright statement to Your modifications and may provide 115 | additional or different license terms and conditions for use, reproduction, or 116 | distribution of Your modifications, or for any such Derivative Works as a whole, 117 | provided Your use, reproduction, and distribution of the Work otherwise complies 118 | with the conditions stated in this License. 119 | 120 | #### 5. Submission of Contributions 121 | 122 | Unless You explicitly state otherwise, any Contribution intentionally submitted 123 | for inclusion in the Work by You to the Licensor shall be under the terms and 124 | conditions of this License, without any additional terms or conditions. 125 | Notwithstanding the above, nothing herein shall supersede or modify the terms of 126 | any separate license agreement you may have executed with Licensor regarding 127 | such Contributions. 128 | 129 | #### 6. Trademarks 130 | 131 | This License does not grant permission to use the trade names, trademarks, 132 | service marks, or product names of the Licensor, except as required for 133 | reasonable and customary use in describing the origin of the Work and 134 | reproducing the content of the NOTICE file. 135 | 136 | #### 7. Disclaimer of Warranty 137 | 138 | Unless required by applicable law or agreed to in writing, Licensor provides the 139 | Work (and each Contributor provides its Contributions) on an “AS IS” BASIS, 140 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, 141 | including, without limitation, any warranties or conditions of TITLE, 142 | NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are 143 | solely responsible for determining the appropriateness of using or 144 | redistributing the Work and assume any risks associated with Your exercise of 145 | permissions under this License. 146 | 147 | #### 8. Limitation of Liability 148 | 149 | In no event and under no legal theory, whether in tort (including negligence), 150 | contract, or otherwise, unless required by applicable law (such as deliberate 151 | and grossly negligent acts) or agreed to in writing, shall any Contributor be 152 | liable to You for damages, including any direct, indirect, special, incidental, 153 | or consequential damages of any character arising as a result of this License or 154 | out of the use or inability to use the Work (including but not limited to 155 | damages for loss of goodwill, work stoppage, computer failure or malfunction, or 156 | any and all other commercial damages or losses), even if such Contributor has 157 | been advised of the possibility of such damages. 158 | 159 | #### 9. Accepting Warranty or Additional Liability 160 | 161 | While redistributing the Work or Derivative Works thereof, You may choose to 162 | offer, and charge a fee for, acceptance of support, warranty, indemnity, or 163 | other liability obligations and/or rights consistent with this License. However, 164 | in accepting such obligations, You may act only on Your own behalf and on Your 165 | sole responsibility, not on behalf of any other Contributor, and only if You 166 | agree to indemnify, defend, and hold each Contributor harmless for any liability 167 | incurred by, or claims asserted against, such Contributor by reason of your 168 | accepting any such warranty or additional liability. 169 | 170 | _END OF TERMS AND CONDITIONS_ 171 | 172 | ### APPENDIX: How to apply the Apache License to your work 173 | 174 | To apply the Apache License to your work, attach the following boilerplate 175 | notice, with the fields enclosed by brackets `[]` replaced with your own 176 | identifying information. (Don't include the brackets!) The text should be 177 | enclosed in the appropriate comment syntax for the file format. We also 178 | recommend that a file or class name and description of purpose be included on 179 | the same “printed page” as the copyright notice for easier identification within 180 | third-party archives. 181 | 182 | Copyright [yyyy] [name of copyright owner] 183 | 184 | Licensed under the Apache License, Version 2.0 (the "License"); 185 | you may not use this file except in compliance with the License. 186 | You may obtain a copy of the License at 187 | 188 | http://www.apache.org/licenses/LICENSE-2.0 189 | 190 | Unless required by applicable law or agreed to in writing, software 191 | distributed under the License is distributed on an "AS IS" BASIS, 192 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 193 | See the License for the specific language governing permissions and 194 | limitations under the License. 195 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(KernelKnnCV_nmslib) 4 | export(KernelKnn_nmslib) 5 | export(NMSlib) 6 | export(TO_scipy_sparse) 7 | export(mat_2scipy_sparse) 8 | import(KernelKnn) 9 | import(reticulate) 10 | importFrom(Matrix,Matrix) 11 | importFrom(R6,R6Class) 12 | importFrom(Rcpp,evalCpp) 13 | importFrom(lifecycle,deprecate_warn) 14 | importFrom(lifecycle,is_present) 15 | importFrom(utils,getFromNamespace) 16 | importFrom(utils,setTxtProgressBar) 17 | importFrom(utils,txtProgressBar) 18 | useDynLib(nmslibR, .registration = TRUE) 19 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | 2 | ## nmslibR 1.0.7 3 | 4 | * I've added the *include_query_data_row_index* parameter to the *Knn_Query()* method of the *NMSlib* R6 Class and at the same time I added a deprecation warning for this parameter, because this method currently excludes by default the first output index and value. By setting the *include_query_data_row_index* to TRUE the first output index and value will be returned. This change will take effect in version 1.1.0 and the *Knn_Query()* method will return the first output index and value by default. 5 | * I added the *"save_data"* parameter to the *"save_Index()"* method and the *"load_data"* parameter to the *"initialize()"* method of the *'NMSlib()'* R6 class. I updated the documentation and references sections as well 6 | * I've modified the *DESCRIPTION* and the *package.R* file by adding only comments related to a new configuration type in the reticulate R package (see: https://github.com/rstudio/reticulate/issues/883#issuecomment-775552812) 7 | * I updated the *Makevars* and the .cpp files from C++11 to C++17 because I received the following NOTE during checking of the package: *Specified C++11: please update to current default of C++17* 8 | * I updated the *.Rbuildignore* file to exclude the *LICENSE.md* file because it gives a NOTE during CRAN checking 9 | 10 | 11 | ## nmslibR 1.0.6 12 | 13 | * I've added a 'packageStartupMessage' informing the user in case of the error 'attempt to apply non-function' that he/she has to use the 'reticulate::py_config()' before loading the package (in a new R session) 14 | * I've updated the 'SystemRequirements' in the DESCRIPTION file 15 | 16 | 17 | ## nmslibR 1.0.5 18 | 19 | * I updated the *License* in the DESCRIPTION file which as of '07-05-2021' will be *Apache License Version 2.0*. Therefore I removed also the COPYRIGHTS file from the 'inst' directory 20 | * I removed *LazyData* from the DESCRIPTION file 21 | * I added the *CITATION* file in the 'inst' directory 22 | * I removed the 'zzz.R' file and the 'packageStartupMessage()' 23 | 24 | 25 | ## nmslibR 1.0.4 26 | 27 | * I adjusted the output indices of the *Knn_Query* method (*NMSlib* R6 class) to account for the difference in indexing between R and Python ( *reference* : https://github.com/mlampros/nmslibR/issues/5 ) 28 | * I removed the *dtype* 'DOUBLE' parameter from the *NMSlib* R6 class, *KernelKnn_nmslib* and *KernelKnnCV_nmslib* functions (*reference* : https://github.com/nmslib/nmslib/commit/4d2937d6259aebb456db141ee0f3c2c465a51a8e ) 29 | * I replaced almost all web-links of the Python *nmslib* Package because the initial repository was moved to https://github.com/nmslib/nmslib 30 | 31 | 32 | ## nmslibR 1.0.3 33 | 34 | I updated the README.md file and especially the installation instructions for all mentioned operating systems i.e. Linux, Macintosh, Windows (switch from python2 to python3 due to pybind11 issues). 35 | 36 | 37 | ## nmslibR 1.0.2 38 | 39 | * The *dgCMatrix_2scipy_sparse* function was renamed to *TO_scipy_sparse* and now accepts either a *dgCMatrix* or a *dgRMatrix* as input. The appropriate format for the nmslibR package in case of sparse matrices is the *dgRMatrix* format (*scipy.sparse.csr_matrix*) 40 | * I added an onload.R file to inform the users about the previous change [ related with the issue : https://github.com/mlampros/nmslibR/issues/1 ] 41 | * I removed the *utils.R* file which included internal functions of the *KernelKnn* package. Rather than including the file I now use the *getFromNamespace* function of the *utils* package. 42 | * Due to the previous changes I modified the Vignette and the tests too. 43 | 44 | 45 | ## nmslibR 1.0.1 46 | 47 | * I commented the example(s) and test(s) related to the *dgCMatrix_2scipy_sparse* function [ *if (Sys.info()["sysname"] != 'Darwin')* ], because the *scipy-sparse* library on CRAN is not upgraded and the older version includes a bug (*TypeError : could not interpret data type*). This leads to an error on *Macintosh* Operating System ( *reference* : https://github.com/scipy/scipy/issues/5353 ) 48 | * I added links to the github repository (master repository, issues) 49 | 50 | 51 | ## nmslibR 1.0.0 52 | 53 | 54 | 55 | 56 | -------------------------------------------------------------------------------- /R/RcppExports.R: -------------------------------------------------------------------------------- 1 | # Generated by using Rcpp::compileAttributes() -> do not edit by hand 2 | # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 3 | 4 | nmslib_idx_dist <- function(input_list, k, threads = 1L) { 5 | .Call(`_nmslibR_nmslib_idx_dist`, input_list, k, threads) 6 | } 7 | 8 | y_idxs <- function(idxs, y, threads = 1L) { 9 | .Call(`_nmslibR_y_idxs`, idxs, y, threads) 10 | } 11 | 12 | check_NaN_Inf <- function(x) { 13 | .Call(`_nmslibR_check_NaN_Inf`, x) 14 | } 15 | 16 | -------------------------------------------------------------------------------- /R/RcppModule.R: -------------------------------------------------------------------------------- 1 | #' @useDynLib nmslibR, .registration = TRUE 2 | #' @importFrom Rcpp evalCpp 3 | #' @importFrom lifecycle deprecate_warn is_present 4 | NULL 5 | -------------------------------------------------------------------------------- /R/nmslib.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | #' conversion of an R matrix to a scipy sparse matrix 4 | #' 5 | #' 6 | #' @param x a data matrix 7 | #' @param format a character string. Either \emph{"sparse_row_matrix"} or \emph{"sparse_column_matrix"} 8 | #' @details 9 | #' This function allows the user to convert an R matrix to a scipy sparse matrix. This is useful because the \emph{nmslibR} package accepts only \emph{python} sparse matrices as input. 10 | #' @export 11 | #' @references https://docs.scipy.org/doc/scipy/reference/sparse.html 12 | #' @examples 13 | #' 14 | #' try({ 15 | #' if (reticulate::py_available(initialize = FALSE)) { 16 | #' if (reticulate::py_module_available("scipy")) { 17 | #' 18 | #' library(nmslibR) 19 | #' 20 | #' set.seed(1) 21 | #' 22 | #' x = matrix(runif(1000), nrow = 100, ncol = 10) 23 | #' 24 | #' res = mat_2scipy_sparse(x) 25 | #' 26 | #' print(dim(x)) 27 | #' 28 | #' print(res$shape) 29 | #' } 30 | #' } 31 | #' }, silent=TRUE) 32 | 33 | 34 | mat_2scipy_sparse = function(x, format = 'sparse_row_matrix') { 35 | 36 | if (!inherits(x, "matrix")) stop("the 'x' parameter should be of type 'matrix'", call. = F) 37 | 38 | if (format == 'sparse_column_matrix') { 39 | return(SCP$sparse$csc_matrix(x)) 40 | } 41 | else if (format == 'sparse_row_matrix') { 42 | return(SCP$sparse$csr_matrix(x)) 43 | } 44 | else { 45 | stop("the function can take either a 'sparse_row_matrix' or a 'sparse_column_matrix' for the 'format' parameter as input", call. = F) 46 | } 47 | } 48 | 49 | 50 | 51 | #' conversion of an R sparse matrix to a scipy sparse matrix 52 | #' 53 | #' 54 | #' @param R_sparse_matrix an R sparse matrix. Acceptable input objects are either a \emph{dgCMatrix} or a \emph{dgRMatrix}. 55 | #' @details 56 | #' This function allows the user to convert either an R \emph{dgCMatrix} or a \emph{dgRMatrix} to a scipy sparse matrix (\emph{scipy.sparse.csc_matrix} or \emph{scipy.sparse.csr_matrix}). This is useful because the \emph{nmslibR} package accepts besides an R dense matrix also python sparse matrices as input. 57 | #' 58 | #' The \emph{dgCMatrix} class is a class of sparse numeric matrices in the compressed, sparse, \emph{column-oriented format}. The \emph{dgRMatrix} class is a class of sparse numeric matrices in the compressed, sparse, \emph{column-oriented format}. 59 | #' 60 | #' @export 61 | #' @import reticulate 62 | #' @importFrom Matrix Matrix 63 | #' @references https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgCMatrix-class.html, https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgRMatrix-class.html, https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix 64 | #' @examples 65 | #' 66 | #' try({ 67 | #' if (reticulate::py_available(initialize = FALSE)) { 68 | #' if (reticulate::py_module_available("scipy")) { 69 | #' 70 | #' if (Sys.info()["sysname"] != 'Darwin') { 71 | #' 72 | #' library(nmslibR) 73 | #' 74 | #' 75 | #' # 'dgCMatrix' sparse matrix 76 | #' #-------------------------- 77 | #' 78 | #' data = c(1, 0, 2, 0, 0, 3, 4, 5, 6) 79 | #' 80 | #' dgcM = Matrix::Matrix(data = data, nrow = 3, 81 | #' 82 | #' ncol = 3, byrow = TRUE, 83 | #' 84 | #' sparse = TRUE) 85 | #' 86 | #' print(dim(dgcM)) 87 | #' 88 | #' res = TO_scipy_sparse(dgcM) 89 | #' 90 | #' print(res$shape) 91 | #' 92 | #' 93 | #' # 'dgRMatrix' sparse matrix 94 | #' #-------------------------- 95 | #' 96 | #' dgrM = as(dgcM, "RsparseMatrix") 97 | #' 98 | #' print(dim(dgrM)) 99 | #' 100 | #' res_dgr = TO_scipy_sparse(dgrM) 101 | #' 102 | #' print(res_dgr$shape) 103 | #' } 104 | #' } 105 | #' } 106 | #' }, silent=TRUE) 107 | 108 | 109 | TO_scipy_sparse = function(R_sparse_matrix) { 110 | 111 | if (inherits(R_sparse_matrix, "dgCMatrix")) { 112 | py_obj = SCP$sparse$csc_matrix(reticulate::tuple(R_sparse_matrix@x, R_sparse_matrix@i, R_sparse_matrix@p), shape = reticulate::tuple(R_sparse_matrix@Dim[1], R_sparse_matrix@Dim[2])) 113 | } 114 | else if (inherits(R_sparse_matrix, "dgRMatrix")) { 115 | py_obj = SCP$sparse$csr_matrix(reticulate::tuple(R_sparse_matrix@x, R_sparse_matrix@j, R_sparse_matrix@p), shape = reticulate::tuple(R_sparse_matrix@Dim[1], R_sparse_matrix@Dim[2])) 116 | } 117 | else { 118 | stop("the 'R_sparse_matrix' parameter should be either a 'dgCMatrix' or a 'dgRMatrix' sparse matrix", call. = F) 119 | } 120 | 121 | return(py_obj) 122 | } 123 | 124 | 125 | 126 | #' Non metric space library 127 | #' 128 | #' 129 | #' @param input_data the input data. See \emph{details} for more information 130 | #' @param query_data_row a vector to query for 131 | #' @param query_data the query_data parameter should be of the same type with the \emph{input_data} parameter. Queries to query for 132 | #' @param k an integer. The number of neighbours to return 133 | #' @param include_query_data_row_index a boolean. If TRUE then the index of the query data row will be returned as well. It currently defaults to FALSE which means the first matched index is excluded from the results (this parameter will be removed in version 1.1.0 and the output behavior of the function will be changed too - see the deprecation warning) 134 | #' @param Index_Params a list of (optional) parameters to use in indexing (when creating the index) 135 | #' @param Time_Params a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset 136 | #' @param space a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs 137 | #' @param space_params a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details. 138 | #' @param method a character string specifying the index method to use 139 | #' @param data_type a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR' 140 | #' @param dtype a character string. Either 'FLOAT' or 'INT' 141 | #' @param index_filepath a character string specifying the path to a file, where an existing index is saved 142 | #' @param load_data a boolean. If TRUE then besides the index also the saved data will be loaded. This parameter is used when the \emph{index_filepath} parameter is not NULL (see the web links in the \emph{references} section for more details). The user might also have to specify the \emph{skip_optimized_index} parameter of the \emph{Index_Params} in the "init" method 143 | #' @param save_data a boolean. If TRUE then besides the index also the data will be saved (see the web links in the \emph{references} section for more details) 144 | #' @param print_progress a boolean (either TRUE or FALSE). Whether or not to display progress bar 145 | #' @param num_threads an integer. The number of threads to use 146 | #' @param filename a character string specifying the path. The filename to save ( in case of the \emph{save_Index} method ) or the filename to load ( in case of the \emph{load_Index} method ) 147 | #' @export 148 | #' @details 149 | #' 150 | #' \emph{input_data} parameter : In case of numeric data the \emph{input_data} parameter should be either an R matrix object or a scipy sparse matrix. Additionally, the \emph{input_data} parameter can be a list including more than one matrices / sparse-matrices having the same number of columns ( this is ideal for instance if the user wants to include both a train and a test dataset in the created index ) 151 | #' 152 | #' the \emph{Knn_Query} function finds the approximate K nearest neighbours of a vector in the index 153 | #' 154 | #' the \emph{knn_Query_Batch} Performs multiple queries on the index, distributing the work over a thread pool 155 | #' 156 | #' the \emph{save_Index} function saves the index to disk 157 | #' 158 | #' If the \emph{index_filepath} parameter is not NULL then an existing index will be loaded 159 | #' 160 | #' \emph{Incrementally} updating an already saved (and loaded) index is \emph{not} possible (see: https://github.com/nmslib/nmslib/issues/73) 161 | #' 162 | #' @references 163 | #' 164 | #' https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf 165 | #' 166 | #' https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_vector_dense_optim.ipynb 167 | #' 168 | #' https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_vector_dense_nonoptim.ipynb 169 | #' 170 | #' https://github.com/nmslib/nmslib/issues/356 171 | #' 172 | #' https://github.com/nmslib/nmslib/blob/master/manual/methods.md 173 | #' 174 | #' https://github.com/nmslib/nmslib/blob/master/manual/spaces.md 175 | #' 176 | #' @docType class 177 | #' @importFrom R6 R6Class 178 | #' @import reticulate 179 | #' @section Methods: 180 | #' 181 | #' \describe{ 182 | #' \item{\code{NMSlib$new(input_data, Index_Params = NULL, Time_Params = NULL, space='l1', 183 | #' space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR', 184 | #' dtype = 'FLOAT', index_filepath = NULL, load_data = FALSE, 185 | #' print_progress = FALSE)}}{} 186 | #' 187 | #' \item{\code{--------------}}{} 188 | #' 189 | #' \item{\code{Knn_Query(query_data_row, k = 5)}}{} 190 | #' 191 | #' \item{\code{--------------}}{} 192 | #' 193 | #' \item{\code{knn_Query_Batch(query_data, k = 5, num_threads = 1)}}{} 194 | #' 195 | #' \item{\code{--------------}}{} 196 | #' 197 | #' \item{\code{save_Index(filename, save_data = FALSE)}}{} 198 | #' } 199 | #' 200 | #' @usage # init <- NMSlib$new(input_data, Index_Params = NULL, Time_Params = NULL, 201 | #' # space='l1', space_params = NULL, method = 'hnsw', 202 | #' # data_type = 'DENSE_VECTOR', dtype = 'FLOAT', 203 | #' # index_filepath = NULL, load_data = FALSE, 204 | #' # print_progress = FALSE) 205 | #' @examples 206 | #' 207 | #' try({ 208 | #' if (reticulate::py_available(initialize = FALSE)) { 209 | #' if (reticulate::py_module_available("nmslib")) { 210 | #' 211 | #' library(nmslibR) 212 | #' 213 | #' set.seed(1) 214 | #' x = matrix(runif(1000), nrow = 100, ncol = 10) 215 | #' 216 | #' init_nms = NMSlib$new(input_data = x) 217 | #' 218 | #' 219 | #' # returns a 1-dimensional vector (index, distance) 220 | #' #-------------------------------------------------- 221 | #' 222 | #' init_nms$Knn_Query(query_data_row = x[1, ], k = 5) 223 | #' 224 | #' 225 | #' # returns knn's for all data 226 | #' #--------------------------- 227 | #' 228 | #' all_dat = init_nms$knn_Query_Batch(x, k = 5, num_threads = 1) 229 | #' } 230 | #' } 231 | #' }, silent=TRUE) 232 | 233 | 234 | NMSlib <- R6::R6Class("NMSlib", 235 | 236 | lock_objects = FALSE, 237 | 238 | public = list( 239 | 240 | initialize = function(input_data, 241 | Index_Params = NULL, 242 | Time_Params = NULL, 243 | space = 'l1', 244 | space_params = NULL, 245 | method = 'hnsw', 246 | data_type = 'DENSE_VECTOR', 247 | dtype = 'FLOAT', 248 | index_filepath = NULL, 249 | load_data = FALSE, 250 | print_progress = FALSE) { 251 | 252 | if (inherits(input_data, "data.frame")) stop("The 'input_data' parameter is a data frame! You have to convert the data.frame to a matrix first!", call. = F) 253 | 254 | # eval-parse to convert string to a variable 255 | #------------------------------------------- 256 | 257 | DATA_TYPE = NMSLIB$DataType 258 | data_type = eval(parse(text = paste('DATA_TYPE$', data_type, sep = "", collapse = ""))) 259 | 260 | DTYPE = NMSLIB$DistType 261 | dtype = eval(parse(text = paste('DTYPE$', dtype, sep = "", collapse = ""))) 262 | 263 | 264 | # initialization of nmslib 265 | #------------------------- 266 | 267 | if (!is.null(space_params)) { 268 | space_params = reticulate::dict(space_params) 269 | } 270 | 271 | private$index = NMSLIB$init(space=space, space_params = space_params, method = method, data_type = data_type, dtype = dtype) 272 | 273 | 274 | # by default data points will be inserted in batches [not single points] to the index AND also account for the fact that 'input_data' can be a list object 275 | #------------------------------------------------------------------------------------ ---------------------------------------------------------------- 276 | 277 | if (inherits(input_data, "list")) { 278 | 279 | for (ITEM in 1:length(input_data)) { 280 | 281 | private$index$addDataPointBatch(input_data[[ITEM]]) # here it's important, in case of matrices, that the columns of each object are equal, otherwise it will throw an error 282 | } 283 | } 284 | else { 285 | private$index$addDataPointBatch(input_data) 286 | } 287 | 288 | 289 | # "createIndex" OR load from a file-path an already saved index 290 | #--------------------------------------------------------------- 291 | 292 | if (is.null(index_filepath)) { # if filepath is NULL create index, ... 293 | 294 | if (is.null(Index_Params)) { 295 | private$index$createIndex( print_progress = print_progress ) 296 | } 297 | else { 298 | private$index$createIndex( reticulate::dict(Index_Params), print_progress = print_progress ) 299 | } 300 | } 301 | else { # ... else, load existing index from filepath (loads the index from disk) 302 | private$index$loadIndex(index_filepath, load_data = load_data) 303 | } 304 | 305 | 306 | # 'setQueryTimeParams' function [ Sets parameters used in 'knnQuery' and 'knnQueryBatch' ] 307 | #------------------------------ 308 | 309 | if (is.null(Time_Params)) { 310 | private$index$setQueryTimeParams( Time_Params ) 311 | } 312 | else { 313 | private$index$setQueryTimeParams( reticulate::dict(Time_Params) ) 314 | } 315 | }, 316 | 317 | 318 | # 'knnQuery' function [ returns index and distance for a single row -- Finds the approximate (or exact when brute force is used) K nearest neighbours of a vector in the index ] 319 | #-------------------- 320 | 321 | Knn_Query = function(query_data_row, k = 5, include_query_data_row_index = FALSE) { 322 | 323 | if (lifecycle::is_present(include_query_data_row_index)) { 324 | 325 | lifecycle::deprecate_warn( 326 | when = "1.0.6", 327 | what = "Knn_Query(include_query_data_row_index)", 328 | details = "The 'include_query_data_row_index' parameter will be removed in version 1.1.0 and the output values and indices of the 'Knn_Query()' function will include also the (potential) value and index of the matched 'query_data_row' input! (currently is excluded by default)" 329 | ) 330 | } 331 | 332 | idx_dists_single_ROW = private$index$knnQuery(query_data_row, as.integer(k + 1)) # add 1 because I'll remove the first item ( see next line ) 333 | 334 | indices = idx_dists_single_ROW[[1]] 335 | indices = indices + 1 # account for the indexing differences betw. Python and R 336 | values = idx_dists_single_ROW[[2]] 337 | 338 | if (!include_query_data_row_index) { 339 | remove_index = 1 # remove the 1st index 340 | } 341 | else { 342 | remove_index = k + 1 # remove the last index 343 | } 344 | 345 | indices = indices[-remove_index] # remove either the first or last index depending on the 'include_query_data_row_index' parameter as it includes also the distance between a row with itself [ !! this might not always true if the row doesn't exist in the data (new external data row) or if the distances of all "k" are 0.0, meaning that the 1st index (lowest distance) might not be input "query_data_row" but one of the other "k" nearest values (no match of the input with the lowest distance) ] 346 | values = values[-remove_index] 347 | 348 | return(list(indices, values)) 349 | }, 350 | 351 | 352 | # 'knnQueryBatch' function [ Performs multiple queries on the index, distributing the work over a thread pool ] 353 | #------------------------- 354 | 355 | knn_Query_Batch = function(query_data, k = 5, num_threads = 1) { 356 | 357 | if (inherits(query_data, "data.frame")) stop("the 'query_data' parameter is a data frame. For the function to run error free convert the data frame to a matrix", call. = F) 358 | 359 | tmp_lst = private$index$knnQueryBatch(query_data, as.integer(k + 1), as.integer(num_threads)) # add 1 to account for the indexing differences betw. Python and R [ adjusted also in the Rcpp function ] 360 | idx_dists_ = nmslib_idx_dist(tmp_lst, k, num_threads) # Rcpp function [ parallelized ] 361 | 362 | return(idx_dists_) 363 | }, 364 | 365 | 366 | # 'saveIndex' function [ Saves the index to disk ] 367 | #--------------------- 368 | 369 | save_Index = function(filename, save_data = FALSE) { 370 | private$index$saveIndex(filename, save_data = save_data) 371 | invisible() 372 | } 373 | ), 374 | 375 | private = list( 376 | index = NULL 377 | ) 378 | ) 379 | 380 | 381 | 382 | #' import internal functions from the KernelKnn package 383 | #' 384 | #' @importFrom utils getFromNamespace 385 | #' @import KernelKnn 386 | #' @keywords internal 387 | 388 | import_internal = function(function_name) { 389 | utils::getFromNamespace(function_name, "KernelKnn") 390 | } 391 | 392 | 393 | 394 | #' inner function to compute kernels, extract weights and return predictions 395 | #' 396 | #' @keywords internal 397 | 398 | inner_kernel_function = function(y_matrix, dist_matrix, Levels, weights_function, h) { 399 | 400 | #------------------------------------ import internal functions from KernelKnn 401 | 402 | normalized = import_internal('normalized') 403 | func_tbl_dist = import_internal('func_tbl_dist') 404 | func_tbl = import_internal('func_tbl') 405 | FUNCTION_weights = import_internal('FUNCTION_weights') 406 | switch_secondary = import_internal('switch_secondary') 407 | switch.ops = import_internal('switch.ops') 408 | FUN_kernels = import_internal('FUN_kernels') 409 | func_categorical_preds = import_internal('func_categorical_preds') 410 | func_shuffle = import_internal('func_shuffle') 411 | class_folds = import_internal('class_folds') 412 | regr_folds = import_internal('regr_folds') 413 | 414 | #------------------------------------ 415 | 416 | if (is.null(Levels)) { # regression 417 | 418 | if (is.null(weights_function)) { 419 | out_ = rowMeans(y_matrix) 420 | } 421 | else if (is.function(weights_function)) { 422 | W_te = FUNCTION_weights(dist_matrix, weights_function) 423 | out_ = rowSums(y_matrix * W_te) 424 | } 425 | else if (is.character(weights_function) && nchar(weights_function) > 1) { 426 | W_te = FUN_kernels(weights_function, dist_matrix, h) 427 | out_ = rowSums(y_matrix * W_te) 428 | } 429 | else { 430 | stop('false input for the weights_function argument') 431 | } 432 | } 433 | else { # classification 434 | if (is.null(weights_function)) { 435 | out_ = func_tbl_dist(y_matrix, sort(Levels)) 436 | colnames(out_) = paste0('class_', sort(Levels)) 437 | } 438 | else if (is.function(weights_function)) { 439 | W_te = FUNCTION_weights(dist_matrix, weights_function) 440 | out_ = func_tbl(y_matrix, W_te, sort(Levels)) 441 | } 442 | else if (is.character(weights_function) && nchar(weights_function) > 1) { 443 | W_te = FUN_kernels(weights_function, dist_matrix, h) 444 | out_ = func_tbl(y_matrix, W_te, sort(Levels)) 445 | } 446 | else { 447 | stop('false input for the weights_function argument') 448 | } 449 | } 450 | 451 | return(out_) 452 | } 453 | 454 | 455 | 456 | 457 | #' Approximate Kernel k nearest neighbors using the nmslib library 458 | #' 459 | #' 460 | #' @param data either a matrix or a scipy sparse matrix 461 | #' @param TEST_data a test dataset (in case of a matrix the \emph{TEST_data} should have equal number of columns with the \emph{data}). It is assumed that the \emph{TEST_data} is an unlabeled dataset 462 | #' @param y a numeric vector specifying the response variable (in classification the labels must be numeric from 1:Inf). The length of \emph{y} must equal the rows of the \emph{data} parameter 463 | #' @param k an integer. The number of neighbours to return 464 | #' @param h the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0) 465 | #' @param weights_function there are various ways of specifying the kernel function. See the details section. 466 | #' @param Levels a numeric vector. In case of classification the unique levels of the response variable are necessary 467 | #' @param Index_Params a list of (optional) parameters to use in indexing (when creating the index) 468 | #' @param Time_Params a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset 469 | #' @param space a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs 470 | #' @param space_params a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details. 471 | #' @param method a character string specifying the index method to use 472 | #' @param data_type a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR' 473 | #' @param dtype a character string. Either 'FLOAT' or 'INT' 474 | #' @param print_progress a boolean (either TRUE or FALSE). Whether or not to display progress bar 475 | #' @param num_threads an integer. The number of threads to use 476 | #' @param index_filepath a character string specifying the path to a file, where an existing index is saved 477 | #' @details 478 | #' There are three possible ways to specify the \emph{weights function}, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of 'uniform', 'triangular', 'epanechnikov', 'biweight', 'triweight', 'tricube', 'gaussian', 'cosine', 'logistic', 'gaussianSimple', 'silverman', 'inverse', 'exponential'. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving 'tricube_gaussian_MULT' or I can add the previously mentioned kernels by giving 'tricube_gaussian_ADD'. 3rd option : a user defined kernel function 479 | #' @export 480 | #' @examples 481 | #' 482 | #' try({ 483 | #' if (reticulate::py_available(initialize = FALSE)) { 484 | #' if (reticulate::py_module_available("nmslib")) { 485 | #' 486 | #' library(nmslibR) 487 | #' 488 | #' x = matrix(runif(1000), nrow = 100, ncol = 10) 489 | #' 490 | #' y = runif(100) 491 | #' 492 | #' out = KernelKnn_nmslib(data = x, y = y, k = 5) 493 | #' } 494 | #' } 495 | #' }, silent=TRUE) 496 | 497 | 498 | KernelKnn_nmslib = function(data, 499 | y, 500 | TEST_data = NULL, 501 | k = 5, 502 | h = 1.0, 503 | weights_function = NULL, 504 | Levels = NULL, 505 | Index_Params = NULL, 506 | Time_Params = NULL, 507 | space = 'l1', 508 | space_params = NULL, 509 | method = 'hnsw', 510 | data_type = 'DENSE_VECTOR', 511 | dtype = 'FLOAT', 512 | index_filepath = NULL, 513 | print_progress = FALSE, 514 | num_threads = 1) { 515 | 516 | if (inherits(data, "data.frame")) stop("the 'data' parameter is a data frame. For the function to run error free convert the data frame to a matrix", call. = F) 517 | 518 | if (!is.null(TEST_data)) { 519 | if (inherits(TEST_data, "data.frame")) stop("the 'TEST_data' parameter is a data frame. For the function to run error free convert the data frame to a matrix", call. = F) 520 | } 521 | 522 | init_nmslib = NMSlib$new(input_data = data, Index_Params, Time_Params, space, space_params, method, data_type, dtype, index_filepath, print_progress) 523 | 524 | if (!is.null(TEST_data)) { 525 | knn_idx_dist = init_nmslib$knn_Query_Batch(TEST_data, k, num_threads) 526 | } 527 | else { 528 | knn_idx_dist = init_nmslib$knn_Query_Batch(data, k, num_threads) 529 | } 530 | 531 | out_y = y_idxs(knn_idx_dist$knn_idx, y, num_threads) 532 | 533 | if (!check_NaN_Inf(out_y)) { 534 | warning("the output includes missing values", call. = F) # in first place just print a warning in case of missing values 535 | } 536 | 537 | out_ = inner_kernel_function(out_y, knn_idx_dist$knn_dist, Levels, weights_function, h) 538 | return(out_) 539 | } 540 | 541 | 542 | 543 | 544 | 545 | #' Approximate Kernel k nearest neighbors (cross-validated) using the nmslib library 546 | #' 547 | #' 548 | #' @param data a numeric matrix 549 | #' @param y a numeric vector specifying the response variable (in classification the labels must be numeric from 1:Inf). The length of \emph{y} must equal the rows of the \emph{data} parameter 550 | #' @param k an integer. The number of neighbours to return 551 | #' @param folds the number of cross validation folds (must be greater than 1) 552 | #' @param h the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0) 553 | #' @param weights_function there are various ways of specifying the kernel function. See the details section. 554 | #' @param Levels a numeric vector. In case of classification the unique levels of the response variable are necessary 555 | #' @param Index_Params a list of (optional) parameters to use in indexing (when creating the index) 556 | #' @param Time_Params a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset 557 | #' @param space a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs 558 | #' @param space_params a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details. 559 | #' @param method a character string specifying the index method to use 560 | #' @param data_type a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR' 561 | #' @param dtype a character string. Either 'FLOAT' or 'INT' 562 | #' @param print_progress a boolean (either TRUE or FALSE). Whether or not to display progress bar 563 | #' @param num_threads an integer. The number of threads to use 564 | #' @param index_filepath a character string specifying the path to a file, where an existing index is saved 565 | #' @param seed_num a numeric value specifying the seed of the random number generator 566 | #' @details 567 | #' There are three possible ways to specify the \emph{weights function}, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of 'uniform', 'triangular', 'epanechnikov', 'biweight', 'triweight', 'tricube', 'gaussian', 'cosine', 'logistic', 'gaussianSimple', 'silverman', 'inverse', 'exponential'. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving 'tricube_gaussian_MULT' or I can add the previously mentioned kernels by giving 'tricube_gaussian_ADD'. 3rd option : a user defined kernel function 568 | #' @export 569 | #' @importFrom utils txtProgressBar 570 | #' @importFrom utils setTxtProgressBar 571 | #' @examples 572 | #' 573 | #' \dontrun{ 574 | #' 575 | #' x = matrix(runif(1000), nrow = 100, ncol = 10) 576 | #' 577 | #' y = runif(100) 578 | #' 579 | #' out = KernelKnnCV_nmslib(x, y, k = 5, folds = 5) 580 | #' 581 | #' } 582 | 583 | 584 | KernelKnnCV_nmslib = function(data, 585 | y, 586 | k = 5, 587 | folds = 5, 588 | h = 1.0, 589 | weights_function = NULL, 590 | Levels = NULL, 591 | Index_Params = NULL, 592 | Time_Params = NULL, 593 | space = 'l1', 594 | space_params = NULL, 595 | method = 'hnsw', 596 | data_type = 'DENSE_VECTOR', 597 | dtype = 'FLOAT', 598 | index_filepath = NULL, 599 | print_progress = FALSE, 600 | num_threads = 1, 601 | seed_num = 1) { 602 | start = Sys.time() 603 | #-------------------------------------------- import internal functions from KernelKnn 604 | class_folds = import_internal('class_folds') 605 | regr_folds = import_internal('regr_folds') 606 | #-------------------------------------------- 607 | 608 | if (is.null(Levels)) { 609 | set.seed(seed_num) 610 | n_folds = regr_folds(folds, y) 611 | } 612 | else { 613 | set.seed(seed_num) 614 | n_folds = class_folds(folds, as.factor(y)) 615 | } 616 | 617 | if (!all(unlist(lapply(n_folds, length)) > 5)) stop('Each fold has less than 5 observations. Consider decreasing the number of folds or increasing the size of the data.') 618 | tmp_fit = list() 619 | cat('\n') 620 | cat('cross-validation starts ..', '\n') 621 | pb <- txtProgressBar(min = 0, max = folds, style = 3); cat('\n') 622 | 623 | for (i in 1:folds) { 624 | tmp_fit[[i]] = KernelKnn_nmslib(data = data[unlist(n_folds[-i]), ], 625 | y = y[unlist(n_folds[-i])], 626 | TEST_data = data[unlist(n_folds[i]), ], 627 | k = k, 628 | h = h, 629 | weights_function = weights_function, 630 | Levels = Levels, 631 | Index_Params = Index_Params, 632 | Time_Params = Time_Params, 633 | space = space, 634 | space_params = space_params, 635 | method = method, 636 | data_type = data_type, 637 | dtype = dtype, 638 | index_filepath = index_filepath, 639 | print_progress = print_progress, 640 | num_threads = num_threads) 641 | setTxtProgressBar(pb, i) 642 | } 643 | 644 | close(pb); cat('\n') 645 | end = Sys.time() 646 | t = end - start 647 | cat('time to complete :', t, attributes(t)$units, '\n') 648 | cat('\n') 649 | return(list(preds = tmp_fit, folds = n_folds)) 650 | } 651 | 652 | 653 | -------------------------------------------------------------------------------- /R/package.R: -------------------------------------------------------------------------------- 1 | #--------------------------------------------------------------------------------- 2 | # An alternative way of configuration - not tested on CRAN yet - is the following: 3 | # https://github.com/rstudio/reticulate/issues/883#issuecomment-775552812 4 | # https://github.com/kevinushey/usespandas/blob/master/DESCRIPTION 5 | # https://github.com/kevinushey/usespandas/blob/master/R/zzz.R 6 | #--------------------------------------------------------------------------------- 7 | 8 | 9 | NMSLIB <- NULL; SCP <- NULL; 10 | 11 | .onLoad <- function(libname, pkgname) { 12 | 13 | # reticulate::configure_environment(pkgname, force = TRUE) # this R programming line is related to the weblinks at the top of the file (see also the documentation) 14 | 15 | try({ 16 | if (reticulate::py_available(initialize = FALSE)) { 17 | 18 | try({ 19 | NMSLIB <<- reticulate::import("nmslib", delay_load = TRUE) 20 | }, silent=TRUE) 21 | 22 | try({ 23 | SCP <<- reticulate::import("scipy", delay_load = TRUE, convert = FALSE) 24 | }, silent=TRUE) 25 | } 26 | }, silent=TRUE) 27 | } 28 | 29 | 30 | .onAttach <- function(libname, pkgname) { 31 | packageStartupMessage("If the 'nmslibR' package gives the following error: 'attempt to apply non-function' then make sure to open a new R session and run 'reticulate::py_config()' before loading the package!") 32 | } 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | [![tic](https://github.com/mlampros/nmslibR/workflows/tic/badge.svg?branch=master)](https://github.com/mlampros/nmslibR/actions) 3 | [![codecov.io](https://codecov.io/github/mlampros/nmslibR/coverage.svg?branch=master)](https://codecov.io/github/mlampros/nmslibR?branch=master) 4 | [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/nmslibR)](http://cran.r-project.org/package=nmslibR) 5 | [![Downloads](http://cranlogs.r-pkg.org/badges/grand-total/nmslibR?color=blue)](http://www.r-pkg.org/pkg/nmslibR) 6 | Buy Me A Coffee 7 | [![Dependencies](https://tinyverse.netlify.com/badge/nmslibR)](https://cran.r-project.org/package=nmslibR) 8 | 9 | 10 | ## nmslibR (Non Metric Space Library in R) 11 |
12 | 13 | 14 | The **nmslibR** package is a wrapper of the [Non-Metric Space Library (NMSLIB)](https://github.com/nmslib/nmslib) *python* package. More details on the functionality of the *nmslibR* package can be found in the [blog-post](http://mlampros.github.io/2018/02/27/the_nmslibR_package/) and in the package Documentation. 15 | 16 |
17 | 18 | 19 | **Reference:** 20 | 21 | https://github.com/nmslib/nmslib 22 | 23 | https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf 24 | 25 | 26 |
27 | 28 | ### **System Requirements** 29 | 30 |
31 | 32 | * Python (>= 2.7) 33 | 34 | 35 |
36 | 37 | All modules should be installed in the default python configuration (the configuration that the R-session displays as default), otherwise errors will occur during the *nmslibR* package installation (**reticulate::py_discover_config()** might be useful here). 38 | 39 |
40 | 41 | The installation notes for *Linux, Macintosh, Windows* are based on *Python 3*. 42 | 43 |
44 | 45 | #### **Debian/Ubuntu** 46 | 47 |
48 | 49 | Installation of the system requirements, 50 | 51 |
52 | 53 | ```R 54 | 55 | sudo apt-get install python3-pip 56 | 57 | sudo pip3 install --upgrade setuptools 58 | 59 | sudo pip3 install -U numpy 60 | 61 | sudo pip3 install --upgrade scipy 62 | 63 | sudo apt-get install libboost-all-dev libgsl0-dev libeigen3-dev 64 | 65 | sudo apt-get install cmake 66 | 67 | pip3 install --upgrade pybind11 68 | 69 | sudo pip3 install nmslib 70 | 71 | ``` 72 | 73 |
74 | 75 | #### **Fedora** 76 | 77 |
78 | 79 | Installation of the system requirements, 80 | 81 |
82 | 83 | ```R 84 | 85 | dnf install python3-pip 86 | 87 | sudo pip3 install --upgrade setuptools 88 | 89 | sudo pip3 install -U numpy 90 | 91 | sudo pip3 install --upgrade scipy 92 | 93 | yum install python3-devel 94 | 95 | yum install boost-devel 96 | 97 | yum install gsl-devel 98 | 99 | yum install eigen3-devel 100 | 101 | pip3 install --upgrade pybind11 102 | 103 | sudo pip3 install nmslib 104 | 105 | ``` 106 | 107 |
108 | 109 | #### **Macintosh OSX** 110 | 111 |
112 | 113 | Upgrade python to version 3 using, 114 | 115 | 116 | ```R 117 | 118 | brew upgrade python 119 | 120 | ``` 121 | 122 |
123 | 124 | Install the requirements, 125 | 126 |
127 | 128 | ```R 129 | 130 | sudo pip3 install --upgrade pip setuptools wheel 131 | 132 | sudo pip3 install -U numpy 133 | 134 | sudo pip3 install --upgrade scipy 135 | 136 | brew install boost 137 | 138 | brew install eigen 139 | 140 | brew install gsl 141 | 142 | brew install cmake 143 | 144 | brew link --overwrite cmake 145 | 146 | pip3 install --upgrade pybind11 147 | 148 | sudo pip3 install nmslib 149 | 150 | ``` 151 |
152 | 153 | 154 | After a successful installation of the requirements the user should open an R session and give the following *reticulate* command to change to the relevant (brew-python) directory (otherwise the *nmslibR* package won't work properly), 155 | 156 |
157 | 158 | ```R 159 | 160 | reticulate::use_python('/usr/local/bin/python3') 161 | 162 | 163 | ``` 164 | 165 |
166 | 167 | and then, 168 | 169 |
170 | 171 | 172 | ```R 173 | 174 | reticulate::py_discover_config() 175 | 176 | 177 | ``` 178 | 179 |
180 | 181 | to validate that a user is in the python version where *nmslibR* is installed. 182 | 183 |

184 | 185 | 186 | 187 | #### **Windows OS** (the instructions were tested with the version 1.0.0 of the R package, thus use with caution) 188 | 189 |
190 | 191 | First download of [get-pip.py](https://bootstrap.pypa.io/get-pip.py) for windows 192 | 193 |
194 | 195 | Update the Environment variables ( Control Panel >> System and Security >> System >> Advanced system settings >> Environment variables >> System variables >> Path >> Edit ) by adding ( for instance in case of python 2.7 ), 196 | 197 |
198 | 199 | ```R 200 | 201 | C:\Python36;C:\Python36\Scripts 202 | 203 | 204 | ``` 205 | 206 |
207 | 208 | Install the [Build Tools for Visual Studio](https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2017) 209 | 210 |
211 | 212 | Open the Command prompt (console) and install / upgrade the system requirements, 213 | 214 |
215 | 216 | ```R 217 | 218 | pip3 install --upgrade pip setuptools wheel 219 | 220 | pip3 install -U numpy 221 | 222 | pip3 install --upgrade scipy 223 | 224 | ``` 225 | 226 |
227 | 228 | **Installation of cmake** 229 | 230 |
231 | 232 | First download cmake for Windows, [win64-x64 Installer](https://cmake.org/download/). 233 | Once the file is downloaded run the **.exe** file and during installation make sure to **add CMake to the system PATH for all users**. 234 | 235 |
236 | 237 | 238 | Then install the *nmslib* library, 239 | 240 |
241 | 242 | ```R 243 | 244 | pip3 install --upgrade pybind11 245 | 246 | pip3 install nmslib 247 | 248 | ``` 249 | 250 |
251 | 252 | 253 | 254 | ### **Installation of the nmslibR package** 255 | 256 |
257 | 258 | To install the package from CRAN use, 259 | 260 |
261 | 262 | ```R 263 | 264 | install.packages('nmslibR') 265 | 266 | 267 | ``` 268 |
269 | 270 | and to download the latest version from Github use the *install_github* function of the *remotes* package, 271 |

272 | 273 | ```R 274 | 275 | remotes::install_github(repo = 'mlampros/nmslibR') 276 | 277 | ``` 278 |
279 | Use the following link to report bugs/issues, 280 |

281 | 282 | [https://github.com/mlampros/nmslibR/issues](https://github.com/mlampros/nmslibR/issues) 283 | 284 |
285 | 286 | ### **Citation:** 287 | 288 | If you use the code of this repository in your paper or research please cite both **nmslibR** and the **original articles / software** [https://CRAN.R-project.org/package=nmslibR/citation.html](https://CRAN.R-project.org/package=nmslibR/citation.html): 289 | 290 |
291 | 292 | ```R 293 | @Manual{, 294 | title = {{nmslibR}: Non Metric Space (Approximate) Library in R}, 295 | author = {Lampros Mouselimis}, 296 | year = {2021}, 297 | note = {R package version 1.0.7}, 298 | url = {https://CRAN.R-project.org/package=nmslibR}, 299 | } 300 | ``` 301 | 302 |
303 | 304 | -------------------------------------------------------------------------------- /codecov.yml: -------------------------------------------------------------------------------- 1 | comment: false 2 | -------------------------------------------------------------------------------- /inst/CITATION: -------------------------------------------------------------------------------- 1 | citHeader("Please cite both the package and the original articles / software in your publications:") 2 | 3 | year <- sub("-.*", "", meta$Date) 4 | note <- sprintf("R package version %s", meta$Version) 5 | 6 | bibentry( 7 | bibtype = "Manual", 8 | title = "{nmslibR}: Non Metric Space (Approximate) Library", 9 | author = person("Lampros", "Mouselimis"), 10 | year = year, 11 | note = note, 12 | url = "https://CRAN.R-project.org/package=nmslibR" 13 | ) 14 | 15 | bibentry( 16 | bibtype = "Manual", 17 | title = "{nmslib}: Non-Metric Space Library (NMSLIB)", 18 | author = c(person("B", "Naidan"), person("L", "Boytsov"), person("Yu", "Malkov"), person("B", "Frederickson"), person("D", "Novak")), 19 | year = "2014", 20 | url = "https://github.com/nmslib/nmslib" 21 | ) 22 | 23 | bibentry( 24 | bibtype = "InProceedings", 25 | author = c(person("Leonid", "Boytsov"), person("Bilegsaikhan", "Naidan")), 26 | editor = c(person("Nieves", "Brisaboa"), person("Oscar", "Pedreira"), person("Pavel", "Zezula")), 27 | title = "Engineering Efficient and Effective Non-metric Space Library", 28 | booktitle = "Similarity Search and Applications - 6th International Conference, SISAP 2013, Spain, October 2-4, 2013, Proceedings", 29 | series = "Lecture Notes in Computer Science", 30 | volume = "8199", 31 | pages = "280--293", 32 | publisher = "Springer", 33 | year = "2013", 34 | url = "https://doi.org/10.1007/978-3-642-41062-8", 35 | doi = "10.1007/978-3-642-41062-8" 36 | ) 37 | 38 | bibentry( 39 | bibtype = "Article", 40 | author = c(person("Yury", "Malkov"), person("D", "Yashunin")), 41 | title = "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs", 42 | journal = "CoRR", 43 | volume = "abs/1603.09320", 44 | year = "2016", 45 | url = "https://arxiv.org/abs/1603.09320" 46 | ) 47 | 48 | -------------------------------------------------------------------------------- /inst/Non_Metric_Space_Library_(NMSLIB)_Manual.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlampros/nmslibR/6250dad4cdc7ba798cc2dc0d4fa9c5f0d40c16dc/inst/Non_Metric_Space_Library_(NMSLIB)_Manual.pdf -------------------------------------------------------------------------------- /man/KernelKnnCV_nmslib.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/nmslib.R 3 | \name{KernelKnnCV_nmslib} 4 | \alias{KernelKnnCV_nmslib} 5 | \title{Approximate Kernel k nearest neighbors (cross-validated) using the nmslib library} 6 | \usage{ 7 | KernelKnnCV_nmslib( 8 | data, 9 | y, 10 | k = 5, 11 | folds = 5, 12 | h = 1, 13 | weights_function = NULL, 14 | Levels = NULL, 15 | Index_Params = NULL, 16 | Time_Params = NULL, 17 | space = "l1", 18 | space_params = NULL, 19 | method = "hnsw", 20 | data_type = "DENSE_VECTOR", 21 | dtype = "FLOAT", 22 | index_filepath = NULL, 23 | print_progress = FALSE, 24 | num_threads = 1, 25 | seed_num = 1 26 | ) 27 | } 28 | \arguments{ 29 | \item{data}{a numeric matrix} 30 | 31 | \item{y}{a numeric vector specifying the response variable (in classification the labels must be numeric from 1:Inf). The length of \emph{y} must equal the rows of the \emph{data} parameter} 32 | 33 | \item{k}{an integer. The number of neighbours to return} 34 | 35 | \item{folds}{the number of cross validation folds (must be greater than 1)} 36 | 37 | \item{h}{the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0)} 38 | 39 | \item{weights_function}{there are various ways of specifying the kernel function. See the details section.} 40 | 41 | \item{Levels}{a numeric vector. In case of classification the unique levels of the response variable are necessary} 42 | 43 | \item{Index_Params}{a list of (optional) parameters to use in indexing (when creating the index)} 44 | 45 | \item{Time_Params}{a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset} 46 | 47 | \item{space}{a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs} 48 | 49 | \item{space_params}{a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.} 50 | 51 | \item{method}{a character string specifying the index method to use} 52 | 53 | \item{data_type}{a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'} 54 | 55 | \item{dtype}{a character string. Either 'FLOAT' or 'INT'} 56 | 57 | \item{index_filepath}{a character string specifying the path to a file, where an existing index is saved} 58 | 59 | \item{print_progress}{a boolean (either TRUE or FALSE). Whether or not to display progress bar} 60 | 61 | \item{num_threads}{an integer. The number of threads to use} 62 | 63 | \item{seed_num}{a numeric value specifying the seed of the random number generator} 64 | } 65 | \description{ 66 | Approximate Kernel k nearest neighbors (cross-validated) using the nmslib library 67 | } 68 | \details{ 69 | There are three possible ways to specify the \emph{weights function}, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of 'uniform', 'triangular', 'epanechnikov', 'biweight', 'triweight', 'tricube', 'gaussian', 'cosine', 'logistic', 'gaussianSimple', 'silverman', 'inverse', 'exponential'. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving 'tricube_gaussian_MULT' or I can add the previously mentioned kernels by giving 'tricube_gaussian_ADD'. 3rd option : a user defined kernel function 70 | } 71 | \examples{ 72 | 73 | \dontrun{ 74 | 75 | x = matrix(runif(1000), nrow = 100, ncol = 10) 76 | 77 | y = runif(100) 78 | 79 | out = KernelKnnCV_nmslib(x, y, k = 5, folds = 5) 80 | 81 | } 82 | } 83 | -------------------------------------------------------------------------------- /man/KernelKnn_nmslib.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/nmslib.R 3 | \name{KernelKnn_nmslib} 4 | \alias{KernelKnn_nmslib} 5 | \title{Approximate Kernel k nearest neighbors using the nmslib library} 6 | \usage{ 7 | KernelKnn_nmslib( 8 | data, 9 | y, 10 | TEST_data = NULL, 11 | k = 5, 12 | h = 1, 13 | weights_function = NULL, 14 | Levels = NULL, 15 | Index_Params = NULL, 16 | Time_Params = NULL, 17 | space = "l1", 18 | space_params = NULL, 19 | method = "hnsw", 20 | data_type = "DENSE_VECTOR", 21 | dtype = "FLOAT", 22 | index_filepath = NULL, 23 | print_progress = FALSE, 24 | num_threads = 1 25 | ) 26 | } 27 | \arguments{ 28 | \item{data}{either a matrix or a scipy sparse matrix} 29 | 30 | \item{y}{a numeric vector specifying the response variable (in classification the labels must be numeric from 1:Inf). The length of \emph{y} must equal the rows of the \emph{data} parameter} 31 | 32 | \item{TEST_data}{a test dataset (in case of a matrix the \emph{TEST_data} should have equal number of columns with the \emph{data}). It is assumed that the \emph{TEST_data} is an unlabeled dataset} 33 | 34 | \item{k}{an integer. The number of neighbours to return} 35 | 36 | \item{h}{the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0)} 37 | 38 | \item{weights_function}{there are various ways of specifying the kernel function. See the details section.} 39 | 40 | \item{Levels}{a numeric vector. In case of classification the unique levels of the response variable are necessary} 41 | 42 | \item{Index_Params}{a list of (optional) parameters to use in indexing (when creating the index)} 43 | 44 | \item{Time_Params}{a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset} 45 | 46 | \item{space}{a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs} 47 | 48 | \item{space_params}{a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.} 49 | 50 | \item{method}{a character string specifying the index method to use} 51 | 52 | \item{data_type}{a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'} 53 | 54 | \item{dtype}{a character string. Either 'FLOAT' or 'INT'} 55 | 56 | \item{index_filepath}{a character string specifying the path to a file, where an existing index is saved} 57 | 58 | \item{print_progress}{a boolean (either TRUE or FALSE). Whether or not to display progress bar} 59 | 60 | \item{num_threads}{an integer. The number of threads to use} 61 | } 62 | \description{ 63 | Approximate Kernel k nearest neighbors using the nmslib library 64 | } 65 | \details{ 66 | There are three possible ways to specify the \emph{weights function}, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of 'uniform', 'triangular', 'epanechnikov', 'biweight', 'triweight', 'tricube', 'gaussian', 'cosine', 'logistic', 'gaussianSimple', 'silverman', 'inverse', 'exponential'. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving 'tricube_gaussian_MULT' or I can add the previously mentioned kernels by giving 'tricube_gaussian_ADD'. 3rd option : a user defined kernel function 67 | } 68 | \examples{ 69 | 70 | try({ 71 | if (reticulate::py_available(initialize = FALSE)) { 72 | if (reticulate::py_module_available("nmslib")) { 73 | 74 | library(nmslibR) 75 | 76 | x = matrix(runif(1000), nrow = 100, ncol = 10) 77 | 78 | y = runif(100) 79 | 80 | out = KernelKnn_nmslib(data = x, y = y, k = 5) 81 | } 82 | } 83 | }, silent=TRUE) 84 | } 85 | -------------------------------------------------------------------------------- /man/NMSlib.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/nmslib.R 3 | \docType{class} 4 | \name{NMSlib} 5 | \alias{NMSlib} 6 | \title{Non metric space library} 7 | \usage{ 8 | # init <- NMSlib$new(input_data, Index_Params = NULL, Time_Params = NULL, 9 | # space='l1', space_params = NULL, method = 'hnsw', 10 | # data_type = 'DENSE_VECTOR', dtype = 'FLOAT', 11 | # index_filepath = NULL, load_data = FALSE, 12 | # print_progress = FALSE) 13 | } 14 | \description{ 15 | Non metric space library 16 | 17 | Non metric space library 18 | } 19 | \details{ 20 | \emph{input_data} parameter : In case of numeric data the \emph{input_data} parameter should be either an R matrix object or a scipy sparse matrix. Additionally, the \emph{input_data} parameter can be a list including more than one matrices / sparse-matrices having the same number of columns ( this is ideal for instance if the user wants to include both a train and a test dataset in the created index ) 21 | 22 | the \emph{Knn_Query} function finds the approximate K nearest neighbours of a vector in the index 23 | 24 | the \emph{knn_Query_Batch} Performs multiple queries on the index, distributing the work over a thread pool 25 | 26 | the \emph{save_Index} function saves the index to disk 27 | 28 | If the \emph{index_filepath} parameter is not NULL then an existing index will be loaded 29 | 30 | \emph{Incrementally} updating an already saved (and loaded) index is \emph{not} possible (see: https://github.com/nmslib/nmslib/issues/73) 31 | } 32 | \section{Methods}{ 33 | 34 | 35 | \describe{ 36 | \item{\code{NMSlib$new(input_data, Index_Params = NULL, Time_Params = NULL, space='l1', 37 | space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR', 38 | dtype = 'FLOAT', index_filepath = NULL, load_data = FALSE, 39 | print_progress = FALSE)}}{} 40 | 41 | \item{\code{--------------}}{} 42 | 43 | \item{\code{Knn_Query(query_data_row, k = 5)}}{} 44 | 45 | \item{\code{--------------}}{} 46 | 47 | \item{\code{knn_Query_Batch(query_data, k = 5, num_threads = 1)}}{} 48 | 49 | \item{\code{--------------}}{} 50 | 51 | \item{\code{save_Index(filename, save_data = FALSE)}}{} 52 | } 53 | } 54 | 55 | \examples{ 56 | 57 | try({ 58 | if (reticulate::py_available(initialize = FALSE)) { 59 | if (reticulate::py_module_available("nmslib")) { 60 | 61 | library(nmslibR) 62 | 63 | set.seed(1) 64 | x = matrix(runif(1000), nrow = 100, ncol = 10) 65 | 66 | init_nms = NMSlib$new(input_data = x) 67 | 68 | 69 | # returns a 1-dimensional vector (index, distance) 70 | #-------------------------------------------------- 71 | 72 | init_nms$Knn_Query(query_data_row = x[1, ], k = 5) 73 | 74 | 75 | # returns knn's for all data 76 | #--------------------------- 77 | 78 | all_dat = init_nms$knn_Query_Batch(x, k = 5, num_threads = 1) 79 | } 80 | } 81 | }, silent=TRUE) 82 | } 83 | \references{ 84 | https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf 85 | 86 | https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_vector_dense_optim.ipynb 87 | 88 | https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_vector_dense_nonoptim.ipynb 89 | 90 | https://github.com/nmslib/nmslib/issues/356 91 | 92 | https://github.com/nmslib/nmslib/blob/master/manual/methods.md 93 | 94 | https://github.com/nmslib/nmslib/blob/master/manual/spaces.md 95 | } 96 | \section{Methods}{ 97 | \subsection{Public methods}{ 98 | \itemize{ 99 | \item \href{#method-new}{\code{NMSlib$new()}} 100 | \item \href{#method-Knn_Query}{\code{NMSlib$Knn_Query()}} 101 | \item \href{#method-knn_Query_Batch}{\code{NMSlib$knn_Query_Batch()}} 102 | \item \href{#method-save_Index}{\code{NMSlib$save_Index()}} 103 | \item \href{#method-clone}{\code{NMSlib$clone()}} 104 | } 105 | } 106 | \if{html}{\out{
}} 107 | \if{html}{\out{}} 108 | \if{latex}{\out{\hypertarget{method-new}{}}} 109 | \subsection{Method \code{new()}}{ 110 | \subsection{Usage}{ 111 | \if{html}{\out{
}}\preformatted{NMSlib$new( 112 | input_data, 113 | Index_Params = NULL, 114 | Time_Params = NULL, 115 | space = "l1", 116 | space_params = NULL, 117 | method = "hnsw", 118 | data_type = "DENSE_VECTOR", 119 | dtype = "FLOAT", 120 | index_filepath = NULL, 121 | load_data = FALSE, 122 | print_progress = FALSE 123 | )}\if{html}{\out{
}} 124 | } 125 | 126 | \subsection{Arguments}{ 127 | \if{html}{\out{
}} 128 | \describe{ 129 | \item{\code{input_data}}{the input data. See \emph{details} for more information} 130 | 131 | \item{\code{Index_Params}}{a list of (optional) parameters to use in indexing (when creating the index)} 132 | 133 | \item{\code{Time_Params}}{a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset} 134 | 135 | \item{\code{space}}{a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs} 136 | 137 | \item{\code{space_params}}{a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.} 138 | 139 | \item{\code{method}}{a character string specifying the index method to use} 140 | 141 | \item{\code{data_type}}{a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'} 142 | 143 | \item{\code{dtype}}{a character string. Either 'FLOAT' or 'INT'} 144 | 145 | \item{\code{index_filepath}}{a character string specifying the path to a file, where an existing index is saved} 146 | 147 | \item{\code{load_data}}{a boolean. If TRUE then besides the index also the saved data will be loaded. This parameter is used when the \emph{index_filepath} parameter is not NULL (see the web links in the \emph{references} section for more details). The user might also have to specify the \emph{skip_optimized_index} parameter of the \emph{Index_Params} in the "init" method} 148 | 149 | \item{\code{print_progress}}{a boolean (either TRUE or FALSE). Whether or not to display progress bar} 150 | } 151 | \if{html}{\out{
}} 152 | } 153 | } 154 | \if{html}{\out{
}} 155 | \if{html}{\out{}} 156 | \if{latex}{\out{\hypertarget{method-Knn_Query}{}}} 157 | \subsection{Method \code{Knn_Query()}}{ 158 | \subsection{Usage}{ 159 | \if{html}{\out{
}}\preformatted{NMSlib$Knn_Query(query_data_row, k = 5, include_query_data_row_index = FALSE)}\if{html}{\out{
}} 160 | } 161 | 162 | \subsection{Arguments}{ 163 | \if{html}{\out{
}} 164 | \describe{ 165 | \item{\code{query_data_row}}{a vector to query for} 166 | 167 | \item{\code{k}}{an integer. The number of neighbours to return} 168 | 169 | \item{\code{include_query_data_row_index}}{a boolean. If TRUE then the index of the query data row will be returned as well. It currently defaults to FALSE which means the first matched index is excluded from the results (this parameter will be removed in version 1.1.0 and the output behavior of the function will be changed too - see the deprecation warning)} 170 | } 171 | \if{html}{\out{
}} 172 | } 173 | } 174 | \if{html}{\out{
}} 175 | \if{html}{\out{}} 176 | \if{latex}{\out{\hypertarget{method-knn_Query_Batch}{}}} 177 | \subsection{Method \code{knn_Query_Batch()}}{ 178 | \subsection{Usage}{ 179 | \if{html}{\out{
}}\preformatted{NMSlib$knn_Query_Batch(query_data, k = 5, num_threads = 1)}\if{html}{\out{
}} 180 | } 181 | 182 | \subsection{Arguments}{ 183 | \if{html}{\out{
}} 184 | \describe{ 185 | \item{\code{query_data}}{the query_data parameter should be of the same type with the \emph{input_data} parameter. Queries to query for} 186 | 187 | \item{\code{k}}{an integer. The number of neighbours to return} 188 | 189 | \item{\code{num_threads}}{an integer. The number of threads to use} 190 | } 191 | \if{html}{\out{
}} 192 | } 193 | } 194 | \if{html}{\out{
}} 195 | \if{html}{\out{}} 196 | \if{latex}{\out{\hypertarget{method-save_Index}{}}} 197 | \subsection{Method \code{save_Index()}}{ 198 | \subsection{Usage}{ 199 | \if{html}{\out{
}}\preformatted{NMSlib$save_Index(filename, save_data = FALSE)}\if{html}{\out{
}} 200 | } 201 | 202 | \subsection{Arguments}{ 203 | \if{html}{\out{
}} 204 | \describe{ 205 | \item{\code{filename}}{a character string specifying the path. The filename to save ( in case of the \emph{save_Index} method ) or the filename to load ( in case of the \emph{load_Index} method )} 206 | 207 | \item{\code{save_data}}{a boolean. If TRUE then besides the index also the data will be saved (see the web links in the \emph{references} section for more details)} 208 | } 209 | \if{html}{\out{
}} 210 | } 211 | } 212 | \if{html}{\out{
}} 213 | \if{html}{\out{}} 214 | \if{latex}{\out{\hypertarget{method-clone}{}}} 215 | \subsection{Method \code{clone()}}{ 216 | The objects of this class are cloneable with this method. 217 | \subsection{Usage}{ 218 | \if{html}{\out{
}}\preformatted{NMSlib$clone(deep = FALSE)}\if{html}{\out{
}} 219 | } 220 | 221 | \subsection{Arguments}{ 222 | \if{html}{\out{
}} 223 | \describe{ 224 | \item{\code{deep}}{Whether to make a deep clone.} 225 | } 226 | \if{html}{\out{
}} 227 | } 228 | } 229 | } 230 | -------------------------------------------------------------------------------- /man/TO_scipy_sparse.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/nmslib.R 3 | \name{TO_scipy_sparse} 4 | \alias{TO_scipy_sparse} 5 | \title{conversion of an R sparse matrix to a scipy sparse matrix} 6 | \usage{ 7 | TO_scipy_sparse(R_sparse_matrix) 8 | } 9 | \arguments{ 10 | \item{R_sparse_matrix}{an R sparse matrix. Acceptable input objects are either a \emph{dgCMatrix} or a \emph{dgRMatrix}.} 11 | } 12 | \description{ 13 | conversion of an R sparse matrix to a scipy sparse matrix 14 | } 15 | \details{ 16 | This function allows the user to convert either an R \emph{dgCMatrix} or a \emph{dgRMatrix} to a scipy sparse matrix (\emph{scipy.sparse.csc_matrix} or \emph{scipy.sparse.csr_matrix}). This is useful because the \emph{nmslibR} package accepts besides an R dense matrix also python sparse matrices as input. 17 | 18 | The \emph{dgCMatrix} class is a class of sparse numeric matrices in the compressed, sparse, \emph{column-oriented format}. The \emph{dgRMatrix} class is a class of sparse numeric matrices in the compressed, sparse, \emph{column-oriented format}. 19 | } 20 | \examples{ 21 | 22 | try({ 23 | if (reticulate::py_available(initialize = FALSE)) { 24 | if (reticulate::py_module_available("scipy")) { 25 | 26 | if (Sys.info()["sysname"] != 'Darwin') { 27 | 28 | library(nmslibR) 29 | 30 | 31 | # 'dgCMatrix' sparse matrix 32 | #-------------------------- 33 | 34 | data = c(1, 0, 2, 0, 0, 3, 4, 5, 6) 35 | 36 | dgcM = Matrix::Matrix(data = data, nrow = 3, 37 | 38 | ncol = 3, byrow = TRUE, 39 | 40 | sparse = TRUE) 41 | 42 | print(dim(dgcM)) 43 | 44 | res = TO_scipy_sparse(dgcM) 45 | 46 | print(res$shape) 47 | 48 | 49 | # 'dgRMatrix' sparse matrix 50 | #-------------------------- 51 | 52 | dgrM = as(dgcM, "RsparseMatrix") 53 | 54 | print(dim(dgrM)) 55 | 56 | res_dgr = TO_scipy_sparse(dgrM) 57 | 58 | print(res_dgr$shape) 59 | } 60 | } 61 | } 62 | }, silent=TRUE) 63 | } 64 | \references{ 65 | https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgCMatrix-class.html, https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgRMatrix-class.html, https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix 66 | } 67 | -------------------------------------------------------------------------------- /man/import_internal.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/nmslib.R 3 | \name{import_internal} 4 | \alias{import_internal} 5 | \title{import internal functions from the KernelKnn package} 6 | \usage{ 7 | import_internal(function_name) 8 | } 9 | \description{ 10 | import internal functions from the KernelKnn package 11 | } 12 | \keyword{internal} 13 | -------------------------------------------------------------------------------- /man/inner_kernel_function.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/nmslib.R 3 | \name{inner_kernel_function} 4 | \alias{inner_kernel_function} 5 | \title{inner function to compute kernels, extract weights and return predictions} 6 | \usage{ 7 | inner_kernel_function(y_matrix, dist_matrix, Levels, weights_function, h) 8 | } 9 | \description{ 10 | inner function to compute kernels, extract weights and return predictions 11 | } 12 | \keyword{internal} 13 | -------------------------------------------------------------------------------- /man/mat_2scipy_sparse.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/nmslib.R 3 | \name{mat_2scipy_sparse} 4 | \alias{mat_2scipy_sparse} 5 | \title{conversion of an R matrix to a scipy sparse matrix} 6 | \usage{ 7 | mat_2scipy_sparse(x, format = "sparse_row_matrix") 8 | } 9 | \arguments{ 10 | \item{x}{a data matrix} 11 | 12 | \item{format}{a character string. Either \emph{"sparse_row_matrix"} or \emph{"sparse_column_matrix"}} 13 | } 14 | \description{ 15 | conversion of an R matrix to a scipy sparse matrix 16 | } 17 | \details{ 18 | This function allows the user to convert an R matrix to a scipy sparse matrix. This is useful because the \emph{nmslibR} package accepts only \emph{python} sparse matrices as input. 19 | } 20 | \examples{ 21 | 22 | try({ 23 | if (reticulate::py_available(initialize = FALSE)) { 24 | if (reticulate::py_module_available("scipy")) { 25 | 26 | library(nmslibR) 27 | 28 | set.seed(1) 29 | 30 | x = matrix(runif(1000), nrow = 100, ncol = 10) 31 | 32 | res = mat_2scipy_sparse(x) 33 | 34 | print(dim(x)) 35 | 36 | print(res$shape) 37 | } 38 | } 39 | }, silent=TRUE) 40 | } 41 | \references{ 42 | https://docs.scipy.org/doc/scipy/reference/sparse.html 43 | } 44 | -------------------------------------------------------------------------------- /src/Makevars: -------------------------------------------------------------------------------- 1 | PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DARMA_64BIT_WORD 2 | PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS) 3 | CXX_STD = CXX17 4 | PKG_CPPFLAGS = -I../inst/include/ 5 | -------------------------------------------------------------------------------- /src/Makevars.win: -------------------------------------------------------------------------------- 1 | PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DARMA_64BIT_WORD 2 | PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS) -mthreads 3 | CXX_STD = CXX17 4 | PKG_CPPFLAGS = -I../inst/include/ 5 | -------------------------------------------------------------------------------- /src/RcppExports.cpp: -------------------------------------------------------------------------------- 1 | // Generated by using Rcpp::compileAttributes() -> do not edit by hand 2 | // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 3 | 4 | #include 5 | #include 6 | 7 | using namespace Rcpp; 8 | 9 | #ifdef RCPP_USE_GLOBAL_ROSTREAM 10 | Rcpp::Rostream& Rcpp::Rcout = Rcpp::Rcpp_cout_get(); 11 | Rcpp::Rostream& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get(); 12 | #endif 13 | 14 | // nmslib_idx_dist 15 | Rcpp::List nmslib_idx_dist(std::vector > >& input_list, unsigned int k, int threads); 16 | RcppExport SEXP _nmslibR_nmslib_idx_dist(SEXP input_listSEXP, SEXP kSEXP, SEXP threadsSEXP) { 17 | BEGIN_RCPP 18 | Rcpp::RObject rcpp_result_gen; 19 | Rcpp::RNGScope rcpp_rngScope_gen; 20 | Rcpp::traits::input_parameter< std::vector > >& >::type input_list(input_listSEXP); 21 | Rcpp::traits::input_parameter< unsigned int >::type k(kSEXP); 22 | Rcpp::traits::input_parameter< int >::type threads(threadsSEXP); 23 | rcpp_result_gen = Rcpp::wrap(nmslib_idx_dist(input_list, k, threads)); 24 | return rcpp_result_gen; 25 | END_RCPP 26 | } 27 | // y_idxs 28 | arma::mat y_idxs(arma::mat& idxs, std::vector& y, int threads); 29 | RcppExport SEXP _nmslibR_y_idxs(SEXP idxsSEXP, SEXP ySEXP, SEXP threadsSEXP) { 30 | BEGIN_RCPP 31 | Rcpp::RObject rcpp_result_gen; 32 | Rcpp::RNGScope rcpp_rngScope_gen; 33 | Rcpp::traits::input_parameter< arma::mat& >::type idxs(idxsSEXP); 34 | Rcpp::traits::input_parameter< std::vector& >::type y(ySEXP); 35 | Rcpp::traits::input_parameter< int >::type threads(threadsSEXP); 36 | rcpp_result_gen = Rcpp::wrap(y_idxs(idxs, y, threads)); 37 | return rcpp_result_gen; 38 | END_RCPP 39 | } 40 | // check_NaN_Inf 41 | bool check_NaN_Inf(arma::mat x); 42 | RcppExport SEXP _nmslibR_check_NaN_Inf(SEXP xSEXP) { 43 | BEGIN_RCPP 44 | Rcpp::RObject rcpp_result_gen; 45 | Rcpp::RNGScope rcpp_rngScope_gen; 46 | Rcpp::traits::input_parameter< arma::mat >::type x(xSEXP); 47 | rcpp_result_gen = Rcpp::wrap(check_NaN_Inf(x)); 48 | return rcpp_result_gen; 49 | END_RCPP 50 | } 51 | -------------------------------------------------------------------------------- /src/init.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include // for NULL 4 | #include 5 | 6 | /* FIXME: 7 | Check these declarations against the C/Fortran source code. 8 | */ 9 | 10 | /* .Call calls */ 11 | extern SEXP _nmslibR_check_NaN_Inf(SEXP); 12 | extern SEXP _nmslibR_nmslib_idx_dist(SEXP, SEXP, SEXP); 13 | extern SEXP _nmslibR_y_idxs(SEXP, SEXP, SEXP); 14 | 15 | static const R_CallMethodDef CallEntries[] = { 16 | {"_nmslibR_check_NaN_Inf", (DL_FUNC) &_nmslibR_check_NaN_Inf, 1}, 17 | {"_nmslibR_nmslib_idx_dist", (DL_FUNC) &_nmslibR_nmslib_idx_dist, 3}, 18 | {"_nmslibR_y_idxs", (DL_FUNC) &_nmslibR_y_idxs, 3}, 19 | {NULL, NULL, 0} 20 | }; 21 | 22 | void R_init_nmslibR(DllInfo *dll) 23 | { 24 | R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); 25 | R_useDynamicSymbols(dll, FALSE); 26 | } 27 | -------------------------------------------------------------------------------- /src/utils.cpp: -------------------------------------------------------------------------------- 1 | # include 2 | // [[Rcpp::depends("RcppArmadillo")]] 3 | // [[Rcpp::plugins(openmp)]] 4 | // [[Rcpp::plugins(cpp17)]] 5 | 6 | #ifdef _OPENMP 7 | #include 8 | #endif 9 | 10 | 11 | 12 | // return a named Rcpp list for the output list [ NA's if length of knn's not equal for all cases ] 13 | // 14 | 15 | // [[Rcpp::export]] 16 | Rcpp::List nmslib_idx_dist(std::vector > >& input_list, unsigned int k, int threads = 1) { 17 | 18 | #ifdef _OPENMP 19 | omp_set_num_threads(threads); 20 | #endif 21 | 22 | unsigned int ROWS = input_list.size(); 23 | arma::mat indices(ROWS, k), distances(ROWS, k); 24 | indices.fill(arma::datum::nan); 25 | distances.fill(arma::datum::nan); 26 | unsigned int i, j; 27 | 28 | #ifdef _OPENMP 29 | #pragma omp parallel for schedule(static) shared(ROWS, input_list, indices, distances) private(i,j) 30 | #endif 31 | for (i = 0; i < ROWS; i++) { 32 | 33 | std::vector > inner_vec = input_list[i]; 34 | std::vector inner_idx = inner_vec[0]; 35 | std::vector inner_dist = inner_vec[1]; // it is possible that the length of a vector differs [ not equal to k -- in that case it takes the value of NA ] 36 | 37 | for (j = 1; j < inner_dist.size(); j++) { // indexing of inner vector begins from 1 38 | 39 | #ifdef _OPENMP 40 | #pragma omp atomic write 41 | #endif 42 | indices(i, j-1) = inner_idx[j] + 1; // when populating matrices the indices begin from 0 ALSO add 1 ( + 1) to account for the difference in indexing between C++ and R 43 | 44 | #ifdef _OPENMP 45 | #pragma omp atomic write 46 | #endif 47 | distances(i, j-1) = inner_dist[j]; 48 | } 49 | } 50 | 51 | return Rcpp::List::create(Rcpp::Named("knn_idx") = indices, Rcpp::Named("knn_dist") = distances); 52 | } 53 | 54 | 55 | 56 | // build matrix from response (y) and output-knn-indices [ account for the case where an index is NA ] 57 | // 58 | 59 | // [[Rcpp::export]] 60 | arma::mat y_idxs(arma::mat& idxs, std::vector& y, int threads = 1) { 61 | 62 | #ifdef _OPENMP 63 | omp_set_num_threads(threads); 64 | #endif 65 | 66 | unsigned int NROWS = idxs.n_rows; 67 | unsigned int NCOLS = idxs.n_cols; 68 | arma::mat out(NROWS, NCOLS); 69 | unsigned int i,j; 70 | 71 | #ifdef _OPENMP 72 | #pragma omp parallel for schedule(static) shared(NROWS, idxs, NCOLS, out, y) private(i,j) 73 | #endif 74 | for (i = 0; i < NROWS; i++) { 75 | 76 | for (j = 0; j < NCOLS; j++) { 77 | 78 | if (idxs(i,j) != idxs(i,j)) { // if NA append nan-value 79 | 80 | #ifdef _OPENMP 81 | #pragma omp atomic write 82 | #endif 83 | out(i,j) = arma::datum::nan; 84 | } 85 | else { 86 | 87 | #ifdef _OPENMP 88 | #pragma omp atomic write 89 | #endif 90 | out(i,j) = y[idxs(i,j) - 1]; // account for the difference in indexing betw. R and C++ 91 | } 92 | } 93 | } 94 | 95 | return out; 96 | } 97 | 98 | 99 | // it returns TRUE if the matrix does not include NaN's or +/- Inf 100 | // it returns FALSE if at least one value is NaN or +/- Inf 101 | // 102 | 103 | // [[Rcpp::export]] 104 | bool check_NaN_Inf(arma::mat x) { 105 | return x.is_finite(); 106 | } 107 | 108 | -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | library(nmslibR) 3 | 4 | test_check("nmslibR") 5 | -------------------------------------------------------------------------------- /tests/testthat/helper-init.R: -------------------------------------------------------------------------------- 1 | 2 | # prefer Python 3 if available [ see: https://github.com/rstudio/reticulate/blob/master/tests/testthat/helper-init.R ] 3 | if (!reticulate::py_available(initialize = FALSE) && 4 | is.na(Sys.getenv("RETICULATE_PYTHON", unset = NA))) 5 | { 6 | python <- Sys.which("python3") 7 | if (nzchar(python)) 8 | reticulate::use_python(python, required = TRUE) 9 | } 10 | -------------------------------------------------------------------------------- /tests/testthat/helper-skip.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | #....................................... 4 | # skip a test if python is not available [ see: https://github.com/rstudio/reticulate/tree/master/tests/testthat ] 5 | #....................................... 6 | 7 | skip_test_if_no_python <- function() { 8 | if (!reticulate::py_available(initialize = FALSE)) 9 | testthat::skip("Python bindings not available for testing") 10 | } 11 | 12 | 13 | #................................................................ 14 | # helper function to skip tests if we don't have the 'foo' module 15 | #................................................................ 16 | 17 | skip_test_if_no_module <- function(MODULE) { # MODULE is of type character string ( length(MODULE) >= 1 ) 18 | 19 | if (length(MODULE) == 1) { 20 | 21 | module_exists <- reticulate::py_module_available(MODULE)} 22 | 23 | else { 24 | 25 | module_exists <- sum(as.vector(sapply(MODULE, function(x) reticulate::py_module_available(x)))) == length(MODULE) 26 | } 27 | 28 | if (!module_exists) { 29 | 30 | testthat::skip(paste0(MODULE, " is not available for testthat-testing")) 31 | } 32 | } 33 | 34 | -------------------------------------------------------------------------------- /tests/testthat/setup.R: -------------------------------------------------------------------------------- 1 | 2 | # data 3 | #----- 4 | 5 | set.seed(1) 6 | x = matrix(runif(1000), nrow = 100, ncol = 10) 7 | 8 | x_lst = list(x, x) 9 | 10 | 11 | # response regression 12 | #-------------------- 13 | 14 | set.seed(3) 15 | y_reg = runif(100) 16 | 17 | 18 | # response "binary" classification 19 | #--------------------------------- 20 | 21 | set.seed(4) 22 | y_BINclass = sample(1:2, 100, replace = T) 23 | 24 | 25 | # response "multiclass" classification 26 | #------------------------------------- 27 | 28 | set.seed(5) 29 | y_MULTIclass = sample(1:3, 100, replace = T) 30 | 31 | 32 | # data for sparse matrices 33 | #------------------------- 34 | 35 | data(ionosphere, package = 'KernelKnn') 36 | 37 | X = as.matrix(ionosphere[, -c(1:2, ncol(ionosphere))]) 38 | -------------------------------------------------------------------------------- /tests/testthat/test-nmslibR_pkg.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | context('tests for nmslibR pkg') 4 | 5 | 6 | # conversion of an R matrix to a scipy sparse matrix 7 | #--------------------------------------------------- 8 | 9 | testthat::test_that("the 'mat_2scipy_sparse' returns an error in case that the 'format' parameter is invalid", { 10 | 11 | skip_test_if_no_python() 12 | skip_test_if_no_module("scipy") 13 | 14 | testthat::expect_error( mat_2scipy_sparse(x, format = 'invalid') ) 15 | }) 16 | 17 | 18 | testthat::test_that("the 'mat_2scipy_sparse' returns a scipy sparse matrix", { 19 | 20 | skip_test_if_no_python() 21 | skip_test_if_no_module("scipy") 22 | 23 | res = mat_2scipy_sparse(x, format = 'sparse_row_matrix') 24 | cl_obj = class(res)[1] # class is python object 25 | same_dims = sum(unlist(reticulate::py_to_r(res$shape)) == dim(x)) == 2 # sparse matrix has same dimensions as input dense matrix 26 | 27 | testthat::expect_true( same_dims && cl_obj == "scipy.sparse.csr.csr_matrix" ) 28 | }) 29 | 30 | 31 | 32 | # conversion of an R sparse matrix to a scipy sparse matrix 33 | #----------------------------------------------------------- 34 | 35 | # run the following tests on all operating systems except for 'Macintosh' 36 | # [ otherwise it will raise an error due to the fact that the 'scipy-sparse' library ( applied on 'TO_scipy_sparse' function) 37 | # on CRAN is not upgraded and the older version includes a bug ('TypeError : could not interpret data type') ] 38 | # reference : https://github.com/scipy/scipy/issues/5353 39 | 40 | if (Sys.info()["sysname"] != 'Darwin') { 41 | 42 | testthat::test_that("the 'TO_scipy_sparse' function returns an error in case that the input object is not of type 'dgCMatrix' or 'dgRMatrix'", { 43 | 44 | skip_test_if_no_python() 45 | skip_test_if_no_module("scipy") 46 | 47 | mt = matrix(runif(20), nrow = 5, ncol = 4) 48 | 49 | testthat::expect_error( TO_scipy_sparse(mt) ) 50 | }) 51 | 52 | 53 | testthat::test_that("the 'TO_scipy_sparse' returns the correct output if the input is a 'dgCMatrix'", { 54 | 55 | skip_test_if_no_python() 56 | skip_test_if_no_module("scipy") 57 | 58 | data = c(1, 0, 2, 0, 0, 3, 4, 5, 6) 59 | 60 | dgcM = Matrix::Matrix(data = data, nrow = 3, 61 | ncol = 3, byrow = TRUE, 62 | sparse = TRUE) 63 | 64 | res = TO_scipy_sparse(dgcM) 65 | cl_obj = class(res)[1] # class is python object 66 | validate_dims = sum(dim(dgcM) == unlist(reticulate::py_to_r(res$shape))) == 2 # sparse matrix has same dimensions as input R sparse matrix 67 | 68 | testthat::expect_true( validate_dims && cl_obj == "scipy.sparse.csc.csc_matrix" ) 69 | }) 70 | 71 | 72 | testthat::test_that("the 'TO_scipy_sparse' returns the correct output if the input is a 'dgRMatrix'", { 73 | 74 | skip_test_if_no_python() 75 | skip_test_if_no_module("scipy") 76 | 77 | data = c(1, 0, 2, 0, 0, 3, 4, 5, 6) 78 | 79 | dgcM = Matrix::Matrix(data = data, nrow = 3, 80 | ncol = 3, byrow = TRUE, 81 | sparse = TRUE) 82 | 83 | dgrM = as(dgcM, "RsparseMatrix") 84 | res = TO_scipy_sparse(dgrM) 85 | cl_obj = class(res)[1] # class is python object 86 | validate_dims = sum(dim(dgrM) == unlist(reticulate::py_to_r(res$shape))) == 2 # sparse matrix has same dimensions as input R sparse matrix 87 | 88 | testthat::expect_true( validate_dims && cl_obj == "scipy.sparse.csr.csr_matrix" ) 89 | }) 90 | } 91 | 92 | 93 | # tests for 'NMSlib' class 94 | #------------------------- 95 | 96 | 97 | testthat::test_that("the NMSlib class works with default settings", { 98 | 99 | skip_test_if_no_python() 100 | skip_test_if_no_module('nmslib') 101 | 102 | init_nms = NMSlib$new(input_data = x, Index_Params = NULL, Time_Params = NULL, space='l1', space_params = NULL, 103 | method = 'hnsw', data_type = 'DENSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE) 104 | 105 | knns = 5 106 | tmp_res = init_nms$Knn_Query(x[1, ], k = knns) 107 | 108 | testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && all(unlist(lapply(tmp_res, length)) == knns) ) 109 | }) 110 | 111 | 112 | 113 | testthat::test_that("the NMSlib class works with default settings [ and 'input_data' is a list ]", { 114 | 115 | skip_test_if_no_python() 116 | skip_test_if_no_module('nmslib') 117 | 118 | init_nms = NMSlib$new(input_data = x_lst, Index_Params = NULL, Time_Params = NULL, space='l1', space_params = NULL, 119 | method = 'hnsw', data_type = 'DENSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE) 120 | 121 | knns = 5 122 | tmp_res = init_nms$Knn_Query(x[2, ], k = knns) 123 | 124 | testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && all(unlist(lapply(tmp_res, length)) == knns) ) 125 | }) 126 | 127 | 128 | testthat::test_that("the NMSlib class works with default settings [ and 'Time_Params' is a list of parameters ]", { 129 | 130 | skip_test_if_no_python() 131 | skip_test_if_no_module('nmslib') 132 | 133 | TIME_PARAMS = list(efSearch = 50) 134 | 135 | init_nms = NMSlib$new(input_data = x, Index_Params = NULL, Time_Params = TIME_PARAMS, space='l1', space_params = NULL, 136 | method = 'hnsw', data_type = 'DENSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE) 137 | 138 | knns = 5 139 | tmp_res = init_nms$knn_Query_Batch(x, k = knns) 140 | 141 | testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && sum(unlist(lapply(tmp_res, function(x) inherits(x, 'matrix')))) == 2 && 142 | sum(unlist(lapply(tmp_res, function(x) ncol(x) == knns))) == 2) 143 | }) 144 | 145 | 146 | 147 | 148 | # tests for 'KernelKnn_nmslib' function 149 | #-------------------------------------- 150 | 151 | 152 | testthat::test_that("the KernelKnn_nmslib function works with default settings [ regression ]", { 153 | 154 | skip_test_if_no_python() 155 | skip_test_if_no_module('nmslib') 156 | 157 | tmp_knn = KernelKnn_nmslib(data = x, TEST_data = NULL, y = y_reg, k = 5, h = 1.0, weights_function = NULL, Levels = NULL, Index_Params = NULL, 158 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR', 159 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1) 160 | 161 | testthat::expect_true( inherits(tmp_knn, 'numeric') && length(tmp_knn) == nrow(x) ) 162 | }) 163 | 164 | 165 | 166 | testthat::test_that("the KernelKnn_nmslib function works with default settings [ binary classification ]", { 167 | 168 | skip_test_if_no_python() 169 | skip_test_if_no_module('nmslib') 170 | 171 | tmp_knn = KernelKnn_nmslib(data = x, TEST_data = NULL, y = y_BINclass, k = 5, h = 1.0, weights_function = NULL, Levels = sort(unique(y_BINclass)), Index_Params = NULL, 172 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR', 173 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1) 174 | 175 | testthat::expect_true( inherits(tmp_knn, 'matrix') && nrow(tmp_knn) == nrow(x) && ncol(tmp_knn) == length(unique(y_BINclass)) ) 176 | }) 177 | 178 | 179 | 180 | testthat::test_that("the KernelKnn_nmslib function works with default settings [ binary classification AND TEST_data is not NULL ]", { 181 | 182 | skip_test_if_no_python() 183 | skip_test_if_no_module('nmslib') 184 | 185 | set.seed(2) 186 | samp = sample(1:nrow(x), round(0.8 * nrow(x))) 187 | samp_ = setdiff(1:nrow(x), samp) 188 | 189 | tmp_knn = KernelKnn_nmslib(data = x[samp, ], TEST_data = x[samp_, ], y = y_BINclass[samp], k = 5, h = 1.0, weights_function = NULL, 190 | Levels = sort(unique(y_BINclass)), Index_Params = NULL, Time_Params = NULL, space='l1', space_params = NULL, 191 | method = 'hnsw', data_type = 'DENSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, 192 | num_threads = 1) 193 | 194 | testthat::expect_true( inherits(tmp_knn, 'matrix') && nrow(tmp_knn) == nrow(x[samp_, ]) && ncol(tmp_knn) == length(unique(y_BINclass)) ) 195 | }) 196 | 197 | 198 | 199 | testthat::test_that("the KernelKnn_nmslib function works with default settings [ multiclass classification ]", { 200 | 201 | skip_test_if_no_python() 202 | skip_test_if_no_module('nmslib') 203 | 204 | tmp_knn = KernelKnn_nmslib(data = x, TEST_data = NULL, y = y_MULTIclass, k = 5, h = 1.0, weights_function = 'uniform', Levels = sort(unique(y_MULTIclass)), Index_Params = NULL, 205 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR', 206 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1) 207 | 208 | testthat::expect_true( inherits(tmp_knn, 'matrix') && nrow(tmp_knn) == nrow(x) && ncol(tmp_knn) == length(unique(y_MULTIclass)) ) 209 | }) 210 | 211 | 212 | 213 | # tests for 'KernelKnnCV_nmslib' function 214 | #---------------------------------------- 215 | 216 | 217 | testthat::test_that("the KernelKnnCV_nmslib function works with default settings [ regression ]", { 218 | 219 | skip_test_if_no_python() 220 | skip_test_if_no_module('nmslib') 221 | 222 | FOLDS = 4 223 | 224 | tmp_knn = KernelKnnCV_nmslib(data = x, y = y_reg, k = 5, folds = FOLDS, h = 1.0, weights_function = NULL, Levels = NULL, Index_Params = NULL, 225 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR', 226 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1, seed_num = 1) 227 | 228 | testthat::expect_true( inherits(tmp_knn, 'list') && all(names(tmp_knn) %in% c("preds", "folds")) && all(as.vector(unlist(lapply(tmp_knn, function(x) lapply(x, function(y) length(y))))) == nrow(x) / FOLDS) ) 229 | }) 230 | 231 | 232 | testthat::test_that("the KernelKnnCV_nmslib function works with default settings [ classification ]", { 233 | 234 | skip_test_if_no_python() 235 | skip_test_if_no_module('nmslib') 236 | 237 | FOLDS = 4 238 | 239 | tmp_knn = KernelKnnCV_nmslib(data = x, y = y_BINclass, k = 5, folds = FOLDS, h = 1.0, weights_function = NULL, Levels = sort(unique(y_BINclass)), Index_Params = NULL, 240 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR', 241 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1, seed_num = 1) 242 | 243 | testthat::expect_true( inherits(tmp_knn, 'list') && all(names(tmp_knn) %in% c("preds", "folds")) && 244 | all(as.vector(unlist(lapply(tmp_knn$preds, function(y) nrow(y)))) == nrow(x) / FOLDS) && 245 | all(as.vector(unlist(lapply(tmp_knn$folds, function(y) length(y)))) == nrow(x) / FOLDS)) 246 | }) 247 | 248 | 249 | 250 | 251 | # sparse datasets 252 | #---------------- 253 | 254 | 255 | testthat::test_that("the NMSlib class works with sparse data in case of 'knn_Query_Batch' [ specify as data_type a 'SPARSE_VECTOR' ]", { 256 | 257 | skip_test_if_no_python() 258 | skip_test_if_no_module(c('nmslib', 'scipy')) 259 | 260 | sparse_x = mat_2scipy_sparse(x, format = 'sparse_row_matrix') 261 | 262 | init_nms = NMSlib$new(input_data = sparse_x, Index_Params = NULL, Time_Params = NULL, space='l1', space_params = NULL, 263 | method = 'hnsw', data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE) 264 | 265 | knns = 5 266 | tmp_res = init_nms$knn_Query_Batch(sparse_x, k = knns) # it would be tricky to do the same with "Knn_Query" as it will require firstly a python object as input and secondly a sparse unit 267 | 268 | testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && all(unlist(lapply(tmp_res, ncol)) == knns) ) 269 | }) 270 | 271 | 272 | 273 | testthat::test_that("the KernelKnn_nmslib function works with sparse data in case of regression [ specify as data_type a 'SPARSE_VECTOR' ]", { 274 | 275 | skip_test_if_no_python() 276 | skip_test_if_no_module(c('nmslib', 'scipy')) 277 | 278 | sparse_x = mat_2scipy_sparse(x, format = 'sparse_row_matrix') 279 | 280 | tmp_knn = KernelKnn_nmslib(data = sparse_x, TEST_data = NULL, y = y_reg, k = 5, h = 1.0, weights_function = NULL, Levels = NULL, Index_Params = NULL, 281 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'SPARSE_VECTOR', 282 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1) 283 | 284 | testthat::expect_true( inherits(tmp_knn, 'numeric') && length(tmp_knn) == unlist(reticulate::py_to_r(sparse_x$shape))[1] ) 285 | }) 286 | 287 | 288 | #================================================================================================================================================================================= 289 | 290 | # run the following tests on all operating systems except for 'Macintosh' 291 | # [ otherwise it will raise an error due to the fact that the 'scipy-sparse' library ( applied on 'TO_scipy_sparse' function) 292 | # on CRAN is not upgraded and the older version includes a bug ('TypeError : could not interpret data type') ] 293 | # reference : https://github.com/scipy/scipy/issues/5353 294 | 295 | 296 | if (Sys.info()["sysname"] != 'Darwin') { 297 | 298 | testthat::test_that("the KernelKnn and nmslibR packages return the same output in case of 'dense' or 'sparse' matrices (sequential search / brute force)", { 299 | 300 | skip_test_if_no_python() 301 | skip_test_if_no_module(c('nmslib', 'scipy')) 302 | 303 | mt_2_sprm = mat_2scipy_sparse(X, format = 'sparse_row_matrix') # first case : R-matrix to scipy-sparse-row-matrix 304 | mt_2_dgr = as(X, "dgRMatrix") 305 | dgr_2_scsp = TO_scipy_sparse(R_sparse_matrix = mt_2_dgr) # second case : R-sparse-matrix to scipy-sparse-row-matrix 306 | 307 | 308 | # test that both 'dense knn' (using an R object) and 'sparse knn' (using a python scipy sparse object) return the same output 309 | #---------------------------------------------------------------------------------------------------------------------------- 310 | 311 | dist_knn = KernelKnn::knn.index.dist(X, TEST_data = NULL, k = 5, method = "euclidean") # the corresponding distance for 'euclidean' in nmslibR is 'l2' or 'l2_sparse (in case of sparse matrices). Page 31 of manual. 312 | 313 | 314 | # nmslibR with "dense" data and sequential search 315 | #------------------------------------------------ 316 | 317 | init_nms = NMSlib$new(input_data = X, space = "l2", method = 'seq_search', 318 | data_type = 'DENSE_VECTOR', dtype = 'FLOAT', print_progress = F) 319 | 320 | all_dat = init_nms$knn_Query_Batch(X, k = 5, num_threads = 1) 321 | 322 | 323 | 324 | # nmslibR with "sparse" data and sequential search 325 | #------------------------------------------------- 326 | 327 | init_nms_spr = NMSlib$new(input_data = dgr_2_scsp, space = "l2_sparse", method = 'seq_search', 328 | data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', print_progress = F) 329 | 330 | all_dat_spr = init_nms_spr$knn_Query_Batch(dgr_2_scsp, k = 5, num_threads = 1) 331 | 332 | 333 | # all three outputs (dist_knn, all_dat, all_dat_spr) must return approximately equal results 334 | #------------------------------------------------------------------------------------------- 335 | 336 | # indices [ first 6 rows ] 337 | 338 | tmp1 = identical(dist_knn$train_knn_idx[1:6, ], all_dat$knn_idx[1:6, ]) 339 | tmp2 = identical(all_dat_spr$knn_idx[1:6, ], all_dat$knn_idx[1:6, ]) 340 | tmp3 = identical(dist_knn$train_knn_idx[1:6, ], all_dat_spr$knn_idx[1:6, ]) 341 | 342 | # distances [ last 6 rows ] -- approximately equal ( use of round() function ) 343 | 344 | tmp_row1 = identical(round(tail(dist_knn$train_knn_dist), 4), round(tail(all_dat$knn_dist), 4)) 345 | tmp_row2 = identical(round(tail(all_dat_spr$knn_idx), 4), round(tail(all_dat$knn_idx), 4)) 346 | tmp_row3 = identical(round(tail(dist_knn$train_knn_idx), 4), round(tail(all_dat_spr$knn_idx), 4)) 347 | 348 | testthat::expect_true( all(tmp1, tmp2, tmp3, tmp_row1, tmp_row2, tmp_row3) ) 349 | }) 350 | } 351 | 352 | 353 | #--------------------------------------------------------- 354 | # THE FOLLOWING TWO FUNCTIONS DO NOT WORK WITH SPARSE DATA [ probably it has to do with subsetting / indexing of sparse matrices (does not work as in dense matrices), especially if I split the data in two or more parts ] 355 | #--------------------------------------------------------- 356 | 357 | 358 | # testthat::test_that("the NMSlib class works with sparse data in case of 'Knn_Query' [ specify as data_type a 'SPARSE_VECTOR' ]", { 359 | # 360 | # skip_test_if_no_module(c('nmslib', 'scipy')) 361 | # 362 | # sparse_x = mat_2scipy_sparse(x, format = 'sparse_row_matrix') 363 | # 364 | # init_nms = NMSlib$new(input_data = sparse_x, Index_Params = NULL, Time_Params = NULL, space='l1_sparse', space_params = NULL, 365 | # 366 | # method = 'hnsw', data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE) 367 | # 368 | # knns = 5 369 | # 370 | # tmp_res = init_nms$Knn_Query( sparse_x$getrow(1), k = knns) # use 'getrow() to subset the sparse matrix [ DOES NOT WORK ] 371 | # 372 | # testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && all(unlist(lapply(tmp_res, ncol)) == knns) ) 373 | # }) 374 | # 375 | # 376 | # 377 | # 378 | # testthat::test_that("the KernelKnnCV_nmslib function works with sparse data in case of classification [ specify as data_type a 'SPARSE_VECTOR' ]", { 379 | # 380 | # skip_test_if_no_module(c('nmslib', 'scipy')) 381 | # 382 | # dgcM = Matrix::Matrix(data = sample(c(rep(0.0, 5), runif(2)), 1000, replace = T), nrow = 100, 383 | # 384 | # ncol = 10, byrow = TRUE, 385 | # 386 | # sparse = TRUE) 387 | # 388 | # FOLDS = 4 389 | # 390 | # tmp_knn = KernelKnnCV_nmslib(data = dgcM, y = y_BINclass, k = 5, folds = FOLDS, h = 1.0, weights_function = NULL, Levels = sort(unique(y_BINclass)), # splitting the dgcM internally and creating scipy-sparse sub-matrices returns an error when inputing to the function 391 | # 392 | # Index_Params = NULL, Time_Params = NULL, space='l1_sparse', space_params = NULL, method = 'hnsw', data_type = 'SPARSE_VECTOR', 393 | # 394 | # dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1, seed_num = 1) 395 | # 396 | # testthat::expect_true( inherits(tmp_knn, 'list') && names(tmp_knn) %in% c("preds", "folds") && 397 | # all(as.vector(unlist(lapply(tmp_knn$preds, function(y) nrow(y)))) == nrow(x) / FOLDS) && 398 | # all(as.vector(unlist(lapply(tmp_knn$folds, function(y) length(y)))) == nrow(x) / FOLDS)) 399 | # }) 400 | 401 | 402 | #================================================================================================================================================================================= 403 | -------------------------------------------------------------------------------- /tic.R: -------------------------------------------------------------------------------- 1 | # installs dependencies, runs R CMD check, runs covr::codecov() 2 | do_package_checks() 3 | 4 | if (ci_on_ghactions() && ci_has_env("BUILD_PKGDOWN")) { 5 | # creates pkgdown site and pushes to gh-pages branch 6 | # only for the runner with the "BUILD_PKGDOWN" env var set 7 | do_pkgdown() 8 | } 9 | -------------------------------------------------------------------------------- /vignettes/the_nmslibR_package.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Non Metric Space ( Approximate ) Library in R" 3 | author: "Lampros Mouselimis" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{Non Metric Space ( Approximate ) Library in R} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\VignetteEncoding{UTF-8} 10 | --- 11 | 12 | 13 | 14 | The **nmslibR** package is a wrapper of [*NMSLIB*](https://github.com/nmslib/nmslib), which according to the authors "... is a similarity search library and a toolkit for evaluation of similarity search methods. The goal of the project is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods". 15 | 16 | I've searched for some time (before wrapping NMSLIB) for a nearest neighbor library which can work with high dimensional data and can scale with big datasets. I've already written a package for k-nearest-neighbor search ([KernelKnn](https://CRAN.R-project.org/package=KernelKnn)), however, it's based on brute force and unfortunately, it requires a certain computation time if the data consists of many rows. The *nmslibR* package, besides the main functionality of the NMSLIB python library, also includes an Approximate Kernel k-nearest function, which as I will show in the next lines is both fast and accurate. A comparison of NMSLIB with other popular approximate k-nearest-neighbor methods can be found [here](https://github.com/erikbern/ann-benchmarks). 17 | 18 |
19 | 20 | The NMSLIB Library, 21 | 22 | * is a collection of search methods for generic spaces 23 | * has both metric and non-metric search algorithms 24 | * has both exact and approximate search algorithms 25 | * is an evaluation toolkit that simplifies experimentation and processing of results 26 | * is extensible (new spaces and methods can be added) 27 | * It was designed to be efficient 28 | 29 |
30 | 31 | Details can be found in the [NMSLIB-manual](https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf). 32 | 33 | 34 |
35 | 36 | #### The nmslibR package 37 | 38 |
39 | 40 | The *nmslibR* package includes the following R6-class / functions, 41 | 42 |
43 | 44 | ##### **class** 45 | 46 | 47 |
48 | 49 | 50 | | NMSlib | 51 | | :------------------: | 52 | | Knn_Query() | 53 | | knn_Query_Batch() | 54 | | save_Index() | 55 | 56 | 57 | 58 |
59 | 60 | 61 | ##### **functions** 62 | 63 | 64 | **UPDATE 10-05-2018** : Beginning from version **1.0.2** the **dgCMatrix_2scipy_sparse** function was renamed to **TO_scipy_sparse** and now accepts either a *dgCMatrix* or a *dgRMatrix* as input. The appropriate format for the nmslibR package in case of sparse matrices is the **dgRMatrix** format (*scipy.sparse.csr_matrix*) 65 | 66 | 67 |
68 | 69 | | KernelKnn_nmslib() | 70 | | :------------------------| 71 | 72 | | KernelKnnCV_nmslib() | 73 | | :------------------------| 74 | 75 | | TO_scipy_sparse() | 76 | | :-----------------| 77 | 78 | | mat_2scipy_sparse() | 79 | | :-------------------| 80 | 81 |
82 | 83 | 84 | The package documentation includes details and examples for the R6-class and functions. I'll start explaining how a user can work with sparse matrices as the input can also be a **python sparse matrix**. 85 | 86 |
87 | 88 | 89 | #### Sparse matrices as input 90 | 91 |
92 | 93 | The nmslibR package includes two functions (**mat_2scipy_sparse** and **TO_scipy_sparse**) which allow the user to convert from a *matrix* / *sparse matrix* (*dgCMatrix*, *dgRMatrix*) to a *scipy sparse matrix* (*scipy.sparse.csc_matrix*, *scipy.sparse.csr_matrix*), 94 | 95 |
96 | 97 | ```{r, eval = F, echo = T} 98 | 99 | library(nmslibR) 100 | 101 | # conversion from a matrix object to a scipy sparse matrix 102 | #---------------------------------------------------------- 103 | 104 | set.seed(1) 105 | 106 | x = matrix(runif(1000), nrow = 100, ncol = 10) 107 | 108 | x_sparse = mat_2scipy_sparse(x, format = "sparse_row_matrix") 109 | 110 | print(dim(x)) 111 | 112 | [1] 100 10 113 | 114 | print(x_sparse$shape) 115 | 116 | (100, 10) 117 | 118 | ``` 119 | 120 |
121 | 122 | 123 | ```{r, eval = F, echo = T} 124 | 125 | # conversion from a dgCMatrix object to a scipy sparse matrix 126 | #------------------------------------------------------------- 127 | 128 | data = c(1, 0, 2, 0, 0, 3, 4, 5, 6) 129 | 130 | 131 | # 'dgCMatrix' sparse matrix 132 | #-------------------------- 133 | 134 | dgcM = Matrix::Matrix(data = data, nrow = 3, 135 | 136 | ncol = 3, byrow = TRUE, 137 | 138 | sparse = TRUE) 139 | 140 | print(dim(dgcM)) 141 | 142 | [1] 3 3 143 | 144 | x_sparse = TO_scipy_sparse(dgcM) 145 | 146 | print(x_sparse$shape) 147 | 148 | (3, 3) 149 | 150 | 151 | # 'dgRMatrix' sparse matrix 152 | #-------------------------- 153 | 154 | dgrM = as(dgcM, "RsparseMatrix") 155 | 156 | class(dgrM) 157 | 158 | # [1] "dgRMatrix" 159 | # attr(,"package") 160 | # [1] "Matrix" 161 | 162 | print(dim(dgrM)) 163 | 164 | [1] 3 3 165 | 166 | res_dgr = TO_scipy_sparse(dgrM) 167 | 168 | print(res_dgr$shape) 169 | 170 | (3, 3) 171 | 172 | ``` 173 | 174 | 175 |
176 | 177 | 178 | #### The NMSlib R6-class 179 | 180 | 181 |
182 | 183 | The parameter settings for the *NMSlib* R6-class can be found in the [Non-Metric Space Library (NMSLIB) Manual](https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf), which explains the NMSLIB Library in detail. In the following code chunk, I'll show the functionality of the methods included using a [data set from my Github repository](https://github.com/mlampros/DataSets) (it appears as [.ipynb notebook in the nmslib Github repository](https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_sift_uint8.ipynb)) 184 | 185 |
186 | 187 | ```{r, eval = F, echo = T} 188 | 189 | 190 | library(nmslibR) 191 | 192 | 193 | # download the data from my Github repository (tested on a Linux OS) 194 | #------------------------------------------------------------------- 195 | 196 | system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/sift_10k.txt") 197 | 198 | 199 | # load the data in the R session 200 | #------------------------------- 201 | 202 | sift_10k = read.table("~/sift_10k.txt", quote="\"", comment.char="") 203 | 204 | 205 | # index parameters 206 | #----------------- 207 | 208 | M = 15 209 | efC = 100 210 | num_threads = 5 211 | 212 | index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC, 213 | 214 | 'post' = 0, 'skip_optimized_index' = 1 ) 215 | 216 | 217 | # query-time parameters 218 | #---------------------- 219 | 220 | efS = 100 221 | 222 | query_time_params = list('efSearch' = efS) 223 | 224 | 225 | # Number of neighbors 226 | #-------------------- 227 | 228 | K = 100 229 | 230 | 231 | # space to use 232 | #--------------- 233 | 234 | space_name = 'l2sqr_sift' 235 | 236 | 237 | # initialize NMSlib [ the data should be a matrix ] 238 | #-------------------------------------------------- 239 | 240 | init_nms = NMSlib$new(input_data = as.matrix(sift_10k), Index_Params = index_params, 241 | 242 | Time_Params = query_time_params, space = space_name, 243 | 244 | space_params = NULL, method = 'hnsw', 245 | 246 | data_type = 'DENSE_UINT8_VECTOR', dtype = 'INT', 247 | 248 | index_filepath = NULL, print_progress = FALSE) 249 | 250 | ``` 251 | 252 |
253 | 254 | ```{r, eval = F, echo = T} 255 | 256 | # returns a 1-dimensional vector 257 | #------------------------------- 258 | 259 | init_nms$Knn_Query(query_data_row = as.matrix(sift_10k[1, ]), k = 5) 260 | 261 | ``` 262 | 263 |
264 | 265 | ```{r, eval = F, echo = T} 266 | 267 | [[1]] 268 | [1] 2 6 4585 9256 140 # indices 269 | 270 | [[2]] 271 | [1] 18724 24320 68158 69067 70321 # distances 272 | 273 | ``` 274 | 275 |
276 | 277 | ```{r, eval = F, echo = T} 278 | 279 | # returns knn's for all data 280 | #--------------------------- 281 | 282 | all_dat = init_nms$knn_Query_Batch(as.matrix(sift_10k), k = 5, num_threads = 1) 283 | 284 | str(all_dat) 285 | 286 | ``` 287 | 288 |
289 | 290 | ```{r, eval = F, echo = T} 291 | 292 | # a list of indices and distances for all observations 293 | #------------------------------------------------------ 294 | 295 | List of 2 296 | $ knn_idx : num [1:10000, 1:5] 3 4 1 2 13 14 1 2 30 31 ... 297 | $ knn_dist: num [1:10000, 1:5] 18724 14995 18724 14995 21038 ... 298 | 299 | ``` 300 | 301 |
302 | 303 | Details on the various methods and parameter settings can be found in the [manual of the NMSLIB python Library](https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf). 304 | 305 | 306 |
307 | 308 | #### KernelKnn using the nmslibR package 309 | 310 |
311 | 312 | 313 | In the [Vignette of the KernelKnn](https://CRAN.R-project.org/package=KernelKnn) (*Image classification of the MNIST and CIFAR-10 data using KernelKnn and HOG (histogram of oriented gradients)*) package I experimented with the **mnist dataset** and a cross-validated kernel k-nearest-neighbors model gave **98.4 % accuracy** based on **HOG** (histogram of oriented gradients) features. However, it took almost **30 minutes** (depending on the system configuration) to complete using **6 threads**. I've implemented a similar function using NMSLIB (**KernelKnnCV_nmslib**), so in the next code chunk I'll use the *same parameter setting* and I'll compare *computation time* and *accuracy*. 314 | 315 |
316 | 317 | First load the data, 318 | 319 |
320 | 321 | ```{r, eval = F, echo = T} 322 | 323 | # using system('wget..') on a linux OS 324 | 325 | system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip") 326 | 327 | mnist <- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T, 328 | 329 | quote = "\"", sep = ",") 330 | 331 | ``` 332 | 333 |
334 | 335 | 336 | ```{r, eval = F, echo = T} 337 | 338 | X = mnist[, -ncol(mnist)] 339 | dim(X) 340 | 341 | ## [1] 70000 784 342 | 343 | # the 'KernelKnnCV_nmslib' function requires that the labels are numeric and start from 1 : Inf 344 | 345 | y = mnist[, ncol(mnist)] + 1 346 | table(y) 347 | 348 | ## y 349 | ## 1 2 3 4 5 6 7 8 9 10 350 | ## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958 351 | 352 | 353 | # evaluation metric 354 | 355 | acc = function (y_true, preds) { 356 | 357 | out = table(y_true, max.col(preds, ties.method = "random")) 358 | 359 | acc = sum(diag(out))/sum(out) 360 | 361 | acc 362 | } 363 | 364 | ``` 365 |
366 | 367 | then compute the HOG features, 368 | 369 |
370 | 371 | ```{r, eval = F, echo = T} 372 | 373 | library(OpenImageR) 374 | 375 | hog = HOG_apply(X, cells = 6, orientations = 9, rows = 28, columns = 28, threads = 6) 376 | 377 | ## 378 | ## time to complete : 2.101281 secs 379 | 380 | dim(hog) 381 | 382 | ## [1] 70000 324 383 | 384 | ``` 385 |
386 | 387 | then compute the **approximate** kernel k-nearest-neighbors using the **cosine** distance, 388 | 389 |
390 | 391 | 392 | ```{r, eval = F, echo = T} 393 | 394 | # parameters for 'KernelKnnCV_nmslib' 395 | #------------------------------------ 396 | 397 | M = 30 398 | efC = 100 399 | num_threads = 6 400 | 401 | index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC, 402 | 403 | 'post' = 0, 'skip_optimized_index' = 1 ) 404 | 405 | 406 | efS = 100 407 | 408 | query_time_params = list('efSearch' = efS) 409 | 410 | 411 | # approximate kernel knn 412 | #----------------------- 413 | 414 | fit_hog = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1, 415 | weights_function = 'biweight_tricube_MULT', 416 | Levels = sort(unique(y)), Index_Params = index_params, 417 | Time_Params = query_time_params, space = "cosinesimil", 418 | space_params = NULL, method = "hnsw", data_type = "DENSE_VECTOR", 419 | dtype = "FLOAT", index_filepath = NULL, print_progress = FALSE, 420 | num_threads = 6, seed_num = 1) 421 | 422 | 423 | # cross-validation starts .. 424 | 425 | # |=================================================================================| 100% 426 | 427 | # time to complete : 32.88805 secs 428 | 429 | 430 | str(fit_hog) 431 | 432 | 433 | ``` 434 | 435 |
436 | 437 | ```{r, eval = F, echo = T} 438 | 439 | List of 2 440 | $ preds:List of 4 441 | ..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ... 442 | ..$ : num [1:17500, 1:10] 0 0 0 0 1 ... 443 | ..$ : num [1:17500, 1:10] 0 0 0 0 0 ... 444 | ..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ... 445 | $ folds:List of 4 446 | ..$ fold_1: int [1:17500] 49808 21991 42918 7967 49782 28979 64440 49809 30522 36673 ... 447 | ..$ fold_2: int [1:17500] 51122 9469 58021 45228 2944 58052 65074 17709 2532 31262 ... 448 | ..$ fold_3: int [1:17500] 33205 40078 68177 32620 52721 18981 19417 53922 19102 67206 ... 449 | ..$ fold_4: int [1:17500] 28267 41652 28514 34525 68534 13294 48759 47521 69395 41408 ... 450 | 451 | ``` 452 | 453 |
454 | 455 | ```{r, eval = F, echo = T} 456 | 457 | acc_fit_hog = unlist(lapply(1:length(fit_hog$preds), 458 | 459 | function(x) acc(y[fit_hog$folds[[x]]], 460 | 461 | fit_hog$preds[[x]]))) 462 | acc_fit_hog 463 | 464 | ## [1] 0.9768000 0.9786857 0.9763429 0.9760000 465 | 466 | cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n') 467 | 468 | ## mean accuracy for hog-features using cross-validation : 0.9769571 469 | 470 | ``` 471 |
472 | 473 | It took approx. **33 seconds** to return with an accuracy of **97.7 %** . Almost **47 times faster** than KernelKnn's corresponding function (brute force) with a **slight lower accuracy** rate (the *braycurtis* distance metric might be better suited for this dataset). 474 | 475 | I also run the corresponding brute-force algorithm of the NMSLIB Library by setting the *method* parameter to **seq_search**, 476 | 477 |
478 | 479 | ```{r, eval = F, echo = T} 480 | 481 | 482 | # brute force of NMSLIB [ here we set 'Index_Params' and 'Time_Params' to NULL ] 483 | #---------------------- 484 | 485 | fit_hog_seq = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1, 486 | weights_function = 'biweight_tricube_MULT', 487 | Levels = sort(unique(y)), Index_Params = NULL, 488 | Time_Params = NULL, space = "cosinesimil", 489 | space_params = NULL, method = "seq_search", 490 | data_type = "DENSE_VECTOR", dtype = "FLOAT", 491 | index_filepath = NULL, print_progress = FALSE, 492 | num_threads = 6, seed_num = 1) 493 | 494 | 495 | # cross-validation starts .. 496 | 497 | # |=================================================================================| 100% 498 | 499 | # time to complete : 4.506177 mins 500 | 501 | 502 | acc_fit_hog_seq = unlist(lapply(1:length(fit_hog_seq$preds), 503 | 504 | function(x) acc(y[fit_hog_seq$folds[[x]]], 505 | 506 | fit_hog_seq$preds[[x]]))) 507 | acc_fit_hog_seq 508 | 509 | ## [1] 0.9785143 0.9802286 0.9783429 0.9784571 510 | 511 | cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog_seq), '\n') 512 | 513 | ## mean accuracy for hog-features using cross-validation : 0.9788857 514 | 515 | 516 | ``` 517 | 518 |
519 | 520 | The brute-force algorithm of the NMSLIB Library is almost **6 times faster** than KernelKnn giving an accuracy of approx. **97.9 %**. 521 | 522 | --------------------------------------------------------------------------------