'Python' Library.
10 | License: Apache License (>= 2.0)
11 | SystemRequirements: python3-dev: apt-get install -y python3-dev (deb), python3-pip: apt-get install -y python3-pip (deb), numpy: pip3 install numpy (deb), scipy: pip3 install scipy (deb), nmslib: pip3 install --no-binary :all: nmslib (deb)
12 | Encoding: UTF-8
13 | Depends:
14 | R(>= 3.2.3)
15 | Imports:
16 | Rcpp (>= 0.12.7),
17 | reticulate,
18 | R6,
19 | Matrix,
20 | KernelKnn,
21 | utils,
22 | lifecycle
23 | LinkingTo: Rcpp, RcppArmadillo (>= 0.8.0)
24 | Suggests:
25 | testthat,
26 | covr,
27 | knitr,
28 | rmarkdown
29 | VignetteBuilder: knitr
30 | RoxygenNote: 7.2.3
31 | Config/reticulate:
32 | list(
33 | packages = list(
34 | list(package = "nmslib", pip = TRUE),
35 | list(package = "scipy", pip = TRUE)
36 | )
37 | )
38 |
--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM rocker/rstudio:devel
2 |
3 | LABEL maintainer='Lampros Mouselimis'
4 |
5 | RUN export DEBIAN_FRONTEND=noninteractive; apt-get -y update && \
6 | apt-get install -y libssl-dev python pandoc pandoc-citeproc libicu-dev libcurl4-openssl-dev libpng-dev && \
7 | apt-get install -y sudo && \
8 | apt-get install -y python3-dev && \
9 | apt-get install -y python3-pip && \
10 | pip3 install numpy && \
11 | pip3 install scipy && \
12 | pip3 install --no-binary :all: nmslib && \
13 | R -e "install.packages(c( 'Rcpp', 'reticulate', 'R6', 'Matrix', 'KernelKnn', 'utils', 'RcppArmadillo', 'testthat', 'covr', 'knitr', 'rmarkdown', 'lifecycle', 'remotes' ), repos = 'https://cloud.r-project.org/' )" && \
14 | R -e "remotes::install_github('mlampros/nmslibR', upgrade = 'never', dependencies = FALSE, repos = 'https://cloud.r-project.org/')" && \
15 | apt-get autoremove -y && \
16 | apt-get clean
17 |
18 |
19 | ENV USER rstudio
20 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | Apache License
2 | ==============
3 |
4 | _Version 2.0, January 2004_
5 | _<>_
6 |
7 | ### Terms and Conditions for use, reproduction, and distribution
8 |
9 | #### 1. Definitions
10 |
11 | “License” shall mean the terms and conditions for use, reproduction, and
12 | distribution as defined by Sections 1 through 9 of this document.
13 |
14 | “Licensor” shall mean the copyright owner or entity authorized by the copyright
15 | owner that is granting the License.
16 |
17 | “Legal Entity” shall mean the union of the acting entity and all other entities
18 | that control, are controlled by, or are under common control with that entity.
19 | For the purposes of this definition, “control” means **(i)** the power, direct or
20 | indirect, to cause the direction or management of such entity, whether by
21 | contract or otherwise, or **(ii)** ownership of fifty percent (50%) or more of the
22 | outstanding shares, or **(iii)** beneficial ownership of such entity.
23 |
24 | “You” (or “Your”) shall mean an individual or Legal Entity exercising
25 | permissions granted by this License.
26 |
27 | “Source” form shall mean the preferred form for making modifications, including
28 | but not limited to software source code, documentation source, and configuration
29 | files.
30 |
31 | “Object” form shall mean any form resulting from mechanical transformation or
32 | translation of a Source form, including but not limited to compiled object code,
33 | generated documentation, and conversions to other media types.
34 |
35 | “Work” shall mean the work of authorship, whether in Source or Object form, made
36 | available under the License, as indicated by a copyright notice that is included
37 | in or attached to the work (an example is provided in the Appendix below).
38 |
39 | “Derivative Works” shall mean any work, whether in Source or Object form, that
40 | is based on (or derived from) the Work and for which the editorial revisions,
41 | annotations, elaborations, or other modifications represent, as a whole, an
42 | original work of authorship. For the purposes of this License, Derivative Works
43 | shall not include works that remain separable from, or merely link (or bind by
44 | name) to the interfaces of, the Work and Derivative Works thereof.
45 |
46 | “Contribution” shall mean any work of authorship, including the original version
47 | of the Work and any modifications or additions to that Work or Derivative Works
48 | thereof, that is intentionally submitted to Licensor for inclusion in the Work
49 | by the copyright owner or by an individual or Legal Entity authorized to submit
50 | on behalf of the copyright owner. For the purposes of this definition,
51 | “submitted” means any form of electronic, verbal, or written communication sent
52 | to the Licensor or its representatives, including but not limited to
53 | communication on electronic mailing lists, source code control systems, and
54 | issue tracking systems that are managed by, or on behalf of, the Licensor for
55 | the purpose of discussing and improving the Work, but excluding communication
56 | that is conspicuously marked or otherwise designated in writing by the copyright
57 | owner as “Not a Contribution.”
58 |
59 | “Contributor” shall mean Licensor and any individual or Legal Entity on behalf
60 | of whom a Contribution has been received by Licensor and subsequently
61 | incorporated within the Work.
62 |
63 | #### 2. Grant of Copyright License
64 |
65 | Subject to the terms and conditions of this License, each Contributor hereby
66 | grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
67 | irrevocable copyright license to reproduce, prepare Derivative Works of,
68 | publicly display, publicly perform, sublicense, and distribute the Work and such
69 | Derivative Works in Source or Object form.
70 |
71 | #### 3. Grant of Patent License
72 |
73 | Subject to the terms and conditions of this License, each Contributor hereby
74 | grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
75 | irrevocable (except as stated in this section) patent license to make, have
76 | made, use, offer to sell, sell, import, and otherwise transfer the Work, where
77 | such license applies only to those patent claims licensable by such Contributor
78 | that are necessarily infringed by their Contribution(s) alone or by combination
79 | of their Contribution(s) with the Work to which such Contribution(s) was
80 | submitted. If You institute patent litigation against any entity (including a
81 | cross-claim or counterclaim in a lawsuit) alleging that the Work or a
82 | Contribution incorporated within the Work constitutes direct or contributory
83 | patent infringement, then any patent licenses granted to You under this License
84 | for that Work shall terminate as of the date such litigation is filed.
85 |
86 | #### 4. Redistribution
87 |
88 | You may reproduce and distribute copies of the Work or Derivative Works thereof
89 | in any medium, with or without modifications, and in Source or Object form,
90 | provided that You meet the following conditions:
91 |
92 | * **(a)** You must give any other recipients of the Work or Derivative Works a copy of
93 | this License; and
94 | * **(b)** You must cause any modified files to carry prominent notices stating that You
95 | changed the files; and
96 | * **(c)** You must retain, in the Source form of any Derivative Works that You distribute,
97 | all copyright, patent, trademark, and attribution notices from the Source form
98 | of the Work, excluding those notices that do not pertain to any part of the
99 | Derivative Works; and
100 | * **(d)** If the Work includes a “NOTICE” text file as part of its distribution, then any
101 | Derivative Works that You distribute must include a readable copy of the
102 | attribution notices contained within such NOTICE file, excluding those notices
103 | that do not pertain to any part of the Derivative Works, in at least one of the
104 | following places: within a NOTICE text file distributed as part of the
105 | Derivative Works; within the Source form or documentation, if provided along
106 | with the Derivative Works; or, within a display generated by the Derivative
107 | Works, if and wherever such third-party notices normally appear. The contents of
108 | the NOTICE file are for informational purposes only and do not modify the
109 | License. You may add Your own attribution notices within Derivative Works that
110 | You distribute, alongside or as an addendum to the NOTICE text from the Work,
111 | provided that such additional attribution notices cannot be construed as
112 | modifying the License.
113 |
114 | You may add Your own copyright statement to Your modifications and may provide
115 | additional or different license terms and conditions for use, reproduction, or
116 | distribution of Your modifications, or for any such Derivative Works as a whole,
117 | provided Your use, reproduction, and distribution of the Work otherwise complies
118 | with the conditions stated in this License.
119 |
120 | #### 5. Submission of Contributions
121 |
122 | Unless You explicitly state otherwise, any Contribution intentionally submitted
123 | for inclusion in the Work by You to the Licensor shall be under the terms and
124 | conditions of this License, without any additional terms or conditions.
125 | Notwithstanding the above, nothing herein shall supersede or modify the terms of
126 | any separate license agreement you may have executed with Licensor regarding
127 | such Contributions.
128 |
129 | #### 6. Trademarks
130 |
131 | This License does not grant permission to use the trade names, trademarks,
132 | service marks, or product names of the Licensor, except as required for
133 | reasonable and customary use in describing the origin of the Work and
134 | reproducing the content of the NOTICE file.
135 |
136 | #### 7. Disclaimer of Warranty
137 |
138 | Unless required by applicable law or agreed to in writing, Licensor provides the
139 | Work (and each Contributor provides its Contributions) on an “AS IS” BASIS,
140 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
141 | including, without limitation, any warranties or conditions of TITLE,
142 | NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are
143 | solely responsible for determining the appropriateness of using or
144 | redistributing the Work and assume any risks associated with Your exercise of
145 | permissions under this License.
146 |
147 | #### 8. Limitation of Liability
148 |
149 | In no event and under no legal theory, whether in tort (including negligence),
150 | contract, or otherwise, unless required by applicable law (such as deliberate
151 | and grossly negligent acts) or agreed to in writing, shall any Contributor be
152 | liable to You for damages, including any direct, indirect, special, incidental,
153 | or consequential damages of any character arising as a result of this License or
154 | out of the use or inability to use the Work (including but not limited to
155 | damages for loss of goodwill, work stoppage, computer failure or malfunction, or
156 | any and all other commercial damages or losses), even if such Contributor has
157 | been advised of the possibility of such damages.
158 |
159 | #### 9. Accepting Warranty or Additional Liability
160 |
161 | While redistributing the Work or Derivative Works thereof, You may choose to
162 | offer, and charge a fee for, acceptance of support, warranty, indemnity, or
163 | other liability obligations and/or rights consistent with this License. However,
164 | in accepting such obligations, You may act only on Your own behalf and on Your
165 | sole responsibility, not on behalf of any other Contributor, and only if You
166 | agree to indemnify, defend, and hold each Contributor harmless for any liability
167 | incurred by, or claims asserted against, such Contributor by reason of your
168 | accepting any such warranty or additional liability.
169 |
170 | _END OF TERMS AND CONDITIONS_
171 |
172 | ### APPENDIX: How to apply the Apache License to your work
173 |
174 | To apply the Apache License to your work, attach the following boilerplate
175 | notice, with the fields enclosed by brackets `[]` replaced with your own
176 | identifying information. (Don't include the brackets!) The text should be
177 | enclosed in the appropriate comment syntax for the file format. We also
178 | recommend that a file or class name and description of purpose be included on
179 | the same “printed page” as the copyright notice for easier identification within
180 | third-party archives.
181 |
182 | Copyright [yyyy] [name of copyright owner]
183 |
184 | Licensed under the Apache License, Version 2.0 (the "License");
185 | you may not use this file except in compliance with the License.
186 | You may obtain a copy of the License at
187 |
188 | http://www.apache.org/licenses/LICENSE-2.0
189 |
190 | Unless required by applicable law or agreed to in writing, software
191 | distributed under the License is distributed on an "AS IS" BASIS,
192 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
193 | See the License for the specific language governing permissions and
194 | limitations under the License.
195 |
--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 |
3 | export(KernelKnnCV_nmslib)
4 | export(KernelKnn_nmslib)
5 | export(NMSlib)
6 | export(TO_scipy_sparse)
7 | export(mat_2scipy_sparse)
8 | import(KernelKnn)
9 | import(reticulate)
10 | importFrom(Matrix,Matrix)
11 | importFrom(R6,R6Class)
12 | importFrom(Rcpp,evalCpp)
13 | importFrom(lifecycle,deprecate_warn)
14 | importFrom(lifecycle,is_present)
15 | importFrom(utils,getFromNamespace)
16 | importFrom(utils,setTxtProgressBar)
17 | importFrom(utils,txtProgressBar)
18 | useDynLib(nmslibR, .registration = TRUE)
19 |
--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
1 |
2 | ## nmslibR 1.0.7
3 |
4 | * I've added the *include_query_data_row_index* parameter to the *Knn_Query()* method of the *NMSlib* R6 Class and at the same time I added a deprecation warning for this parameter, because this method currently excludes by default the first output index and value. By setting the *include_query_data_row_index* to TRUE the first output index and value will be returned. This change will take effect in version 1.1.0 and the *Knn_Query()* method will return the first output index and value by default.
5 | * I added the *"save_data"* parameter to the *"save_Index()"* method and the *"load_data"* parameter to the *"initialize()"* method of the *'NMSlib()'* R6 class. I updated the documentation and references sections as well
6 | * I've modified the *DESCRIPTION* and the *package.R* file by adding only comments related to a new configuration type in the reticulate R package (see: https://github.com/rstudio/reticulate/issues/883#issuecomment-775552812)
7 | * I updated the *Makevars* and the .cpp files from C++11 to C++17 because I received the following NOTE during checking of the package: *Specified C++11: please update to current default of C++17*
8 | * I updated the *.Rbuildignore* file to exclude the *LICENSE.md* file because it gives a NOTE during CRAN checking
9 |
10 |
11 | ## nmslibR 1.0.6
12 |
13 | * I've added a 'packageStartupMessage' informing the user in case of the error 'attempt to apply non-function' that he/she has to use the 'reticulate::py_config()' before loading the package (in a new R session)
14 | * I've updated the 'SystemRequirements' in the DESCRIPTION file
15 |
16 |
17 | ## nmslibR 1.0.5
18 |
19 | * I updated the *License* in the DESCRIPTION file which as of '07-05-2021' will be *Apache License Version 2.0*. Therefore I removed also the COPYRIGHTS file from the 'inst' directory
20 | * I removed *LazyData* from the DESCRIPTION file
21 | * I added the *CITATION* file in the 'inst' directory
22 | * I removed the 'zzz.R' file and the 'packageStartupMessage()'
23 |
24 |
25 | ## nmslibR 1.0.4
26 |
27 | * I adjusted the output indices of the *Knn_Query* method (*NMSlib* R6 class) to account for the difference in indexing between R and Python ( *reference* : https://github.com/mlampros/nmslibR/issues/5 )
28 | * I removed the *dtype* 'DOUBLE' parameter from the *NMSlib* R6 class, *KernelKnn_nmslib* and *KernelKnnCV_nmslib* functions (*reference* : https://github.com/nmslib/nmslib/commit/4d2937d6259aebb456db141ee0f3c2c465a51a8e )
29 | * I replaced almost all web-links of the Python *nmslib* Package because the initial repository was moved to https://github.com/nmslib/nmslib
30 |
31 |
32 | ## nmslibR 1.0.3
33 |
34 | I updated the README.md file and especially the installation instructions for all mentioned operating systems i.e. Linux, Macintosh, Windows (switch from python2 to python3 due to pybind11 issues).
35 |
36 |
37 | ## nmslibR 1.0.2
38 |
39 | * The *dgCMatrix_2scipy_sparse* function was renamed to *TO_scipy_sparse* and now accepts either a *dgCMatrix* or a *dgRMatrix* as input. The appropriate format for the nmslibR package in case of sparse matrices is the *dgRMatrix* format (*scipy.sparse.csr_matrix*)
40 | * I added an onload.R file to inform the users about the previous change [ related with the issue : https://github.com/mlampros/nmslibR/issues/1 ]
41 | * I removed the *utils.R* file which included internal functions of the *KernelKnn* package. Rather than including the file I now use the *getFromNamespace* function of the *utils* package.
42 | * Due to the previous changes I modified the Vignette and the tests too.
43 |
44 |
45 | ## nmslibR 1.0.1
46 |
47 | * I commented the example(s) and test(s) related to the *dgCMatrix_2scipy_sparse* function [ *if (Sys.info()["sysname"] != 'Darwin')* ], because the *scipy-sparse* library on CRAN is not upgraded and the older version includes a bug (*TypeError : could not interpret data type*). This leads to an error on *Macintosh* Operating System ( *reference* : https://github.com/scipy/scipy/issues/5353 )
48 | * I added links to the github repository (master repository, issues)
49 |
50 |
51 | ## nmslibR 1.0.0
52 |
53 |
54 |
55 |
56 |
--------------------------------------------------------------------------------
/R/RcppExports.R:
--------------------------------------------------------------------------------
1 | # Generated by using Rcpp::compileAttributes() -> do not edit by hand
2 | # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
3 |
4 | nmslib_idx_dist <- function(input_list, k, threads = 1L) {
5 | .Call(`_nmslibR_nmslib_idx_dist`, input_list, k, threads)
6 | }
7 |
8 | y_idxs <- function(idxs, y, threads = 1L) {
9 | .Call(`_nmslibR_y_idxs`, idxs, y, threads)
10 | }
11 |
12 | check_NaN_Inf <- function(x) {
13 | .Call(`_nmslibR_check_NaN_Inf`, x)
14 | }
15 |
16 |
--------------------------------------------------------------------------------
/R/RcppModule.R:
--------------------------------------------------------------------------------
1 | #' @useDynLib nmslibR, .registration = TRUE
2 | #' @importFrom Rcpp evalCpp
3 | #' @importFrom lifecycle deprecate_warn is_present
4 | NULL
5 |
--------------------------------------------------------------------------------
/R/nmslib.R:
--------------------------------------------------------------------------------
1 |
2 |
3 | #' conversion of an R matrix to a scipy sparse matrix
4 | #'
5 | #'
6 | #' @param x a data matrix
7 | #' @param format a character string. Either \emph{"sparse_row_matrix"} or \emph{"sparse_column_matrix"}
8 | #' @details
9 | #' This function allows the user to convert an R matrix to a scipy sparse matrix. This is useful because the \emph{nmslibR} package accepts only \emph{python} sparse matrices as input.
10 | #' @export
11 | #' @references https://docs.scipy.org/doc/scipy/reference/sparse.html
12 | #' @examples
13 | #'
14 | #' try({
15 | #' if (reticulate::py_available(initialize = FALSE)) {
16 | #' if (reticulate::py_module_available("scipy")) {
17 | #'
18 | #' library(nmslibR)
19 | #'
20 | #' set.seed(1)
21 | #'
22 | #' x = matrix(runif(1000), nrow = 100, ncol = 10)
23 | #'
24 | #' res = mat_2scipy_sparse(x)
25 | #'
26 | #' print(dim(x))
27 | #'
28 | #' print(res$shape)
29 | #' }
30 | #' }
31 | #' }, silent=TRUE)
32 |
33 |
34 | mat_2scipy_sparse = function(x, format = 'sparse_row_matrix') {
35 |
36 | if (!inherits(x, "matrix")) stop("the 'x' parameter should be of type 'matrix'", call. = F)
37 |
38 | if (format == 'sparse_column_matrix') {
39 | return(SCP$sparse$csc_matrix(x))
40 | }
41 | else if (format == 'sparse_row_matrix') {
42 | return(SCP$sparse$csr_matrix(x))
43 | }
44 | else {
45 | stop("the function can take either a 'sparse_row_matrix' or a 'sparse_column_matrix' for the 'format' parameter as input", call. = F)
46 | }
47 | }
48 |
49 |
50 |
51 | #' conversion of an R sparse matrix to a scipy sparse matrix
52 | #'
53 | #'
54 | #' @param R_sparse_matrix an R sparse matrix. Acceptable input objects are either a \emph{dgCMatrix} or a \emph{dgRMatrix}.
55 | #' @details
56 | #' This function allows the user to convert either an R \emph{dgCMatrix} or a \emph{dgRMatrix} to a scipy sparse matrix (\emph{scipy.sparse.csc_matrix} or \emph{scipy.sparse.csr_matrix}). This is useful because the \emph{nmslibR} package accepts besides an R dense matrix also python sparse matrices as input.
57 | #'
58 | #' The \emph{dgCMatrix} class is a class of sparse numeric matrices in the compressed, sparse, \emph{column-oriented format}. The \emph{dgRMatrix} class is a class of sparse numeric matrices in the compressed, sparse, \emph{column-oriented format}.
59 | #'
60 | #' @export
61 | #' @import reticulate
62 | #' @importFrom Matrix Matrix
63 | #' @references https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgCMatrix-class.html, https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgRMatrix-class.html, https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix
64 | #' @examples
65 | #'
66 | #' try({
67 | #' if (reticulate::py_available(initialize = FALSE)) {
68 | #' if (reticulate::py_module_available("scipy")) {
69 | #'
70 | #' if (Sys.info()["sysname"] != 'Darwin') {
71 | #'
72 | #' library(nmslibR)
73 | #'
74 | #'
75 | #' # 'dgCMatrix' sparse matrix
76 | #' #--------------------------
77 | #'
78 | #' data = c(1, 0, 2, 0, 0, 3, 4, 5, 6)
79 | #'
80 | #' dgcM = Matrix::Matrix(data = data, nrow = 3,
81 | #'
82 | #' ncol = 3, byrow = TRUE,
83 | #'
84 | #' sparse = TRUE)
85 | #'
86 | #' print(dim(dgcM))
87 | #'
88 | #' res = TO_scipy_sparse(dgcM)
89 | #'
90 | #' print(res$shape)
91 | #'
92 | #'
93 | #' # 'dgRMatrix' sparse matrix
94 | #' #--------------------------
95 | #'
96 | #' dgrM = as(dgcM, "RsparseMatrix")
97 | #'
98 | #' print(dim(dgrM))
99 | #'
100 | #' res_dgr = TO_scipy_sparse(dgrM)
101 | #'
102 | #' print(res_dgr$shape)
103 | #' }
104 | #' }
105 | #' }
106 | #' }, silent=TRUE)
107 |
108 |
109 | TO_scipy_sparse = function(R_sparse_matrix) {
110 |
111 | if (inherits(R_sparse_matrix, "dgCMatrix")) {
112 | py_obj = SCP$sparse$csc_matrix(reticulate::tuple(R_sparse_matrix@x, R_sparse_matrix@i, R_sparse_matrix@p), shape = reticulate::tuple(R_sparse_matrix@Dim[1], R_sparse_matrix@Dim[2]))
113 | }
114 | else if (inherits(R_sparse_matrix, "dgRMatrix")) {
115 | py_obj = SCP$sparse$csr_matrix(reticulate::tuple(R_sparse_matrix@x, R_sparse_matrix@j, R_sparse_matrix@p), shape = reticulate::tuple(R_sparse_matrix@Dim[1], R_sparse_matrix@Dim[2]))
116 | }
117 | else {
118 | stop("the 'R_sparse_matrix' parameter should be either a 'dgCMatrix' or a 'dgRMatrix' sparse matrix", call. = F)
119 | }
120 |
121 | return(py_obj)
122 | }
123 |
124 |
125 |
126 | #' Non metric space library
127 | #'
128 | #'
129 | #' @param input_data the input data. See \emph{details} for more information
130 | #' @param query_data_row a vector to query for
131 | #' @param query_data the query_data parameter should be of the same type with the \emph{input_data} parameter. Queries to query for
132 | #' @param k an integer. The number of neighbours to return
133 | #' @param include_query_data_row_index a boolean. If TRUE then the index of the query data row will be returned as well. It currently defaults to FALSE which means the first matched index is excluded from the results (this parameter will be removed in version 1.1.0 and the output behavior of the function will be changed too - see the deprecation warning)
134 | #' @param Index_Params a list of (optional) parameters to use in indexing (when creating the index)
135 | #' @param Time_Params a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset
136 | #' @param space a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs
137 | #' @param space_params a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.
138 | #' @param method a character string specifying the index method to use
139 | #' @param data_type a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'
140 | #' @param dtype a character string. Either 'FLOAT' or 'INT'
141 | #' @param index_filepath a character string specifying the path to a file, where an existing index is saved
142 | #' @param load_data a boolean. If TRUE then besides the index also the saved data will be loaded. This parameter is used when the \emph{index_filepath} parameter is not NULL (see the web links in the \emph{references} section for more details). The user might also have to specify the \emph{skip_optimized_index} parameter of the \emph{Index_Params} in the "init" method
143 | #' @param save_data a boolean. If TRUE then besides the index also the data will be saved (see the web links in the \emph{references} section for more details)
144 | #' @param print_progress a boolean (either TRUE or FALSE). Whether or not to display progress bar
145 | #' @param num_threads an integer. The number of threads to use
146 | #' @param filename a character string specifying the path. The filename to save ( in case of the \emph{save_Index} method ) or the filename to load ( in case of the \emph{load_Index} method )
147 | #' @export
148 | #' @details
149 | #'
150 | #' \emph{input_data} parameter : In case of numeric data the \emph{input_data} parameter should be either an R matrix object or a scipy sparse matrix. Additionally, the \emph{input_data} parameter can be a list including more than one matrices / sparse-matrices having the same number of columns ( this is ideal for instance if the user wants to include both a train and a test dataset in the created index )
151 | #'
152 | #' the \emph{Knn_Query} function finds the approximate K nearest neighbours of a vector in the index
153 | #'
154 | #' the \emph{knn_Query_Batch} Performs multiple queries on the index, distributing the work over a thread pool
155 | #'
156 | #' the \emph{save_Index} function saves the index to disk
157 | #'
158 | #' If the \emph{index_filepath} parameter is not NULL then an existing index will be loaded
159 | #'
160 | #' \emph{Incrementally} updating an already saved (and loaded) index is \emph{not} possible (see: https://github.com/nmslib/nmslib/issues/73)
161 | #'
162 | #' @references
163 | #'
164 | #' https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf
165 | #'
166 | #' https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_vector_dense_optim.ipynb
167 | #'
168 | #' https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_vector_dense_nonoptim.ipynb
169 | #'
170 | #' https://github.com/nmslib/nmslib/issues/356
171 | #'
172 | #' https://github.com/nmslib/nmslib/blob/master/manual/methods.md
173 | #'
174 | #' https://github.com/nmslib/nmslib/blob/master/manual/spaces.md
175 | #'
176 | #' @docType class
177 | #' @importFrom R6 R6Class
178 | #' @import reticulate
179 | #' @section Methods:
180 | #'
181 | #' \describe{
182 | #' \item{\code{NMSlib$new(input_data, Index_Params = NULL, Time_Params = NULL, space='l1',
183 | #' space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR',
184 | #' dtype = 'FLOAT', index_filepath = NULL, load_data = FALSE,
185 | #' print_progress = FALSE)}}{}
186 | #'
187 | #' \item{\code{--------------}}{}
188 | #'
189 | #' \item{\code{Knn_Query(query_data_row, k = 5)}}{}
190 | #'
191 | #' \item{\code{--------------}}{}
192 | #'
193 | #' \item{\code{knn_Query_Batch(query_data, k = 5, num_threads = 1)}}{}
194 | #'
195 | #' \item{\code{--------------}}{}
196 | #'
197 | #' \item{\code{save_Index(filename, save_data = FALSE)}}{}
198 | #' }
199 | #'
200 | #' @usage # init <- NMSlib$new(input_data, Index_Params = NULL, Time_Params = NULL,
201 | #' # space='l1', space_params = NULL, method = 'hnsw',
202 | #' # data_type = 'DENSE_VECTOR', dtype = 'FLOAT',
203 | #' # index_filepath = NULL, load_data = FALSE,
204 | #' # print_progress = FALSE)
205 | #' @examples
206 | #'
207 | #' try({
208 | #' if (reticulate::py_available(initialize = FALSE)) {
209 | #' if (reticulate::py_module_available("nmslib")) {
210 | #'
211 | #' library(nmslibR)
212 | #'
213 | #' set.seed(1)
214 | #' x = matrix(runif(1000), nrow = 100, ncol = 10)
215 | #'
216 | #' init_nms = NMSlib$new(input_data = x)
217 | #'
218 | #'
219 | #' # returns a 1-dimensional vector (index, distance)
220 | #' #--------------------------------------------------
221 | #'
222 | #' init_nms$Knn_Query(query_data_row = x[1, ], k = 5)
223 | #'
224 | #'
225 | #' # returns knn's for all data
226 | #' #---------------------------
227 | #'
228 | #' all_dat = init_nms$knn_Query_Batch(x, k = 5, num_threads = 1)
229 | #' }
230 | #' }
231 | #' }, silent=TRUE)
232 |
233 |
234 | NMSlib <- R6::R6Class("NMSlib",
235 |
236 | lock_objects = FALSE,
237 |
238 | public = list(
239 |
240 | initialize = function(input_data,
241 | Index_Params = NULL,
242 | Time_Params = NULL,
243 | space = 'l1',
244 | space_params = NULL,
245 | method = 'hnsw',
246 | data_type = 'DENSE_VECTOR',
247 | dtype = 'FLOAT',
248 | index_filepath = NULL,
249 | load_data = FALSE,
250 | print_progress = FALSE) {
251 |
252 | if (inherits(input_data, "data.frame")) stop("The 'input_data' parameter is a data frame! You have to convert the data.frame to a matrix first!", call. = F)
253 |
254 | # eval-parse to convert string to a variable
255 | #-------------------------------------------
256 |
257 | DATA_TYPE = NMSLIB$DataType
258 | data_type = eval(parse(text = paste('DATA_TYPE$', data_type, sep = "", collapse = "")))
259 |
260 | DTYPE = NMSLIB$DistType
261 | dtype = eval(parse(text = paste('DTYPE$', dtype, sep = "", collapse = "")))
262 |
263 |
264 | # initialization of nmslib
265 | #-------------------------
266 |
267 | if (!is.null(space_params)) {
268 | space_params = reticulate::dict(space_params)
269 | }
270 |
271 | private$index = NMSLIB$init(space=space, space_params = space_params, method = method, data_type = data_type, dtype = dtype)
272 |
273 |
274 | # by default data points will be inserted in batches [not single points] to the index AND also account for the fact that 'input_data' can be a list object
275 | #------------------------------------------------------------------------------------ ----------------------------------------------------------------
276 |
277 | if (inherits(input_data, "list")) {
278 |
279 | for (ITEM in 1:length(input_data)) {
280 |
281 | private$index$addDataPointBatch(input_data[[ITEM]]) # here it's important, in case of matrices, that the columns of each object are equal, otherwise it will throw an error
282 | }
283 | }
284 | else {
285 | private$index$addDataPointBatch(input_data)
286 | }
287 |
288 |
289 | # "createIndex" OR load from a file-path an already saved index
290 | #---------------------------------------------------------------
291 |
292 | if (is.null(index_filepath)) { # if filepath is NULL create index, ...
293 |
294 | if (is.null(Index_Params)) {
295 | private$index$createIndex( print_progress = print_progress )
296 | }
297 | else {
298 | private$index$createIndex( reticulate::dict(Index_Params), print_progress = print_progress )
299 | }
300 | }
301 | else { # ... else, load existing index from filepath (loads the index from disk)
302 | private$index$loadIndex(index_filepath, load_data = load_data)
303 | }
304 |
305 |
306 | # 'setQueryTimeParams' function [ Sets parameters used in 'knnQuery' and 'knnQueryBatch' ]
307 | #------------------------------
308 |
309 | if (is.null(Time_Params)) {
310 | private$index$setQueryTimeParams( Time_Params )
311 | }
312 | else {
313 | private$index$setQueryTimeParams( reticulate::dict(Time_Params) )
314 | }
315 | },
316 |
317 |
318 | # 'knnQuery' function [ returns index and distance for a single row -- Finds the approximate (or exact when brute force is used) K nearest neighbours of a vector in the index ]
319 | #--------------------
320 |
321 | Knn_Query = function(query_data_row, k = 5, include_query_data_row_index = FALSE) {
322 |
323 | if (lifecycle::is_present(include_query_data_row_index)) {
324 |
325 | lifecycle::deprecate_warn(
326 | when = "1.0.6",
327 | what = "Knn_Query(include_query_data_row_index)",
328 | details = "The 'include_query_data_row_index' parameter will be removed in version 1.1.0 and the output values and indices of the 'Knn_Query()' function will include also the (potential) value and index of the matched 'query_data_row' input! (currently is excluded by default)"
329 | )
330 | }
331 |
332 | idx_dists_single_ROW = private$index$knnQuery(query_data_row, as.integer(k + 1)) # add 1 because I'll remove the first item ( see next line )
333 |
334 | indices = idx_dists_single_ROW[[1]]
335 | indices = indices + 1 # account for the indexing differences betw. Python and R
336 | values = idx_dists_single_ROW[[2]]
337 |
338 | if (!include_query_data_row_index) {
339 | remove_index = 1 # remove the 1st index
340 | }
341 | else {
342 | remove_index = k + 1 # remove the last index
343 | }
344 |
345 | indices = indices[-remove_index] # remove either the first or last index depending on the 'include_query_data_row_index' parameter as it includes also the distance between a row with itself [ !! this might not always true if the row doesn't exist in the data (new external data row) or if the distances of all "k" are 0.0, meaning that the 1st index (lowest distance) might not be input "query_data_row" but one of the other "k" nearest values (no match of the input with the lowest distance) ]
346 | values = values[-remove_index]
347 |
348 | return(list(indices, values))
349 | },
350 |
351 |
352 | # 'knnQueryBatch' function [ Performs multiple queries on the index, distributing the work over a thread pool ]
353 | #-------------------------
354 |
355 | knn_Query_Batch = function(query_data, k = 5, num_threads = 1) {
356 |
357 | if (inherits(query_data, "data.frame")) stop("the 'query_data' parameter is a data frame. For the function to run error free convert the data frame to a matrix", call. = F)
358 |
359 | tmp_lst = private$index$knnQueryBatch(query_data, as.integer(k + 1), as.integer(num_threads)) # add 1 to account for the indexing differences betw. Python and R [ adjusted also in the Rcpp function ]
360 | idx_dists_ = nmslib_idx_dist(tmp_lst, k, num_threads) # Rcpp function [ parallelized ]
361 |
362 | return(idx_dists_)
363 | },
364 |
365 |
366 | # 'saveIndex' function [ Saves the index to disk ]
367 | #---------------------
368 |
369 | save_Index = function(filename, save_data = FALSE) {
370 | private$index$saveIndex(filename, save_data = save_data)
371 | invisible()
372 | }
373 | ),
374 |
375 | private = list(
376 | index = NULL
377 | )
378 | )
379 |
380 |
381 |
382 | #' import internal functions from the KernelKnn package
383 | #'
384 | #' @importFrom utils getFromNamespace
385 | #' @import KernelKnn
386 | #' @keywords internal
387 |
388 | import_internal = function(function_name) {
389 | utils::getFromNamespace(function_name, "KernelKnn")
390 | }
391 |
392 |
393 |
394 | #' inner function to compute kernels, extract weights and return predictions
395 | #'
396 | #' @keywords internal
397 |
398 | inner_kernel_function = function(y_matrix, dist_matrix, Levels, weights_function, h) {
399 |
400 | #------------------------------------ import internal functions from KernelKnn
401 |
402 | normalized = import_internal('normalized')
403 | func_tbl_dist = import_internal('func_tbl_dist')
404 | func_tbl = import_internal('func_tbl')
405 | FUNCTION_weights = import_internal('FUNCTION_weights')
406 | switch_secondary = import_internal('switch_secondary')
407 | switch.ops = import_internal('switch.ops')
408 | FUN_kernels = import_internal('FUN_kernels')
409 | func_categorical_preds = import_internal('func_categorical_preds')
410 | func_shuffle = import_internal('func_shuffle')
411 | class_folds = import_internal('class_folds')
412 | regr_folds = import_internal('regr_folds')
413 |
414 | #------------------------------------
415 |
416 | if (is.null(Levels)) { # regression
417 |
418 | if (is.null(weights_function)) {
419 | out_ = rowMeans(y_matrix)
420 | }
421 | else if (is.function(weights_function)) {
422 | W_te = FUNCTION_weights(dist_matrix, weights_function)
423 | out_ = rowSums(y_matrix * W_te)
424 | }
425 | else if (is.character(weights_function) && nchar(weights_function) > 1) {
426 | W_te = FUN_kernels(weights_function, dist_matrix, h)
427 | out_ = rowSums(y_matrix * W_te)
428 | }
429 | else {
430 | stop('false input for the weights_function argument')
431 | }
432 | }
433 | else { # classification
434 | if (is.null(weights_function)) {
435 | out_ = func_tbl_dist(y_matrix, sort(Levels))
436 | colnames(out_) = paste0('class_', sort(Levels))
437 | }
438 | else if (is.function(weights_function)) {
439 | W_te = FUNCTION_weights(dist_matrix, weights_function)
440 | out_ = func_tbl(y_matrix, W_te, sort(Levels))
441 | }
442 | else if (is.character(weights_function) && nchar(weights_function) > 1) {
443 | W_te = FUN_kernels(weights_function, dist_matrix, h)
444 | out_ = func_tbl(y_matrix, W_te, sort(Levels))
445 | }
446 | else {
447 | stop('false input for the weights_function argument')
448 | }
449 | }
450 |
451 | return(out_)
452 | }
453 |
454 |
455 |
456 |
457 | #' Approximate Kernel k nearest neighbors using the nmslib library
458 | #'
459 | #'
460 | #' @param data either a matrix or a scipy sparse matrix
461 | #' @param TEST_data a test dataset (in case of a matrix the \emph{TEST_data} should have equal number of columns with the \emph{data}). It is assumed that the \emph{TEST_data} is an unlabeled dataset
462 | #' @param y a numeric vector specifying the response variable (in classification the labels must be numeric from 1:Inf). The length of \emph{y} must equal the rows of the \emph{data} parameter
463 | #' @param k an integer. The number of neighbours to return
464 | #' @param h the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0)
465 | #' @param weights_function there are various ways of specifying the kernel function. See the details section.
466 | #' @param Levels a numeric vector. In case of classification the unique levels of the response variable are necessary
467 | #' @param Index_Params a list of (optional) parameters to use in indexing (when creating the index)
468 | #' @param Time_Params a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset
469 | #' @param space a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs
470 | #' @param space_params a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.
471 | #' @param method a character string specifying the index method to use
472 | #' @param data_type a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'
473 | #' @param dtype a character string. Either 'FLOAT' or 'INT'
474 | #' @param print_progress a boolean (either TRUE or FALSE). Whether or not to display progress bar
475 | #' @param num_threads an integer. The number of threads to use
476 | #' @param index_filepath a character string specifying the path to a file, where an existing index is saved
477 | #' @details
478 | #' There are three possible ways to specify the \emph{weights function}, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of 'uniform', 'triangular', 'epanechnikov', 'biweight', 'triweight', 'tricube', 'gaussian', 'cosine', 'logistic', 'gaussianSimple', 'silverman', 'inverse', 'exponential'. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving 'tricube_gaussian_MULT' or I can add the previously mentioned kernels by giving 'tricube_gaussian_ADD'. 3rd option : a user defined kernel function
479 | #' @export
480 | #' @examples
481 | #'
482 | #' try({
483 | #' if (reticulate::py_available(initialize = FALSE)) {
484 | #' if (reticulate::py_module_available("nmslib")) {
485 | #'
486 | #' library(nmslibR)
487 | #'
488 | #' x = matrix(runif(1000), nrow = 100, ncol = 10)
489 | #'
490 | #' y = runif(100)
491 | #'
492 | #' out = KernelKnn_nmslib(data = x, y = y, k = 5)
493 | #' }
494 | #' }
495 | #' }, silent=TRUE)
496 |
497 |
498 | KernelKnn_nmslib = function(data,
499 | y,
500 | TEST_data = NULL,
501 | k = 5,
502 | h = 1.0,
503 | weights_function = NULL,
504 | Levels = NULL,
505 | Index_Params = NULL,
506 | Time_Params = NULL,
507 | space = 'l1',
508 | space_params = NULL,
509 | method = 'hnsw',
510 | data_type = 'DENSE_VECTOR',
511 | dtype = 'FLOAT',
512 | index_filepath = NULL,
513 | print_progress = FALSE,
514 | num_threads = 1) {
515 |
516 | if (inherits(data, "data.frame")) stop("the 'data' parameter is a data frame. For the function to run error free convert the data frame to a matrix", call. = F)
517 |
518 | if (!is.null(TEST_data)) {
519 | if (inherits(TEST_data, "data.frame")) stop("the 'TEST_data' parameter is a data frame. For the function to run error free convert the data frame to a matrix", call. = F)
520 | }
521 |
522 | init_nmslib = NMSlib$new(input_data = data, Index_Params, Time_Params, space, space_params, method, data_type, dtype, index_filepath, print_progress)
523 |
524 | if (!is.null(TEST_data)) {
525 | knn_idx_dist = init_nmslib$knn_Query_Batch(TEST_data, k, num_threads)
526 | }
527 | else {
528 | knn_idx_dist = init_nmslib$knn_Query_Batch(data, k, num_threads)
529 | }
530 |
531 | out_y = y_idxs(knn_idx_dist$knn_idx, y, num_threads)
532 |
533 | if (!check_NaN_Inf(out_y)) {
534 | warning("the output includes missing values", call. = F) # in first place just print a warning in case of missing values
535 | }
536 |
537 | out_ = inner_kernel_function(out_y, knn_idx_dist$knn_dist, Levels, weights_function, h)
538 | return(out_)
539 | }
540 |
541 |
542 |
543 |
544 |
545 | #' Approximate Kernel k nearest neighbors (cross-validated) using the nmslib library
546 | #'
547 | #'
548 | #' @param data a numeric matrix
549 | #' @param y a numeric vector specifying the response variable (in classification the labels must be numeric from 1:Inf). The length of \emph{y} must equal the rows of the \emph{data} parameter
550 | #' @param k an integer. The number of neighbours to return
551 | #' @param folds the number of cross validation folds (must be greater than 1)
552 | #' @param h the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0)
553 | #' @param weights_function there are various ways of specifying the kernel function. See the details section.
554 | #' @param Levels a numeric vector. In case of classification the unique levels of the response variable are necessary
555 | #' @param Index_Params a list of (optional) parameters to use in indexing (when creating the index)
556 | #' @param Time_Params a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset
557 | #' @param space a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs
558 | #' @param space_params a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.
559 | #' @param method a character string specifying the index method to use
560 | #' @param data_type a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'
561 | #' @param dtype a character string. Either 'FLOAT' or 'INT'
562 | #' @param print_progress a boolean (either TRUE or FALSE). Whether or not to display progress bar
563 | #' @param num_threads an integer. The number of threads to use
564 | #' @param index_filepath a character string specifying the path to a file, where an existing index is saved
565 | #' @param seed_num a numeric value specifying the seed of the random number generator
566 | #' @details
567 | #' There are three possible ways to specify the \emph{weights function}, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of 'uniform', 'triangular', 'epanechnikov', 'biweight', 'triweight', 'tricube', 'gaussian', 'cosine', 'logistic', 'gaussianSimple', 'silverman', 'inverse', 'exponential'. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving 'tricube_gaussian_MULT' or I can add the previously mentioned kernels by giving 'tricube_gaussian_ADD'. 3rd option : a user defined kernel function
568 | #' @export
569 | #' @importFrom utils txtProgressBar
570 | #' @importFrom utils setTxtProgressBar
571 | #' @examples
572 | #'
573 | #' \dontrun{
574 | #'
575 | #' x = matrix(runif(1000), nrow = 100, ncol = 10)
576 | #'
577 | #' y = runif(100)
578 | #'
579 | #' out = KernelKnnCV_nmslib(x, y, k = 5, folds = 5)
580 | #'
581 | #' }
582 |
583 |
584 | KernelKnnCV_nmslib = function(data,
585 | y,
586 | k = 5,
587 | folds = 5,
588 | h = 1.0,
589 | weights_function = NULL,
590 | Levels = NULL,
591 | Index_Params = NULL,
592 | Time_Params = NULL,
593 | space = 'l1',
594 | space_params = NULL,
595 | method = 'hnsw',
596 | data_type = 'DENSE_VECTOR',
597 | dtype = 'FLOAT',
598 | index_filepath = NULL,
599 | print_progress = FALSE,
600 | num_threads = 1,
601 | seed_num = 1) {
602 | start = Sys.time()
603 | #-------------------------------------------- import internal functions from KernelKnn
604 | class_folds = import_internal('class_folds')
605 | regr_folds = import_internal('regr_folds')
606 | #--------------------------------------------
607 |
608 | if (is.null(Levels)) {
609 | set.seed(seed_num)
610 | n_folds = regr_folds(folds, y)
611 | }
612 | else {
613 | set.seed(seed_num)
614 | n_folds = class_folds(folds, as.factor(y))
615 | }
616 |
617 | if (!all(unlist(lapply(n_folds, length)) > 5)) stop('Each fold has less than 5 observations. Consider decreasing the number of folds or increasing the size of the data.')
618 | tmp_fit = list()
619 | cat('\n')
620 | cat('cross-validation starts ..', '\n')
621 | pb <- txtProgressBar(min = 0, max = folds, style = 3); cat('\n')
622 |
623 | for (i in 1:folds) {
624 | tmp_fit[[i]] = KernelKnn_nmslib(data = data[unlist(n_folds[-i]), ],
625 | y = y[unlist(n_folds[-i])],
626 | TEST_data = data[unlist(n_folds[i]), ],
627 | k = k,
628 | h = h,
629 | weights_function = weights_function,
630 | Levels = Levels,
631 | Index_Params = Index_Params,
632 | Time_Params = Time_Params,
633 | space = space,
634 | space_params = space_params,
635 | method = method,
636 | data_type = data_type,
637 | dtype = dtype,
638 | index_filepath = index_filepath,
639 | print_progress = print_progress,
640 | num_threads = num_threads)
641 | setTxtProgressBar(pb, i)
642 | }
643 |
644 | close(pb); cat('\n')
645 | end = Sys.time()
646 | t = end - start
647 | cat('time to complete :', t, attributes(t)$units, '\n')
648 | cat('\n')
649 | return(list(preds = tmp_fit, folds = n_folds))
650 | }
651 |
652 |
653 |
--------------------------------------------------------------------------------
/R/package.R:
--------------------------------------------------------------------------------
1 | #---------------------------------------------------------------------------------
2 | # An alternative way of configuration - not tested on CRAN yet - is the following:
3 | # https://github.com/rstudio/reticulate/issues/883#issuecomment-775552812
4 | # https://github.com/kevinushey/usespandas/blob/master/DESCRIPTION
5 | # https://github.com/kevinushey/usespandas/blob/master/R/zzz.R
6 | #---------------------------------------------------------------------------------
7 |
8 |
9 | NMSLIB <- NULL; SCP <- NULL;
10 |
11 | .onLoad <- function(libname, pkgname) {
12 |
13 | # reticulate::configure_environment(pkgname, force = TRUE) # this R programming line is related to the weblinks at the top of the file (see also the documentation)
14 |
15 | try({
16 | if (reticulate::py_available(initialize = FALSE)) {
17 |
18 | try({
19 | NMSLIB <<- reticulate::import("nmslib", delay_load = TRUE)
20 | }, silent=TRUE)
21 |
22 | try({
23 | SCP <<- reticulate::import("scipy", delay_load = TRUE, convert = FALSE)
24 | }, silent=TRUE)
25 | }
26 | }, silent=TRUE)
27 | }
28 |
29 |
30 | .onAttach <- function(libname, pkgname) {
31 | packageStartupMessage("If the 'nmslibR' package gives the following error: 'attempt to apply non-function' then make sure to open a new R session and run 'reticulate::py_config()' before loading the package!")
32 | }
33 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | [](https://github.com/mlampros/nmslibR/actions)
3 | [](https://codecov.io/github/mlampros/nmslibR?branch=master)
4 | [](http://cran.r-project.org/package=nmslibR)
5 | [](http://www.r-pkg.org/pkg/nmslibR)
6 |
7 | [](https://cran.r-project.org/package=nmslibR)
8 |
9 |
10 | ## nmslibR (Non Metric Space Library in R)
11 |
12 |
13 |
14 | The **nmslibR** package is a wrapper of the [Non-Metric Space Library (NMSLIB)](https://github.com/nmslib/nmslib) *python* package. More details on the functionality of the *nmslibR* package can be found in the [blog-post](http://mlampros.github.io/2018/02/27/the_nmslibR_package/) and in the package Documentation.
15 |
16 |
17 |
18 |
19 | **Reference:**
20 |
21 | https://github.com/nmslib/nmslib
22 |
23 | https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf
24 |
25 |
26 |
27 |
28 | ### **System Requirements**
29 |
30 |
31 |
32 | * Python (>= 2.7)
33 |
34 |
35 |
36 |
37 | All modules should be installed in the default python configuration (the configuration that the R-session displays as default), otherwise errors will occur during the *nmslibR* package installation (**reticulate::py_discover_config()** might be useful here).
38 |
39 |
40 |
41 | The installation notes for *Linux, Macintosh, Windows* are based on *Python 3*.
42 |
43 |
44 |
45 | #### **Debian/Ubuntu**
46 |
47 |
48 |
49 | Installation of the system requirements,
50 |
51 |
52 |
53 | ```R
54 |
55 | sudo apt-get install python3-pip
56 |
57 | sudo pip3 install --upgrade setuptools
58 |
59 | sudo pip3 install -U numpy
60 |
61 | sudo pip3 install --upgrade scipy
62 |
63 | sudo apt-get install libboost-all-dev libgsl0-dev libeigen3-dev
64 |
65 | sudo apt-get install cmake
66 |
67 | pip3 install --upgrade pybind11
68 |
69 | sudo pip3 install nmslib
70 |
71 | ```
72 |
73 |
74 |
75 | #### **Fedora**
76 |
77 |
78 |
79 | Installation of the system requirements,
80 |
81 |
82 |
83 | ```R
84 |
85 | dnf install python3-pip
86 |
87 | sudo pip3 install --upgrade setuptools
88 |
89 | sudo pip3 install -U numpy
90 |
91 | sudo pip3 install --upgrade scipy
92 |
93 | yum install python3-devel
94 |
95 | yum install boost-devel
96 |
97 | yum install gsl-devel
98 |
99 | yum install eigen3-devel
100 |
101 | pip3 install --upgrade pybind11
102 |
103 | sudo pip3 install nmslib
104 |
105 | ```
106 |
107 |
108 |
109 | #### **Macintosh OSX**
110 |
111 |
112 |
113 | Upgrade python to version 3 using,
114 |
115 |
116 | ```R
117 |
118 | brew upgrade python
119 |
120 | ```
121 |
122 |
123 |
124 | Install the requirements,
125 |
126 |
127 |
128 | ```R
129 |
130 | sudo pip3 install --upgrade pip setuptools wheel
131 |
132 | sudo pip3 install -U numpy
133 |
134 | sudo pip3 install --upgrade scipy
135 |
136 | brew install boost
137 |
138 | brew install eigen
139 |
140 | brew install gsl
141 |
142 | brew install cmake
143 |
144 | brew link --overwrite cmake
145 |
146 | pip3 install --upgrade pybind11
147 |
148 | sudo pip3 install nmslib
149 |
150 | ```
151 |
152 |
153 |
154 | After a successful installation of the requirements the user should open an R session and give the following *reticulate* command to change to the relevant (brew-python) directory (otherwise the *nmslibR* package won't work properly),
155 |
156 |
157 |
158 | ```R
159 |
160 | reticulate::use_python('/usr/local/bin/python3')
161 |
162 |
163 | ```
164 |
165 |
166 |
167 | and then,
168 |
169 |
170 |
171 |
172 | ```R
173 |
174 | reticulate::py_discover_config()
175 |
176 |
177 | ```
178 |
179 |
180 |
181 | to validate that a user is in the python version where *nmslibR* is installed.
182 |
183 |
184 |
185 |
186 |
187 | #### **Windows OS** (the instructions were tested with the version 1.0.0 of the R package, thus use with caution)
188 |
189 |
190 |
191 | First download of [get-pip.py](https://bootstrap.pypa.io/get-pip.py) for windows
192 |
193 |
194 |
195 | Update the Environment variables ( Control Panel >> System and Security >> System >> Advanced system settings >> Environment variables >> System variables >> Path >> Edit ) by adding ( for instance in case of python 2.7 ),
196 |
197 |
198 |
199 | ```R
200 |
201 | C:\Python36;C:\Python36\Scripts
202 |
203 |
204 | ```
205 |
206 |
207 |
208 | Install the [Build Tools for Visual Studio](https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2017)
209 |
210 |
211 |
212 | Open the Command prompt (console) and install / upgrade the system requirements,
213 |
214 |
215 |
216 | ```R
217 |
218 | pip3 install --upgrade pip setuptools wheel
219 |
220 | pip3 install -U numpy
221 |
222 | pip3 install --upgrade scipy
223 |
224 | ```
225 |
226 |
227 |
228 | **Installation of cmake**
229 |
230 |
231 |
232 | First download cmake for Windows, [win64-x64 Installer](https://cmake.org/download/).
233 | Once the file is downloaded run the **.exe** file and during installation make sure to **add CMake to the system PATH for all users**.
234 |
235 |
236 |
237 |
238 | Then install the *nmslib* library,
239 |
240 |
241 |
242 | ```R
243 |
244 | pip3 install --upgrade pybind11
245 |
246 | pip3 install nmslib
247 |
248 | ```
249 |
250 |
251 |
252 |
253 |
254 | ### **Installation of the nmslibR package**
255 |
256 |
257 |
258 | To install the package from CRAN use,
259 |
260 |
261 |
262 | ```R
263 |
264 | install.packages('nmslibR')
265 |
266 |
267 | ```
268 |
269 |
270 | and to download the latest version from Github use the *install_github* function of the *remotes* package,
271 |
272 |
273 | ```R
274 |
275 | remotes::install_github(repo = 'mlampros/nmslibR')
276 |
277 | ```
278 |
279 | Use the following link to report bugs/issues,
280 |
281 |
282 | [https://github.com/mlampros/nmslibR/issues](https://github.com/mlampros/nmslibR/issues)
283 |
284 |
285 |
286 | ### **Citation:**
287 |
288 | If you use the code of this repository in your paper or research please cite both **nmslibR** and the **original articles / software** [https://CRAN.R-project.org/package=nmslibR/citation.html](https://CRAN.R-project.org/package=nmslibR/citation.html):
289 |
290 |
291 |
292 | ```R
293 | @Manual{,
294 | title = {{nmslibR}: Non Metric Space (Approximate) Library in R},
295 | author = {Lampros Mouselimis},
296 | year = {2021},
297 | note = {R package version 1.0.7},
298 | url = {https://CRAN.R-project.org/package=nmslibR},
299 | }
300 | ```
301 |
302 |
303 |
304 |
--------------------------------------------------------------------------------
/codecov.yml:
--------------------------------------------------------------------------------
1 | comment: false
2 |
--------------------------------------------------------------------------------
/inst/CITATION:
--------------------------------------------------------------------------------
1 | citHeader("Please cite both the package and the original articles / software in your publications:")
2 |
3 | year <- sub("-.*", "", meta$Date)
4 | note <- sprintf("R package version %s", meta$Version)
5 |
6 | bibentry(
7 | bibtype = "Manual",
8 | title = "{nmslibR}: Non Metric Space (Approximate) Library",
9 | author = person("Lampros", "Mouselimis"),
10 | year = year,
11 | note = note,
12 | url = "https://CRAN.R-project.org/package=nmslibR"
13 | )
14 |
15 | bibentry(
16 | bibtype = "Manual",
17 | title = "{nmslib}: Non-Metric Space Library (NMSLIB)",
18 | author = c(person("B", "Naidan"), person("L", "Boytsov"), person("Yu", "Malkov"), person("B", "Frederickson"), person("D", "Novak")),
19 | year = "2014",
20 | url = "https://github.com/nmslib/nmslib"
21 | )
22 |
23 | bibentry(
24 | bibtype = "InProceedings",
25 | author = c(person("Leonid", "Boytsov"), person("Bilegsaikhan", "Naidan")),
26 | editor = c(person("Nieves", "Brisaboa"), person("Oscar", "Pedreira"), person("Pavel", "Zezula")),
27 | title = "Engineering Efficient and Effective Non-metric Space Library",
28 | booktitle = "Similarity Search and Applications - 6th International Conference, SISAP 2013, Spain, October 2-4, 2013, Proceedings",
29 | series = "Lecture Notes in Computer Science",
30 | volume = "8199",
31 | pages = "280--293",
32 | publisher = "Springer",
33 | year = "2013",
34 | url = "https://doi.org/10.1007/978-3-642-41062-8",
35 | doi = "10.1007/978-3-642-41062-8"
36 | )
37 |
38 | bibentry(
39 | bibtype = "Article",
40 | author = c(person("Yury", "Malkov"), person("D", "Yashunin")),
41 | title = "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs",
42 | journal = "CoRR",
43 | volume = "abs/1603.09320",
44 | year = "2016",
45 | url = "https://arxiv.org/abs/1603.09320"
46 | )
47 |
48 |
--------------------------------------------------------------------------------
/inst/Non_Metric_Space_Library_(NMSLIB)_Manual.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlampros/nmslibR/6250dad4cdc7ba798cc2dc0d4fa9c5f0d40c16dc/inst/Non_Metric_Space_Library_(NMSLIB)_Manual.pdf
--------------------------------------------------------------------------------
/man/KernelKnnCV_nmslib.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/nmslib.R
3 | \name{KernelKnnCV_nmslib}
4 | \alias{KernelKnnCV_nmslib}
5 | \title{Approximate Kernel k nearest neighbors (cross-validated) using the nmslib library}
6 | \usage{
7 | KernelKnnCV_nmslib(
8 | data,
9 | y,
10 | k = 5,
11 | folds = 5,
12 | h = 1,
13 | weights_function = NULL,
14 | Levels = NULL,
15 | Index_Params = NULL,
16 | Time_Params = NULL,
17 | space = "l1",
18 | space_params = NULL,
19 | method = "hnsw",
20 | data_type = "DENSE_VECTOR",
21 | dtype = "FLOAT",
22 | index_filepath = NULL,
23 | print_progress = FALSE,
24 | num_threads = 1,
25 | seed_num = 1
26 | )
27 | }
28 | \arguments{
29 | \item{data}{a numeric matrix}
30 |
31 | \item{y}{a numeric vector specifying the response variable (in classification the labels must be numeric from 1:Inf). The length of \emph{y} must equal the rows of the \emph{data} parameter}
32 |
33 | \item{k}{an integer. The number of neighbours to return}
34 |
35 | \item{folds}{the number of cross validation folds (must be greater than 1)}
36 |
37 | \item{h}{the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0)}
38 |
39 | \item{weights_function}{there are various ways of specifying the kernel function. See the details section.}
40 |
41 | \item{Levels}{a numeric vector. In case of classification the unique levels of the response variable are necessary}
42 |
43 | \item{Index_Params}{a list of (optional) parameters to use in indexing (when creating the index)}
44 |
45 | \item{Time_Params}{a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset}
46 |
47 | \item{space}{a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs}
48 |
49 | \item{space_params}{a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.}
50 |
51 | \item{method}{a character string specifying the index method to use}
52 |
53 | \item{data_type}{a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'}
54 |
55 | \item{dtype}{a character string. Either 'FLOAT' or 'INT'}
56 |
57 | \item{index_filepath}{a character string specifying the path to a file, where an existing index is saved}
58 |
59 | \item{print_progress}{a boolean (either TRUE or FALSE). Whether or not to display progress bar}
60 |
61 | \item{num_threads}{an integer. The number of threads to use}
62 |
63 | \item{seed_num}{a numeric value specifying the seed of the random number generator}
64 | }
65 | \description{
66 | Approximate Kernel k nearest neighbors (cross-validated) using the nmslib library
67 | }
68 | \details{
69 | There are three possible ways to specify the \emph{weights function}, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of 'uniform', 'triangular', 'epanechnikov', 'biweight', 'triweight', 'tricube', 'gaussian', 'cosine', 'logistic', 'gaussianSimple', 'silverman', 'inverse', 'exponential'. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving 'tricube_gaussian_MULT' or I can add the previously mentioned kernels by giving 'tricube_gaussian_ADD'. 3rd option : a user defined kernel function
70 | }
71 | \examples{
72 |
73 | \dontrun{
74 |
75 | x = matrix(runif(1000), nrow = 100, ncol = 10)
76 |
77 | y = runif(100)
78 |
79 | out = KernelKnnCV_nmslib(x, y, k = 5, folds = 5)
80 |
81 | }
82 | }
83 |
--------------------------------------------------------------------------------
/man/KernelKnn_nmslib.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/nmslib.R
3 | \name{KernelKnn_nmslib}
4 | \alias{KernelKnn_nmslib}
5 | \title{Approximate Kernel k nearest neighbors using the nmslib library}
6 | \usage{
7 | KernelKnn_nmslib(
8 | data,
9 | y,
10 | TEST_data = NULL,
11 | k = 5,
12 | h = 1,
13 | weights_function = NULL,
14 | Levels = NULL,
15 | Index_Params = NULL,
16 | Time_Params = NULL,
17 | space = "l1",
18 | space_params = NULL,
19 | method = "hnsw",
20 | data_type = "DENSE_VECTOR",
21 | dtype = "FLOAT",
22 | index_filepath = NULL,
23 | print_progress = FALSE,
24 | num_threads = 1
25 | )
26 | }
27 | \arguments{
28 | \item{data}{either a matrix or a scipy sparse matrix}
29 |
30 | \item{y}{a numeric vector specifying the response variable (in classification the labels must be numeric from 1:Inf). The length of \emph{y} must equal the rows of the \emph{data} parameter}
31 |
32 | \item{TEST_data}{a test dataset (in case of a matrix the \emph{TEST_data} should have equal number of columns with the \emph{data}). It is assumed that the \emph{TEST_data} is an unlabeled dataset}
33 |
34 | \item{k}{an integer. The number of neighbours to return}
35 |
36 | \item{h}{the bandwidth (applicable if the weights_function is not NULL, defaults to 1.0)}
37 |
38 | \item{weights_function}{there are various ways of specifying the kernel function. See the details section.}
39 |
40 | \item{Levels}{a numeric vector. In case of classification the unique levels of the response variable are necessary}
41 |
42 | \item{Index_Params}{a list of (optional) parameters to use in indexing (when creating the index)}
43 |
44 | \item{Time_Params}{a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset}
45 |
46 | \item{space}{a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs}
47 |
48 | \item{space_params}{a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.}
49 |
50 | \item{method}{a character string specifying the index method to use}
51 |
52 | \item{data_type}{a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'}
53 |
54 | \item{dtype}{a character string. Either 'FLOAT' or 'INT'}
55 |
56 | \item{index_filepath}{a character string specifying the path to a file, where an existing index is saved}
57 |
58 | \item{print_progress}{a boolean (either TRUE or FALSE). Whether or not to display progress bar}
59 |
60 | \item{num_threads}{an integer. The number of threads to use}
61 | }
62 | \description{
63 | Approximate Kernel k nearest neighbors using the nmslib library
64 | }
65 | \details{
66 | There are three possible ways to specify the \emph{weights function}, 1st option : if the weights_function is NULL then a simple k-nearest-neighbor is performed. 2nd option : the weights_function is one of 'uniform', 'triangular', 'epanechnikov', 'biweight', 'triweight', 'tricube', 'gaussian', 'cosine', 'logistic', 'gaussianSimple', 'silverman', 'inverse', 'exponential'. The 2nd option can be extended by combining kernels from the existing ones (adding or multiplying). For instance, I can multiply the tricube with the gaussian kernel by giving 'tricube_gaussian_MULT' or I can add the previously mentioned kernels by giving 'tricube_gaussian_ADD'. 3rd option : a user defined kernel function
67 | }
68 | \examples{
69 |
70 | try({
71 | if (reticulate::py_available(initialize = FALSE)) {
72 | if (reticulate::py_module_available("nmslib")) {
73 |
74 | library(nmslibR)
75 |
76 | x = matrix(runif(1000), nrow = 100, ncol = 10)
77 |
78 | y = runif(100)
79 |
80 | out = KernelKnn_nmslib(data = x, y = y, k = 5)
81 | }
82 | }
83 | }, silent=TRUE)
84 | }
85 |
--------------------------------------------------------------------------------
/man/NMSlib.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/nmslib.R
3 | \docType{class}
4 | \name{NMSlib}
5 | \alias{NMSlib}
6 | \title{Non metric space library}
7 | \usage{
8 | # init <- NMSlib$new(input_data, Index_Params = NULL, Time_Params = NULL,
9 | # space='l1', space_params = NULL, method = 'hnsw',
10 | # data_type = 'DENSE_VECTOR', dtype = 'FLOAT',
11 | # index_filepath = NULL, load_data = FALSE,
12 | # print_progress = FALSE)
13 | }
14 | \description{
15 | Non metric space library
16 |
17 | Non metric space library
18 | }
19 | \details{
20 | \emph{input_data} parameter : In case of numeric data the \emph{input_data} parameter should be either an R matrix object or a scipy sparse matrix. Additionally, the \emph{input_data} parameter can be a list including more than one matrices / sparse-matrices having the same number of columns ( this is ideal for instance if the user wants to include both a train and a test dataset in the created index )
21 |
22 | the \emph{Knn_Query} function finds the approximate K nearest neighbours of a vector in the index
23 |
24 | the \emph{knn_Query_Batch} Performs multiple queries on the index, distributing the work over a thread pool
25 |
26 | the \emph{save_Index} function saves the index to disk
27 |
28 | If the \emph{index_filepath} parameter is not NULL then an existing index will be loaded
29 |
30 | \emph{Incrementally} updating an already saved (and loaded) index is \emph{not} possible (see: https://github.com/nmslib/nmslib/issues/73)
31 | }
32 | \section{Methods}{
33 |
34 |
35 | \describe{
36 | \item{\code{NMSlib$new(input_data, Index_Params = NULL, Time_Params = NULL, space='l1',
37 | space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR',
38 | dtype = 'FLOAT', index_filepath = NULL, load_data = FALSE,
39 | print_progress = FALSE)}}{}
40 |
41 | \item{\code{--------------}}{}
42 |
43 | \item{\code{Knn_Query(query_data_row, k = 5)}}{}
44 |
45 | \item{\code{--------------}}{}
46 |
47 | \item{\code{knn_Query_Batch(query_data, k = 5, num_threads = 1)}}{}
48 |
49 | \item{\code{--------------}}{}
50 |
51 | \item{\code{save_Index(filename, save_data = FALSE)}}{}
52 | }
53 | }
54 |
55 | \examples{
56 |
57 | try({
58 | if (reticulate::py_available(initialize = FALSE)) {
59 | if (reticulate::py_module_available("nmslib")) {
60 |
61 | library(nmslibR)
62 |
63 | set.seed(1)
64 | x = matrix(runif(1000), nrow = 100, ncol = 10)
65 |
66 | init_nms = NMSlib$new(input_data = x)
67 |
68 |
69 | # returns a 1-dimensional vector (index, distance)
70 | #--------------------------------------------------
71 |
72 | init_nms$Knn_Query(query_data_row = x[1, ], k = 5)
73 |
74 |
75 | # returns knn's for all data
76 | #---------------------------
77 |
78 | all_dat = init_nms$knn_Query_Batch(x, k = 5, num_threads = 1)
79 | }
80 | }
81 | }, silent=TRUE)
82 | }
83 | \references{
84 | https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf
85 |
86 | https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_vector_dense_optim.ipynb
87 |
88 | https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_vector_dense_nonoptim.ipynb
89 |
90 | https://github.com/nmslib/nmslib/issues/356
91 |
92 | https://github.com/nmslib/nmslib/blob/master/manual/methods.md
93 |
94 | https://github.com/nmslib/nmslib/blob/master/manual/spaces.md
95 | }
96 | \section{Methods}{
97 | \subsection{Public methods}{
98 | \itemize{
99 | \item \href{#method-new}{\code{NMSlib$new()}}
100 | \item \href{#method-Knn_Query}{\code{NMSlib$Knn_Query()}}
101 | \item \href{#method-knn_Query_Batch}{\code{NMSlib$knn_Query_Batch()}}
102 | \item \href{#method-save_Index}{\code{NMSlib$save_Index()}}
103 | \item \href{#method-clone}{\code{NMSlib$clone()}}
104 | }
105 | }
106 | \if{html}{\out{
}}
107 | \if{html}{\out{}}
108 | \if{latex}{\out{\hypertarget{method-new}{}}}
109 | \subsection{Method \code{new()}}{
110 | \subsection{Usage}{
111 | \if{html}{\out{}}\preformatted{NMSlib$new(
112 | input_data,
113 | Index_Params = NULL,
114 | Time_Params = NULL,
115 | space = "l1",
116 | space_params = NULL,
117 | method = "hnsw",
118 | data_type = "DENSE_VECTOR",
119 | dtype = "FLOAT",
120 | index_filepath = NULL,
121 | load_data = FALSE,
122 | print_progress = FALSE
123 | )}\if{html}{\out{
}}
124 | }
125 |
126 | \subsection{Arguments}{
127 | \if{html}{\out{}}
128 | \describe{
129 | \item{\code{input_data}}{the input data. See \emph{details} for more information}
130 |
131 | \item{\code{Index_Params}}{a list of (optional) parameters to use in indexing (when creating the index)}
132 |
133 | \item{\code{Time_Params}}{a list of parameters to use in querying. Setting \emph{Time_Params} to NULL will reset}
134 |
135 | \item{\code{space}}{a character string (optional). The metric space to create for this index. Page 31 of the manual (see \emph{references}) explains all available inputs}
136 |
137 | \item{\code{space_params}}{a list of (optional) parameters for configuring the space. See the \emph{references} manual for more details.}
138 |
139 | \item{\code{method}}{a character string specifying the index method to use}
140 |
141 | \item{\code{data_type}}{a character string. One of 'DENSE_UINT8_VECTOR', 'DENSE_VECTOR', 'OBJECT_AS_STRING' or 'SPARSE_VECTOR'}
142 |
143 | \item{\code{dtype}}{a character string. Either 'FLOAT' or 'INT'}
144 |
145 | \item{\code{index_filepath}}{a character string specifying the path to a file, where an existing index is saved}
146 |
147 | \item{\code{load_data}}{a boolean. If TRUE then besides the index also the saved data will be loaded. This parameter is used when the \emph{index_filepath} parameter is not NULL (see the web links in the \emph{references} section for more details). The user might also have to specify the \emph{skip_optimized_index} parameter of the \emph{Index_Params} in the "init" method}
148 |
149 | \item{\code{print_progress}}{a boolean (either TRUE or FALSE). Whether or not to display progress bar}
150 | }
151 | \if{html}{\out{
}}
152 | }
153 | }
154 | \if{html}{\out{
}}
155 | \if{html}{\out{}}
156 | \if{latex}{\out{\hypertarget{method-Knn_Query}{}}}
157 | \subsection{Method \code{Knn_Query()}}{
158 | \subsection{Usage}{
159 | \if{html}{\out{}}\preformatted{NMSlib$Knn_Query(query_data_row, k = 5, include_query_data_row_index = FALSE)}\if{html}{\out{
}}
160 | }
161 |
162 | \subsection{Arguments}{
163 | \if{html}{\out{}}
164 | \describe{
165 | \item{\code{query_data_row}}{a vector to query for}
166 |
167 | \item{\code{k}}{an integer. The number of neighbours to return}
168 |
169 | \item{\code{include_query_data_row_index}}{a boolean. If TRUE then the index of the query data row will be returned as well. It currently defaults to FALSE which means the first matched index is excluded from the results (this parameter will be removed in version 1.1.0 and the output behavior of the function will be changed too - see the deprecation warning)}
170 | }
171 | \if{html}{\out{
}}
172 | }
173 | }
174 | \if{html}{\out{
}}
175 | \if{html}{\out{}}
176 | \if{latex}{\out{\hypertarget{method-knn_Query_Batch}{}}}
177 | \subsection{Method \code{knn_Query_Batch()}}{
178 | \subsection{Usage}{
179 | \if{html}{\out{}}\preformatted{NMSlib$knn_Query_Batch(query_data, k = 5, num_threads = 1)}\if{html}{\out{
}}
180 | }
181 |
182 | \subsection{Arguments}{
183 | \if{html}{\out{}}
184 | \describe{
185 | \item{\code{query_data}}{the query_data parameter should be of the same type with the \emph{input_data} parameter. Queries to query for}
186 |
187 | \item{\code{k}}{an integer. The number of neighbours to return}
188 |
189 | \item{\code{num_threads}}{an integer. The number of threads to use}
190 | }
191 | \if{html}{\out{
}}
192 | }
193 | }
194 | \if{html}{\out{
}}
195 | \if{html}{\out{}}
196 | \if{latex}{\out{\hypertarget{method-save_Index}{}}}
197 | \subsection{Method \code{save_Index()}}{
198 | \subsection{Usage}{
199 | \if{html}{\out{}}\preformatted{NMSlib$save_Index(filename, save_data = FALSE)}\if{html}{\out{
}}
200 | }
201 |
202 | \subsection{Arguments}{
203 | \if{html}{\out{}}
204 | \describe{
205 | \item{\code{filename}}{a character string specifying the path. The filename to save ( in case of the \emph{save_Index} method ) or the filename to load ( in case of the \emph{load_Index} method )}
206 |
207 | \item{\code{save_data}}{a boolean. If TRUE then besides the index also the data will be saved (see the web links in the \emph{references} section for more details)}
208 | }
209 | \if{html}{\out{
}}
210 | }
211 | }
212 | \if{html}{\out{
}}
213 | \if{html}{\out{}}
214 | \if{latex}{\out{\hypertarget{method-clone}{}}}
215 | \subsection{Method \code{clone()}}{
216 | The objects of this class are cloneable with this method.
217 | \subsection{Usage}{
218 | \if{html}{\out{}}\preformatted{NMSlib$clone(deep = FALSE)}\if{html}{\out{
}}
219 | }
220 |
221 | \subsection{Arguments}{
222 | \if{html}{\out{}}
223 | \describe{
224 | \item{\code{deep}}{Whether to make a deep clone.}
225 | }
226 | \if{html}{\out{
}}
227 | }
228 | }
229 | }
230 |
--------------------------------------------------------------------------------
/man/TO_scipy_sparse.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/nmslib.R
3 | \name{TO_scipy_sparse}
4 | \alias{TO_scipy_sparse}
5 | \title{conversion of an R sparse matrix to a scipy sparse matrix}
6 | \usage{
7 | TO_scipy_sparse(R_sparse_matrix)
8 | }
9 | \arguments{
10 | \item{R_sparse_matrix}{an R sparse matrix. Acceptable input objects are either a \emph{dgCMatrix} or a \emph{dgRMatrix}.}
11 | }
12 | \description{
13 | conversion of an R sparse matrix to a scipy sparse matrix
14 | }
15 | \details{
16 | This function allows the user to convert either an R \emph{dgCMatrix} or a \emph{dgRMatrix} to a scipy sparse matrix (\emph{scipy.sparse.csc_matrix} or \emph{scipy.sparse.csr_matrix}). This is useful because the \emph{nmslibR} package accepts besides an R dense matrix also python sparse matrices as input.
17 |
18 | The \emph{dgCMatrix} class is a class of sparse numeric matrices in the compressed, sparse, \emph{column-oriented format}. The \emph{dgRMatrix} class is a class of sparse numeric matrices in the compressed, sparse, \emph{column-oriented format}.
19 | }
20 | \examples{
21 |
22 | try({
23 | if (reticulate::py_available(initialize = FALSE)) {
24 | if (reticulate::py_module_available("scipy")) {
25 |
26 | if (Sys.info()["sysname"] != 'Darwin') {
27 |
28 | library(nmslibR)
29 |
30 |
31 | # 'dgCMatrix' sparse matrix
32 | #--------------------------
33 |
34 | data = c(1, 0, 2, 0, 0, 3, 4, 5, 6)
35 |
36 | dgcM = Matrix::Matrix(data = data, nrow = 3,
37 |
38 | ncol = 3, byrow = TRUE,
39 |
40 | sparse = TRUE)
41 |
42 | print(dim(dgcM))
43 |
44 | res = TO_scipy_sparse(dgcM)
45 |
46 | print(res$shape)
47 |
48 |
49 | # 'dgRMatrix' sparse matrix
50 | #--------------------------
51 |
52 | dgrM = as(dgcM, "RsparseMatrix")
53 |
54 | print(dim(dgrM))
55 |
56 | res_dgr = TO_scipy_sparse(dgrM)
57 |
58 | print(res_dgr$shape)
59 | }
60 | }
61 | }
62 | }, silent=TRUE)
63 | }
64 | \references{
65 | https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgCMatrix-class.html, https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgRMatrix-class.html, https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix
66 | }
67 |
--------------------------------------------------------------------------------
/man/import_internal.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/nmslib.R
3 | \name{import_internal}
4 | \alias{import_internal}
5 | \title{import internal functions from the KernelKnn package}
6 | \usage{
7 | import_internal(function_name)
8 | }
9 | \description{
10 | import internal functions from the KernelKnn package
11 | }
12 | \keyword{internal}
13 |
--------------------------------------------------------------------------------
/man/inner_kernel_function.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/nmslib.R
3 | \name{inner_kernel_function}
4 | \alias{inner_kernel_function}
5 | \title{inner function to compute kernels, extract weights and return predictions}
6 | \usage{
7 | inner_kernel_function(y_matrix, dist_matrix, Levels, weights_function, h)
8 | }
9 | \description{
10 | inner function to compute kernels, extract weights and return predictions
11 | }
12 | \keyword{internal}
13 |
--------------------------------------------------------------------------------
/man/mat_2scipy_sparse.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/nmslib.R
3 | \name{mat_2scipy_sparse}
4 | \alias{mat_2scipy_sparse}
5 | \title{conversion of an R matrix to a scipy sparse matrix}
6 | \usage{
7 | mat_2scipy_sparse(x, format = "sparse_row_matrix")
8 | }
9 | \arguments{
10 | \item{x}{a data matrix}
11 |
12 | \item{format}{a character string. Either \emph{"sparse_row_matrix"} or \emph{"sparse_column_matrix"}}
13 | }
14 | \description{
15 | conversion of an R matrix to a scipy sparse matrix
16 | }
17 | \details{
18 | This function allows the user to convert an R matrix to a scipy sparse matrix. This is useful because the \emph{nmslibR} package accepts only \emph{python} sparse matrices as input.
19 | }
20 | \examples{
21 |
22 | try({
23 | if (reticulate::py_available(initialize = FALSE)) {
24 | if (reticulate::py_module_available("scipy")) {
25 |
26 | library(nmslibR)
27 |
28 | set.seed(1)
29 |
30 | x = matrix(runif(1000), nrow = 100, ncol = 10)
31 |
32 | res = mat_2scipy_sparse(x)
33 |
34 | print(dim(x))
35 |
36 | print(res$shape)
37 | }
38 | }
39 | }, silent=TRUE)
40 | }
41 | \references{
42 | https://docs.scipy.org/doc/scipy/reference/sparse.html
43 | }
44 |
--------------------------------------------------------------------------------
/src/Makevars:
--------------------------------------------------------------------------------
1 | PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DARMA_64BIT_WORD
2 | PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS)
3 | CXX_STD = CXX17
4 | PKG_CPPFLAGS = -I../inst/include/
5 |
--------------------------------------------------------------------------------
/src/Makevars.win:
--------------------------------------------------------------------------------
1 | PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -DARMA_64BIT_WORD
2 | PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CXXFLAGS) -mthreads
3 | CXX_STD = CXX17
4 | PKG_CPPFLAGS = -I../inst/include/
5 |
--------------------------------------------------------------------------------
/src/RcppExports.cpp:
--------------------------------------------------------------------------------
1 | // Generated by using Rcpp::compileAttributes() -> do not edit by hand
2 | // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
3 |
4 | #include
5 | #include
6 |
7 | using namespace Rcpp;
8 |
9 | #ifdef RCPP_USE_GLOBAL_ROSTREAM
10 | Rcpp::Rostream& Rcpp::Rcout = Rcpp::Rcpp_cout_get();
11 | Rcpp::Rostream& Rcpp::Rcerr = Rcpp::Rcpp_cerr_get();
12 | #endif
13 |
14 | // nmslib_idx_dist
15 | Rcpp::List nmslib_idx_dist(std::vector > >& input_list, unsigned int k, int threads);
16 | RcppExport SEXP _nmslibR_nmslib_idx_dist(SEXP input_listSEXP, SEXP kSEXP, SEXP threadsSEXP) {
17 | BEGIN_RCPP
18 | Rcpp::RObject rcpp_result_gen;
19 | Rcpp::RNGScope rcpp_rngScope_gen;
20 | Rcpp::traits::input_parameter< std::vector > >& >::type input_list(input_listSEXP);
21 | Rcpp::traits::input_parameter< unsigned int >::type k(kSEXP);
22 | Rcpp::traits::input_parameter< int >::type threads(threadsSEXP);
23 | rcpp_result_gen = Rcpp::wrap(nmslib_idx_dist(input_list, k, threads));
24 | return rcpp_result_gen;
25 | END_RCPP
26 | }
27 | // y_idxs
28 | arma::mat y_idxs(arma::mat& idxs, std::vector& y, int threads);
29 | RcppExport SEXP _nmslibR_y_idxs(SEXP idxsSEXP, SEXP ySEXP, SEXP threadsSEXP) {
30 | BEGIN_RCPP
31 | Rcpp::RObject rcpp_result_gen;
32 | Rcpp::RNGScope rcpp_rngScope_gen;
33 | Rcpp::traits::input_parameter< arma::mat& >::type idxs(idxsSEXP);
34 | Rcpp::traits::input_parameter< std::vector& >::type y(ySEXP);
35 | Rcpp::traits::input_parameter< int >::type threads(threadsSEXP);
36 | rcpp_result_gen = Rcpp::wrap(y_idxs(idxs, y, threads));
37 | return rcpp_result_gen;
38 | END_RCPP
39 | }
40 | // check_NaN_Inf
41 | bool check_NaN_Inf(arma::mat x);
42 | RcppExport SEXP _nmslibR_check_NaN_Inf(SEXP xSEXP) {
43 | BEGIN_RCPP
44 | Rcpp::RObject rcpp_result_gen;
45 | Rcpp::RNGScope rcpp_rngScope_gen;
46 | Rcpp::traits::input_parameter< arma::mat >::type x(xSEXP);
47 | rcpp_result_gen = Rcpp::wrap(check_NaN_Inf(x));
48 | return rcpp_result_gen;
49 | END_RCPP
50 | }
51 |
--------------------------------------------------------------------------------
/src/init.c:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include // for NULL
4 | #include
5 |
6 | /* FIXME:
7 | Check these declarations against the C/Fortran source code.
8 | */
9 |
10 | /* .Call calls */
11 | extern SEXP _nmslibR_check_NaN_Inf(SEXP);
12 | extern SEXP _nmslibR_nmslib_idx_dist(SEXP, SEXP, SEXP);
13 | extern SEXP _nmslibR_y_idxs(SEXP, SEXP, SEXP);
14 |
15 | static const R_CallMethodDef CallEntries[] = {
16 | {"_nmslibR_check_NaN_Inf", (DL_FUNC) &_nmslibR_check_NaN_Inf, 1},
17 | {"_nmslibR_nmslib_idx_dist", (DL_FUNC) &_nmslibR_nmslib_idx_dist, 3},
18 | {"_nmslibR_y_idxs", (DL_FUNC) &_nmslibR_y_idxs, 3},
19 | {NULL, NULL, 0}
20 | };
21 |
22 | void R_init_nmslibR(DllInfo *dll)
23 | {
24 | R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
25 | R_useDynamicSymbols(dll, FALSE);
26 | }
27 |
--------------------------------------------------------------------------------
/src/utils.cpp:
--------------------------------------------------------------------------------
1 | # include
2 | // [[Rcpp::depends("RcppArmadillo")]]
3 | // [[Rcpp::plugins(openmp)]]
4 | // [[Rcpp::plugins(cpp17)]]
5 |
6 | #ifdef _OPENMP
7 | #include
8 | #endif
9 |
10 |
11 |
12 | // return a named Rcpp list for the output list [ NA's if length of knn's not equal for all cases ]
13 | //
14 |
15 | // [[Rcpp::export]]
16 | Rcpp::List nmslib_idx_dist(std::vector > >& input_list, unsigned int k, int threads = 1) {
17 |
18 | #ifdef _OPENMP
19 | omp_set_num_threads(threads);
20 | #endif
21 |
22 | unsigned int ROWS = input_list.size();
23 | arma::mat indices(ROWS, k), distances(ROWS, k);
24 | indices.fill(arma::datum::nan);
25 | distances.fill(arma::datum::nan);
26 | unsigned int i, j;
27 |
28 | #ifdef _OPENMP
29 | #pragma omp parallel for schedule(static) shared(ROWS, input_list, indices, distances) private(i,j)
30 | #endif
31 | for (i = 0; i < ROWS; i++) {
32 |
33 | std::vector > inner_vec = input_list[i];
34 | std::vector inner_idx = inner_vec[0];
35 | std::vector inner_dist = inner_vec[1]; // it is possible that the length of a vector differs [ not equal to k -- in that case it takes the value of NA ]
36 |
37 | for (j = 1; j < inner_dist.size(); j++) { // indexing of inner vector begins from 1
38 |
39 | #ifdef _OPENMP
40 | #pragma omp atomic write
41 | #endif
42 | indices(i, j-1) = inner_idx[j] + 1; // when populating matrices the indices begin from 0 ALSO add 1 ( + 1) to account for the difference in indexing between C++ and R
43 |
44 | #ifdef _OPENMP
45 | #pragma omp atomic write
46 | #endif
47 | distances(i, j-1) = inner_dist[j];
48 | }
49 | }
50 |
51 | return Rcpp::List::create(Rcpp::Named("knn_idx") = indices, Rcpp::Named("knn_dist") = distances);
52 | }
53 |
54 |
55 |
56 | // build matrix from response (y) and output-knn-indices [ account for the case where an index is NA ]
57 | //
58 |
59 | // [[Rcpp::export]]
60 | arma::mat y_idxs(arma::mat& idxs, std::vector& y, int threads = 1) {
61 |
62 | #ifdef _OPENMP
63 | omp_set_num_threads(threads);
64 | #endif
65 |
66 | unsigned int NROWS = idxs.n_rows;
67 | unsigned int NCOLS = idxs.n_cols;
68 | arma::mat out(NROWS, NCOLS);
69 | unsigned int i,j;
70 |
71 | #ifdef _OPENMP
72 | #pragma omp parallel for schedule(static) shared(NROWS, idxs, NCOLS, out, y) private(i,j)
73 | #endif
74 | for (i = 0; i < NROWS; i++) {
75 |
76 | for (j = 0; j < NCOLS; j++) {
77 |
78 | if (idxs(i,j) != idxs(i,j)) { // if NA append nan-value
79 |
80 | #ifdef _OPENMP
81 | #pragma omp atomic write
82 | #endif
83 | out(i,j) = arma::datum::nan;
84 | }
85 | else {
86 |
87 | #ifdef _OPENMP
88 | #pragma omp atomic write
89 | #endif
90 | out(i,j) = y[idxs(i,j) - 1]; // account for the difference in indexing betw. R and C++
91 | }
92 | }
93 | }
94 |
95 | return out;
96 | }
97 |
98 |
99 | // it returns TRUE if the matrix does not include NaN's or +/- Inf
100 | // it returns FALSE if at least one value is NaN or +/- Inf
101 | //
102 |
103 | // [[Rcpp::export]]
104 | bool check_NaN_Inf(arma::mat x) {
105 | return x.is_finite();
106 | }
107 |
108 |
--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
1 | library(testthat)
2 | library(nmslibR)
3 |
4 | test_check("nmslibR")
5 |
--------------------------------------------------------------------------------
/tests/testthat/helper-init.R:
--------------------------------------------------------------------------------
1 |
2 | # prefer Python 3 if available [ see: https://github.com/rstudio/reticulate/blob/master/tests/testthat/helper-init.R ]
3 | if (!reticulate::py_available(initialize = FALSE) &&
4 | is.na(Sys.getenv("RETICULATE_PYTHON", unset = NA)))
5 | {
6 | python <- Sys.which("python3")
7 | if (nzchar(python))
8 | reticulate::use_python(python, required = TRUE)
9 | }
10 |
--------------------------------------------------------------------------------
/tests/testthat/helper-skip.R:
--------------------------------------------------------------------------------
1 |
2 |
3 | #.......................................
4 | # skip a test if python is not available [ see: https://github.com/rstudio/reticulate/tree/master/tests/testthat ]
5 | #.......................................
6 |
7 | skip_test_if_no_python <- function() {
8 | if (!reticulate::py_available(initialize = FALSE))
9 | testthat::skip("Python bindings not available for testing")
10 | }
11 |
12 |
13 | #................................................................
14 | # helper function to skip tests if we don't have the 'foo' module
15 | #................................................................
16 |
17 | skip_test_if_no_module <- function(MODULE) { # MODULE is of type character string ( length(MODULE) >= 1 )
18 |
19 | if (length(MODULE) == 1) {
20 |
21 | module_exists <- reticulate::py_module_available(MODULE)}
22 |
23 | else {
24 |
25 | module_exists <- sum(as.vector(sapply(MODULE, function(x) reticulate::py_module_available(x)))) == length(MODULE)
26 | }
27 |
28 | if (!module_exists) {
29 |
30 | testthat::skip(paste0(MODULE, " is not available for testthat-testing"))
31 | }
32 | }
33 |
34 |
--------------------------------------------------------------------------------
/tests/testthat/setup.R:
--------------------------------------------------------------------------------
1 |
2 | # data
3 | #-----
4 |
5 | set.seed(1)
6 | x = matrix(runif(1000), nrow = 100, ncol = 10)
7 |
8 | x_lst = list(x, x)
9 |
10 |
11 | # response regression
12 | #--------------------
13 |
14 | set.seed(3)
15 | y_reg = runif(100)
16 |
17 |
18 | # response "binary" classification
19 | #---------------------------------
20 |
21 | set.seed(4)
22 | y_BINclass = sample(1:2, 100, replace = T)
23 |
24 |
25 | # response "multiclass" classification
26 | #-------------------------------------
27 |
28 | set.seed(5)
29 | y_MULTIclass = sample(1:3, 100, replace = T)
30 |
31 |
32 | # data for sparse matrices
33 | #-------------------------
34 |
35 | data(ionosphere, package = 'KernelKnn')
36 |
37 | X = as.matrix(ionosphere[, -c(1:2, ncol(ionosphere))])
38 |
--------------------------------------------------------------------------------
/tests/testthat/test-nmslibR_pkg.R:
--------------------------------------------------------------------------------
1 |
2 |
3 | context('tests for nmslibR pkg')
4 |
5 |
6 | # conversion of an R matrix to a scipy sparse matrix
7 | #---------------------------------------------------
8 |
9 | testthat::test_that("the 'mat_2scipy_sparse' returns an error in case that the 'format' parameter is invalid", {
10 |
11 | skip_test_if_no_python()
12 | skip_test_if_no_module("scipy")
13 |
14 | testthat::expect_error( mat_2scipy_sparse(x, format = 'invalid') )
15 | })
16 |
17 |
18 | testthat::test_that("the 'mat_2scipy_sparse' returns a scipy sparse matrix", {
19 |
20 | skip_test_if_no_python()
21 | skip_test_if_no_module("scipy")
22 |
23 | res = mat_2scipy_sparse(x, format = 'sparse_row_matrix')
24 | cl_obj = class(res)[1] # class is python object
25 | same_dims = sum(unlist(reticulate::py_to_r(res$shape)) == dim(x)) == 2 # sparse matrix has same dimensions as input dense matrix
26 |
27 | testthat::expect_true( same_dims && cl_obj == "scipy.sparse.csr.csr_matrix" )
28 | })
29 |
30 |
31 |
32 | # conversion of an R sparse matrix to a scipy sparse matrix
33 | #-----------------------------------------------------------
34 |
35 | # run the following tests on all operating systems except for 'Macintosh'
36 | # [ otherwise it will raise an error due to the fact that the 'scipy-sparse' library ( applied on 'TO_scipy_sparse' function)
37 | # on CRAN is not upgraded and the older version includes a bug ('TypeError : could not interpret data type') ]
38 | # reference : https://github.com/scipy/scipy/issues/5353
39 |
40 | if (Sys.info()["sysname"] != 'Darwin') {
41 |
42 | testthat::test_that("the 'TO_scipy_sparse' function returns an error in case that the input object is not of type 'dgCMatrix' or 'dgRMatrix'", {
43 |
44 | skip_test_if_no_python()
45 | skip_test_if_no_module("scipy")
46 |
47 | mt = matrix(runif(20), nrow = 5, ncol = 4)
48 |
49 | testthat::expect_error( TO_scipy_sparse(mt) )
50 | })
51 |
52 |
53 | testthat::test_that("the 'TO_scipy_sparse' returns the correct output if the input is a 'dgCMatrix'", {
54 |
55 | skip_test_if_no_python()
56 | skip_test_if_no_module("scipy")
57 |
58 | data = c(1, 0, 2, 0, 0, 3, 4, 5, 6)
59 |
60 | dgcM = Matrix::Matrix(data = data, nrow = 3,
61 | ncol = 3, byrow = TRUE,
62 | sparse = TRUE)
63 |
64 | res = TO_scipy_sparse(dgcM)
65 | cl_obj = class(res)[1] # class is python object
66 | validate_dims = sum(dim(dgcM) == unlist(reticulate::py_to_r(res$shape))) == 2 # sparse matrix has same dimensions as input R sparse matrix
67 |
68 | testthat::expect_true( validate_dims && cl_obj == "scipy.sparse.csc.csc_matrix" )
69 | })
70 |
71 |
72 | testthat::test_that("the 'TO_scipy_sparse' returns the correct output if the input is a 'dgRMatrix'", {
73 |
74 | skip_test_if_no_python()
75 | skip_test_if_no_module("scipy")
76 |
77 | data = c(1, 0, 2, 0, 0, 3, 4, 5, 6)
78 |
79 | dgcM = Matrix::Matrix(data = data, nrow = 3,
80 | ncol = 3, byrow = TRUE,
81 | sparse = TRUE)
82 |
83 | dgrM = as(dgcM, "RsparseMatrix")
84 | res = TO_scipy_sparse(dgrM)
85 | cl_obj = class(res)[1] # class is python object
86 | validate_dims = sum(dim(dgrM) == unlist(reticulate::py_to_r(res$shape))) == 2 # sparse matrix has same dimensions as input R sparse matrix
87 |
88 | testthat::expect_true( validate_dims && cl_obj == "scipy.sparse.csr.csr_matrix" )
89 | })
90 | }
91 |
92 |
93 | # tests for 'NMSlib' class
94 | #-------------------------
95 |
96 |
97 | testthat::test_that("the NMSlib class works with default settings", {
98 |
99 | skip_test_if_no_python()
100 | skip_test_if_no_module('nmslib')
101 |
102 | init_nms = NMSlib$new(input_data = x, Index_Params = NULL, Time_Params = NULL, space='l1', space_params = NULL,
103 | method = 'hnsw', data_type = 'DENSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE)
104 |
105 | knns = 5
106 | tmp_res = init_nms$Knn_Query(x[1, ], k = knns)
107 |
108 | testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && all(unlist(lapply(tmp_res, length)) == knns) )
109 | })
110 |
111 |
112 |
113 | testthat::test_that("the NMSlib class works with default settings [ and 'input_data' is a list ]", {
114 |
115 | skip_test_if_no_python()
116 | skip_test_if_no_module('nmslib')
117 |
118 | init_nms = NMSlib$new(input_data = x_lst, Index_Params = NULL, Time_Params = NULL, space='l1', space_params = NULL,
119 | method = 'hnsw', data_type = 'DENSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE)
120 |
121 | knns = 5
122 | tmp_res = init_nms$Knn_Query(x[2, ], k = knns)
123 |
124 | testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && all(unlist(lapply(tmp_res, length)) == knns) )
125 | })
126 |
127 |
128 | testthat::test_that("the NMSlib class works with default settings [ and 'Time_Params' is a list of parameters ]", {
129 |
130 | skip_test_if_no_python()
131 | skip_test_if_no_module('nmslib')
132 |
133 | TIME_PARAMS = list(efSearch = 50)
134 |
135 | init_nms = NMSlib$new(input_data = x, Index_Params = NULL, Time_Params = TIME_PARAMS, space='l1', space_params = NULL,
136 | method = 'hnsw', data_type = 'DENSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE)
137 |
138 | knns = 5
139 | tmp_res = init_nms$knn_Query_Batch(x, k = knns)
140 |
141 | testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && sum(unlist(lapply(tmp_res, function(x) inherits(x, 'matrix')))) == 2 &&
142 | sum(unlist(lapply(tmp_res, function(x) ncol(x) == knns))) == 2)
143 | })
144 |
145 |
146 |
147 |
148 | # tests for 'KernelKnn_nmslib' function
149 | #--------------------------------------
150 |
151 |
152 | testthat::test_that("the KernelKnn_nmslib function works with default settings [ regression ]", {
153 |
154 | skip_test_if_no_python()
155 | skip_test_if_no_module('nmslib')
156 |
157 | tmp_knn = KernelKnn_nmslib(data = x, TEST_data = NULL, y = y_reg, k = 5, h = 1.0, weights_function = NULL, Levels = NULL, Index_Params = NULL,
158 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR',
159 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1)
160 |
161 | testthat::expect_true( inherits(tmp_knn, 'numeric') && length(tmp_knn) == nrow(x) )
162 | })
163 |
164 |
165 |
166 | testthat::test_that("the KernelKnn_nmslib function works with default settings [ binary classification ]", {
167 |
168 | skip_test_if_no_python()
169 | skip_test_if_no_module('nmslib')
170 |
171 | tmp_knn = KernelKnn_nmslib(data = x, TEST_data = NULL, y = y_BINclass, k = 5, h = 1.0, weights_function = NULL, Levels = sort(unique(y_BINclass)), Index_Params = NULL,
172 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR',
173 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1)
174 |
175 | testthat::expect_true( inherits(tmp_knn, 'matrix') && nrow(tmp_knn) == nrow(x) && ncol(tmp_knn) == length(unique(y_BINclass)) )
176 | })
177 |
178 |
179 |
180 | testthat::test_that("the KernelKnn_nmslib function works with default settings [ binary classification AND TEST_data is not NULL ]", {
181 |
182 | skip_test_if_no_python()
183 | skip_test_if_no_module('nmslib')
184 |
185 | set.seed(2)
186 | samp = sample(1:nrow(x), round(0.8 * nrow(x)))
187 | samp_ = setdiff(1:nrow(x), samp)
188 |
189 | tmp_knn = KernelKnn_nmslib(data = x[samp, ], TEST_data = x[samp_, ], y = y_BINclass[samp], k = 5, h = 1.0, weights_function = NULL,
190 | Levels = sort(unique(y_BINclass)), Index_Params = NULL, Time_Params = NULL, space='l1', space_params = NULL,
191 | method = 'hnsw', data_type = 'DENSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE,
192 | num_threads = 1)
193 |
194 | testthat::expect_true( inherits(tmp_knn, 'matrix') && nrow(tmp_knn) == nrow(x[samp_, ]) && ncol(tmp_knn) == length(unique(y_BINclass)) )
195 | })
196 |
197 |
198 |
199 | testthat::test_that("the KernelKnn_nmslib function works with default settings [ multiclass classification ]", {
200 |
201 | skip_test_if_no_python()
202 | skip_test_if_no_module('nmslib')
203 |
204 | tmp_knn = KernelKnn_nmslib(data = x, TEST_data = NULL, y = y_MULTIclass, k = 5, h = 1.0, weights_function = 'uniform', Levels = sort(unique(y_MULTIclass)), Index_Params = NULL,
205 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR',
206 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1)
207 |
208 | testthat::expect_true( inherits(tmp_knn, 'matrix') && nrow(tmp_knn) == nrow(x) && ncol(tmp_knn) == length(unique(y_MULTIclass)) )
209 | })
210 |
211 |
212 |
213 | # tests for 'KernelKnnCV_nmslib' function
214 | #----------------------------------------
215 |
216 |
217 | testthat::test_that("the KernelKnnCV_nmslib function works with default settings [ regression ]", {
218 |
219 | skip_test_if_no_python()
220 | skip_test_if_no_module('nmslib')
221 |
222 | FOLDS = 4
223 |
224 | tmp_knn = KernelKnnCV_nmslib(data = x, y = y_reg, k = 5, folds = FOLDS, h = 1.0, weights_function = NULL, Levels = NULL, Index_Params = NULL,
225 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR',
226 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1, seed_num = 1)
227 |
228 | testthat::expect_true( inherits(tmp_knn, 'list') && all(names(tmp_knn) %in% c("preds", "folds")) && all(as.vector(unlist(lapply(tmp_knn, function(x) lapply(x, function(y) length(y))))) == nrow(x) / FOLDS) )
229 | })
230 |
231 |
232 | testthat::test_that("the KernelKnnCV_nmslib function works with default settings [ classification ]", {
233 |
234 | skip_test_if_no_python()
235 | skip_test_if_no_module('nmslib')
236 |
237 | FOLDS = 4
238 |
239 | tmp_knn = KernelKnnCV_nmslib(data = x, y = y_BINclass, k = 5, folds = FOLDS, h = 1.0, weights_function = NULL, Levels = sort(unique(y_BINclass)), Index_Params = NULL,
240 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'DENSE_VECTOR',
241 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1, seed_num = 1)
242 |
243 | testthat::expect_true( inherits(tmp_knn, 'list') && all(names(tmp_knn) %in% c("preds", "folds")) &&
244 | all(as.vector(unlist(lapply(tmp_knn$preds, function(y) nrow(y)))) == nrow(x) / FOLDS) &&
245 | all(as.vector(unlist(lapply(tmp_knn$folds, function(y) length(y)))) == nrow(x) / FOLDS))
246 | })
247 |
248 |
249 |
250 |
251 | # sparse datasets
252 | #----------------
253 |
254 |
255 | testthat::test_that("the NMSlib class works with sparse data in case of 'knn_Query_Batch' [ specify as data_type a 'SPARSE_VECTOR' ]", {
256 |
257 | skip_test_if_no_python()
258 | skip_test_if_no_module(c('nmslib', 'scipy'))
259 |
260 | sparse_x = mat_2scipy_sparse(x, format = 'sparse_row_matrix')
261 |
262 | init_nms = NMSlib$new(input_data = sparse_x, Index_Params = NULL, Time_Params = NULL, space='l1', space_params = NULL,
263 | method = 'hnsw', data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE)
264 |
265 | knns = 5
266 | tmp_res = init_nms$knn_Query_Batch(sparse_x, k = knns) # it would be tricky to do the same with "Knn_Query" as it will require firstly a python object as input and secondly a sparse unit
267 |
268 | testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && all(unlist(lapply(tmp_res, ncol)) == knns) )
269 | })
270 |
271 |
272 |
273 | testthat::test_that("the KernelKnn_nmslib function works with sparse data in case of regression [ specify as data_type a 'SPARSE_VECTOR' ]", {
274 |
275 | skip_test_if_no_python()
276 | skip_test_if_no_module(c('nmslib', 'scipy'))
277 |
278 | sparse_x = mat_2scipy_sparse(x, format = 'sparse_row_matrix')
279 |
280 | tmp_knn = KernelKnn_nmslib(data = sparse_x, TEST_data = NULL, y = y_reg, k = 5, h = 1.0, weights_function = NULL, Levels = NULL, Index_Params = NULL,
281 | Time_Params = NULL, space='l1', space_params = NULL, method = 'hnsw', data_type = 'SPARSE_VECTOR',
282 | dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1)
283 |
284 | testthat::expect_true( inherits(tmp_knn, 'numeric') && length(tmp_knn) == unlist(reticulate::py_to_r(sparse_x$shape))[1] )
285 | })
286 |
287 |
288 | #=================================================================================================================================================================================
289 |
290 | # run the following tests on all operating systems except for 'Macintosh'
291 | # [ otherwise it will raise an error due to the fact that the 'scipy-sparse' library ( applied on 'TO_scipy_sparse' function)
292 | # on CRAN is not upgraded and the older version includes a bug ('TypeError : could not interpret data type') ]
293 | # reference : https://github.com/scipy/scipy/issues/5353
294 |
295 |
296 | if (Sys.info()["sysname"] != 'Darwin') {
297 |
298 | testthat::test_that("the KernelKnn and nmslibR packages return the same output in case of 'dense' or 'sparse' matrices (sequential search / brute force)", {
299 |
300 | skip_test_if_no_python()
301 | skip_test_if_no_module(c('nmslib', 'scipy'))
302 |
303 | mt_2_sprm = mat_2scipy_sparse(X, format = 'sparse_row_matrix') # first case : R-matrix to scipy-sparse-row-matrix
304 | mt_2_dgr = as(X, "dgRMatrix")
305 | dgr_2_scsp = TO_scipy_sparse(R_sparse_matrix = mt_2_dgr) # second case : R-sparse-matrix to scipy-sparse-row-matrix
306 |
307 |
308 | # test that both 'dense knn' (using an R object) and 'sparse knn' (using a python scipy sparse object) return the same output
309 | #----------------------------------------------------------------------------------------------------------------------------
310 |
311 | dist_knn = KernelKnn::knn.index.dist(X, TEST_data = NULL, k = 5, method = "euclidean") # the corresponding distance for 'euclidean' in nmslibR is 'l2' or 'l2_sparse (in case of sparse matrices). Page 31 of manual.
312 |
313 |
314 | # nmslibR with "dense" data and sequential search
315 | #------------------------------------------------
316 |
317 | init_nms = NMSlib$new(input_data = X, space = "l2", method = 'seq_search',
318 | data_type = 'DENSE_VECTOR', dtype = 'FLOAT', print_progress = F)
319 |
320 | all_dat = init_nms$knn_Query_Batch(X, k = 5, num_threads = 1)
321 |
322 |
323 |
324 | # nmslibR with "sparse" data and sequential search
325 | #-------------------------------------------------
326 |
327 | init_nms_spr = NMSlib$new(input_data = dgr_2_scsp, space = "l2_sparse", method = 'seq_search',
328 | data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', print_progress = F)
329 |
330 | all_dat_spr = init_nms_spr$knn_Query_Batch(dgr_2_scsp, k = 5, num_threads = 1)
331 |
332 |
333 | # all three outputs (dist_knn, all_dat, all_dat_spr) must return approximately equal results
334 | #-------------------------------------------------------------------------------------------
335 |
336 | # indices [ first 6 rows ]
337 |
338 | tmp1 = identical(dist_knn$train_knn_idx[1:6, ], all_dat$knn_idx[1:6, ])
339 | tmp2 = identical(all_dat_spr$knn_idx[1:6, ], all_dat$knn_idx[1:6, ])
340 | tmp3 = identical(dist_knn$train_knn_idx[1:6, ], all_dat_spr$knn_idx[1:6, ])
341 |
342 | # distances [ last 6 rows ] -- approximately equal ( use of round() function )
343 |
344 | tmp_row1 = identical(round(tail(dist_knn$train_knn_dist), 4), round(tail(all_dat$knn_dist), 4))
345 | tmp_row2 = identical(round(tail(all_dat_spr$knn_idx), 4), round(tail(all_dat$knn_idx), 4))
346 | tmp_row3 = identical(round(tail(dist_knn$train_knn_idx), 4), round(tail(all_dat_spr$knn_idx), 4))
347 |
348 | testthat::expect_true( all(tmp1, tmp2, tmp3, tmp_row1, tmp_row2, tmp_row3) )
349 | })
350 | }
351 |
352 |
353 | #---------------------------------------------------------
354 | # THE FOLLOWING TWO FUNCTIONS DO NOT WORK WITH SPARSE DATA [ probably it has to do with subsetting / indexing of sparse matrices (does not work as in dense matrices), especially if I split the data in two or more parts ]
355 | #---------------------------------------------------------
356 |
357 |
358 | # testthat::test_that("the NMSlib class works with sparse data in case of 'Knn_Query' [ specify as data_type a 'SPARSE_VECTOR' ]", {
359 | #
360 | # skip_test_if_no_module(c('nmslib', 'scipy'))
361 | #
362 | # sparse_x = mat_2scipy_sparse(x, format = 'sparse_row_matrix')
363 | #
364 | # init_nms = NMSlib$new(input_data = sparse_x, Index_Params = NULL, Time_Params = NULL, space='l1_sparse', space_params = NULL,
365 | #
366 | # method = 'hnsw', data_type = 'SPARSE_VECTOR', dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE)
367 | #
368 | # knns = 5
369 | #
370 | # tmp_res = init_nms$Knn_Query( sparse_x$getrow(1), k = knns) # use 'getrow() to subset the sparse matrix [ DOES NOT WORK ]
371 | #
372 | # testthat::expect_true( inherits(tmp_res, 'list') && length(tmp_res) == 2 && all(unlist(lapply(tmp_res, ncol)) == knns) )
373 | # })
374 | #
375 | #
376 | #
377 | #
378 | # testthat::test_that("the KernelKnnCV_nmslib function works with sparse data in case of classification [ specify as data_type a 'SPARSE_VECTOR' ]", {
379 | #
380 | # skip_test_if_no_module(c('nmslib', 'scipy'))
381 | #
382 | # dgcM = Matrix::Matrix(data = sample(c(rep(0.0, 5), runif(2)), 1000, replace = T), nrow = 100,
383 | #
384 | # ncol = 10, byrow = TRUE,
385 | #
386 | # sparse = TRUE)
387 | #
388 | # FOLDS = 4
389 | #
390 | # tmp_knn = KernelKnnCV_nmslib(data = dgcM, y = y_BINclass, k = 5, folds = FOLDS, h = 1.0, weights_function = NULL, Levels = sort(unique(y_BINclass)), # splitting the dgcM internally and creating scipy-sparse sub-matrices returns an error when inputing to the function
391 | #
392 | # Index_Params = NULL, Time_Params = NULL, space='l1_sparse', space_params = NULL, method = 'hnsw', data_type = 'SPARSE_VECTOR',
393 | #
394 | # dtype = 'FLOAT', index_filepath = NULL, print_progress = FALSE, num_threads = 1, seed_num = 1)
395 | #
396 | # testthat::expect_true( inherits(tmp_knn, 'list') && names(tmp_knn) %in% c("preds", "folds") &&
397 | # all(as.vector(unlist(lapply(tmp_knn$preds, function(y) nrow(y)))) == nrow(x) / FOLDS) &&
398 | # all(as.vector(unlist(lapply(tmp_knn$folds, function(y) length(y)))) == nrow(x) / FOLDS))
399 | # })
400 |
401 |
402 | #=================================================================================================================================================================================
403 |
--------------------------------------------------------------------------------
/tic.R:
--------------------------------------------------------------------------------
1 | # installs dependencies, runs R CMD check, runs covr::codecov()
2 | do_package_checks()
3 |
4 | if (ci_on_ghactions() && ci_has_env("BUILD_PKGDOWN")) {
5 | # creates pkgdown site and pushes to gh-pages branch
6 | # only for the runner with the "BUILD_PKGDOWN" env var set
7 | do_pkgdown()
8 | }
9 |
--------------------------------------------------------------------------------
/vignettes/the_nmslibR_package.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Non Metric Space ( Approximate ) Library in R"
3 | author: "Lampros Mouselimis"
4 | date: "`r Sys.Date()`"
5 | output: rmarkdown::html_vignette
6 | vignette: >
7 | %\VignetteIndexEntry{Non Metric Space ( Approximate ) Library in R}
8 | %\VignetteEngine{knitr::rmarkdown}
9 | %\VignetteEncoding{UTF-8}
10 | ---
11 |
12 |
13 |
14 | The **nmslibR** package is a wrapper of [*NMSLIB*](https://github.com/nmslib/nmslib), which according to the authors "... is a similarity search library and a toolkit for evaluation of similarity search methods. The goal of the project is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods".
15 |
16 | I've searched for some time (before wrapping NMSLIB) for a nearest neighbor library which can work with high dimensional data and can scale with big datasets. I've already written a package for k-nearest-neighbor search ([KernelKnn](https://CRAN.R-project.org/package=KernelKnn)), however, it's based on brute force and unfortunately, it requires a certain computation time if the data consists of many rows. The *nmslibR* package, besides the main functionality of the NMSLIB python library, also includes an Approximate Kernel k-nearest function, which as I will show in the next lines is both fast and accurate. A comparison of NMSLIB with other popular approximate k-nearest-neighbor methods can be found [here](https://github.com/erikbern/ann-benchmarks).
17 |
18 |
19 |
20 | The NMSLIB Library,
21 |
22 | * is a collection of search methods for generic spaces
23 | * has both metric and non-metric search algorithms
24 | * has both exact and approximate search algorithms
25 | * is an evaluation toolkit that simplifies experimentation and processing of results
26 | * is extensible (new spaces and methods can be added)
27 | * It was designed to be efficient
28 |
29 |
30 |
31 | Details can be found in the [NMSLIB-manual](https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf).
32 |
33 |
34 |
35 |
36 | #### The nmslibR package
37 |
38 |
39 |
40 | The *nmslibR* package includes the following R6-class / functions,
41 |
42 |
43 |
44 | ##### **class**
45 |
46 |
47 |
48 |
49 |
50 | | NMSlib |
51 | | :------------------: |
52 | | Knn_Query() |
53 | | knn_Query_Batch() |
54 | | save_Index() |
55 |
56 |
57 |
58 |
59 |
60 |
61 | ##### **functions**
62 |
63 |
64 | **UPDATE 10-05-2018** : Beginning from version **1.0.2** the **dgCMatrix_2scipy_sparse** function was renamed to **TO_scipy_sparse** and now accepts either a *dgCMatrix* or a *dgRMatrix* as input. The appropriate format for the nmslibR package in case of sparse matrices is the **dgRMatrix** format (*scipy.sparse.csr_matrix*)
65 |
66 |
67 |
68 |
69 | | KernelKnn_nmslib() |
70 | | :------------------------|
71 |
72 | | KernelKnnCV_nmslib() |
73 | | :------------------------|
74 |
75 | | TO_scipy_sparse() |
76 | | :-----------------|
77 |
78 | | mat_2scipy_sparse() |
79 | | :-------------------|
80 |
81 |
82 |
83 |
84 | The package documentation includes details and examples for the R6-class and functions. I'll start explaining how a user can work with sparse matrices as the input can also be a **python sparse matrix**.
85 |
86 |
87 |
88 |
89 | #### Sparse matrices as input
90 |
91 |
92 |
93 | The nmslibR package includes two functions (**mat_2scipy_sparse** and **TO_scipy_sparse**) which allow the user to convert from a *matrix* / *sparse matrix* (*dgCMatrix*, *dgRMatrix*) to a *scipy sparse matrix* (*scipy.sparse.csc_matrix*, *scipy.sparse.csr_matrix*),
94 |
95 |
96 |
97 | ```{r, eval = F, echo = T}
98 |
99 | library(nmslibR)
100 |
101 | # conversion from a matrix object to a scipy sparse matrix
102 | #----------------------------------------------------------
103 |
104 | set.seed(1)
105 |
106 | x = matrix(runif(1000), nrow = 100, ncol = 10)
107 |
108 | x_sparse = mat_2scipy_sparse(x, format = "sparse_row_matrix")
109 |
110 | print(dim(x))
111 |
112 | [1] 100 10
113 |
114 | print(x_sparse$shape)
115 |
116 | (100, 10)
117 |
118 | ```
119 |
120 |
121 |
122 |
123 | ```{r, eval = F, echo = T}
124 |
125 | # conversion from a dgCMatrix object to a scipy sparse matrix
126 | #-------------------------------------------------------------
127 |
128 | data = c(1, 0, 2, 0, 0, 3, 4, 5, 6)
129 |
130 |
131 | # 'dgCMatrix' sparse matrix
132 | #--------------------------
133 |
134 | dgcM = Matrix::Matrix(data = data, nrow = 3,
135 |
136 | ncol = 3, byrow = TRUE,
137 |
138 | sparse = TRUE)
139 |
140 | print(dim(dgcM))
141 |
142 | [1] 3 3
143 |
144 | x_sparse = TO_scipy_sparse(dgcM)
145 |
146 | print(x_sparse$shape)
147 |
148 | (3, 3)
149 |
150 |
151 | # 'dgRMatrix' sparse matrix
152 | #--------------------------
153 |
154 | dgrM = as(dgcM, "RsparseMatrix")
155 |
156 | class(dgrM)
157 |
158 | # [1] "dgRMatrix"
159 | # attr(,"package")
160 | # [1] "Matrix"
161 |
162 | print(dim(dgrM))
163 |
164 | [1] 3 3
165 |
166 | res_dgr = TO_scipy_sparse(dgrM)
167 |
168 | print(res_dgr$shape)
169 |
170 | (3, 3)
171 |
172 | ```
173 |
174 |
175 |
176 |
177 |
178 | #### The NMSlib R6-class
179 |
180 |
181 |
182 |
183 | The parameter settings for the *NMSlib* R6-class can be found in the [Non-Metric Space Library (NMSLIB) Manual](https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf), which explains the NMSLIB Library in detail. In the following code chunk, I'll show the functionality of the methods included using a [data set from my Github repository](https://github.com/mlampros/DataSets) (it appears as [.ipynb notebook in the nmslib Github repository](https://github.com/nmslib/nmslib/blob/master/python_bindings/notebooks/search_sift_uint8.ipynb))
184 |
185 |
186 |
187 | ```{r, eval = F, echo = T}
188 |
189 |
190 | library(nmslibR)
191 |
192 |
193 | # download the data from my Github repository (tested on a Linux OS)
194 | #-------------------------------------------------------------------
195 |
196 | system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/sift_10k.txt")
197 |
198 |
199 | # load the data in the R session
200 | #-------------------------------
201 |
202 | sift_10k = read.table("~/sift_10k.txt", quote="\"", comment.char="")
203 |
204 |
205 | # index parameters
206 | #-----------------
207 |
208 | M = 15
209 | efC = 100
210 | num_threads = 5
211 |
212 | index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC,
213 |
214 | 'post' = 0, 'skip_optimized_index' = 1 )
215 |
216 |
217 | # query-time parameters
218 | #----------------------
219 |
220 | efS = 100
221 |
222 | query_time_params = list('efSearch' = efS)
223 |
224 |
225 | # Number of neighbors
226 | #--------------------
227 |
228 | K = 100
229 |
230 |
231 | # space to use
232 | #---------------
233 |
234 | space_name = 'l2sqr_sift'
235 |
236 |
237 | # initialize NMSlib [ the data should be a matrix ]
238 | #--------------------------------------------------
239 |
240 | init_nms = NMSlib$new(input_data = as.matrix(sift_10k), Index_Params = index_params,
241 |
242 | Time_Params = query_time_params, space = space_name,
243 |
244 | space_params = NULL, method = 'hnsw',
245 |
246 | data_type = 'DENSE_UINT8_VECTOR', dtype = 'INT',
247 |
248 | index_filepath = NULL, print_progress = FALSE)
249 |
250 | ```
251 |
252 |
253 |
254 | ```{r, eval = F, echo = T}
255 |
256 | # returns a 1-dimensional vector
257 | #-------------------------------
258 |
259 | init_nms$Knn_Query(query_data_row = as.matrix(sift_10k[1, ]), k = 5)
260 |
261 | ```
262 |
263 |
264 |
265 | ```{r, eval = F, echo = T}
266 |
267 | [[1]]
268 | [1] 2 6 4585 9256 140 # indices
269 |
270 | [[2]]
271 | [1] 18724 24320 68158 69067 70321 # distances
272 |
273 | ```
274 |
275 |
276 |
277 | ```{r, eval = F, echo = T}
278 |
279 | # returns knn's for all data
280 | #---------------------------
281 |
282 | all_dat = init_nms$knn_Query_Batch(as.matrix(sift_10k), k = 5, num_threads = 1)
283 |
284 | str(all_dat)
285 |
286 | ```
287 |
288 |
289 |
290 | ```{r, eval = F, echo = T}
291 |
292 | # a list of indices and distances for all observations
293 | #------------------------------------------------------
294 |
295 | List of 2
296 | $ knn_idx : num [1:10000, 1:5] 3 4 1 2 13 14 1 2 30 31 ...
297 | $ knn_dist: num [1:10000, 1:5] 18724 14995 18724 14995 21038 ...
298 |
299 | ```
300 |
301 |
302 |
303 | Details on the various methods and parameter settings can be found in the [manual of the NMSLIB python Library](https://github.com/nmslib/nmslib/blob/master/manual/latex/manual.pdf).
304 |
305 |
306 |
307 |
308 | #### KernelKnn using the nmslibR package
309 |
310 |
311 |
312 |
313 | In the [Vignette of the KernelKnn](https://CRAN.R-project.org/package=KernelKnn) (*Image classification of the MNIST and CIFAR-10 data using KernelKnn and HOG (histogram of oriented gradients)*) package I experimented with the **mnist dataset** and a cross-validated kernel k-nearest-neighbors model gave **98.4 % accuracy** based on **HOG** (histogram of oriented gradients) features. However, it took almost **30 minutes** (depending on the system configuration) to complete using **6 threads**. I've implemented a similar function using NMSLIB (**KernelKnnCV_nmslib**), so in the next code chunk I'll use the *same parameter setting* and I'll compare *computation time* and *accuracy*.
314 |
315 |
316 |
317 | First load the data,
318 |
319 |
320 |
321 | ```{r, eval = F, echo = T}
322 |
323 | # using system('wget..') on a linux OS
324 |
325 | system("wget https://raw.githubusercontent.com/mlampros/DataSets/master/mnist.zip")
326 |
327 | mnist <- read.table(unz("mnist.zip", "mnist.csv"), nrows = 70000, header = T,
328 |
329 | quote = "\"", sep = ",")
330 |
331 | ```
332 |
333 |
334 |
335 |
336 | ```{r, eval = F, echo = T}
337 |
338 | X = mnist[, -ncol(mnist)]
339 | dim(X)
340 |
341 | ## [1] 70000 784
342 |
343 | # the 'KernelKnnCV_nmslib' function requires that the labels are numeric and start from 1 : Inf
344 |
345 | y = mnist[, ncol(mnist)] + 1
346 | table(y)
347 |
348 | ## y
349 | ## 1 2 3 4 5 6 7 8 9 10
350 | ## 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958
351 |
352 |
353 | # evaluation metric
354 |
355 | acc = function (y_true, preds) {
356 |
357 | out = table(y_true, max.col(preds, ties.method = "random"))
358 |
359 | acc = sum(diag(out))/sum(out)
360 |
361 | acc
362 | }
363 |
364 | ```
365 |
366 |
367 | then compute the HOG features,
368 |
369 |
370 |
371 | ```{r, eval = F, echo = T}
372 |
373 | library(OpenImageR)
374 |
375 | hog = HOG_apply(X, cells = 6, orientations = 9, rows = 28, columns = 28, threads = 6)
376 |
377 | ##
378 | ## time to complete : 2.101281 secs
379 |
380 | dim(hog)
381 |
382 | ## [1] 70000 324
383 |
384 | ```
385 |
386 |
387 | then compute the **approximate** kernel k-nearest-neighbors using the **cosine** distance,
388 |
389 |
390 |
391 |
392 | ```{r, eval = F, echo = T}
393 |
394 | # parameters for 'KernelKnnCV_nmslib'
395 | #------------------------------------
396 |
397 | M = 30
398 | efC = 100
399 | num_threads = 6
400 |
401 | index_params = list('M'= M, 'indexThreadQty' = num_threads, 'efConstruction' = efC,
402 |
403 | 'post' = 0, 'skip_optimized_index' = 1 )
404 |
405 |
406 | efS = 100
407 |
408 | query_time_params = list('efSearch' = efS)
409 |
410 |
411 | # approximate kernel knn
412 | #-----------------------
413 |
414 | fit_hog = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1,
415 | weights_function = 'biweight_tricube_MULT',
416 | Levels = sort(unique(y)), Index_Params = index_params,
417 | Time_Params = query_time_params, space = "cosinesimil",
418 | space_params = NULL, method = "hnsw", data_type = "DENSE_VECTOR",
419 | dtype = "FLOAT", index_filepath = NULL, print_progress = FALSE,
420 | num_threads = 6, seed_num = 1)
421 |
422 |
423 | # cross-validation starts ..
424 |
425 | # |=================================================================================| 100%
426 |
427 | # time to complete : 32.88805 secs
428 |
429 |
430 | str(fit_hog)
431 |
432 |
433 | ```
434 |
435 |
436 |
437 | ```{r, eval = F, echo = T}
438 |
439 | List of 2
440 | $ preds:List of 4
441 | ..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
442 | ..$ : num [1:17500, 1:10] 0 0 0 0 1 ...
443 | ..$ : num [1:17500, 1:10] 0 0 0 0 0 ...
444 | ..$ : num [1:17500, 1:10] 0 0 0 0 0 0 0 0 0 0 ...
445 | $ folds:List of 4
446 | ..$ fold_1: int [1:17500] 49808 21991 42918 7967 49782 28979 64440 49809 30522 36673 ...
447 | ..$ fold_2: int [1:17500] 51122 9469 58021 45228 2944 58052 65074 17709 2532 31262 ...
448 | ..$ fold_3: int [1:17500] 33205 40078 68177 32620 52721 18981 19417 53922 19102 67206 ...
449 | ..$ fold_4: int [1:17500] 28267 41652 28514 34525 68534 13294 48759 47521 69395 41408 ...
450 |
451 | ```
452 |
453 |
454 |
455 | ```{r, eval = F, echo = T}
456 |
457 | acc_fit_hog = unlist(lapply(1:length(fit_hog$preds),
458 |
459 | function(x) acc(y[fit_hog$folds[[x]]],
460 |
461 | fit_hog$preds[[x]])))
462 | acc_fit_hog
463 |
464 | ## [1] 0.9768000 0.9786857 0.9763429 0.9760000
465 |
466 | cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog), '\n')
467 |
468 | ## mean accuracy for hog-features using cross-validation : 0.9769571
469 |
470 | ```
471 |
472 |
473 | It took approx. **33 seconds** to return with an accuracy of **97.7 %** . Almost **47 times faster** than KernelKnn's corresponding function (brute force) with a **slight lower accuracy** rate (the *braycurtis* distance metric might be better suited for this dataset).
474 |
475 | I also run the corresponding brute-force algorithm of the NMSLIB Library by setting the *method* parameter to **seq_search**,
476 |
477 |
478 |
479 | ```{r, eval = F, echo = T}
480 |
481 |
482 | # brute force of NMSLIB [ here we set 'Index_Params' and 'Time_Params' to NULL ]
483 | #----------------------
484 |
485 | fit_hog_seq = KernelKnnCV_nmslib(hog, y, k = 20, folds = 4, h = 1,
486 | weights_function = 'biweight_tricube_MULT',
487 | Levels = sort(unique(y)), Index_Params = NULL,
488 | Time_Params = NULL, space = "cosinesimil",
489 | space_params = NULL, method = "seq_search",
490 | data_type = "DENSE_VECTOR", dtype = "FLOAT",
491 | index_filepath = NULL, print_progress = FALSE,
492 | num_threads = 6, seed_num = 1)
493 |
494 |
495 | # cross-validation starts ..
496 |
497 | # |=================================================================================| 100%
498 |
499 | # time to complete : 4.506177 mins
500 |
501 |
502 | acc_fit_hog_seq = unlist(lapply(1:length(fit_hog_seq$preds),
503 |
504 | function(x) acc(y[fit_hog_seq$folds[[x]]],
505 |
506 | fit_hog_seq$preds[[x]])))
507 | acc_fit_hog_seq
508 |
509 | ## [1] 0.9785143 0.9802286 0.9783429 0.9784571
510 |
511 | cat('mean accuracy for hog-features using cross-validation :', mean(acc_fit_hog_seq), '\n')
512 |
513 | ## mean accuracy for hog-features using cross-validation : 0.9788857
514 |
515 |
516 | ```
517 |
518 |
519 |
520 | The brute-force algorithm of the NMSLIB Library is almost **6 times faster** than KernelKnn giving an accuracy of approx. **97.9 %**.
521 |
522 |
--------------------------------------------------------------------------------