├── .github ├── CODE_OF_CONDUCT.md └── CONTRIBUTING.md ├── .gitignore ├── LICENSE ├── README.md ├── alt-graph-index ├── Makefile ├── README.md ├── altid.swig ├── altid_impl.cpp ├── altid_impl.h ├── graph_dynamic_bench_invlists.py └── test_altid.py ├── custom_invlist_cpp ├── Makefile ├── README.md ├── __pycache__ │ └── test_compressed_ivfs.cpython-310-pytest-7.4.0.pyc ├── bench_invlists.py ├── codec.cpp ├── codec.h ├── custom_invlists.swig ├── custom_invlists_impl.cpp ├── custom_invlists_impl.h ├── search_ivf_qinco.py ├── test_codec.cpp └── test_compressed_ivfs.py ├── elias_fano.hpp ├── fenwick_tree_cpp ├── Makefile ├── bin │ └── test_fenwick_tree ├── src │ ├── fenwick_tree.cpp │ ├── fenwick_tree.h │ └── fenwick_tree.i └── tests │ ├── __pycache__ │ └── test_FenwickTree.cpython-310-pytest-7.4.0.pyc │ ├── test_FenwickTree.py │ └── test_fenwick_tree.cpp ├── graph_static_bench_invlists.py ├── install-dependencies.sh ├── qinco_datasets.py └── zuckerli-baseline ├── README.md └── generate_graph_edgelists.py /.github/CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to make participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies within all project spaces, and it also applies when 49 | an individual is representing the project or its community in public spaces. 50 | Examples of representing a project or community include using an official 51 | project e-mail address, posting via an official social media account, or acting 52 | as an appointed representative at an online or offline event. Representation of 53 | a project may be further defined and clarified by project maintainers. 54 | 55 | This Code of Conduct also applies outside the project spaces when there is a 56 | reasonable belief that an individual's behavior may have a negative impact on 57 | the project or its community. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported by contacting the project team at . All 63 | complaints will be reviewed and investigated and will result in a response that 64 | is deemed necessary and appropriate to the circumstances. The project team is 65 | obligated to maintain confidentiality with regard to the reporter of an incident. 66 | Further details of specific enforcement policies may be posted separately. 67 | 68 | Project maintainers who do not follow or enforce the Code of Conduct in good 69 | faith may face temporary or permanent repercussions as determined by other 70 | members of the project's leadership. 71 | 72 | ## Attribution 73 | 74 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 75 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 76 | 77 | [homepage]: https://www.contributor-covenant.org 78 | 79 | For answers to common questions about this code of conduct, see 80 | https://www.contributor-covenant.org/faq 81 | 82 | -------------------------------------------------------------------------------- /.github/CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to NeuralCompression 2 | 3 | We want to make contributing to this project as easy and transparent as 4 | possible. 5 | 6 | ## Pull Requests 7 | 8 | We actively welcome your pull requests. 9 | 10 | 1. Fork the repo and create your branch from `main`. 11 | 2. If you've added code that should be tested, add tests. 12 | 3. If you've changed APIs, update the documentation. 13 | 4. Ensure the test suite passes. 14 | 5. Make sure your code lints. 15 | 6. If you haven't already, complete the Contributor License Agreement ("CLA"). 16 | 17 | ## Contributor License Agreement ("CLA") 18 | 19 | In order to accept your pull request, we need you to submit a CLA. You only need 20 | to do this once to work on any of Facebook's open source projects. 21 | 22 | Complete your CLA here: 23 | 24 | ## Issues 25 | 26 | We use GitHub issues to track public bugs. Please ensure your description is 27 | clear and has sufficient instructions to be able to reproduce the issue. 28 | 29 | Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe 30 | disclosure of security bugs. In those cases, please go through the process 31 | outlined on that page and do not file a public issue. 32 | 33 | ## License 34 | 35 | By contributing to vector_db_id_compression, you agree that your contributions will be 36 | licensed under the LICENSE file in the root directory of this source tree. 37 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | alt-graph-index/results-online-graphs/ 2 | custom_invlist_cpp/results-online-ivf/ 3 | results-offline-graphs-rec/ 4 | Random-Edge-Coding/ 5 | succinct/ 6 | 7 | **/*.o 8 | **/*.so 9 | **/*_wrap.cxx 10 | alt-graph-index/altid.py 11 | custom_invlist_cpp/custom_invlists.py 12 | custom_invlist_cpp/test_codec 13 | */core* 14 | */.gdb_history 15 | *.out 16 | *.err 17 | *.csv 18 | *.pickle 19 | *.el 20 | *.el.bin 21 | *.el.bin.comp 22 | alt-graph-index/altid.py 23 | alt-graph-index/altid.cxx -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Attribution-NonCommercial 4.0 International 3 | 4 | ======================================================================= 5 | 6 | Creative Commons Corporation ("Creative Commons") is not a law firm and 7 | does not provide legal services or legal advice. Distribution of 8 | Creative Commons public licenses does not create a lawyer-client or 9 | other relationship. Creative Commons makes its licenses and related 10 | information available on an "as-is" basis. Creative Commons gives no 11 | warranties regarding its licenses, any material licensed under their 12 | terms and conditions, or any related information. Creative Commons 13 | disclaims all liability for damages resulting from their use to the 14 | fullest extent possible. 15 | 16 | Using Creative Commons Public Licenses 17 | 18 | Creative Commons public licenses provide a standard set of terms and 19 | conditions that creators and other rights holders may use to share 20 | original works of authorship and other material subject to copyright 21 | and certain other rights specified in the public license below. The 22 | following considerations are for informational purposes only, are not 23 | exhaustive, and do not form part of our licenses. 24 | 25 | Considerations for licensors: Our public licenses are 26 | intended for use by those authorized to give the public 27 | permission to use material in ways otherwise restricted by 28 | copyright and certain other rights. Our licenses are 29 | irrevocable. Licensors should read and understand the terms 30 | and conditions of the license they choose before applying it. 31 | Licensors should also secure all rights necessary before 32 | applying our licenses so that the public can reuse the 33 | material as expected. Licensors should clearly mark any 34 | material not subject to the license. This includes other CC- 35 | licensed material, or material used under an exception or 36 | limitation to copyright. More considerations for licensors: 37 | wiki.creativecommons.org/Considerations_for_licensors 38 | 39 | Considerations for the public: By using one of our public 40 | licenses, a licensor grants the public permission to use the 41 | licensed material under specified terms and conditions. If 42 | the licensor's permission is not necessary for any reason--for 43 | example, because of any applicable exception or limitation to 44 | copyright--then that use is not regulated by the license. Our 45 | licenses grant only permissions under copyright and certain 46 | other rights that a licensor has authority to grant. Use of 47 | the licensed material may still be restricted for other 48 | reasons, including because others have copyright or other 49 | rights in the material. A licensor may make special requests, 50 | such as asking that all changes be marked or described. 51 | Although not required by our licenses, you are encouraged to 52 | respect those requests where reasonable. More_considerations 53 | for the public: 54 | wiki.creativecommons.org/Considerations_for_licensees 55 | 56 | ======================================================================= 57 | 58 | Creative Commons Attribution-NonCommercial 4.0 International Public 59 | License 60 | 61 | By exercising the Licensed Rights (defined below), You accept and agree 62 | to be bound by the terms and conditions of this Creative Commons 63 | Attribution-NonCommercial 4.0 International Public License ("Public 64 | License"). To the extent this Public License may be interpreted as a 65 | contract, You are granted the Licensed Rights in consideration of Your 66 | acceptance of these terms and conditions, and the Licensor grants You 67 | such rights in consideration of benefits the Licensor receives from 68 | making the Licensed Material available under these terms and 69 | conditions. 70 | 71 | Section 1 -- Definitions. 72 | 73 | a. Adapted Material means material subject to Copyright and Similar 74 | Rights that is derived from or based upon the Licensed Material 75 | and in which the Licensed Material is translated, altered, 76 | arranged, transformed, or otherwise modified in a manner requiring 77 | permission under the Copyright and Similar Rights held by the 78 | Licensor. For purposes of this Public License, where the Licensed 79 | Material is a musical work, performance, or sound recording, 80 | Adapted Material is always produced where the Licensed Material is 81 | synched in timed relation with a moving image. 82 | 83 | b. Adapter's License means the license You apply to Your Copyright 84 | and Similar Rights in Your contributions to Adapted Material in 85 | accordance with the terms and conditions of this Public License. 86 | 87 | c. Copyright and Similar Rights means copyright and/or similar rights 88 | closely related to copyright including, without limitation, 89 | performance, broadcast, sound recording, and Sui Generis Database 90 | Rights, without regard to how the rights are labeled or 91 | categorized. For purposes of this Public License, the rights 92 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 93 | Rights. 94 | d. Effective Technological Measures means those measures that, in the 95 | absence of proper authority, may not be circumvented under laws 96 | fulfilling obligations under Article 11 of the WIPO Copyright 97 | Treaty adopted on December 20, 1996, and/or similar international 98 | agreements. 99 | 100 | e. Exceptions and Limitations means fair use, fair dealing, and/or 101 | any other exception or limitation to Copyright and Similar Rights 102 | that applies to Your use of the Licensed Material. 103 | 104 | f. Licensed Material means the artistic or literary work, database, 105 | or other material to which the Licensor applied this Public 106 | License. 107 | 108 | g. Licensed Rights means the rights granted to You subject to the 109 | terms and conditions of this Public License, which are limited to 110 | all Copyright and Similar Rights that apply to Your use of the 111 | Licensed Material and that the Licensor has authority to license. 112 | 113 | h. Licensor means the individual(s) or entity(ies) granting rights 114 | under this Public License. 115 | 116 | i. NonCommercial means not primarily intended for or directed towards 117 | commercial advantage or monetary compensation. For purposes of 118 | this Public License, the exchange of the Licensed Material for 119 | other material subject to Copyright and Similar Rights by digital 120 | file-sharing or similar means is NonCommercial provided there is 121 | no payment of monetary compensation in connection with the 122 | exchange. 123 | 124 | j. Share means to provide material to the public by any means or 125 | process that requires permission under the Licensed Rights, such 126 | as reproduction, public display, public performance, distribution, 127 | dissemination, communication, or importation, and to make material 128 | available to the public including in ways that members of the 129 | public may access the material from a place and at a time 130 | individually chosen by them. 131 | 132 | k. Sui Generis Database Rights means rights other than copyright 133 | resulting from Directive 96/9/EC of the European Parliament and of 134 | the Council of 11 March 1996 on the legal protection of databases, 135 | as amended and/or succeeded, as well as other essentially 136 | equivalent rights anywhere in the world. 137 | 138 | l. You means the individual or entity exercising the Licensed Rights 139 | under this Public License. Your has a corresponding meaning. 140 | 141 | Section 2 -- Scope. 142 | 143 | a. License grant. 144 | 145 | 1. Subject to the terms and conditions of this Public License, 146 | the Licensor hereby grants You a worldwide, royalty-free, 147 | non-sublicensable, non-exclusive, irrevocable license to 148 | exercise the Licensed Rights in the Licensed Material to: 149 | 150 | a. reproduce and Share the Licensed Material, in whole or 151 | in part, for NonCommercial purposes only; and 152 | 153 | b. produce, reproduce, and Share Adapted Material for 154 | NonCommercial purposes only. 155 | 156 | 2. Exceptions and Limitations. For the avoidance of doubt, where 157 | Exceptions and Limitations apply to Your use, this Public 158 | License does not apply, and You do not need to comply with 159 | its terms and conditions. 160 | 161 | 3. Term. The term of this Public License is specified in Section 162 | 6(a). 163 | 164 | 4. Media and formats; technical modifications allowed. The 165 | Licensor authorizes You to exercise the Licensed Rights in 166 | all media and formats whether now known or hereafter created, 167 | and to make technical modifications necessary to do so. The 168 | Licensor waives and/or agrees not to assert any right or 169 | authority to forbid You from making technical modifications 170 | necessary to exercise the Licensed Rights, including 171 | technical modifications necessary to circumvent Effective 172 | Technological Measures. For purposes of this Public License, 173 | simply making modifications authorized by this Section 2(a) 174 | (4) never produces Adapted Material. 175 | 176 | 5. Downstream recipients. 177 | 178 | a. Offer from the Licensor -- Licensed Material. Every 179 | recipient of the Licensed Material automatically 180 | receives an offer from the Licensor to exercise the 181 | Licensed Rights under the terms and conditions of this 182 | Public License. 183 | 184 | b. No downstream restrictions. You may not offer or impose 185 | any additional or different terms or conditions on, or 186 | apply any Effective Technological Measures to, the 187 | Licensed Material if doing so restricts exercise of the 188 | Licensed Rights by any recipient of the Licensed 189 | Material. 190 | 191 | 6. No endorsement. Nothing in this Public License constitutes or 192 | may be construed as permission to assert or imply that You 193 | are, or that Your use of the Licensed Material is, connected 194 | with, or sponsored, endorsed, or granted official status by, 195 | the Licensor or others designated to receive attribution as 196 | provided in Section 3(a)(1)(A)(i). 197 | 198 | b. Other rights. 199 | 200 | 1. Moral rights, such as the right of integrity, are not 201 | licensed under this Public License, nor are publicity, 202 | privacy, and/or other similar personality rights; however, to 203 | the extent possible, the Licensor waives and/or agrees not to 204 | assert any such rights held by the Licensor to the limited 205 | extent necessary to allow You to exercise the Licensed 206 | Rights, but not otherwise. 207 | 208 | 2. Patent and trademark rights are not licensed under this 209 | Public License. 210 | 211 | 3. To the extent possible, the Licensor waives any right to 212 | collect royalties from You for the exercise of the Licensed 213 | Rights, whether directly or through a collecting society 214 | under any voluntary or waivable statutory or compulsory 215 | licensing scheme. In all other cases the Licensor expressly 216 | reserves any right to collect such royalties, including when 217 | the Licensed Material is used other than for NonCommercial 218 | purposes. 219 | 220 | Section 3 -- License Conditions. 221 | 222 | Your exercise of the Licensed Rights is expressly made subject to the 223 | following conditions. 224 | 225 | a. Attribution. 226 | 227 | 1. If You Share the Licensed Material (including in modified 228 | form), You must: 229 | 230 | a. retain the following if it is supplied by the Licensor 231 | with the Licensed Material: 232 | 233 | i. identification of the creator(s) of the Licensed 234 | Material and any others designated to receive 235 | attribution, in any reasonable manner requested by 236 | the Licensor (including by pseudonym if 237 | designated); 238 | 239 | ii. a copyright notice; 240 | 241 | iii. a notice that refers to this Public License; 242 | 243 | iv. a notice that refers to the disclaimer of 244 | warranties; 245 | 246 | v. a URI or hyperlink to the Licensed Material to the 247 | extent reasonably practicable; 248 | 249 | b. indicate if You modified the Licensed Material and 250 | retain an indication of any previous modifications; and 251 | 252 | c. indicate the Licensed Material is licensed under this 253 | Public License, and include the text of, or the URI or 254 | hyperlink to, this Public License. 255 | 256 | 2. You may satisfy the conditions in Section 3(a)(1) in any 257 | reasonable manner based on the medium, means, and context in 258 | which You Share the Licensed Material. For example, it may be 259 | reasonable to satisfy the conditions by providing a URI or 260 | hyperlink to a resource that includes the required 261 | information. 262 | 263 | 3. If requested by the Licensor, You must remove any of the 264 | information required by Section 3(a)(1)(A) to the extent 265 | reasonably practicable. 266 | 267 | 4. If You Share Adapted Material You produce, the Adapter's 268 | License You apply must not prevent recipients of the Adapted 269 | Material from complying with this Public License. 270 | 271 | Section 4 -- Sui Generis Database Rights. 272 | 273 | Where the Licensed Rights include Sui Generis Database Rights that 274 | apply to Your use of the Licensed Material: 275 | 276 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 277 | to extract, reuse, reproduce, and Share all or a substantial 278 | portion of the contents of the database for NonCommercial purposes 279 | only; 280 | 281 | b. if You include all or a substantial portion of the database 282 | contents in a database in which You have Sui Generis Database 283 | Rights, then the database in which You have Sui Generis Database 284 | Rights (but not its individual contents) is Adapted Material; and 285 | 286 | c. You must comply with the conditions in Section 3(a) if You Share 287 | all or a substantial portion of the contents of the database. 288 | 289 | For the avoidance of doubt, this Section 4 supplements and does not 290 | replace Your obligations under this Public License where the Licensed 291 | Rights include other Copyright and Similar Rights. 292 | 293 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 294 | 295 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 296 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 297 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 298 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 299 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 300 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 301 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 302 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 303 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 304 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 305 | 306 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 307 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 308 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 309 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 310 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 311 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 312 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 313 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 314 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 315 | 316 | c. The disclaimer of warranties and limitation of liability provided 317 | above shall be interpreted in a manner that, to the extent 318 | possible, most closely approximates an absolute disclaimer and 319 | waiver of all liability. 320 | 321 | Section 6 -- Term and Termination. 322 | 323 | a. This Public License applies for the term of the Copyright and 324 | Similar Rights licensed here. However, if You fail to comply with 325 | this Public License, then Your rights under this Public License 326 | terminate automatically. 327 | 328 | b. Where Your right to use the Licensed Material has terminated under 329 | Section 6(a), it reinstates: 330 | 331 | 1. automatically as of the date the violation is cured, provided 332 | it is cured within 30 days of Your discovery of the 333 | violation; or 334 | 335 | 2. upon express reinstatement by the Licensor. 336 | 337 | For the avoidance of doubt, this Section 6(b) does not affect any 338 | right the Licensor may have to seek remedies for Your violations 339 | of this Public License. 340 | 341 | c. For the avoidance of doubt, the Licensor may also offer the 342 | Licensed Material under separate terms or conditions or stop 343 | distributing the Licensed Material at any time; however, doing so 344 | will not terminate this Public License. 345 | 346 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 347 | License. 348 | 349 | Section 7 -- Other Terms and Conditions. 350 | 351 | a. The Licensor shall not be bound by any additional or different 352 | terms or conditions communicated by You unless expressly agreed. 353 | 354 | b. Any arrangements, understandings, or agreements regarding the 355 | Licensed Material not stated herein are separate from and 356 | independent of the terms and conditions of this Public License. 357 | 358 | Section 8 -- Interpretation. 359 | 360 | a. For the avoidance of doubt, this Public License does not, and 361 | shall not be interpreted to, reduce, limit, restrict, or impose 362 | conditions on any use of the Licensed Material that could lawfully 363 | be made without permission under this Public License. 364 | 365 | b. To the extent possible, if any provision of this Public License is 366 | deemed unenforceable, it shall be automatically reformed to the 367 | minimum extent necessary to make it enforceable. If the provision 368 | cannot be reformed, it shall be severed from this Public License 369 | without affecting the enforceability of the remaining terms and 370 | conditions. 371 | 372 | c. No term or condition of this Public License will be waived and no 373 | failure to comply consented to unless expressly agreed to by the 374 | Licensor. 375 | 376 | d. Nothing in this Public License constitutes or may be interpreted 377 | as a limitation upon, or waiver of, any privileges and immunities 378 | that apply to the Licensor or You, including from the legal 379 | processes of any jurisdiction or authority. 380 | 381 | ======================================================================= 382 | 383 | Creative Commons is not a party to its public 384 | licenses. Notwithstanding, Creative Commons may elect to apply one of 385 | its public licenses to material it publishes and in those instances 386 | will be considered the “Licensor.” The text of the Creative Commons 387 | public licenses is dedicated to the public domain under the CC0 Public 388 | Domain Dedication. Except for the limited purpose of indicating that 389 | material is shared under a Creative Commons public license or as 390 | otherwise permitted by the Creative Commons policies published at 391 | creativecommons.org/policies, Creative Commons does not authorize the 392 | use of the trademark "Creative Commons" or any other trademark or logo 393 | of Creative Commons without its prior written consent including, 394 | without limitation, in connection with any unauthorized modifications 395 | to any of its public licenses or any other arrangements, 396 | understandings, or agreements concerning use of licensed material. For 397 | the avoidance of doubt, this paragraph does not form part of the 398 | public licenses. 399 | 400 | Creative Commons may be contacted at creativecommons.org. 401 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ID compression for Vector databases 2 | 3 | This is the implementation of the paper [Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search](http://arxiv.org/abs/2501.10479) by Daniel Severo, Giuseppe Ottaviano, Matthew Muckley, Karen Ullrich, and Matthijs Douze. 4 | 5 | The package is implemented in Python and partly in C++. 6 | The main package depends on the Elias-Fano implementation from the [Succint library](https://github.com/ot/succinct/blob/master/elias_fano.hpp) and the wavelet tree from [SDSL](https://github.com/simongog/sdsl-lite). 7 | 8 | ## TABLE OF CONTENTS 9 | 10 | - [1) Compiling and Installing Dependencies](#1-compiling-and-installing-dependencies) 11 | - [2) Reproducing results in the paper](#2-reproducing-results-in-the-paper) 12 | - [2.1) Tables 1 and 2 (online setting)](#21-tables-1-and-2-online-setting) 13 | - [Graph indices](#graph-indices) 14 | - [IVF Indices](#ivf-indices) 15 | - [2.2) Table 3 (offline setting)](#22-table-3-offline-setting) 16 | - [Random Edge Coding (REC)](#random-edge-coding-rec) 17 | - [Zuckerli Baseline](#zuckerli-baseline) 18 | - [2.3) Table 4 (Large-scale experiment with QINCo)](#23-table-4-large-scale-experiment-with-qinco) 19 | - [3) Citation](#3-citation) 20 | - [4) License](#4-license) 21 | 22 | ## 1) Compiling and Installing Dependencies 23 | 24 | Most of the code is written as a plugin to the [Faiss](https://github.com/facebookresearch/faiss) vector search library. 25 | Therefore Faiss should be installed and available from Python. 26 | We assume that Faiss is installed via Conda and that the Faiss headers are available in the `$CONDA_PREFIX/include` directory. 27 | We also assume that [swig](https://swig.org/) is installed (it is available in conda). 28 | 29 | The compilation is piloted via makefiles (that are written for Linux). 30 | Make should be run in the two subdirectories [alt-graph-index](./alt-graph-index) and [custom_invlist_cpp](./custom_invlist_cpp), see compilation instructions there. 31 | 32 | 33 | To complete the setup, do the following. 34 | - Create a conda environment with python 3.10. We suggest using 35 | ```sh 36 | conda create -n vector_db_id_compression python=3.10 37 | ``` 38 | - Activate your environment (e.g., `conda activate vector_db_id_compression`) 39 | - Install [Faiss](https://github.com/facebookresearch/faiss). 40 | - Install external dependencies (including Python dependencies) by running `/install-dependencies.sh` 41 | 42 | ## 2) Reproducing results in the paper 43 | 44 | ### 2.1) Tables 1 and 2 (online setting) 45 | 46 | #### Graph indices 47 | To reproduce graph-index results of Table 1, first make sure you have installed `succinct` as specified in [succinct/README.md](https://github.com/ot/succinct/blob/669eebbdcaa0562028a22cb7c877e512e4f1210b/README.md) 48 | 49 | Then, install the compressed graph indices by running `make` in the `alt-graph-index` directory. 50 | 51 | From `alt-graph-index/`, run `graph_dynamic_bench_invlists.py`. 52 | This script takes 3 arguments: 53 | 54 | - Dataset index, specifying which dataset to generate the .el for. See the AVAILABLE_DATASETS variable. 55 | - Max degree parameter of NSG. 56 | - Path to the `fb_ssnpp/` directory (only used if dataset index is set to 3). 57 | 58 | To reproduce all graph results in Table 1, run the following code. 59 | 60 | ```sh 61 | cd alt-graph-index 62 | fb_ssnpp_dir=... 63 | for dataset_idx in 0 1 2 3; do 64 | for max_degree in 16 32 64 128 256; do 65 | python graph_dynamic_bench_invlists.py $dataset_idx $max_degree $fb_ssnpp_dir 66 | done 67 | done 68 | ``` 69 | 70 | A CSV file with results will be saved to `alt-graph-index/results-online-graphs`, for each run. 71 | 72 | --- 73 | 74 | #### IVF Indices 75 | 76 | To reproduce IVF index results of Table 1, first install the compressed IVF indices by running `make` in the `custom_invlist_cpp` directory. 77 | 78 | From the `custom_invlist_cpp` directory run `bench_invlists.py`. 79 | This script takes 3 arguments: 80 | 81 | - Dataset index. See the AVAILABLE_DATASETS variable. 82 | - Index string in the style of the [FAISS index factory](https://github.com/facebookresearch/faiss/wiki/The-index-factory). 83 | - Path to the `fb_ssnpp/` directory (only used if dataset index is set to 3). 84 | 85 | To reproduce all IVF results in Table 1, run the following code. 86 | ```sh 87 | cd custom_invlist_cpp 88 | fb_ssnpp_dir=... 89 | for dataset_idx in 0 1 2 3; do 90 | for code in Flat PQ4 PQ16 PQ32 PQ8x10; do 91 | for num_clusters in 256 512 1024 2048; do 92 | index_str=IVF$num_clusters,$code 93 | python bench_invlists.py $dataset_idx $index_str $fb_ssnpp_dir 94 | done 95 | done 96 | done 97 | ``` 98 | 99 | A CSV file with results will be saved to `custom_invlist_cpp/results-online-ivf`, for each run. 100 | 101 | ### 2.2) Table 3 (offline setting) 102 | 103 | #### Random Edge Coding (REC) 104 | For these experiments you will need to clone [dsevero/Random-Edge-Coding](https://github.com/dsevero/Random-Edge-Coding) into the root directory. Follow the instructions in [Random-Edge-Coding/README.md](https://github.com/dsevero/Random-Edge-Coding?tab=readme-ov-file#how-to-use-random-edge-coding) to install REC. 105 | 106 | Note: assuming you cloned `facebookresearch/vector_db_id_compression` (i.e., this repo) with `git clone --recursive`, then REC will automatically be cloned as well. 107 | 108 | #### Zuckerli Baseline 109 | See [zuckerli-baseline/README.md](zuckerli-baseline/README.md) for the baseline. 110 | 111 | ### 2.3) Table 4: Large-scale experiment with QINCo 112 | 113 | The [QINCo](https://github.com/facebookresearch/Qinco/tree/main/qinco_v1) package should be available as a subdirectory of `custom_invlists_cpp`: 114 | ```sh 115 | cd custom_invlists_cpp 116 | git clone https://github.com/facebookresearch/Qinco.git 117 | mv Qinco/qinco_v1 . # the code of the original qinco 118 | rm -rf Qinco 119 | mv qinco_v1 Qinco 120 | ``` 121 | 122 | A small scale validation experiment is with 10M vectors, using a pre-built QINCo index 123 | 124 |
commands 125 | 126 | ```sh 127 | tmpdir=/scratch/matthijs/ # some temporary directory 128 | 129 | # data from https://github.com/facebookresearch/Qinco/blob/main/docs/IVF_search.md 130 | (cd $tmpdir; wget https://dl.fbaipublicfiles.com/QINCo/models/bigann_IVF65k_16x8_L2.pt ) 131 | (cd $tmpdir ; wget https://dl.fbaipublicfiles.com/QINCo/ivf/bigann10M_IVF65k_16x8_L2.faissindex ) 132 | 133 | # run baseline without id compression 134 | # parameters are one of the optimal op points from 135 | # https://gist.github.com/mdouze/e4b7c9dbf6a52e0f7cf100ce0096aaa8 136 | # cno=21 137 | 138 | # baseline 139 | 140 | python search_ivf_qinco.py --db bigann10M \ 141 | --model $tmpdir/bigann_IVF65k_16x8_L2.pt \ 142 | --index $tmpdir/bigann10M_IVF65k_16x8_L2.faissindex \ 143 | --todo search --nthread 32 --nprobe 64 --nshort 100 144 | 145 | # with ROC 146 | 147 | python search_ivf_qinco.py --db bigann10M \ 148 | --model $tmpdir/bigann_IVF65k_16x8_L2.pt \ 149 | --index $tmpdir/bigann10M_IVF65k_16x8_L2.faissindex \ 150 | --todo search --nthread 32 --nprobe 64 --nshort 100 \ 151 | --id_compression roc --defer_id_decoding 152 | 153 | ``` 154 | 155 |
156 | 157 | This should produce an output similar to [this log](https://gist.github.com/mdouze/b28e1172f612764dc2cf5133b5614f7d) 158 | 159 | Note that `search_ivf_qinco.py` is a slightly adapted version of QINCo's [`search_ivf.py`](https://github.com/facebookresearch/Qinco/blob/main/qinco_v1/search_ivf.py) to integrate id compression. 160 | 161 | #### Full scale experiment 162 | 163 | Similar to above (but slower!): 164 | 165 |
commands 166 | 167 | ```sh 168 | 169 | datadir=/checkpoint/matthijs/id_compression/qinco/ 170 | 171 | (cd $datadir && wget https://dl.fbaipublicfiles.com/QINCo/ivf/bigann1B_IVF1M_8x8_L2.faissindex) 172 | (cd $datadir && wget https://dl.fbaipublicfiles.com/QINCo/models/bigann_IVF1M_8x8_L2.pt) 173 | 174 | # cno=533 from https://gist.github.com/mdouze/0187c2ca3f96f806e41567af13f80442 175 | # fastest run with R@1 > 0.3 176 | params="--nprobe 128 --quantizer_efSearch 64 --nshort 200" 177 | 178 | for comp in none packed-bits elias-fano roc wavelet-tree wavelet-tree-1; do 179 | 180 | python -u search_ivf_qinco.py \ 181 | --todo search \ 182 | --db bigann1B \ 183 | --model $datadir/bigann_IVF1M_8x8_L2.pt \ 184 | --index $datadir/bigann1B_IVF1M_8x8_L2.faissindex \ 185 | --nthread 32 \ 186 | --id_compression $comp --defer_id_decoding --redo_search 10 \ 187 | $params 188 | 189 | done 190 | 191 | ``` 192 | 193 |
194 | 195 | This outputs [these logs](https://gist.github.com/mdouze/93491e398da661843f215b17525eda59). 196 | The 10 runs are averaged to produce table 4 using 197 | [parse_qinco_res.py](https://gist.github.com/mdouze/8fe85335197049db4d728ae0b427036f). 198 | 199 | ## 3) Citation 200 | 201 | If you use this package in a research work, please cite: 202 | 203 | ``` 204 | @misc{severo2025idcompression, 205 | title="Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search", 206 | author={Daniel Severo and Giuseppe Ottaviano and Matthew Muckley and Karen Ullrich and Matthijs Douze}, 207 | year={2025}, 208 | eprint={2501.10479}, 209 | archivePrefix={arXiv}, 210 | primaryClass={cs.LG} 211 | } 212 | ``` 213 | 214 | ## 4) License 215 | 216 | This package is provided under a [CC-by-NC license](https://creativecommons.org/licenses/by-nc/4.0/deed.en). 217 | -------------------------------------------------------------------------------- /alt-graph-index/Makefile: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | PYTHON_INCLUDE := $(shell python -c "import distutils.sysconfig ; print(distutils.sysconfig.get_python_inc())") 8 | CONDA_INCLUDE := ${CONDA_PREFIX}/include 9 | AVX2_INCLUDE := ${CONDA_PREFIX}/lib/libfaiss_avx2.so 10 | 11 | all: test 12 | 13 | altid_wrap.cxx altid.py: altid.swig altid_impl.h 14 | swig -c++ -python -I${CONDA_INCLUDE} $< 15 | 16 | 17 | _altid.so: altid_wrap.cxx altid_impl.cpp ../custom_invlist_cpp/codec.cpp 18 | g++ -fPIC -O3 -std=c++17 \ 19 | $^ \ 20 | -shared \ 21 | -o $@ \ 22 | -I${PYTHON_INCLUDE} \ 23 | -I${CONDA_INCLUDE} \ 24 | ${AVX2_INCLUDE} 25 | 26 | test: _altid.so 27 | python -c "import faiss, altid" && python -m unittest -v test_altid.py 28 | 29 | 30 | clean: 31 | rm -f altid_wrap.cxx altid.py _altid.so 32 | 33 | .PHONY: clean test -------------------------------------------------------------------------------- /alt-graph-index/README.md: -------------------------------------------------------------------------------- 1 | # What is this? 2 | 3 | This directory contains additional module `altid` to the Faiss indexes useful for ID compression of NSG indexes. 4 | 5 | `altid` contains 3 functions for alternative id storage: 6 | 7 | - a function that traces the nodes visited during NSG search 8 | - three classes that contain the graph structure (alternatives of the regular uncompressed one) 9 | 10 | ## How to compile 11 | 12 | Note that this module needs a Faiss compiled after 2024-10-22 to compile (because replacing the NSG graph was not supported before). 13 | To install this Faiss version on conda, do: 14 | 15 | ``` 16 | conda install -c pytorch/label/nightly -c nvidia faiss-gpu=1.9.0 pytorch=*=*cuda* 17 | ``` 18 | 19 | This also installs pytorch in case you'd need it... 20 | To compile, just run `make`. 21 | This automatically runs the test below. 22 | 23 | ## How to test 24 | 25 | Run 26 | ``` 27 | python -m unittest test_altid.py 28 | ``` 29 | See also that file on how to use the module. 30 | 31 | ## How to modify 32 | 33 | The C++ implementation is in `altid_impl.{h,cpp}`. 34 | -------------------------------------------------------------------------------- /alt-graph-index/altid.swig: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | %module altid 8 | 9 | 10 | // to get uint32_t and friends 11 | %include 12 | 13 | // This means: assume what's declared in these .h files is provided 14 | // by the Faiss module. 15 | 16 | #define FAISS_API 17 | 18 | %import(module="faiss") "faiss/MetricType.h" 19 | %import(module="faiss") "faiss/Index.h" 20 | %import(module="faiss") "faiss/impl/NSG.h" 21 | %import(module="faiss") "faiss/IndexNSG.h" 22 | 23 | %template(FinalNSGGraph) faiss::nsg::Graph< int32_t >; 24 | 25 | %{ 26 | #include 27 | 28 | #include "altid_impl.h" 29 | 30 | %} 31 | 32 | 33 | // This is important to release GIL and do Faiss exception handing 34 | %exception { 35 | Py_BEGIN_ALLOW_THREADS 36 | try { 37 | $action 38 | } catch(faiss::FaissException & e) { 39 | PyEval_RestoreThread(_save); 40 | 41 | if (PyErr_Occurred()) { 42 | // some previous code already set the error type. 43 | } else { 44 | PyErr_SetString(PyExc_RuntimeError, e.what()); 45 | } 46 | SWIG_fail; 47 | } catch(std::bad_alloc & ba) { 48 | PyEval_RestoreThread(_save); 49 | PyErr_SetString(PyExc_MemoryError, "std::bad_alloc"); 50 | SWIG_fail; 51 | } 52 | Py_END_ALLOW_THREADS 53 | } 54 | 55 | 56 | %include "altid_impl.h" 57 | 58 | 59 | %inline %{ 60 | 61 | void NSG_replace_final_graph(faiss::NSG & nsg, faiss::nsg::Graph *graph) { 62 | nsg.final_graph.reset(graph); 63 | } 64 | 65 | 66 | // make an untyped version of the above because there is a big mess-up with the SWIG types 67 | void search_NSG_and_trace_untyped( 68 | const faiss::IndexNSG & index, 69 | faiss::idx_t n, 70 | const float *x, 71 | int k, 72 | void *labels, 73 | float *distances, 74 | void * visited_nodes) 75 | { 76 | search_NSG_and_trace(index, n, x, k, (faiss::idx_t*)labels, distances, *(std::vector *)visited_nodes); 77 | } 78 | 79 | 80 | %} 81 | 82 | 83 | %pythoncode %{ 84 | 85 | import faiss 86 | import numpy as np 87 | 88 | def replace_final_graph(self, graph): 89 | _altid.NSG_replace_final_graph(self, graph) 90 | graph.this.disown() 91 | 92 | faiss.NSG.replace_final_graph = replace_final_graph 93 | 94 | def search_and_trace(self, x, k): 95 | n, d = x.shape 96 | I = np.empty((n, k), dtype='int64') 97 | D = np.empty((n, k), dtype='float32') 98 | visited_nodes = faiss.Int64Vector() 99 | search_NSG_and_trace_untyped( 100 | self, n, faiss.swig_ptr(x), k, 101 | faiss.swig_ptr(I), faiss.swig_ptr(D), 102 | visited_nodes) 103 | return D, I, faiss.vector_to_array(visited_nodes) 104 | 105 | faiss.IndexNSG.search_and_trace = search_and_trace 106 | 107 | 108 | %} 109 | 110 | 111 | 112 | -------------------------------------------------------------------------------- /alt-graph-index/altid_impl.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #include "altid_impl.h" 8 | 9 | #include 10 | #include 11 | 12 | 13 | /********************************************** NSG extensions */ 14 | 15 | 16 | using namespace faiss; 17 | 18 | using FinalNSGGraph = nsg::Graph; 19 | 20 | CompactBitNSGGraph::CompactBitNSGGraph(const FinalNSGGraph& graph) 21 | : FinalNSGGraph(graph.data, graph.N, graph.K) { 22 | bits = 0; 23 | while((1 << bits) < N + 1) bits++; 24 | stride = (K * bits + 7) / 8; 25 | compressed_data.resize(N * stride); 26 | for (size_t i = 0; i < N; i++) { 27 | BitstringWriter writer(compressed_data.data() + i * stride, stride); 28 | for (size_t j = 0; j < K; j++) { 29 | int32_t v = graph.data[i * K + j]; 30 | if (v == -1) { 31 | writer.write(N, bits); 32 | break; 33 | } else { 34 | writer.write(v, bits); 35 | } 36 | } 37 | } 38 | data = nullptr; 39 | } 40 | 41 | size_t CompactBitNSGGraph::get_neighbors(int i, int32_t* neighbors) const { 42 | BitstringReader reader(compressed_data.data() + i * stride, stride); 43 | for (int j = 0; j < K; j++) { 44 | int32_t v = reader.read(bits); 45 | if (v == N) { 46 | return j; 47 | } 48 | neighbors[j] = v; 49 | } 50 | return K; 51 | } 52 | 53 | EliasFanoNSGGraph::EliasFanoNSGGraph(const FinalNSGGraph& graph) 54 | : FinalNSGGraph(graph.data, graph.N, graph.K) { 55 | std::vector num_outgoing_edges(N, 0); 56 | overhead_in_bytes += N*std::ceil(std::log2(N)) / 8.0; // size of each set (i.e., friendlist) 57 | overhead_in_bytes += N*std::ceil(std::log2(N)) / 8.0; // max_id value 58 | for (size_t i = 0; i < N; i++) { 59 | 60 | // Compute number of outgoing edges 61 | for (size_t j = 0; j < K; j++) { 62 | int32_t v = graph.data[i * K + j]; 63 | if (v == -1) { 64 | break; 65 | } else { 66 | num_outgoing_edges[i]++; 67 | } 68 | } 69 | 70 | // Compress ids with Elias-Fano 71 | uint32_t ls = num_outgoing_edges[i]; 72 | int32_t* start_friendlist = graph.data + i * K; 73 | int32_t* end_friendlist = start_friendlist + ls; 74 | 75 | int32_t max_id = *std::max_element(start_friendlist, end_friendlist); 76 | std::sort(start_friendlist, end_friendlist); 77 | succinct::elias_fano::elias_fano_builder ef_builder(max_id, ls); 78 | for (size_t j=0; j < ls; j++) 79 | { 80 | int32_t v = graph.data[i * K + j]; 81 | if (v < 0) break; 82 | ef_builder.push_back((uint64_t)v); 83 | } 84 | succinct::elias_fano* ef_ptr = new succinct::elias_fano(&ef_builder); 85 | ef_bitstreams.push_back(ef_ptr); 86 | compressed_ids_size_in_bytes += ef_ptr->m_low_bits.size() + ef_ptr->m_high_bits.size(); 87 | } 88 | compressed_ids_size_in_bytes /= 8; 89 | data = nullptr; 90 | } 91 | 92 | size_t EliasFanoNSGGraph::get_neighbors(int i, int32_t* neighbors) const { 93 | const auto& ef = ef_bitstreams[i]; 94 | succinct::elias_fano::select_enumerator it(*ef, 0); 95 | 96 | for (size_t i=0; i < ef->num_elements; i++) { 97 | neighbors[i] = it.next(); 98 | } 99 | 100 | return ef->num_elements; 101 | } 102 | 103 | ROCNSGGraph::ROCNSGGraph(const FinalNSGGraph& graph) 104 | : FinalNSGGraph(graph.data, graph.N, graph.K) { 105 | num_outgoing_edges.resize(N); 106 | overhead_in_bytes += N*std::ceil(std::log2(N)) / 8.0; 107 | ans_states.resize(graph.N); 108 | for (size_t list_no = 0; list_no < graph.N; list_no++) { 109 | // Compute number of outgoing edges 110 | for (size_t j = 0; j < K; j++) { 111 | int32_t v = graph.data[list_no * K + j]; 112 | if (v == -1) { 113 | break; 114 | } else { 115 | num_outgoing_edges[list_no]++; 116 | } 117 | } 118 | 119 | // Compress with ROC 120 | uint32_t ls = num_outgoing_edges[list_no]; 121 | int32_t* start_friendlist = graph.data + list_no * K; 122 | int32_t* end_friendlist = start_friendlist + ls; 123 | 124 | int32_t max_id = *std::max_element(start_friendlist, end_friendlist); 125 | id_symbol_precision.push_back((uint64_t)std::ceil(std::log2(max_id))); 126 | 127 | // Populate fenwick tree in random order to increase balancedness 128 | std::random_device rd; 129 | std::mt19937 g(rd()); 130 | std::shuffle(start_friendlist, end_friendlist, g); 131 | 132 | FenwickTree ftree; 133 | for (int i=0; i < ls; i++) { 134 | ROCSymbolType symbol = start_friendlist[i]; 135 | ftree.insert_then_forward_lookup(symbol); 136 | } 137 | 138 | for (size_t i = 0; i < ls; i++) { 139 | // Sample, without replacement, an element from the FenwickTree. 140 | uint32_t nmax = ls - i; 141 | size_t index = pop_with_finer_precision(ans_states[list_no], nmax); 142 | Range range = ftree.reverse_lookup_then_remove(index); 143 | 144 | // Encode id. 145 | ROCSymbolType id = range.ftree->symbol; 146 | codec_push(ans_states[list_no], id, id_symbol_precision[list_no]); 147 | } 148 | compressed_ids_size_in_bytes += ans_states[list_no].size(); 149 | } 150 | data = nullptr; 151 | } 152 | 153 | size_t ROCNSGGraph::get_neighbors(int node, int32_t* neighbors) const { 154 | FenwickTree ftree; 155 | ANSState state(ans_states[node]); 156 | uint32_t n = num_outgoing_edges[node]; 157 | for (size_t i = 0; i < n; i++) { 158 | int32_t symbol = (uint64_t)codec_pop(state, id_symbol_precision[node]); 159 | auto range = ftree.insert_then_forward_lookup(symbol); 160 | uint32_t nmax = i + 1; 161 | push_with_finer_precision(state, range.start, nmax); 162 | neighbors[n - i - 1] = range.ftree->symbol; 163 | } 164 | return K; 165 | } 166 | 167 | 168 | namespace { 169 | 170 | struct TracingDistanceComputer: DistanceComputer { 171 | std::vector visited; 172 | DistanceComputer *basedis; 173 | 174 | TracingDistanceComputer(DistanceComputer *basedis): 175 | basedis(basedis) {} 176 | 177 | void set_query(const float* x) override { 178 | basedis->set_query(x); 179 | } 180 | 181 | /// compute distance of vector i to current query 182 | float operator()(idx_t i) override { 183 | visited.push_back(i); 184 | return (*basedis)(i); 185 | } 186 | 187 | /// compute distance between two stored vectors 188 | float symmetric_dis(idx_t i, idx_t j) override { 189 | visited.push_back(i); 190 | visited.push_back(j); 191 | return basedis->symmetric_dis(i, j); 192 | } 193 | 194 | virtual ~TracingDistanceComputer() { 195 | delete basedis; 196 | } 197 | 198 | }; 199 | 200 | } // anonymous namespace 201 | 202 | 203 | void search_NSG_and_trace( 204 | const IndexNSG & index, 205 | idx_t n, 206 | const float *x, 207 | int k, 208 | faiss::idx_t *labels, 209 | float *distances, 210 | std::vector & visited_nodes) { 211 | 212 | int L = std::max(index.nsg.search_L, (int)k); // in case of search L = -1 213 | 214 | VisitedTable vt(index.ntotal); 215 | 216 | std::unique_ptr dis( 217 | new TracingDistanceComputer( 218 | nsg::storage_distance_computer(index.storage))); 219 | 220 | for (idx_t i = 0; i < n; i++) { 221 | idx_t* idxi = labels + i * k; 222 | float* simi = distances + i * k; 223 | dis->set_query(x + i * index.d); 224 | 225 | index.nsg.search(*dis, k, idxi, simi, vt); 226 | 227 | vt.advance(); 228 | } 229 | 230 | std::swap(visited_nodes, dis->visited); 231 | } 232 | -------------------------------------------------------------------------------- /alt-graph-index/altid_impl.h: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #pragma once 8 | 9 | #include 10 | #include 11 | #include "../succinct/elias_fano.hpp" 12 | #include "../fenwick_tree_cpp/src/fenwick_tree.h" 13 | #include "../custom_invlist_cpp/codec.h" 14 | 15 | 16 | 17 | /* perform a search in the NSG graph and collect the ids of the nodes that are compared */ 18 | void search_NSG_and_trace( 19 | const faiss::IndexNSG & index, 20 | faiss::idx_t n, 21 | const float *x, 22 | int k, 23 | faiss::idx_t *labels, 24 | float *distances, 25 | std::vector & visited_nodes); 26 | 27 | 28 | /** NSG graph where edges are encoded into the minimum number of bits */ 29 | struct CompactBitNSGGraph : faiss::nsg::Graph { 30 | int bits; ///< number of bits per edge 31 | size_t stride; ///< number of bytes per node 32 | 33 | ///< array of size N * stride 34 | std::vector compressed_data; 35 | 36 | CompactBitNSGGraph(const faiss::nsg::Graph& graph); 37 | 38 | size_t get_neighbors(int i, int32_t* neighbors) const override; 39 | }; 40 | 41 | /** NSG graph where edges are encoded using Elias-Fano */ 42 | struct EliasFanoNSGGraph : faiss::nsg::Graph { 43 | std::vector ef_bitstreams; 44 | 45 | size_t compressed_ids_size_in_bytes = 0; 46 | size_t overhead_in_bytes = 0; 47 | EliasFanoNSGGraph(const faiss::nsg::Graph& graph); 48 | 49 | size_t get_neighbors(int i, int32_t* neighbors) const override; 50 | }; 51 | 52 | 53 | using ROCSymbolType = int32_t; 54 | 55 | /** NSG graph where edges are encoded using ROC */ 56 | struct ROCNSGGraph : faiss::nsg::Graph { 57 | std::vector> codes_all; 58 | std::vector ans_states; 59 | std::vector id_symbol_precision; 60 | size_t compressed_ids_size_in_bytes = 0; 61 | std::vector num_outgoing_edges; 62 | size_t overhead_in_bytes = 0; 63 | 64 | ROCNSGGraph(const faiss::nsg::Graph& graph); 65 | 66 | size_t get_neighbors(int i, int32_t* neighbors) const override; 67 | }; -------------------------------------------------------------------------------- /alt-graph-index/graph_dynamic_bench_invlists.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import datetime 8 | import sys 9 | import time 10 | from pathlib import Path 11 | 12 | import altid 13 | import faiss 14 | import numpy as np 15 | import pandas as pd 16 | from faiss.contrib.datasets import DatasetDeep1B, DatasetSIFT1M, SyntheticDataset 17 | from faiss.contrib.inspect_tools import get_NSG_neighbors 18 | 19 | from ..qinco_datasets import DatasetFB_ssnpp 20 | 21 | AVAILABLE_COMPRESSED_IVFS = { 22 | "elias-fano": altid.EliasFanoNSGGraph, 23 | "roc": altid.ROCNSGGraph, 24 | "compact": altid.CompactBitNSGGraph, 25 | "ref": None, 26 | } 27 | 28 | 29 | def get_ids_size(dataset, graph_comp, comp_method, num_edges): 30 | if comp_method is None: 31 | return 8 * num_edges 32 | elif comp_method == "compact": 33 | return np.log2(dataset.nb) / 8 * num_edges 34 | else: 35 | return graph_comp.compressed_ids_size_in_bytes 36 | 37 | 38 | def get_overhead_size(dataset, graph_comp, comp_method, num_edges): 39 | if comp_method in ["roc", "elias-fano"]: 40 | return graph_comp.overhead_in_bytes 41 | elif comp_method == "ref": 42 | return 0 43 | elif comp_method == "compact": 44 | return 0 45 | else: 46 | return None 47 | 48 | 49 | if __name__ == "__main__": 50 | dataset_idx = int(sys.argv[1]) 51 | max_degree = int(sys.argv[2]) 52 | fb_ssnpp_dir = None 53 | 54 | if dataset_idx == 3: 55 | assert ( 56 | len(sys.argv) == 4 57 | ), "Path to fb_ssnpp/ directory is needed for DatasetFB_ssnpp (index 3)" 58 | fb_ssnpp_dir = sys.argv[3] 59 | 60 | AVAILABLE_DATASETS = [ 61 | (SyntheticDataset, dict(d=32, nt=10_000, nq=1, nb=1_000)), 62 | (DatasetSIFT1M, {}), 63 | (DatasetDeep1B, dict(nb=10**6)), 64 | (DatasetFB_ssnpp, dict(basedir=fb_ssnpp_dir)), 65 | ] 66 | 67 | dataset_cls, dataset_kwargs = AVAILABLE_DATASETS[dataset_idx] 68 | dataset = dataset_cls(**dataset_kwargs) 69 | index_str = f"NSG{max_degree},Flat" 70 | 71 | now = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S-%f") 72 | csv_path = Path( 73 | f"results-online-graphs/graph-dynamic-results-{now}-{index_str.replace(",", "_")}-{dataset_cls.__name__}.csv" 74 | ) 75 | csv_path.parent.mkdir(parents=True, exist_ok=True) 76 | 77 | compression_methods = ["elias-fano", "roc", "compact"] 78 | 79 | search_time_params = dict(k=[20], nq=[None], nprobe=[16]) 80 | num_runs = 100 81 | 82 | results = [] 83 | i = 0 84 | 85 | print(f"Indexing {type(dataset).__name__} / {index_str}", flush=True) 86 | index = faiss.index_factory(dataset.d, index_str) 87 | index.verbose = True 88 | database = dataset.get_database() 89 | index.add(database) 90 | num_edges = (get_NSG_neighbors(index.nsg) != -1).sum() 91 | 92 | # precompute graphs for compression methods 93 | print("Compressing database ...") 94 | graph = index.nsg.get_final_graph() 95 | invlists_comp = { 96 | comp_method: AVAILABLE_COMPRESSED_IVFS[comp_method](graph) 97 | for comp_method in compression_methods 98 | } 99 | 100 | # first run will be with no compression 101 | print("Running search ...") 102 | for comp_method in [None, *compression_methods]: 103 | if comp_method is not None: 104 | graph_comp = invlists_comp[comp_method] 105 | index.nsg.replace_final_graph(graph_comp) 106 | else: 107 | graph_comp = graph 108 | 109 | for k in search_time_params["k"]: 110 | for nq in search_time_params["nq"]: 111 | for nprobe in search_time_params["nprobe"]: 112 | index.nprobe = nprobe 113 | queries = dataset.get_queries()[:nq] 114 | 115 | for run_id in range(num_runs): 116 | t0 = time.time() 117 | _, Iref = index.search(queries, k) 118 | t1 = time.time() 119 | dt_search = t1 - t0 120 | 121 | results.append( 122 | { 123 | "dt_search": dt_search, 124 | "nprobe": nprobe, 125 | "run_id": run_id, 126 | "index_str": index_str, 127 | "k": k, 128 | "nq": queries.shape[0], 129 | "comp_method": comp_method or "ref", 130 | "dataset": type(dataset).__name__, 131 | "ids_size": get_ids_size( 132 | dataset, graph_comp, comp_method, num_edges 133 | ), 134 | "overhead_size": get_overhead_size( 135 | dataset, graph_comp, comp_method, num_edges 136 | ), 137 | "nb": dataset.nb, 138 | "nt": dataset.nt, 139 | "num_edges": num_edges, 140 | } 141 | ) 142 | print(results[-1], flush=True) 143 | i += 1 144 | 145 | df = pd.DataFrame(results) 146 | df.to_csv(csv_path, index=False) 147 | print(f"Saved to {csv_path} with {i} entries") 148 | print(df) 149 | -------------------------------------------------------------------------------- /alt-graph-index/test_altid.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import unittest 8 | 9 | # importing this module adds various methods to the IndexIVF and IndexNSG objects... 10 | import altid 11 | import faiss 12 | import numpy as np 13 | from faiss.contrib.datasets import SyntheticDataset 14 | from faiss.contrib.inspect_tools import get_NSG_neighbors 15 | 16 | 17 | class TestCompressedNSG(unittest.TestCase): 18 | 19 | def test_compact_bit(self): 20 | self.do_test(altid.CompactBitNSGGraph) 21 | 22 | def test_elias_fano(self): 23 | self.do_test(altid.EliasFanoNSGGraph) 24 | 25 | def test_roc_graph(self): 26 | self.do_test(altid.ROCNSGGraph) 27 | 28 | def do_test(self, graph_class): 29 | 30 | ds = SyntheticDataset(32, 0, 1000, 10) 31 | 32 | index = faiss.index_factory(ds.d, "NSG32,Flat") 33 | index.add(ds.get_database()) 34 | Dref, Iref = index.search(ds.get_queries(), 10) 35 | 36 | gr = index.nsg.get_final_graph() 37 | 38 | compactbit_graph = graph_class(gr) 39 | index.nsg.replace_final_graph(compactbit_graph) 40 | 41 | D, I = index.search(ds.get_queries(), 10) 42 | 43 | np.testing.assert_array_equal(I, Iref) 44 | np.testing.assert_array_equal(D, Dref) 45 | 46 | 47 | class TestSearchTraced(unittest.TestCase): 48 | 49 | def test_traced(self): 50 | ds = SyntheticDataset(32, 0, 10000, 10) 51 | 52 | index_nsg = faiss.index_factory(ds.d, "NSG32,Flat") 53 | index_nsg.add(ds.get_database()) 54 | q = ds.get_queries()[:1] # just one query 55 | Dref, Iref = index_nsg.search(q, 10) 56 | 57 | D, I, trace = index_nsg.search_and_trace(q, 10) 58 | np.testing.assert_array_equal(I, Iref) 59 | np.testing.assert_array_equal(D, Dref) 60 | 61 | # at least, all result vectors should be in the trace 62 | assert set(I.ravel().tolist()) <= set(trace) 63 | -------------------------------------------------------------------------------- /custom_invlist_cpp/Makefile: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | PYTHON_INCLUDE := $(shell python -c "import distutils.sysconfig ; print(distutils.sysconfig.get_python_inc())") 8 | CONDA_INCLUDE := ${CONDA_PREFIX}/include 9 | AVX2_LIB := ${CONDA_PREFIX}/lib/libfaiss_avx2.so 10 | 11 | SDSL_PATH=/private/home/matthijs/src/test/sdsl-lite/installed/ 12 | SDSL_INC=-I${SDSL_PATH}/include 13 | SDSL_LIB=-L${SDSL_PATH}/lib -lsdsl -ldivsufsort -ldivsufsort64 14 | 15 | 16 | 17 | COPT= -Wall -fPIC -g -O3 -std=c++17 -fopenmp -mavx2 18 | # for debugging 19 | # COPT= -Wall -fPIC -g -std=c++17 -fopenmp 20 | 21 | all: test 22 | 23 | custom_invlists_wrap.cxx custom_invlists.py: custom_invlists.swig custom_invlists_impl.h 24 | swig -c++ -python -I${CONDA_INCLUDE} custom_invlists.swig 25 | 26 | .cpp.o: 27 | g++ ${COPT} -c $< -I${CONDA_INCLUDE} ${SDSL_INC} 28 | 29 | codec.o: codec.cpp codec.h 30 | 31 | custom_invlists_impl.o: custom_invlists_impl.cpp custom_invlists_impl.h 32 | 33 | _custom_invlists.so: custom_invlists_wrap.cxx custom_invlists_impl.o codec.o 34 | g++ ${COPT} \ 35 | $^ \ 36 | -shared \ 37 | -o $@ \ 38 | -I${PYTHON_INCLUDE} -I${CONDA_INCLUDE} ${SDSL_INC} \ 39 | ${AVX2_LIB} ${SDSL_LIB} 40 | 41 | test: _custom_invlists.so custom_invlists_wrap.cxx custom_invlists.py 42 | python -c "import faiss, custom_invlists" && \ 43 | g++ codec.cpp test_codec.cpp -O4 -o test_codec && \ 44 | ./test_codec && \ 45 | python -m unittest -v test_compressed_ivfs.py 46 | 47 | clean: 48 | rm -f custom_invlists_wrap.cxx _custom_invlists.so custom_invlists_impl.o codec.o 49 | 50 | .PHONY: clean test 51 | -------------------------------------------------------------------------------- /custom_invlist_cpp/README.md: -------------------------------------------------------------------------------- 1 | # What is this? 2 | 3 | This directory contains module `custom_invlists` that integrates with Faiss to compress the ids in IVF indexes. 4 | 5 | `custom_invlists` contains: 6 | 7 | - `search_IVF_defer_id_decoding`: an IVF search function that does not collect the result IDs immediately, but instead collects (invlist, offset) pairs and does the conversion to IDs when the search finished -- this avoids decompressing invlists without necessity 8 | 9 | - four alternative `InvertedLists` objects that compress the IDs of the lists. The index itself does the vector compression. 10 | 11 | ## How to compile 12 | 13 | The module depends on Faiss, SDSL and succint. 14 | 15 | See [here](../alt-graph-index) on how to install Faiss and SWIG. 16 | 17 | The [SDSL library](https://github.com/simongog/sdsl-lite), needs to be compiled and installed. 18 | To install, go to some directory and run 19 | ``` 20 | git clone https://github.com/simongog/sdsl-lite.git 21 | cd sdsl-lite/ 22 | bash install.sh $PWD/installed 23 | ``` 24 | then adjust the path `SDSL_PATH` in the Makefile 25 | 26 | The succint library, is a header-only library provided as a git submodule (make sure to checkout submodules). 27 | succint does not need to be configured or compiled, so it's fine to skip the configuration with 28 | ``` 29 | cat > ../succinct/succinct_config.hpp << EOF 30 | #pragma once 31 | #define SUCCINCT_USE_LIBCXX 1 32 | #define SUCCINCT_USE_INTRINSICS 1 33 | #define SUCCINCT_USE_POPCNT 1 34 | EOF 35 | ``` 36 | With these three in place, just compile `custom_invlists` with make. 37 | This will also run a few sanity checks. 38 | 39 | -------------------------------------------------------------------------------- /custom_invlist_cpp/__pycache__/test_compressed_ivfs.cpython-310-pytest-7.4.0.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/vector_db_id_compression/46eb4860cb181bace769607e6dc3d82369a3acb5/custom_invlist_cpp/__pycache__/test_compressed_ivfs.cpython-310-pytest-7.4.0.pyc -------------------------------------------------------------------------------- /custom_invlist_cpp/bench_invlists.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import datetime 8 | import sys 9 | import time 10 | from pathlib import Path 11 | 12 | import custom_invlists 13 | import faiss 14 | import pandas as pd 15 | from faiss.contrib.datasets import DatasetDeep1B, DatasetSIFT1M, SyntheticDataset 16 | 17 | from ..qinco_datasets import DatasetFB_ssnpp 18 | 19 | AVAILABLE_COMPRESSED_IVFS = { 20 | "elias-fano": custom_invlists.CompressedIDInvertedListsEliasFano, 21 | "roc": custom_invlists.CompressedIDInvertedListsFenwickTree, 22 | "packed-bits": custom_invlists.CompressedIDInvertedListsPackedBits, 23 | "wavelet-tree": custom_invlists.CompressedIDInvertedListsWaveletTree, 24 | "ref": None, # must be last 25 | } 26 | 27 | 28 | def get_ids_size(dataset, invlist, comp_method): 29 | if comp_method is None: 30 | return 8 * dataset.nb 31 | else: 32 | return invlist.compressed_ids_size_in_bytes 33 | 34 | 35 | def get_overhead_size(dataset, invlist, comp_method): 36 | if comp_method is None: 37 | return 0 38 | elif comp_method in ["roc", "elias-fano"]: 39 | return invlist.overhead_in_bytes 40 | 41 | 42 | if __name__ == "__main__": 43 | 44 | dataset_idx = int(sys.argv[1]) 45 | index_str = str(sys.argv[2]) 46 | fb_ssnpp_dir = None 47 | 48 | now = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S-%f") 49 | 50 | AVAILABLE_DATASETS = [ 51 | (SyntheticDataset, dict(d=32, nt=10_000, nq=1, nb=1_000)), 52 | (DatasetSIFT1M, {}), 53 | (DatasetDeep1B, dict(nb=10**6)), 54 | (DatasetFB_ssnpp, dict(basedir=fb_ssnpp_dir)), 55 | ] 56 | 57 | compression_methods = list(AVAILABLE_COMPRESSED_IVFS.keys())[:-1] 58 | 59 | search_time_params = dict( 60 | k=[20], 61 | nq=[None], 62 | nprobe=[1, 4, 16], 63 | ) 64 | num_runs = 100 65 | 66 | results = [] 67 | i = 0 68 | 69 | dataset_cls, dataset_kwargs = AVAILABLE_DATASETS[dataset_idx] 70 | dataset = dataset_cls(**dataset_kwargs) 71 | 72 | csv_path = Path( 73 | f"results-online-ivf/ivf-results-{now}-{index_str}-{dataset_cls.__name__}.csv".replace( 74 | ",", "_" 75 | ).replace( 76 | " ", "_" 77 | ) 78 | ) 79 | csv_path.parent.mkdir(parents=True, exist_ok=True) 80 | 81 | index = faiss.index_factory(dataset.d, index_str) 82 | index.train(dataset.get_train()) 83 | database = dataset.get_database() 84 | index.add(database) 85 | 86 | # for deferred decoding 87 | index.parallel_mode = 3 88 | 89 | # precompute invlists for compression methods 90 | invlists_comp = { 91 | comp_method: AVAILABLE_COMPRESSED_IVFS[comp_method](index.invlists) 92 | for comp_method in compression_methods 93 | } 94 | 95 | # first run will be with no compression 96 | for comp_method in [None, *compression_methods]: 97 | if comp_method is None: 98 | invlist = index.invlists 99 | else: 100 | invlist = invlists_comp[comp_method] 101 | index.replace_invlists(invlist, False) 102 | 103 | decode_1by1 = comp_method in ("wavelet-tree", "packed-bits", None) 104 | 105 | for k in search_time_params["k"]: 106 | for nq in search_time_params["nq"]: 107 | for nprobe in search_time_params["nprobe"]: 108 | index.nprobe = nprobe 109 | queries = dataset.get_queries()[:nq] 110 | print(queries.shape, flush=True) 111 | 112 | for run_id in range(num_runs): 113 | t0 = time.time() 114 | index.search_defer_id_decoding( 115 | queries, k=k, decode_1by1=decode_1by1 116 | ) 117 | t1 = time.time() 118 | dt_search = t1 - t0 119 | 120 | results.append( 121 | { 122 | "dt_search": dt_search, 123 | "nprobe": nprobe, 124 | "run_id": run_id, 125 | "index_str": index_str, 126 | "k": k, 127 | "nq": queries.shape[0], 128 | "comp_method": comp_method or "ref", 129 | "dataset": type(dataset).__name__, 130 | "ids_size": get_ids_size(dataset, invlist, comp_method), 131 | "overhead_size": get_overhead_size( 132 | dataset, invlist, comp_method 133 | ), 134 | "nb": dataset.nb, 135 | "nt": dataset.nt, 136 | } 137 | ) 138 | i += 1 139 | print(results[-1], flush=True) 140 | 141 | df = pd.DataFrame(results) 142 | df.to_csv(csv_path, index=False) 143 | print( 144 | f"Saved to {csv_path} with {i} entries", 145 | flush=True, 146 | ) 147 | print(df, flush=True) 148 | -------------------------------------------------------------------------------- /custom_invlist_cpp/codec.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #include "codec.h" 8 | 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include "../fenwick_tree_cpp/src/fenwick_tree.h" 17 | using namespace std; 18 | 19 | constexpr uint64_t rans_l = (uint64_t)1 << 31; 20 | 21 | uint64_t pop_with_finer_precision(ANSState &ans_state, uint64_t nmax) 22 | { 23 | uint64_t head_0 = ans_state.get_head(); 24 | 25 | if (head_0 >= nmax * ((rans_l / nmax) << 32)) 26 | { 27 | ans_state.extend_stack(head_0 & 0xffffffff); 28 | head_0 >>= 32; 29 | } 30 | 31 | uint64_t cfs = head_0 % nmax; 32 | uint64_t head = head_0 / nmax; 33 | 34 | if (head_0 < rans_l) 35 | { 36 | uint64_t new_head = ans_state.stack_slice(); 37 | head = new_head | (head << 32); 38 | } 39 | 40 | ans_state.set_head(head); 41 | return cfs; 42 | } 43 | 44 | void push_with_finer_precision(ANSState &ans_state, size_t symbol, uint64_t nmax) 45 | { 46 | uint64_t head_0 = ans_state.get_head(); 47 | 48 | if (head_0 >= ((rans_l / nmax) << 32)) 49 | { 50 | ans_state.extend_stack(head_0 & 0xffffffff); 51 | head_0 >>= 32; 52 | } 53 | 54 | uint64_t head = head_0 * nmax + symbol; 55 | 56 | if (head < rans_l) 57 | { 58 | uint64_t new_head = ans_state.stack_slice(); 59 | head = new_head | (head << 32); 60 | } 61 | 62 | ans_state.set_head(head); 63 | } 64 | 65 | void vrans_push(ANSState &state, uint64_t start, int precision) 66 | { 67 | uint64_t head = state.get_head(); 68 | // not enough room in the head, push 32 bits to the stack 69 | if (head >= (rans_l >> precision) << 32) 70 | { 71 | state.extend_stack(head & 0xffffffff); 72 | head >>= 32; 73 | } 74 | uint64_t head2 = (head << precision) + start; 75 | state.set_head(head2); 76 | } 77 | 78 | uint64_t vrans_pop(ANSState &state, int precision) 79 | { 80 | uint64_t head_0 = state.get_head(); 81 | uint64_t cfs = head_0 & (((uint64_t)1 << precision) - 1); 82 | uint64_t head = head_0 >> precision; 83 | if (head < rans_l) 84 | { 85 | uint64_t new_head = state.stack_slice(); 86 | head = (head << 32) | new_head; 87 | } 88 | state.set_head(head); 89 | return cfs; 90 | } 91 | 92 | void codec_push(ANSState &state, uint64_t symbol, int precision) 93 | { 94 | // encode by 16-bit slices 95 | for (int lower = 0; lower < 64; lower += 16) 96 | { 97 | uint64_t s = (symbol >> lower) & 0xffff; 98 | int p = precision - lower; 99 | if (p < 0) 100 | p = 0; 101 | if (p > 16) 102 | p = 16; 103 | vrans_push(state, s, p); 104 | } 105 | } 106 | 107 | uint64_t codec_pop(ANSState &state, int precision) 108 | { 109 | uint64_t symbol = 0; 110 | for (int lower = 48; lower >= 0; lower -= 16) 111 | { 112 | int p = precision - lower; 113 | if (p < 0) 114 | p = 0; 115 | if (p > 16) 116 | p = 16; 117 | uint64_t s = vrans_pop(state, p); 118 | symbol = (symbol << 16) | s; 119 | } 120 | return symbol; 121 | } 122 | 123 | void compress(size_t n, const uint64_t *data, ANSState &state, int precision) 124 | { 125 | FenwickTree ftree; 126 | for (size_t i = 0; i < n; i++) 127 | { 128 | ftree.insert_then_forward_lookup(data[i]); 129 | } 130 | 131 | for (size_t i = 0; i < n; i++) 132 | { 133 | uint32_t nmax = n - i; 134 | size_t index = pop_with_finer_precision(state, nmax); 135 | auto range = ftree.reverse_lookup_then_remove(index); 136 | codec_push(state, range.ftree->symbol, precision); 137 | } 138 | } 139 | 140 | void decompress(ANSState &state, size_t n, uint64_t *data, int precision) 141 | { 142 | FenwickTree ftree; 143 | 144 | for (size_t i = 0; i < n; i++) 145 | { 146 | uint64_t symbol = codec_pop(state, precision); 147 | auto range = ftree.insert_then_forward_lookup(symbol); 148 | uint32_t nmax = i + 1; 149 | push_with_finer_precision(state, range.start, nmax); 150 | data[n - i - 1] = range.ftree->symbol; 151 | } 152 | } -------------------------------------------------------------------------------- /custom_invlist_cpp/codec.h: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #include 8 | #include 9 | #include 10 | 11 | #pragma once 12 | 13 | struct ANSState { 14 | uint64_t head = uint64_t(1) << 31; 15 | std::vector stack; 16 | std::mt19937 mt; 17 | 18 | ANSState(): mt(1234) {} 19 | 20 | void extend_stack(uint32_t x) { 21 | stack.push_back(x); 22 | } 23 | 24 | void set_head(uint64_t x) { 25 | head = x; 26 | } 27 | 28 | uint64_t get_head() const { 29 | return head; 30 | } 31 | 32 | uint32_t stack_slice() { 33 | if (!stack.empty()) { 34 | uint32_t ret = stack.back(); 35 | stack.pop_back(); 36 | return ret; 37 | } else { 38 | return mt(); 39 | } 40 | } 41 | 42 | size_t size() { 43 | return 8 + stack.size() * sizeof(uint32_t); 44 | } 45 | }; 46 | 47 | void compress(size_t n, const uint64_t *data, ANSState & state, int precision); 48 | void decompress(ANSState & state, size_t n, uint64_t *data, int precision); 49 | uint64_t pop_with_finer_precision(ANSState &ans_state, uint64_t nmax); 50 | void codec_push(ANSState &state, uint64_t symbol, int precision); 51 | void push_with_finer_precision(ANSState &ans_state, size_t symbol, uint64_t nmax); 52 | uint64_t codec_pop(ANSState &state, int precision); -------------------------------------------------------------------------------- /custom_invlist_cpp/custom_invlists.swig: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | %module custom_invlists; 8 | 9 | // to get uint32_t and friends 10 | %include 11 | 12 | // Put C++ includes here 13 | %{ 14 | 15 | 16 | #include 17 | #include 18 | #include "codec.h" 19 | #include "custom_invlists_impl.h" 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | 26 | %} 27 | 28 | #define FAISS_API 29 | 30 | %import(module="faiss") "faiss/MetricType.h" 31 | %import(module="faiss") "faiss/invlists/InvertedLists.h" 32 | %import(module="faiss") "faiss/Index.h" 33 | %import(module="faiss") "faiss/IndexIVF.h" 34 | 35 | 36 | 37 | // This is important to release GIL and do Faiss exception handing 38 | %exception { 39 | Py_BEGIN_ALLOW_THREADS 40 | try { 41 | $action 42 | } catch(faiss::FaissException & e) { 43 | PyEval_RestoreThread(_save); 44 | 45 | if (PyErr_Occurred()) { 46 | // some previous code already set the error type. 47 | } else { 48 | PyErr_SetString(PyExc_RuntimeError, e.what()); 49 | } 50 | SWIG_fail; 51 | } catch(std::bad_alloc & ba) { 52 | PyEval_RestoreThread(_save); 53 | PyErr_SetString(PyExc_MemoryError, "std::bad_alloc"); 54 | SWIG_fail; 55 | } 56 | Py_END_ALLOW_THREADS 57 | } 58 | 59 | %ignore CompressedIDInvertedListsWaveletTree::wt; 60 | 61 | %include "custom_invlists_impl.h" 62 | 63 | %inline %{ 64 | 65 | // untyped is a work-around for a type mismatch, should be fixed when 66 | // https://www.internalfb.com/diff/D63991471 lands 67 | void search_IVF_defer_id_decoding_untyped( 68 | const faiss::IndexIVF & index, 69 | faiss::idx_t n, 70 | const float *x, 71 | int k, 72 | float *distances, 73 | void *labels, 74 | bool decode_1by1, 75 | void *codes, 76 | bool include_listno) { 77 | 78 | search_IVF_defer_id_decoding( 79 | index, n, x, k, distances, (faiss::idx_t*)labels, decode_1by1, 80 | (uint8_t*)codes, include_listno); 81 | } 82 | 83 | 84 | %} 85 | 86 | %pythoncode %{ 87 | 88 | import numpy as np 89 | import faiss 90 | 91 | 92 | def search_IVF_defer_id_decoding(self, x, k, decode_1by1=False, return_codes=0): 93 | """ 94 | decode_1by1 = use get_single_id instead of get_ids (if random access is efficient) 95 | return_codes = 0: no, 1: yes, without listnos, 2: yes, with listnos 96 | """ 97 | 98 | x = np.ascontiguousarray(x, dtype="float32") 99 | n, d = x.shape 100 | assert d == self.d 101 | D = np.zeros((n, k), dtype="float32") 102 | I = np.zeros((n, k), dtype="int64") 103 | 104 | if return_codes: 105 | code_size_1 = self.code_size 106 | if return_codes == 2: 107 | code_size_1 += self.coarse_code_size() 108 | codes = np.zeros((n, k, code_size_1), dtype="uint8") 109 | codes_ptr = faiss.swig_ptr(codes) 110 | else: 111 | codes_ptr = None 112 | 113 | search_IVF_defer_id_decoding_untyped( 114 | self, n, faiss.swig_ptr(x), k, 115 | faiss.swig_ptr(D), faiss.swig_ptr(I), 116 | decode_1by1, codes_ptr, return_codes == 2 117 | ) 118 | 119 | if return_codes: 120 | return D, I, codes 121 | else: 122 | return D, I 123 | 124 | faiss.IndexIVF.search_defer_id_decoding = search_IVF_defer_id_decoding 125 | 126 | 127 | %} 128 | -------------------------------------------------------------------------------- /custom_invlist_cpp/custom_invlists_impl.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #include "custom_invlists_impl.h" 8 | 9 | #include 10 | #include 11 | 12 | #include "../fenwick_tree_cpp/src/fenwick_tree.h" 13 | 14 | 15 | /********************************************************************************* 16 | * InvertedListsArrayCodes 17 | *********************************************************************************/ 18 | 19 | InvertedListsArrayCodes::InvertedListsArrayCodes(const faiss::InvertedLists & il): 20 | ReadOnlyInvertedLists(il.nlist, il.code_size) {} 21 | 22 | 23 | size_t InvertedListsArrayCodes::list_size(size_t list_no) const { 24 | return codes_all[list_no].size() / code_size; 25 | } 26 | 27 | const uint8_t* InvertedListsArrayCodes::get_codes(size_t list_no) const { 28 | return codes_all[list_no].data(); 29 | } 30 | 31 | /********************************************************************************* 32 | * Packed bits 33 | *********************************************************************************/ 34 | 35 | uint64_t BitstringReader_get_bits(const faiss::BitstringReader &bs, size_t i, int nbit) { 36 | assert(bs.code_size * 8 >= nbit + i); 37 | // nb of available bits in i / 8 38 | int na = 8 - (i & 7); 39 | // get available bits in current byte 40 | uint64_t res = bs.code[i >> 3] >> (i & 7); 41 | if (nbit <= na) { 42 | res &= (1 << nbit) - 1; 43 | return res; 44 | } else { 45 | int ofs = na; 46 | size_t j = (i >> 3) + 1; 47 | nbit -= na; 48 | while (nbit > 8) { 49 | res |= ((uint64_t)bs.code[j++]) << ofs; 50 | ofs += 8; 51 | nbit -= 8; // TODO remove nbit 52 | } 53 | uint64_t last_byte = bs.code[j]; 54 | last_byte &= (1 << nbit) - 1; 55 | res |= last_byte << ofs; 56 | return res; 57 | } 58 | } 59 | 60 | using idx_t = faiss::idx_t; 61 | 62 | 63 | 64 | CompressedIDInvertedListsPackedBits::CompressedIDInvertedListsPackedBits(const faiss::InvertedLists & il): 65 | InvertedListsArrayCodes(il) 66 | { 67 | 68 | bits = 0; 69 | size_t ntotal = il.compute_ntotal(); 70 | while((1 << bits) < ntotal + 1) bits++; 71 | codes_all.resize(nlist); 72 | ids_all.resize(nlist); 73 | compressed_ids_size_in_bytes = 0; 74 | for (size_t list_no = 0; list_no < nlist; list_no++) { 75 | size_t ls = il.list_size(list_no); // number of elements in voronoi cell 76 | ScopedIds ids_in(&il, list_no); 77 | ScopedCodes codes(&il, list_no); 78 | 79 | const uint64_t* ids_data = (const uint64_t*)ids_in.get(); 80 | const uint8_t* codes_data = (const uint8_t*)codes.get(); 81 | 82 | ids_all[list_no].resize((ls * bits + 7) / 8); 83 | faiss::BitstringWriter bs(ids_all[list_no].data(), ids_all[list_no].size()); 84 | compressed_ids_size_in_bytes += ids_all[list_no].size(); 85 | 86 | for(size_t i = 0; i < ls; i++) { 87 | FAISS_THROW_IF_NOT(ids_in[i] >= 0 && ids_in[i] < ntotal); 88 | bs.write(ids_in[i], bits); 89 | } 90 | 91 | codes_all[list_no].resize(code_size * ls); 92 | memcpy(codes_all[list_no].data(), codes.get(), codes_all[list_no].size()); 93 | } 94 | } 95 | 96 | const faiss::idx_t* CompressedIDInvertedListsPackedBits::get_ids(size_t list_no) const { 97 | size_t ls = list_size(list_no); 98 | idx_t *the_ids = new idx_t[ls]; 99 | faiss::BitstringReader bs(ids_all[list_no].data(), ids_all[list_no].size()); 100 | 101 | for(size_t i = 0; i < ls; i++) { 102 | the_ids[i] = bs.read(bits); 103 | } 104 | return the_ids; 105 | } 106 | 107 | 108 | faiss::idx_t CompressedIDInvertedListsPackedBits::get_single_id(size_t list_no, size_t offset) const { 109 | size_t ls = list_size(list_no); 110 | faiss::BitstringReader bs(ids_all[list_no].data(), ids_all[list_no].size()); 111 | return BitstringReader_get_bits(bs, offset * bits, bits); 112 | 113 | } 114 | 115 | 116 | void CompressedIDInvertedListsPackedBits::release_ids(size_t list_no, const idx_t* ids_in) const { 117 | delete [] ids_in; 118 | } 119 | 120 | 121 | namespace { 122 | /* 123 | void get_sorted_invlist( 124 | const faiss::InvertedLists & il, size_t list_no, 125 | ) 126 | */ 127 | } // anonymous namespace 128 | 129 | /********************************************************************************* 130 | * Fenwick Tree 131 | *********************************************************************************/ 132 | 133 | CompressedIDInvertedListsFenwickTree::CompressedIDInvertedListsFenwickTree(const faiss::InvertedLists & il): 134 | InvertedListsArrayCodes(il) 135 | { 136 | 137 | // First element is the id value, second element is a pointer to a code. 138 | // Note, however, that only the ids are compressed into the ANS state. 139 | // The FenwickTree is over this tuple datatype only to facilitate reordering of the codes at compression time. 140 | using SymbolType = std::tuple; 141 | 142 | ans_states.resize(nlist); // nlist = number of voronoi cells 143 | codes_all.resize(nlist); 144 | id_symbol_precision.resize(nlist); 145 | 146 | 147 | #pragma omp parallel for 148 | for (size_t list_no = 0; list_no < nlist; list_no++) { 149 | size_t ls = il.list_size(list_no); // number of elements in voronoi cell 150 | std::vector codes_reordered; 151 | 152 | if (ls == 0) { 153 | continue; 154 | } 155 | 156 | ScopedIds ids_in(&il, list_no); 157 | ScopedCodes codes(&il, list_no); 158 | 159 | const uint64_t* ids_data = (const uint64_t*)ids_in.get(); 160 | const uint8_t* codes_data = (const uint8_t*)codes.get(); 161 | FenwickTree ftree; 162 | 163 | int max_id = *std::max_element(ids_data, ids_data + ls); 164 | id_symbol_precision[list_no] = (uint64_t)std::ceil(std::log2(max_id)); 165 | 166 | // Populate fenwick tree in random order to increase balancedness 167 | std::vector indices(ls); 168 | std::iota(indices.begin(), indices.end(), 0); 169 | std::random_device rd; 170 | std::mt19937 g(rd()); 171 | std::shuffle(indices.begin(), indices.end(), g); 172 | 173 | for (int i : indices) { 174 | SymbolType symbol = std::make_tuple(ids_data[i], codes_data + i * code_size); 175 | ftree.insert_then_forward_lookup(symbol); 176 | } 177 | 178 | for (size_t i = 0; i < ls; i++) { 179 | // Sample, without replacement, an element from the FenwickTree. 180 | uint32_t nmax = ls - i; 181 | size_t index = pop_with_finer_precision(ans_states[list_no], nmax); 182 | Range range = ftree.reverse_lookup_then_remove(index); 183 | 184 | // Encode id. 185 | uint64_t id = std::get<0>(range.ftree->symbol); 186 | codec_push(ans_states[list_no], id, id_symbol_precision[list_no]); 187 | 188 | // Encode codes (no compression atm). 189 | const uint8_t* p = std::get<1>(range.ftree->symbol); 190 | codes_reordered.insert(codes_reordered.end(), p, p + code_size); 191 | 192 | } 193 | codes_all[list_no] = codes_reordered; 194 | } 195 | 196 | compressed_ids_size_in_bytes = 0; 197 | codes_size_in_bytes = 0; 198 | for (size_t list_no = 0; list_no < il.nlist; list_no++) { 199 | if (list_size(list_no) == 0) { 200 | continue; // let's prtend no memory is used here 201 | } 202 | compressed_ids_size_in_bytes += ans_states[list_no].size(); 203 | for (const auto& codes_this_voronoi_cell : codes_all) { 204 | codes_size_in_bytes += codes_this_voronoi_cell.size() * sizeof(uint8_t); 205 | } 206 | } 207 | 208 | } 209 | 210 | const faiss::idx_t* CompressedIDInvertedListsFenwickTree::get_ids(size_t list_no) const { 211 | size_t ls = list_size(list_no); 212 | if (ls == 0) { 213 | return nullptr; 214 | } 215 | idx_t *the_ids = new idx_t[ls]; 216 | ANSState copy(ans_states[list_no]); 217 | decompress(copy, ls, (uint64_t*)the_ids, id_symbol_precision[list_no]); 218 | return the_ids; 219 | } 220 | 221 | void CompressedIDInvertedListsFenwickTree::release_ids(size_t list_no, const idx_t* ids_in) const { 222 | delete [] ids_in; 223 | } 224 | 225 | /********************************************************************************* 226 | * Elias-Fano 227 | *********************************************************************************/ 228 | 229 | CompressedIDInvertedListsEliasFano::CompressedIDInvertedListsEliasFano(const faiss::InvertedLists & il): 230 | InvertedListsArrayCodes(il) 231 | { 232 | codes_all.resize(nlist); 233 | ef_bitstreams.resize(nlist); 234 | #pragma omp parallel for 235 | for (size_t list_no = 0; list_no < nlist; list_no++) { 236 | size_t ls = il.list_size(list_no); // number of elements in voronoi cell 237 | 238 | if (ls == 0) { 239 | continue; 240 | } 241 | 242 | std::vector codes_reordered; 243 | std::vector ids_reordered; 244 | 245 | ScopedIds ids_in(&il, list_no); 246 | ScopedCodes codes(&il, list_no); 247 | 248 | const uint64_t* ids_data = (uint64_t*)ids_in.get(); 249 | const uint8_t* codes_data = (uint8_t*)codes.get(); 250 | 251 | // Sort codes and ids, together 252 | auto pairs = canonicalize_order_inplace(ids_data, codes_data, ls); 253 | for (size_t i = 0; i < ls; ++i) { 254 | ids_reordered.push_back(pairs[i].first); 255 | const uint8_t* p = pairs[i].second; 256 | codes_reordered.insert(codes_reordered.end(), p, p + code_size); 257 | } 258 | 259 | codes_all[list_no] = codes_reordered; 260 | 261 | // Compress ids with Elias-Fano 262 | uint64_t max_id = *std::max_element(ids_reordered.begin(), ids_reordered.end()); 263 | succinct::elias_fano::elias_fano_builder ef_builder(max_id, ls); 264 | for (size_t i=0; i < ls; i++) 265 | { 266 | ef_builder.push_back(ids_reordered[i]); 267 | } 268 | succinct::elias_fano* ef_ptr = new succinct::elias_fano(&ef_builder); 269 | ef_bitstreams[list_no] = ef_ptr; 270 | } 271 | 272 | compressed_ids_size_in_bytes = 0; 273 | codes_size_in_bytes = 0; 274 | for (size_t list_no = 0; list_no < il.nlist; list_no++) { 275 | succinct::elias_fano* ef_ptr = ef_bitstreams[list_no]; 276 | if (!ef_ptr) continue; 277 | compressed_ids_size_in_bytes += ef_ptr->m_low_bits.size() + ef_ptr->m_high_bits.size(); 278 | for (const auto& codes_this_voronoi_cell : codes_all) { 279 | codes_size_in_bytes += codes_this_voronoi_cell.size() * sizeof(uint8_t); 280 | } 281 | } 282 | compressed_ids_size_in_bytes /= 8; 283 | 284 | } 285 | 286 | CompressedIDInvertedListsEliasFano::~CompressedIDInvertedListsEliasFano() { 287 | for (auto ef_ptr : ef_bitstreams) { 288 | delete ef_ptr; 289 | } 290 | } 291 | 292 | const faiss::idx_t* CompressedIDInvertedListsEliasFano::get_ids(size_t list_no) const { 293 | size_t ls = list_size(list_no); 294 | if (ls == 0) { 295 | return nullptr; 296 | } 297 | idx_t *the_ids = new idx_t[ls]; 298 | const auto& ef = ef_bitstreams[list_no]; 299 | 300 | /* 301 | for (size_t i=0; i < ls; i++) { 302 | the_ids[i] = ef->select(i); 303 | }*/ 304 | 305 | succinct::elias_fano::select_enumerator it(*ef, 0); 306 | for (size_t i=0; i < ls; i++) { 307 | the_ids[i] = it.next(); 308 | } 309 | 310 | return the_ids; 311 | } 312 | 313 | 314 | faiss::idx_t CompressedIDInvertedListsEliasFano::get_single_id(size_t list_no, size_t offset) const { 315 | const auto& ef = ef_bitstreams[list_no]; 316 | 317 | return ef->select(offset); 318 | } 319 | 320 | void CompressedIDInvertedListsEliasFano::release_ids(size_t list_no, const idx_t* ids_in) const { 321 | delete [] ids_in; 322 | } 323 | 324 | std::vector> 325 | CompressedIDInvertedListsEliasFano::canonicalize_order_inplace( 326 | const uint64_t* ids_data, 327 | const uint8_t* codes_data, 328 | size_t ls 329 | ) { 330 | std::vector> pairs; 331 | for (size_t i = 0; i < ls; ++i) { 332 | pairs.push_back(std::make_pair(ids_data[i], codes_data + i * code_size)); 333 | } 334 | 335 | // Sort the pairs based on the id 336 | std::sort(pairs.begin(), pairs.end()); 337 | 338 | return pairs; 339 | } 340 | 341 | /********************************************************************************* 342 | * Wavelet Tree 343 | *********************************************************************************/ 344 | 345 | 346 | CompressedIDInvertedListsWaveletTree::CompressedIDInvertedListsWaveletTree(const faiss::InvertedLists & il, int wt_type): 347 | InvertedListsArrayCodes(il), wt_type(wt_type) { 348 | assert(wt_type == 0 || wt_type == 1); 349 | size_t ntotal = il.compute_ntotal(); 350 | sdsl::int_vector<> list_nos(ntotal); 351 | codes_all.resize(nlist); 352 | for (size_t list_no = 0; list_no < nlist; list_no++) { 353 | size_t ls = il.list_size(list_no); // number of elements in voronoi cell 354 | ScopedIds ids_in(&il, list_no); 355 | ScopedCodes codes(&il, list_no); 356 | const idx_t* ids_data = ids_in.get(); 357 | idx_t prev_id = -1; 358 | for (idx_t i = 0; i < ls; i++) { 359 | assert(ids_data[i] > prev_id); // assume ordered 360 | assert(ids_data[i] < ntotal); 361 | list_nos[ids_data[i]] = list_no; 362 | prev_id = ids_data[i]; 363 | } 364 | codes_all[list_no].resize(code_size * ls); 365 | memcpy(codes_all[list_no].data(), codes.get(), codes_all[list_no].size()); 366 | } 367 | if (wt_type == 0) { 368 | sdsl::construct_im(wt, list_nos); 369 | compressed_ids_size_in_bytes = sdsl::size_in_bytes(wt); 370 | } else { 371 | sdsl::construct_im(wt_compressed, list_nos); 372 | compressed_ids_size_in_bytes = sdsl::size_in_bytes(wt_compressed); 373 | } 374 | 375 | } 376 | 377 | idx_t CompressedIDInvertedListsWaveletTree::get_single_id(size_t list_no, size_t offset) const { 378 | return wt_type == 0 ? wt.select(offset + 1, list_no) : wt_compressed.select(offset + 1, list_no); 379 | } 380 | 381 | const idx_t* CompressedIDInvertedListsWaveletTree::get_ids(size_t list_no) const { 382 | size_t ls = list_size(list_no); 383 | 384 | idx_t *the_ids = new idx_t[ls]; 385 | 386 | // there is probably an iterator for this as well... 387 | for (size_t i=0; i < ls; i++) { 388 | the_ids[i] = get_single_id(list_no, i); 389 | } 390 | 391 | return the_ids; 392 | } 393 | 394 | 395 | void CompressedIDInvertedListsWaveletTree::release_ids(size_t list_no, const idx_t* ids_in) const { 396 | delete [] ids_in; 397 | } 398 | 399 | /********************************************************************************* 400 | * Deferred decoding 401 | *********************************************************************************/ 402 | 403 | 404 | using idx_t = int64_t; 405 | using namespace faiss; 406 | 407 | void search_IVF_defer_id_decoding( 408 | const IndexIVF & index, 409 | idx_t n, 410 | const float *x, 411 | int k, 412 | float *distances, 413 | idx_t *labels, 414 | bool decode_1by1, 415 | uint8_t* codes, 416 | bool include_listno) { 417 | std::unique_ptr Dq(new float[n * index.nprobe]); 418 | std::unique_ptr Iq(new idx_t[n * index.nprobe]); 419 | 420 | FAISS_THROW_IF_NOT_MSG( 421 | index.parallel_mode == 3, 422 | "set the parallel mode to 3 otherwise search will be single-threaded"); 423 | 424 | index.quantizer->search( 425 | n, x, index.nprobe, Dq.get(), Iq.get()); 426 | 427 | index.search_preassigned( 428 | n, x, k, Iq.get(), Dq.get(), distances, labels, true); 429 | 430 | const InvertedLists *invlists = index.invlists; 431 | 432 | 433 | if (codes) { 434 | // return the codes tables 435 | size_t code_size = index.code_size; 436 | size_t code_size_1 = code_size; 437 | if (include_listno) { 438 | code_size_1 += index.coarse_code_size(); 439 | } 440 | 441 | #pragma omp parallel for if (n * k > 1000) 442 | for (idx_t ij = 0; ij < n * k; ij++) { 443 | idx_t key = labels[ij]; 444 | uint8_t* code1 = codes + ij * code_size_1; 445 | 446 | if (key < 0) { 447 | // Fill with 0xff 448 | memset(code1, -1, code_size_1); 449 | } else { 450 | int list_no = lo_listno(key); 451 | int offset = lo_offset(key); 452 | const uint8_t* cc = invlists->get_single_code(list_no, offset); 453 | 454 | if (include_listno) { 455 | index.encode_listno(list_no, code1); 456 | code1 += code_size_1 - code_size; 457 | } 458 | memcpy(code1, cc, code_size); 459 | } 460 | } 461 | 462 | } 463 | 464 | 465 | // perform the id translation 466 | if (decode_1by1) { 467 | #pragma omp parallel for 468 | for (size_t i = 0; i < n * k; i++) { 469 | idx_t & l = labels[i]; 470 | if (l >= 0) { 471 | l = invlists->get_single_id(lo_listno(l), lo_offset(l)); 472 | } 473 | } 474 | return; 475 | } 476 | 477 | // collect queries with the same listno 478 | std::unordered_map counts; 479 | size_t nv = 0; 480 | // maybe we could parallelize this 481 | for (size_t i = 0; i < n * k; i++) { 482 | if (labels[i] >= 0) { 483 | counts[lo_listno(labels[i])] += 1; 484 | nv++; 485 | } 486 | } 487 | std::unordered_map offsets; 488 | std::vector lists; // useful only for openmp loops 489 | size_t ofs = 0; 490 | for (auto it : counts) { 491 | offsets[it.first] = ofs; 492 | lists.push_back(it.first); 493 | ofs += it.second; 494 | } 495 | // maybe we could parallelize this 496 | std::unique_ptr resno(new idx_t[nv]); 497 | for (size_t i = 0; i < n * k; i++) { 498 | if (labels[i] >= 0) { 499 | size_t &ofs = offsets[lo_listno(labels[i])]; 500 | resno[ofs++] = i; 501 | } 502 | } 503 | 504 | // perform translation per list 505 | // for (auto it : offsets) { 506 | // idx_t list_no = it.first; 507 | // size_t end = it.second; 508 | #pragma omp parallel for 509 | for (size_t i = 0; i < lists.size(); i++) { 510 | idx_t list_no = lists[i]; 511 | size_t end = offsets[list_no]; 512 | size_t begin = end - counts[list_no]; 513 | InvertedLists::ScopedIds sids(invlists, list_no); 514 | const idx_t *ids = sids.get(); 515 | for (size_t j = begin; j < end; j++) { 516 | idx_t r = resno[j]; 517 | idx_t l = labels[r]; 518 | assert(lo_listno(l) == list_no); 519 | assert(lo_offset(l) < invlists->list_size(list_no)); 520 | /* 521 | printf("r=%ld, l=%ld, list_no=%ld, lo_offset=%ld id=%ld\n", 522 | r, l, it.first, lo_offset(l), ids[lo_offset(l)]); */ 523 | labels[r] = ids[lo_offset(l)]; 524 | } 525 | } 526 | } -------------------------------------------------------------------------------- /custom_invlist_cpp/custom_invlists_impl.h: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #include 8 | #include 9 | 10 | #include "codec.h" 11 | 12 | // need to access m_high_bits and m_low_bits 13 | #define protected public 14 | #include "../succinct/elias_fano.hpp" 15 | 16 | #include 17 | #include 18 | 19 | uint64_t BitstringReader_get_bits(const faiss::BitstringReader &bs, size_t i, int nbit); 20 | 21 | /// all the invlist impelmentations just store the codes in an array, so put this in common 22 | struct InvertedListsArrayCodes: faiss::ReadOnlyInvertedLists { 23 | using idx_t = faiss::idx_t; 24 | 25 | std::vector> codes_all; 26 | 27 | size_t list_size(size_t list_no) const override; 28 | 29 | const uint8_t* get_codes(size_t list_no) const override; 30 | 31 | InvertedListsArrayCodes(const faiss::InvertedLists & il); 32 | 33 | }; 34 | 35 | 36 | /// Packed bits: just allocate the necessary bits to store ids from 0 to notal-1 37 | struct CompressedIDInvertedListsPackedBits: InvertedListsArrayCodes { 38 | using idx_t = faiss::idx_t; 39 | 40 | /// bits per identifier 41 | int bits = 0; 42 | std::vector> ids_all; 43 | 44 | size_t compressed_ids_size_in_bytes = 0; 45 | 46 | CompressedIDInvertedListsPackedBits(const faiss::InvertedLists & il); 47 | 48 | const idx_t* get_ids(size_t list_no) const override; 49 | 50 | idx_t get_single_id(size_t list_no, size_t offset) const override; 51 | 52 | void release_ids(size_t list_no, const idx_t* ids_in) const override; 53 | }; 54 | 55 | 56 | struct CompressedIDInvertedListsFenwickTree: InvertedListsArrayCodes { 57 | using idx_t = faiss::idx_t; 58 | 59 | std::vector ans_states; 60 | size_t compressed_ids_size_in_bytes = 0; 61 | size_t codes_size_in_bytes = 0; 62 | std::vector id_symbol_precision; 63 | size_t overhead_in_bytes = 0; 64 | 65 | CompressedIDInvertedListsFenwickTree(const faiss::InvertedLists & il); 66 | 67 | const idx_t* get_ids(size_t list_no) const override; 68 | 69 | void release_ids(size_t list_no, const idx_t* ids_in) const override; 70 | }; 71 | 72 | struct CompressedIDInvertedListsEliasFano: InvertedListsArrayCodes { 73 | using idx_t = faiss::idx_t; 74 | size_t overhead_in_bytes = 0; 75 | 76 | std::vector ef_bitstreams; 77 | 78 | size_t compressed_ids_size_in_bytes = 0; 79 | size_t codes_size_in_bytes = 0; 80 | 81 | CompressedIDInvertedListsEliasFano(const faiss::InvertedLists & il); 82 | 83 | ~CompressedIDInvertedListsEliasFano(); 84 | 85 | const idx_t* get_ids(size_t list_no) const override; 86 | 87 | idx_t get_single_id(size_t list_no, size_t offset) const override; 88 | 89 | void release_ids(size_t list_no, const idx_t* ids_in) const override; 90 | 91 | std::vector> 92 | canonicalize_order_inplace( 93 | const uint64_t* ids_data, 94 | const uint8_t* codes_data, 95 | size_t ls 96 | ); 97 | 98 | }; 99 | 100 | struct CompressedIDInvertedListsWaveletTree: InvertedListsArrayCodes { 101 | using idx_t = faiss::idx_t; 102 | 103 | // the tree structure 104 | sdsl::wt_int wt; 105 | sdsl::wt_int> wt_compressed; 106 | 107 | size_t compressed_ids_size_in_bytes = 0; 108 | 109 | int wt_type = 0; 110 | 111 | // vector_type == 0: use wt 112 | // vector_type == 1: use wt_compressed 113 | CompressedIDInvertedListsWaveletTree(const faiss::InvertedLists & il, int wt_type=0); 114 | 115 | // ~CompressedIDInvertedListsWaveletTree(); 116 | 117 | const idx_t* get_ids(size_t list_no) const override; 118 | 119 | idx_t get_single_id(size_t list_no, size_t offset) const override; 120 | 121 | void release_ids(size_t list_no, const idx_t* ids_in) const override; 122 | 123 | 124 | }; 125 | 126 | /* Search an IVF. 127 | * Do not collect the result IDs immediately, but instead collect (invlist, offset) pairs 128 | * and do the conversion to IDs when the search finished -- avoids decompressing invlists 129 | * without necessity */ 130 | void search_IVF_defer_id_decoding( 131 | const faiss::IndexIVF & index, 132 | faiss::idx_t n, 133 | const float *x, 134 | int k, 135 | float *distances, 136 | faiss::idx_t *labels, 137 | bool decode_1by1 = false, 138 | uint8_t* codes = nullptr, 139 | bool include_listno = false); 140 | 141 | -------------------------------------------------------------------------------- /custom_invlist_cpp/search_ivf_qinco.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import argparse 8 | import json 9 | import os 10 | import time 11 | import sys 12 | 13 | import faiss 14 | import numpy as np 15 | import torch 16 | from faiss.contrib.evaluation import OperatingPointsWithRanges 17 | 18 | import custom_invlists 19 | 20 | import Qinco.datasets as datasets 21 | from Qinco.codec_qinco import encode 22 | from Qinco.utils import ( 23 | add_to_ivfaq_index, 24 | compute_fixed_codebooks, 25 | mean_squared_error, 26 | reconstruct_from_fixed_codebooks, 27 | reconstruct_from_fixed_codebooks_parallel, 28 | ) 29 | 30 | # clone Qinco as a subdirectory (or link) 31 | sys.path.insert(0, "Qinco") 32 | 33 | #################################################################### 34 | # Centroids training phase 35 | #################################################################### 36 | 37 | 38 | def train_ivf_centroids(args, ds): 39 | print(f"load training vectors") 40 | xt = ds.get_train(args.nt) 41 | print(f" xt shape {xt.shape=:}") 42 | d = ds.d 43 | km = faiss.Kmeans( 44 | d, args.n_centroids, niter=args.kmeans_iter, verbose=True, gpu=True 45 | ) 46 | km.train(xt) 47 | return km.centroids 48 | 49 | 50 | #################################################################### 51 | # Codes training phase 52 | #################################################################### 53 | 54 | 55 | def run_train(args, model, ds, res): 56 | print(f"load training vectors") 57 | xt = ds.get_train(args.nt) 58 | print(f" xt shape {xt.shape=:}") 59 | d = ds.d 60 | M = model.M 61 | if args.xt_codes: 62 | print("loading pretrained codes from", args.xt_codes) 63 | xt_codes = np.load(args.xt_codes) 64 | print(f" shape", xt_codes.shape) 65 | assert xt_codes.shape == (len(xt), M + 1) 66 | else: 67 | print("encode trainset") 68 | t0 = time.time() 69 | xt_codes = encode(model, xt, bs=args.bs, is_float16=args.float16) 70 | res.t_xt_encode = time.time() - t0 71 | print(f" done in {res.t_xt_encode:.3f} s") 72 | 73 | print("get IVF centroids from model") 74 | ivf_codebook = model.codebook0.weight.cpu().numpy() * model.db_scale 75 | print(f"{ivf_codebook.shape=:} {ivf_codebook.max()=:g}") 76 | print("train fixed codebook on residuals") 77 | t0 = time.time() 78 | print(" compute residuals") 79 | xt_residuals = xt - ivf_codebook[xt_codes[:, 0]] 80 | print(" train") 81 | codebooks = compute_fixed_codebooks(xt_residuals, xt_codes[:, 1:]) 82 | res.t_train_codebook = time.time() - t0 83 | print(f"train done in {res.t_train_codebook:.2f} s") 84 | xt_fixed_recons = reconstruct_from_fixed_codebooks(xt_codes[:, 1:], codebooks) 85 | MSE = mean_squared_error(xt_fixed_recons, xt_residuals) 86 | res.MSE_train_residuals = MSE 87 | print(f"MSE on training set {MSE:g}") 88 | print("Train norms") 89 | # not really useful for _Nfloat encoding and even less for ST_decompress 90 | norms = ((xt_fixed_recons - xt_residuals) ** 2).sum(1) 91 | 92 | print("construct the index", args.index_key) 93 | index = faiss.index_factory(d, args.index_key) 94 | quantizer = faiss.downcast_index(index.quantizer) 95 | 96 | if args.quantizer_efConstruction > 0: 97 | print("set quantizer efConstruction to", args.quantizer_efConstruction) 98 | quantizer.hnsw.efConstruction = args.quantizer_efConstruction 99 | 100 | print("setting IVF centroids and RQ codebooks") 101 | t0 = time.time() 102 | print(" IVF centroids") 103 | assert ivf_codebook.shape[0] == index.nlist 104 | quantizer.add(ivf_codebook) 105 | print(" set codebook") 106 | assert codebooks.shape[0] == index.rq.M 107 | assert codebooks.shape[2] == index.rq.d 108 | rq_Ks = list(2 ** faiss.vector_to_array(index.rq.nbits)) 109 | assert rq_Ks == [codebooks.shape[1]] * index.rq.M 110 | faiss.copy_array_to_vector(codebooks.ravel(), index.rq.codebooks) 111 | print(" train norms") 112 | index.rq.train_norm(len(norms), faiss.swig_ptr(norms)) 113 | res.t_add_codebook = time.time() - t0 114 | index.rq.is_trained = True 115 | index.is_trained = True 116 | print(f"index ready in {res.t_add_codebook:.2f} s") 117 | 118 | return index 119 | 120 | 121 | #################################################################### 122 | # Adding phase 123 | #################################################################### 124 | 125 | 126 | def run_add(args, model, ds, index, res): 127 | quantizer = faiss.downcast_index(index.quantizer) 128 | codebooks = faiss.vector_to_array(index.rq.codebooks) 129 | ivf_codebook = quantizer.reconstruct_n() 130 | k = 1 << index.rq.nbits.at(0) 131 | M = index.rq.M 132 | d = ds.d 133 | codebooks = codebooks.reshape(M, k, d) 134 | 135 | if len(args.quantizer_efSearch) > 0: 136 | ef = args.quantizer_efSearch[0] 137 | print("set quantizer efSearch to", ef) 138 | quantizer.hnsw.efSearch = ef 139 | 140 | t0 = time.time() 141 | 142 | if args.xb_codes: 143 | print("adding from precomputed codes") 144 | 145 | def yield_codes(): 146 | for fname in args.xb_codes: 147 | print(f" [{time.time() - t0:.2f} s] load", fname) 148 | xb_codes = np.load(fname) 149 | yield xb_codes 150 | 151 | else: 152 | print("computing codes and adding") 153 | 154 | def yield_codes(): 155 | for xb in ds.database_iterator(bs=args.add_bs): 156 | print(f" [{time.time() - t0:.2f} s] encode batch", xb.shape) 157 | xb_codes = encode(model, xb, bs=args.bs, is_float16=args.float16) 158 | yield xb_codes 159 | 160 | i0 = 0 161 | for xb_codes in yield_codes(): 162 | print(f" codes shape", xb_codes.shape) 163 | assert xb_codes.shape[1] == M + 1 164 | i1 = i0 + xb_codes.shape[0] 165 | xb_fixed_recons = reconstruct_from_fixed_codebooks_parallel( 166 | xb_codes[:, 1:], codebooks, nt=args.nthreads 167 | ) 168 | # xb_residuals = xb - ivf_codebook[xb_codes[:, 0]] 169 | # MSE = mean_squared_error(xb_fixed_recons, xb_residuals) 170 | xb_norms = (xb_fixed_recons**2).sum(1) 171 | print(f" add {i0}:{i1}") 172 | add_to_ivfaq_index(index, xb_codes[:, 1:], xb_codes[:, 0], xb_norms, i_base=i0) 173 | i0 = i1 174 | 175 | assert index.ntotal == ds.nb 176 | res.t_add = time.time() - t0 177 | print(f"add done in {res.t_add:.3f} s") 178 | 179 | 180 | #################################################################### 181 | # Searching 182 | #################################################################### 183 | 184 | 185 | def run_search(args, ds, index, res): 186 | quantizer = faiss.downcast_index(index.quantizer) 187 | assert index.ntotal == ds.nb 188 | 189 | print("loading CPU version of the model") 190 | model_cpu = torch.load(args.model, map_location="cpu") 191 | 192 | db_scale = model_cpu.db_scale 193 | print(f" {db_scale=:g}") 194 | 195 | print("preparing index") 196 | index.parallel_mode 197 | index.parallel_mode = 3 198 | print("loading queries") 199 | xq = ds.get_queries() 200 | 201 | gt = ds.get_groundtruth() 202 | 203 | def compute_recalls(I): 204 | recalls = {} 205 | for rank in 1, 10, 100: 206 | recall = (I[:, :rank] == gt[:, :1]).sum() / gt.shape[0] 207 | recalls[rank] = float(recall) 208 | return recalls 209 | 210 | print(f" {xq.shape=:} {gt.shape=:}") 211 | 212 | res.search_results = ivf_real_res = [] 213 | cc = index.coarse_code_size() 214 | cc1 = index.sa_code_size() 215 | nq, d = xq.shape 216 | M = model_cpu.M 217 | listno_mask = index.nlist - 1 218 | print("start experiments") 219 | 220 | op = OperatingPointsWithRanges() 221 | op.add_range("nprobe", args.nprobe) 222 | if len(args.quantizer_efSearch) > 0: 223 | op.add_range("quantizer_efSearch", args.quantizer_efSearch) 224 | op.add_range("nshort", args.nshort) 225 | 226 | experiments = op.sample_experiments(args.n_autotune, rs=np.random.RandomState(123)) 227 | print(f"Total nb experiments {op.num_experiments()}, running {len(experiments)}") 228 | 229 | if args.redo_search > 1: 230 | print(f"redoing {args.redo_search} times") 231 | experiments = experiments * args.redo_search 232 | 233 | for cno in experiments: 234 | key = op.cno_to_key(cno) 235 | parameters = op.get_parameters(key) 236 | print(f"{cno=:4d} {str(parameters):50}", end=": ", flush=True) 237 | 238 | if args.n_autotune == 0: 239 | pass # don't optimize 240 | else: 241 | (max_perf, min_time) = op.predict_bounds(key) 242 | if not op.is_pareto_optimal(max_perf, min_time): 243 | print( 244 | f"SKIP, {max_perf=:.3f} {min_time=:.3f}", 245 | ) 246 | continue 247 | 248 | index.nprobe = parameters["nprobe"] 249 | nshort = parameters["nshort"] 250 | if "quantizer_efSearch" in parameters: 251 | quantizer.hnsw.efSearch = parameters["quantizer_efSearch"] 252 | 253 | t0 = time.time() 254 | if args.defer_id_decoding: 255 | decode_1by1 = args.id_decoding_1by1 256 | D, I, codes = index.search_defer_id_decoding( 257 | xq, nshort, decode_1by1=decode_1by1, return_codes=2) 258 | else: 259 | D, I, codes = index.search_and_return_codes(xq, nshort, include_listnos=True) 260 | t1 = time.time() 261 | 262 | # decode 263 | codes2 = codes.reshape(nshort * nq, cc1) 264 | 265 | codes_int32 = np.zeros((nshort * nq, M + 1), dtype="int32") 266 | if cc == 2: 267 | codes_int32[:, 0] = codes2[:, 0] | (codes2[:, 1].astype(np.int32) << 8) 268 | elif cc == 3: 269 | codes_int32[:, 0] = ( 270 | codes2[:, 0] 271 | | (codes2[:, 1].astype(np.int32) << 8) 272 | | (codes2[:, 2].astype(np.int32) << 16) 273 | ) 274 | else: 275 | raise NotImplementedError 276 | 277 | # to avoid decode errors on -1 (missing shortlist result) 278 | # will be caught later because the id is -1 as well. 279 | codes_int32[:, 0] &= listno_mask 280 | 281 | codes_int32[:, 1:] = codes2[:, cc : M + cc] 282 | with torch.no_grad(): 283 | shortlist = [] 284 | for i in range(0, len(codes_int32), args.bs): 285 | code_batch = torch.from_numpy(codes_int32[i : i + args.bs]) 286 | x_batch = model_cpu.decode(code_batch) 287 | shortlist.append(x_batch.numpy() * db_scale) 288 | 289 | t2 = time.time() 290 | shortlist = np.vstack(shortlist) 291 | shortlist = shortlist.reshape(nq, nshort, d) 292 | D_refined = ((xq.reshape(nq, 1, d) - shortlist) ** 2).sum(2) 293 | 294 | idx = np.argsort(D_refined, axis=1) 295 | I_refined = np.take_along_axis(I, idx[:, :100], axis=1) 296 | t3 = time.time() 297 | 298 | recalls_orig = compute_recalls(I) 299 | recalls = compute_recalls(I_refined) 300 | 301 | print(f"times {t1-t0:.3f}s + {t2-t1:.3f}s + {t3-t2:.3f}s " f"recalls {recalls}") 302 | 303 | op.add_operating_point(key, recalls[1], t3 - t0) 304 | 305 | ivf_real_res.append( 306 | dict( 307 | parameters=parameters, 308 | cno=cno, 309 | t_search=t1 - t0, 310 | t_decode=t2 - t1, 311 | t_dis=t3 - t2, 312 | recalls=recalls, 313 | recalls_orig=recalls_orig, 314 | ) 315 | ) 316 | 317 | 318 | #################################################################### 319 | # Driver 320 | #################################################################### 321 | 322 | 323 | def main(): 324 | parser = argparse.ArgumentParser() 325 | 326 | def aa(*args, **kwargs): 327 | group.add_argument(*args, **kwargs) 328 | 329 | ivf_models_dir = "/checkpoint/ihuijben/ImplicitCodebookPredIVF/231026_fixH256/" 330 | 331 | group = parser.add_argument_group("what and how to compute") 332 | aa( 333 | "--todo", 334 | default=["train", "add", "search"], 335 | choices=["train_centroids", "train", "add", "search"], 336 | nargs="+", 337 | help="what to do", 338 | ) 339 | aa("--bs", default=4096, type=int, help="batch size") 340 | aa("--add_bs", default=1000_000, type=int, help="batch size at add time") 341 | aa("--nthreads", default=32, type=int, help="number of OMP threads") 342 | 343 | group = parser.add_argument_group("QINCo model") 344 | aa("--model", default="", help="Model to load") 345 | aa("--device", default="cuda:0", help="pytorch device") 346 | aa("--float16", default=False, action="store_true", help="convert model to float16") 347 | 348 | group = parser.add_argument_group("database") 349 | aa( 350 | "--db", 351 | default="bigann1M", 352 | choices=datasets.available_names, 353 | help="Dataset to handle", 354 | ) 355 | aa("--nt", default=1000_000, type=int, help="nb training vectors to use") 356 | aa("--xt_codes", default="", help="npy file with pre-encoded training vectors") 357 | aa( 358 | "--xb_codes", 359 | default=[], 360 | nargs="*", 361 | help="npy file with pre-encoded database vectors", 362 | ) 363 | 364 | group = parser.add_argument_group("IVF centroids training") 365 | aa("--n_centroids", default=65536, type=int, help="number of centroids to train") 366 | aa("--kmeans_iter", default=10, type=int, help="number of k-means iterations") 367 | aa("--IVF_centroids", default="", help="where to store the IVF centroids") 368 | 369 | group = parser.add_argument_group("build index") 370 | aa("--index_key", default="IVF65536,RQ8x8_Nfloat", help="Faiss index key") 371 | aa("--trained_index", default="", help="load / store trained index") 372 | aa("--index", default="", help="load / store full index") 373 | aa("--quantizer_efConstruction", default=-1, type=int) 374 | aa( 375 | "--quantizer_efSearch", 376 | default=[], 377 | nargs="+", 378 | type=int, 379 | help="efSearch for the quantizer (used at add and search time)", 380 | ) 381 | 382 | group = parser.add_argument_group("search parameters to try") 383 | aa( 384 | "--id_compression", 385 | default="none", 386 | choices="none packed-bits elias-fano roc wavelet-tree wavelet-tree-1".split(), 387 | help="How to compress the ids", 388 | ) 389 | aa("--defer_id_decoding", default=False, action="store_true") 390 | # aa("--id_decoding_1by1", default=False, action="store_true") 391 | aa("--redo_search", default=1, type=int, help="numer of times to redo the search (to stabilize timings)") 392 | 393 | aa( 394 | "--nprobe", 395 | default=[1, 4, 16, 64, 256, 1024], 396 | type=int, 397 | nargs="+", 398 | help="nprobe settings to try", 399 | ) 400 | aa( 401 | "--nshort", 402 | default=[10, 20, 50, 100, 200, 500, 1000], 403 | type=int, 404 | nargs="+", 405 | help="shortlist sizes to try", 406 | ) 407 | aa( 408 | "--n_autotune", 409 | default=0, 410 | type=int, 411 | help="number of autotune experiments (0=exhaustive exploration)", 412 | ) 413 | 414 | args = parser.parse_args() 415 | 416 | # set automatically 417 | args.id_decoding_1by1 = args.id_compression != "roc" 418 | 419 | print("args:", args) 420 | os.system( 421 | 'echo -n "nb processors "; ' 422 | "cat /proc/cpuinfo | grep ^processor | wc -l; " 423 | 'cat /proc/cpuinfo | grep ^"model name" | tail -1' 424 | ) 425 | os.system("nvidia-smi") 426 | 427 | # object to collect various stats 428 | res = argparse.Namespace() 429 | res.args = args.__dict__ 430 | res.cpu_model = [l for l in open("/proc/cpuinfo", "r") if "model name" in l][0] 431 | 432 | res.cuda_devices = [ 433 | torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count()) 434 | ] 435 | 436 | if args.nthreads != -1: 437 | print(f"set nb threads to {args.nthreads}") 438 | faiss.omp_set_num_threads(args.nthreads) 439 | 440 | ds = datasets.dataset_from_name(args.db) 441 | print(f"Prepared dataset {ds}") 442 | 443 | if "train_centroids" in args.todo: 444 | print("======== k-means clustering to compute IVF centroids") 445 | ivf_centroids = train_ivf_centroids(args, ds) 446 | 447 | if args.IVF_centroids: 448 | print("storing centroids in", args.IVF_centroids) 449 | np.save(args.IVF_centroids, ivf_centroids) 450 | 451 | # it does not make much sense to go further as the centroids 452 | # are an input of model training 453 | if args.todo == ["train_centroids"]: 454 | return 455 | 456 | print("loading model", args.model) 457 | model = torch.load(args.model) 458 | print(" database normalization factor", model.db_scale) 459 | model.eval() 460 | model.to(args.device) 461 | if args.float16: 462 | model.half() 463 | 464 | d = model.d 465 | k = model.K 466 | M = model.M 467 | 468 | index = None 469 | if "train" in args.todo: 470 | print("====================== training") 471 | 472 | index = run_train(args, model, ds, res) 473 | 474 | if args.trained_index: 475 | print("storing trained index in", args.trained_index) 476 | faiss.write_index(index, args.trained_index) 477 | 478 | if "add" in args.todo: 479 | print("====================== adding") 480 | 481 | if index is None and args.trained_index: 482 | print("loading pretrained index", args.trained_index) 483 | index = faiss.read_index(args.trained_index) 484 | elif index is None: 485 | raise RuntimeError("no pretrained index provided") 486 | 487 | run_add(args, model, ds, index, res) 488 | 489 | if args.index: 490 | print("storing index in", args.index) 491 | faiss.write_index(index, args.index) 492 | 493 | if "search" in args.todo: 494 | print("====================== searching") 495 | 496 | if index is None and args.index: 497 | print("loading pretrained index", args.index) 498 | index = faiss.read_index(args.index) 499 | elif index is None: 500 | raise RuntimeError("no index provided") 501 | 502 | if args.id_compression != "none": 503 | print("compressing ids with", args.id_compression) 504 | t0 = time.time() 505 | if args.id_compression == "packed-bits": 506 | il = custom_invlists.CompressedIDInvertedListsPackedBits(index.invlists) 507 | elif args.id_compression == "elias-fano": 508 | il = custom_invlists.CompressedIDInvertedListsEliasFano(index.invlists) 509 | elif args.id_compression == "roc": 510 | il = custom_invlists.CompressedIDInvertedListsFenwickTree(index.invlists) 511 | elif args.id_compression == "wavelet-tree": 512 | il = custom_invlists.CompressedIDInvertedListsWaveletTree(index.invlists) 513 | elif args.id_compression == "wavelet-tree-1": 514 | il = custom_invlists.CompressedIDInvertedListsWaveletTree(index.invlists, 1) 515 | else: 516 | assert False 517 | t1 = time.time() 518 | print(f"compressed ids size: {il.compressed_ids_size_in_bytes} bytes, compressed in {t1-t0:.3f} s") 519 | res.compressed_ids_size_in_bytes = il.compressed_ids_size_in_bytes 520 | res.id_compression_time = t1 - t0 521 | index.own_invlists 522 | index.own_invlists = False 523 | index.replace_invlists(il, False) 524 | 525 | run_search(args, ds, index, res) 526 | 527 | print("JSON results:", json.dumps(res.__dict__)) 528 | 529 | 530 | if __name__ == "__main__": 531 | main() 532 | -------------------------------------------------------------------------------- /custom_invlist_cpp/test_codec.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | /** 8 | 9 | g++ codec.cpp test_codec.cpp -g -std=c++17 && ./a.out 10 | */ 11 | 12 | #include "codec.h" 13 | 14 | #include 15 | #include 16 | #include 17 | #include 18 | 19 | double getmillisecs() { 20 | struct timeval tv; 21 | gettimeofday(&tv, nullptr); 22 | return tv.tv_sec * 1e3 + tv.tv_usec * 1e-3; 23 | } 24 | 25 | int main_xx() { 26 | uint64_t tab[] = {12351235, 49024902, 17781778, 36663666}; 27 | // uint64_t tab[] = {1235, 4902, 1778, 3666}; 28 | int n = 4; 29 | int nbits = 26; 30 | ANSState state; 31 | 32 | compress(n, tab, state, nbits); 33 | 34 | printf("head=%ld\nstack=[", state.head); 35 | for(int i = 0; i < state.stack.size(); i++) { 36 | printf("%u ", state.stack[i]); 37 | } 38 | printf("]\n"); 39 | 40 | std::vector tab2(n); 41 | decompress(state, n, tab2.data(), nbits); 42 | 43 | printf("tab2=["); 44 | // for(int i = 0; i < n; i++) { 45 | // printf("%u ", tab2[i]); 46 | // } 47 | printf("]\n"); 48 | 49 | 50 | return 0; 51 | } 52 | 53 | 54 | int main() { 55 | /* 56 | int n = 200; 57 | int nbits = 9; 58 | */ 59 | 60 | int n = 65000; 61 | int nbits = 20; 62 | 63 | for (int seed = 0; seed < 10; seed++) { 64 | 65 | std::vector data(n); 66 | std::mt19937 mt(seed); 67 | 68 | std::set seen; 69 | 70 | assert(nbits < 32); 71 | assert(n < (1 << nbits)); 72 | for(int i = 0; i < n; i++) { 73 | uint64_t x; 74 | for(;;) { 75 | x = mt() & ((1 << nbits) - 1); 76 | if (seen.count(x) == 0) { 77 | break; 78 | } 79 | } 80 | seen.insert(x); 81 | data[i] = x; 82 | } 83 | 84 | double t0 = getmillisecs(); 85 | 86 | ANSState state; 87 | compress(n, data.data(), state, nbits); 88 | 89 | double t1 = getmillisecs(); 90 | 91 | size_t size = 8 + 4 * state.stack.size(); 92 | std::vector tab2(n); 93 | decompress(state, n, tab2.data(), nbits); 94 | 95 | double t2 = getmillisecs(); 96 | 97 | printf("n=%d nbits=%d seed=%d encode %.3f ms decode %.3f ms size=%ld bytes (%.3f bit / id)\n", 98 | n, nbits, seed, t1 - t0, t2 - t1, 99 | size, size * 8 / float(n)); 100 | 101 | std::set seen2(tab2.begin(), tab2.end()); 102 | assert(seen == seen2); 103 | } 104 | 105 | return 0; 106 | } -------------------------------------------------------------------------------- /custom_invlist_cpp/test_compressed_ivfs.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import logging 8 | import unittest 9 | from ctypes import c_uint64 10 | from pathlib import Path 11 | 12 | import custom_invlists # type: ignore 13 | import faiss # type: ignore 14 | import numpy as np 15 | from faiss.contrib.datasets import SyntheticDataset # type: ignore 16 | from faiss.contrib.inspect_tools import get_invlist # type: ignore 17 | 18 | LOGGER = logging.getLogger(Path(__file__).name) 19 | logging.basicConfig(level=logging.INFO) 20 | 21 | 22 | def make_compressed_wt(il): 23 | return custom_invlists.CompressedIDInvertedListsWaveletTree(il, 1) 24 | 25 | 26 | class TestCompressedIDInvertedLists(unittest.TestCase): 27 | 28 | def test_packed_bits(self): 29 | self.do_test(custom_invlists.CompressedIDInvertedListsPackedBits) 30 | 31 | def test_fenwick_tree(self): 32 | self.do_test(custom_invlists.CompressedIDInvertedListsFenwickTree) 33 | 34 | def test_elias_fano(self): 35 | self.do_test(custom_invlists.CompressedIDInvertedListsEliasFano) 36 | 37 | def test_wavelet_tree(self): 38 | self.do_test(custom_invlists.CompressedIDInvertedListsWaveletTree) 39 | 40 | def test_wavelet_tree_compressed(self): 41 | self.do_test(make_compressed_wt) 42 | 43 | def do_test(self, CompressedIVF): 44 | ds = SyntheticDataset(d=4, nt=1_000, nb=100, nq=1) 45 | database = ds.get_database() 46 | queries = ds.get_queries() 47 | k = 5 48 | index_string = "IVF8,Flat" 49 | 50 | index = faiss.index_factory(ds.d, index_string) 51 | index.train(ds.get_train()) 52 | index.add(database) 53 | 54 | LOGGER.info(f"TESTING {CompressedIVF}") 55 | 56 | index_comp = faiss.index_factory(ds.d, index_string) 57 | index_comp.train(ds.get_train()) 58 | index_comp.add(database) 59 | 60 | print(get_invlist(index.invlists, 0)[0]) 61 | 62 | for c in range(index.nlist): 63 | ids_comp = get_invlist(index_comp.invlists, c)[0] 64 | ids_ref = get_invlist(index.invlists, c)[0] 65 | # print(c, ids_comp, ids_ref) 66 | assert np.all(np.sort(ids_comp) == ids_ref) 67 | LOGGER.info( 68 | "Clusters in index and index_comp contain the same elements, before compression" 69 | ) 70 | 71 | invlists_comp = CompressedIVF(index_comp.invlists) 72 | index_comp.replace_invlists(invlists_comp, False) 73 | 74 | for c in range(index.nlist): 75 | n = invlists_comp.list_size(c) 76 | p = int(invlists_comp.get_ids(c)) 77 | ids_comp = np.ctypeslib.as_array((c_uint64 * n).from_address(p)) 78 | ids_ref = get_invlist(index.invlists, c)[0] 79 | assert np.all(np.sort(ids_comp) == ids_ref) 80 | LOGGER.info( 81 | "Clusters in index and index_comp contain the same elements, after compression" 82 | ) 83 | 84 | _, Iref = index.search(queries, k) 85 | _, Icomp = index_comp.search(queries, k) 86 | np.testing.assert_array_equal(Iref, Icomp) 87 | LOGGER.info( 88 | f"Search results are the same for compressed and uncompressed IVFs." 89 | ) 90 | LOGGER.info(f"All tests passed for {CompressedIVF}!") 91 | 92 | 93 | class TestDeferredIVFDecoding(unittest.TestCase): 94 | 95 | def test_ivf(self): 96 | 97 | ds = SyntheticDataset(32, 10000, 10000, 10) 98 | 99 | index = faiss.index_factory(ds.d, "IVF32,PQ4np") 100 | index.train(ds.get_train()) 101 | index.add(ds.get_database()) 102 | index.nprobe = 4 103 | Dref, Iref = index.search(ds.get_queries(), 10) 104 | 105 | index.parallel_mode = 3 106 | 107 | D, I = index.search_defer_id_decoding(ds.get_queries(), 10) 108 | 109 | np.testing.assert_array_equal(I, Iref) 110 | np.testing.assert_array_equal(D, Dref) 111 | 112 | # test return codes 113 | D, I, codes = index.search_defer_id_decoding( 114 | ds.get_queries(), 10, return_codes=2 115 | ) 116 | 117 | assert codes.shape == (ds.nq, 10, 5) 118 | for q in range(ds.nq): 119 | for ki in range(10): 120 | if I[q, ki] < 0: 121 | continue 122 | list_no = int(codes[q, ki, 0]) 123 | code = codes[q, ki, 1:] 124 | il_ids, il_codes = get_invlist(index.invlists, list_no) 125 | offset = np.where(il_ids == I[q, ki])[0][0] 126 | assert np.all(code == il_codes[offset]) 127 | 128 | def test_1by1_wavelet_tree(self): 129 | self.do_1by1_test(custom_invlists.CompressedIDInvertedListsWaveletTree) 130 | 131 | def test_1by1_wavelet_tree_compressed(self): 132 | self.do_1by1_test(make_compressed_wt) 133 | 134 | def test_1by1_packed_bits(self): 135 | self.do_1by1_test(custom_invlists.CompressedIDInvertedListsPackedBits) 136 | 137 | def test_1by1_elias_fano(self): 138 | self.do_1by1_test(custom_invlists.CompressedIDInvertedListsEliasFano) 139 | 140 | def do_1by1_test(self, CompressedIDInvertedLists): 141 | ds = SyntheticDataset(32, 10000, 10000, 10) 142 | 143 | index = faiss.index_factory(ds.d, "IVF32,PQ4np") 144 | index.train(ds.get_train()) 145 | index.add(ds.get_database()) 146 | index.nprobe = 4 147 | Dref, Iref = index.search(ds.get_queries(), 10) 148 | 149 | invlists2 = CompressedIDInvertedLists(index.invlists) 150 | index.replace_invlists(invlists2, False) 151 | index.parallel_mode = 3 152 | 153 | D, I = index.search_defer_id_decoding(ds.get_queries(), 10, decode_1by1=True) 154 | 155 | np.testing.assert_array_equal(I, Iref) 156 | np.testing.assert_array_equal(D, Dref) 157 | 158 | 159 | if __name__ == "__main__": 160 | unittest.main() 161 | -------------------------------------------------------------------------------- /elias_fano.hpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | // modified from https://github.com/ot/succinct/blob/master/elias_fano.hpp 8 | 9 | #pragma once 10 | 11 | #include "bit_vector.hpp" 12 | #include "darray.hpp" 13 | 14 | namespace succinct { 15 | 16 | class elias_fano { 17 | public: 18 | elias_fano() 19 | : m_size(0) 20 | {} 21 | 22 | struct elias_fano_builder { 23 | elias_fano_builder(uint64_t n, uint64_t m) 24 | : m_n(n) 25 | , m_m(m) 26 | , m_pos(0) 27 | , m_last(0) 28 | , m_l(uint8_t((m && n / m) ? broadword::msb(n / m) : 0)) 29 | , m_high_bits((m + 1) + (n >> m_l) + 1) 30 | { 31 | assert(m_l < 64); // for the correctness of low_mask 32 | m_low_bits.reserve(m * m_l); 33 | } 34 | 35 | inline void push_back(uint64_t i) { 36 | assert(i >= m_last && i <= m_n); 37 | m_last = i; 38 | uint64_t low_mask = (1ULL << m_l) - 1; 39 | 40 | if (m_l) { 41 | m_low_bits.append_bits(i & low_mask, m_l); 42 | } 43 | m_high_bits.set((i >> m_l) + m_pos, 1); 44 | ++m_pos; 45 | assert(m_pos <= m_m); (void)m_m; 46 | } 47 | 48 | friend class elias_fano; 49 | public: 50 | uint64_t m_n; 51 | uint64_t m_m; 52 | uint64_t m_pos; 53 | uint64_t m_last; 54 | uint8_t m_l; 55 | bit_vector_builder m_high_bits; 56 | bit_vector_builder m_low_bits; 57 | }; 58 | 59 | 60 | elias_fano(bit_vector_builder* bvb, bool with_rank_index = true) 61 | { 62 | bit_vector_builder::bits_type& bits = bvb->move_bits(); 63 | uint64_t n = bvb->size(); 64 | 65 | uint64_t m = 0; 66 | for (size_t i = 0; i < bits.size(); ++i) { 67 | m += broadword::popcount(bits[i]); 68 | } 69 | 70 | bit_vector bv(bvb); 71 | elias_fano_builder builder(n, m); 72 | 73 | uint64_t i = 0; 74 | for (uint64_t pos = 0; pos < m; ++pos) { 75 | i = bv.successor1(i); 76 | builder.push_back(i); 77 | ++i; 78 | } 79 | 80 | build(builder, with_rank_index); 81 | } 82 | 83 | elias_fano(elias_fano_builder* builder, bool with_rank_index = true) 84 | { 85 | num_elements = builder->m_m; 86 | build(*builder, with_rank_index); 87 | } 88 | 89 | template 90 | void map(Visitor& visit) { 91 | visit 92 | (m_size, "m_size") 93 | (m_high_bits, "m_high_bits") 94 | (m_high_bits_d1, "m_high_bits_d1") 95 | (m_high_bits_d0, "m_high_bits_d0") 96 | (m_low_bits, "m_low_bits") 97 | (m_l, "m_l") 98 | ; 99 | } 100 | 101 | void swap(elias_fano& other) { 102 | std::swap(other.m_size, m_size); 103 | other.m_high_bits.swap(m_high_bits); 104 | other.m_high_bits_d1.swap(m_high_bits_d1); 105 | other.m_high_bits_d0.swap(m_high_bits_d0); 106 | other.m_low_bits.swap(m_low_bits); 107 | std::swap(other.m_l, m_l); 108 | } 109 | 110 | inline uint64_t size() const { 111 | return m_size; 112 | } 113 | 114 | inline uint64_t num_ones() const { 115 | return m_high_bits_d1.num_positions(); 116 | } 117 | 118 | inline bool operator[](uint64_t pos) const { 119 | assert(pos < size()); 120 | assert(m_high_bits_d0.num_positions()); // needs rank index 121 | uint64_t h_rank = pos >> m_l; 122 | uint64_t h_pos = m_high_bits_d0.select(m_high_bits, h_rank); 123 | uint64_t rank = h_pos - h_rank; 124 | uint64_t l_pos = pos & ((1ULL << m_l) - 1); 125 | 126 | while (h_pos > 0 127 | && m_high_bits[h_pos - 1]) { 128 | --rank; 129 | --h_pos; 130 | uint64_t cur_low_bits = m_low_bits.get_bits(rank * m_l, m_l); 131 | if (cur_low_bits == l_pos) { 132 | return true; 133 | } else if (cur_low_bits < l_pos) { 134 | return false; 135 | } 136 | } 137 | 138 | return false; 139 | } 140 | 141 | inline uint64_t select(uint64_t n) const { 142 | return 143 | ((m_high_bits_d1.select(m_high_bits, n) - n) << m_l) 144 | | m_low_bits.get_bits(n * m_l, m_l); 145 | } 146 | 147 | inline uint64_t rank(uint64_t pos) const { 148 | assert(pos <= m_size); 149 | assert(m_high_bits_d0.num_positions()); // needs rank index 150 | if (pos == size()) { 151 | return num_ones(); 152 | } 153 | 154 | uint64_t h_rank = pos >> m_l; 155 | uint64_t h_pos = m_high_bits_d0.select(m_high_bits, h_rank); 156 | uint64_t rank = h_pos - h_rank; 157 | uint64_t l_pos = pos & ((1ULL << m_l) - 1); 158 | 159 | while (h_pos > 0 160 | && m_high_bits[h_pos - 1] 161 | && m_low_bits.get_bits((rank - 1) * m_l, m_l) >= l_pos) { 162 | --rank; 163 | --h_pos; 164 | } 165 | 166 | return rank; 167 | } 168 | 169 | inline uint64_t predecessor1(uint64_t pos) const { 170 | return select(rank(pos + 1) - 1); 171 | } 172 | 173 | inline uint64_t successor1(uint64_t pos) const { 174 | return select(rank(pos)); 175 | } 176 | 177 | 178 | // Equivalent to select(n) - select(n - 1) (and select(0) for n = 0) 179 | // Involves a linear search for predecessor in high bits. 180 | // Efficient only if there are no large gaps in high bits 181 | // XXX(ot): could make this adaptive 182 | inline uint64_t delta(uint64_t n) const { 183 | uint64_t high_val = m_high_bits_d1.select(m_high_bits, n); 184 | uint64_t low_val = m_low_bits.get_bits(n * m_l, m_l); 185 | if (n) { 186 | return 187 | // need a + here instead of an | for carry 188 | ((high_val - m_high_bits.predecessor1(high_val - 1) - 1) << m_l) 189 | + low_val - m_low_bits.get_bits((n - 1) * m_l, m_l); 190 | } else { 191 | return 192 | ((high_val - n) << m_l) 193 | | low_val; 194 | } 195 | } 196 | 197 | 198 | // same as delta() 199 | inline std::pair select_range(uint64_t n) const 200 | { 201 | assert(n + 1 < num_ones()); 202 | uint64_t high_val_b = m_high_bits_d1.select(m_high_bits, n); 203 | uint64_t low_val_b = m_low_bits.get_bits(n * m_l, m_l); 204 | uint64_t high_val_e = m_high_bits.successor1(high_val_b + 1); 205 | uint64_t low_val_e = m_low_bits.get_bits((n + 1) * m_l, m_l); 206 | return std::make_pair(((high_val_b - n) << m_l) | low_val_b, 207 | ((high_val_e - n - 1) << m_l) | low_val_e); 208 | } 209 | 210 | struct select_enumerator { 211 | 212 | select_enumerator(elias_fano const& ef, uint64_t i) 213 | : m_ef(&ef) 214 | , m_i(i) 215 | , m_l(ef.m_l) 216 | { 217 | m_low_mask = (uint64_t(1) << m_l) - 1; 218 | m_low_buf = 0; 219 | if (m_l) { 220 | m_chunks_in_word = 64 / m_l; 221 | m_chunks_avail = 0; 222 | } else { 223 | m_chunks_in_word = 0; 224 | m_chunks_avail = m_ef->num_ones(); 225 | } 226 | 227 | if (!m_ef->num_ones()) return; 228 | uint64_t pos = m_ef->m_high_bits_d1.select(m_ef->m_high_bits, m_i); 229 | m_high_enum = bit_vector::unary_enumerator(m_ef->m_high_bits, pos); 230 | assert(m_l < 64); 231 | } 232 | 233 | uint64_t next() { 234 | if (!m_chunks_avail--) { 235 | m_low_buf = m_ef->m_low_bits.get_word(m_i * m_l); 236 | m_chunks_avail = m_chunks_in_word - 1; 237 | } 238 | 239 | uint64_t high = m_high_enum.next(); 240 | assert(high == m_ef->m_high_bits_d1.select(m_ef->m_high_bits, m_i)); 241 | uint64_t low = m_low_buf & m_low_mask; 242 | uint64_t ret = 243 | ((high - m_i) << m_l) 244 | | low; 245 | m_i += 1; 246 | m_low_buf >>= m_l; 247 | 248 | return ret; 249 | } 250 | 251 | public: 252 | 253 | elias_fano const* m_ef; 254 | uint64_t m_i; 255 | uint64_t m_l; 256 | bit_vector::unary_enumerator m_high_enum; 257 | uint64_t m_low_buf; 258 | uint64_t m_low_mask; 259 | uint64_t m_chunks_in_word; 260 | uint64_t m_chunks_avail; 261 | }; 262 | 263 | public: 264 | void build(elias_fano_builder& builder, bool with_rank_index) { 265 | m_size = builder.m_n; 266 | m_l = builder.m_l; 267 | bit_vector(&builder.m_high_bits).swap(m_high_bits); 268 | darray1(m_high_bits).swap(m_high_bits_d1); 269 | if (with_rank_index) { 270 | darray0(m_high_bits).swap(m_high_bits_d0); 271 | } 272 | bit_vector(&builder.m_low_bits).swap(m_low_bits); 273 | } 274 | 275 | uint64_t m_size; 276 | uint64_t num_elements; 277 | bit_vector m_high_bits; 278 | darray1 m_high_bits_d1; 279 | darray0 m_high_bits_d0; 280 | bit_vector m_low_bits; 281 | uint8_t m_l; 282 | }; 283 | 284 | } -------------------------------------------------------------------------------- /fenwick_tree_cpp/Makefile: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | PYTHON_INCLUDE := $(shell python -c "import distutils.sysconfig ; print(distutils.sysconfig.get_python_inc())") 8 | CONDA_INCLUDE := ${CONDA_PREFIX}/include 9 | 10 | all: test 11 | 12 | lib/fenwick_tree_wrap.cxx: dirs 13 | swig -c++ -python -outdir lib -o lib/fenwick_tree_wrap.cxx src/fenwick_tree.i 14 | 15 | lib/fenwick_tree.o: 16 | g++ -std=c++17 -O2 -fPIC -c src/fenwick_tree.cpp -o lib/fenwick_tree.o 17 | 18 | lib/fenwick_tree_wrap.o: lib/fenwick_tree_wrap.cxx 19 | g++ -std=c++17 -O2 -fPIC -c lib/fenwick_tree_wrap.cxx -I${PYTHON_INCLUDE} -o lib/fenwick_tree_wrap.o 20 | 21 | lib/_fenwick_tree.so: lib/fenwick_tree.o lib/fenwick_tree_wrap.o 22 | g++ -std=c++17 -shared lib/fenwick_tree.o lib/fenwick_tree_wrap.o -o lib/_fenwick_tree.so 23 | 24 | dirs: 25 | mkdir -p src bin lib 26 | 27 | bin/test_fenwick_tree: dirs 28 | g++ tests/test_fenwick_tree.cpp -o bin/test_fenwick_tree 29 | 30 | test: bin/test_fenwick_tree lib/_fenwick_tree.so 31 | ./bin/test_fenwick_tree 32 | python -B -m pytest tests/test_FenwickTree.py -v 33 | 34 | clean: 35 | rm -rf bin/* lib/* tests/__pycache__ 36 | 37 | .PHONY: clean test -------------------------------------------------------------------------------- /fenwick_tree_cpp/bin/test_fenwick_tree: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/vector_db_id_compression/46eb4860cb181bace769607e6dc3d82369a3acb5/fenwick_tree_cpp/bin/test_fenwick_tree -------------------------------------------------------------------------------- /fenwick_tree_cpp/src/fenwick_tree.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #include "fenwick_tree.h" -------------------------------------------------------------------------------- /fenwick_tree_cpp/src/fenwick_tree.h: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #ifndef FENWICK_TREE_H 8 | #define FENWICK_TREE_H 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | using namespace std; 16 | 17 | template 18 | struct Range; 19 | 20 | template 21 | struct FenwickTree 22 | { 23 | T symbol; 24 | int size; 25 | FenwickTree *left; 26 | FenwickTree *right; 27 | 28 | FenwickTree(T symbol) : symbol(symbol), size(1), left(NULL), right(NULL) {} 29 | FenwickTree() : symbol(T()), size(0), left(NULL), right(NULL) {} 30 | ~FenwickTree() 31 | { 32 | if (left != NULL) 33 | { 34 | delete left; 35 | } 36 | if (right != NULL) 37 | { 38 | delete right; 39 | } 40 | } 41 | 42 | Range insert_then_forward_lookup(T symbol) 43 | { 44 | FenwickTree *current = this; 45 | int size_not_right; 46 | int freq; 47 | int start; 48 | int start_offset = 0; 49 | 50 | if (this->size == 0) 51 | { 52 | this->size += 1; 53 | this->symbol = symbol; 54 | return Range(this, 0, 1); 55 | } 56 | 57 | while (true) 58 | { 59 | size_not_right = current->size - (current->right != NULL ? current->right->size : 0); 60 | freq = size_not_right - (current->left != NULL ? current->left->size : 0); 61 | start = size_not_right - freq; 62 | current->size += 1; 63 | 64 | if (symbol < current->symbol) 65 | { 66 | if (current->left == NULL) 67 | { 68 | current->left = new FenwickTree(symbol); 69 | return Range(current->left, start_offset, 1); 70 | } 71 | else 72 | { 73 | current = current->left; 74 | } 75 | } 76 | else if (symbol > current->symbol) 77 | { 78 | start_offset += size_not_right; 79 | if (current->right == NULL) 80 | { 81 | current->right = new FenwickTree(symbol); 82 | return Range(current->right, start_offset, 1); 83 | } 84 | else 85 | { 86 | current = current->right; 87 | } 88 | } 89 | else 90 | { 91 | return Range(current, start + start_offset, freq + 1); 92 | } 93 | } 94 | } 95 | 96 | Range reverse_lookup_then_remove(int index) 97 | { 98 | FenwickTree *current = this; 99 | FenwickTree *parent = NULL; 100 | int size_not_right; 101 | int freq; 102 | int start; 103 | bool went_left; 104 | int start_offset = 0; 105 | 106 | while (true) 107 | { 108 | size_not_right = current->size - (current->right != NULL ? current->right->size : 0); 109 | freq = size_not_right - (current->left != NULL ? current->left->size : 0); 110 | start = size_not_right - freq; 111 | 112 | current->size -= 1; 113 | if (index < start) 114 | { 115 | went_left = true; 116 | parent = current; 117 | current = current->left; 118 | } 119 | else if (index >= start + freq) 120 | { 121 | went_left = false; 122 | parent = current; 123 | current = current->right; 124 | index -= size_not_right; 125 | start_offset += size_not_right; 126 | } 127 | else 128 | { 129 | if (current->size == 0 && went_left && parent != NULL) 130 | { 131 | parent->left = NULL; 132 | } 133 | else if (current->size == 0 && parent != NULL) 134 | { 135 | parent->right = NULL; 136 | } 137 | return Range(current, start + start_offset, freq); 138 | } 139 | } 140 | } 141 | 142 | vector inorder_traversal() 143 | { 144 | vector elements; 145 | stack *> stack; 146 | FenwickTree *current = this; 147 | 148 | while (current != NULL || stack.empty() == false) 149 | { 150 | while (current != NULL) 151 | { 152 | stack.push(current); 153 | current = current->left; 154 | } 155 | current = stack.top(); 156 | stack.pop(); 157 | int size_not_right = current->size - (current->right != NULL ? current->right->size : 0); 158 | int freq = size_not_right - (current->left != NULL ? current->left->size : 0); 159 | for (int i = 0; i < freq; i++) 160 | { 161 | elements.push_back(current->symbol); 162 | } 163 | current = current->right; 164 | } 165 | return elements; 166 | } 167 | }; 168 | 169 | template 170 | struct Range 171 | { 172 | FenwickTree *ftree; 173 | int start; 174 | int freq; 175 | 176 | Range(FenwickTree *ftree, int start, int freq) : ftree(ftree), start(start), freq(freq) {} 177 | }; 178 | 179 | #endif // FENWICK_TREE_H -------------------------------------------------------------------------------- /fenwick_tree_cpp/src/fenwick_tree.i: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | %module fenwick_tree 8 | 9 | %{ 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include "../src/fenwick_tree.h" 17 | %} 18 | 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include "../src/fenwick_tree.h" 26 | using namespace std; 27 | 28 | template 29 | struct Range; 30 | 31 | template 32 | struct FenwickTree 33 | { 34 | T symbol; 35 | int size; 36 | FenwickTree *left; 37 | FenwickTree *right; 38 | FenwickTree(T symbol) : symbol(symbol), size(1), left(NULL), right(NULL) {} 39 | FenwickTree() : symbol(T()), size(0), left(NULL), right(NULL) {} 40 | Range insert_then_forward_lookup(T symbol); 41 | Range reverse_lookup_then_remove(int index); 42 | vector inorder_traversal(); 43 | }; 44 | 45 | template 46 | struct Range 47 | { 48 | FenwickTree *ftree; 49 | int start; 50 | int freq; 51 | Range(FenwickTree *ftree, int start, int freq) : ftree(ftree), start(start), freq(freq) {} 52 | }; 53 | 54 | %template(FenwickTree) FenwickTree; 55 | %template(Range) Range; -------------------------------------------------------------------------------- /fenwick_tree_cpp/tests/__pycache__/test_FenwickTree.cpython-310-pytest-7.4.0.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/vector_db_id_compression/46eb4860cb181bace769607e6dc3d82369a3acb5/fenwick_tree_cpp/tests/__pycache__/test_FenwickTree.cpython-310-pytest-7.4.0.pyc -------------------------------------------------------------------------------- /fenwick_tree_cpp/tests/test_FenwickTree.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | from lib.fenwick_tree import FenwickTree, Range 8 | 9 | 10 | def test_Range(): 11 | ftree = FenwickTree(0) 12 | r = Range(ftree, 10, 10) 13 | assert r.start == r.freq == 10 14 | 15 | 16 | def test_FenwickTree(): 17 | ftree = FenwickTree() 18 | assert ftree.size == 0 19 | r = ftree.insert_then_forward_lookup(0) 20 | assert r.start == 0 21 | assert r.freq == 1 22 | 23 | r = ftree.insert_then_forward_lookup(1) 24 | assert r.start == 1 25 | assert r.freq == 1 26 | 27 | r = ftree.insert_then_forward_lookup(0) 28 | assert r.start == 0 29 | assert r.freq == 2 30 | 31 | r = ftree.insert_then_forward_lookup(5) 32 | assert r.start == 3 33 | assert r.freq == 1 34 | -------------------------------------------------------------------------------- /fenwick_tree_cpp/tests/test_fenwick_tree.cpp: -------------------------------------------------------------------------------- 1 | // Copyright (c) Meta Platforms, Inc. and affiliates. 2 | // All rights reserved. 3 | // 4 | // This source code is licensed under the license found in the 5 | // LICENSE file in the root directory of this source tree. 6 | 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include "../src/fenwick_tree.h" 14 | using namespace std; 15 | 16 | void test_FenwickTree_1() 17 | { 18 | FenwickTree ftree; 19 | 20 | // INSERT 21 | // auto range = insert_then_forward_lookup(&ftree, 'b'); 22 | auto range = ftree.insert_then_forward_lookup('b'); 23 | assert(range.ftree->symbol == 'b'); 24 | assert(range.start == 0); 25 | assert(range.freq == 1); 26 | auto expected = vector{'b'}; 27 | assert(ftree.inorder_traversal() == expected); 28 | 29 | range = ftree.insert_then_forward_lookup('a'); 30 | assert(range.ftree->symbol == 'a'); 31 | assert(range.start == 0); 32 | assert(range.freq == 1); 33 | expected = vector{'a', 'b'}; 34 | assert(ftree.inorder_traversal() == expected); 35 | 36 | range = ftree.insert_then_forward_lookup('b'); 37 | assert(range.ftree->symbol == 'b'); 38 | assert(range.start == 1); 39 | assert(range.freq == 2); 40 | expected = vector{'a', 'b', 'b'}; 41 | assert(ftree.inorder_traversal() == expected); 42 | 43 | range = ftree.insert_then_forward_lookup('d'); 44 | assert(range.ftree->symbol == 'd'); 45 | assert(range.start == 3); 46 | assert(range.freq == 1); 47 | expected = vector{'a', 'b', 'b', 'd'}; 48 | assert(ftree.inorder_traversal() == expected); 49 | 50 | range = ftree.insert_then_forward_lookup('c'); 51 | assert(range.ftree->symbol == 'c'); 52 | assert(range.start == 3); 53 | assert(range.freq == 1); 54 | expected = vector{'a', 'b', 'b', 'c', 'd'}; 55 | assert(ftree.inorder_traversal() == expected); 56 | 57 | range = ftree.insert_then_forward_lookup('e'); 58 | assert(range.ftree->symbol == 'e'); 59 | assert(range.start == 5); 60 | assert(range.freq == 1); 61 | expected = vector{'a', 'b', 'b', 'c', 'd', 'e'}; 62 | assert(ftree.inorder_traversal() == expected); 63 | 64 | range = ftree.insert_then_forward_lookup('c'); 65 | assert(range.ftree->symbol == 'c'); 66 | assert(range.start == 3); 67 | assert(range.freq == 2); 68 | expected = vector{'a', 'b', 'b', 'c', 'c', 'd', 'e'}; 69 | assert(ftree.inorder_traversal() == expected); 70 | 71 | range = ftree.insert_then_forward_lookup('c'); 72 | assert(range.ftree->symbol == 'c'); 73 | assert(range.start == 3); 74 | assert(range.freq == 3); 75 | expected = vector{'a', 'b', 'b', 'c', 'c', 'c', 'd', 'e'}; 76 | assert(ftree.inorder_traversal() == expected); 77 | 78 | // REMOVE 79 | range = ftree.reverse_lookup_then_remove(6); 80 | assert(range.ftree->symbol == 'd'); 81 | assert(range.start == 6); 82 | assert(range.freq == 1); 83 | expected = vector{'a', 'b', 'b', 'c', 'c', 'c', 'e'}; 84 | assert(ftree.inorder_traversal() == expected); 85 | 86 | range = ftree.reverse_lookup_then_remove(1); 87 | assert(range.ftree->symbol == 'b'); 88 | assert(range.start == 1); 89 | assert(range.freq == 2); 90 | expected = vector{'a', 'b', 'c', 'c', 'c', 'e'}; 91 | assert(ftree.inorder_traversal() == expected); 92 | 93 | range = ftree.reverse_lookup_then_remove(3); 94 | assert(range.ftree->symbol == 'c'); 95 | assert(range.start == 2); 96 | assert(range.freq == 3); 97 | expected = vector{'a', 'b', 'c', 'c', 'e'}; 98 | assert(ftree.inorder_traversal() == expected); 99 | 100 | range = ftree.reverse_lookup_then_remove(4); 101 | assert(range.ftree->symbol == 'e'); 102 | assert(range.start == 4); 103 | assert(range.freq == 1); 104 | expected = vector{'a', 'b', 'c', 'c'}; 105 | assert(ftree.inorder_traversal() == expected); 106 | 107 | range = ftree.reverse_lookup_then_remove(0); 108 | assert(range.ftree->symbol == 'a'); 109 | assert(range.start == 0); 110 | assert(range.freq == 1); 111 | expected = vector{'b', 'c', 'c'}; 112 | assert(ftree.inorder_traversal() == expected); 113 | 114 | range = ftree.reverse_lookup_then_remove(1); 115 | assert(range.ftree->symbol == 'c'); 116 | assert(range.start == 1); 117 | assert(range.freq == 2); 118 | expected = vector{'b', 'c'}; 119 | assert(ftree.inorder_traversal() == expected); 120 | 121 | range = ftree.reverse_lookup_then_remove(0); 122 | assert(range.ftree->symbol == 'b'); 123 | assert(range.start == 0); 124 | assert(range.freq == 1); 125 | expected = vector{'c'}; 126 | assert(ftree.inorder_traversal() == expected); 127 | 128 | range = ftree.reverse_lookup_then_remove(0); 129 | assert(range.ftree->symbol == 'c'); 130 | assert(range.start == 0); 131 | assert(range.freq == 1); 132 | expected = vector{}; 133 | assert(ftree.inorder_traversal() == expected); 134 | } 135 | 136 | void test_FenwickTree_2() 137 | { 138 | FenwickTree ftree; 139 | 140 | // INSERT 141 | auto range = ftree.insert_then_forward_lookup((uint64_t)83); 142 | assert(range.ftree->symbol == 83); 143 | assert(range.start == 0); 144 | assert(range.freq == 1); 145 | auto expected = vector{83}; 146 | assert(ftree.inorder_traversal() == expected); 147 | 148 | range = ftree.insert_then_forward_lookup((uint64_t)77); 149 | assert(range.ftree->symbol == 77); 150 | assert(range.start == 0); 151 | assert(range.freq == 1); 152 | expected = vector{77, 83}; 153 | assert(ftree.inorder_traversal() == expected); 154 | 155 | range = ftree.insert_then_forward_lookup((uint64_t)15); 156 | assert(range.ftree->symbol == 15); 157 | assert(range.start == 0); 158 | assert(range.freq == 1); 159 | expected = vector{15, 77, 83}; 160 | assert(ftree.inorder_traversal() == expected); 161 | 162 | range = ftree.insert_then_forward_lookup((uint64_t)86); 163 | assert(range.ftree->symbol == 86); 164 | assert(range.start == 3); 165 | assert(range.freq == 1); 166 | expected = vector{15, 77, 83, 86}; 167 | assert(ftree.inorder_traversal() == expected); 168 | 169 | range = ftree.insert_then_forward_lookup((uint64_t)93); 170 | assert(range.ftree->symbol == 93); 171 | assert(range.start == 4); 172 | assert(range.freq == 1); 173 | expected = vector{15, 77, 83, 86, 93}; 174 | assert(ftree.inorder_traversal() == expected); 175 | 176 | // REMOVE 177 | range = ftree.reverse_lookup_then_remove(3); 178 | assert(range.ftree->symbol == 86); 179 | assert(range.start == 3); 180 | assert(range.freq == 1); 181 | expected = vector{15, 77, 83, 93}; 182 | assert(ftree.inorder_traversal() == expected); 183 | } 184 | 185 | int main() 186 | { 187 | cout << "Running tests ..." << endl; 188 | test_FenwickTree_1(); 189 | test_FenwickTree_2(); 190 | cout << "All tests passed!" << endl; 191 | return 0; 192 | } -------------------------------------------------------------------------------- /graph_static_bench_invlists.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import datetime 8 | import sys 9 | from pathlib import Path 10 | 11 | import faiss 12 | import numpy as np 13 | import pandas as pd 14 | from faiss.contrib.datasets import DatasetDeep1B, DatasetSIFT1M, SyntheticDataset 15 | from faiss.contrib.inspect_tools import get_NSG_neighbors 16 | from rec.definitions import Graph 17 | from rec.models import PolyasUrnModel 18 | 19 | from qinco_datasets import DatasetFB_ssnpp 20 | 21 | 22 | def friend_to_edgelist_repr(graph_friends): 23 | return np.array( 24 | [[v, w] for v, friends in enumerate(graph_friends) for w in friends if w != -1] 25 | ) 26 | 27 | 28 | def vector_to_array(v): 29 | """make a vector visible as a numpy array (without copying data)""" 30 | return faiss.rev_swig_ptr(v.data(), v.size()) 31 | 32 | 33 | def get_hnsw_links(hnsw, vno): 34 | """get link strcutre for vertex vno""" 35 | 36 | # make arrays visible from Python 37 | levels = vector_to_array(hnsw.levels) 38 | cum_nneighbor_per_level = vector_to_array(hnsw.cum_nneighbor_per_level) 39 | offsets = vector_to_array(hnsw.offsets) 40 | neighbors = vector_to_array(hnsw.neighbors) 41 | 42 | # all neighbors of vno 43 | neigh_vno = neighbors[offsets[vno] : offsets[vno + 1]] 44 | 45 | # break down per level 46 | nlevel = levels[vno] 47 | return [ 48 | neigh_vno[cum_nneighbor_per_level[l] : cum_nneighbor_per_level[l + 1]] 49 | for l in range(nlevel) 50 | ] 51 | 52 | 53 | if __name__ == "__main__": 54 | dataset_idx = int(sys.argv[1]) 55 | max_degree = int(sys.argv[2]) 56 | fb_ssnpp_dir = None 57 | 58 | if dataset_idx == 3: 59 | assert ( 60 | len(sys.argv) == 4 61 | ), "Path to fb_ssnpp/ directory is needed for DatasetFB_ssnpp (index 3)" 62 | fb_ssnpp_dir = sys.argv[3] 63 | 64 | AVAILABLE_DATASETS = [ 65 | (SyntheticDataset, dict(d=32, nt=10_000, nq=1, nb=1_000)), 66 | (DatasetSIFT1M, {}), 67 | (DatasetDeep1B, dict(nb=10**6)), 68 | (DatasetFB_ssnpp, dict(basedir=fb_ssnpp_dir)), 69 | ] 70 | 71 | dataset_cls, dataset_kwargs = AVAILABLE_DATASETS[dataset_idx] 72 | dataset = dataset_cls(**dataset_kwargs) 73 | 74 | index_strs = [f"NSG{max_degree},Flat", f"HNSW{max_degree},Flat"] 75 | now = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S-%f") 76 | csv_path = Path(f"results-offline-graphs-rec/{now}-{dataset_cls.__name__}.csv") 77 | csv_path.parent.mkdir(parents=True, exist_ok=True) 78 | 79 | results = [] 80 | 81 | for index_str in index_strs: 82 | print(f"Indexing {dataset_cls.__name__} / {index_str}", flush=True) 83 | index = faiss.index_factory(dataset.d, index_str) 84 | index.verbose = True 85 | database = dataset.get_database() 86 | index.add(database) 87 | 88 | if "NSG" in index_str: 89 | graph_friends = get_NSG_neighbors(index.nsg) 90 | elif "HNSW" in index_str: 91 | graph_friends = [ 92 | get_hnsw_links(index.hnsw, v)[0] for v in range(dataset.nb) 93 | ] 94 | 95 | graph_edgelist = friend_to_edgelist_repr(graph_friends) 96 | graph = Graph( 97 | edge_array=graph_edgelist, 98 | num_nodes=len(graph_friends), 99 | num_edges=len(graph_edgelist), 100 | ) 101 | 102 | model_pu = PolyasUrnModel( 103 | graph.num_nodes, 104 | graph.num_edges, 105 | undirected=False, 106 | ) 107 | 108 | # Compute results directly as BPE. See REC paper for details. 109 | _, graph_bpe = model_pu.compute_bpe(graph) 110 | 111 | results.append( 112 | { 113 | "index_str": index_str, 114 | "comp_method": "rec", 115 | "dataset": type(dataset).__name__, 116 | "nb": dataset.nb, 117 | "nt": dataset.nt, 118 | "bpe": graph_bpe, 119 | "num_edges": graph.num_edges, 120 | } 121 | ) 122 | print(results[-1]) 123 | 124 | df = pd.DataFrame(results) 125 | df.to_csv(csv_path, index=False) 126 | print(df) 127 | -------------------------------------------------------------------------------- /install-dependencies.sh: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | # clone repos if needed 8 | git clone git@github.com:ot/succinct.git 9 | git clone git@github.com:dsevero/Random-Edge-Coding.git 10 | pip install -e Random-Edge-Coding 11 | 12 | 13 | # copy elias fano mod to succinct 14 | cp elias_fano.hpp succinct/ -------------------------------------------------------------------------------- /qinco_datasets.py: -------------------------------------------------------------------------------- 1 | # TAKEN FROM https://github.com/facebookresearch/Qinco/blob/main/datasets.py 2 | # Copyright (c) Meta Platforms, Inc. and affiliates. 3 | # All rights reserved. 4 | # 5 | # This source code is licensed under the license found in the 6 | # LICENSE file in the root directory of this source tree. 7 | 8 | import numpy as np 9 | from faiss.contrib.datasets import Dataset 10 | from faiss.contrib.datasets import dataset_from_name as dataset_from_name_faiss 11 | 12 | """ 13 | This file contains dataset classes of datasets that are not present in FAISS, 14 | but they inheret from the FAISS Dataset class to have a similar interface. 15 | """ 16 | 17 | 18 | class DatasetFB_ssnpp(Dataset): 19 | """ 20 | A wrapper for the FB_ssnpp dataset such that it inherets from the FAISS Dataset class. 21 | """ 22 | 23 | def __init__(self, basedir: str, nb_M=1): 24 | Dataset.__init__(self) 25 | assert nb_M == 1 26 | # 1e8 training vectors are available, but to prevent loading of too many, 27 | # maximize loading here to 1e7 28 | self.d, self.nt, self.nb, self.nq = 256, int(1e7), int(nb_M * 10**6), 10000 29 | self.basedir = basedir 30 | 31 | def get_queries(self): 32 | return np.load(self.basedir + "queries.npy") 33 | 34 | def get_train(self, maxtrain=None): 35 | if maxtrain is None: 36 | maxtrain = self.nt 37 | xt = np.load(self.basedir + "training_set10010k.npy", mmap_mode="r") 38 | if maxtrain <= 10010000: 39 | return np.array(xt[:maxtrain]) 40 | else: 41 | raise NotImplementedError 42 | 43 | def get_database(self): 44 | return np.load(self.basedir + "database1M.npy") 45 | 46 | def get_groundtruth(self, k=None): 47 | gt = np.load(self.basedir + "ground_truth1M.npy") 48 | if k is not None: 49 | assert k <= 100 50 | gt = gt[:, :k] 51 | return gt 52 | -------------------------------------------------------------------------------- /zuckerli-baseline/README.md: -------------------------------------------------------------------------------- 1 | # Zuckerli baseline 2 | 3 | ## 1) Generate `.el` file for each dataset 4 | Run `generate_graph_edelists.py` to generate an `.el` file for some dataset and index string. 5 | This script takes 3 arguments: 6 | - Dataset index, specifying which dataset to generate the `.el` for. See the `AVAILABLE_DATASETS` variable. 7 | - Max degree parameter of NSG and HNSW. 8 | - Path to the `fb_ssnpp/` directory (only used if dataset index is set to `3`). 9 | 10 | To generate `.el` files for all datasets in the paper, run the following 11 | 12 | ```bash 13 | fb_ssnpp_dir=... 14 | for dataset_idx in 0 1 2 3; do 15 | for max_degree in 16 32 64 128 256; do 16 | ./generate_graph_edgelists_sbatch.sh $dataset_idx $max_degree $fb_ssnpp_dir 17 | done 18 | done 19 | ``` 20 | 21 | `.el` files will be saved in `graphs/`. 22 | 23 | ## 2) Install veluca93/graph_utils and generate Zuckerli graph binaries 24 | 25 | Compile the `gutil` following instructions in https://github.com/veluca93/graph_utils/tree/a9521a943f67466e0e1badaf10876e82c2fbef2a 26 | 27 | Create a `.bin` file for each graph 28 | 29 | ```sh 30 | # Convert graphs to zuckerli binary, skip if exists 31 | for x in graphs/*.el; do 32 | if [[ ! -f "$x.bin" ]] 33 | then 34 | echo "converting $x" 35 | ./graph_utils/bin/gutil convert -i "$x" -F bin -o "$x.bin" 36 | fi 37 | done 38 | ``` 39 | 40 | ## 3) Compile Zuckerli and compress 41 | 42 | Compile Zuckerli from https://github.com/google/zuckerli/tree/874ac40705d1e67d2ee177865af4a41b5bc2b250 43 | 44 | Assuming you cloned `zuckerli` in the local directory, run the following to compress each graph. 45 | ```sh 46 | for x in graphs/*.el.bin; do 47 | zuckerli/build/encoder --input_path "$x" --output_path "$x.comp" 48 | done 49 | ``` 50 | 51 | This will generate logs similar to https://gist.github.com/dsevero/ac356c1c1cdf4aac17eee34387a5a4b2 -------------------------------------------------------------------------------- /zuckerli-baseline/generate_graph_edgelists.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import sys 8 | from pathlib import Path 9 | 10 | import faiss 11 | from faiss.contrib.datasets import DatasetDeep1B, DatasetSIFT1M, SyntheticDataset 12 | from faiss.contrib.inspect_tools import get_NSG_neighbors 13 | 14 | from ..qinco_datasets import DatasetFB_ssnpp 15 | 16 | 17 | def friend_to_edgelist_repr(graph_friends): 18 | return list( 19 | sorted( 20 | [v, w] 21 | for v, friends in enumerate(graph_friends) 22 | for w in friends 23 | if w != -1 24 | ) 25 | ) 26 | 27 | 28 | def vector_to_array(v): 29 | """make a vector visible as a numpy array (without copying data)""" 30 | return faiss.rev_swig_ptr(v.data(), v.size()) 31 | 32 | 33 | def get_hnsw_links(hnsw, vno): 34 | """get link strcutre for vertex vno""" 35 | 36 | # make arrays visible from Python 37 | levels = vector_to_array(hnsw.levels) 38 | cum_nneighbor_per_level = vector_to_array(hnsw.cum_nneighbor_per_level) 39 | offsets = vector_to_array(hnsw.offsets) 40 | neighbors = vector_to_array(hnsw.neighbors) 41 | 42 | # all neighbors of vno 43 | neigh_vno = neighbors[offsets[vno] : offsets[vno + 1]] 44 | 45 | # break down per level 46 | nlevel = levels[vno] 47 | return [ 48 | neigh_vno[cum_nneighbor_per_level[l] : cum_nneighbor_per_level[l + 1]] 49 | for l in range(nlevel) 50 | ] 51 | 52 | 53 | if __name__ == "__main__": 54 | dataset_idx = int(sys.argv[1]) 55 | max_degree = int(sys.argv[2]) 56 | fb_ssnpp_dir = None 57 | 58 | if dataset_idx == 3: 59 | assert ( 60 | len(sys.argv) == 4 61 | ), "Path to fb_ssnpp/ directory is needed for DatasetFB_ssnpp (index 3)" 62 | fb_ssnpp_dir = sys.argv[3] 63 | 64 | AVAILABLE_DATASETS = [ 65 | (SyntheticDataset, dict(d=32, nt=10_000, nq=1, nb=1_000)), 66 | (DatasetSIFT1M, {}), 67 | (DatasetDeep1B, dict(nb=10**6)), 68 | (DatasetFB_ssnpp, dict(basedir=fb_ssnpp_dir)), 69 | ] 70 | 71 | dataset_cls, dataset_kwargs = AVAILABLE_DATASETS[dataset_idx] 72 | dataset = dataset_cls(**dataset_kwargs) 73 | 74 | index_strs = ([f"NSG{max_degree},Flat" f"HNSW{max_degree},Flat"],) 75 | 76 | for index_str in index_strs: 77 | dataset_name = type(dataset).__name__ 78 | print(f"Indexing {dataset_name} / {index_str}", flush=True) 79 | index = faiss.index_factory(dataset.d, index_str) 80 | index.verbose = True 81 | database = dataset.get_database() 82 | index.add(database) 83 | 84 | if "NSG" in index_str: 85 | graph_friends = get_NSG_neighbors(index.nsg) 86 | elif "HNSW" in index_str: 87 | graph_friends = [ 88 | get_hnsw_links(index.hnsw, v)[0] for v in range(dataset.nb) 89 | ] 90 | 91 | graph_edgelist = friend_to_edgelist_repr(graph_friends) 92 | graph_edgelist_str = "\n".join(map(lambda e: f"{e[0]} {e[1]}", graph_edgelist)) 93 | el_path = Path(f"graphs/{dataset_name}-{index_str}.el") 94 | el_path.parent.mkdir(parents=True, exist_ok=True) 95 | with open(el_path, "w") as f: 96 | f.write(graph_edgelist_str) 97 | --------------------------------------------------------------------------------