├── .github └── workflows │ └── black.yml ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── scripts ├── download_index.py └── sphere_client_demo_hf.py ├── setup.py └── sphere_logo.png /.github/workflows/black.yml: -------------------------------------------------------------------------------- 1 | name: Code quality 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | pull_request: 8 | branches: 9 | - main 10 | jobs: 11 | lint: 12 | runs-on: ubuntu-latest 13 | steps: 14 | - uses: actions/checkout@v2 15 | - uses: psf/black@stable 16 | with: 17 | options: "--check --diff --verbose --line-length 120 --extend-exclude (users|web)" -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # Developer tools 6 | .vscode/ 7 | 8 | # Installation files 9 | *.egg-info/ 10 | eggs/ 11 | .eggs/ 12 | *.egg 13 | *.mmap 14 | src/ 15 | 16 | 17 | checkpoints/ 18 | configs/ 19 | data/ 20 | faiss_index/ 21 | logs/ 22 | output/ 23 | *.log 24 | *.tar.gz -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to make participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies within all project spaces, and it also applies when 49 | an individual is representing the project or its community in public spaces. 50 | Examples of representing a project or community include using an official 51 | project e-mail address, posting via an official social media account, or acting 52 | as an appointed representative at an online or offline event. Representation of 53 | a project may be further defined and clarified by project maintainers. 54 | 55 | This Code of Conduct also applies outside the project spaces when there is a 56 | reasonable belief that an individual's behavior may have a negative impact on 57 | the project or its community. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported by contacting the project team at . All 63 | complaints will be reviewed and investigated and will result in a response that 64 | is deemed necessary and appropriate to the circumstances. The project team is 65 | obligated to maintain confidentiality with regard to the reporter of an incident. 66 | Further details of specific enforcement policies may be posted separately. 67 | 68 | Project maintainers who do not follow or enforce the Code of Conduct in good 69 | faith may face temporary or permanent repercussions as determined by other 70 | members of the project's leadership. 71 | 72 | ## Attribution 73 | 74 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 75 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 76 | 77 | [homepage]: https://www.contributor-covenant.org 78 | 79 | For answers to common questions about this code of conduct, see 80 | https://www.contributor-covenant.org/faq 81 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to Sphere 2 | We want to make contributing to this project as easy and transparent as 3 | possible. 4 | 5 | ## Pull Requests 6 | We actively welcome your pull requests. 7 | 8 | 1. Fork the repo and create your branch from `main`. 9 | 2. If you've added code that should be tested, add tests. 10 | 3. If you've changed APIs, update the documentation. 11 | 4. Ensure the test suite passes. 12 | 5. Make sure your code lints. 13 | 6. If you haven't already, complete the Contributor License Agreement ("CLA"). 14 | 15 | ## Contributor License Agreement ("CLA") 16 | In order to accept your pull request, we need you to submit a CLA. You only need 17 | to do this once to work on any of Facebook's open source projects. 18 | 19 | Complete your CLA here: 20 | 21 | ## Issues 22 | We use GitHub issues to track public bugs. Please ensure your description is 23 | clear and has sufficient instructions to be able to reproduce the issue. 24 | 25 | Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe 26 | disclosure of security bugs. In those cases, please go through the process 27 | outlined on that page and do not file a public issue. 28 | 29 | ## License 30 | By contributing to Sphere, you agree that your contributions will be licensed 31 | under the LICENSE file in the root directory of this source tree. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Attribution-NonCommercial 4.0 International 3 | 4 | ======================================================================= 5 | 6 | Creative Commons Corporation ("Creative Commons") is not a law firm and 7 | does not provide legal services or legal advice. Distribution of 8 | Creative Commons public licenses does not create a lawyer-client or 9 | other relationship. Creative Commons makes its licenses and related 10 | information available on an "as-is" basis. Creative Commons gives no 11 | warranties regarding its licenses, any material licensed under their 12 | terms and conditions, or any related information. Creative Commons 13 | disclaims all liability for damages resulting from their use to the 14 | fullest extent possible. 15 | 16 | Using Creative Commons Public Licenses 17 | 18 | Creative Commons public licenses provide a standard set of terms and 19 | conditions that creators and other rights holders may use to share 20 | original works of authorship and other material subject to copyright 21 | and certain other rights specified in the public license below. The 22 | following considerations are for informational purposes only, are not 23 | exhaustive, and do not form part of our licenses. 24 | 25 | Considerations for licensors: Our public licenses are 26 | intended for use by those authorized to give the public 27 | permission to use material in ways otherwise restricted by 28 | copyright and certain other rights. Our licenses are 29 | irrevocable. Licensors should read and understand the terms 30 | and conditions of the license they choose before applying it. 31 | Licensors should also secure all rights necessary before 32 | applying our licenses so that the public can reuse the 33 | material as expected. Licensors should clearly mark any 34 | material not subject to the license. This includes other CC- 35 | licensed material, or material used under an exception or 36 | limitation to copyright. More considerations for licensors: 37 | wiki.creativecommons.org/Considerations_for_licensors 38 | 39 | Considerations for the public: By using one of our public 40 | licenses, a licensor grants the public permission to use the 41 | licensed material under specified terms and conditions. If 42 | the licensor's permission is not necessary for any reason--for 43 | example, because of any applicable exception or limitation to 44 | copyright--then that use is not regulated by the license. Our 45 | licenses grant only permissions under copyright and certain 46 | other rights that a licensor has authority to grant. Use of 47 | the licensed material may still be restricted for other 48 | reasons, including because others have copyright or other 49 | rights in the material. A licensor may make special requests, 50 | such as asking that all changes be marked or described. 51 | Although not required by our licenses, you are encouraged to 52 | respect those requests where reasonable. More_considerations 53 | for the public: 54 | wiki.creativecommons.org/Considerations_for_licensees 55 | 56 | ======================================================================= 57 | 58 | Creative Commons Attribution-NonCommercial 4.0 International Public 59 | License 60 | 61 | By exercising the Licensed Rights (defined below), You accept and agree 62 | to be bound by the terms and conditions of this Creative Commons 63 | Attribution-NonCommercial 4.0 International Public License ("Public 64 | License"). To the extent this Public License may be interpreted as a 65 | contract, You are granted the Licensed Rights in consideration of Your 66 | acceptance of these terms and conditions, and the Licensor grants You 67 | such rights in consideration of benefits the Licensor receives from 68 | making the Licensed Material available under these terms and 69 | conditions. 70 | 71 | Section 1 -- Definitions. 72 | 73 | a. Adapted Material means material subject to Copyright and Similar 74 | Rights that is derived from or based upon the Licensed Material 75 | and in which the Licensed Material is translated, altered, 76 | arranged, transformed, or otherwise modified in a manner requiring 77 | permission under the Copyright and Similar Rights held by the 78 | Licensor. For purposes of this Public License, where the Licensed 79 | Material is a musical work, performance, or sound recording, 80 | Adapted Material is always produced where the Licensed Material is 81 | synched in timed relation with a moving image. 82 | 83 | b. Adapter's License means the license You apply to Your Copyright 84 | and Similar Rights in Your contributions to Adapted Material in 85 | accordance with the terms and conditions of this Public License. 86 | 87 | c. Copyright and Similar Rights means copyright and/or similar rights 88 | closely related to copyright including, without limitation, 89 | performance, broadcast, sound recording, and Sui Generis Database 90 | Rights, without regard to how the rights are labeled or 91 | categorized. For purposes of this Public License, the rights 92 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 93 | Rights. 94 | d. Effective Technological Measures means those measures that, in the 95 | absence of proper authority, may not be circumvented under laws 96 | fulfilling obligations under Article 11 of the WIPO Copyright 97 | Treaty adopted on December 20, 1996, and/or similar international 98 | agreements. 99 | 100 | e. Exceptions and Limitations means fair use, fair dealing, and/or 101 | any other exception or limitation to Copyright and Similar Rights 102 | that applies to Your use of the Licensed Material. 103 | 104 | f. Licensed Material means the artistic or literary work, database, 105 | or other material to which the Licensor applied this Public 106 | License. 107 | 108 | g. Licensed Rights means the rights granted to You subject to the 109 | terms and conditions of this Public License, which are limited to 110 | all Copyright and Similar Rights that apply to Your use of the 111 | Licensed Material and that the Licensor has authority to license. 112 | 113 | h. Licensor means the individual(s) or entity(ies) granting rights 114 | under this Public License. 115 | 116 | i. NonCommercial means not primarily intended for or directed towards 117 | commercial advantage or monetary compensation. For purposes of 118 | this Public License, the exchange of the Licensed Material for 119 | other material subject to Copyright and Similar Rights by digital 120 | file-sharing or similar means is NonCommercial provided there is 121 | no payment of monetary compensation in connection with the 122 | exchange. 123 | 124 | j. Share means to provide material to the public by any means or 125 | process that requires permission under the Licensed Rights, such 126 | as reproduction, public display, public performance, distribution, 127 | dissemination, communication, or importation, and to make material 128 | available to the public including in ways that members of the 129 | public may access the material from a place and at a time 130 | individually chosen by them. 131 | 132 | k. Sui Generis Database Rights means rights other than copyright 133 | resulting from Directive 96/9/EC of the European Parliament and of 134 | the Council of 11 March 1996 on the legal protection of databases, 135 | as amended and/or succeeded, as well as other essentially 136 | equivalent rights anywhere in the world. 137 | 138 | l. You means the individual or entity exercising the Licensed Rights 139 | under this Public License. Your has a corresponding meaning. 140 | 141 | Section 2 -- Scope. 142 | 143 | a. License grant. 144 | 145 | 1. Subject to the terms and conditions of this Public License, 146 | the Licensor hereby grants You a worldwide, royalty-free, 147 | non-sublicensable, non-exclusive, irrevocable license to 148 | exercise the Licensed Rights in the Licensed Material to: 149 | 150 | a. reproduce and Share the Licensed Material, in whole or 151 | in part, for NonCommercial purposes only; and 152 | 153 | b. produce, reproduce, and Share Adapted Material for 154 | NonCommercial purposes only. 155 | 156 | 2. Exceptions and Limitations. For the avoidance of doubt, where 157 | Exceptions and Limitations apply to Your use, this Public 158 | License does not apply, and You do not need to comply with 159 | its terms and conditions. 160 | 161 | 3. Term. The term of this Public License is specified in Section 162 | 6(a). 163 | 164 | 4. Media and formats; technical modifications allowed. The 165 | Licensor authorizes You to exercise the Licensed Rights in 166 | all media and formats whether now known or hereafter created, 167 | and to make technical modifications necessary to do so. The 168 | Licensor waives and/or agrees not to assert any right or 169 | authority to forbid You from making technical modifications 170 | necessary to exercise the Licensed Rights, including 171 | technical modifications necessary to circumvent Effective 172 | Technological Measures. For purposes of this Public License, 173 | simply making modifications authorized by this Section 2(a) 174 | (4) never produces Adapted Material. 175 | 176 | 5. Downstream recipients. 177 | 178 | a. Offer from the Licensor -- Licensed Material. Every 179 | recipient of the Licensed Material automatically 180 | receives an offer from the Licensor to exercise the 181 | Licensed Rights under the terms and conditions of this 182 | Public License. 183 | 184 | b. No downstream restrictions. You may not offer or impose 185 | any additional or different terms or conditions on, or 186 | apply any Effective Technological Measures to, the 187 | Licensed Material if doing so restricts exercise of the 188 | Licensed Rights by any recipient of the Licensed 189 | Material. 190 | 191 | 6. No endorsement. Nothing in this Public License constitutes or 192 | may be construed as permission to assert or imply that You 193 | are, or that Your use of the Licensed Material is, connected 194 | with, or sponsored, endorsed, or granted official status by, 195 | the Licensor or others designated to receive attribution as 196 | provided in Section 3(a)(1)(A)(i). 197 | 198 | b. Other rights. 199 | 200 | 1. Moral rights, such as the right of integrity, are not 201 | licensed under this Public License, nor are publicity, 202 | privacy, and/or other similar personality rights; however, to 203 | the extent possible, the Licensor waives and/or agrees not to 204 | assert any such rights held by the Licensor to the limited 205 | extent necessary to allow You to exercise the Licensed 206 | Rights, but not otherwise. 207 | 208 | 2. Patent and trademark rights are not licensed under this 209 | Public License. 210 | 211 | 3. To the extent possible, the Licensor waives any right to 212 | collect royalties from You for the exercise of the Licensed 213 | Rights, whether directly or through a collecting society 214 | under any voluntary or waivable statutory or compulsory 215 | licensing scheme. In all other cases the Licensor expressly 216 | reserves any right to collect such royalties, including when 217 | the Licensed Material is used other than for NonCommercial 218 | purposes. 219 | 220 | Section 3 -- License Conditions. 221 | 222 | Your exercise of the Licensed Rights is expressly made subject to the 223 | following conditions. 224 | 225 | a. Attribution. 226 | 227 | 1. If You Share the Licensed Material (including in modified 228 | form), You must: 229 | 230 | a. retain the following if it is supplied by the Licensor 231 | with the Licensed Material: 232 | 233 | i. identification of the creator(s) of the Licensed 234 | Material and any others designated to receive 235 | attribution, in any reasonable manner requested by 236 | the Licensor (including by pseudonym if 237 | designated); 238 | 239 | ii. a copyright notice; 240 | 241 | iii. a notice that refers to this Public License; 242 | 243 | iv. a notice that refers to the disclaimer of 244 | warranties; 245 | 246 | v. a URI or hyperlink to the Licensed Material to the 247 | extent reasonably practicable; 248 | 249 | b. indicate if You modified the Licensed Material and 250 | retain an indication of any previous modifications; and 251 | 252 | c. indicate the Licensed Material is licensed under this 253 | Public License, and include the text of, or the URI or 254 | hyperlink to, this Public License. 255 | 256 | 2. You may satisfy the conditions in Section 3(a)(1) in any 257 | reasonable manner based on the medium, means, and context in 258 | which You Share the Licensed Material. For example, it may be 259 | reasonable to satisfy the conditions by providing a URI or 260 | hyperlink to a resource that includes the required 261 | information. 262 | 263 | 3. If requested by the Licensor, You must remove any of the 264 | information required by Section 3(a)(1)(A) to the extent 265 | reasonably practicable. 266 | 267 | 4. If You Share Adapted Material You produce, the Adapter's 268 | License You apply must not prevent recipients of the Adapted 269 | Material from complying with this Public License. 270 | 271 | Section 4 -- Sui Generis Database Rights. 272 | 273 | Where the Licensed Rights include Sui Generis Database Rights that 274 | apply to Your use of the Licensed Material: 275 | 276 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 277 | to extract, reuse, reproduce, and Share all or a substantial 278 | portion of the contents of the database for NonCommercial purposes 279 | only; 280 | 281 | b. if You include all or a substantial portion of the database 282 | contents in a database in which You have Sui Generis Database 283 | Rights, then the database in which You have Sui Generis Database 284 | Rights (but not its individual contents) is Adapted Material; and 285 | 286 | c. You must comply with the conditions in Section 3(a) if You Share 287 | all or a substantial portion of the contents of the database. 288 | 289 | For the avoidance of doubt, this Section 4 supplements and does not 290 | replace Your obligations under this Public License where the Licensed 291 | Rights include other Copyright and Similar Rights. 292 | 293 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 294 | 295 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 296 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 297 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 298 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 299 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 300 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 301 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 302 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 303 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 304 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 305 | 306 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 307 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 308 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 309 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 310 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 311 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 312 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 313 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 314 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 315 | 316 | c. The disclaimer of warranties and limitation of liability provided 317 | above shall be interpreted in a manner that, to the extent 318 | possible, most closely approximates an absolute disclaimer and 319 | waiver of all liability. 320 | 321 | Section 6 -- Term and Termination. 322 | 323 | a. This Public License applies for the term of the Copyright and 324 | Similar Rights licensed here. However, if You fail to comply with 325 | this Public License, then Your rights under this Public License 326 | terminate automatically. 327 | 328 | b. Where Your right to use the Licensed Material has terminated under 329 | Section 6(a), it reinstates: 330 | 331 | 1. automatically as of the date the violation is cured, provided 332 | it is cured within 30 days of Your discovery of the 333 | violation; or 334 | 335 | 2. upon express reinstatement by the Licensor. 336 | 337 | For the avoidance of doubt, this Section 6(b) does not affect any 338 | right the Licensor may have to seek remedies for Your violations 339 | of this Public License. 340 | 341 | c. For the avoidance of doubt, the Licensor may also offer the 342 | Licensed Material under separate terms or conditions or stop 343 | distributing the Licensed Material at any time; however, doing so 344 | will not terminate this Public License. 345 | 346 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 347 | License. 348 | 349 | Section 7 -- Other Terms and Conditions. 350 | 351 | a. The Licensor shall not be bound by any additional or different 352 | terms or conditions communicated by You unless expressly agreed. 353 | 354 | b. Any arrangements, understandings, or agreements regarding the 355 | Licensed Material not stated herein are separate from and 356 | independent of the terms and conditions of this Public License. 357 | 358 | Section 8 -- Interpretation. 359 | 360 | a. For the avoidance of doubt, this Public License does not, and 361 | shall not be interpreted to, reduce, limit, restrict, or impose 362 | conditions on any use of the Licensed Material that could lawfully 363 | be made without permission under this Public License. 364 | 365 | b. To the extent possible, if any provision of this Public License is 366 | deemed unenforceable, it shall be automatically reformed to the 367 | minimum extent necessary to make it enforceable. If the provision 368 | cannot be reformed, it shall be severed from this Public License 369 | without affecting the enforceability of the remaining terms and 370 | conditions. 371 | 372 | c. No term or condition of this Public License will be waived and no 373 | failure to comply consented to unless expressly agreed to by the 374 | Licensor. 375 | 376 | d. Nothing in this Public License constitutes or may be interpreted 377 | as a limitation upon, or waiver of, any privileges and immunities 378 | that apply to the Licensor or You, including from the legal 379 | processes of any jurisdiction or authority. 380 | 381 | ======================================================================= 382 | 383 | Creative Commons is not a party to its public 384 | licenses. Notwithstanding, Creative Commons may elect to apply one of 385 | its public licenses to material it publishes and in those instances 386 | will be considered the “Licensor.” The text of the Creative Commons 387 | public licenses is dedicated to the public domain under the CC0 Public 388 | Domain Dedication. Except for the limited purpose of indicating that 389 | material is shared under a Creative Commons public license or as 390 | otherwise permitted by the Creative Commons policies published at 391 | creativecommons.org/policies, Creative Commons does not authorize the 392 | use of the trademark "Creative Commons" or any other trademark or logo 393 | of Creative Commons without its prior written consent including, 394 | without limitation, in connection with any unauthorized modifications 395 | to any of its public licenses or any other arrangements, 396 | understandings, or agreements concerning use of licensed material. For 397 | the avoidance of doubt, this paragraph does not form part of the 398 | public licenses. 399 | 400 | Creative Commons may be contacted at creativecommons.org. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sphere 2 | 3 | 4 | # About 5 | In our paper [*The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus*](https://arxiv.org/abs/2112.09924) we propose to use a web corpus as a universal, uncurated and unstructured knowledge source for multiple KI-NLP tasks at once. 6 | 7 | We leverage an open web corpus coupled with strong retrieval baselines instead of a black-box, commercial search engine - an approach which facilitates transparent and reproducible research and opens up a path for future studies comparing search engines optimised for humans with retrieval solutions designed for neural networks. 8 | We use a subset of [CCNet](https://github.com/facebookresearch/cc_net) covering 134M documents split into 906M passages as the web corpus which we call **Sphere**. 9 | 10 | In this repository we open source indices of Sphere both for the sparse retrieval baseline, compatible with [Pyserini](https://github.com/castorini/pyserini), and our best dense model compatible with [distributed-faiss](https://github.com/facebookresearch/distributed-faiss). We also provide instructions on how to evaluate the retrieval performance for both standard and newly introduced retrieval metrics, using the [KILT](https://github.com/facebookresearch/KILT) API. 11 | 12 | 13 | ## Reference 14 | If you use the content of this repository in your research, please cite the following: 15 | ``` 16 | @article{DBLP:journals/corr/abs-2112-09924, 17 | author = {Aleksandra Piktus and Fabio Petroni 18 | and Vladimir Karpukhin and Dmytro Okhonko 19 | and Samuel Broscheit and Gautier Izacard 20 | and Patrick Lewis and Barlas Oguz 21 | and Edouard Grave and Wen{-}tau Yih 22 | and Sebastian Riedel}, 23 | title = {The Web Is Your Oyster - Knowledge-Intensive {NLP} against a Very 24 | Large Web Corpus}, 25 | journal = {CoRR}, 26 | volume = {abs/2112.09924}, 27 | year = {2021}, 28 | url = {https://arxiv.org/abs/2112.09924}, 29 | eprinttype = {arXiv}, 30 | eprint = {2112.09924}, 31 | timestamp = {Tue, 04 Jan 2022 15:59:27 +0100}, 32 | biburl = {https://dblp.org/rec/journals/corr/abs-2112-09924.bib}, 33 | bibsource = {dblp computer science bibliography, https://dblp.org} 34 | } 35 | ``` 36 | 37 | ## Installation 38 | ``` 39 | git clone git@github.com:facebookresearch/Sphere.git 40 | cd Sphere 41 | conda create -n sphere -y python=3.7 && conda activate sphere 42 | pip install -e . 43 | ``` 44 | 45 | ## Index download 46 | We open source pre-built Sphere indices: 47 | - a Pyserini-compatible sparse BM25 index: [sphere_sparse_index.tar.gz](https://dl.fbaipublicfiles.com/sphere/sphere_sparse_index.tar.gz) - 775.6 GiB 48 | - a distributed-faiss-compatible dense DPR index: [sphere_sparse_index.tar.gz](https://dl.fbaipublicfiles.com/sphere/sphere_dense_index.tar.gz) - 1.2 TiB 49 | 50 | You can download and unpack respective index files directly e.g. via the browser of `wget`: 51 | ``` 52 | mkdir -p faiss_index 53 | 54 | wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_sparse_index.tar.gz 55 | tar -xzvf faiss_index/sphere_sparse_index.tar.gz -C faiss_index 56 | 57 | wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_dense_index.tar.gz 58 | tar -xzvf faiss_index/sphere_dense_index.tar.gz -C faiss_index 59 | ``` 60 | 61 | # Evaluation with [KILT](https://github.com/facebookresearch/KILT) 62 | We implement the retrieval metrics introduced in the paper: 63 | - the `answer-in-context@k`, 64 | - the `answer+entity-in-context@k`, 65 | - as well as the `entity-in-input` ablation metric 66 | 67 | within the KILT repository. Follow instruction below to perform and evaluate retrieval on KILT tasks for both sparse and dense Sphere indices. 68 | 69 | ## KILT dependencies 70 | ```bash 71 | pip install -e git+https://github.com/facebookresearch/KILT#egg=KILT 72 | ``` 73 | 74 | Download KILT data. Check out instructions in the [KILT](https://github.com/facebookresearch/KILT#download-the-data) repo for more details. 75 | ```bash 76 | mkdir -p data 77 | python src/kilt/scripts/download_all_kilt_data.py 78 | python src/kilt/scripts/get_triviaqa_input.py 79 | ``` 80 | 81 | ## Dense index 82 | ### Install dependencies 83 | ```bash 84 | pip install -e git+https://github.com/facebookresearch/distributed-faiss#egg=distributed-faiss 85 | pip install -e git+https://github.com/facebookresearch/DPR@multi_task_training#egg=DPR 86 | pip install spacy==2.1.8 87 | python -m spacy download en 88 | ``` 89 | 90 | ### Launch `distributed-faiss` server 91 | More details [here](https://github.com/facebookresearch/distributed-faiss#launching-servers-with-submitit-on-slurm-managed-clusters). 92 | ```bash 93 | python src/distributed-faiss/scripts/server_launcher.py \ 94 | --log-dir logs \ 95 | --discovery-config faiss_index/disovery_config.txt \ 96 | --num-servers 32 \ 97 | --num-servers-per-node 4 \ 98 | --timeout-min 4320 \ 99 | --save-dir faiss_index/ \ 100 | --mem-gb 500 \ 101 | --base-port 13034 \ 102 | --partition dev & 103 | ``` 104 | ### Download assets 105 | - The DPR_web model: [dpr_web_biencoder.cp](http://dl.fbaipublicfiles.com/sphere/dpr_web_biencoder.cp) 106 | - The configuration file: [dpr_web_sphere.yaml](https://dl.fbaipublicfiles.com/sphere/dpr_web_sphere.yaml) 107 | ```bash 108 | mkdir -p checkpoints 109 | wget -P checkpoints http://dl.fbaipublicfiles.com/sphere/dpr_web_biencoder.cp 110 | 111 | mkdir -p configs 112 | wget -P configs https://dl.fbaipublicfiles.com/sphere/dpr_web_sphere.yaml 113 | ``` 114 | 115 | Subsequently update the following fields in the `dpr_web_sphere.yaml` configuration file: 116 | ```bash 117 | n_docs: 100 # the number of documents to retrieve per query 118 | model_file: checkpoints/dpr_web_biencoder.cp # path to the downloaded model file 119 | rpc_retriever_cfg_file: faiss_index/disovery_config.txt # path to the discovery config file used when launching the distributed-faiss server 120 | rpc_index_id: dense # the name of the folder contaning dense index partitions 121 | ``` 122 | 123 | ### Execute retrieval 124 | In order to perform retrieval from the dense index you first need to launch the distributed-faiss server as described above. You can control the KILT datasets you perform retrieval for by modifying respective config files, e.g. `src/kilt/configs/dev_data.json`. 125 | ```bash 126 | python src/kilt/scripts/execute_retrieval.py \ 127 | --model_name dpr_distr \ 128 | --model_configuration configs/dpr_web_sphere.yaml \ 129 | --test_config src/kilt/kilt/configs/dev_data.json \ 130 | --output_folder output/dense/ 131 | ``` 132 | ## Sparse index 133 | ### Install dependencies 134 | Our sparse index relies on Pyserini, and therfore requires [an install of Java 11](https://github.com/castorini/pyserini#installation) to be available on the machine. 135 | ```bash 136 | pip install jnius 137 | pip install pyserini==0.9.4.0 138 | ``` 139 | 140 | Next, download the following file: 141 | - The configuration file: [bm25_sphere.json](https://dl.fbaipublicfiles.com/sphere/bm25_sphere.json) 142 | ```bash 143 | mkdir -p configs 144 | wget -P configs https://dl.fbaipublicfiles.com/sphere/bm25_sphere.json 145 | ``` 146 | 147 | Subsequently update the following field in the `bm25_sphere.json` configuration file: 148 | ```bash 149 | "k": 100, # the number of documents to retrieve per query 150 | "index": "faiss_index/sparse", # path to the unpacked sparse BM25 index 151 | ``` 152 | 153 | ### Execute retrieval 154 | ``` 155 | python src/kilt/scripts/execute_retrieval.py \ 156 | --model_name bm25 \ 157 | --model_configuration configs/bm25_sphere.json \ 158 | --test_config src/kilt/kilt/configs/dev_data.json \ 159 | --output_folder output/sparse/ 160 | ``` 161 | 162 | ## Retrieval evaluation 163 | ```bash 164 | python src/kilt/kilt/eval_retrieval.py \ 165 | output/$index/$dataset-dev-kilt.jsonl \ # retrieval results - the output of running eval_retrieval.py 166 | data/$dataset-dev-kilt.jsonl \ # gold KILT file (available for download in the KILT repo) 167 | --ks="1,20,100" 168 | ``` 169 | 170 | 171 | # Standalone dense index usage 172 | Install and launch `distributed-faiss`. More details on the `distributed-faiss` server [here](https://github.com/facebookresearch/distributed-faiss#launching-servers-with-submitit-on-slurm-managed-clusters). 173 | 174 | ```bash 175 | pip install -e git+https://github.com/facebookresearch/distributed-faiss#egg=distributed-faiss 176 | ``` 177 | 178 | ```bash 179 | python src/distributed-faiss/scripts/server_launcher.py \ 180 | --log-dir logs/ \ 181 | --discovery-config faiss_index/disovery_config.txt \ 182 | --num-servers 32 \ 183 | --num-servers-per-node 4 \ 184 | --timeout-min 4320 \ 185 | --save-dir faiss_index/ \ 186 | --mem-gb 500 \ 187 | --base-port 13034 \ 188 | --partition dev & 189 | ``` 190 | 191 | ## Standalone client example 192 | For a minimal working example of querying the Sphere dense index, we propose to interact with the DPR model via `transformers` API. To that end please install dependencies: 193 | ```bash 194 | pip install transformers==4.17.0 195 | ``` 196 | Using the DPR checkpoing with transformers API requires reformatting the original checkpoint. You can download and unpack the `transformers`-complatible DPR_web query encoder here: 197 | - [dpr_web_query_encoder_hf.tar.gz](https://dl.fbaipublicfiles.com/sphere/dpr_web_query_encoder_hf.tar.gz) 198 | 199 | ```bash 200 | mkdir -p checkpoints 201 | wget -P checkpoints https://dl.fbaipublicfiles.com/sphere/dpr_web_query_encoder_hf.tar.gz 202 | tar -xzvf checkpoints/dpr_web_query_encoder_hf.tar.gz -C checkpoints/ 203 | ``` 204 | Alternatively, you can convert the [`dpr_web_biencoder.cp`](http://dl.fbaipublicfiles.com/sphere/dpr_web_biencoder.cp) model yourself using [available scripts](https://github.com/huggingface/transformers/blob/main/src/transformers/models/dpr/convert_dpr_original_checkpoint_to_pytorch.py). 205 | 206 | 207 | Then you can run the interactive demo: 208 | ```bash 209 | python scripts/sphere_client_demo_hf.py \ 210 | --encoder checkpoints/dpr_web_query_encoder_hf \ 211 | --discovery-config faiss_index/disovery_config.txt \ 212 | --index-id dense 213 | ``` 214 | 215 | # License 216 | `Sphere` is released under the CC-BY-NC 4.0 license. See the `LICENSE` file for details. 217 | -------------------------------------------------------------------------------- /scripts/download_index.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import argparse 8 | import os.path 9 | import requests 10 | 11 | from tqdm import tqdm 12 | from pathlib import Path 13 | import tarfile 14 | 15 | 16 | SPHERE_URL = "http://dl.fbaipublicfiles.com/sphere" 17 | 18 | # dense index constants 19 | SPHERE_DENSE_PARTITIONS = 32 20 | PARTITIONS_FILES = ["buffer.pkl", "cfg.json", "meta.pkl", "index.faiss"] 21 | 22 | # sparse index constants 23 | SPARSE_FILENAME = "sphere_sparse_index.tar.gz" 24 | 25 | 26 | def download_file(url, file_path, overwrite): 27 | file_name = url.split("/")[-1] 28 | r = requests.get(url, stream=True) 29 | 30 | # Total size in bytes. 31 | total_size = int(r.headers.get("content-length", 0)) 32 | 33 | if not overwrite and os.path.isfile(file_path): 34 | current_size = os.path.getsize(file_path) 35 | if total_size == current_size: 36 | print(" - Skipping " + file_name + " - already exists.") 37 | return 38 | 39 | block_size = 1024 # 1 Kibibyte 40 | t = tqdm( 41 | total=total_size, 42 | unit="iB", 43 | unit_scale=True, 44 | desc=" - Downloading " + file_name + ": ", 45 | ) 46 | with open(file_path, "wb") as f: 47 | for data in r.iter_content(block_size): 48 | t.update(len(data)) 49 | f.write(data) 50 | t.close() 51 | 52 | 53 | def download_sparse(dest_dir, overwrite): 54 | Path(dest_dir + "/sparse").mkdir(parents=True, exist_ok=True) 55 | print("Downloading compressed sparse index:") 56 | download_file( 57 | SPHERE_URL + "/" + SPARSE_FILENAME, 58 | dest_dir + "/sparse/" + SPARSE_FILENAME, 59 | overwrite, 60 | ) 61 | 62 | print("Extracting sparse index:") 63 | my_tar = tarfile.open(dest_dir + "/sparse/" + SPARSE_FILENAME) 64 | my_tar.extractall(dest_dir + "/sparse/") # specify which folder to extract to 65 | my_tar.close() 66 | 67 | print("Removing compressed sparse index:") 68 | os.remove(dest_dir + "/sparse/" + SPARSE_FILENAME) 69 | 70 | 71 | def download_dense(dest_dir, overwrite, partitions): 72 | Path(dest_dir + "/dense").mkdir(parents=True, exist_ok=True) 73 | for i in range(partitions): 74 | print("Downloading files for node {} our of {}:".format(i, partitions)) 75 | dense_suffix = "/dense/" + str(i) + "/" 76 | partition_dir = dest_dir + dense_suffix 77 | Path(partition_dir).mkdir(exist_ok=True) 78 | for file_name in PARTITIONS_FILES: 79 | download_file( 80 | SPHERE_URL + dense_suffix + file_name, 81 | partition_dir + file_name, 82 | overwrite, 83 | ) 84 | 85 | 86 | if __name__ == "__main__": 87 | parser = argparse.ArgumentParser() 88 | parser.add_argument( 89 | "--dest_dir", 90 | required=True, 91 | type=str, 92 | help="The path to a directory where index files should be stored", 93 | ) 94 | parser.add_argument( 95 | "--index_type", 96 | required=True, 97 | choices=["dense", "sparse"], 98 | type=str, 99 | help="The type of index to download, choose dense or sparse.", 100 | ) 101 | parser.add_argument( 102 | "--overwrite", 103 | action="store_true", 104 | help="If flag set, existing files will be overwritter, otherwise skipping download.", 105 | ) 106 | parser.add_argument( 107 | "--partitions", 108 | type=int, 109 | default=SPHERE_DENSE_PARTITIONS, 110 | help="The number of partitions the dense index is split into.", 111 | ) 112 | args = parser.parse_args() 113 | 114 | if args.index_type == "dense": 115 | download_dense(args.dest_dir, args.overwrite, args.partitions) 116 | elif args.index_type == "sparse": 117 | download_sparse(args.dest_dir, args.overwrite) 118 | else: 119 | raise ValueError 120 | -------------------------------------------------------------------------------- /scripts/sphere_client_demo_hf.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | import argparse 8 | import numpy as np 9 | import torch 10 | import zlib 11 | 12 | from transformers import DPRQuestionEncoderTokenizer, DPRQuestionEncoder 13 | 14 | from distributed_faiss.client import IndexClient 15 | 16 | 17 | class RetrievalClient: 18 | def __init__(self, tokenizer, encoder, discovery_config, index_id): 19 | self.device = "cuda" if torch.cuda.is_available() else "cpu" 20 | self.tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(tokenizer) 21 | self.encoder = DPRQuestionEncoder.from_pretrained(encoder).to(self.device) 22 | 23 | # connectind to the distributed-faiss server 24 | self.index_client = IndexClient(discovery_config) 25 | self.index_id = index_id 26 | print("Loading remote index") 27 | self.index_client.load_index(args.index_id, force_reload=False) 28 | print("Done loading index") 29 | 30 | def encode_query(self, questions): 31 | inputs = self.tokenizer.batch_encode_plus( 32 | questions, 33 | return_tensors="pt", 34 | padding="max_length", 35 | max_length=256, 36 | truncation=True, 37 | add_special_tokens=True, 38 | )["input_ids"].to(self.device) 39 | # See https://github.com/facebookresearch/DPR/blob/multi_task_training/dpr/models/hf_models.py#L307 40 | # to justify the line below 41 | inputs[:, -1] = self.tokenizer.sep_token_id 42 | with torch.no_grad(): 43 | question_tensors = self.encoder(inputs)[0] 44 | return question_tensors 45 | 46 | def get_top_docs( 47 | self, 48 | query_vectors: np.array, 49 | top_docs: int = 100, 50 | use_l2_conversion: bool = True, 51 | ): 52 | results = [] 53 | # Sphere index is build for l2 distance - extra dim required for compatilibilty with dot product, more details 54 | # in https://github.com/facebookresearch/faiss/wiki/MetricType-and-distances#how-can-i-do-max-inner-product-search-on-indexes-that-support-only-l2 55 | if use_l2_conversion: 56 | aux_dim = np.zeros(len(query_vectors), dtype="float32") 57 | query_vectors = np.hstack((query_vectors, aux_dim.reshape(-1, 1))) 58 | scores, metas = self.index_client.search_with_filter(query_vectors, top_docs, self.index_id, 3, True) 59 | results.extend([(metas[q], scores[q]) for q in range(len(scores))]) 60 | return results 61 | 62 | 63 | def main(args): 64 | retrieval_client = RetrievalClient(args.tokenizer, args.encoder, args.discovery_config, args.index_id) 65 | 66 | while True: 67 | print("Type the query...") 68 | query = input() 69 | questions_tensor = retrieval_client.encode_query([query]) 70 | docs_and_scores = retrieval_client.get_top_docs(questions_tensor.cpu().numpy(), top_docs=args.k) 71 | for doc_ids, doc_scores in docs_and_scores: 72 | for i, t in enumerate(doc_ids): 73 | print("Doc id: ", t[0]) 74 | print("Title:", zlib.decompress(t[2]).decode()) 75 | print("Text:", zlib.decompress(t[1]).decode()) 76 | print("Retrieval score:", float(doc_scores[i])) 77 | print() 78 | 79 | 80 | def parse_args(): 81 | parser = argparse.ArgumentParser() 82 | parser.add_argument("--encoder", type=str, required=True) 83 | parser.add_argument("--discovery-config", type=str, required=True) 84 | parser.add_argument("--tokenizer", type=str, default="facebook/dpr-question_encoder-single-nq-base") 85 | parser.add_argument("--index-id", type=str, default="dense") 86 | parser.add_argument("-k", type=int, default=5, help="number of docs to retrieve") 87 | 88 | return parser.parse_args() 89 | 90 | 91 | if __name__ == "__main__": 92 | args = parse_args() 93 | main(args) 94 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Meta Platforms, Inc. and affiliates. 2 | # All rights reserved. 3 | 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | 7 | from setuptools import setup 8 | 9 | with open("README.md") as f: 10 | readme = f.read() 11 | 12 | setup( 13 | name="sphere", 14 | version="0.0.1", 15 | description="The Sphere library", 16 | url="", # TODO 17 | classifiers=[ 18 | "Intended Audience :: Science/Research", 19 | "License :: CC-BY-NC", 20 | "Programming Language :: Python :: 3.7", 21 | "Topic :: Scientific/Engineering :: Artificial Intelligence", 22 | ], 23 | long_description=readme, 24 | long_description_content_type="text/markdown", 25 | setup_requires=["setuptools>=18.0"], 26 | install_requires=[ 27 | "black", 28 | "requests", 29 | "tqdm", 30 | ], 31 | ) 32 | -------------------------------------------------------------------------------- /sphere_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/Sphere/79c84739cacb20518f1f9d9b85077c7b2587238d/sphere_logo.png --------------------------------------------------------------------------------