├── .github ├── FUNDING.yml ├── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── PULL_REQUEST_TEMPLATE │ └── pull_request_template.md ├── labeler.yml └── workflows │ ├── greetings.yml │ ├── labeler.yml │ ├── sync.yml │ └── urlchecker.yml ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE.md ├── README.md ├── README_French.md ├── README_Hindi.md ├── README_Portuguesse.md ├── README_Spanish.md ├── _config.yml └── logo.svg /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | github: [ agrover112 ] 4 | patreon: # Replace with a single Patreon username 5 | open_collective: # Replace with a single Open Collective username 6 | ko_fi: # Replace with a single Ko-fi username 7 | tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel 8 | community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry 9 | liberapay: # Replace with a single Liberapay username 10 | issuehunt: # Replace with a single IssueHunt username 11 | otechie: # Replace with a single Otechie username 12 | lfx_crowdfunding: # Replace with a single LFX Crowdfunding project-name e.g., cloud-foundry 13 | custom: [ 'paypal.me/agrover112' ] 14 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: "[BUG]" 5 | labels: bug 6 | assignees: '' 7 | 8 | --- 9 | 10 | Your issue may already be reported! 11 | Please search on the [issue tracker](https://github.com/Agrover112/awesome-semantic-search/issues) before creating one. 12 | 13 | 14 | **Describe the bug** 15 | A clear and concise description of what the bug is. 16 | 17 | **To Reproduce** 18 | Steps to reproduce the behavior: 19 | 1. Go to '...' 20 | 2. Click on '....' 21 | 3. Scroll down to '....' 22 | 4. See error 23 | 24 | **Expected behavior** 25 | A clear and concise description of what you expected to happen. 26 | 27 | **Screenshots** 28 | If applicable, add screenshots to help explain your problem. 29 | 30 | **Additional context** 31 | Add any other context about the problem here. 32 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE/pull_request_template.md: -------------------------------------------------------------------------------- 1 | A similar PR may already be submitted! 2 | Please search among the [Pull request](https://github.com/Agrover112/awesome-semantic-search/pulls) before creating one. 3 | 4 | Thanks for submitting a pull request! Please provide enough information so that others can review your pull request: 5 | 6 | For more information, see the `CONTRIBUTING` guide. 7 | 8 | 9 | **Summary** 10 | 11 | 12 | 13 | This PR fixes/implements the following **bugs/features** 14 | 15 | * [ ] Bug 1 16 | * [ ] Bug 2 17 | * [ ] Feature 1 18 | * [ ] Feature 2 19 | * [ ] Breaking changes 20 | 21 | 22 | 23 | Explain the **motivation** for making this change. What existing problem does the pull request solve? 24 | 25 | 26 | 27 | **Test plan (required)** 28 | 29 | Demonstrate the code is solid. Example: The exact commands you ran and their output, screenshots / videos if the pull request changes UI. 30 | 31 | 32 | 33 | **Code formatting** 34 | 35 | 36 | 37 | **Closing issues** 38 | 39 | 40 | Fixes # 41 | -------------------------------------------------------------------------------- /.github/labeler.yml: -------------------------------------------------------------------------------- 1 | # Add labels based on what README file language is being contributed to 2 | English: README.md 3 | Hindi: README_Hindi.md 4 | Spanish: README_Spanish.md 5 | -------------------------------------------------------------------------------- /.github/workflows/greetings.yml: -------------------------------------------------------------------------------- 1 | name: Greetings 2 | 3 | on: [pull_request, issues] 4 | 5 | jobs: 6 | greeting: 7 | runs-on: ubuntu-latest 8 | permissions: 9 | issues: write 10 | pull-requests: write 11 | steps: 12 | - uses: actions/first-interaction@v1 13 | with: 14 | repo-token: ${{ secrets.GITHUB_TOKEN }} 15 | issue-message: 'Message that will be displayed on users first issue' 16 | pr-message: 'Message that will be displayed on users first pull request' 17 | -------------------------------------------------------------------------------- /.github/workflows/labeler.yml: -------------------------------------------------------------------------------- 1 | # This workflow will triage pull requests and apply a label based on the 2 | # paths that are modified in the pull request. 3 | # 4 | # To use this workflow, you will need to set up a .github/labeler.yml 5 | # file with configuration. For more information, see: 6 | # https://github.com/actions/labeler 7 | 8 | name: Labeler 9 | on: [pull_request_target] 10 | 11 | jobs: 12 | label: 13 | runs-on: ubuntu-latest 14 | 15 | steps: 16 | - uses: actions/labeler@v3 17 | with: 18 | repo-token: "${{ secrets.GITHUB_TOKEN }}" 19 | -------------------------------------------------------------------------------- /.github/workflows/sync.yml: -------------------------------------------------------------------------------- 1 | name: Sync Fork 2 | 3 | on: 4 | schedule: 5 | - cron: '*/30 * * * *' # every 30 minutes 6 | workflow_dispatch: # on button click 7 | 8 | jobs: 9 | sync: 10 | 11 | runs-on: ubuntu-latest 12 | 13 | steps: 14 | - uses: tgymnich/fork-sync@v1.4 15 | with: 16 | token: ${{ secrets.PERSONAL_TOKEN }} 17 | owner: llvm 18 | base: master 19 | head: master 20 | -------------------------------------------------------------------------------- /.github/workflows/urlchecker.yml: -------------------------------------------------------------------------------- 1 | 2 | name: Check URLs 3 | 4 | on: [push, pull_request] 5 | 6 | jobs: 7 | urlcheck: 8 | runs-on: ubuntu-latest 9 | 10 | steps: 11 | - uses: actions/checkout@v2 12 | - name: URLs-checker 13 | uses: urlstechie/urlchecker-action@0.0.27 14 | with: 15 | file_types: .md 16 | print_all: false 17 | timeout: 10 18 | retry_count: 3 19 | exclude_patterns: https://github.com/Agrover112/awesome-semantic-search/issues 20 | force_pass: true 21 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct. 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, religion, or sexual identity 10 | and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the 26 | overall community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or 31 | advances of any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email 35 | address, without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at 63 | . 64 | All complaints will be reviewed and investigated promptly and fairly. 65 | 66 | All community leaders are obligated to respect the privacy and security of the 67 | reporter of any incident. 68 | 69 | ## Enforcement Guidelines 70 | 71 | Community leaders will follow these Community Impact Guidelines in determining 72 | the consequences for any action they deem in violation of this Code of Conduct: 73 | 74 | ### 1. Correction 75 | 76 | **Community Impact**: Use of inappropriate language or other behavior deemed 77 | unprofessional or unwelcome in the community. 78 | 79 | **Consequence**: A private, written warning from community leaders, providing 80 | clarity around the nature of the violation and an explanation of why the 81 | behavior was inappropriate. A public apology may be requested. 82 | 83 | ### 2. Warning 84 | 85 | **Community Impact**: A violation through a single incident or series 86 | of actions. 87 | 88 | **Consequence**: A warning with consequences for continued behavior. No 89 | interaction with the people involved, including unsolicited interaction with 90 | those enforcing the Code of Conduct, for a specified period of time. This 91 | includes avoiding interactions in community spaces as well as external channels 92 | like social media. Violating these terms may lead to a temporary or 93 | permanent ban. 94 | 95 | ### 3. Temporary Ban 96 | 97 | **Community Impact**: A serious violation of community standards, including 98 | sustained inappropriate behavior. 99 | 100 | **Consequence**: A temporary ban from any sort of interaction or public 101 | communication with the community for a specified period of time. No public or 102 | private interaction with the people involved, including unsolicited interaction 103 | with those enforcing the Code of Conduct, is allowed during this period. 104 | Violating these terms may lead to a permanent ban. 105 | 106 | ### 4. Permanent Ban 107 | 108 | **Community Impact**: Demonstrating a pattern of violation of community 109 | standards, including sustained inappropriate behavior, harassment of an 110 | individual, or aggression toward or disparagement of classes of individuals. 111 | 112 | **Consequence**: A permanent ban from any sort of public interaction within 113 | the community. 114 | 115 | ## Attribution 116 | 117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 118 | version 2.0, available at 119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. 120 | 121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct 122 | enforcement ladder](https://github.com/mozilla/diversity). 123 | 124 | [homepage]: https://www.contributor-covenant.org 125 | 126 | For answers to common questions about this code of conduct, see the FAQ at 127 | https://www.contributor-covenant.org/faq. Translations are available at 128 | https://www.contributor-covenant.org/translations. 129 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # CONTRIBUTING GUIDELINES 2 | Please take a moment to review this document in order to make the contribution process easy and effective for everyone involved. 3 | 4 | Following these guidelines helps to communicate that you respect the time of the developers managing and developing this open source project. In return, they should reciprocate that respect in addressing your issue or assessing patches and features 5 | ## Some contributing rules you should follow 6 | 7 | - Be critical: is the proposed library or paper or conference really awesome? If does, add it in the relevant section at the last position.Bear in mind, that in many cases one resource may fit multiple categories.Choose exactly one. 8 | - Make use of [discussion](https://github.com/Agrover112/awesome-semantic-search/discussions) properly (proper language) 9 | - Check if the resource you are adding already exists in the [list](https://github.com/Agrover112/awesome-semantic-search#papers) 10 | - Check for broken or re-located links. 11 | - If this is your first contribution, You might also want to take up issues with the good first issue or the help wanted label. 12 | 13 | - Discuss the changes you wish to make by creating an [issue](https://github.com/Agrover112/awesome-semantic-search/issues/new) or comment on an [existing issue](https://github.com/Agrover112/awesome-semantic-search/issues). 14 | - Description should start with a capital letter and be ended with proper punctuation. 15 | - Once you have been assigned the issue by the maintainer, you can go ahead to fork the repo, clone and make changes to fix the issue. 16 | - Please follow [**conventional commits**](https://www.conventionalcommits.org/en/v1.0.0-beta.2/) 17 | 18 | ## Making your Pull Request 19 | 20 | - Good pull requests - patches, improvements, new features - are a fantastic help. They should remain focused in scope and avoid containing unrelated commits. 21 | 22 | - you can create a pull request referencing the number of the issue you fixed. 23 | 24 | - Once, you have completed this, your pull request would be reviewed by a maintainer, if it satisfies the requirements of the corresponding issue to which it was made, it would be merged. 25 | 26 | Kudos to you :balloon: 27 | 28 | --- 29 | 30 | Thank you for contributing to [awesome-semantic-search](https://github.com/Agrover112/awesome-semantic-search). 31 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Semantic-Search [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) [![Conventional Commits](https://img.shields.io/badge/Conventional%20Commits-1.0.0-yellow.svg)](https://conventionalcommits.org) 2 | 3 | 4 | 5 | 6 | 7 | 8 | Logo made by [@createdbytango](https://instagram.com/createdbytango). 9 | 10 | **Looking for More Paper Additions. 11 | PS: Raise a PR** 12 | 13 | Following repository aims to serve a meta-repository for [Semantic Search](https://en.wikipedia.org/wiki/Semantic_search) and [Semantic Similarity](http://nlpprogress.com/english/semantic_textual_similarity.html) related tasks. 14 | 15 | Semantic Search isn't limited to text! It can be done with images, speech, etc.There are numerous different use-cases and applications of semantic search. 16 | 17 | Feel free to raise a PR on this repo! 18 | 19 | ## Contents 20 | 21 | - [Papers](#papers) 22 | - [2014](#2014) 23 | - [2015](#2015) 24 | - [2016](#2016) 25 | - [2017](#2017) 26 | - [2018](#2018) 27 | - [2019](#2019) 28 | - [2020](#2020) 29 | - [2021](#2021) 30 | - [2022](#2022) 31 | - [2023](#2023) 32 | - [Articles](#articles) 33 | - [Libraries and Tools](#libraries-and-tools) 34 | - [Datasets](#datasets) 35 | - [Milestones](#milestones) 36 | 37 | ## Papers 38 | 39 | ### 2010 40 | - [Priority Range Trees](https://arxiv.org/abs/1009.3527) 41 | - [Information Retrieval and the semantic web](https://ieeexplore.ieee.org/document/5607549) 📄 42 | 43 | ### 2014 44 | - [A Latent Semantic Model with Convolutional-Pooling 45 | Structure for Information Retrieval](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf) 📄 46 | 47 | ### 2015 48 | - [Skip-Thought Vectors](https://arxiv.org/pdf/1506.06726.pdf) 📄 49 | - [Practical and Optimal LSH for Angular Distance](https://proceedings.neurips.cc/paper/2015/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html) 50 | 51 | ### 2016 52 | - [Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759) 📄 53 | - [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) 📄 54 | - [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/abs/1603.09320) 55 | - [On Approximately Searching for Similar Word Embeddings](https://www.aclweb.org/anthology/P16-1214.pdf) 56 | - [Learning Distributed Representations of Sentences from Unlabelled Data](https://arxiv.org/abs/1602.03483)📄 57 | - [Approximate Nearest Neighbor Search on High Dimensional Data --- Experiments, Analyses, and Improvement](https://arxiv.org/abs/1610.02455) 58 | 59 | ### 2017 60 | - [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://research.fb.com/wp-content/uploads/2017/09/emnlp2017.pdf) 📄 61 | - [Semantic Textual Similarity For Hindi](https://www.semanticscholar.org/paper/Semantic-Textual-Similarity-For-Hindi-Mujadia-Mamidi/372f615ce36d7543512b8e40d6de51d17f316e0b)📄 62 | - [Efficient Natural Language Response Suggestion for Smart Reply](https://arxiv.org/abs/1705.00652)📃 63 | 64 | ### 2018 65 | - [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf) 📄 66 | - [Learning Semantic Textual Similarity from Conversations](https://arxiv.org/pdf/1804.07754.pdf) 📄 67 | - [Google AI Blog: Advances in Semantic Textual Similarity](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html) 📄 68 | - [Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech](https://arxiv.org/abs/1803.08976))🔊 69 | - [Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data](https://arxiv.org/abs/1810.07355) 🔊 70 | - [Fast Approximate Nearest Neighbor Search With The 71 | Navigating Spreading-out Graph](http://www.vldb.org/pvldb/vol12/p461-fu.pdf) 72 | - [The Case for Learned Index Structures](https://dl.acm.org/doi/10.1145/3183713.3196909) 73 | 74 | ### 2019 75 | - [LASER: Language Agnostic Sentence Representations](https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/) 📄 76 | - [Document Expansion by Query Prediction](https://arxiv.org/abs/1904.08375) 📄 77 | - [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/pdf/1908.10084.pdf) 📄 78 | - [Multi-Stage Document Ranking with BERT](https://arxiv.org/abs/1910.14424) 📄 79 | - [Latent Retrieval for Weakly Supervised Open Domain Question Answering](https://arxiv.org/abs/1906.00300) 80 | - [End-to-End Open-Domain Question Answering with BERTserini](https://www.aclweb.org/anthology/N19-4013/) 81 | - [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746)📄 82 | - [Analyzing and Improving Representations with the Soft Nearest Neighbor Loss](https://arxiv.org/pdf/1902.01889.pdf)📷 83 | - [DiskANN: Fast Accurate Billion-point Nearest 84 | Neighbor Search on a Single Node](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf) 85 | 86 | ### 2020 87 | - [Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned](https://arxiv.org/abs/2004.05125) 📄 88 | - [PASSAGE RE-RANKING WITH BERT](https://arxiv.org/pdf/1901.04085.pdf) 📄 89 | - [CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization](https://arxiv.org/pdf/2006.09595.pdf) 📄 90 | - [LaBSE:Language-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852) 📄 91 | - [Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset](https://arxiv.org/abs/2007.07846) 📄 92 | - [DeText: A deep NLP framework for intelligent text understanding](https://engineering.linkedin.com/blog/2020/open-sourcing-detext) 📄 93 | - [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/pdf/2004.09813.pdf) 📄 94 | - [Pretrained Transformers for Text Ranking: BERT and Beyond](https://arxiv.org/abs/2010.06467) 📄 95 | - [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) 96 | - [ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS](https://openreview.net/pdf?id=r1xMH1BtvB)📄 97 | - [Improving Deep Learning For Airbnb Search](https://arxiv.org/pdf/2002.05515) 98 | - [Managing Diversity in Airbnb Search](https://arxiv.org/abs/2004.02621)📄 99 | - [Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval](https://arxiv.org/abs/2007.00808v1)📄 100 | - [Unsupervised Image Style Embeddings for Retrieval and Recognition Tasks](https://openaccess.thecvf.com/content_WACV_2020/papers/Gairola_Unsupervised_Image_Style_Embeddings_for_Retrieval_and_Recognition_Tasks_WACV_2020_paper.pdf)📷 101 | - [DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations](https://arxiv.org/abs/2006.03659)📄 102 | 103 | ### 2021 104 | - [Hybrid approach for semantic similarity calculation between Tamil words](https://www.researchgate.net/publication/350112163_Hybrid_approach_for_semantic_similarity_calculation_between_Tamil_words) 📄 105 | - [Augmented SBERT](https://arxiv.org/pdf/2010.08240.pdf) 📄 106 | - [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) 📄 107 | - [Compatibility-aware Heterogeneous Visual Search](https://arxiv.org/abs/2105.06047) 📷 108 | - [Learning Personal Style from Few Examples](https://chuanenlin.com/personalstyle)📷 109 | - [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979)📄 110 | - [A Survey of Transformers](https://arxiv.org/abs/2106.04554)📄📷 111 | - [SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://dl.acm.org/doi/10.1145/3404835.3463098)📄 112 | - [High Quality Related Search Query Suggestions using Deep Reinforcement Learning](https://arxiv.org/abs/2108.04452v1) 113 | - [Embedding-based Product Retrieval in Taobao Search](https://arxiv.org/pdf/2106.09297.pdf)📄📷 114 | - [TPRM: A Topic-based Personalized Ranking Model for Web Search](https://arxiv.org/abs/2108.06014)📄 115 | - [mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset](https://arxiv.org/abs/2108.13897)📄 116 | - [Database Reasoning Over Text](https://aclanthology.org/2021.acl-long.241.pdf)📄 117 | - [How Does Adversarial Fine-Tuning Benefit BERT?](https://arxiv.org/abs/2108.13602))📄 118 | - [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)📄 119 | - [Primer: Searching for Efficient Transformers for Language Modeling](https://arxiv.org/abs/2109.08668)📄 120 | - [How Familiar Does That Sound? Cross-Lingual Representational 121 | Similarity Analysis of Acoustic Word Embeddings](https://arxiv.org/pdf/2109.10179.pdf)🔊 122 | - [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821#)📄 123 | - [Compositional Attention: Disentangling Search and Retrieval](https://arxiv.org/abs/2110.09419)📄📷 124 | - [SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search](https://arxiv.org/abs/2111.08566) 125 | - [GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval](https://arxiv.org/abs/2112.07577) 📄 126 | - [Generative Search Engines: Initial Experiments](https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_50.pdf) 📷 127 | - [Rethinking Search: Making Domain Experts out of Dilettantes](https://dl.acm.org/doi/10.1145/3476415.3476428) 128 | -[WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach](https://arxiv.org/abs/2104.01767) 129 | 130 | ### 2022 131 | - [Text and Code Embeddings by Contrastive Pre-Training](https://arxiv.org/abs/2201.10005)📄 132 | - [RELIC: Retrieving Evidence for Literary Claims](https://arxiv.org/abs/2203.10053)📄 133 | - [Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations](https://arxiv.org/abs/2109.13059)📄 134 | - [SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation](https://arxiv.org/abs/2205.08180)🔊 135 | - [An Analysis of Fusion Functions for Hybrid Retrieval](https://arxiv.org/abs/2210.11934)📄 136 | - [Out-of-distribution Detection with Deep Nearest Neighbors](https://arxiv.org/abs/2204.06507) 137 | - [ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition](https://arxiv.org/abs/2210.13352)🔊 138 | - [Analyzing Acoustic Word Embeddings From Pre-Trained Self-Supervised Speech Models](https://arxiv.org/pdf/2210.16043.pdf))🔊 139 | - [Rethinking with Retrieval: Faithful Large Language Model Inference](https://arxiv.org/abs/2301.00303)📄 140 | - [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/pdf/2212.10496.pdf)📄 141 | - [Transformer Memory as a Differentiable Search Index](https://arxiv.org/abs/2202.06991)📄 142 | 143 | ### 2023 144 | - [FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search](https://dl.acm.org/doi/10.1145/3543507.3583318)📄 145 | - [“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors](https://aclanthology.org/2023.findings-acl.426/)📄 146 | - [SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval](https://dl.acm.org/doi/pdf/10.1145/3539618.3592065) 📄 147 | 148 | ## Articles 149 | - [Tackling Semantic Search](https://adityamalte.substack.com/p/tackle-semantic-search/) 150 | - [Semantic search in Azure Cognitive Search](https://docs.microsoft.com/en-us/azure/search/semantic-search-overview) 151 | - [How we used semantic search to make our search 10x smarter](https://zilliz.com/blog/How-we-used-semantic-search-to-make-our-search-10-x-smarter/) 152 | - [Stanford AI Blog : Building Scalable, Explainable, and Adaptive NLP Models with Retrieval](https://ai.stanford.edu/blog/retrieval-based-NLP/) 153 | - [Building a semantic search engine with dual space word embeddings](https://m.mage.ai/building-a-semantic-search-engine-with-dual-space-word-embeddings-f5a596eb6d90) 154 | - [Billion-scale semantic similarity search with FAISS+SBERT](https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2) 155 | - [Some observations about similarity search thresholds](https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html) 156 | - [Near Duplicate Image Search using Locality Sensitive Hashing](https://keras.io/examples/vision/near_dup_search/) 157 | - [Free Course on Vector Similarity Search and Faiss]( https://link.medium.com/HtFoFKlKvkb) 158 | - [Comprehensive Guide To Approximate Nearest Neighbors Algorithms](https://link.medium.com/V62Z8drvEkb) 159 | - [Introducing the hybrid index to enable keyword-aware semantic search](https://www.pinecone.io/learn/hybrid-search/?utm_medium=email&_hsmi=0&_hsenc=p2ANqtz--zLu9hiyh-y_XTa7FCEpi8JESJKmif5dhpYtAxTWka8PIttaTOGE21LMZlg9EOZyPYpCm6GDvYy57tlGRwH6TjgLCsJg&utm_content=231741722&utm_source=hs_email) 160 | - [Argilla Semantic Search](https://docs.argilla.io/en/latest/guides/features/semantic-search.html) 161 | - [Co:here's Multilingual Text Understanding Model](https://txt.cohere.ai/multilingual/) 162 | - [Simplify Search woth Multilingual Embedding Models](https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/) 163 | 164 | ## Libraries and Tools 165 | - [fastText](https://fasttext.cc/) 166 | - [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4) 167 | - [SBERT](https://www.sbert.net/) 168 | - [ELECTRA](https://github.com/google-research/electra) 169 | - [LaBSE](https://tfhub.dev/google/LaBSE/2) 170 | - [LASER](https://github.com/facebookresearch/LASER) 171 | - [Relevance AI - Vector Platform From Experimentation To Deployment](https://relevance.ai) 172 | - [Haystack](https://github.com/deepset-ai/haystack/) 173 | - [Jina.AI](https://jina.ai/) 174 | - [pinecone](https://www.pinecone.io/) 175 | - [SentEval Toolkit](https://github.com/facebookresearch/SentEval?utm_source=catalyzex.com) 176 | - [ranx](https://github.com/AmenRa/ranx) 177 | - [BEIR :Benchmarking IR](https://github.com/UKPLab/beir) 178 | - [RELiC: Retrieving Evidence for Literary Claims Dataset](https://relic.cs.umass.edu/) 179 | - [matchzoo-py](https://github.com/NTMC-Community/MatchZoo-py) 180 | - [deep_text_matching](https://github.com/wangle1218/deep_text_matching) 181 | - [Which Frame?](http://whichframe.com/) 182 | - [lexica.art](https://lexica.art/) 183 | - [emoji semantic search](https://github.com/lilianweng/emoji-semantic-search) 184 | - [PySerini](https://github.com/castorini/pyserini) 185 | - [BERTSerini](https://github.com/rsvp-ai/bertserini) 186 | - [BERTSimilarity](https://github.com/Brokenwind/BertSimilarity) 187 | - [milvus](https://www.milvus.io/) 188 | - [NeuroNLP++](https://plusplus.neuronlp.fruitflybrain.org/) 189 | - [weaviate](https://github.com/semi-technologies/weaviate) 190 | - [semantic-search-through-wikipedia-with-weaviate](https://github.com/semi-technologies/semantic-search-through-wikipedia-with-weaviate) 191 | - [natural-language-youtube-search](https://github.com/haltakov/natural-language-youtube-search) 192 | - [same.energy](https://www.same.energy/about) 193 | - [ann benchmarks](http://ann-benchmarks.com/) 194 | - [scaNN](https://github.com/google-research/google-research/tree/master/scann) 195 | - [REALM](https://github.com/google-research/language/tree/master/language/realm) 196 | - [annoy](https://github.com/spotify/annoy) 197 | - [pynndescent](https://github.com/lmcinnes/pynndescent) 198 | - [nsg](https://github.com/ZJULearning/nsg) 199 | - [FALCONN](https://github.com/FALCONN-LIB/FALCONN) 200 | - [redis HNSW](https://github.com/zhao-lang/redis_hnsw) 201 | - [autofaiss](https://github.com/criteo/autofaiss) 202 | - [DPR](https://github.com/facebookresearch/DPR) 203 | - [rank_BM25](https://github.com/dorianbrown/rank_bm25) 204 | - [FlashRank](https://github.com/PrithivirajDamodaran/FlashRank) 205 | - [nearPy](http://pixelogik.github.io/NearPy/) 206 | - [vearch](https://github.com/vearch/vearch) 207 | - [vespa](https://github.com/vespa-engine/vespa) 208 | - [PyNNDescent](https://github.com/lmcinnes/pynndescent) 209 | - [pgANN](https://github.com/netrasys/pgANN) 210 | - [Tensorflow Similarity](https://github.com/tensorflow/similarity) 211 | - [opensemanticsearch.org](https://www.opensemanticsearch.org/) 212 | - [GPT3 Semantic Search](https://gpt3demo.com/category/semantic-search) 213 | - [searchy](https://github.com/lubianat/searchy) 214 | - [txtai](https://github.com/neuml/txtai) 215 | - [HyperTag](https://github.com/Ravn-Tech/HyperTag) 216 | - [vectorai](https://github.com/vector-ai/vectorai) 217 | - [embeddinghub](https://github.com/featureform/embeddinghub) 218 | - [AquilaDb](https://github.com/Aquila-Network/AquilaDB) 219 | - [STripNet](https://github.com/stephenleo/stripnet) 220 | 221 | ## Datasets 222 | - [Semantic Text Similarity Dataset Hub](https://github.com/brmson/dataset-sts) 223 | - [Facebook AI Image Similarity Challenge](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/?fbclid=IwAR31vRV0EdxRdrxtPy12neZtBJQ0H9qdLHm8Wl2DjHY09PtQdn1nEEIJVUo) 224 | - [WIT : Wikipedia-based Image Text Dataset](https://github.com/google-research-datasets/wit) 225 | - [BEIR](https://github.com/beir-cellar/beir) 226 | - MTEB 227 | 228 | ## Milestones 229 | 230 | Have a look at the [project board](https://github.com/Agrover112/awesome-semantic-search/projects/1) for the task list to contribute to any of the open issues. 231 | -------------------------------------------------------------------------------- /README_French.md: -------------------------------------------------------------------------------- 1 | # Impressionnant Recherche-Sémantique [![Impressionnant](https://awesome.re/badge.svg)](https://awesome.re) [![Commits Conventionnels](https://img.shields.io/badge/Commits%20Conventionnels-1.0.0-jaune.svg)](https://conventionalcommits.org) 2 | 3 | 4 | 5 | Logo réalisé par [@createdbytango](https://instagram.com/createdbytango). 6 | 7 | **À la recherche d'ajouts de papiers supplémentaires. 8 | PS : Soumettez une Pull Request** 9 | 10 | Le référentiel suivant vise à servir de méta-référentiel pour les tâches liées à la [recherche sémantique](https://en.wikipedia.org/wiki/Semantic_search) et à la [similarité sémantique](http://nlpprogress.com/english/semantic_textual_similarity.html). 11 | 12 | La recherche sémantique n'est pas limitée au texte ! Elle peut être réalisée avec des images, de la parole, etc. Il existe de nombreux cas d'utilisation et applications différents de la recherche sémantique. 13 | 14 | N'hésitez pas à soumettre une [Pull Request](https://github.com/Agrover112/awesome-semantic-search/projects/1) sur ce référentiel ! 15 | 16 | ## Contenu 17 | 18 | - [Papiers](#papers) 19 | - [2014](#2014) 20 | - [2015](#2015) 21 | - [2016](#2016) 22 | - [2017](#2017) 23 | - [2018](#2018) 24 | - [2019](#2019) 25 | - [2020](#2020) 26 | - [2021](#2021) 27 | - [2022](#2022) 28 | - [2023](#2023) 29 | - [Articles](#articles) 30 | - [Bibliothèques et Outils](#libraries-and-tools) 31 | - [Ensembles de données](#datasets) 32 | - [Étapes Importantes](#milestones) 33 | 34 | ## Papiers 35 | 36 | ### 2010 37 | - [Priority Range Trees](https://arxiv.org/abs/1009.3527) 38 | 39 | ### 2014 40 | - [Un Modèle Sémantique Latent avec une Structure de Convolutions-Pooling pour la Récupération d'Informations](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf) 📄 41 | 42 | ### 2015 43 | - [Vecteurs Skip-Thought](https://arxiv.org/pdf/1506.06726.pdf) 📄 44 | - [LSH Pratique et Optimal pour la Distance Angulaire](https://proceedings.neurips.cc/paper/2015/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html) 45 | 46 | ### 2016 47 | - [Sac de Trucs pour la Classification Efficace du Texte](https://arxiv.org/abs/1607.01759) 📄 48 | - [Enrichissement des Vecteurs de Mots avec des Informations Subword](https://arxiv.org/abs/1607.04606) 📄 49 | - [Recherche de Voisin le Plus Proche Approximatif Efficace et Robuste en Utilisant des Graphes Mondiaux Navigables Hiérarchiques](https://arxiv.org/abs/1603.09320) 50 | - [Recherche Approximative du Voisin le Plus Proche pour les Vecteurs de Mots Similaires - Expériences, Analyses et Amélioration](https://www.aclweb.org/anthology/P16-1214.pdf) 51 | - [Apprentissage de Représentations Distribuées de Phrases à Partir de Données Non Étiquetées](https://arxiv.org/abs/1602.03483) 📄 52 | - [Recherche Approximative du Voisin le Plus Proche sur des Données de Grande Dimension --- Expériences, Analyses et Amélioration](https://arxiv.org/abs/1610.02455) 53 | 54 | ### 2017 55 | - [Apprentissage Supervisé de Représentations Universelles de Phrases à Partir de Données d'Inférence en Langage Naturel](https://research.fb.com/wp-content/uploads/2017/09/emnlp2017.pdf) 📄 56 | - [Similarité Textuelle Sémantique pour le Hindi](https://www.semanticscholar.org/paper/Semantic-Textual-Similarity-For-Hindi-Mujadia-Mamidi/372f615ce36d7543512b8e40d6de51d17f316e0b) 📄 57 | - [Suggestion Efficace de Réponses en Langage Naturel pour Smart Reply](https://arxiv.org/abs/1705.00652) 📃 58 | 59 | ### 2018 60 | - [Encodeur Universel de Phrases](https://arxiv.org/pdf/1803.11175.pdf) 📄 61 | - [Apprentissage de la Similarité Textuelle Sémantique à Partir de Conversations](https://arxiv.org/pdf/1804.07754.pdf) 📄 62 | - [Blog Google AI : Avancées dans la Similarité Textuelle Sémantique](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html) 📄 63 | - [Speech2Vec : Un Cadre Séquence à Séquence pour Apprendre des Embarquements de Mots à Partir de la Parole](https://arxiv.org/abs/1803.08976)) 🔊 64 | - [Optimisation de l'Indexation Basée sur le Graphique du Voisin le Plus Proche k pour la Recherche de Proximité dans des Données de Grande Dimension](https://arxiv.org/abs/1810.07355) 🔊 65 | - [Recherche Efficace du Voisin le Plus Proche Approximatif avec le Graphique de Dissémination](http://www.vldb.org/pvldb/vol12/p461-fu.pdf) 66 | - [Plaidoyer pour des Structures d'Indexation Apprises](https://dl.acm.org/doi/10.1145/3183713.3196909) 67 | 68 | ### 2019 69 | - [LASER : Représentations de phrases indépendantes du langage](https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/) 📄 70 | - [Expansion de document par prédiction de requête](https://arxiv.org/abs/1904.08375) 📄 71 | - [Sentence-BERT : Intégration de phrases à l'aide de réseaux Siamese BERT](https://arxiv.org/pdf/1908.10084.pdf) 📄 72 | - [Classement de documents à plusieurs étapes avec BERT](https://arxiv.org/abs/1910.14424) 📄 73 | - [Récupération latente pour le questionnement faiblement supervisé en domaine ouvert](https://arxiv.org/abs/1906.00300) 74 | - [Question-réponse de bout en bout avec BERTserini](https://www.aclweb.org/anthology/N19-4013/) 75 | - [BioBERT : un modèle de représentation linguistique biomédicale pré-entraîné pour l'extraction de texte biomédical](https://arxiv.org/abs/1901.08746)📄 76 | - [Analyse et amélioration des représentations avec la perte douce du voisin le plus proche](https://arxiv.org/pdf/1902.01889.pdf)📷 77 | - [DiskANN : Recherche rapide et précise du voisin le plus proche pour un milliard de points sur un seul nœud](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf) 78 | 79 | ### 2020 80 | - [Déploiement rapide d'un moteur de recherche neuronal pour le COVID-19 Open Research Dataset : Réflexions préliminaires et leçons apprises](https://arxiv.org/abs/2004.05125) 📄 81 | - [RE-CLASSEMENT DE PASSAGE AVEC BERT](https://arxiv.org/pdf/1901.04085.pdf) 📄 82 | - [CO-Search : Recherche d'informations sur le COVID-19 avec recherche sémantique, question-réponse et résumé abstrait](https://arxiv.org/pdf/2006.09595.pdf) 📄 83 | - [LaBSE : Intégration de phrases sans langage](https://arxiv.org/abs/2007.01852) 📄 84 | - [Covidex : Modèles de classement neuronal et infrastructure de recherche par mot-clé pour le COVID-19 Open Research Dataset](https://arxiv.org/abs/2007.07846) 📄 85 | - [DeText : Un cadre d'IA profonde pour la compréhension intelligente du texte](https://engineering.linkedin.com/blog/2020/open-sourcing-detext) 📄 86 | - [Rendre les plongements de phrases monolingues multilingues en utilisant la distillation des connaissances](https://arxiv.org/pdf/2004.09813.pdf) 📄 87 | - [Transformateurs pré-entraînés pour le classement de texte : BERT et au-delà](https://arxiv.org/abs/2010.06467) 📄 88 | - [REALM : Pré-entraînement d'un modèle linguistique augmenté par récupération](https://arxiv.org/abs/2002.08909) 89 | - [ELECTRA : PRÉ-ENTRAÎNEMENT DES ENCODEURS DE TEXTE EN TANT QUE DISCRIMINATEURS PLUTÔT QUE DES GÉNÉRATEURS](https://openreview.net/pdf?id=r1xMH1BtvB)📄 90 | - [Amélioration de l'apprentissage profond pour la recherche Airbnb](https://arxiv.org/pdf/2002.05515) 91 | - [Gestion de la diversité dans la recherche Airbnb](https://arxiv.org/abs/2004.02621)📄 92 | - [Apprentissage négatif de contraste approximatif du voisin le plus proche pour la recherche dense de texte](https://arxiv.org/abs/2007.00808v1)📄 93 | - [Plongements d'images sans supervision pour les tâches de recherche et de reconnaissance](https://openaccess.thecvf.com/content_WACV_2020/papers/Gairola_Unsupervised_Image_Style_Embeddings_for_Retrieval_and_Recognition_Tasks_WACV_2020_paper.pdf)📷 94 | - [DeCLUTR : Apprentissage en profondeur contrastif pour les représentations textuelles non supervisées](https://arxiv.org/abs/2006.03659)📄 95 | 96 | 97 | ### 2021 98 | - [Approche hybride pour le calcul de similarité sémantique entre les mots tamouls](https://www.researchgate.net/publication/350112163_Hybrid_approach_for_semantic_similarity_calculation_between_Tamil_words) 📄 99 | - [SBERT augmenté](https://arxiv.org/pdf/2010.08240.pdf) 📄 100 | - [BEIR : un banc d'essai hétérogène pour l'évaluation sans tir préalable des modèles de recherche d'informations](https://arxiv.org/abs/2104.08663) 📄 101 | - [Recherche visuelle hétérogène compatible](https://arxiv.org/abs/2105.06047) 📷 102 | - [Apprentissage du style personnel à partir de quelques exemples](https://chuanenlin.com/personalstyle)📷 103 | - [TSDAE : Utilisation d'un auto-encodeur de débruitage séquentiel basé sur un transformateur pour l'apprentissage non supervisé de l'intégration de phrases](https://arxiv.org/abs/2104.06979)📄 104 | - [Une enquête sur les transformateurs](https://arxiv.org/abs/2106.04554)📄📷 105 | - [SPLADE : Modèle lexical et d'expansion parcimonieux pour le classement de la première étape](https://dl.acm.org/doi/10.1145/3404835.3463098)📄 106 | - [Suggestions de requêtes de recherche liées de haute qualité à l'aide de l'apprentissage en profondeur par renforcement](https://arxiv.org/abs/2108.04452v1) 107 | - [Récupération de produits basée sur l'intégration dans la recherche Taobao](https://arxiv.org/pdf/2106.09297.pdf)📄📷 108 | - [TPRM : Un modèle de classement personnalisé basé sur les sujets pour la recherche Web](https://arxiv.org/abs/2108.06014)📄 109 | - [mMARCO : Une version multilingue de l'ensemble de données de classement de passages MS MARCO](https://arxiv.org/abs/2108.13897)📄 110 | - [Raisonnement sur la base de données à partir du texte](https://aclanthology.org/2021.acl-long.241.pdf)📄 111 | - [En quoi l'affinage adversarial profite-t-il à BERT ?](https://arxiv.org/abs/2108.13602))📄 112 | - [Entraînement court, test long : l'attention avec des biais linéaires permet l'extrapolation de la longueur d'entrée](https://arxiv.org/abs/2108.12409)📄 113 | - [Primer : Recherche d'architectures de transformateurs efficaces pour la modélisation linguistique](https://arxiv.org/abs/2109.08668)📄 114 | - [À quel point cela semble-t-il familier ? Analyse de similarité représentationnelle interlingue des plongements acoustiques de mots](https://arxiv.org/pdf/2109.10179.pdf)🔊 115 | - [SimCSE : Apprentissage contrastif simple des plongements de phrases](https://arxiv.org/abs/2104.08821#)📄 116 | - [Attention compositionnelle : Désentrelacement de la recherche et de la récupération](https://arxiv.org/abs/2110.09419)📄📷 117 | - [SPANN : Recherche de voisin le plus proche efficace à l'échelle du milliard](https://arxiv.org/abs/2111.08566) 118 | - [GPL : Étiquetage pseudo-génératif pour l'adaptation de domaine non supervisée de la récupération dense](https://arxiv.org/abs/2112.07577) 📄 119 | - [Moteurs de recherche génératifs : expériences initiales](https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_50.pdf) 📷 120 | - [Repenser la recherche : faire des experts de domaine à partir de dilettantes](https://dl.acm.org/doi/10.1145/3476415.3476428) 121 | - [WhiteningBERT : Une approche facile d'intégration de phrases non supervisée](https://arxiv.org/abs/2104.01767) 122 | 123 | ### 2022 124 | - [Intégration de textes et de codes par pré-entraînement contrastif](https://arxiv.org/abs/2201.10005)📄 125 | - [RELIC : Récupération de preuves pour les revendications littéraires](https://arxiv.org/abs/2203.10053)📄 126 | - [Trans-Encoder : Modélisation non supervisée de paires de phrases par auto-distillations mutuelles et mutuelles](https://arxiv.org/abs/2109.13059)📄 127 | - [SAMU-XLSR : Représentation multimodale de l'énoncé interlingue alignée sémantiquement](https://arxiv.org/abs/2205.08180)🔊 128 | - [Analyse des fonctions de fusion pour la recherche hybride](https://arxiv.org/abs/2210.11934)📄 129 | - [Détection hors distribution avec des voisins les plus proches profonds](https://arxiv.org/abs/2204.06507) 130 | - [ESB : Un banc d'essai pour la reconnaissance de la parole de bout en bout multi-domaines](https://arxiv.org/abs/2210.13352)🔊 131 | - [Analyse des plongements acoustiques de mots à partir de modèles de parole auto-supervisés pré-entraînés](https://arxiv.org/pdf/2210.16043.pdf))🔊 132 | - [Repenser avec la récupération : Inférence fidèle de grands modèles linguistiques](https://arxiv.org/abs/2301.00303)📄 133 | - [Récupération dense précise sans étiquettes de pertinence](https://arxiv.org/pdf/2212.10496.pdf)📄 134 | - [Mémoire du transformateur en tant qu'index de recherche différenciable](https://arxiv.org/abs/2202.06991)📄 135 | 136 | ### 2023 137 | - [FINGER : Inférence rapide pour la recherche du voisin le plus proche approximatif basée sur un graphe](https://dl.acm.org/doi/10.1145/3543507.3583318)📄 138 | - [Classification de texte "faible ressource" : une méthode de classification sans paramètre avec des compresseurs](https://aclanthology.org/2023.findings-acl.426/)📄 139 | - [SparseEmbed : Apprentissage de représentations lexicales clairsemées avec des plongements contextuels pour la récupération](https://dl.acm.org/doi/pdf/10.1145/3539618.3592065) 📄 140 | 141 | ## Articles 142 | - [Aborder la recherche sémantique](https://adityamalte.substack.com/p/tackle-semantic-search/) 143 | - [Recherche sémantique dans Azure Cognitive Search](https://docs.microsoft.com/en-us/azure/search/semantic-search-overview) 144 | - [Comment nous avons utilisé la recherche sémantique pour rendre notre recherche 10 fois plus intelligente](https://zilliz.com/blog/How-we-used-semantic-search-to-make-our-search-10-x-smarter/) 145 | - [Stanford AI Blog : Construction de modèles NLP évolutifs, explicables et adaptatifs avec la récupération](https://ai.stanford.edu/blog/retrieval-based-NLP/) 146 | - [Construction d'un moteur de recherche sémantique avec des plongements de mots à double espace](https://m.mage.ai/building-a-semantic-search-engine-with-dual-space-word-embeddings-f5a596eb6d90) 147 | - [Recherche de similarité sémantique à l'échelle du milliard avec FAISS+SBERT](https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2) 148 | - [Quelques observations sur les seuils de recherche de similarité](https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html) 149 | - [Recherche d'images quasi identiques avec Locality Sensitive Hashing](https://keras.io/examples/vision/near_dup_search/) 150 | - [Cours gratuit sur la recherche de similarité vectorielle et Faiss](https://link.medium.com/HtFoFKlKvkb) 151 | - [Guide complet des algorithmes de recherche des voisins les plus proches approximatifs](https://link.medium.com/V62Z8drvEkb) 152 | - [Introduction de l'index hybride pour permettre la recherche sémantique consciente des mots-clés](https://www.pinecone.io/learn/hybrid-search/?utm_medium=email&_hsmi=0&_hsenc=p2ANqtz--zLu9hiyh-y_XTa7FCEpi8JESJKmif5dhpYtAxTWka8PIttaTOGE21LMZlg9EOZyPYpCm6GDvYy57tlGRwH6TjgLCsJg&utm_content=231741722&utm_source=hs_email) 153 | - [Recherche sémantique Argilla](https://docs.argilla.io/en/latest/guides/features/semantic-search.html) 154 | - [Modèle de compréhension textuelle multilingue de Co:here](https://txt.cohere.ai/multilingual/) 155 | - [Simplifiez la recherche avec des modèles d'embedding multilingues](https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/) 156 | 157 | ## Bibliothèques et Outils 158 | - [fastText](https://fasttext.cc/) 159 | - [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4) 160 | - [SBERT](https://www.sbert.net/) 161 | - [ELECTRA](https://github.com/google-research/electra) 162 | - [LaBSE](https://tfhub.dev/google/LaBSE/2) 163 | - [LASER](https://github.com/facebookresearch/LASER) 164 | - [Relevance AI - Plateforme vectorielle de l'expérimentation au déploiement](https://relevance.ai) 165 | - [Haystack](https://github.com/deepset-ai/haystack/) 166 | - [Jina.AI](https://jina.ai/) 167 | - [Pinecone](https://www.pinecone.io/) 168 | - [SentEval Toolkit](https://github.com/facebookresearch/SentEval?utm_source=catalyzex.com) 169 | - [ranx](https://github.com/AmenRa/ranx) 170 | - [BEIR :Evaluation des IR](https://github.com/UKPLab/beir) 171 | - [RELiC: Jeu de données de récupération d'éléments pour les revendications littéraires](https://relic.cs.umass.edu/) 172 | - [matchzoo-py](https://github.com/NTMC-Community/MatchZoo-py) 173 | - [deep_text_matching](https://github.com/wangle1218/deep_text_matching) 174 | - [Quel cadre ?](http://whichframe.com/) 175 | - [lexica.art](https://lexica.art/) 176 | - [Recherche sémantique emoji](https://github.com/lilianweng/emoji-semantic-search) 177 | - [PySerini](https://github.com/castorini/pyserini) 178 | - [BERTSerini](https://github.com/rsvp-ai/bertserini) 179 | - [BERTSimilarity](https://github.com/Brokenwind/BertSimilarity) 180 | - [milvus](https://www.milvus.io/) 181 | - [NeuroNLP++](https://plusplus.neuronlp.fruitflybrain.org/) 182 | - [weaviate](https://github.com/semi-technologies/weaviate) 183 | - [Recherche sémantique à travers Wikipedia avec Weaviate](https://github.com/semi-technologies/semantic-search-through-wikipedia-with-weaviate) 184 | - [Recherche naturelle sur YouTube](https://github.com/haltakov/natural-language-youtube-search) 185 | - [same.energy](https://www.same.energy/about) 186 | - [Benchmarks ANN](http://ann-benchmarks.com/) 187 | - [scaNN](https://github.com/google-research/google-research/tree/master/scann) 188 | - [REALM](https://github.com/google-research/language/tree/master/language/realm) 189 | - [annoy](https://github.com/spotify/annoy) 190 | - [pynndescent](https://github.com/lmcinnes/pynndescent) 191 | - [nsg](https://github.com/ZJULearning/nsg) 192 | - [FALCONN](https://github.com/FALCONN-LIB/FALCONN) 193 | - [redis HNSW](https://github.com/zhao-lang/redis_hnsw) 194 | - [autofaiss](https://github.com/criteo/autofaiss) 195 | - [DPR](https://github.com/facebookresearch/DPR) 196 | - [rank_BM25](https://github.com/dorianbrown/rank_bm25) 197 | - [nearPy](http://pixelogik.github.io/NearPy/) 198 | - [vearch](https://github.com/vearch/vearch) 199 | - [vespa](https://github.com/vespa-engine/vespa) 200 | - [PyNNDescent](https://github.com/lmcinnes/pynndescent) 201 | - [pgANN](https://github.com/netrasys/pgANN) 202 | - [Tensorflow Similarity](https://github.com/tensorflow/similarity) 203 | - [opensemanticsearch.org](https://www.opensemanticsearch.org/) 204 | - [GPT3 Semantic Search](https://gpt3demo.com/category/semantic-search) 205 | - [searchy](https://github.com/lubianat/searchy) 206 | - [txtai](https://github.com/neuml/txtai) 207 | - [HyperTag](https://github.com/Ravn-Tech/HyperTag) 208 | - [vectorai](https://github.com/vector-ai/vectorai) 209 | - [embeddinghub](https://github.com/featureform/embeddinghub) 210 | - [AquilaDb](https://github.com/Aquila-Network/AquilaDB) 211 | - [STripNet](https://github.com/stephenleo/stripnet) 212 | 213 | ## Ensembles-de-données 214 | - [Semantic Text Similarity Dataset Hub](https://github.com/brmson/dataset-sts) 215 | - [Facebook AI Image Similarity Challenge](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/?fbclid=IwAR31vRV0EdxRdrxtPy12neZtBJQ0H9qdLHm8Wl2DjHY09PtQdn1nEEIJVUo) 216 | - [WIT : Wikipedia-based Image Text Dataset](https://github.com/google-research-datasets/wit) 217 | - [BEIR](https://github.com/beir-cellar/beir) 218 | - MTEB 219 | 220 | ## Étapes Importantes 221 | 222 | Consultez le [tableau du projet](https://github.com/Agrover112/awesome-semantic-search/projects/1) pour la liste des tâches afin de contribuer à l'une des issues ouvertes. 223 | 224 | -------------------------------------------------------------------------------- /README_Hindi.md: -------------------------------------------------------------------------------- 1 | Awesome Semantic-Search [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 2 | ====================================================================================== 3 | 4 | 5 | 6 | logo इनके द्वारा निर्मित [@createdbytango](https://instagram.com/createdbytango). 7 | 8 | निम्नलिखित रिपॉजिटरी का उद्देश्य [सिमेंटिक 9 | सर्च](https://en.wikipedia.org/wiki/Semantic_search) और [सिमेंटिक 10 | समानता](http://nlpprogress.com/english/semantic_textual_similarity.html) 11 | से संबंधित कार्यों के लिए मेटा-रिपॉजिटरी की सेवा करना है। 12 | 13 | सिमेंटिक सर्च टेक्स्ट तक ही सीमित नहीं है! यह छवियों, भाषण, आदि के साथ 14 | किया जा सकता है। इसलिए अर्थपूर्ण खोज के कई अलग-अलग उपयोग-मामले और 15 | अनुप्रयोग हैं। 16 | 17 | Contributions / Milestones 18 | -------------------------- 19 | 20 | [कार्य 21 | सूची](https://github.com/Agrover112/awesome-semantic-search/projects/1) 22 | के लिए प्रोजेक्ट बोर्ड पर एक नज़र डालें 23 | 24 | विषय-सूची 25 | --------- 26 | 27 | - [दस्तावेज़](#दस्तावेज़) 28 | - [2014](#2014) 29 | - [2015](#2015) 30 | - [2016](#2016) 31 | - [2017](#2017) 32 | - [2018](#2018) 33 | - [2019](#2019) 34 | - [2020](#2020) 35 | - [2021](#2021) 36 | 37 | - [लेख](#लेख) 38 | - [Libraries तथा Tools](#libraries-तथा-tools) 39 | - [डेटासेट](#डेटासेट) 40 | - [माइलस्टोन्स](#माइलस्टोन्स) 41 | 42 | 43 | दस्तावेज़ 44 | --------- 45 | 46 | ### 2010 47 | 48 | - [प्राथमिकता रेंज पेड़ ](https://arxiv.org/abs/1009.3527) 49 | 📄 50 | 51 | ### 2014 52 | 53 | - [सूचना पुनर्प्राप्ति के लिए कनवल्शनल-पूलिंग स्ट्रक्चर के साथ एक 54 | अव्यक्त सिमेंटिक 55 | मॉडल](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf) 56 | 📄 57 | 58 | ### 2015 59 | 60 | - [स्किप-थॉट वैक्टर](https://arxiv.org/pdf/1506.06726.pdf) 📄 61 | - [कोणीय दूरी के लिए व्यावहारिक और इष्टतम एलएसएच](https://proceedings.neurips.cc/paper/2015/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html) 📄 62 | 63 | ### 2016 64 | 65 | - [कुशल पाठ वर्गीकरण के लिए ट्रिक्स का 66 | बैग](https://arxiv.org/abs/1607.01759) 📄 67 | - [सबवर्ड जानकारी के साथ वर्ड वैक्टर को समृद्ध 68 | करना](https://arxiv.org/abs/1607.04606) 📄 69 | - [पदानुक्रमित नेविगेट करने योग्य लघु विश्व ग्राफ़ का उपयोग करके कुशल 70 | और मजबूत अनुमानित निकटतम पड़ोसी 71 | खोज](https://arxiv.org/abs/1603.09320) 72 | - [लगभग समान शब्द एंबेडिंग की खोज 73 | पर](https://www.aclweb.org/anthology/P16-1214.pdf) 74 | - [बिना लेबल वाले डेटा से वाक्यों के वितरित अभ्यावेदन सीखना](https://arxiv.org/abs/1602.03483) 📄 75 | - [उच्च आयामी डेटा पर अनुमानित निकटतम पड़ोसी खोज --- प्रयोग, विश्लेषण और सुधार](https://arxiv.org/abs/1610.02455) 76 | 77 | ### 2017 78 | 79 | - [प्राकृतिक भाषा अनुमान डेटा से सार्वभौमिक वाक्य अभ्यावेदन की 80 | पर्यवेक्षित 81 | शिक्षा](https://research.fb.com/wp-content/uploads/2017/09/emnlp2017.pdf) 82 | 📄 83 | 84 | ### 2018 85 | 86 | - [यूनिवर्सल सेंटेंस एनकोडर](https://arxiv.org/pdf/1803.11175.pdf) 📄 87 | - [बातचीत से सिमेंटिक टेक्स्टुअल समानता 88 | सीखना](https://arxiv.org/pdf/1804.07754.pdf) 📄 89 | - [Google AI ब्लॉग: सिमेंटिक टेक्स्टुअल समानता में 90 | प्रगति](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html) 91 | 📄 92 | - [उच्च-आयामी डेटा में निकटता खोज के लिए k-निकटतम पड़ोसी ग्राफ़ के 93 | आधार पर अनुक्रमण का अनुकूलन](https://arxiv.org/abs/1810.07355) 94 | - [नेविगेटिंग स्प्रेडिंग-आउट ग्राफ के साथ तेजी से अनुमानित निकटतम पड़ोसी खोज](http://www.vldb.org/pvldb/vol12/p461-fu.pdf) 95 | - [सीखा सूचकांक संरचनाओं के लिए मामला](https://dl.acm.org/doi/10.1145/3183713.3196909) 96 | 97 | ### 2019 98 | 99 | - [लेजर: भाषा अज्ञेय वाक्य 100 | प्रतिनिधित्व](https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/) 101 | 📄 102 | - [प्रश्न भविष्यवाणी द्वारा दस्तावेज़ 103 | विस्तार](https://arxiv.org/abs/1904.08375) 📄 104 | - [सेंटेंस-बर्ट: स्याम देश के बर्ट-नेटवर्क का इस्तेमाल करते हुए वाक्य 105 | एम्बेडिंग](https://arxiv.org/pdf/1908.10084.pdf) 📄 106 | - [बर्ट के साथ बहु-स्तरीय दस्तावेज़ 107 | रैंकिंग](https://arxiv.org/abs/1910.14424) 📄 108 | - [कमजोर पर्यवेक्षित खुले डोमेन प्रश्न उत्तर के लिए गुप्त पुनर्प्राप्ति](https://arxiv.org/abs/1906.00300) 109 | - [BERTserini के साथ एंड-टू-एंड ओपन-डोमेन प्रश्न उत्तर](https://www.aclweb.org/anthology/N19-4013/) 110 | - [बायोबर्ट: बायोमेडिकल टेक्स्ट माइनिंग के लिए एक पूर्व-प्रशिक्षित बायोमेडिकल भाषा प्रतिनिधित्व मॉडल](https://arxiv.org/abs/1901.08746)📄 111 | - [नरम निकटतम पड़ोसी नुकसान के साथ प्रतिनिधित्व का विश्लेषण और सुधार](https://arxiv.org/pdf/1902.01889.pdf):camera_flash: 112 | - [DiskANN: एक ही नोड पर तेजी से सटीक अरब-बिंदु निकटतम पड़ोसी खोजें](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf) 113 | 114 | ### 2020 115 | 116 | - [COVID-19 ओपन रिसर्च डेटासेट के लिए एक तंत्रिका खोज इंजन को तेजी से 117 | तैनात करना: प्रारंभिक विचार और सीखे गए 118 | सबक](https://arxiv.org/abs/2004.05125) 📄 119 | - [बर्ट के साथ पैसेज री-रैंकिंग](https://arxiv.org/pdf/1901.04085.pdf) 120 | 📄 121 | - [सह-खोज: अर्थपूर्ण खोज के साथ COVID-19 सूचना पुनर्प्राप्ति, प्रश्न 122 | उत्तर, और सार संक्षेप](https://arxiv.org/pdf/2006.09595.pdf) 📄 123 | - [LaBSE:Language-agnostic BERT Sentence 124 | Embedding](https://arxiv.org/abs/2007.01852) 📄 125 | - [Covidex: COVID-19 ओपन रिसर्च डेटासेट के लिए न्यूरल रैंकिंग मॉडल और 126 | कीवर्ड सर्च इंफ्रास्ट्रक्चर](https://arxiv.org/abs/2007.07846) 📄 127 | - [DeTect: बुद्धिमान पाठ समझ के लिए एक गहन एनएलपी 128 | ढांचा](https://engineering.linkedin.com/blog/2020/open-sourcing-detext) 129 | 📄 130 | - [ज्ञान आसवन का उपयोग करके एकभाषी वाक्य एम्बेडिंग बहुभाषी 131 | बनाना](https://arxiv.org/pdf/2004.09813.pdf) 📄 132 | - [टेक्स्ट रैंकिंग के लिए पूर्व प्रशिक्षित ट्रांसफॉर्मर: बीईआरटी और 133 | परे](https://arxiv.org/abs/2010.06467) 📄 134 | - [REALM: पुनर्प्राप्ति-संवर्धित भाषा मॉडल पूर्व-प्रशिक्षण](https://arxiv.org/abs/2002.08909) 135 | - [इलेक्ट्रा: प्री-ट्रेनिंग टेक्स्ट एनकोडर जेनरेटर के बजाय डिस्क्रिमिनेटर के रूप में होते हैं](https://openreview.net/pdf?id=r1xMH1BtvB)📄 136 | - [एयरबीएनबी खोज के लिए डीप लर्निंग में सुधार](https://arxiv.org/pdf/2002.05515) 137 | - [Airbnb खोज में विविधता का प्रबंधन](https://arxiv.org/abs/2004.02621)📄 138 | - [सघन पाठ पुनर्प्राप्ति के लिए लगभग निकटतम पड़ोसी नकारात्मक विपरीत शिक्षा](https://arxiv.org/abs/2007.00808v1)📄 139 | 140 | ### 2021 141 | 142 | - [तमिल शब्दों के बीच अर्थ समानता गणना के लिए हाइब्रिड दृष्टिकोण](https://www.researchgate.net/publication/350112163_Hybrid_approach_for_semantic_similarity_calculation_between_Tamil_words):page_facing_up: 143 | - [संवर्धित SBERT](https://arxiv.org/pdf/2010.08240.pdf) 📄 144 | - [BEIR: सूचना पुनर्प्राप्ति मॉडल के शून्य-शॉट मूल्यांकन के लिए एक 145 | विषम बेंचमार्क](https://arxiv.org/abs/2104.08663) 📄 146 | - [संगतता-जागरूक विषम दृश्य खोज](https://arxiv.org/abs/2105.06047) 📷 147 | - [कुछ उदाहरणों से व्यक्तिगत शैली सीखना](https://chuanenlin.com/personalstyle/)📷 148 | - [TSDAE: अनसुपरवाइज्ड सेंटेंस एंबेडिंग लर्निंग के लिए ट्रांसफॉर्मर-आधारित अनुक्रमिक डीनोइज़िंग ऑटो-एनकोडर का उपयोग करना](https://arxiv.org/abs/2104.06979)📄 149 | - [ट्रांसफॉर्मर का एक सर्वेक्षण](https://arxiv.org/abs/2106.04554)📄📷 150 | - [डीप सुदृढीकरण लर्निंग का उपयोग करके उच्च गुणवत्ता से संबंधित खोज क्वेरी सुझाव](https://arxiv.org/abs/2108.04452v1) 151 | - [Taobao खोज में एम्बेडिंग-आधारित उत्पाद पुनर्प्राप्ति](https://arxiv.org/pdf/2106.09297.pdf)📄📷 152 | - [टीपीआरएम: वेब खोज के लिए एक विषय-आधारित निजीकृत रैंकिंग मॉडल](https://arxiv.org/abs/2108.06014)📄 153 | - [mMARCO: एमएस मार्को पैसेज रैंकिंग डेटासेट का एक बहुभाषी संस्करण](https://arxiv.org/abs/2108.13897)📄 154 | - [टेक्स्ट पर डेटाबेस रीजनिंग](https://aclanthology.org/2021.acl-long.241.pdf) 155 | - [एडवरसैरियल फाइन-ट्यूनिंग BERT को कैसे लाभ पहुंचाता है?](https://arxiv.org/abs/2108.13602):page_facing_up: 156 | - [ट्रेन शॉर्ट, टेस्ट लांग: रैखिक पूर्वाग्रहों के साथ ध्यान इनपुट लेंथ एक्सपेरिमेंटेशन को सक्षम बनाता है](https://arxiv.org/abs/2108.12409):page_facing_up: 157 | - [प्राइमर: भाषा मॉडलिंग के लिए कुशल ट्रांसफॉर्मर की खोज](https://arxiv.org/abs/2109.08668)📄 158 | - [वह ध्वनि कितनी परिचित है? ध्वनिक शब्द एम्बेडिंग का क्रॉस-लिंगुअल रिप्रेजेंटेशनल समानता विश्लेषण](https://arxiv.org/pdf/2109.10179.pdf):loud_sound: 159 | - [SimCSE: वाक्य एम्बेडिंग की सरल विरोधाभासी शिक्षा](https://arxiv.org/abs/2104.08821#):page_facing_up: 160 | - [रचनात्मक ध्यान:खोज और पुनर्प्राप्ति को अलग करना](https://arxiv.org/abs/2110.09419)📄📷 161 | - [स्पैन: अत्यधिक कुशल अरब पैमाने पर लगभग निकटतम पड़ोसी खोज](https://arxiv.org/abs/2111.08566) 162 | 163 | लेख 164 | ------- 165 | - [अर्थपूर्ण खोज से निपटना](https://adityamalte.substack.com/p/tackle-semantic-search/) 166 | - [Azure Congnitive Search में सिमेंटिक सर्च](https://docs.microsoft.com/en-us/azure/search/semantic-search-overview) 167 | - [हमने अपनी खोज को 10x स्मार्ट बनाने के लिए सिमेंटिक खोज का उपयोग कैसे किया](https://zilliz.com/blog/How-we-used-semantic-search-to-make-our-search-10-x-smarter/) 168 | - [दोहरे स्थान वाले शब्द एम्बेडिंग के साथ सिमेंटिक सर्च इंजन का निर्माण](https://m.mage.ai/building-a-semantic-search-engine-with-dual-space-word-embeddings-f5a596eb6d90) 169 | - [FAISS+SBERT के साथ अरब-पैमाने की सिमेंटिक समानता खोज](https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2) 170 | - [समानता खोज थ्रेसहोल्ड के बारे में कुछ टिप्पणियां](https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html) 171 | - [स्थानीयता संवेदनशील हैशिंग का उपयोग करके डुप्लिकेट छवि खोज के पास](https://keras.io/examples/vision/near_dup_search/) 172 | - [वेक्टर समानता खोज और फैस पर नि: शुल्क पाठ्यक्रम](https://link.medium.com/HtFoFKlKvkb) 173 | - [निकटतम पड़ोसियों के एल्गोरिदम के लिए व्यापक गाइड](https://link.medium.com/V62Z8drvEkb) 174 | 175 | Libraries तथा Tools 176 | ------------------- 177 | 178 | - [fastText](https://fasttext.cc/) 179 | - [Universal Sentence 180 | Encoder](https://tfhub.dev/google/universal-sentence-encoder/4) 181 | - [SBERT](https://www.sbert.net/) 182 | - [LaBSE](https://tfhub.dev/google/LaBSE/2) 183 | - [LASER](https://github.com/facebookresearch/LASER) 184 | - [Haystack](https://github.com/deepset-ai/haystack/) 185 | - [Jina.AI](https://jina.ai/) 186 | - [SentEval 187 | Toolkit](https://github.com/facebookresearch/SentEval?utm_source=catalyzex.com) 188 | - [BEIR :Benchmarking IR](https://github.com/UKPLab/beir) 189 | - [Which Frame?](http://whichframe.com/) 190 | - [PySerini](https://github.com/castorini/pyserini) 191 | - [milvus](https://www.milvus.io/) 192 | - [weaviate](https://github.com/semi-technologies/weaviate) 193 | - [natural-language-youtube-search](https://github.com/haltakov/natural-language-youtube-search) 194 | - [same.energy](https://www.same.energy/about) 195 | - [scaNN](https://github.com/google-research/google-research/tree/master/scann) 196 | - [annoy](https://github.com/spotify/annoy) 197 | - [faiss](https://github.com/facebookresearch/faiss) 198 | - [DPR](https://github.com/facebookresearch/DPR) 199 | - [rank\_BM25](https://github.com/dorianbrown/rank_bm25) 200 | - [nearPy](http://pixelogik.github.io/NearPy/) 201 | - [vearch](https://github.com/vearch/vearch) 202 | - [PyNNDescent](https://github.com/lmcinnes/pynndescent) 203 | - [pgANN](https://github.com/netrasys/pgANN) 204 | - [Tensorflow Similarity](https://github.com/tensorflow/similarity) 205 | - [opensemanticsearch.org](https://www.opensemanticsearch.org/) 206 | - [GPT3 Semantic Search](https://gpt3demo.com/category/semantic-search) 207 | - [searchy](https://github.com/lubianat/searchy) 208 | - [txtai](https://github.com/neuml/txtai) 209 | - [HyperTag](https://github.com/Ravn-Tech/HyperTag) 210 | - [vectorai](https://github.com/vector-ai/vectorai) 211 | - [embeddinghub](https://github.com/featureform/embeddinghub) 212 | - [AquilaDb](https://github.com/Aquila-Network/AquilaDB) 213 | 214 | डेटासेट 215 | ------- 216 | 217 | - [सिमेंटिक टेक्स्ट समानता डेटासेट 218 | हब](https://github.com/brmson/dataset-sts) 219 | - [फेसबुक एआई छवि समानता चुनौती](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/?fbclid=IwAR31vRV0EdxRdrxtPy12neZtBJQ0H9qdLHm8Wl2DjHY09PtQdn1nEEIJVUo) 220 | - [WIT: विकिपीडिया-आधारित छवि पाठ डेटासेट](https://github.com/google-research-datasets/wit) 221 | 222 | 223 | माइलस्टोन्स 224 | ------- 225 | 226 | - कार्य सूची के लिए [परियोजना बोर्ड](https://github.com/Agrover112/awesome-semantic-search/projects/1) पर एक नज़र डालें ताकि किसी भी खुले मुद्दे में योगदान किया जा सके। 227 | -------------------------------------------------------------------------------- /README_Portuguesse.md: -------------------------------------------------------------------------------- 1 | # Busca Semântica Incrível [![Incrível](https://awesome.re/badge.svg)](https://awesome.re) [![Commits Convencionais](https://img.shields.io/badge/Commits%20Convencionais-1.0.0-amarelo.svg)](https://conventionalcommits.org) 2 | 3 | 4 | 5 | Logo feito por [@createdbytango](https://instagram.com/createdbytango). 6 | 7 | **À procura de mais adições de artigos. 8 | PS: Abra um PR (Pedido de Pull)** 9 | 10 | Este repositório visa servir como um meta-repositório para tarefas relacionadas com [Busca Semântica](https://pt.wikipedia.org/wiki/Busca_semântica) e [Similaridade Semântica](http://nlpprogress.com/english/semantic_textual_similarity.html). 11 | 12 | A busca semântica não se limita a texto! Pode ser feito com imagens, voz, etc. Existem inúmeros casos de uso e diferentes aplicações de busca semântica. 13 | 14 | Sinta-se à vontade para abrir um PR neste repositório! 15 | 16 | ## Conteúdo 17 | 18 | - [Artigos](#artigos) 19 | - [2014](#2014) 20 | - [2015](#2015) 21 | - [2016](#2016) 22 | - [2017](#2017) 23 | - [2018](#2018) 24 | - [2019](#2019) 25 | - [2020](#2020) 26 | - [2021](#2021) 27 | - [2022](#2022) 28 | - [2023](#2023) 29 | - [Artigos](#artigos) 30 | - [Bibliotecas e Ferramentas](#bibliotecas-e-ferramentas) 31 | - [Conjuntos de Dados](#conjuntos-de-dados) 32 | - [Marcos](#marcos) 33 | 34 | ## Artigos 35 | 36 | ### 2010 37 | 38 | - [Priority Range Trees](https://arxiv.org/abs/1009.3527) 39 | 40 | ### 2014 41 | 42 | - [Um Modelo Semântico Latente com Estrutura de Convolutional-Pooling para Recuperação de Informação](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf) 📄 43 | 44 | ### 2015 45 | 46 | - [Vetores de Skip-Thought](https://arxiv.org/pdf/1506.06726.pdf) 📄 47 | - [LSH Prático e Ótimo para Distância Angular](https://proceedings.neurips.cc/paper/2015/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html) 48 | 49 | ### 2016 50 | 51 | - [Saco de truques para classificação eficiente de texto](https://arxiv.org/abs/1607.01759) 📄 52 | - [Enriquecendo vetores de palavras com informações de subpalavras](https://arxiv.org/abs/1607.04606) 📄 53 | - [Pesquisa aproximada de vizinho mais próximo eficiente e robusta usando gráficos hierárquicos navegáveis ​​de pequenos mundos](https://arxiv.org/abs/1603.09320) 54 | - [Sobre a pesquisa aproximada de incorporações de palavras semelhantes](https://www.aclweb.org/anthology/P16-1214.pdf) 55 | - [Aprendendo representações distribuídas de sentenças a partir de dados não rotulados](https://arxiv.org/abs/1602.03483)📄 56 | - [Pesquisa aproximada do vizinho mais próximo em dados de alta dimensão --- Experimentos, análises e melhorias](https://arxiv.org/abs/1610.02455) 57 | 58 | ### 2017 59 | 60 | - [Aprendizagem supervisionada de representações de frases universais a partir de dados de inferência de linguagem natural](https://research.fb.com/wp-content/uploads/2017/09/emnlp2017.pdf) 📄 61 | - [Semelhança textual semântica para hindi] (https://www.semanticscholar.org/paper/Semantic-Textual-Similarity-For-Hindi-Mujadia-Mamidi/372f615ce36d7543512b8e40d6de51d17f316e0b)📄 62 | - [Sugestão eficiente de resposta em linguagem natural para resposta inteligente](https://arxiv.org/abs/1705.00652)📃 63 | 64 | ### 2018 65 | 66 | - [Codificador de frases universais](https://arxiv.org/pdf/1803.11175.pdf) 📄 67 | - [Aprendendo similaridade textual semântica em conversas](https://arxiv.org/pdf/1804.07754.pdf) 📄 68 | - [Blog de IA do Google: avanços na similaridade textual semântica](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html) 📄 69 | - [Speech2Vec: uma estrutura de sequência a sequência para aprender incorporações de palavras a partir da fala](https://arxiv.org/abs/1803.08976))🔊 70 | - [Otimização da indexação com base no gráfico k-vizinho mais próximo para pesquisa de proximidade em dados de alta dimensão](https://arxiv.org/abs/1810.07355) 🔊 71 | - [Pesquisa rápida aproximada do vizinho mais próximo com o 72 | Navegando no gráfico de dispersão](http://www.vldb.org/pvldb/vol12/p461-fu.pdf) 73 | - [O caso das estruturas de índice aprendidas](https://dl.acm.org/doi/10.1145/3183713.3196909) 74 | 75 | ### 2019 76 | 77 | - [LASER: representações de frases agnósticas de linguagem](https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/) 📄 78 | - [Expansão de documentos por previsão de consulta](https://arxiv.org/abs/1904.08375) 📄 79 | - [Sentence-BERT: Embeddings de frases usando redes BERT siamesas](https://arxiv.org/pdf/1908.10084.pdf) 📄 80 | - [Classificação de documentos em vários estágios com BERT](https://arxiv.org/abs/1910.14424) 📄 81 | - [Recuperação latente para resposta a perguntas de domínio aberto com supervisão fraca](https://arxiv.org/abs/1906.00300) 82 | - [Resposta completa de perguntas de domínio aberto com BERTserini](https://www.aclweb.org/anthology/N19-4013/) 83 | - [BioBERT: um modelo de representação de linguagem biomédica pré-treinado para mineração de texto biomédico](https://arxiv.org/abs/1901.08746)📄 84 | - [Analisando e melhorando representações com a perda suave do vizinho mais próximo](https://arxiv.org/pdf/1902.01889.pdf)📷 85 | - [DiskANN: rápido e preciso bilhão de pontos mais próximo 86 | Pesquisa de vizinho em um único nó](https://proceedings.neurips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf) 87 | 88 | ### 2020 89 | 90 | - [Implantando rapidamente um mecanismo de pesquisa neural para o conjunto de dados de pesquisa aberta COVID-19: reflexões preliminares e lições aprendidas](https://arxiv.org/abs/2004.05125) 📄 91 | - [RE-RANKING DA PASSAGEM COM BERT](https://arxiv.org/pdf/1901.04085.pdf) 📄 92 | - [CO-Search: recuperação de informações sobre COVID-19 com pesquisa semântica, resposta a perguntas e resumo abstrativo](https://arxiv.org/pdf/2006.09595.pdf) 📄 93 | - [LaBSE: Incorporação de frase BERT independente de idioma](https://arxiv.org/abs/2007.01852) 📄 94 | - [Covidex: Modelos de classificação neural e infraestrutura de pesquisa de palavras-chave para o conjunto de dados de pesquisa aberta COVID-19](https://arxiv.org/abs/2007.07846) 📄 95 | - [DeText: uma estrutura profunda de PNL para compreensão inteligente de texto](https://engineering.linkedin.com/blog/2020/open-sourcing-detext) 📄 96 | - [Fazendo incorporações de frases monolíngues multilíngues usando destilação de conhecimento](https://arxiv.org/pdf/2004.09813.pdf) 📄 97 | - [Transformadores pré-treinados para classificação de texto: BERT e além](https://arxiv.org/abs/2010.06467) 📄 98 | - [REALM: Pré-treinamento de modelo de linguagem aumentada de recuperação](https://arxiv.org/abs/2002.08909) 99 | - [ELECTRA: CODIFICADORES DE TEXTO DE PRÉ-TREINAMENTO COMO DISCRIMINADORES EM VEZ DE GERADORES](https://openreview.net/pdf?id=r1xMH1BtvB)📄 100 | - [Melhorando o aprendizado profundo para pesquisa no Airbnb](https://arxiv.org/pdf/2002.05515) 101 | - [Gerenciando a Diversidade na Pesquisa Airbnb](https://arxiv.org/abs/2004.02621)📄 102 | - [Aprendizagem contrastiva negativa aproximada do vizinho mais próximo para recuperação de texto denso](https://arxiv.org/abs/2007.00808v1)📄 103 | - [Incorporações de estilo de imagem não supervisionado para tarefas de recuperação e reconhecimento](https://openaccess.thecvf.com/content_WACV_2020/papers/Gairola_Unsupervised_Image_Style_Embeddings_for_Retrieval_and_Recognition_Tasks_WACV_2020_paper.pdf)📷 104 | - [DeCLUTR: Aprendizagem Contrastiva Profunda para Representações Textuais Não Supervisionadas](https://arxiv.org/abs/2006.03659)📄 105 | 106 | ### 2021 107 | 108 | - [Abordagem híbrida para cálculo de similaridade semântica entre palavras Tamil](https://www.researchgate.net/publication/350112163_Hybrid_approach_for_semantic_similarity_calculation_between_Tamil_words) 📄 109 | - [SBERT aumentado](https://arxiv.org/pdf/2010.08240.pdf) 📄 110 | - [BEIR: um benchmark heterogêneo para avaliação zero-shot de modelos de recuperação de informações](https://arxiv.org/abs/2104.08663) 📄 111 | - [Pesquisa visual heterogênea com reconhecimento de compatibilidade](https://arxiv.org/abs/2105.06047) 📷 112 | - [Aprendendo estilo pessoal com alguns exemplos](https://chuanenlin.com/personalstyle)📷 113 | - [TSDAE: Usando codificador automático de eliminação de ruído sequencial baseado em transformador para aprendizagem não supervisionada de incorporação de frases](https://arxiv.org/abs/2104.06979)📄 114 | - [Uma Pesquisa de Transformadores](https://arxiv.org/abs/2106.04554)📄📷 115 | - [SPLADE: modelo lexical esparso e de expansão para classificação de primeiro estágio](https://dl.acm.org/doi/10.1145/3404835.3463098)📄 116 | - [Sugestões de consulta de pesquisa relacionada de alta qualidade usando Deep Reinforcement Learning](https://arxiv.org/abs/2108.04452v1) 117 | - [Recuperação de produto baseada em incorporação na pesquisa Taobao](https://arxiv.org/pdf/2106.09297.pdf)📄📷 118 | - [TPRM: um modelo de classificação personalizado baseado em tópicos para pesquisa na Web](https://arxiv.org/abs/2108.06014)📄 119 | - [mMARCO: uma versão multilíngue do conjunto de dados de classificação de passagens MS MARCO](https://arxiv.org/abs/2108.13897)📄 120 | - [Raciocínio de banco de dados sobre texto](https://aclanthology.org/2021.acl-long.241.pdf)📄 121 | - [Como o ajuste fino adversário beneficia o BERT?](https://arxiv.org/abs/2108.13602))📄 122 | - [Treinar curto, testar longo: atenção com polarizações lineares permite extrapolação de comprimento de entrada](https://arxiv.org/abs/2108.12409)📄 123 | - [Primer: Procurando Transformadores Eficientes para Modelagem de Linguagem](https://arxiv.org/abs/2109.08668)📄 124 | - [Quão familiar isso parece? Representacional Multilíngue 125 | Análise de similaridade de incorporações acústicas de palavras](https://arxiv.org/pdf/2109.10179.pdf)🔊 126 | - [SimCSE: Aprendizagem contrastiva simples de incorporações de frases](https://arxiv.org/abs/2104.08821#)📄 127 | - [Atenção Composicional: Desembaraçando Pesquisa e Recuperação](https://arxiv.org/abs/2110.09419)📄📷 128 | - [SPANN: pesquisa aproximada de vizinho mais próximo em escala de bilhões de dólares altamente eficiente](https://arxiv.org/abs/2111.08566) 129 | - [GPL: Pseudo-rotulagem generativa para adaptação de domínio não supervisionado de recuperação densa](https://arxiv.org/abs/2112.07577) 📄 130 | - [Mecanismos de pesquisa generativos: experimentos iniciais](https://computationalcreativity.net/iccc21/wp-content/uploads/2021/09/ICCC_2021_paper_50.pdf) 📷 131 | - [Repensando a pesquisa: transformando diletantes em especialistas em domínio](https://dl.acm.org/doi/10.1145/3476415.3476428) -[WhiteningBERT: uma abordagem fácil de incorporação de frases não supervisionadas](https://arxiv.org/abs/2104.01767) 132 | 133 | ### 2022 134 | 135 | - [Incorporações de texto e código por pré-treinamento contrastivo](https://arxiv.org/abs/2201.10005)📄 136 | - [RELIC: Recuperando evidências para reivindicações literárias](https://arxiv.org/abs/2203.10053)📄 137 | - [Trans-Encoder: modelagem não supervisionada de pares de frases por meio de destilações próprias e mútuas](https://arxiv.org/abs/2109.13059)📄 138 | - [SAMU-XLSR: Representação de fala interlingual em nível de expressão multimodal semanticamente alinhada](https://arxiv.org/abs/2205.08180)🔊 139 | - [Uma análise de funções de fusão para recuperação híbrida](https://arxiv.org/abs/2210.11934)📄 140 | - [Detecção fora de distribuição com vizinhos mais próximos](https://arxiv.org/abs/2204.06507) 141 | - [ESB: uma referência para reconhecimento de fala ponta a ponta em vários domínios](https://arxiv.org/abs/2210.13352)🔊 142 | - [Analisando incorporações de palavras acústicas a partir de modelos de fala auto-supervisionados pré-treinados](https://arxiv.org/pdf/2210.16043.pdf))🔊 143 | - [Repensando com recuperação: inferência fiel do modelo de linguagem grande](https://arxiv.org/abs/2301.00303)📄 144 | - [Recuperação densa precisa de tiro zero sem rótulos de relevância](https://arxiv.org/pdf/2212.10496.pdf)📄 145 | - [Memória do transformador como índice de pesquisa diferenciável](https://arxiv.org/abs/2202.06991)📄 146 | 147 | ### 2023 148 | 149 | - [FINGER: Inferência rápida para pesquisa aproximada de vizinho mais próximo baseada em gráfico](https://dl.acm.org/doi/10.1145/3543507.3583318)📄 150 | - [Classificação de texto de “baixos recursos”: um método de classificação sem parâmetros com compressores](https://aclanthology.org/2023.findings-acl.426/)📄 151 | - [SparseEmbed: aprendendo representações lexicais esparsas com incorporações contextuais para recuperação](https://dl.acm.org/doi/pdf/10.1145/3539618.3592065) 📄 152 | 153 | ## Artigos 154 | 155 | - [Combatendo a pesquisa semântica](https://adityamalte.substack.com/p/tackle-semantic-search/) 156 | - [Pesquisa semântica no Azure Cognitive Search](https://docs.microsoft.com/en-us/azure/search/semantic-search-overview) 157 | - [Como usamos a pesquisa semântica para tornar nossa pesquisa 10 vezes mais inteligente](https://zilliz.com/blog/How-we-used-semantic-search-to-make-our-search-10-x-smarter/) 158 | - [Stanford AI Blog: Construindo modelos de PNL escaláveis, explicáveis ​​e adaptativos com recuperação](https://ai.stanford.edu/blog/retrieval-based-NLP/) 159 | - [Construindo um mecanismo de pesquisa semântico com embeddings de palavras de espaço duplo](https://m.mage.ai/building-a-semantic-search-engine-with-dual-space-word-embeddings-f5a596eb6d90) 160 | - [Pesquisa de similaridade semântica em escala de bilhões com FAISS+SBERT](https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2) 161 | - [Algumas observações sobre limites de pesquisa de similaridade](https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html) 162 | - [Pesquisa de imagens quase duplicadas usando hash sensível à localidade](https://keras.io/examples/vision/near_dup_search/) 163 | - [Curso gratuito sobre pesquisa de similaridade vetorial e Faiss](https://link.medium.com/HtFoFKlKvkb) 164 | - [Guia abrangente para algoritmos aproximados de vizinhos mais próximos](https://link.medium.com/V62Z8drvEkb) 165 | - [Apresentando o índice híbrido para permitir a pesquisa semântica com reconhecimento de palavras-chave](https://www.pinecone.io/learn/hybrid-search/?utm_medium=email&_hsmi=0&_hsenc=p2ANqtz--zLu9hiyh-y_XTa7FCEpi8JESJKmif5dhpYtAxTWka8PIttaTOGE21LMZlg9EOZyPYpCm6GDvYy57tlGRwH6TjgLCsJg&utm_content=231741722&utm_source=hs_email) 166 | - [Pesquisa Semântica Argilla](https://docs.argilla.io/en/latest/guides/features/semantic-search.html) 167 | - [Co: aqui está o modelo de compreensão de texto multilíngue](https://txt.cohere.ai/multilingual/) 168 | - [Simplifique a pesquisa com modelos de incorporação multilíngue](https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/) 169 | 170 | ## Bibliotecas e ferramentas 171 | 172 | - [fastText](https://fasttext.cc/) 173 | - [Codificador de frase universal](https://tfhub.dev/google/universal-sentence-encoder/4) 174 | - [SBERT](https://www.sbert.net/) 175 | - [ELECTRA](https://github.com/google-research/electra) 176 | - [LaBSE](https://tfhub.dev/google/LaBSE/2) 177 | - [LASER](https://github.com/facebookresearch/LASER) 178 | - [Relevance AI - Plataforma vetorial da experimentação à implantação](https://relevance.ai) 179 | - [Palheiro](https://github.com/deepset-ai/haystack/) 180 | - [Jina.AI](https://jina.ai/) 181 | - [pinha](https://www.pinecone.io/) 182 | - [Kit de ferramentas SentEval](https://github.com/facebookresearch/SentEval?utm_source=catalyzex.com) 183 | - [ranx](https://github.com/AmenRa/ranx) 184 | - [BEIR: Comparativo de RI](https://github.com/UKPLab/beir) 185 | - [RELiC: recuperando evidências para conjunto de dados de reivindicações literárias](https://relic.cs.umass.edu/) 186 | - [matchzoo-py](https://github.com/NTMC-Community/MatchZoo-py) 187 | - [deep_text_matching](https://github.com/wangle1218/deep_text_matching) 188 | - [Qual quadro?](http://qualframe.com/) 189 | - [lexica.art](https://lexica.art/) 190 | - [pesquisa semântica de emoji](https://github.com/lilianweng/emoji-semantic-search) 191 | - [PySerini](https://github.com/castorini/pyserini) 192 | - [BERTSerini](https://github.com/rsvp-ai/bertserini) 193 | - [BERTSimilarity](https://github.com/Brokenwind/BertSimilarity) 194 | - [milvus](https://www.milvus.io/) 195 | - [NeuroNLP++](https://plusplus.neuronlp.fruitflybrain.org/) 196 | - [weaviate](https://github.com/semi-technologies/weaviate) 197 | - [pesquisa semântica através da wikipedia-com-weaviate](https://github.com/semi-technologies/semantic-search-through-wikipedia-with-weaviate) 198 | - [pesquisa em linguagem natural do YouTube](https://github.com/haltakov/linguagemnatural-youtube-search) 199 | - [same.energy](https://www.same.energy/about) 200 | - [ann benchmarks](http://ann-benchmarks.com/) 201 | - [scaNN](https://github.com/google-research/google-research/tree/master/scann) 202 | - [REALM](https://github.com/google-research/linguagem/tree/master/idioma/realm) 203 | - [irritante](https://github.com/spotify/annoy) 204 | - [pynndescente](https://github.com/lmcinnes/pynndescente) 205 | - [nsg](https://github.com/ZJULearning/nsg) 206 | - [FALCONN](https://github.com/FALCONN-LIB/FALCONN) 207 | - [redis HNSW](https://github.com/zhao-lang/redis_hnsw) 208 | - [autofaiss](https://github.com/criteo/autofaiss) 209 | - [DPR](https://github.com/facebookresearch/DPR) 210 | - [rank_BM25](https://github.com/dorianbrown/rank_bm25) 211 | - [nearPy](http://pixelogik.github.io/NearPy/) 212 | - [vearch](https://github.com/vearch/vearch) 213 | - [vespa](https://github.com/vespa-engine/vespa) 214 | - [PyNNDescent](https://github.com/lmcinnes/pynndescent) 215 | - [pgANN](https://github.com/netrasys/pgANN) 216 | - [Semelhança do Tensorflow](https://github.com/tensorflow/similarity) 217 | - [opensemanticsearch.org](https://www.opensemanticsearch.org/) 218 | - [Pesquisa Semântica GPT3](https://gpt3demo.com/category/semantic-search) 219 | - [pesquisar](https://github.com/lubianat/searchy) 220 | - [txtai](https://github.com/neuml/txtai) 221 | - [HyperTag](https://github.com/Ravn-Tech/HyperTag) 222 | - [vetorai](https://github.com/vector-ai/vectorai) 223 | - [embeddinghub](https://github.com/featureform/embeddinghub) 224 | - [AquilaDb](https://github.com/Aquila-Network/AquilaDB) 225 | - [STripNet](https://github.com/stephenleo/stripnet) 226 | 227 | ## Conjuntos de dados 228 | 229 | - [Hub de conjunto de dados de similaridade de texto semântico](https://github.com/brmson/dataset-sts) 230 | - [Desafio de similaridade de imagens de IA do Facebook](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/?fbclid=IwAR31vRV0EdxRdrxtPy12neZtBJQ0H9qdLHm8Wl2DjHY09PtQdn1nEEIJVUo) 231 | - [WIT: conjunto de dados de texto de imagem baseado na Wikipédia](https://github.com/google-research-datasets/wit) 232 | - [BEIR](https://github.com/beir-cellar/beir) 233 | - MTEB 234 | 235 | ## Conquistas 236 | 237 | Dê uma olhada no [quadro do projeto](https://github.com/Agrover112/awesome-semantic-search/projects/1) para ver a lista de tarefas para contribuir com qualquer uma das questões em aberto. 238 | -------------------------------------------------------------------------------- /README_Spanish.md: -------------------------------------------------------------------------------- 1 | # Awesome Semantic-Search [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) [![Conventional Commits](https://img.shields.io/badge/Conventional%20Commits-1.0.0-yellow.svg)](https://conventionalcommits.org) 2 | 3 | 4 | 5 | 6 | 7 | 8 | Logo hecho por [@createdbytango](https://instagram.com/createdbytango). 9 | 10 | Este repositorio intenta ser un meta-repositorio para los temas relacionados [Búsqueda Semántica](https://en.wikipedia.org/wiki/Semantic_search) and [Similaridad Semántica](http://nlpprogress.com/english/semantic_textual_similarity.html). 11 | 12 | La búsqueda semántica no está limitada solamente a texto! Puede hacerse con imágenes, discursos, etcétera. Es por eso que hay muchos casos en los que la búsqueda semántica se puede aplicar. 13 | 14 | ## Índice 15 | 16 | - [Papers](#papers) 17 | - [2014](#2014) 18 | - [2015](#2015) 19 | - [2016](#2016) 20 | - [2017](#2017) 21 | - [2018](#2018) 22 | - [2019](#2019) 23 | - [2020](#2020) 24 | - [2021](#2021) 25 | - [Artículos](#articulos) 26 | - [Librerías y Herramientas](#librerías-y-herramientas) 27 | - [Conjuntos de datos](#conjuntos-de-datos) 28 | - [Hitos](#hitos) 29 | 30 | ## Papers 31 | ### 2014 32 | - [Un modelo semántico latente con estructura Pooling Convolucional para la recopilación de información](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2014_cdssm_final.pdf) 📄 33 | 34 | ### 2015 35 | - [Vectores Skip-Thought](https://arxiv.org/pdf/1506.06726.pdf) 📄 36 | 37 | ### 2016 38 | - [Bolsa de trucos para la clasificación eficiente de textos](https://arxiv.org/abs/1607.01759) 📄 39 | - [Vectores de palabras enrriquecedores con información Subword](https://arxiv.org/abs/1607.04606) 📄 40 | - [Aproximaciones robustas y eficientes para la búsqueda del vecino mas cercano usando grafos Jerárquicos Navegables de Mundos Pequeños](https://arxiv.org/abs/1603.09320) 41 | - [Sobre la aproximación al buscar Embeddings de Palabras Similares](https://www.aclweb.org/anthology/P16-1214.pdf) 42 | - [Aprendiendo las Distribuciones de Representaciones de Oraciones a partir de Información Sin Clasificar](https://arxiv.org/abs/1602.03483)📄 43 | 44 | ### 2017 45 | - [Aprendizaje supervisado de las Representaciones de Oraciones Universales a partir de el Lenguaje Natural de los Datos de Inferencia](https://research.fb.com/wp-content/uploads/2017/09/emnlp2017.pdf) 📄 46 | 47 | ### 2018 48 | - [Codificador de Oraciones Universal](https://arxiv.org/pdf/1803.11175.pdf) 📄 49 | - [Aprendiendo la Similaridad Semántica Textual a partir de conversaciones](https://arxiv.org/pdf/1804.07754.pdf) 📄 50 | - [Blog de IA de Google: Avances en la Similaridad Textual Semántica](https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html) 📄 51 | - [Optimización de la Indexación basada en los k Vecinos más Cercanos por Proximidad en Búsqueda en Datos de Varias Dimensiones](https://arxiv.org/abs/1810.07355) 52 | 53 | ### 2019 54 | - [ROAL: Representaciones de Oraciones Agnósticas del Lenguaje](https://engineering.fb.com/2019/01/22/ai-research/laser-multilingual-sentence-embeddings/) 📄 55 | - [Expansión de Documentos por Predicción de Consultas](https://arxiv.org/abs/1904.08375) 📄 56 | - [Oraciones-BERT: Embeddings de Oraciones usando Redes Siamesas BERT](https://arxiv.org/pdf/1908.10084.pdf) 📄 57 | - [Ranking de Oraciones Multi Fase con BERT](https://arxiv.org/abs/1910.14424) 📄 58 | - [Recuperación Latente para Respuestas a preguntas de dominio abierto Débilmente Supervisadas](https://arxiv.org/abs/1906.00300) 59 | - [End-to-End Respuestas a Preguntas de Dominio Abierto con BERTserini](https://www.aclweb.org/anthology/N19-4013/) 60 | 61 | ### 2020 62 | - [Desplegando Rapidamente un Mecanismo de Búsqueda Neuronal para los Conjuntos de Datos de Investigación Abiertos de COVID-19: Pensamientos Preeliminares y Lecciones Aprendidas](https://arxiv.org/abs/2004.05125) 📄 63 | - [RE-CLASIFICACIÓN DE PASAJE CON BERT](https://arxiv.org/pdf/1901.04085.pdf) 📄 64 | - [CO-Búsqueda: Recuperación de infomación de COVID-19 con Búsqueda Semántica, Respondiendo Preguntas y Resumen Abstracto.](https://arxiv.org/pdf/2006.09595.pdf) 📄 65 | - [EOALaB: Embedding de Oraciones Agnósticas del lenguaje BERTLanguage-agnostic BERT Sentence Embedding](https://arxiv.org/abs/2007.01852) 📄 66 | - [Covidex: Modelos de Ranking Neural e Infraestructura de Búsqueda de Palabras Clave para los Conjuntos de Datos abiertos de COVID-19](https://arxiv.org/abs/2007.07846) 📄 67 | - [DeText: Un framework profundo de NLP para entender textos inteligentes](https://engineering.linkedin.com/blog/2020/open-sourcing-detext) 📄 68 | - [Haciendo Embeddings de Oraciones Monolinguales Multilinguales usando Destilación de Conocimiento](https://arxiv.org/pdf/2004.09813.pdf) 📄 69 | - [Transformadores Preentrenados para Ranking de textos: BERT y más allá](https://arxiv.org/abs/2010.06467) 📄 70 | - [LMPRA: Language de Modelo Preentrenado para Recuperacion Aumentada](https://arxiv.org/abs/2002.08909) 71 | - [ELECTRA: PREENTRENANDO CODIFICADORES DE TEXTOS COMO DISCRIMINADORES EN VEZ DE COMO GENERADORES](https://openreview.net/pdf?id=r1xMH1BtvB)📄 72 | ### 2021 73 | - [SBERT Aumentado](https://arxiv.org/pdf/2010.08240.pdf) 📄 74 | - [BEIR: Un Punto de Referencia Homogéneo para Evaluaciones Zero-shot de Modelos de Recuperación de Información](https://arxiv.org/abs/2104.08663) 📄 75 | - [Búsquedas Visuales Conscientes de Compatibilidad Heterogénea](https://arxiv.org/abs/2105.06047) 📷 76 | - [Aprendiendo el Estilo Personal a partir de Pocos Ejempos](https://chuanenlin.com/personalstyle)📷 77 | - [TSDAE: Usando Codificadores Automáticos de Eliminación del ruido Basados en Transformaciones para Aprendizaje Sin Supervisión de Embedding de Oraciones](https://arxiv.org/abs/2104.06979)📄 78 | - [Una Encuesta sobre Transformaciones](https://arxiv.org/abs/2106.04554)📄📷 79 | 80 | ## Artículos 81 | 82 | - [Abordando la Búsqueda Semántica](https://adityamalte.substack.com/p/tackle-semantic-search/) 83 | - [Búsqueda Semántica en Azure Congnitive Search](https://docs.microsoft.com/en-us/azure/search/semantic-search-overview) 84 | - [Como usamos búsqueda semántica pra hacer nuestras búsquedas 10x más rápidas](https://zilliz.com/blog/How-we-used-semantic-search-to-make-our-search-10-x-smarter) 85 | - [Construyendo un sistema de búsqueda semántico con doble embedding de palabras](https://m.mage.ai/building-a-semantic-search-engine-with-dual-space-word-embeddings-f5a596eb6d90) 86 | - [Búsqueda de similaridad semántica de escala un Millón FAISS+SBERT](https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2) 87 | - [Algunas observaciones sobre umbrales de búsqueda de similiridades](https://greglandrum.github.io/rdkit-blog/similarity/reference/2021/05/26/similarity-threshold-observations1.html) 88 | ## Librerías y Herramientas 89 | - [fastText](https://fasttext.cc/) 90 | - [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4) 91 | - [SBERT](https://www.sbert.net/) 92 | - [ELECTRA](https://github.com/google-research/electra) 93 | - [LaBSE](https://tfhub.dev/google/LaBSE/2) 94 | - [LASER](https://github.com/facebookresearch/LASER) 95 | - [Haystack](https://github.com/deepset-ai/haystack/) 96 | - [Jina.AI](https://jina.ai/) 97 | - [SentEval Toolkit](https://github.com/facebookresearch/SentEval?utm_source=catalyzex.com) 98 | - [BEIR :Benchmarking IR](https://github.com/UKPLab/beir) 99 | - [matchzoo-py](https://github.com/NTMC-Community/MatchZoo-py) 100 | - [Which Frame?](http://whichframe.com/) 101 | - [PySerini](https://github.com/castorini/pyserini) 102 | - [BERTSerini](https://github.com/rsvp-ai/bertserini) 103 | - [BERTSimilarity](https://github.com/Brokenwind/BertSimilarity) 104 | - [milvus](https://www.milvus.io/) 105 | - [weaviate](https://github.com/semi-technologies/weaviate) 106 | - [natural-language-youtube-search](https://github.com/haltakov/natural-language-youtube-search) 107 | - [same.energy](https://www.same.energy/about) 108 | - [scaNN](https://github.com/google-research/google-research/tree/master/scann) 109 | - [REALM](https://github.com/google-research/language/tree/master/language/realm) 110 | - [annoy](https://github.com/spotify/annoy) 111 | - [faiss](https://github.com/facebookresearch/faiss) 112 | - [DPR](https://github.com/facebookresearch/DPR) 113 | - [rank_BM25](https://github.com/dorianbrown/rank_bm25) 114 | - [nearPy](http://pixelogik.github.io/NearPy/) 115 | - [vearch](https://github.com/vearch/vearch) 116 | - [PyNNDescent](https://github.com/lmcinnes/pynndescent) 117 | - [pgANN](https://github.com/netrasys/pgANN) 118 | - [opensemanticsearch.org](https://www.opensemanticsearch.org/) 119 | - [GPT3 Semantic Search](https://gpt3demo.com/category/semantic-search) 120 | - [searchy](https://github.com/lubianat/searchy) 121 | ## Conjuntos de Datos 122 | - [Conjunto de Datos de Textos de Similaridad Semántica](https://github.com/brmson/dataset-sts) 123 | 124 | ## Hitos 125 | 126 | Mira el [projecto](https://github.com/Agrover112/awesome-semantic-search/projects/1) para ver la lista de tareas a contribuir para cualquiera de los Issues abiertos. 127 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-minimal -------------------------------------------------------------------------------- /logo.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | --------------------------------------------------------------------------------