└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # awesome-semantic-search 2 | 3 | In [Semantic search with embeddings](https://rom1504.medium.com/semantic-search-with-embeddings-index-anything-8fb18556443c), I described how to build semantic search systems (also called neural search). These systems are being used more and more with indexing techniques improving and representation learning getting better every year with new deep learning papers. The medium post explain how to build them, and this list is meant to reference all interesting resources on the topic to allow anyone to quickly start building systems. 4 | 5 | ![image](https://user-images.githubusercontent.com/2346494/118412784-38db9480-b69c-11eb-9cf7-d159da16434a.png) 6 | 7 | 8 | * **Tutorials** explain in depth how to build semantic search systems 9 | * [Semantic search with embeddings](https://rom1504.medium.com/semantic-search-with-embeddings-index-anything-8fb18556443c#ef3f) end to end explanation on how to build semantic search pipelines 10 | * [google cloud embedding similarity system](https://cloud.google.com/solutions/machine-learning/building-real-time-embeddings-similarity-matching-system) Use google cloud to build an embedding similarity system 11 | * [cvpr 2020 tutorial on image retrieval](https://matsui528.github.io/cvpr2020_tutorial_retrieval/) end to end in depth tutorial focusing on image 12 | * **Good datasets** to build semantic search systems 13 | * [Tensorflow datasets](https://www.tensorflow.org/datasets/catalog/overview) building search systems only requires image or text, many tf datasets are interesting in that regard 14 | * [Torchvision datasets](https://pytorch.org/vision/stable/datasets.html) datasets provided for vision are also interesting for this 15 | * **Pretrained encoders** make it possible to quickly build a new system without training 16 | * Vision+Language 17 | * [Clip](https://github.com/openai/CLIP) encode image and text in a same space 18 | * Image 19 | * [Efficientnet b0](https://github.com/qubvel/efficientnet) is a simple way to encode images 20 | * [Dino](https://github.com/facebookresearch/dino) is an encoder trained using self supervision which reaches high knn classification performance 21 | * [Face embeddings](https://github.com/ageitgey/face_recognition) compute face embeddings 22 | * Text 23 | * [Labse](https://tfhub.dev/google/LaBSE/2) a bert text encoder trained for similarity that put sentences from 109 in the same space 24 | * Misc 25 | * [Jina examples](https://github.com/jina-ai/examples) provide example on how to use pretrained encoders to build search systems 26 | * [Vectorhub](https://github.com/vector-ai/vectorhub) image, text, audio encoders 27 | * **Similarity learning** allows you to build new similarity encoders 28 | * [Fine tuning classification with keras](https://keras.io/guides/transfer_learning/) enables adapting an existing image encoder to a custom dataset 29 | * [Fine tuning classification with hugging face](https://huggingface.co/transformers/training.html) makes it possible to adapt existing text encoders 30 | * [Lightly](https://github.com/lightly-ai/lightly) is a simple way to train image encoders with self supervision 31 | * [Pytorch big graph](https://github.com/facebookresearch/PyTorch-BigGraph) library to encode a graph as node and link embeddings 32 | * [RSVD](https://github.com/criteo/Spark-RSVD) a spark library to compute large scale svd with spark 33 | * [Groknet](https://ai.facebook.com/blog/powered-by-ai-advancing-product-understanding-and-building-new-shopping-experiences/) Using image and categories and many datasets to fine tune product embeddings with many losses 34 | * **Indexing and approximate knn**: indexing make it possible to create small indices encoding million of embeddings that can be used to query the data in milli seconds 35 | * [Faiss](https://github.com/facebookresearch/faiss) Many aknn algorithms (ivf, hnsw, flat, gpu, …) in c++ with a python interface 36 | * [Autofaiss](https://github.com/criteo/autofaiss) to use faiss easily 37 | * [Nmslib](https://github.com/nmslib/nmslib) fast implementation of hnsw 38 | * [Annoy](https://github.com/spotify/annoy) a aknn algorithm by spotify 39 | * [Scann](https://github.com/google-research/google-research/tree/master/scann) a aknn algorithm faster than hnsw by google 40 | * [Catalyzer](https://arxiv.org/pdf/1806.03198.pdf) training the quantizer with backpropagation 41 | * [hora](https://github.com/hora-search/hora) approximate knn implemented in rust 42 | * **Search pipelines** allow fast serving and customization of how the indices are queries 43 | * [Milvus](https://github.com/milvus-io/milvus) end to end similarity engine, on top of faiss and hnswlib 44 | * [Jina](https://github.com/jina-ai/jina) flexible end to end similarity engine 45 | * [Haystack](https://github.com/deepset-ai/haystack) question answering on text pipeline 46 | * **Companies**: many companies are being built around semantic search systems 47 | * [Jina](https://jina.ai/) is building flexible pipeline to encode and search with embeddings 48 | * [Weaviate](https://github.com/semi-technologies/weaviate) is building a cloud-native vector search engine 49 | * [Pinecone](https://techcrunch.com/2021/01/27/pinecone-lands-10m-seed-for-purpose-built-machine-learning-database/?guccounter=1) a startup building databases indexing embeddings 50 | * [Vector ai](https://hub.getvectorai.com/) is building an encoder hub 51 | * [Milvus](https://milvus.io/) builds an end to end open source semantic search system 52 | * [FeatureForm's embeddinghub](https://github.com/featureform/embeddinghub) combining DB and KNN 53 | * [vespa](https://blog.vespa.ai/) knn-based managed retrieval engine 54 | * Many other companies are using these systems and releasing open tools on the way, and it would be too long a list to put them here (for example facebook with faiss and self supervision, google with scann and thousand of papers, microsoft with sptag, spotify with annoy, criteo with rsvd, deepr, autofaiss, …) 55 | --------------------------------------------------------------------------------