├── research ├── books-history.md ├── books-nlp.md ├── books-information-science.md ├── books-library.md ├── crawling-research.md ├── trust-research.md ├── books-knowledge-management.md ├── decentralization-research.md ├── federation-research.md ├── recommendations-research.md ├── semantic-research.md ├── personalization-research.md ├── fairness-research.md ├── research-main.md ├── ranking-research.md ├── uncategorized-research.md ├── annotated-collaborative-research.md └── books-research.md ├── specific-engines ├── opensearch │ ├── opensearch-python.md │ ├── opensearch-vector.md │ └── opensearch-python-client.md ├── pg_search.md ├── vector.md ├── aws-opensearch │ ├── aws-opensearch-misc.md │ ├── aws-opensearch-dql.md │ ├── aws-opensearch-serverless.md │ ├── aws-opensearch-vector.md │ ├── aws-opensearch-code.md │ └── aws-opensearch-main.md ├── solr │ ├── solr-resources-other.md │ ├── solr-extend.md │ ├── basic-admin-ui-tutorial.md │ ├── solr-resources-used-by.md │ ├── solr-notes.md │ ├── basic-indexing-own-data-tutorial.md │ ├── solr-resources-code.md │ ├── solr-resources-ui.md │ ├── solr-development.md │ ├── solr-resources-interesting-old.md │ ├── basic-solrcloud-tutorial.md │ ├── solr-resources-app-framework-integrations.md │ ├── solr-resources-utlities.md │ ├── solr-terminology.md │ └── basic-tutorial.md ├── elasticsearch │ ├── elasticsearch-build-ui.md │ ├── elasticsearch-ui.md │ ├── elasticsearch-ingestion.md │ └── elasticsearch-clients.md ├── apache-lucene.md ├── elasticsearch.md ├── yacy.md ├── apache-solr.md └── opensearch.md ├── SearchFrontEnd.md ├── ToCategorize.md ├── front-end ├── ui-component-libraries-for-search.md └── ui-components-of-search.md ├── LICENSE ├── Glossary.md ├── features └── faceting.md ├── common-crawl ├── common-crawl-resources.md ├── basic-info-common-crawl.md └── basic-manually-accessing-common-crawl.md ├── vector-search └── vector-basics.md ├── BuildingSearchEngines.md ├── web-archiving └── archiving-introduction.md ├── collaborative └── README.md ├── WebCrawlers.md ├── OpenSourceSearchEngines.md └── CommonCrawl.md /research/books-history.md: -------------------------------------------------------------------------------- 1 | - Simon Winchester. Knowing What We Know: The Transmission of Knowledge: From Ancient Wisdom to Modern Magic. Harper, 4/2023. 431 pp.* -------------------------------------------------------------------------------- /specific-engines/opensearch/opensearch-python.md: -------------------------------------------------------------------------------- 1 | - Nabila Abraham. [Semantic Search with OpenSearch and Cohere: A Comprehensive Demo](https://cohere.com/blog/semantic-search-open-search-demo). cohere, 6/2023. -------------------------------------------------------------------------------- /specific-engines/pg_search.md: -------------------------------------------------------------------------------- 1 | - Ming Ying. [pg_search: Elastic-Quality Full Text Search Inside Postgres](https://blog.paradedb.com/pages/introducing_search). ParadeDB, 10/2023. 2 | - Introduces pg_search with a high-level overview. -------------------------------------------------------------------------------- /specific-engines/vector.md: -------------------------------------------------------------------------------- 1 | - [Milvus](https://milvus.io/) - specifically built for similarity search 2 | - [Qdrant](https://qdrant.io/) - open source 3 | - [GitHub](https://github.com/qdrant/qdrant) - Stars: 10.1k - Updated: 5/2023 - Checked: 5/2023. -------------------------------------------------------------------------------- /specific-engines/aws-opensearch/aws-opensearch-misc.md: -------------------------------------------------------------------------------- 1 | - Count API - Get the count of documents without getting the actual documents. 2 | - [ev2900's OpenSearch Audit Logs Repo](https://github.com/ev2900/OpenSearch_Audit_Logs). Updated: 3/2024. Checked: 3/2024. -------------------------------------------------------------------------------- /research/books-nlp.md: -------------------------------------------------------------------------------- 1 | - Paul Azunre. Transfer Learning for Natural Language Processing. Manning, 8/2021. 2 | - Hannes Hapke, Cole Howard, Hobson Lane. Natural Language Processing in Action: Understanding, Analyzing, and Generating Text in Python. Manning, 3/2019. 1114 pp. -------------------------------------------------------------------------------- /research/books-information-science.md: -------------------------------------------------------------------------------- 1 | - David Bawden, Lyn Robinson. Introduction to Information Science, 2nd edition. Facet Publishing, 2/2022. 536 pp. 2 | - G. Edward Evans, Stacey Greenwell. Management Basics for Information Professionals, 4th edition. ALA Neal-Schuman, 1/2020. 352 pp. -------------------------------------------------------------------------------- /specific-engines/solr/solr-resources-other.md: -------------------------------------------------------------------------------- 1 | - [Dual Indexing Neo4j and Solr for a Unified Platform](https://neo4j.com/blog/dual-indexing-neo4j-and-solr-for-a-unified-platform/). neo4j, 3/2021. 2 | - Written by Nathan Maynes who worked on implementing such a solution at Thomson Reuters. -------------------------------------------------------------------------------- /SearchFrontEnd.md: -------------------------------------------------------------------------------- 1 | # Search on the Front-end 2 | 3 | ## Introduction 4 | This document covers resources available for building UI's for search engines. 5 | 6 | 7 | ## Contents 8 | - [Pre-Built Search Component Libraries](./front-end/ui-component-libraries-for-search.md) 9 | - [Understanding the UI Components of Search](./front-end/ui-components-of-search.md)` -------------------------------------------------------------------------------- /specific-engines/solr/solr-extend.md: -------------------------------------------------------------------------------- 1 | - [Solr Extension Directory](https://solr.cool/) - A directory of extensions available for Solr. - Checked: 2/2025. 2 | - [solr-compound-word-filter](https://github.com/redlink-gmbh/solr-compound-word-filter) - Stars: 3 - Updated: 5/2024 - Checked: 5/2024 3 | - Redlink version of `solr.HyphenationCompoundWordTokenFilterFactory` with fix for LUCENE-8183 and support for epenthesis parameter. -------------------------------------------------------------------------------- /specific-engines/aws-opensearch/aws-opensearch-dql.md: -------------------------------------------------------------------------------- 1 | # Dashboards Query Language (DQL) 2 | 3 | Used within OpenSearch Dashboards. Consists of four primary query types: 4 | - Terms Query - Matches a specific term. 5 | - Boolean Query - Combine multiple queries using AND, OR, and NOT. 6 | - Date and Range Query - Matches documents within a specific date or range. 7 | - Nested Field Query - Allows for retrieving specific portions of a document which contains nested fields. -------------------------------------------------------------------------------- /specific-engines/opensearch/opensearch-vector.md: -------------------------------------------------------------------------------- 1 | - Valentin Crettaz. [How to Set Up Vector Search in OpenSearch](https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/). Opster, 10/2023. 2 | - Dylan Castillo. [Semantic Search with OpenSearch, Cohere, and FastAPI](https://dylancastillo.co/posts/semantic-search-with-opensearch-cohere-and-fastapi.html). 4/2023, updated: 7/2024. 3 | - Seems useful but written for OpenSearch 2.6 so doesn't take advantage of some of the latest vector features in OpenSearch. -------------------------------------------------------------------------------- /specific-engines/elasticsearch/elasticsearch-build-ui.md: -------------------------------------------------------------------------------- 1 | - [Elastic UI Framework](https://github.com/elastic/eui) - Stars: 6k - Updated: 5/2024 - Checked: 5/2024. 2 | - "...a collection of React UI components for quickly building user interfaces at Elastic." 3 | - [Elastic Search UI](https://github.com/elastic/search-ui) - Stars: 1.9k - Updated: 5/2024 - Checked: 5/2024. 4 | - "A React-based library for building search user interfaces." 5 | - "A JavaScript library for the fast development of modern, engaging search experiences with Elastic." 6 | - Note: This is not React specific and can be used with Elasticsearch "or any other search API." -------------------------------------------------------------------------------- /specific-engines/aws-opensearch/aws-opensearch-serverless.md: -------------------------------------------------------------------------------- 1 | # AWS OpenSearch Serverless 2 | 3 | We won't be discussing the Serverless option much at this time but will highlight here a few differences between this offering and the managed offering which is covered more extensively. 4 | 5 | ## Resources 6 | - [Amazon OpenSearch Serverless Developer Guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html) 7 | - [Working with Vector Search Collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html) 8 | - [Build a search application with Amazon OpenSearch Serverless](https://aws.amazon.com/blogs/big-data/build-a-search-application-with-amazon-opensearch-serverless/). AWS Big Data Blog, 1/2023. 9 | -------------------------------------------------------------------------------- /ToCategorize.md: -------------------------------------------------------------------------------- 1 | Wikimedia Foundation The Anatomy of Search Blog Series (Trey Jones) 2 | - [A Token of My Affection](https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/). 8/2018. 3 | - On tokenization. 4 | - [Variation Under Nature](https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/). 9/2018. 5 | - On normalization. 6 | - [The Root of the Problem](https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/). 11/2018/ 7 | - On stemming. 8 | - [A Place for My Stuff](https://wikimediafoundation.org/news/2019/03/12/the-anatomy-of-search-a-place-for-my-stuff/). 3/2019. 9 | - On the inverted index. 10 | - [In Search Of...](https://wikimediafoundation.org/news/2019/09/05/the-anatomy-of-search-in-search-of/). 9/2019. 11 | - On querying/searching. -------------------------------------------------------------------------------- /research/books-library.md: -------------------------------------------------------------------------------- 1 | - Peter Botticelli, Martha R. Mahard, Michele V. Cloonan. Libraries, Archives, and Museums Today: Insights from the Field. Rowman & Littlefield Publishers, 2/2019. 193 pp.* 2 | - Barbara B. Moran, Claudia J. Morner. Library and Information Center Management, 9th edition. Libraries Unlimited, 11/2017. 1006 pp.* 3 | - Lisa K. Hussey, Diane L. Velasquez. Library Management 101: A Practical Guide, 2nd edition. ALA Editions, 4/2019. 312 pp. 4 | - Bridgit McCafferty. Library Management: A Practical Guide for Librarians. Rowman & Littlefield Publishers, 5/2021. 169 pp. 5 | - Claire B. Joseph, Priscilla L. Stephenson, eds. Managing Health Sciences Libraries in a Time of Change. Rowman & Littlefield, 1/2024. 123 pp.* 6 | - Stacey Marien, ed. Library Technical Services; Adapting to a Changing Environment. Purdue University Press, 8/2020. 494 pp. -------------------------------------------------------------------------------- /research/crawling-research.md: -------------------------------------------------------------------------------- 1 | # Crawling Research on Search Engines and Information Retrieval 2 | 3 | - Ajay Sudhir Bale, Naveen Ghorpade, S Kamalesh, et al. [Web Scraping Approaches and their Performance on Modern Websites](https://www.researchgate.net/publication/363669276_Web_Scraping_Approaches_and_their_Performance_on_Modern_Websites). 9/2022. 4 | - Jesse Sayles, Ryan P. Furey, Marilyn R. ten Brink. [How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations](https://www.researchgate.net/publication/361121252_How_deep_to_dig_effects_of_web-scraping_search_depth_on_hyperlink_network_analysis_of_environmental_stewardship_organizations). 6/2022. 5 | - Dvijesh Bhatt, Daiwat Amit Vyas, Sharnil Pandya. [Focused Web Crawler](https://www.researchgate.net/publication/344579272_Focused_Web_Crawler). 10/2020. 6 | -------------------------------------------------------------------------------- /specific-engines/solr/basic-admin-ui-tutorial.md: -------------------------------------------------------------------------------- 1 | # Basic Solr Admin UI Tutorial 2 | 3 | ## Overview 4 | - Dashboard: http://hostname:8983/solr/ 5 | - Navigation: 6 | - Logging Screen 7 | - Collections / Core Admin 8 | - Java Properties Screen 9 | - Dropdown: Collection Selector 10 | - Only available when running SolrCloud, relevant options available under Core Selector 11 | - Dropdown: Core Selector 12 | - Selecting a Core or Collection shows additional specific menu items 13 | - Login Screen 14 | - Getting Assistance 15 | - Documentation 16 | - Issue Tracker (JIRA) 17 | - IRC 18 | - Community Forum (mailing lists) 19 | - Solr Query Syntax 20 | - Security UI 21 | - Schema Designer 22 | - Only available when using SolrCloud 23 | 24 | 25 | 26 | ## Bibliography / Resources 27 | - https://solr.apache.org/guide/solr/latest/getting-started/solr-admin-ui.html -------------------------------------------------------------------------------- /research/trust-research.md: -------------------------------------------------------------------------------- 1 | # Trust / Trustworthiness 2 | 3 | - Maarten de Rijke. [Beyond-Accuracy Goals, Again](https://dl.acm.org/doi/10.1145/3539597.3572332). 2/2023. 4 | - Markus Schedl, Emilia Gómez, Elisabeth Lex. [Trustworthy Algorithmic Ranking Systems.](https://dl.acm.org/doi/10.1145/3539597.3572723). 2/2023.* 5 | - Thomas Wadlow. [Who Must You Trust?: You must have some trust if you want to get anything done](https://dl.acm.org/doi/10.1145/2620660.2630691). 5/2014. 6 | - Dirk Lewandowski. [Credibility in Web Search Engines](https://www.researchgate.net/publication/230609381_Credibility_in_Web_Search_Engines). 8/2012.* 7 | - Yusuke Yamamoto, Katsumi Tanaka. [Enhancing Credibility Judgment of Web Search Results](https://www.researchgate.net/publication/221518035_Enhancing_Credibility_Judgment_of_Web_Search_Results). 5/2011. 8 | - Ken Thompson. [Reflections on trusting trust](https://dl.acm.org/doi/10.1145/358198.358210). 8/1984. -------------------------------------------------------------------------------- /specific-engines/elasticsearch/elasticsearch-ui.md: -------------------------------------------------------------------------------- 1 | # Query Builders 2 | - [mirage](https://github.com/appbaseio/mirage) - Stars: 2.2k - Updated: 2019 - Checked: 5/2024. 3 | - Created by Appbase / ReactiveSsarch. 4 | - "Mirage is a modern, open-source web based query explorer for Elasticsearch...It offers a blocks based GUI for composing Elasticsearch queries and comes with an on-the-fly transformer to show the corresponding JSON query API of Elasticsearch." 5 | 6 | # Cluster Management 7 | - [elasticvue](https://github.com/cars10/elasticvue) - Stars: 1.6k - Updated: 4/2024 - Checked: 5/2024 8 | - "Elasticsearch gui for the browser" 9 | - Available as both a desktop app and a browser extension. 10 | - [elasticsearch-comrade](https://github.com/moshe/elasticsearch-comrade) - Stars: 271 - Updated: 2022 - Checked: 5/2024 11 | - "Elasticsearch admin panel built for ops and monitoring...highly inspired by Cerebro." -------------------------------------------------------------------------------- /research/books-knowledge-management.md: -------------------------------------------------------------------------------- 1 | - Manlio Del Guidice, Veronica Scuotto, Armando Papa. Knowledge Management and AI in Society 5.0. Routledge, 3/2023. 91 pp. 2 | - Donald Hislop, Rachelle Bosua, Remko Helms. Knowledge Management in Organizations: A Critical Introduction, 4th edition. Oxford University Press, 4/2018.* 3 | - Kimiz Dalkir. Knowledge Management in Theory and Practice. Routledge, 9/2013. 4 | - Jennifer A. Bartlett. Knowledge Management: A Practical Guide for Librarians. Rowman & Littlefield Publishers, 5/2021. 151 pp.* 5 | - Irma Becerra-Fernandez, Rajiv Sabherwal, Richard Kumi. Knowledge Management: Systems and Processes in the AI Era. Routledge, 2/2024. 388 pp.* 6 | - Klaus North, Gita Kumta. Knowledge Management: Value Creation Through Organizational Learning. Springer, 4/2018. 642 pp. 7 | - Olivier Serrat. Knowledge Solutions: Tools, Methods, and Approaches to Drive Organizational Performance. Springer, 5/2017. 1503 pp. 8 | - Available under CC license. 9 | - Patrick Lambe. Organising Knowledge: Taxonomies, Knowledge and Organisational Effectiveness. Chandos Publishing, 3/2007. 300 pp. -------------------------------------------------------------------------------- /front-end/ui-component-libraries-for-search.md: -------------------------------------------------------------------------------- 1 | ## Pre-Built Search Component Libraries 2 | 3 | ### Reactive Search 4 | - https://www.reactivesearch.io/product/search-ui 5 | - GitHub: https://github.com/appbaseio/reactivesearch 6 | - Stars: 4.9k, Updated: 1/2024, Checked: 2/2024. 7 | - Supports Elasticsearch, OpenSearch, Solr, MongoDB. 8 | - Available for React, Vue, Vanilla JS. 9 | - Note that this requires one to host an open source backend ReactiveSearch provides 10 | or to use ReactiveSearch's cloud service. 11 | 12 | ### Instantsearch 13 | - https://www.algolia.com/doc/guides/building-search-ui/what-is-instantsearch/js/ 14 | - GitHub: https://github.com/algolia/instantsearch 15 | - Stars: 3.5k, Updated: 3/2024, Checked: 3/2024. 16 | - While open source by default it only supports Algolia. See Searchkit below for a wrapper 17 | around instantsearch that supports Elasticsearch and Opensearch. 18 | 19 | ### Searchkit 20 | - https://www.searchkit.co/ 21 | - GitHub: https://github.com/searchkit/searchkit 22 | - Stars: 4.7k, Updated: 2/2024, Checked: 2/2024. 23 | - Supports Elasticsearch and Opensearch. 24 | - Built on top of Algolia's instantsearch. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Dave Mackey 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /research/decentralization-research.md: -------------------------------------------------------------------------------- 1 | # Decentralization in Search Engines and Information Retrieval 2 | 3 | - Mario Kubek, Herwig Unger. [WebEngine Version 1.0: Building a Decentralised Web Search Engine](https://www.researchgate.net/publication/357504886_WebEngine_Version_10_Building_a_Decentralised_Web_Search_Engine). 1/2022. 4 | - See also Kubek and Unger's [The WebEngine - A Fully Integrated, Decentralised Web Search Engine](https://www.researchgate.net/publication/329183704_The_WebEngine_-_A_Fully_Integrated_Decentralised_Web_Search_Engine) from 11/2018. 5 | - Thanassis Tiropanis, Alexandra Poulovassilis, Adriane Chapman, George Roussos. [Search in a Redecentralised Web](https://www.researchgate.net/publication/357026616_Search_in_a_Redecentralised_Web). 12/2021.* 6 | - Hongsheng Xu, Ganglong Fan, Ke Li. [Construction of Search Engine System Based on Multithread Distributed Web Crawler](https://www.researchgate.net/publication/333206095_Construction_of_Search_Engine_System_Based_on_Multithread_Distributed_Web_Crawler). 5/2019. 7 | - Reaz Ahmed, Md. Faizul Bari, Rakibul Haque, R. Boutaba, Bertrand Mathieu. [DEWS: A decentralized engine for Web search](https://www.researchgate.net/publication/282931363_DEWS_A_decentralized_engine_for_Web_search). 1/2015. -------------------------------------------------------------------------------- /Glossary.md: -------------------------------------------------------------------------------- 1 | - Back-pressure 2 | - Cluster 3 | - Documents 4 | - Fields 5 | - Graph 6 | - Indexes 7 | - Indexing 8 | - Information Retrieval (IR) 9 | - k-NN 10 | - Knowledge Graph (KG) 11 | - Lucene 12 | - Mapping 13 | - [Named Entity Recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) - "(also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc." 14 | - Natural Language Processing (NLP) 15 | - Neural Networks 16 | - Nodes 17 | - Pretrained Language Models (PLM) 18 | - Primary Shared 19 | - Replica Shard 20 | - Semantics 21 | - [Semantic Search](https://en.wikipedia.org/wiki/Semantic_search) - "Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query." 22 | - [Semantic Triple](https://en.wikipedia.org/wiki/Semantic_triple) 23 | - [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web) 24 | - Shards -------------------------------------------------------------------------------- /features/faceting.md: -------------------------------------------------------------------------------- 1 | - [Wikipedia: Faceted Search](https://en.wikipedia.org/wiki/Faceted_search) 2 | - [Elastic Docs / App Search / Guides / Facets Guide](https://www.elastic.co/guide/en/app-search/current/facets-guide.html) - Fairly short and general introduction to facets. 3 | - [Elastic Docs / App Search / API Reference / Search API Facets](https://www.elastic.co/guide/en/app-search/current/facets.html) 4 | - [Elastic Search Labs: Faceted Search](https://www.elastic.co/search-labs/tutorials/search-tutorial/full-text-search/facets) 5 | - [Elastic Docs / App Search / Guides / Hierarchical Facets Guide](https://www.elastic.co/guide/en/app-search/current/hierarchical-facets-guide.html) 6 | - [OpenSearch Python Client: API Reference / helpers / faceted_search](https://opensearch-project.github.io/opensearch-py/api-ref/helpers/faceted_search.html) 7 | - [WPSOLR: What Are Search Facets, How To Use Them, And What Are Their Limitations?](https://www.wpsolr.com/what-are-search-facets-how-to-use-them-and-what-are-their-limitations/) 8 | - [Vinted Engineering: Faceted Search Using Elasticsearch](https://vinted.engineering/2023/03/21/faceted-search-using-elasticsearch/) 9 | - Interesting article on how Vinted implemented faceting with Elasticsearch and why they made specific decisions. 10 | - [StackOverflow: Custom Facets Using Elastic Search](https://stackoverflow.com/questions/60688229/custom-facets-using-elastic-search) -------------------------------------------------------------------------------- /specific-engines/elasticsearch/elasticsearch-ingestion.md: -------------------------------------------------------------------------------- 1 | # Django 2 | - [django-elasticsearch-dsl](https://github.com/django-es/django-elasticsearch-dsl) - Stars: 1k - Updated: 9/2023 - Checked: 5/2024 3 | - "Django Elasticsearch DSL is a package that allows indexing of django models in elasticsearch. It is built as a thin wrapper around elasticsearch-dsl-py so you can use all the features developed by the elasticsearch-dsl-py team." 4 | - [elasticsearch-django](https://github.com/yunojuno/elasticsearch-django) - Stars: 74 - Updated: 11/2023 - Checked: 5/2024 5 | - "This is a lightweight Django app for people who are using Elasticsearch with Django, and want to manage their indexes." 6 | 7 | # Gmail 8 | - [elasticsearch-gmail](https://github.com/oliver006/elasticsearch-gmail) - Stars: 2k - Updated: 8/2023 - Checked: 5/2024 9 | 10 | # Multiple 11 | - [elasticsearch_loader](https://github.com/moshe/elasticsearch_loader) - Stars: 395 - Updated: 2022 - Checked: 5/2024 12 | - Batch loading data files (json, parquet, csv, tsv). 13 | 14 | # Mongoose (Mongo) 15 | - [mongoosastic](https://github.com/mongoosastic/mongoosastic) - Stars: 1.1k - Updated: 2022 - Checked: 5/2024. 16 | 17 | # Fluentd 18 | - [fluent-plugin-elasticsearch](https://github.com/uken/fluent-plugin-elasticsearch) - Stars: 885 - Updated: 1/2024 - Checked: 5/2024 19 | - "Send your logs to Elasticsearch (and search them with Kibana maybe?)" -------------------------------------------------------------------------------- /common-crawl/common-crawl-resources.md: -------------------------------------------------------------------------------- 1 | ## Accessing the CommonCrawl Index - General 2 | - [Searching 100 Billion Webpages Pages With Capture Index](https://skeptric.com/searching-100b-pages-cdx/). skeptric, 6/2020. 3 | - Shows how to do so with CDX Toolkit, the Capture INdex API directly, and using comcrawl. 4 | - [Common Crawl on Laptop - Extracting Subset of Data](https://avilpage.com/2022/11/common-crawl-laptop-extract-subset.html). Avil Page, 11/2022. 5 | 6 | ## Understanding CommonCrawl WARC Files 7 | - [Extracting Text, Metadata and Data from Common Crawl](https://skeptric.com/text-meta-data-commoncrawl/). skeptric, 6/2020. 8 | - Covers WARC, WET, and WAT. 9 | 10 | ## Understanding CommonCrawl Parquet Files 11 | - [Read Common Crawl Parquet Metadata with Python](https://skeptric.com/reading-parquet-metadata/). skeptric, 4/2022. 12 | - Demonstrates PyArrow, fastparquet, manual methods. 13 | 14 | ## AWS Athena 15 | - [Common Crawl Index Athena](https://skeptric.com/common-crawl-index-athena/). skeptric, 6/2020. 16 | 17 | ## CDX Toolkit 18 | - [Extracting Job Ads from Common Crawl](https://skeptric.com/common-crawl-job-ads/). skeptric, 6/2020. 19 | 20 | ## SQL 21 | - [Accessing WARC files via SQL](https://digital.library.unt.edu/ark:/67531/metadc1608961/). UNIT Digital Library, 2019. 22 | 23 | ## For Security Purposes 24 | - [All Around the World: The Common Crawl Dataset](https://labs.watchtowr.com/all-around-the-world-the-common-crawl-dataset/). watchtowr, 10/2022. -------------------------------------------------------------------------------- /research/federation-research.md: -------------------------------------------------------------------------------- 1 | # Federation in Search Engines and Information Retrieval 2 | 3 | - Shuchang Liu, Yingqiang Ge, Shuyuan Xu, Yongfeng Zhang, Amelie Marian. [Fairness-aware Federated Matrix Factorization](https://dl.acm.org/doi/10.1145/3523227.3546771). 9/2022. 4 | - Qi Zhang, Tiancheng Wu, Peichen Zhou, Shan Zhou, Yuan Yang, Xiulang Jin. [Felicitas: Federated Learning in Distributed Cross Device Collaborative Frameworks](https://dl.acm.org/doi/10.1145/3534678.3539039). 8/2022. 5 | - Seok-Ju Hahn, Minwoo Jeong, Junghye Lee. [Connecting Low-Loss Subspace for Personalized Federated Learning](https://dl.acm.org/doi/10.1145/3534678.3539254). 8/2022. 6 | - Sen Cui, Jian Liang, Weishen Pan, Kun Chen, Changshui Zhang, Fei Wang. [Collaboration Equilibrium in Federated Learning](https://dl.acm.org/doi/10.1145/3534678.3539237). 8/2022. 7 | - Yaliang Li, Bolin Ding, Jingren Zhou. [A Practical Introduction to Federated Learning](https://dl.acm.org/doi/10.1145/3534678.3542631). 8/2022. 8 | - Minas Pergantis, Iraklis Varlamis, Andreas Giannakoulopoulos. [User Evaluation and Metrics Analysis of a Prototype Web-Based Federated Search Engine for Art and Cultural Heritage](https://www.researchgate.net/publication/361097561_User_Evaluation_and_Metrics_Analysis_of_a_Prototype_Web-Based_Federated_Search_Engine_for_Art_and_Cultural_Heritage). 6/2022. 9 | - Dong Nguyen, Thomas Demeester, Dolf Trieschnigg, Djored Hiemstra. [Resource Selection for Federated Search on the Web](https://www.researchgate.net/publication/308152481_Resource_Selection_for_Federated_Search_on_the_Web). 9/2016.* 10 | -------------------------------------------------------------------------------- /specific-engines/solr/solr-resources-used-by.md: -------------------------------------------------------------------------------- 1 | ### Projects Using Solr 2 | - [Fulcrum (Heliotrope)](https://github.com/mlibrary/heliotrope) - Stars: 41 - Updated: 5/2023 - Checked: 5/2023 3 | - "a Samvera-based digital publishing platform built by the University of Michigan Library" 4 | - [FXDesktopSearch](https://github.com/mirkosertic/FXDesktopSearch) - Stars: 164 - Updated: 5/2023 - Checked: 5/2023 5 | - "a JavaFX based desktop search application" 6 | - [Hydroshare](https://github.com/hydroshare/hydroshare) - Stars: 166 - Updated: 5/2023 - Checked: 5/2023 7 | - "collaborative website for better access to data and models in the hydrologic sciences." 8 | - [Islandora](https://www.islandora.ca/) 9 | - "Open-source Digital Asset Management" 10 | - [MontySolr](https://github.com/adsabs/montysolr) - Stars: 50 - Updated: 12/2022 - Checked: 5/2023 11 | - "the search engine behind Astrophysics Data System (ADS 2.0)" 12 | - [National Information Exchange Model (NIEM) Movement](https://github.com/NIEM/movement-solr) - Stars: 3 - Updated: 2/2023 - Checked: 5/2023 13 | - [Nelmio](https://github.com/nelmio/NelmioSolariumBundle) - Stars: 152 - Updated: 4/2023 - Checked: 5/2023 14 | - [Samvera](https://github.com/samvera/hyrax) - Stars: 166 - Updated: 5/2023 - Checked: 5/2023 15 | - "provides a foundation for creating many different digital repository applications." 16 | - [netarchivesuite's SolrWayback](https://github.com/netarchivesuite/solrwayback) - Stars: 77 - Updated: 5/2023 - Checked: 5/2023 17 | - "A search interface and wayback machine for the UKWA Solr based warc-indexer framework" 18 | - [OpenSextant's Xponents](https://github.com/OpenSextant/Xponents) - Stars: 42 - Updated: 5/2023 - Checked: 5/2023 -------------------------------------------------------------------------------- /common-crawl/basic-info-common-crawl.md: -------------------------------------------------------------------------------- 1 | # Basic Information About Common Crawl 2 | 3 | ## Introduction 4 | Common Crawl is a non-profit organization (founded by Gil Elbaz) which for years has been crawling the web in a similar manner to major search engines such as Google and Bing. The data is then made available for [free and under a non-restrictive license](https://commoncrawl.org/terms-of-use/) for usage by anyone who desires to use it. 5 | 6 | Common Crawl has been around for eons in internet years (since 2008) and has a number of well-known individuals on it's Board of Directors and Advisory Board. 7 | 8 | Amazingly, the CC is currently primarily maintained by a single individual - Sebastian Nagel. 9 | 10 | ## That Said... 11 | It does appear that CC has probably seen a decrease in funding over the years. The variety of voices on the blog has decreased markedly since ~2015, roughly when Lisa Green left her position as Director of Common Crawl (although she remains on the Board of Advisors) and more recently as crawls have become a bi-monthly event (formerly monthly). My sincere appreciation to Nagel who continues to carry the torch of an important project! I hope that funding and contributors will increase in the near future! 12 | 13 | ## Crawler 14 | The web crawler (CCBot) is built on the famous open source Apach Nutch engine and utilizes Apache Hadoop. 15 | 16 | ## Data 17 | The data is stored on Amazon Web Services' (AWS) S3 storage service. It can be accessed over HTTP (slow), using other AWS systems (e.g. EC2, Athena), and using the AWS CLI/SDK. 18 | 19 | ## Bibliography 20 | - [Common Crawl](https://commoncrawl.org/) 21 | - [About](https://commoncrawl.org/about/) 22 | - [Our Team](https://commoncrawl.org/about/team/) 23 | -------------------------------------------------------------------------------- /specific-engines/solr/solr-notes.md: -------------------------------------------------------------------------------- 1 | ## Goal 2 | My memory isn't amazing so I tend to make concise notes that help me remember technologies. Much of this document will be a summarization of documentation for Apache Solr. On occasion it may also include information not in the source materials (or at least not in the same place), this is done to fill in gaps in my knowledge. 3 | 4 | ## Loading Data into Solr 5 | - Can ingest data from many sources (XML, CSV, Microsoft Word, PDF, etc.) 6 | 7 | ### Common Ways to Load Data 8 | - Using Solr Cell and Apache Tika, the latter is for ingesting binary files. 9 | - Uploading XML files using HTTP requests. 10 | - Creating a custom application that utilizes the Java Client API. 11 | - Note: By default the `-e` option when starting Solr sets the `example` directory as base directory for the Solr instance. 12 | 13 | ## Searching with Solr 14 | - A query is made to Solr and a *response handler* which calls a *query parser* "which interprets the terms and parameters of a query." 15 | 16 | ### Query Parsers 17 | - "Different query parsers support different syntax." 18 | - Default: Standard Query Parser (aka "lucene" query parser). 19 | - Other Included Query Parsers: 20 | - DisMax Query Parser 21 | - Extended DisMax (eDisMax) Query Parser 22 | - The Standard Query Parser is precise (but throws syntax errors when queries are incorrect) while the DisMax/eDisMax Query Parsers act similarly to web search engines. 23 | - eDisMax extends and improves on DisMax's functionality. 24 | 25 | ### Query Parser Input Types 26 | - Queries can be made using: 27 | - strings (e.g. words, terms) 28 | - "parameters for fine-tuning the query" 29 | - "parameters for controlling the presentation of the query response" 30 | 31 | ## Bibliography / Resources 32 | - See the main [Apache Solr](../apache-solr.md) document for a list of resources. -------------------------------------------------------------------------------- /research/recommendations-research.md: -------------------------------------------------------------------------------- 1 | # Recommendations / Recommender Systems 2 | 3 | ## 2023 4 | - Zhe Fu, Xi Niu, Li Yu. [Wisdom of Crowds and Fine-Grained Learning for Serendipity Recommendations](https://dl.acm.org/doi/10.1145/3539618.3591787). SIGIR '23, 2023. 5 | - Yuanhao Liu, Qi Cao, Huawei Shen, Yunfan Wu, Shuchang Tao, Xueqi Cheng. [Popularity Debiasing from Exposure to Interaction in Collaborative Filtering](https://dl.acm.org/doi/10.1145/3539618.3591947). SIGIR '23, 7/2023. 6 | - Ziwei Fan, Ke Xu, Zhang Dong, Hao Peng, Jiawei Zhang, Philip S. Yu. [Graph Collaborative Signals Denoising and Augmentation for Recommendation](https://dl.acm.org/doi/10.1145/3539618.3591994). SIGIR '23, 7/2023. 7 | - Jiazheng Jing, Yinan Zhang, Xin Zhou, Zhiqi Shen. [Capturing Popularity Trends: A Simplistic Non-Personalized Approach for Enhanced Item Recommendation](https://dl.acm.org/doi/10.1145/3583780.3614801). CIKM '23, 10/2023. 8 | - Zhuang Liu, Haoxuan Li, Guanming Chen, Yuanxin Ouyang, Wenge Rong, Zhang Xiong. [PopDCL: Popularity-aware Debiased Contrastive Loss for Collaborative Filtering](https://dl.acm.org/doi/10.1145/3583780.3615009). CIKM '23, 10/2023.* 9 | - Sichun Luo, Chen Ma, Yuanzhang Xiao, Linqi Song. [Improving Long-Tail Item Recommendation with Graph Augmentation](https://dl.acm.org/doi/10.1145/3583780.3614929). CIKM '23, 10/2023.* 10 | 11 | ## 2020 12 | - Lijo Abraham. [Building a Recommendation System with Spark ML and Elasticsearch](https://towardsdatascience.com/building-a-recommendation-system-with-spark-ml-and-elasticsearch-abbd0fb59454). towardsdatascience, 9/2020. 13 | 14 | ## 2018 15 | - Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, Tony Jebara. [Variational Autoencoders for Collaborative Filtering](https://dl.acm.org/doi/10.1145/3178876.3186150). WWW '18, 4/2018. 16 | 17 | ## 2015 18 | - Carlos A. Gomez-Uribe, Neil Hunt. [The Netflix Recommender System: Algorithms, Business Value, and Innovation](https://dl.acm.org/doi/10.1145/2843948). 12/2015. -------------------------------------------------------------------------------- /research/semantic-research.md: -------------------------------------------------------------------------------- 1 | # Semantic Research in Search Engines and Information Retrieval 2 | 3 | - Anand Kumar, B.P. Singh. [Semantic Web: Past, Present and Future](https://www.researchgate.net/publication/369038964_Semantic_Web_Past_Present_and_Future). 3/2023. 4 | - Hiteshwar Kumar Azaad, Akshay Deepak, Amisha Azad. [LOD search engine: A semantic search over linked data](https://www.researchgate.net/publication/356389086_LOD_search_engine_A_semantic_search_over_linked_data). 8/2022. 5 | - Uzoma Peter Ozioma, Amanze Bethran Chibuike, Agbakwuru Alphonsus Onyekachi, Agbasonu V.C. [Development of a Visual Semantic Web Ontology Based Learning Management System](https://www.researchgate.net/publication/361056506_DEVELOPMENT_OF_A_VISUAL_SEMANTIC_WEB_ONTOLOGY_BASED_LEARNING_MANAGEMENT_SYSTEM). 2/2022. 6 | - Anita Kumari, Jawahar Thalur. [Semantic Web Search Engines: A Comparative Study](https://www.researchgate.net/publication/330602787_Semantic_Web_Search_Engines_A_Comparative_Survey). 1/2019. 7 | - Amit Upadhyay, Amit Paul, Pijush Kanti Dutta Pramanik. [Semantic Web Crawler for More Relevant Search Using Ontology](https://www.researchgate.net/publication/282121492_Semantic_Web_Crawler_for_More_Relevant_Search_Using_Ontology). 12/2014. 8 | - G Sudeepthi, G Anuradha, M Surenda, Prasad Babu. [A Survey on Semantic Web Search Engine](https://www.researchgate.net/publication/268436376_A_Survey_on_Semantic_Web_Search_Engine). 3/2012. 9 | - S. Latha Shanmuga Vadivu, M. Rajaram, S.N. Sivanandam. [A Survey on semantic web mining based web search engines](https://www.researchgate.net/publication/283249382_A_Survey_on_semantic_web_mining_based_web_search_engines). 10/2011. 10 | - Bettina Fazzinga, Giorgio Gianforme, Georg Gottlob, Thomas Lukasiewicz. [Semantic Web search based on ontological conjunctive queries](https://www.researchgate.net/publication/220291161_Semantic_Web_search_based_on_ontological_conjunctive_queries). 12/2011. 11 | - Bettina Fazzinga, Thomas Lukasiewicz. [Semantic search on the Web](https://www.researchgate.net/publication/220575552_Semantic_search_on_the_Web). 1/2010. -------------------------------------------------------------------------------- /specific-engines/solr/basic-indexing-own-data-tutorial.md: -------------------------------------------------------------------------------- 1 | # Indexing One's Own Data Tutorial 2 | 3 | The content here is pulled from the official Solr Tutorials, [Exercise 3: Index Your Own Data](https://solr.apache.org/guide/solr/latest/getting-started/tutorial-diy.html). 4 | 5 | These are essentially my notes on the course materials. 6 | 7 | ## Preparations 8 | - What sort of data will I index? 9 | - What do I need to prepare Solr for this data? 10 | - e.g., create fields, set up copy fields, determine analysis rules, etc. 11 | - What kind of search options should the users have? 12 | - How much testing is needed to ensure it is working correctly? 13 | 14 | ## Create Your Own Collection 15 | - Make sure that Solr is running, e.g. `./bin/solr start` 16 | - `./bin/solr create -c yourCollectionName -s 2 -rf 2` 17 | 18 | ## Indexing Ideas 19 | 20 | ### Local Files with bin/post 21 | - If files are local, can handle JSON, XML, CSV, HTML, PDF, MS Office, plain text, and other formats. 22 | - `./bin/post -c yourCollectionName ~/pathToYourData` 23 | 24 | ### SolrJ or Other Client APIs 25 | 26 | ### Documents Screen 27 | - http://localhost:8983/solr/#/yourCollectionName/documents 28 | - Paste in document or use Document Builder from Document Type to create a document field by field. 29 | 30 | ## Updating Data 31 | - Documents with an identical `uniqueKey` value in the field `id` are updated rather than added in future indexing operations. 32 | 33 | 34 | ## Deleting Data 35 | - "You can delete data by POSTing a delete command to the update URL and specifying the value of the document’s unique key field, or a query that matches multiple documents (be careful with that one!). We can use bin/post to delete documents also if we structure the request properly." 36 | - Specific Document: `bin/post -c yourCollectionName -d "someUniqueId"` 37 | - All Documents: `bin/post -c yourCollectionName -d "*:*"` 38 | 39 | 40 | ## Bibliography / Resources 41 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-diy.html -------------------------------------------------------------------------------- /research/personalization-research.md: -------------------------------------------------------------------------------- 1 | # Personalization in Search Engines and Information Retrieval 2 | 3 | - Ulrich Matter, Roland Hodler, Johannes Ladwig. [Personalization of Web Search During the 2020 US Elections](https://www.researchgate.net/publication/363920168_Personalization_of_Web_Search_During_the_2020_US_Elections). 9/2022. 4 | - Sunny Sharma, Vijay Rana. [A Systematic Literature Review of Web Search Personalization](https://www.researchgate.net/publication/339466998_A_Systematic_Literature_Review_Of_Web_Search_Personalization). 2/2020.* 5 | - Wiem Chebil, Mohammad Wedyan, Haiyan Lu, Omar Elshaweesh. [Context-Aware Personalized Web Search Using Navigation History](https://www.researchgate.net/publication/339435401_Context-Aware_Personalized_Web_Search_Using_Navigation_History). 2/2020. 6 | - Eugene Agichtein, Eric Brill, Susan Dumais. [Improving Web Search Ranking by Incorporating User Behavior Information](https://www.researchgate.net/publication/330459175_Improving_Web_Search_Ranking_by_Incorporating_User_Behavior_Information). 1/2019. 7 | - S. Salehi, J.T. Du, H. Ashman. [Use of web search engines and personalisation in information searching for educational purposes](https://www.researchgate.net/publication/326197645_Use_of_web_search_engines_and_personalisation_in_information_searching_for_educational_purposes). 6/2018. 8 | - Eric Utrera, Alfredo Simón-Cuevas, José A. Olivas. [Analysis of trends in the customization of results in web search engines](https://www.researchgate.net/publication/325157476_Analysis_of_trends_in_the_customization_of_results_in_web_search_engines). 4/2018. 9 | - M. Omair Shafiq, Reda Alhajj, John G. Rokne. [On personalizing Web search using social network analysis](https://www.researchgate.net/publication/275219347_On_personalizing_Web_search_using_social_network_analysis). 9/2015. 10 | - Aniko Hannak, Piotr Sapiezynski, Arash Molavi Kakhki, Balachander Krishnamurthy, David Lazer, Alan Mislove, Christo Wilson. [Measuring personalization of web search](https://www.researchgate.net/publication/262424460_Measuring_personalization_of_Web_search). 5/2013. -------------------------------------------------------------------------------- /specific-engines/aws-opensearch/aws-opensearch-vector.md: -------------------------------------------------------------------------------- 1 | # AWS OpenSearch Vector Functionality 2 | 3 | # High-Level 4 | - Jon Handler, Dylan Tong, Jianwei Li, Vamshi Vijay Nakkirtha. [Amazon OpenSearch Service's Vector Database Capabilities Explained.](https://aws.amazon.com/blogs/big-data/amazon-opensearch-services-vector-database-capabilities-explained/) AWS Big Data Blog, 6/2023. 5 | 6 | # Workshop 7 | - [Semantic and Vector Search with Amazon OpenSearch Service](https://catalog.workshops.aws/semantic-search/) 8 | - Covers Search Basics, Text Search, Semantic Search, Fullstack Semantic Search, Fine Tuning Semantic Search, Neural Search, Retrieval Augmented Generation, and ConversationalS Search. 9 | - I went through this workshop but ran into issues in Module 4. Even before this point the workshop instructions and the notebook were diverging in what they were saying. At this step I started getting errors. I think the fix for this particular issue is pretty straightforward but I'm not going to spend more time with this workshop, hopefully Amazon will fix at some point. 10 | 11 | # Tutorials 12 | - Matt Barlow. [How to Augment ChatGPT with AWS OpenSearch](https://stratusgrid.com/blog/augmenting-chatgpt-with-amazon-opensearch?locale=en). StratusGrid, 3/2024. 13 | - Uses LlamaIndex. 14 | - Arun Shankar. [Augmenting Large Language Models with Verified Information Sources: Leveraging AWS SageMaker and OpenSearch for Knowledge-Dirven Question Answering](https://medium.com/@shankar.arunp/augmenting-large-language-models-with-verified-information-sources-leveraging-aws-sagemaker-and-f6be17fb10a8). Medium, 4/2023. 15 | - James Matson. [So, You Want to Store Your GPT/LLM Data? AWS OpenSearch to the Rescue](https://betterprogramming.pub/%EF%B8%8Fso-you-want-to-store-your-llm-data-aws-opensearch-to-the-rescue-f704a0f70558). Better Programming, 5/2023. 16 | 17 | # Documentation 18 | - [OpenSearch Service Flow Framework Templates](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ml-workflow-framework.html) 19 | - These flow templates can be used to automate ML connector workflows. -------------------------------------------------------------------------------- /specific-engines/solr/solr-resources-code.md: -------------------------------------------------------------------------------- 1 | # Language Integrations 2 | 3 | ## .NET 4 | - [SolrNet](https://github.com/SolrNet/SolrNet) - Stars: 913 - Updated: 5/2023 - Checked: 5/2023 5 | - [solr-net-linq](https://github.com/IharYakimush/solr-net-linq) - Stars: 4 - Updated: 4/2023 - Checked: 5/2023 6 | 7 | ## Elixir 8 | - [Elsol](https://github.com/findmypast/elsol) - Stars: 8 - Updated: 3/2023 - Checked: 5/2023 9 | 10 | ## Go 11 | - [Solr Go](https://github.com/stevenferrer/solr-go) - Stars: 36 - Updated: 3/2023 - Checked: 5/2023 12 | - [go-solr](https://github.com/vanng822/go-solr) - Stars: 67 - Updated: 9/2022 - Checked: 5/2023 13 | 14 | ## JS 15 | - [solr-client for Node.js](https://github.com/lbdremy/solr-node-client) - Stars: 456 - Updated: 3/2022 - Checked: 5/2023 16 | 17 | ## JVM (includes Java, Clojure, Scala, etc.) 18 | - [flux](https://github.com/mwmitchell/flux) - Stars: 35 - Updated: 7/2022 - Checked: 5/2023 19 | - "A Clojure based Solr client." 20 | - [solrs](https://github.com/inoio/solrs) - Stars: 103 - Updated: 4/2023 - Checked: 5/2023 21 | - "An async, non-blocking solr client for java/scala, providing a query interface like SolrJ" 22 | - [solr-scala-client](https://github.com/takezoe/solr-scala-client) - Stars: 91 - Updated: 9/2022 - Checked: 5/2023 23 | 24 | ## PHP 25 | - [Solarium](https://github.com/solariumphp/solarium) - Stars: 911 - Updated: 5/2023 - Checked: 5/2023 26 | - [SolrQueryComponent](https://github.com/InterNations/SolrQueryComponent) - Stars: 64 - Updated: 8/2022 - Checked: 5/2023 27 | - [pecl-search_engine-solr](https://github.com/php/pecl-search_engine-solr) - Stars: 56 - Updated: 5/2023 - Checked: 5/2023 28 | 29 | ## Python 30 | - [Pysolr](https://github.com/django-haystack/pysolr) - Stars: 639 - Updated: 5/2023 - Checked: 5/2023 31 | - [solrcloudpy](https://github.com/solrcloudpy/solrcloudpy) - Stars: 37 - Updated: 2/2021 - Checked: 5/2023 32 | - [solrq](https://github.com/swistakm/solrq) - Stars: 22 - Updated: 11/2022 - Checked: 5/2023 33 | 34 | ## Ruby 35 | - [rsolr](https://github.com/rsolr/rsolr) - Stars: 416 - Updated: 2/2022 - Checked: 5/2023 36 | - [solrb](https://github.com/machinio/solrb) - Stars: 38 - Updated: 9/2022 - Checked: 5/2023 37 | - [LSolr](https://github.com/supercaracal/lsolr) - Stars: 10 - Updated: 12/2022 - Checked: 5/2023 38 | - [Sunspot](https://github.com/sunspot/sunspot) - Stars: 3k - Updated: 3/2023 - Checked: 5/2023 -------------------------------------------------------------------------------- /specific-engines/aws-opensearch/aws-opensearch-code.md: -------------------------------------------------------------------------------- 1 | # AWS OpenSearch - Writing Code 2 | 3 | ## Introduction 4 | 5 | One of the stranger aspects of AWS' documentation on OpenSearch is it's relative lack of code samples. One would expect to find something comprehensive in the [Amazon OpenSearch Service Developer Guide's Sample Code](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/samplecode.html) section, but the section is extremely brief and only tangentially mentions Elasticsearch (not even OpenSeach) language clients. If you dig through [Using the AWS SDKs](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configuration-samples.html) subsection there is some reference - but you have to dig for it (at least imho). 6 | 7 | There is also a tutorial on [Creating a search application](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/search-example.html) but this is quite limited in scope and depends on a specific architecture (Amazon API Gateway <--> AWS Lambda <--> Amazon OpenSearch Service). 8 | 9 | Ironically, nested under OpenSearch Serverless one can find some additional code (though beware that the serverless offering and managed offering have significant differences) in [Using the AWS SDKs to interact with Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-sdk.html) and [Ingesting data into collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-clients.html). 10 | 11 | This section attempts to provide a gentler and more rounded conversation of writing code for AWS OpenSearch Service. 12 | 13 | ## Clients 14 | 15 | One can perform operational tasks on an OpenSearch cluster using the AWS SDK in various languages but this does not provide an interface for querying the cluster. 16 | - [Java AWS SDK: OpenSearch](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/opensearch/package-summary.html) 17 | - [JavaScript AWS SDK: OpenSearch](https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/client/opensearch/) 18 | - [Python AWS Boto3: OpenSearch](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/opensearch.html) 19 | 20 | Instead one needs to use the [OpenSearch language-specific clients](https://opensearch.org/docs/latest/clients/) for this task. Currently these are available for Python, Java, JavaScript, Go, Ruby, PHP, .NET, and Rust. 21 | 22 | -------------------------------------------------------------------------------- /specific-engines/solr/solr-resources-ui.md: -------------------------------------------------------------------------------- 1 | # Discovery and/or UI 2 | 3 | ## Blacklight 4 | - [Blacklight](https://projectblacklight.org/) - Stars: 731 - Updated: 4/2023 - Checked: 5/2023 5 | - A Ruby on Rails open source frontend for querying and discovery of results from Solr. 6 | - [Boston Public Library](https://github.com/boston-library/commonwealth-vlr-engine) 7 | - [Penn State University](https://github.com/psu-libraries/psulib_blacklight) 8 | - [Stanford University](https://github.com/sul-dlss/exhibits) 9 | - [Temple University](https://github.com/tulibraries/funcake-solr) 10 | - [warclight](https://github.com/archivesunleashed/warclight) 11 | 12 | ## Everything Else 13 | - [AJAX Solr](https://github.com/evolvingweb/ajax-solr) - Stars: 654 - Updated: 10/2021 - Checked: 4/2023 - JS library for building UI's for Solr. 14 | - [SolrDora](https://github.com/hectorcorrea/solrdora) - Stars: 6 - Updated: 1/2023 - Checked: 5/2023* 15 | - Provides a UI for browsing a Solr collection. 16 | - [YASA (Yet Another Solr Admin)](https://github.com/yasa-org/yasa) - Stars: 47 - Updated: 4/2023 - Checked: 5/2023 17 | - A web-based UI for administering Solr written with Vue in TypeScript. 18 | - [RecordManager](https://github.com/NatLibFi/RecordManager) - Stars: 44 - Updated: 5/2023 - Checked: 5/2023 19 | - [Goobi](https://goobi.io/) 20 | - "an open source software suite for the control and presentation of digitization projects." 21 | - [DS-Discover](https://github.com/kb-dk/ds-discover) - Stars: 0 - Updated: 4/2023 - Checked: 5/2023 22 | - "Gateway for Solr text search, image similarity, sound location and other discovery technologies....Developed and maintained by the Royal Danish Library." 23 | - [Search Management UI](https://github.com/querqy/smui) - Stars: 49 - Updated: 5/2023 - Checked: 5/2023 24 | - [MOPSY Search](https://github.com/Der-Henning/mopsy-react) - Stars: 0 - Updated: 1/2022 - Checked: 5/2023 25 | - Hasn't been updated in a while, but has a live demo. 26 | - [solrkit](https://github.com/garysieling/solrkit) - Stars: 11 - Updated: 4/2018 - Checked: 5/2023 27 | 28 | ## Learning 29 | - [Multi-Select Facet Example](https://github.com/stevenferrer/multi-select-facet) - Stars: 32 - Updated: 3/2023 - Checked: 5/2023 30 | - "An example of multi-select facet using Solr, Vue and Go." 31 | - [Solr-JavaScript-Search-Client](https://github.com/BLE-LTER/Solr-JavaScript-Search-Client) - Stars: 7 - Updated: 11/2022 - Checked: 5/2023 -------------------------------------------------------------------------------- /specific-engines/solr/solr-development.md: -------------------------------------------------------------------------------- 1 | # Solr Development 2 | 3 | ## Start With 4 | - Documentation: 5 | - dev-docs/README.adoc 6 | - dev-docs/solr-source-code.adoc 7 | - dev-docs/how-to-contribute.adoc 8 | - dev-docs/git.adoc 9 | - dev-docs/FAQ.adoc 10 | - dev-docs/running-in-docker.adoc 11 | 12 | 13 | ## More Advanced 14 | - Documentation: 15 | - dev-docs/lucene-upgrade.md 16 | - dev-docs/working-between-major-versions.adoc 17 | - dev-docs/cloud-script.adoc 18 | - dev-docs/dependency-upgrades.adoc 19 | - dev-docs/overseer/overseer.adoc 20 | - dev-docs/shard-split/shard-split.adoc 21 | 22 | ## Core Dependencies 23 | - [Java 11 Java Development Kit (JDK)](https://adoptium.net/) 24 | - Documentation: dev-docs/solr-source-code.adoc 25 | 26 | ## Bundled Dependencies 27 | - General Documentation - help/dependencies.txt 28 | - [Antora](https://antora.org/) - static site generator that generates Solr Ref Guide HTML. 29 | - Documentation: dev-docs/ref-guide/antora.adoc 30 | - [AsciiDoc](https://asciidoc.org/) - markup language used by the Solr project. 31 | - Documentation: dev-docs/ref-guide/asciidoc-syntax.adoc 32 | - [Gradle](https://gradle.org/) - build system. 33 | - Documentation: dev-docs/solr-source-code.adoc 34 | - Documentation: dev-docs/jvms.adoc 35 | 36 | ## Development 37 | - Plugins, Modules, and Packages Overview: dev-docs/plugins-modules-packages.adoc 38 | - IDEs: dev-docs/IDEs.adoc 39 | - Dev Tools: 40 | - dev-tools/README.txt 41 | - dev-tools/scripts/README.md 42 | - Benchmarking: solr/benchmark/README.md 43 | - Docker: 44 | - solr/docker/README.md 45 | - solr/docker/gradle-help.txt 46 | - Examples: 47 | - solr/examples/README.md 48 | - solr/examples/films/README.md 49 | - solr/examples/films/vectors/README.md 50 | - Modules: Each module under solr/modules contains a README.md file. 51 | - Server: solr/server/README.md 52 | - Solar Reference Guide: solr/solr-ref-guide/README.adoc 53 | 54 | ## Discussions 55 | - UI: 56 | - [Shifting Execution Strategy to a New UI Plugin](https://lists.apache.org/thread/f3r6ymgpggrv38hyozmf2n9cgox5ck7k) 57 | - CLI: 58 | - [Improving the Solr CLI](https://lists.apache.org/thread/39fglyc5rwwsnso9bldhowxcr80jddwg) 59 | - SolrCell (Tika): 60 | - [Future of SolrCell in Solr](https://lists.apache.org/thread/lbm6wb88gd1cfktgs6sfvw5xf73o8trd) 61 | - Documentation: 62 | - [Solr documentation questions](https://lists.apache.org/thread/wlbvg71b6f608ddpho9jbxtl0vf04jds) -------------------------------------------------------------------------------- /specific-engines/apache-lucene.md: -------------------------------------------------------------------------------- 1 | # Apache Lucene 2 | 3 | ## Introduction 4 | 5 | Lucene is the best known open source search engine library. It forms the core of popular software like Elasticsearch, OpenSearch, Apache Solr, and Neo4j. 6 | 7 | ## Implementations 8 | - [Lucene.NET](https://lucenenet.apache.org/index.html) 9 | - [Examine](https://github.com/Shazwazza/Examine) 10 | - A search engine implementation on top of Lucene.NET, e.g. somewhat similar to Solr. 11 | - [PyLucene](https://lucene.apache.org/pylucene/) 12 | - A wrapper around the Java library, not an actual port. 13 | - [lupyne](https://github.com/coady/lupyne) 14 | - A search engine built using PyLucene. 15 | 16 | ## Articles 17 | - [Lucene: The Good Parts](https://www.parse.ly/lucene/) 18 | - An opinioniated and interesting article that starts high-level and then dives into some technical details. 19 | - [IONOS Apache Lucene Tutorial](https://www.ionos.com/digitalguide/server/configuration/apache-lucene/) 20 | - Beginner tutorial. 21 | - [Baeldung Introduction to Apache Lucene](https://www.baeldung.com/lucene) 22 | - Another beginner tutorial. 23 | - [Data Warrior's Building a search engine (Lucene tutorial)](https://datawarrior.medium.com/building-a-search-engine-lucene-tutorial-a515e3bfb44b) 24 | - [Han Bo Sun's Lucene Full-Text Search - A Very Basic Tutorial](https://www.codeproject.com/Articles/5246976/Lucene-Full-Text-Search-A-Very-Basic-Tutorial) 25 | - [LuceneTutorial.com](https://lucenetutorial.com/) 26 | - [Ishan Upamanyu's Apache Lucene Tutorial](https://ishanupamanyu.com/blog/apache-lucene-tutorial/) 27 | - [A Simple Tutorial of Lucene's Indexing and Search Systems](https://github.com/jiepujiang/LuceneTutorial) 28 | 29 | ## Code 30 | - [shaikhu/lucene-in-action](https://github.com/shaikhu/lucene_in_action) 31 | - Provides updated code examples for the book Lucene in Action. 32 | - [Michael Froh's Lucene University](https://github.com/msfroh/lucene-university) 33 | 34 | ## Tooling 35 | - [clue](https://github.com/javasoze/clue) - CLI for interacting with Lucene. 36 | - [Luqum](https://github.com/jurismarches/luqum) - ""luqum" (as in LUcene QUery Manipolator) is a tool to parse queries written in the Lucene Query DSL and build an abstract syntax tree to inspect, analyze or otherwise manipulate search queries." 37 | 38 | ## Other 39 | - [Yelp's nrtsearch](https://github.com/Yelp/nrtsearch) - gRPC server on top of Lucene. 40 | - [Blacklab](https://github.com/INL/BlackLab) - "It allows fast, complex searches with accurate hit highlighting on large, tagged and annotated, bodies of text." -------------------------------------------------------------------------------- /specific-engines/solr/solr-resources-interesting-old.md: -------------------------------------------------------------------------------- 1 | # Interesting But Old 2 | - [Elsevier Labs' Solr Dictionary Annotator](https://github.com/elsevierlabs-os/soda) - 2/2020. 3 | - [OpenSextant's Solr Text Tagger](https://github.com/OpenSextant/SolrTextTagger) - 7/2020. 4 | - [O'Reilly Media's Solr Plugin](https://github.com/oreillymedia/ifpress-solr-plugin) - 8/2020. 5 | - [Vector Scoring Plugin for Solr](https://github.com/saaay71/solr-vector-scoring) - 9/2019. 6 | - [Solr Recommender](https://github.com/pferrel/solr-recommender) - 6/2016. 7 | - [solr-movielens-recommender](https://github.com/o19s/solr-movielens-recommender) - 10/2016. 8 | - [solr-resource-recommender](https://github.com/lacic/solr-resource-recommender) - 12/2014. 9 | - [Selective Search](https://github.com/rajanim/selective-search) - 9/2019. 10 | - [solr-mlt](https://github.com/dfdeshom/custom-mlt) - 10/2019. 11 | - [UPenn's solrplugins](https://github.com/upenn-libraries/solrplugins) - 7/2017. 12 | - [solrgraph](https://github.com/kwatters/solrgraph) - 8/2021. 13 | - [carrot2's solr-integration-strategies](https://github.com/carrot2/solr-integration-strategies) - 1/2014. 14 | - [solr-fusion](https://github.com/outermedia/solr-fusion) - 3/2017. 15 | - [solr-quantities-detection-qparsers](https://github.com/SeaseLtd/solr-quantities-detection-qparsers) - 10/2019. 16 | - [kafka-solr-sink-connector](https://github.com/bkatwal/kafka-solr-sink-connector) - 11/2020. 17 | - [scrapy-solr](https://github.com/scalingexcellence/scrapy-solr) - 4/2016. 18 | - [SKOS Support for Apache Lucene and Solr](https://github.com/behas/lucene-skos) - 2/2016. 19 | - [Solr Mongo Importer](https://github.com/james75/SolrMongoImporter) - 11/2018. 20 | - [ftw.crawler](https://github.com/4teamwork/ftw.crawler) - 11/2017. 21 | - [solr-sql](https://github.com/bluejoe2008/solr-sql) - 4/2020. 22 | - [rdf-graph-search-with-solr-custom-streaming-expression](https://github.com/spoddutur/rdf-graph-search-with-solr-custom-streaming-expression) - 2/2018. 23 | - [nutch-solr-integration](https://github.com/basraven/nutch-solr-integration) - 10/2018. 24 | - [solrj-nested-docs](https://github.com/lucidworks/solrj-nested-docs) - 7/2014. 25 | 26 | ## UI 27 | - [Solrstrap](https://github.com/fergiemcdowall/solrstrap) - Stars: 86 - Updated: 4/2017 - Checked: 5/2023 28 | - [ngSolr](https://github.com/elmarquez/ngSolr) - Stars: 7 - Updated: 4/2016 - Checked: 5/2023 29 | - [HN-Search](https://github.com/agustingrigoriu/HN-Search) - Stars: 0 - Updated: 12/2019 - Checked: 5/2023 30 | - [Banana](https://github.com/lucidworks/banana/tree/release) - Stars: 672 - Updated: 6/2020 - Checked: 5/2023 -------------------------------------------------------------------------------- /specific-engines/solr/basic-solrcloud-tutorial.md: -------------------------------------------------------------------------------- 1 | # SolrCloud Tutorial 2 | 3 | - "SolrCloud is designed to provide a highly available, fault tolerant environment for distributing your indexed content and query requests across multiple servers." 4 | - "It’s a system in which data is organized into multiple pieces, or shards, that can be hosted on multiple machines, with replicas providing redundancy for both scalability and fault tolerance, and a ZooKeeper server that helps manage the overall structure so that both indexing and search requests can be routed properly. 5 | " 6 | 7 | ## Interactive Startup 8 | - `bin/solr -e cloud` 9 | - How many nodes? 2 10 | - node1 port? 8983 11 | - node2 port? 7574 12 | - Solr starts the nodes and displays the command it used to start each node. 13 | - The first node starts a ZooKeeper server on port 9983. 14 | - name of collection? gettingstarted 15 | - number of shards? 2 16 | - replicas per shard? 2 17 | - configuration? _default 18 | 19 | ## Check Status 20 | - `bin/solr status` 21 | 22 | ## Log Files 23 | - You can find log files in `example/cloud/node1/logs` and `example/cloud/node2/logs`. 24 | 25 | ## See Collection Deployed Across Cluster 26 | - http://localhost:8983/solr/#/~cloud 27 | 28 | ## Basic Diagnostics 29 | - `bin/solr healthcheck -c gettingstarted` 30 | - "The healthcheck command gathers basic information about each replica in a collection, such as number of docs, current status (active, down, etc.), and address (where the replica lives in the cluster)." 31 | 32 | ## Stopping SolrCloud 33 | - `bin/solr stop -all` 34 | 35 | ## Starting with -noprompt 36 | - Use the defaults instead of interactive 37 | - `bin/solr -e cloud -noprompt` 38 | 39 | ## Restarting Nodes 40 | - `bin/solr restart -c -p 8983 -s example/cloud/node1/solr` 41 | - `bin/solr restart -c -p 7574 -z localhost:9983 -s example/cloud/node2/solr` 42 | - Note that in the first case we don't need to specify the ZooKeeper address because it is the same as the node but for the second case we need to tell the command where ZooKeeper is located. 43 | 44 | ## Adding a Node to a Cluster 45 | - `mkdir solrHomeForNewNode` 46 | - `cp AnExistingSolrHomePath solrHomeForNewNodePath` 47 | - `bin/solr start -cloud -s solrHomeForNewNode/solr -p portNumber -z zooKeeperHostAndPort` 48 | - Example: 49 | - `mkdir -p example/cloud/node3/solr` 50 | - `cp example/cloud/node1/solr/solr.xml example/cloud/node3/solr/solr.xml` 51 | - `bin/solr start -cloud -s example/cloud/node3/solr -p 8987 -z localhost:9983` 52 | 53 | ## Bibliography / Resources 54 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-solrcloud.html -------------------------------------------------------------------------------- /vector-search/vector-basics.md: -------------------------------------------------------------------------------- 1 | # Vector Search 2 | - Doug Turnbull. [Vector Search for the Uninitiated](https://softwaredoug.com/blog/2023/02/13/why-vector-search). 2/2023. 3 | - Provides a brief overview of how traditional search works and how vector search differs as well as the relative strengths and weaknesses of each. 4 | - Greg Kogan. [Introduction to Vector Search for Developers](https://www.pinecone.io/learn/vector-search-basics/). 5 | - A high-level overview of vector search with a slight emphasis on Pinecone's product. Touches on traditional search, vector embeddings, semantic similarity, etc. 6 | - Doug Turnbull. [Vector Search: The Hard Way](https://softwaredoug.com/blog/2023/09/05/vector-search-the-hard-way). 9/2023. 7 | - 75 slides on the challenges of Vector Search. 8 | - Dmitry Kan. [Keynote: Where Vector Search is taking us](https://haystackconf.com/eu2022/2022/09/27/keynote.html). Haystack Conference, 9/2022. 9 | - Slide deck and video presentation on the state of Vector Search and it's future. 10 | - Ethan Steininger. [Vector Search](https://github.com/esteininger/vector-search). 6/2023. 11 | - A GitHub repo with a collection of articles and links relating to Vector Search. 12 | - Panda Smith. [Build a search engine, not a vector DB](https://blog.elicit.com/search-vs-vector-db/). Elicit, 12/2023. 13 | - Guidance on using a solid search engine as the foundation for vector search. 14 | 15 | # Vector Search Challenges 16 | - James Briggs. [The Missing WHERE Clause in Vector Search](https://www.pinecone.io/learn/vector-search-filtering/). Pinecone. 17 | - Discusses the difficult challenge of filtering results in vector search, explains the pre/post filtering techniques and Pinecone's single stage filtering. 18 | - Gibbs Cullen has an additional post on Pinecone's implementation titled, [Introducing the hybrid index to enable keyword-aware semantic search](https://www.pinecone.io/blog/hybrid-search/). Pinecone, 10/2022. 19 | 20 | # Vector Embeddings 21 | - Roie Schwaber-Cohen. [Vector Embeddings for Developers: The Basics](https://www.pinecone.io/learn/vector-embeddings-for-developers/). Pinecone. 22 | - Solid article for beginners looking for high-level overview. Touches on vectors, vector embeddings, embedding models, word2vec, and semantic similarity. 23 | 24 | # Neo4j Vector Search 25 | - [Documentation: Vector Search Indexes](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/) 26 | - Utilizes Apache Lucene, which uses HSNW Graph and k-ANN for querying. 27 | 28 | # Vector Companies / Databases 29 | - [Vector DB Comparison](https://superlinked.com/vector-db-comparison/) 30 | - [Milvus](https://milvus.io/) 31 | - [pgvector](https://github.com/pgvector/pgvector) 32 | - [Pinecone](https://www.pinecone.io/) 33 | - [Qdrant](https://qdrant.tech/) 34 | - [Vespa](https://vespa.ai/) 35 | - [Weaviate](https://weaviate.io/) -------------------------------------------------------------------------------- /specific-engines/solr/solr-resources-app-framework-integrations.md: -------------------------------------------------------------------------------- 1 | ### Application/Framework Integrations 2 | - [Dokku Solr](https://github.com/dokku/dokku-solr) - Stars: 13 - Updated: 5/2023 - Checked: 5/2023 3 | - [Drupal Search API Solr](https://github.com/mkalkbrenner/search_api_solr) - Stars: 6 - Updated: 5/2023 - Checked: 5/2023 4 | - [eZ Platform](https://github.com/ezsystems/ezplatform-solr-search-engine) - Stars: 45 - Updated: 5/2023 - Checked: 5/2023 5 | - [Feathersjs Solr Client (feathers-solr)](https://github.com/sajov/feathers-solr) - Stars: 29 - Updated: 3/2023 - Checked: 5/2023 6 | - [Flask Backend Solr Service](https://github.com/NeuroBridge/Backend_service) - Stars: 0 - Updayted: 5/2023 - Checked: 5/2023 7 | - [@florajs/datasource-solr](https://github.com/florajs/datasource-solr) - Stars: 6 - Updated: 5/2023 - Checked: 5/2023 8 | - [Ibexa DXP Solr Integration](https://github.com/ibexa/solr) - Stars: 2 - Updated: 5/2023 - Checked: 5/2023 9 | - [Kafka Connect Solr](https://github.com/jcustenborder/kafka-connect-solr) - Stars: 37 - Updated: 9/2021 - Checked: 5/2023 10 | - [Solr Lando Plugin](https://github.com/lando/solr) - Stars: 0 - Updated: 5/2023 - Checked: 5/2023 11 | - [MusicBrainz Solr](https://github.com/metabrainz/mb-solr) - Stars: 3 - Updated: 1/2023 - Checked: 5/2023 12 | - [Omeka-S-module-SearchSolr](https://github.com/Daniel-KM/Omeka-S-module-SearchSolr) - Stars: 2 - Updated: 5/2023 - Checked: 5/2023 13 | - [collective.solr for Plone CMS](https://github.com/collective/collective.solr) - Stars: 19 - Updated: 5/2023 - Checked: 5/2023 14 | - [Solr Search for NodeBB](https://github.com/julianlam/nodebb-plugin-solr) - Stars: 22 - Updated: 1/2023 - Checked: 5/2023 15 | - [Spring Data Solr](https://github.com/spring-projects/spring-data-solr) - Stars: 386 - Updated: 5/2023 - Checked: 5/2023 16 | - [Modern SilverStripe Solr Search](https://github.com/Firesphere/silverstripe-solr-search) - Stars: 10 - Updated: 5/2023 - Checked: 5/2023 17 | - [TYPO3-Find](https://github.com/subugoe/typo3-find) - Starts: 17 - Updated: 9/2022 - Checked: 5/2023 18 | - For providing a UI using TYPO3 to a given Solr instance. 19 | - [TYPO3-Solr](https://github.com/TYPO3-Solr/ext-solr) - Stars: 126 - Updated: 5/2023 - Checked: 5/2023 20 | - For searching the contents of a TYPO3 CMS instance. 21 | - [Solr Engine for Laravel Scout](https://github.com/pxslip/laravel-scout-solr) - Stars: 17 - Updated: 5/2023 - Checked: 5/2023 22 | - [Apache Solr Dialect for SQLAlchemy and Apache Superset](https://github.com/aadel/sqlalchemy-solr) - Stars: 8 - Updated: 8/2022 - Checked: 5/2023 23 | - [Sitecore SmartSolrSchema](https://github.com/dataweaversio/SmartSolrSchema) - Stars: 17 - Updated: 12/2022 - Checked: 5/2023 24 | - Populates "not only standard Sitecore dynamic fields but also reads any custom languages that are set up" 25 | - [solr-power (for WordPress)](https://github.com/pantheon-systems/solr-power) - Stars: 122 - Updated: 4/2023 - Checked: 5/2023 -------------------------------------------------------------------------------- /specific-engines/elasticsearch.md: -------------------------------------------------------------------------------- 1 | # Extending Elasticsearch 2 | - [elastiknn](https://github.com/alexklibisz/elastiknn) - Stars: 324 - Updated: 5/2023 - Checked: 5/2023 3 | - Nearest Neighbor plugin. 4 | - [elasticsearch-sql](https://github.com/iamazy/elasticsearch-sql) - Stars: 321 - Updated: 11/2022 - Checked: 5/2023 5 | - [elasticsearch-carrot2](https://github.com/carrot2/elasticsearch-carrot2) - Stars: 289 - Updated: 1/2023 - Checked: 11/2023 6 | - On-the-fly clustering capabilities. 7 | - [search-extra](https://github.com/wikimedia/search-extra) - Stars: 48 - Updated: 11/2023 - Checked: 11/2023 8 | - A number of queries, filters, etc. from MediaWiki. 9 | - [search-highlighter](https://github.com/wikimedia/search-highlighter) 10 | - Stars: 96 - Updated: 11/2022 - Checked: 11/2023 11 | - MediaWiki highlighter mean for easy experimentation with weights and groupings for hits. 12 | - [zentity](https://github.com/zentity-io/zentity) - Stars: 142 - Updated: 2/2022 - Checked: 11/2023 13 | - A simple, fast, generic, transitive, multi-source, accommodating, logical entity resolution plugin. 14 | 15 | # Deployment 16 | - [elasticsearch-cloud-deploy](https://github.com/BigDataBoutique/elasticsearch-cloud-deploy) - Stars: 329 - Updated: 11/2022 - Checked: 5/2023 17 | 18 | # Development Integrations 19 | - [Nest.js Elasticsearch](https://github.com/nestjs/elasticsearch) - Stars: 331 - Updated: 5/2023 - Checked: 5/2023 20 | 21 | # Learn 22 | - [Elasticsearch Cheatsheet for developers](https://github.com/jolicode/elasticsearch-cheatsheet) - Stars: 336 - Updated: 5/2022 - Checked: 5/2023 23 | - [The Elasticsearch Handbook](https://elasticsearchbook.com/) - eBook, $29. 24 | 25 | 26 | # Applications Using 27 | - [DataCap](https://github.com/EdurtIO/datacap) 28 | - [DataShare](https://github.com/ICIJ/datashare) 29 | - [Diskover Community Edition](https://github.com/lacic/solr-resource-recommender) 30 | - [Magda](https://github.com/magda-io/magda) 31 | - [Tigris](https://github.com/tigrisdata/tigris) 32 | - [Zenodo](https://github.com/zenodo/zenodo) 33 | 34 | # Resources 35 | - [awesome-elasticsearch](https://github.com/dzharii/awesome-elasticsearch) - Stars: 4.7k - Updated: 9/2022 - Checked: 10/2023 36 | - [codingexplaing/complete-guide-to-elasticsearch](https://github.com/codingexplained/complete-guide-to-elasticsearch) - Stars: 1.6k - Updated: 1/2024 - Checked: 5/2024 37 | - "Contains all of the queries used within the Complete Guide to Elasticsearch course." 38 | 39 | ## Books 40 | - Anurag Srivastava. Elasticsearch 8 for Developers: A beginner's guide to indexing, analyzing, searching, and aggregating data, 2nd edition. BPB Publications, 10/2023. 392 pp. 41 | - Rafał Kuć, Marek Rogoziński. Elasticsearch Server, 3rd edition. Packt Publishing, 2/2016. 556 pp. 42 | - Steve Hoberman, Rafid Reaz. Elasticsearch Data Modeling and Schema Design. Technics Publications, 8/2023. 196 pp. 43 | 44 | # Companies Working with Elasticsearch 45 | - [Sematext](https://sematext.com/) 46 | 47 | # Terminology 48 | - Data Streams 49 | - Index Aliases 50 | - Index Rollups 51 | - Index State Management 52 | - Index Templates 53 | - Index Transforms 54 | - Ingest Pipeline 55 | - Ingest Processors 56 | - Processors 57 | - Search Analyzer 58 | - Tasks -------------------------------------------------------------------------------- /BuildingSearchEngines.md: -------------------------------------------------------------------------------- 1 | # Building Search Engines 2 | 3 | ## Table of Contents 4 | - [Open Source Search Engines](OpenSourceSearchEngines.md) 5 | - [Open Source Web Crawlers](WebCrawlers.md) 6 | - Building Search Engines General Resources 7 | - [Common Crawl](CommonCrawl.md) 8 | - Wikimedia Search 9 | - Sites Covering Search Related Topics 10 | - Tutorials 11 | - [Books on Search and Information Retrieval](/research/books-research.md) 12 | - [Research on Search and Information Retrieval](/research/research-main.md) 13 | - [Vector Search](/vector-search/vector-basics.md) 14 | 15 | 16 | ## General Resources 17 | - [The Open Guide to Search Engineering](https://github.com/open-guides/og-search-engineering) 18 | - [Web Archiving Introduction](/web-archiving/archiving-introduction.md) 19 | 20 | ## Wikimedia Search 21 | - https://www.mediawiki.org/wiki/Wikimedia_Search_Platform 22 | - https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes 23 | 24 | ## Sites Covering Search Related Topics 25 | A number of the sites list below are commercial search providers, while there may be useful resources throughout the site, checking the blog is often a good place to start. 26 | - [Algolia](https://algolia.com/) - A popular SaaS search engine. 27 | - [Bonsai](https://bonsai.io/) - Managed Elasticsearch, OpenSearch, and SolrCloud platform. 28 | 29 | ## Community 30 | - [Relevance & Matching Tech](https://www.opensourceconnections.com/slack) - A long-lived and popular Slack communtiy for the IR/search community run by [OpenSource Connections](https://www.opensourceconnections.com/). 31 | - [IR-Relevant](https://ir-relevant.net/) - A new (2023) community forum for those interested in Information Retrieval. Created and run by [Sease](https://sease.io/), a well-known information retrieval and search provider. 32 | 33 | ## Tutorials 34 | 35 | ### For Beginners 36 | - [Build Your Own Search Engine and Web Crawler in 5 Minutes with Node.js, MySQL, and Elasticsearch](https://coderdose.com/build-your-own-search-engine-and-web-crawler-in-5-minutes-with-node-js-mysql-and-elasticsearch/). Coderdose, 3/2023. 37 | - [Web Search Engine: Design and implementation of a Web Search Engine Using Text Mining Techniques](https://www.codeproject.com/Articles/5319612/Web-Search-Engine). Code Project, 12/2021-3/2023. 38 | - Uses Python and a SQL back-end. 39 | 40 | ## ACM Conferences Related to Search 41 | - [WSDM: Web Search and Data Mining](https://dl.acm.org/conference/wsdm) 42 | - [IR: Research and Development in Information Retrieval](https://dl.acm.org/conference/ir) 43 | - [KDD: Knowledge Discovery and Data Mining](https://dl.acm.org/conference/kdd) 44 | - [CHI: Conference on Human Factors in Computing Systems](https://dl.acm.org/conference/chi) - Not focused on search but covers a lot of topics that are/should be of interest to those working with search. 45 | - [CIKM: Conference on Information and Knowledge Management](https://dl.acm.org/conference/cikm) - Largely focused on AI/ML. 46 | - [RECSYS: ACM Conference on Recommender Systems](https://dl.acm.org/conference/recsys) 47 | - [IDEAS: International Database Engineering & Applications Symposium](https://dl.acm.org/conference/ideas) 48 | - [WWW: International World Wide Web Conference](https://dl.acm.org/conference/www) 49 | - [MOD: International Conference on Management of Data](https://dl.acm.org/conference/mod) 50 | -------------------------------------------------------------------------------- /specific-engines/yacy.md: -------------------------------------------------------------------------------- 1 | # YaCy 2 | - Official Site: https://yacy.net/ 3 | - [GitHub Repo](https://github.com/yacy/yacy_search_server) 4 | - [Forums](https://community.searchlab.eu/) 5 | - [Subreddit](https://www.reddit.com/r/YaCy/) 6 | - [HackerNews](https://news.ycombinator.com/item?id=32597309) 7 | - [Wikipedia](https://en.wikipedia.org/wiki/YaCy) 8 | - An open source, distributed, P2P search engine built in Java with a focus on user privacy and decentralization. It's been around for a long time and continues to be actively developed. Includes a crawler. 9 | 10 | ## Other Projects 11 | - [Susper](https://github.com/fossasia/susper.com) - Stars: 1.7k - Updated: 3/2022 - Checke: 3/2023 - Built on top of YaCy and Apache Solr. 12 | 13 | ## Implementations 14 | - [Susper](https://susper.com) 15 | - [Land](https://www.land.nrw) 16 | 17 | ## YaCy Grid 18 | - YaCy is a P2P search engine while YaCy Grid is a distributed search engine but not P2P. Read more at: 19 | - Michael Christen. [The Story of YaCy Grid](https://community.searchlab.eu/t/the-story-of-yacy-grid/48). 6/2019. 20 | - Covers the origins of YaCy Grid and it's basic architecture. 21 | 22 | ## YaCy Searchlab 23 | - https://searchlab.eu/ 24 | - [GitHub Repo](https://github.com/yacy/searchlab) 25 | - Provides a UI on top of YaCy Grid. 26 | - Michael Christen. [The Searchlab Project](https://community.searchlab.eu/t/the-searchlab-project/867). 10/2021-10/2022. 27 | - Covers the launch of Searchlab, an implementation of YaCy Grid with corresponding open source projects. Note that you should read through the thread as the initial post has not been updated. 28 | 29 | ## Articles 30 | - Keyhan Vakil. [Personal internet search engine with YaCy](https://www.kvakil.me/posts/2022-07-03-yacy-private-search-engine.html). 7/2022. 31 | - Covers using YaCy as a personal, private search engine. Not web-scale, but a good introduction to YaCy. 32 | - Arunmozhi. [Personal Bookmarking using YACY & yacy-it](https://arunmozhi.in/2022/06/27/personal-bookmarking-using-yacy-yacy-it/). 6/2022. 33 | - A good followup article on Vakil's. Arunmozhi created a Firefox extension to reduce the friction of adding sites to one's personal search engine. 34 | - Richard Osgood. [YaCy Personal Search Engine](https://www.richardosgood.com/posts/yacy-personal-search-engine/). 2/2023. 35 | - A followup on Vakil's article with some additional good ideas. 36 | - LinuxReviews 37 | - LinuxReviews has a fairly negative opinion of YaCy overall but sees a glimmer of opportunity for the engine. Unfortunately, the articles have not been updated since 2019/2021 respectively leaving some gap between what is described and what may now be YaCy's status. That said there is a decent amount of helpful technical info. to be found in the articles. 38 | - [YaCy](https://linuxreviews.org/YaCy). 9/2019. 39 | - [The YaCy Search Server Is Sort-Of Being Actively Developed Again After Half a Decade Of Inactivity](https://linuxreviews.org/The_YaCy_Search_Server_Is_Sort-Of_Being_Actively_Developed_Again_After_Half_A_Decade_Of_Inactivity). 4/2021. 40 | - Michael Herrmann, Kai-Chun Ning, Claudia Díaz, B. Praneel. [Description of the YaCy Distributed Web Search Engine](https://www.semanticscholar.org/paper/Description-of-the-YaCy-Distributed-Web-Search-Herrmann-Ning/8d0c816ab14ca3748a1887d7f2ef088d630f831d). 2014. 41 | - An academic article providing a technical description of YaCy. 42 | - Jeremy Rand. [Relevance and Privacy Improvements to the YaCy Decentralized Web Search Engine](https://shareok.org/handle/11244/299892). University of Oklahoma, 2018. -------------------------------------------------------------------------------- /specific-engines/elasticsearch/elasticsearch-clients.md: -------------------------------------------------------------------------------- 1 | # Ruby 2 | - [elasticsearch-ruby](https://github.com/elastic/elasticsearch-ruby) - Stars: 2k - Updated: 4/2024 - Checked: 5/2024 3 | - [elasticsearch-rails](https://github.com/elastic/elasticsearch-rails) - Stars: 3.1k - Updated: 4/2024 - Checked: 5/2024. 4 | - [chewy](https://github.com/toptal/chewy) - Stars: 1.9k - Updated: 5/2024 - Checked: 5/2024. 5 | - "High-level Elasticsearch Ruby framework based on the official elasticsearchruby client." 6 | 7 | # Go 8 | - [go-elasticsearch](https://github.com/elastic/go-elasticsearch) - Stars: 5.5k - Updated: 4/2024 - Checked: 5/2024 9 | - Official. 10 | - [elasticsql](https://github.com/cch123/elasticsql) - Stars: 1.1k - Updated: 8/2023 - Checked: 5/2024 11 | - "Convert sql to elasticsearch DSL" 12 | - [vulcanizer](https://github.com/github/vulcanizer) - Stars: 663 - Updated: 4/2024 - Checked: 5/2024 13 | - By GitHub 14 | - "...a golang library for interacting with an Elasticsearch cluster...to provide a high level API to help with common tasks...operating an Elasticsearch cluster such as querying health status..., migrating data off of nodes, updating cluster settings, etc." 15 | 16 | # Rust 17 | - [elasticsearch-rs](https://github.com/elastic/elasticsearch-rs) - Stars: 686 - Updated: 12/2023 - Checked: 5/2024. 18 | - Official. 19 | 20 | # .NET 21 | - [elasticsearch-net](https://github.com/elastic/elasticsearch-net) - Stars: 3.5k - Updated: 4/2024 - Checked: 5/2024. 22 | - Official. 23 | - [ElasticLINQ](https://github.com/ElasticLINQ/ElasticLINQ) - Stars: 384 - Updated: 2023 - Checked: 5/2024. 24 | 25 | # PHP 26 | - [elastic-php](https://github.com/elastic/elasticsearch-php) - Stars: 5.2k - Updated: 3/2024 - Checked: 5/2024 27 | - Official. 28 | - [Elastica](https://github.com/ruflin/Elastica) - Stars: 2.3k - Updated: 5/2024 - Checked: 5/2024. 29 | - Updated by CirrusSearch which is used by MediaWiki/Wikipedia. 30 | 31 | ## Laravel 32 | - [laravel-scout-elasticsearch](https://github.com/matchish/laravel-scout-elasticsearch) - Stars: 684 - Updated: 2/2024 - Checked: 5/2024 33 | 34 | ## Symfony 35 | - [FOSElasticaBundle](https://github.com/FriendsOfSymfony/FOSElasticaBundle) - Stars: 1.2k - Updated: 4/2024 - Checked: 5/2024 36 | - Uses Elastica to integrate Elasticsearch with Symfony. 37 | 38 | ## Yii 39 | - [yii2-elasticsearch](https://github.com/yiisoft/yii2-elasticsearch) - Stars: 430 - Updated: 9/2023 - Checked: 5/2024 40 | 41 | # Python 42 | - [elasticsearch-py](https://github.com/elastic/elasticsearch-py) - Stars: 4.1k - Updated: 5/2024 - Checked: 5/2024 43 | - Official. 44 | - [elasticsearch-dsl-py](https://github.com/elastic/elasticsearch-dsl-py) - Stars: 3.8k - Updated: 5/2024 - Checked: 5/2024 45 | - "High level Python client for Elasticsearch" 46 | 47 | ## Django 48 | - [django-elasticsearch-dsl-drf](https://github.com/barseghyanartur/django-elasticsearch-dsl-drf) - Stars: 364 - Updated: 2022 - Checked: 5/2024 49 | 50 | # JavaScript 51 | - [elasticsearch-js](https://github.com/elastic/elasticsearch-js) - Stars: 5.2k - Updated: 5/2024 - Checked: 5/2024 52 | - Official. 53 | - [elastic-builder](https://github.com/sudo-suhas/elastic-builder) - Stars: 503 - Updated: 5/2024 - Checked: 5/2024 54 | - "A Node.js implementation of the elasticsearch Query DSL" 55 | 56 | # Elixir 57 | - [elasticsearch-elixir](https://github.com/danielberkompas/elasticsearch-elixir) - Stars: 416 - Updated: 9/2023 - Checked - 5/2024 58 | 59 | # Haskell 60 | - [bloodhound](https://github.com/bitemyapp/bloodhound) - Stars: 420 - Updated: 2/2024 - Checked: 5/2024. 61 | - "Haskell Elasticsearch client and query DSL" 62 | 63 | # Scala 64 | - [elastic4s](https://github.com/Philippus/elastic4s) - Stars: 1.6k - Updated: 5/2024 - Checked: 5/2024 -------------------------------------------------------------------------------- /web-archiving/archiving-introduction.md: -------------------------------------------------------------------------------- 1 | # Web Archiving Introduction 2 | 3 | ## Introduction 4 | Search engines depend on indexes which are built by crawling and caching the content of the web. While for search engines this caching can be temporary (before the relevant data is extracted into an index) it overlaps significantly with web archiving and using open crawl data often involves working with web archiving formats and tooling. 5 | 6 | In this document we'll be particularly interested in discussing the file formats utilized in modern web archiving. 7 | 8 | ## Origins 9 | The [ARC](https://archive.org/web/researcher/ArcFileFormat.php) format was created by the [Internet Archive](https://archive.org/) for use with it's Wayback Machine. It's success is WARC, released in a "finalized" form in 2009, this format (and subsequent revisions) continues to be the mainstay of web archiving. 10 | 11 | ## WARC Format 12 | - [Official Standard Specifications for WARC 1.1](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) 13 | - Karl-Rainer Blumenthal. [The stack: An introduction to the WARC file](https://ait.blog.archive.org/post/the-stack-warc-file/). Archive.org, 4/2021. 14 | - A great introductory article to WARC including it's history, purpose, and implementation. 15 | - [Wikipedia on Web ARChive](https://en.wikipedia.org/wiki/Web_ARChive) 16 | 17 | A WARC file contains WARC records which are composed of eight pieces, six of which are actually utilized currently: 18 | - warcinfo - information about the request, "good provenance information" as Blumenthal puts it. 19 | - request - The HTTP request made by the archiving tool to the website that results in the response received. 20 | - response - The response received from the website (including file contents). 21 | - revisit - A record that has been previously archived and hasn't changed in subsequent visit. 22 | - resource - May include screenshots, videos of the page. 23 | - conversion - Conversion of older data into a current format (e.g. as an image standard is deprecated this might contain a replacement image in a new format). 24 | - continuation - Allows one to reference another WARC record that contains the remainder of the record. 25 | - metadata - Various metadata depending on the archiving source. 26 | 27 | They can be opened with a text editor (although many WARC files are quite large and may require a special editor with large file support). 28 | 29 | ## WAT Format 30 | Includes data extracted from the WARC format using JSON. Includes metadata, request, and response as well as the links extracted from the page. 31 | 32 | ## WET Format 33 | Includes data extracted from a WARC in plaintext. 34 | 35 | ## Bibliography/Resources 36 | - Archive.org 37 | - Note that the Internet Archive maintains a [general blog](https://blog.archive.org/) but for those interested in more technical aspects of the Archive, see the [Archive-It blog](https://ait.blog.archive.org/), which also covers general Archive-It news along with some technical posts. 38 | - [A New Wayback: Improving Web Archive Replay](https://ait.blog.archive.org/post/archive-it-wayback-release/). 9/2021. 39 | - Karl-Rainer Blumenthal. [The stack: A guide to A/V web archiving with youtube-dl](https://ait.blog.archive.org/post/the-stack-youtube-dl-guide/). 1/2021. 40 | - Karl-Rainer Blumenthal. [The stack: High fidelity web collecting at scale with Brozzler](https://ait.blog.archive.org/post/the-stack-brozzler/). 11/2020. 41 | - Molly Bragg, Kristine Hanna, et al. [Web Archiving Lifecycle Model](https://ait.blog.archive.org/learn-more/publications/web-archiving-life-cycle-model/). 3/2013. 42 | - CommonCrawl.org 43 | - Stephen Merity. [Navigating the WARC file format](https://commoncrawl.org/2014/04/navigating-the-warc-file-format/). 4/2014. 44 | - Brief introduction to the WARC format, but perhaps more importantly (as the info seems less readibly available around the web), discusses the WET and WAT formats. -------------------------------------------------------------------------------- /collaborative/README.md: -------------------------------------------------------------------------------- 1 | # Collaborative Search Engines Till Now 2 | 3 | ## Introduction 4 | 5 | I am particulary interested in human augmented search engines and am [working towards building one](https://github.com/davidshq/next-search). I've decided to collect information regarding collaborative search engines here. 6 | 7 | ## Engines 8 | - [ApexKB / Jumper](https://en.wikipedia.org/wiki/ApexKB) 9 | - Last Release: 11/2010 10 | - [Seeks](https://en.wikipedia.org/wiki/Seeks) 11 | - Last Release: 4/2012 12 | - [Searx](https://en.wikipedia.org/wiki/Searx) - While inspired by Seeks it does not provide the collaborative search aspect. 13 | 14 | ## Wikipedia Articles 15 | - [Collaborative search engine](https://en.wikipedia.org/wiki/Collaborative_search_engine) 16 | - "let users combine their efforts in information retrieval (IR) activities, share information resources collaboratively using knowledge tags, and allow experts to guide less experienced people through their searches. Collaboration partners do so by providing query terms, collective tagging, adding comments or opinions, rating search results, and links clicked of former (successful) IR activities to users having the same or a related information need." 17 | - Implicit Collaboration: Collaborative filtering, recommendation systems. 18 | - I-Spy 19 | - Jumper 2.0 20 | - Seeks 21 | - Community Search Assistance 22 | - Burghardt et al. CSE 23 | - Longo et al. 24 | - Explicit Collaboration 25 | - SearchTogether (2007) 26 | - PlayByPlay 27 | - MUSE 28 | - MUST 29 | - Cerciamo 30 | - Papagelis et al. 31 | - CoSense 32 | - CoSearch 33 | - GroupWeb 34 | - ClassSearch 35 | - [Collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) 36 | - [Robust collaborative filtering](https://en.wikipedia.org/wiki/Robust_collaborative_filtering) 37 | - Social Bookmarking 38 | - StumbleUpon 39 | - Digg 40 | - [Firefly](https://en.wikipedia.org/wiki/Firefly_(website)) 41 | - [Recommender system](https://en.wikipedia.org/wiki/Recommender_system) 42 | - [Information filtering system](https://en.wikipedia.org/wiki/Information_filtering_system) 43 | - [Collaborative intelligence](https://en.wikipedia.org/wiki/Collaborative_intelligence) 44 | - [Crowdsourcing](https://en.wikipedia.org/wiki/Crowdsourcing) 45 | - Reddit 46 | - [Citizen science](https://en.wikipedia.org/wiki/Citizen_science) 47 | - [Enterprise bookmarking](https://en.wikipedia.org/wiki/Enterprise_bookmarking) 48 | - IBM Dogear (Lotus Connections) 49 | - Cogenz 50 | - Jumper 2.0 51 | - del.icio.us 52 | - [Social bookmarking](https://en.wikipedia.org/wiki/Social_bookmarking) 53 | - 1996 - itList 54 | - 1997 - NASA WebTagger 55 | - Backflip 56 | - Blink 57 | - Clip2 58 | - ClickMarks 59 | - HotLinks 60 | - 2003 - Delicious 61 | - Furl 62 | - Simpy 63 | - Spurl.net 64 | - unalog 65 | - CiteULike 66 | - Connotea 67 | - StumbleUpon 68 | - 2006 - Ma.gnolia (Gnolia) 69 | - Blue Dot (Faves) 70 | - Mister Wong 71 | - Diigo 72 | - Connectbeam 73 | - 2007 - IBM Lotus Connections 74 | - 2009 - Pinboard 75 | - Digg 76 | - Reddit 77 | - Newsvine 78 | - [Tag](https://en.wikipedia.org/wiki/Tag_(metadata)) 79 | - [Comparison of enterprise bookmarking platforms](https://en.wikipedia.org/wiki/Comparison_of_enterprise_bookmarking_platforms) 80 | - [Knowledge management](https://en.wikipedia.org/wiki/Comparison_of_enterprise_bookmarking_platforms) 81 | - [Web directory](https://en.wikipedia.org/wiki/Web_directory) 82 | - [Folksonomy](https://en.wikipedia.org/wiki/Folksonomy) - aka Collaborative tagging 83 | - [List of social bookmarking sites](https://en.wikipedia.org/wiki/List_of_social_bookmarking_websites) 84 | - [List of social software](https://en.wikipedia.org/wiki/List_of_social_software) 85 | - [Elium (Knowledge Plaza)](https://en.wikipedia.org/wiki/Elium) 86 | - [Models of collaborative tagging](https://en.wikipedia.org/wiki/Models_of_collaborative_tagging) -------------------------------------------------------------------------------- /front-end/ui-components-of-search.md: -------------------------------------------------------------------------------- 1 | # Understanding the UI Components of Search 2 | 3 | In this section we look at the various UI components commonly utilized in building a search engine. We will initially base our discussion on the components provided by [Pre-Built Search Component Libraries](./ui-component-libraries-for-search.md). 4 | 5 | ## Core Components 6 | 7 | ### Search Box 8 | This is the main input field where a user enters their queries. 9 | 10 | ### Search Results 11 | This is the area where search results are displayed. It typically includes a list of items with titles, descriptions, and other relevant information. 12 | 13 | #### Result Item (Hit) 14 | This is an individual search result displayed within the search results. It usually includes a title, description, and other relevant information. 15 | 16 | ### Pagination 17 | Pagination allows users to navigate through multiple pages of search results. 18 | 19 | ### Per Page 20 | This is a dropdown menu that allows users to select the number of search results displayed per page. 21 | 22 | ### Facets (Refinements, Filters) 23 | Facets are used to refine search results. They are typically displayed as a list of checkboxes or radio buttons that allow users to filter search results by various criteria such as category, price, date, etc. 24 | 25 | #### Types of Facet 26 | - Radio Buttons 27 | - Checkboxes 28 | - Hierarchical 29 | - Range (slider) 30 | - Rating 31 | - Toggle 32 | 33 | #### Selected Facets (Active Filters) 34 | This is a list of the currently selected facets. It is usually displayed above the search results and allows users to easily remove filters. 35 | 36 | ### Sort By 37 | Generally this is a dropdown menu that allows users to sort search results by various criteria such as relevance, date, popularity, etc. but sometimes it may be radio buttons or similar. 38 | 39 | ## Optional Components 40 | - Breadcrumbs 41 | - Geographic 42 | - History 43 | - Menus 44 | - Multi-Select Items 45 | - Stats 46 | - Related Items 47 | - Tag Cloud 48 | - Date Picker 49 | - Tree View 50 | - Tag Filters 51 | 52 | ## Functionality 53 | - Autocomplete (Typeahead) 54 | - Conditional Facets 55 | - Grouping Results 56 | - Highlight 57 | - Search-as-you-type 58 | - Snippet 59 | - Suggestions 60 | - Voice Search 61 | 62 | ## UI Elements 63 | - Input Box 64 | - Button 65 | - List 66 | - Checkbox 67 | - Range Slider 68 | - Radio Button 69 | - Dropdown 70 | 71 | ## Resources 72 | - [InstantSearch.js UI Library Widgets](https://www.algolia.com/doc/api-reference/widgets/js/). 73 | - [ReactiveSearch Components](https://docs.reactivesearch.io/docs/reactivesearch/react/overview/components/). 74 | - They have sub-pages for each type of component and this includes a screenshot of the visual appearance of the component, which can be quite helpful. 75 | - [Elastic Search UI Components](https://docs.elastic.co/search-ui/api/react/components/search-box) 76 | - David Ubersky's article [Search UI Patterns: Elements](https://ddsky.medium.com/search-ui-patterns-elements-80ea9d241f97) 77 | 78 | ### Generic Component Libraries 79 | Just a few sample UI component libraries that may be worth consideration. Not meant to be representative or exhaustive. 80 | 81 | #### General 82 | - [Bootstrap](https://getbootstrap.com/) 83 | - [Semantic UI](https://semantic-ui.com/) 84 | - Has some components specifically dedicated to search. 85 | - [Shoelace](https://shoelace.style/) 86 | 87 | #### Angular 88 | - [PrimeEng](https://primeng.org/) 89 | 90 | #### React 91 | - [Material UI (MUI)](https://mui.com/material-ui/all-components/) 92 | - [MP React Components](https://materialsproject.github.io/mp-react-components/?path=/story/introduction-mp-react-components--page) 93 | - Focused on material sciences. 94 | - [Chakra](https://chakra-ui.com/docs/components) 95 | - [Radix](https://www.radix-ui.com/) 96 | - Headless UI. 97 | - [React Bootstrap](https://react-bootstrap.github.io/) 98 | 99 | #### Vue 100 | - [PrimeVue](https://primevue.org/) 101 | - Has DataTable and DataView. -------------------------------------------------------------------------------- /research/fairness-research.md: -------------------------------------------------------------------------------- 1 | # Diversity, Fairness, and Bias in Search Engines and Information Retrieval 2 | 3 | ## 2023 4 | - Ya-Lin Zhang, Yi-Xuan Sun, Fangfang Fan, Menng Li, Yeyu Zhao, Wei Wang, Longfei Li, Jun Zhou, Jinghua Feng. [A Framework for Detecting Frauds from Extremely Few Labels](https://dl.acm.org/doi/10.1145/3539597.3573022). 2/2023. 5 | - Gang Chen, Jiawei Chen, Fuli Feng, Sheng Zhou, Xiangnan He. [Unbiased Knowledge Distillation for Recommendation](https://dl.acm.org/doi/10.1145/3539597.3570477). 2/2023. 6 | - Xiaoying Zhang, Hongning Wang, Hang Li. [Disentangled Representation for Diversified Recommendations](https://dl.acm.org/doi/10.1145/3539597.3570389). 2/2023. 7 | - Sophie Scharf, Monika Wiegelmann, Arnt Bröder. [Information search in everyday decisions: The generalizability of the attraction search effect](https://www.researchgate.net/publication/366827841_Information_search_in_everyday_decisions_The_generalizability_of_the_attraction_search_effect). 1/2023. 8 | - Zheng Hu, Satoshi Nakagawa, Liang Luo, Yu Gu, Fuji Ren. [Celebrity-aware Graph Contrastive Learning Framework for Social Recommendation](https://dl.acm.org/doi/10.1145/3583780.3614806). CIKM '23. 10/2023.* 9 | 10 | ## 2022 11 | - Amifa Raj. [Fair Ranking Metrics](https://dl.acm.org/doi/10.1145/3523227.3547430). 9/2022. 12 | - Yuta Saito, Thorsten Joachims. [Fair Ranking as Fair Division: Impact-Based Individual Fairness in Ranking](https://dl.acm.org/doi/10.1145/3534678.3539353). 8/2022.* 13 | - Ji Liu, Zenan Li, Yuan Yao, Feng Xu, Xiaoxing Ma, Miao Xu, Hanghang Tong. [Fair Representation Learning: An Alternative to Mutual Information](https://dl.acm.org/doi/10.1145/3534678.3539302). 8/2022. 14 | - Mouxiang Chen, Chenghao Liu, Zemin Liu, Jianling Sun. [Scalar is Not Enough: Vectorization-based Unbiased Learning to Rank](https://dl.acm.org/doi/10.1145/3534678.3539468). 8/2022. 15 | - Zhaolin Gao, Tianshu Shen, Zheda Mai, Mohamed Reda Bouadjenek, Isaac Waller, Ashton Anderson. [Mitigating the Filter Bubble While Maintaining Relevance: Targeted Diversification with VAE-based Recommender Systems](https://dl.acm.org/doi/10.1145/3477495.3531890). 7/2022. 16 | - Wenjie Wang, Fuli Feng, Liqiang Nie, Tat-Seng Chua. [User-controllable Recommendation Against Filter Bubbles](https://dl.acm.org/doi/10.1145/3477495.3532075). 7/2022.* 17 | - Yuan Wang, Zhiqiang Tao, Yi Fang. [A Meta-learning Approach to Fair Ranking](https://dl.acm.org/doi/10.1145/3477495.3531892). 7/2022.* 18 | Mohammadmehdi Naghiaei, Hossein A. Rahmani, Yasher Deldjoo. [CPFair: Personalized Consumer and Producer Fairness Re-ranking for Recommendation Systems](https://dl.acm.org/doi/10.1145/3477495.3531959). 7/2022. 19 | - Anja Klasnja, Negar Arabzadeh, Mahbod Mehrvarz, Ebrahim Baghieri. [On the Characteristics of Ranking-based Gender Bias Measures](https://dl.acm.org/doi/10.1145/3501247.3531540). 6/2022.* 20 | - Qinzhi Jiang, Mustafa Naseem, Jamie Lai, Kentaro Toyama, Panos Papalambros. [Understanding Power Differentials and Cultural Differences in Co-design with Marginalized Populations](https://dl.acm.org/doi/10.1145/3530190.3534819). 6/2022. 21 | 22 | ## 2021 23 | - Valeria Mazzeo, Andrea Rapisarda, Giovanni Giuffrida. [Detection of Fake News on COVID-19 on Web Search Engines](https://www.researchgate.net/publication/352838694_Detection_of_Fake_News_on_COVID-19_on_Web_Search_Engines). 6/2021. 24 | 25 | ## 2019 26 | - Juhi Kulshrestha, Motahhare Eslami, Johnnatan Messias, Muhammad Bilal Zafar, Saptarshi Ghosh, Krishna P. Gummadi, Karrie Karahalios. [Search bias quantification: investigating political bias in social media and web search](https://www.researchgate.net/publication/327146029_Search_bias_quantification_investigating_political_bias_in_social_media_and_web_search). 4/2019.* 27 | 28 | ## 2018 29 | - Will Serrano. [Neural Networks in Big Data and Web Search](https://www.researchgate.net/publication/330028298_Neural_Networks_in_Big_Data_and_Web_Search). 12/2018. 30 | 31 | ## 2014 32 | - Xinyu Xing, Wei Meng, Dan Doozan, Nick Feamster, Wenke Lee, Alex C. Snoeren. [Exposing Inconsistent Web Search Results with Bobble](https://www.researchgate.net/publication/301967705_Exposing_Inconsistent_Web_Search_Results_with_Bobble). 3/2014. 33 | 34 | ## 2013 35 | - Ryen White. [Beliefs and biases in web search](https://www.researchgate.net/publication/262393954_Beliefs_and_biases_in_web_search). 7/2013. 36 | - Rodrygo L. T. Santos. [Explicit webs earch result diversification](https://www.researchgate.net/publication/262272502_Explicit_web_search_result_diversification). 6/2013. -------------------------------------------------------------------------------- /specific-engines/solr/solr-resources-utlities.md: -------------------------------------------------------------------------------- 1 | # Solr Resources: Utilities 2 | - [Solr Ansible role](https://github.com/idealista/solr_role) - Stars: 23 - Updated: 1/2023 - Checked: 5/2023 3 | - [Apache Solr Container (Built with Ansible)](https://github.com/geerlingguy/solr-container) - Stars: 17 - Updated: 11/2022 - Checked: 5/2023 4 | - [NLA's blacklight-solrcloud-repository](https://github.com/nla/blacklight-solrcloud-repository) - Stars: 0 - Updated: 5/2023 - Checked: 5/2023 5 | - "A Blacklight repository to connect with a collection on a ZooKeeper managed SolrCloud cluster." 6 | - [Solr Bulk Indexing](https://github.com/miku/solrbulk) - Stars: 39 - Updated: 4/2023 - Checked: 5/2023 7 | - For indexing "a bunch of documents really, really, fast" 8 | - [solr-cmd-utils](https://github.com/tblsoft/solr-cmd-utils) - Stars: 3 - Updated: 12/2022 - Checked: 5/2023 9 | - Includes solr-pipeline, solr-dump, solr-extract-nouns, solr-numfound. 10 | - [Solr DB Importer](https://github.com/saro-lab/solr-db-importer) - Stars: 10 - Updated: 3/2023 - Checked: 5/2023 11 | - Supports MariaDB, Oracle, MSSQL, MySQL, PostgreSQL, and H2. 12 | - [ik-analyzer-solr](https://github.com/magese/ik-analyzer-solr) - Stars: 1.1k - Updated: 1/2022 - Checked: 5/2023 13 | - [Data Import Handler](https://github.com/SearchScale/dataimporthandler) - Stars: 51 - Updated: 4/2023 - Checked: 5/2023 14 | - Assists in importing records from databases into Solr. 15 | - [Multi Tier Annotation Search (MTAS)](https://github.com/textexploration/mtas) - Stars: 6 - Updated: 1/2022 - Checked: 5/2023 16 | - [RequestSanitizer](https://github.com/cominvent/request-sanitizer-component) - Stars: 4 - Updated: 12/2022 - Checked: 5/2023 17 | - Sanitizes request parameter input. 18 | - [Relevancy Dashboard](https://github.com/sul-dlss/relevancy_dashboard) - Stars: 3 - Updated: 5/2023 - Checked: 5/2023 19 | - "Analyzing relevancy changes across solr versions" 20 | - [solr-diagnostics](https://github.com/sematext/solr-diagnostics) - Stars: 5 - Updated: 5/2021 - Checked: 5/2023 21 | - "Gathers info from Solr that should help diagnose issues" 22 | - [OSC's Solr Dump](https://github.com/o19s/solr_dump) - Stars: 7 - Updated: 3/2022 - Checked: 5/2023 23 | - "Dump a Solr index to file; Read from dumped file to Solr." 24 | - [solrdump](https://github.com/ubleipzig/solrdump) - Stars: 35 - Updated: 4/2023 - Checked: 5/2023 25 | - "Export documents from a SOLR index as JSON" 26 | - [SolrCloud Manager](https://github.com/ekataglobal/solrcloud_manager) - Stars: 23 - Updated: 1/2022 - Checked: 5/2023 27 | - "Provides easy SolrCloud cluster management." 28 | - [solrcopy](https://github.com/juarezr/solrcopy) - Stars: 6 - Updated: 3/2022 - Checked: 5/2023 29 | - CLI for backup/restore of documents in Solr cores. 30 | - [Solr Grouping](https://github.com/nla/solr-grouping) - Stars: 2 - Updated: 1/2023 - Checked: 5/2023 31 | - Allows two levels of grouping (from the National Library of Australia) 32 | - [Solr Operator](https://github.com/apache/solr-operator) - Stars: 208 - Updated: 5/2023 - Checked: 5/2023 33 | - "Kubernetes Operator for Apache Solr" 34 | - [Lucidworks Spark/Solr Integration](https://github.com/lucidworks/spark-solr) - Stars: 440 - Updated: 5/2023 - Checked: 5/2023 35 | - "tools for reading data from Solr as a Spark DataFrame/RDD and indexing objects from Spark into Solr using SolrJ." 36 | - [solrscripts](https://github.com/tokee/solrscripts) - Stars: 10 - Updated: 4/2022 - Checked: 5/2023 37 | - Includes a tool for diffing schema configs and another for validating configs. 38 | - [Cominvent's Solr tools](https://github.com/cominvent/solr-tools) - Stars: 38 - Updated: 11/2022 - Checked: 5/2023 39 | - [SolrUtils](https://github.com/InterNations/SolrUtils) - Stars: 8 - Updated: 8/2022 - Checked: 5/2023 40 | - "helps with recurring tasks when working with Solr like escaping and sanitizing user input" 41 | - [Traject](https://github.com/traject/traject) - Stars: 98 - Updated: 4/2023 - Checked: 5/2023 42 | - "An easy to use, high-performance, flexible and extensible metadata transformation system, focused on library-archives-museums input, and indexing to Solr as output." 43 | - [Zeppelin Solr Interpreter](https://github.com/lucidworks/zeppelin-solr) - Stars: 28 - Updated: 1/2021 - Checked: 5/2023 44 | - "allows user to issue Solr queries and display results in the Zeppelin UI" 45 | - [solr-bench](https://github.com/fullstorydev/solr-bench) - Stars: 15 - Updated: 4/2023 - Checked: 5/2023 46 | - "Solr benchmarking and load testing harness" -------------------------------------------------------------------------------- /WebCrawlers.md: -------------------------------------------------------------------------------- 1 | # Open Source Web Crawlers 2 | 3 | ## Table of Contents 4 | - Comments 5 | - General Resources 6 | - Apache Nutch 7 | - StormCrawler 8 | - Scrapy 9 | - Norconex Web Crawler 10 | - PulsarR 11 | - Heritrix 12 | - Sparkler 13 | - CoCrawler 14 | - Comparisons 15 | - Other 16 | - Maybe...? 17 | 18 | ## Comments 19 | - This page focuses on web crawlers/spiders as opposed to web scrapers. While there can be significant overlap between the two, our goal is to evaluate systems that are meant for web scale crawling.f 20 | - This document focuses on general purpose web crawlers. There is a growing niche of crawlers created specifically for security purposes which are not covered here. 21 | - We focus primarily on projects which are being actively developed. Projects which are showing limited signs of life may not be included. If you feel we've passed over a project that should be included, please create an issue or pull request. 22 | 23 | ## General Resources 24 | - [Awesome Crawler](https://github.com/BruceDone/awesome-crawler) - Stars: 5.5k - Updated: 12/2022 - Checked: 2/2025. 25 | 26 | ## Apache Nutch 27 | - https://nutch.apache.org/ 28 | - [GitHub Repo](https://github.com/apache/nutch) 29 | - Stars: 2.6k - Updated: 3/2023 - Checked: 4/2023. 30 | - Probably the best known and most utilized open source web crawler. 31 | - [Nutch Tutorial](https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial) - The official tutorial for getting started with Nutch. 32 | 33 | ## StormCrawler 34 | - http://stormcrawler.net/index.html 35 | - [GitHub Repo](https://github.com/DigitalPebble/storm-crawler) 36 | - Stars: 795 - Updated: 4/2023 - Checked: 4/2023. 37 | - Open source web crawler built on Apache Storm. 38 | - OpenWebSearch.eu's [Owler](https://openwebsearch.eu/owler/) web crawler is built off of StormCrawler. 39 | 40 | ## Scrapy 41 | - https://scrapy.org/ 42 | - [GitHub Repo](https://github.com/scrapy/scrapy) 43 | - Stars: 9.9k - Updated: 4/2023 - Checked: 4/2023. 44 | - A popular, open source web crawler/scraper written in Python. 45 | - [Scrapy Documentation on Broad Crawls](https://docs.scrapy.org/en/latest/topics/broad-crawls.html). 46 | - [WebScraping API's Web Crawling With Python](https://www.webscrapingapi.com/web-crawling-with-python). 12/2022. 47 | 48 | ## Norconex Web Crawler 49 | - https://opensource.norconex.com/crawlers/web/ 50 | - [GitHub Repo](https://github.com/Norconex/collector-http) 51 | - Stars: 153 - Updated: 2/2023 - Checked: 4/2023. 52 | - Open source Java web crawler. 53 | 54 | ## PulsarR 55 | - https://github.com/platonai/pulsarr 56 | - Open source web crawler written in Kotlin. 57 | 58 | ## Heritrix 59 | - https://heritrix.readthedocs.io/en/latest/ 60 | - [GitHub Repo](https://github.com/internetarchive/heritrix3) 61 | - Stars: 2.4k - Updated: 3/2023 - Checked: 4/2023. 62 | - Open source web crawler written in Java by the Internet Archive. 63 | - See also Internet Archive's browser-based distributed crawler, [brozzler](https://github.com/internetarchive/brozzler). 64 | 65 | ## Sparkler 66 | - http://irds.usc.edu/sparkler/ 67 | - [GitHub Repo](https://github.com/USCDataScience/sparkler) 68 | - Stars: 400 - Updated: 4/2023 - Checked: 4/2023. 69 | - A next-generation successor to Apache Nutch that uses Spark, Kafka, Lucene/Solr, Tika, and pf4j. 70 | 71 | ## CoCrawler 72 | - [GitHub Repo](https://github.com/cocrawler/cocrawler) 73 | - Stars: 166 - Updated: 4/2022 - Checked: 4/2023. 74 | - Authored by Greg Lindahl (Blekko) in Python, pre-release. 75 | - Included primarily because Lindahl has a proven track record in web crawling. 76 | 77 | ## Comparisons 78 | - Rody. [Comparison of Open Source Web Crawlers for Data Mining and Web Scraping: Pros & Cons](https://outsourceit.today/comparison-open-source-web-crawlers/). outsourceit.today, 10/2022. 79 | - Covers Scrapy, Heritrix, Nutch, and PySpider. 80 | 81 | ## Other 82 | - [Crawlab](https://github.com/crawlab-team/crawlab) - Stars: 9.7k - Updated: 4/2023 - Checked: 4/2023. 83 | - A Go language, distributed web crawler admin platform that works with multiple languages and frameworks including Scrapy. 84 | - NOTE: Does not appear to have integrations with most web scale crawlers, e.g. Nutch or StormCrawler. 85 | - [ACHE](https://ache.readthedocs.io/en/latest/) - Stars: 461 - Updated: 2023 - Checked: 2/2025. 86 | - "ACHE is a web crawler for domain-specific search." 87 | - [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) - Stars: 702 - Updated: 2/2025 - Checked: 2/2025 88 | - "Run a high-fidelity browser-based web archiving crawler in a single Docker container" 89 | 90 | ## Maybe...? 91 | - This section includes a few crawlers that are in development and show some promise. 92 | 93 | - [Crawler](https://github.com/crwlrsoft/crawler) - Stars: 233 - Updated: 3/2023 - Checked: 4/2023. 94 | - Including this one because it's written in PHP, which isn't particularly common for web crawlers. 95 | - [SeimiCrawler](https://github.com/zhegexiaohuozi/SeimiCrawler) - Stars: 1.9k - Updateds: 4/2023 - Checked: 4/2023. 96 | - A Java-based, distributed, open source web crawler. 97 | - [XXL-CRAWLER](https://www.xuxueli.com/xxl-crawler/) - Stars: 654 - Updated: 10/2022 - Checked: 4/2023. 98 | - A Java-based, distributed, open source web crawler. 99 | - [Sparkler-Crawler](https://github.com/USCDataScience/sparkler) - Stars: 400 - Updated: 4/2023 - Checked: 4/2023. 100 | - A Java/Scala based web crawler built on Spark. 101 | - [crawler](https://github.com/a11ywatch/crawler) - Stars: 22 - Updated: 4/2023 - Checked: 4/2023. 102 | - A Rust, open source web crawler that claims it is "capable of handling millions of pages per second efficiently." 103 | - [colly](https://github.com/gocolly/colly) - Stars: 19.3k - Updated: 4/2023 - Checked: 4/2023. 104 | - A Go language open source frmework for building crawlers/scrapers/spiders. 105 | - [Montferret](https://www.montferret.dev/) - Stars: 5.3k - Updated: 4/2023 - Checked: 4/2023. 106 | - A Go language, open source web scraper. Letting it slide in for its interesting declarative approach. -------------------------------------------------------------------------------- /research/research-main.md: -------------------------------------------------------------------------------- 1 | # Research on Search and Information Retrieval 2 | 3 | ## Collaborative 4 | - See [the page dedictaed to collaborative research](collaborative-research.md). 5 | 6 | ## Diversity, Fairness, Bias 7 | - See [the page dedicated to diversity, fairness, and bias research](fairness-research.md). 8 | 9 | ## Federation 10 | - See [the page dedicated to federation research](federation-research.md). 11 | 12 | ## Personalization 13 | - See [the page dedicated to personalization research](personalization-research.md). 14 | 15 | ## Ranking 16 | - See [the page dedicated to ranking research](ranking-research.md). 17 | 18 | ## Web Crawling 19 | - See [the page dedicated to web crawling research](crawling-research.md). 20 | 21 | ## Decentralization 22 | - See [the page dedicated to decentralization research](decentralization-research.md). 23 | 24 | ## Semantic 25 | - See [the page dedicated to semantic research](semantic-research.md). 26 | 27 | ## Uncategorized 28 | - See [the page dedicated to uncategorized research](uncategorized-research.md). 29 | 30 | ## Recommendations 31 | - See [the page dedicated to recommendations research](recommendations-research.md). 32 | 33 | ## Trustworthiness 34 | - See [the page dedicated to trustworthiness research](trust-research.md). 35 | 36 | ## Conversational 37 | - Jeffrey Dalton, Sophie Fischer, Paul Owoicho, Filip Radlinski, Rederico Rossetto, Johanne R. Trippas, Hamed Zamani. [Conversation Information Seeking: Theory and Application](https://dl.acm.org/doi/10.1145/3477495.3532678). 7/2022.* 38 | 39 | ## Internationalization and Localization 40 | - Zhuliu Li, Yiming Wang, Xiao Yan, Weizhi Meng, Yanen Li, Jaewon Yang. [TaxoTrans: Taxonomy-Guided Entity Translation](https://dl.acm.org/doi/10.1145/3534678.3539188). 8/2022. 41 | 42 | ## Privacy 43 | - Amit Kumar, Marc Spaniol. [There is a fine Line between Personalization and Surveillanced: Semantic User Interest Tracing via Entity-Level Analytics](https://dl.acm.org/doi/10.1145/3501247.3531592). 6/2022. 44 | 45 | ## Storage 46 | - Laurens Debackere, Pieter Colpaert, Ruben Taelman, Ruben Verborgh. [A Policy-Oriented Architecture for Enforcing Consent in Solid](https://dl.acm.org/doi/10.1145/3487553.3524630). 8/2022. 47 | 48 | ## Performance, Scalability 49 | - B. Barla Cambazoglu, Ricardo Baeza-Yates. [Scalability and Efficiency Challenges in Large-Scale Web Search Engines](https://dl.acm.org/doi/10.1145/2684822.2697039). 2/2015. 50 | - Kamlesh Kumar Pandey, Narendra Pradhan. [Internet Search Engine: Performance Evaluating the Google, Yahoo and Bing Web Search Engine based on their Searching Capabilities](https://www.researchgate.net/publication/324482784_Internet_Search_Engine_Performance_Evaluating_the_Google_Yahoo_and_Bing_Web_Search_Engine_based_on_their_Searching_Capabilities). 2/2018. 51 | 52 | ## SPAM 53 | - Asim Shahzad, Nazri Mohd Nawi, Syed Muhammad Zubair Rehman Gillani, Abdullah Khan. [An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach](https://www.researchgate.net/publication/356258479_An_Improved_Framework_for_Content-_and_Link-Based_Web-Spam_Detection_A_Combined_Approach). 11/2021. 54 | - Jayakrishnan Ashok, Pankaj Badoni. [Web Content Authentication: A Machine Learning Approach to Identify Fake and Authentic Web Pages on Internet](https://www.researchgate.net/publication/353005229_Web_Content_Authentication_A_Machine_Learning_Approach_to_Identify_Fake_and_Authentic_Web_Pages_on_Internet). 7/2021. 55 | - Asim Shahzad, Jamaluddin Mir, Aamer Khan, Muhammad Asshad, Muhammad Zeeshan, Ahsan Zubair, Muhammad Naeem. [The Web Spam Taxonomy and Algorithms for Detection and Prevention of Web Spamming - A Systematic Review](https://www.researchgate.net/publication/362861741_The_Web_Spam_Taxonomy_and_Algorithms_for_Detection_and_Prevention_of_Web_Spamming_-A_Systematic_Review). 7/2021.* 56 | 57 | ## Older Research 58 | - Na Dai, Brian D. Davison. [Capturing page freshness for web search](https://dl.acm.org/doi/10.1145/1835449.1835658). 7/2010.* 59 | - Carlos Castillo, Brian Davison. [Adversarial Web Search](https://www.researchgate.net/publication/220613785_Adversarial_Web_Search). 1/2010. 60 | - Amanda Spink, Michael Zimmer. [Web Search: Multidisclipinary Perspectives](https://www.researchgate.net/publication/321614743_Web_Search_Multidisciplinary_Perspectives). 1/2008. 61 | - Fang Qi-Ming, Yang Guang-Wen, Wu Yong-Wei, Zheng Wei Min. [P2P Web Search Technology](https://www.researchgate.net/publication/253605198_P2P_Web_Search_Technology). 1/2008. 62 | - Jim Jansen, Sherry Koshma, Amanda Spink. [Web Searching on the Vivisimo Search Engine](https://www.researchgate.net/publication/27479615_Web_Searching_on_the_Vivisimo_Search_Engine). 12/2006. 63 | - Nils Kammenhuber, Julia Luxenburger, Anja Feldmann, Gerhard Weikum. [Web search clickstreams](https://www.researchgate.net/publication/221611907_Web_search_clickstreams). 10/2006. 64 | - Amanda Spink, Minsoo Park, Jim Jansen, Jan Pedersen. [Multitasking during Web search sessions](https://www.researchgate.net/publication/222436299_Multitasking_during_Web_search_sessions). 1/2006. 65 | - Yiping Ke, Lin Deng, Wee Keong Ng, Dik Lee. [Web dynamics and their ramifications for the development of Web search engines](https://www.researchgate.net/publication/222416900_Web_dynamics_and_their_ramifications_for_the_development_of_Web_search_engines). 66 | - Jim Gray. [A Conversation with Tim Bray: Searching for ways to tame the world's vast stores of information](https://dl.acm.org/doi/10.1145/1046931.1046941). 2/2005.* 67 | - Mike Cafarella, Doug Cutting. [Building Nutch: Open Source Search: A case study in writing an open source search engine](https://dl.acm.org/doi/10.1145/988392.988408). 4/2004.* 68 | - Anna Patterson. [Why Writing Your Own Search Engine Is Hard: Big or small, proprietary or open source, Web or intranet, it's a tough job](https://dl.acm.org/doi/10.1145/988392.988407). 4/2004.* 69 | - Amanda Spink. [Web Search: Emerging Patterns](https://www.researchgate.net/publication/32962078_Web_Search_Emerging_Patterns). 9/2003. 70 | - Upendra Shardanand, Pattie Maes. [Social Information Filtering: Algorithms for Automating "Word of Mouth"](https://dl.acm.org/doi/10.1145/223904.223931). CHI '95. 5/1995. -------------------------------------------------------------------------------- /specific-engines/aws-opensearch/aws-opensearch-main.md: -------------------------------------------------------------------------------- 1 | # AWS OpenSearch Service 2 | 3 | ## Introduction 4 | Amazon Web Services (AWS) used to offer a managed service called Amazon Elasticsearch Service (Amazon ES) that utilized the open source Elasticsearch engine. Elastic changed it's licensing in an attempt to prevent Amazon from using it's software without paying what Elastic saw as a reasonable price. AWS forked the last fully open source version of Elasticsearch and rebranded it as OpenSearch. OpenSearch itself is an open source search engine application that does not depend on AWS. AWS OpenSearch Service is a managed OpenSearch service. There is also a serverless offering available, but we will be focusing primarily on the managed offering at this point. 5 | 6 | ## Caveat 7 | Both Elasticsearch and OpenSearch are powerful document search engines but much of the documentation on them focuses on their usage within DevOps and Security analytics contexts. We will not be covering these topics as indexes although we will address them as they apply to the proper configuration and maintenance of OpenSearch clusters. 8 | 9 | ## Resources 10 | - [Amazon OpenSearch Service Documentation](https://docs.aws.amazon.com/opensearch-service/) 11 | - [Amazon OpenSearch Service Developer Guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html) 12 | - [Searching data in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/searching.html) 13 | - [Amazon OpenSearch Service API Operations](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_Operations_Amazon_OpenSearch_Service.html) 14 | - [Amazon OpenSearch Service AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/opensearch/) 15 | - [Amazon OpenSearch Ingestion API Operations Documentation](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_Operations_Amazon_OpenSearch_Ingestion.html) 16 | 17 | 18 | 19 | 20 | ## Ways to Search 21 | - URI - Simple but limited in what functionality it can utilize. 22 | - Request Body - Slightly more complex but able to utilize the full range of OpenSearch DSL. 23 | 24 | ## Boosting 25 | 26 | ## Search Result Highlighting 27 | 28 | ## Pagination 29 | - Point in Time (PIT) - Runs the queries against the data as it was at a specific point in time. This is the preferred method. 30 | - Using From and Size Parameters - Slightly less complicated but may not be as accurate as PIT. 31 | 32 | ## Packages (Dictionaries, Plugins) 33 | - See [Custom packages for Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/custom-packages.html) and [Plugins by engine version in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/supported-plugins.html) 34 | 35 | Somewhat confusingly named, these are custom dictionary files and plugins that can improve the quality of results returned. 36 | 37 | The plugins provided by Amazon for OpenSearch Service currently include analyzers for Japaneses, Chinese, Pinyin, and Korean as well as the more generally applicable Amazon Persoanlzied Ranking plugin ("re-ranks OpenSearch results based on each user's past behavior and preferences"). 38 | 39 | We can use a synonym token filter to add tokens and stop token filter to remove tokens when a specific token is found. 40 | 41 | ## SQL 42 | - See [Querying your Amazon OpenSearch Service data with SQL](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/sql-support.html) and [SQL and PPL](https://opensearch.org/docs/latest/search-plugins/sql/index/) 43 | - It is what it sounds like, you can use SQL instead of the JSON-based DSL to query your OpenSearch cluster. 44 | - There is a SQL Workbench in OpenSearch Dashboards, a SQLI CLI is available, as well as a JDBC driver. There is also a read-only ODBC driver. 45 | 46 | ## k-NN search 47 | - See [k-Nearest Neighbor (k-NN) search in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html) and [k-NN search](https://opensearch.org/docs/latest/search-plugins/knn/index/). 48 | 49 | ## Cross-cluster search 50 | - See [Cross-cluster search in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/cross-cluster-search.html) 51 | - Sometimes one may want to create multiple smaller domains rather than one large domain, this is helpful when the data will be used for different types of workloads and the clusters can be optimized to support that specific workload. 52 | 53 | ## Learning to Rank 54 | - See [Learning to Rank for Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/learning-to-rank.html), [Elastic Learning to Rank Documentation](https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/index.html) 55 | - Uses machine learning and behavioral data to tune the relevance of search results. 56 | - Based on the Elasticsearch LTR plugin which utilizes models from the XGBoost and Ranklib libraries for rescoring results. 57 | 58 | ## Asynchronous Search 59 | - See [Asynchronous search in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/asynchronous-search.html) 60 | - "With asynchronous search for Amazon OpenSearch Service you can submit a search query that gets executed in the background, monitor the progress of the request, and retrieve results at a later stage. You can retrieve partial results as they become available before the search has completed. After the search finishes, save the results for later retrieval and analysis." 61 | 62 | ## Point in Time (PIT) 63 | - See [Point in time in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pit.html) 64 | - "The point in time (PIT) feature is a type of search that lets you run different queries against a dataset that's fixed in time. Typically, when you run the same query on the same index at different points in time, you receive different results because documents are constantly indexed, updated, and deleted. With PIT, you can query against a constant state of your dataset." 65 | 66 | ## Semantic Search 67 | - See [Semantic search in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/semantic-search.html) 68 | - Allows one to perform semantic search using neural search or k-NN. 69 | - DM: Explore this section further. 70 | -------------------------------------------------------------------------------- /OpenSourceSearchEngines.md: -------------------------------------------------------------------------------- 1 | # Open Source Search Engines 2 | 3 | ## Table of Contents 4 | - Apache Lucene 5 | - Lucene++ 6 | - Apache Solr 7 | - Open Semantic Search 8 | - Subprojects 9 | - Solr PHP UI 10 | - Elasticsearch 11 | - Other Projects 12 | - dejavu 13 | - Fess 14 | - Searchkit 15 | - OpenSearch 16 | - Other Projects 17 | - Gigablast 18 | - [YaCy](/specific-engines/yacy.md) 19 | - Articles 20 | - Vald 21 | - Weaviate 22 | - MWMBL 23 | - Alexandria 24 | - Wiby 25 | - OpenSearchServer 26 | - Metasearch 27 | - MetaGer 28 | - Not Web Scale 29 | - meilisearch 30 | - Typesense 31 | - Smaller Engines 32 | - Sonic 33 | - ZincSearch 34 | 35 | ## Apache Lucene 36 | - https://lucene.apache.org/ 37 | - The open source Java library that powers Apache Solr and Elasticsearch, among many other search projects. 38 | 39 | ### Lucene++ 40 | - https://github.com/luceneplusplus/LucenePlusPlus 41 | - An open source C++ port of Lucene. 42 | 43 | ## Apache Solr 44 | - https://solr.apache.org/ 45 | - See also dedicated pages [on Solr](/specific-engines/apache-solr.md) 46 | 47 | ## Open Semantic Search 48 | - https://opensemanticsearch.org/ 49 | - Under the hood one is running Apache Solr, but there are some significant changes that make listing Open Semantic Search separately worthwile.[^opensemanticsearch] 50 | 51 | ### Subprojects 52 | - [Solr PHP UI](https://opensemanticsearch.org/solr-php-ui/) - Stars: 20 - Updated: 12/2021 - Checked: 2/2024 53 | - A frontend for Open Semantic Search. 54 | - [GitHub Repo](https://github.com/opensemanticsearch/solr-php-ui) 55 | - [Solr Ontology Tagger](https://github.com/opensemanticsearch/solr-ontology-tagger) - Stars: 39 - Updated: 1/2022 - Checked: 5/2023 56 | - [Solr Synonames](https://github.com/opensemanticsearch/solr-synonames) - Stars: 5 - Updated: 10/2020 - Checked: 5/2023 57 | 58 | ## Elasticsearch 59 | - https://elastic.co/ 60 | - See also the [dedicated pages on Elasticsearch](/specific-engines//elasticsearch.md). 61 | 62 | ### Other Projects 63 | - [dejavu](https://github.com/appbaseio/dejavu) - Open source, JS web-based UI for Elasticsearch and OpenSearch. 64 | - [Fess](https://fess.codelibs.org/) - Open source, enterprise search server with web crawler and GUI. Written in Java. 65 | - [Searchkit](https://github.com/searchkit/searchkit) - Updated: 3/2023 - Checked: 3/2023 - Stars: 4.6k - Open source library for building search UI's with JS, React, Vue, Angular, etc. Written in TypeScript primarily. 66 | 67 | ## OpenSearch 68 | - https://opensearch.org/ 69 | - An open source fork of Elasticsearch started by Amazon.[^controversy] 70 | - See also the [dedicated pages on OpenSearch](/specific-engines/opensearch.md) 71 | 72 | ### Other Projects 73 | - Please see Other Projects under Elasticsearch. Only projects that are for OpenSearch exclusively will be listed here. 74 | 75 | ## Gigablast 76 | - https://gigablast.com/ 77 | - [GitHub Repo](https://github.com/gigablast/open-source-search-engine) 78 | - Founded in 2000 by Matt Wells as a closed source search engine it was later open sourceed. It is written in C++, is distributed, and includes both the engine and a crawler. 79 | 80 | ## YaCy 81 | - Please see the [dedicated page on YaCy](/specific-engines/yacy.md). 82 | 83 | ## Vald 84 | - https://vald.vdaas.org/ 85 | - [GitHub Repo](https://github.com/vdaas/vald) 86 | - An open source, distributed vector search engine built using Go, utilized by Yahoo Japan. 87 | 88 | ## Weaviate 89 | - https://weaviate.io/ 90 | - [GitHub Repo](https://github.com/weaviate/weaviate) 91 | - Open source vector search engine written in Go. 92 | - [Semantic Search through Wikipedia with Weaviate](https://github.com/weaviate/semantic-search-through-wikipedia-with-weaviate) 93 | 94 | ## MWMBL 95 | - https://mwmbl.org/ 96 | - [GitHub Repo](https://github.com/mwmbl/mwmbl) 97 | - Open source, non-profit search engine written in Python.[^mwmbl] 98 | 99 | ## Alexandria 100 | - https://www.alexandria.org/ 101 | - [GitHub Repo](https://www.alexandria.org/) 102 | - Open source search engine that uses CommonCrawl and is written in C++. 103 | 104 | ## Wiby 105 | - https://wiby.me/ 106 | - [GitHub Repo](https://github.com/wibyweb/wiby) 107 | - [Installation and Setup Instructions](https://wiby.me/about/guide.html) 108 | - Open source search engine written in PHP, C, and Go. 109 | 110 | ## OpenSearchServer 111 | - https://www.opensearchserver.com/ 112 | - [GitHub Repo](https://github.com/jaeksoft/opensearchserver) 113 | - Open source search engine written in Java, includes bundled crawler. 114 | - Note: No updates since 8/2021 as of 3/2023. 115 | 116 | ## Metasearch 117 | 118 | ### MetaGer 119 | - https://metager.org/ 120 | - [Git Repo](https://gitlab.metager.de/open-source/MetaGer) 121 | - Open source metasearch engine run by a nonprofit. 122 | 123 | ## Not Web Scale 124 | 125 | ### meilisearch 126 | - https://www.meilisearch.com/ 127 | - [GitHub Repo](https://github.com/meilisearch/meilisearch) 128 | - An open source search engine written in Rust. 129 | 130 | ### Typesense 131 | - https://typesense.org/ 132 | - [GitHub Repo](https://github.com/typesense/typesense) 133 | - An open source Algolia alternative written in C/C++.[^typesense] 134 | 135 | ## Smaller Engines 136 | - [Sonic](https://github.com/valeriansaliou/sonic) - Updated: 1/2023 - Checked: 3/2023 - Stars: 18k - A lightweight, speedy search backend written in Rust. 137 | - [ZincSearch](https://github.com/zincsearch/zincsearch) - Updated: 3/2023 - Checked: 3/2023 - Stars: 14.7k - Lightweight alternative to Elasticsearch, written in Go. Includes a web UI. 138 | 139 | ## Footnotes 140 | [^controversy]: The fork was started following controversial licensing changes by Elasticsearch. For more on the history of this controversy see Graham Gillen's [Elasticsearch vs OpenSearch series](https://pureinsights.com/blog/2021/elasticsearch-vs-opensearch-user-point-of-view-part-1-of-3/). For a brief evaluation of OpenSearch's progress see Matt Asay's [One year of OpenSearch: Grading AWS’ open source effort](https://www.techrepublic.com/article/opensearch-grading-aws-open-source/). 141 | [^typesense]: Some interesting functionality includes tunable ranking, sorting, faceting & filtering, grouping & distinct, federated search, and curation. It doesn't appear to be in web scale usage but they've expressed interest in benchmarking larger datasets so I submmited an [issue requesting CommonCrawl be benchmarked](https://github.com/typesense/typesense/issues/933). 142 | [^opensemanticsearch]: It isn't meant for web search particularly but it offers a number of features which could be useful in a search engine - e.g. exploratory search as well as collaborative annotation and tagging. 143 | [^mwmbl]: The project has some similarities with what I'm looking to do with [Phoebe](https://github.com/davidshq/next-search/). It is open source, a non-profit, and the code is written in Python. -------------------------------------------------------------------------------- /specific-engines/apache-solr.md: -------------------------------------------------------------------------------- 1 | # Apache Solr 2 | - Introduction 3 | - My Notes 4 | - Related Projects 5 | - General 6 | - [Discovery and/or UI](./solr/solr-resources-ui.md) 7 | - [(Coding) Language Integrations](./solr/solr-resources-code.md) 8 | - [Application/Framework Integrations](./solr/solr-resources-app-framework-integrations.md) 9 | - [Utilities](./solr/solr-resources-utilities.md) 10 | - Vector Search 11 | - [Projects Using Solr](./solr/solr-resources-used-by.md) 12 | - Discussion 13 | - Solr as a Service Options 14 | - Consulting Companies Working with Solr 15 | - Blogs 16 | - Demos 17 | - [Interesting But Old](./solr/solr-resources-interesting-old.md) 18 | - Bibliography/Resources 19 | 20 | ## Introduction 21 | Apache Solr is a search engine built on top of Apache Lucene (a Java library, also used in Elasticsearch). Solr receives queries via HTTP requests and provides responses in JSON by default (but can also output XML, CSV, etc.). 22 | 23 | ## My Notes 24 | These are largely pulled from Solr's documentation, consider them cliff notes / cheat sheets. 25 | - [Basic Solr Tutorial](./solr/basic-tutorial.md) 26 | - [Basic Solr Admin UI Tutorial](./solr/basic-admin-ui-tutorial.md) 27 | - [Basic SolrCloud Tutorial](./solr/basic-solrcloud-tutorial.md) 28 | - [Basic Indexing Your Own Data Tutorial](./solr/basic-indexing-your-own-data.md) 29 | - [Solr Development](./solr/solr-development.md) 30 | - [Solr Terminology](./solr/solr-terminology.md) 31 | - [Solr Notes Unorganized](./solr/solr-notes.md) 32 | 33 | ## Related Projects 34 | 35 | ### General 36 | - [Solr Plugin Directory](https://solr.cool/) - A directory of plugins/extensions available for Solr including query parsers, analyzers, response writers, search components, document transformers, and utilities. 37 | 38 | ### Neural 39 | - [Neural Solr](https://github.com/maxdotio/neural-solr) - Stars: 14 - Updated: 6/2022 - Checked: 5/2023 40 | - "This project provides a complete and working semantic search application, using Mighty Inference Server, Apache Solr v9, and an example Node.js express application." 41 | 42 | ### Other 43 | - [solr-constant-similarity](https://github.com/freedev/solr-constant-similarity) - Stars: 2 - Updated: 4/2022 - Checked: 5/2023 44 | 45 | ### Plugins 46 | - [sematext's Solr Redis Extensions](https://github.com/sematext/solr-redis) - Stars: 51 - Updated: 5/2022 - Checked: 5/2023. 47 | - "a ParserPlugin that provides a Solr query parser based on data stored in Redis." 48 | - [solr-sandbox](https://github.com/apache/solr-sandbox) - Stars: 7 - Updated: 5/2023 - Checked: 5/2023 49 | - "The solr sandbox repository serves as a place to host contributions that are not a part of core solr." 50 | 51 | ### Security 52 | - [solr-proxy](https://github.com/Trott/solr-proxy) - Stars: 7 - Updated: 5/2023 - Checked: 5/2023 53 | - "Reverse proxy to make a Solr instance read-only, rejecting requests that have the potential to modify the Solr index." 54 | 55 | ### Semantic Search 56 | - [Solr-SBERT-semantic-search](https://github.com/tkhang1999/Solr-SBERT-semantic-search) - Stars: 5 - Updated: 4/2023 - Checked: 5/2023 57 | - "a simple web demo of semantic search (search by meaning)...using Solr and BERT embeddings." 58 | 59 | ### Vector Search 60 | - [BERT Solr Search](https://github.com/DmitryKey/bert-solr-search) - Stars: 134 - Updated: 6/2022 - Checked: 3/2023 - Allows one to search with BERT vectors in Solr, also compatible with Elasticsearch/OpenSearch. 61 | - Has associated articles explaining the process that was used to build the solution. 62 | - [Vector Search for E-commerce with Chorus](https://opensourceconnections.com/blog/2023/03/22/building-vector-search-in-chorus-a-technical-deep-dive/) - a blog from OpenSource Connections showing how to add vector features to a Solr-powered e-commerce platform 63 | 64 | ## Learning Resources 65 | - [Solr for newbies workshop](https://github.com/hectorcorrea/solr-for-newbies) - Stars: 68 - Updated: 3/2023 - Checked: 5/2023. 66 | - [solr-tmbd](https://github.com/o19s/solr-tmdb) - Stars: 18 - Updated: 5/2023 - Checked: 5/2023 67 | - "part of the Think Like a Relevancy Engineer training provided by OpenSource Connections." 68 | - [OSC's pdf-discovery-demo](https://github.com/o19s/pdf-discovery-demo) - Stars: 25 - Updated: 4/2023 - Checked: 5/2023 69 | - "leverages the Solr Payload Component...and the Offset Highlighter Component...as well as pdf.js to make PDF documents searchable and have highlighting of matches with the text in context of the PDF." 70 | - [Apache Lucene Solr Guide](https://github.com/mikeroyal/Apache-Lucene-Solr-Guide) - Stars: 7 - Updated: 10/2021 - Checked: 5/2023 71 | - [Videos featuring Solr from OSC](https://www.youtube.com/playlist?list=PLCoJWKqBHERuLJgmR0PhiXmS3TUYjWatW) - a playlist of videos of talks featuring Solr 72 | 73 | ## Discussion 74 | - [Official Solr Users Mailing List](https://lists.apache.org/list.html?users@solr.apache.org) 75 | 76 | ## Solr as a Service Options 77 | - [OpenSolr](https://opensolr.com/) 78 | - 30-day free trial 79 | - Pricing starts at €10/mo. 80 | - [SearchStax](https://www.searchstax.com/) 81 | - Limited free account 82 | - Pricing starts at $9/mo. 83 | - [WebSolr](https://www.websolr.com/) 84 | - Standard from $59/mo with enterprise options 85 | 86 | ## Companies Working with Solr 87 | - [sematext](https://sematext.com/) 88 | - [Training on Solr](https://sematext.com/training/solr/) 89 | - [Monitoring of Solr](https://sematext.com/docs/integration/solr/), [SolrCloud](https://sematext.com/docs/integration/solrcloud/), and [Solr Logs](https://sematext.com/docs/integration/solr-logs/). 90 | - [SeaseLtd](https://sease.io/) 91 | - [OpenSource Connections](https://www.opensourceconnections.com) 92 | 93 | ## Blogs 94 | - [Joel Bernstein's Solr Analytics Blog](https://joelsolr.blogspot.com/) 95 | 96 | ## Demos 97 | - [Apache Solr Manual Search Demo - Multi-Language Model](https://demo.rondhuit.com/en/solr-manual) 98 | - [Slide Deck on setting up the demo](https://www.rondhuit.com/download/RONDHUIT-solrmanual-1.0.0.pdf) 99 | - [YouTube Video on setting up the demo](https://www.youtube.com/watch?v=rh3fP9qQAhw) 100 | - [KandaSearch blog post on setting up the demo](https://kandasearch.com/blogs/9c7ec12f-c09b-4ddd-b5eb-aafc3bb8b1a6) 101 | 102 | 103 | 104 | ## Bibliography / Resources 105 | - [Apache Solr](https://solr.apache.org/) 106 | - Solr Reference Guide 107 | - Getting Started 108 | - [Introduction to Solr](https://solr.apache.org/guide/solr/latest/getting-started/introduction.html) 109 | - Solr Concepts 110 | - [Documents, Fields, and Schema Design](https://solr.apache.org/guide/solr/latest/getting-started/documents-fields-schema-design.html) 111 | - [Solr Indexing](https://solr.apache.org/guide/solr/latest/getting-started/solr-indexing.html) 112 | - [Searching in Solr](https://solr.apache.org/guide/solr/latest/getting-started/searching-in-solr.html) 113 | - [Relevance](https://solr.apache.org/guide/solr/latest/getting-started/relevance.html) 114 | - [Solr Glossary](https://solr.apache.org/guide/solr/latest/getting-started/solr-glossary.html) 115 | - [Solr Tutorials](https://solr.apache.org/guide/solr/latest/getting-started/solr-tutorial.html) 116 | 117 | -------------------------------------------------------------------------------- /common-crawl/basic-manually-accessing-common-crawl.md: -------------------------------------------------------------------------------- 1 | # How To Manually Access CommonCrawl 2 | 3 | 4 | ## Introduction 5 | 6 | I sometimes find it helpful to understand the intricacies of a process by performing it manually before attempting to automate it. This document outlines a manual process for retrieving data from CommonCrawl. 7 | 8 | **NOTE**: There is a web search interface one can use and one can download the files over HTTP but both of these are quite slow. I'll only be covering how to accomplish the crawl using Amazon S3 (where the data is stored). 9 | 10 | ## Prerequisites 11 | 12 | You'll need to have an AWS account, the AWS CLI installed, and programmatic access setup on your local system. 13 | 14 | ## Choose a Crawl 15 | 16 | Crawls are organized by date. You can go to http://index.commoncrawl.org/ to view the list of available crawls. 17 | 18 | For this example I'm using CC-MAIN-2023-14. 19 | 20 | ## Download the Index List 21 | 22 | You can see the list of index files on the previously mentioned page. We'll grab the one for our selected crawl using the AWS CLI: 23 | `aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2023-14/cc-index.paths.gz .` 24 | 25 | NOTE: The `.` in the above command is very important. The AWS CLI command is saying copy from the specified path on S3 to the specified path on the local system. The dot `.` will drop the file wherever you are running the command from. You can also specify a specific path instead of using `.` 26 | 27 | ## Opening the File 28 | 29 | Extract the contents of the gzip file and the result should be a file called `cc-index.paths`. This can be opened with a regular text editor. 30 | 31 | Scroll to the bottom and you'll find the path to get the `cluster.idx` file. 32 | 33 | ## Downloading the Cluster Index 34 | 35 | In our case the path shown is `cc-index/collections/CC-MAIN-2023-14/indexes/cluster.idx`. Using the AWS CLI our command should look like: 36 | `aws s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2023-14/indexes/cluster.idx .` 37 | 38 | This file is quite a bit larger (~150 MB) and may take a few minutes to download. 39 | 40 | ## Opening the Cluster Index 41 | 42 | There are a number of ways we could manipulate the file without opening it in a normal editor but for the moment lets do just that. A cross-platform program that can handle large files with a GUI is [klogg](https://klogg.filimonov.dev/). 43 | 44 | You should see data that looks something like this: 45 | 46 | ``` 47 | 0,1,184,137)/igplay 20230325212225 cdx-00000.gz 0 158603 1 48 | 10,124,97,161)/paito-warna-trinidad-tobago-afternoon 20230330213447 cdx-00000.gz 158603 187900 2 49 | ``` 50 | 51 | The above is two lines of data. Each contains the reversed domain of a site indexed by CommonCrawl (`0,1,184,137`, `10,124,97,161`) followed by the specific path that was indexed (e.g. `/igplay`, `/paito-warna-trinidad`...). 52 | 53 | This can be a little confusing at first glance. Where are the domain names? Sometimes servers don't have a domain name and instead are accessed by IP, because when sorting a file numeric characters come before alpha characters the file starts off with all the IPs it has crawled. 54 | 55 | Scroll down a bit in the file and you should start to see records that look like this: 56 | 57 | ``` 58 | com,homesandgardens)/gardens/how-to-split-irises 20230330183612 cdx-00077.gz 736773153 222043 311253 59 | ``` 60 | 61 | Note that whether the site is a domain or an IP it is reversed and separated by levels. If we reassemble the IPs/domains from the above examples we get: 62 | ``` 63 | 137.184.1.0 64 | 161.97.124.10 65 | homesandgardens.com 66 | ``` 67 | 68 | ## Finding and Downloading the CDX We Need 69 | 70 | If we are looking to access specific site data we need to figure out what CDX file serves as the index. In the case of the `homesandgardens.com` URL we can see that the CDX is `cdx-00077.gz`. 71 | 72 | We'll use the AWS CLI to download this file: 73 | ``` 74 | aws s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2023-14/indexes/cdx-00077.gz . 75 | ``` 76 | 77 | This file, again, is much larger than the previous files (closing in on 1 GB) so it may take some time to download. 78 | 79 | ## Finding and Downloading the WARC We Need 80 | 81 | Once downloaded we need to extract the contents from the gzip, which should be a plain text file called `cdx-00072`. Extracted this file will likely be several GB in size. 82 | 83 | When we open the file (using `klogg` or something similar) we should see records that look like this: 84 | 85 | ``` 86 | com,homes-n-gardens)/adult-coloring-pages-garden 20230402111251 {"url": "https://homes-n-gardens.com/adult-coloring-pages-garden/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "XJHZYZYMKOL62PIT56QEUFJT5MYNYVOT", "length": "3991", "offset": "343066379", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296950528.96/warc/CC-MAIN-20230402105054-20230402135054-00060.warc.gz", "charset": "UTF-8", "languages": "eng"} 87 | ``` 88 | 89 | NOTE: See Appendix A for a prettified version of the JSON above. 90 | 91 | Note that we again have the reversed domain name at the beginning followed by the path to the specific file/document accessed and a little later we have `"filename":`. The filename specified here is the file we need to access to retrieve the data for the specific URL we are looking at. In this case it is: 92 | 93 | ``` 94 | "crawl-data/CC-MAIN-2023-14/segments/1679296950528.96/warc/CC-MAIN-20230402105054-20230402135054-00060.warc.gz" 95 | ``` 96 | 97 | We can get the WARC file using the AWS CLI: 98 | ``` 99 | aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2023-14/segments/1679296950528.96/warc/CC-MAIN-20230402105054-20230402135054-00060.warc.gz . 100 | ``` 101 | 102 | This file is likely to clock in at over 1 GB in it's compressed form - so expect the download to take some time. 103 | 104 | ## Viewing the WARC 105 | 106 | Once again we'll extract the file from it's gzip. We should end up with a file titled: `CC-MAIN-20230402105054-20230402135054-00060.warc`. We can open this file using klogg as well. Much of the file will be human readable but there will also be binary files that have been included (e.g. images) and these will appear as a long series of garbled characters. 107 | 108 | # Appendix A: Prettified JSON CDX Record 109 | ```json 110 | { 111 | "url": "https://homes-n-gardens.com/adult-coloring-pages-garden/", 112 | "mime": "text/html", 113 | "mime-detected": "text/html", 114 | "status": "200", 115 | "digest": "XJHZYZYMKOL62PIT56QEUFJT5MYNYVOT", 116 | "length": "3991", 117 | "offset": "343066379", 118 | "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296950528.96/warc/CC-MAIN-20230402105054-20230402135054-00060.warc.gz", 119 | "charset": "UTF-8", 120 | "languages": "eng" 121 | } 122 | ``` 123 | 124 | 125 | # Bibliography 126 | 127 | - [StackOverflow: How to view huge txt files in Linux?](https://stackoverflow.com/questions/21246752/how-to-view-huge-txt-files-in-linux) 128 | - Samuel Schaffhauser. [Using the Common Crawl as a Data Source](https://medium.com/@samuel.schaffhauser/using-the-common-crawl-as-a-data-source-693a41b3baa9). 6/2022. 129 | - Chillar Anand. [Common Crawl On Laptop - Extracting Subset of Data](https://avilpage.com/2022/11/common-crawl-laptop-extract-subset.html). 11/2022. 130 | - Derek Morgan. [Exploring the Common Crawl with Python](https://dmorgan.info/posts/common-crawl-python/). 2016. -------------------------------------------------------------------------------- /research/ranking-research.md: -------------------------------------------------------------------------------- 1 | # Ranking and Recommendations 2 | 3 | ## 2023 4 | - Yi Ren, Xaio Han, Xu Zhao, Shenzheng Zhang, Yan Zhang. [Slate-Aware Ranking for Recommendation](https://dl.acm.org/doi/10.1145/3539597.3570380). 2/2023. 5 | - Yunjia Xi, Jianghao Lin, Weiwen Liu, Xinyi Dai, Weinan Zhang, Rui Zhang, Ruming Tang, Yong Yu. [A Bird's-eye View of Reranking: From List Level to Page Level](https://dl.acm.org/doi/10.1145/3539597.3570399). 2/2023. 6 | - Shuting Wang, Zhicheng Dou, Yutao Zhu. [Heterogeneous Graph-based Context-aware Document Ranking](https://dl.acm.org/doi/10.1145/3539597.3570390). 2/2023.* 7 | - Dan Luo, Lixin Zou, Qingyao Ai, Zhiyu Chen, Dawei Yin, Brian D. Davison. [Model-based Unbiased Learning to Rank](https://dl.acm.org/doi/10.1145/3539597.3570395). 2/2023. 8 | 9 | ## 2022 10 | - Syed Ahmed Yasin, P. V. R. D. Prasada Rao. [Enhanced CRNN-Based Optimal Web Page Classification and Improved Tunicate Swarm Algorith-Based Re-Ranking](https://www.researchgate.net/publication/365619498_Enhanced_CRNN-Based_Optimal_Web_Page_Classification_and_Improved_Tunicate_Swarm_Algorithm-Based_Re-Ranking). 11/2022. 11 | - Yuexin Wu, Xiaolei Huang. [A Gumbel-based Rating Prediction Framework for Imbalanced Recommendation](https://dl.acm.org/doi/10.1145/3511808.3557341). 10/2022. 12 | - Sejoon Oh, Berk Ustun, Julian McAuley, Srijan Kumar. [Rank List Sensitivity of Recommender Systems to Interaction Perturbations](https://dl.acm.org/doi/10.1145/3511808.3557425). 10/2022.* 13 | - Haolun Wu, Chen Ma, Yingxue Zhang, Xue Liu, Ruiming Tang, Mark Coates. [Adapting Triplet Importance of Implicit Feedback for Personalized Recommendation](https://dl.acm.org/doi/10.1145/3511808.3557229). 10/2022. 14 | - Yi Ren, Hongyan Tang, Siwen Zhu. [Unbiased Learning to Rank with Biased Continuous Feedback](https://dl.acm.org/doi/10.1145/3511808.3557483). 10/2022. 15 | - Weiwen Liu, Jiarui Qin, Ruiming Tang, Bo Chen. [Neural Re-ranking for Multi-stage Recommender Systems](https://dl.acm.org/doi/10.1145/3523227.3547369). 9/2022.* 16 | - Roberto Pellgrini, Wenjie Zhao, Iain Murray. [Don't recommend the obvious: estimate probability ratios](https://dl.acm.org/doi/10.1145/3523227.3546753). 9/2022.* 17 | - Giacomo Balloccu, Ludovico Boratto, Gianni Fenu, Mirko Marras. [Hands on Explaining Recommender Systems with Knowledge Graphs](https://dl.acm.org/doi/10.1145/3523227.3547374). 9/2022.* 18 | - Xiang Li, Xioajiang Zhou, Yao Xiao, Peihao Huang, Dayao Chen, Sheng Chen, Yunsen Xian. [AutoFAS: Automatic Feature and Architecture Selection for Pre-Ranking System](https://dl.acm.org/doi/10.1145/3534678.3539083). 8/2022. 19 | - Ruobing Xe, Qi Liu, Liangdong Wang, Shukai Liu, Bo Zhang, Leyu Lin. [Contrastive Cross-domain Recommendation in Matching](https://dl.acm.org/doi/10.1145/3534678.3539125). 8/2022. 20 | - Yankai Chen, Huifeng Guo, Yingxue Zhang, Chen Ma, Ruiming Tang, Jingjie Li, Irwin King. [Learning Binarized Graph Representations with Multi-faceted Quantization Reinforcement for Top-K Recommendation](https://dl.acm.org/doi/10.1145/3534678.3539452). 8/2022. 21 | - Zihan Lin, Hui Wang, Jingshu Mao, Wayne Xin Zhao, Cheng Wang, Peng Jiang, Ji-Rong Wen. [Feature-aware Diversified Re-ranking with Disentangled Representations for Relevant Recommendation](https://dl.acm.org/doi/10.1145/3534678.3539130). 8/2022. 22 | - Linsey Pang, Wei Liu, Keng-Hao Chang, Xue Li, Moumita Bhattacharya, Xianjing Liu, Stephen Guo. [Deep Search Relevance Ranking in Practice](https://dl.acm.org/doi/10.1145/3534678.3542632). 8/2022.* 23 | - Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony Lee, Ciya Liao.[Search Retrieval at Walmart](https://dl.acm.org/doi/10.1145/3534678.3539164). 8/2022. 24 | - Yi Li, Jieming Zhu, Weiwen Liu, Liangcai Su, Guohao Cai, Qi Zhang, Ruiming Tang, Xi Xiao, Xiuqiang He. [PEAR: Personalized Re-Ranking with Contextualized Transformer for Recommendation](https://dl.acm.org/doi/10.1145/3487553.3524208). 8/2022. 25 | - Egor Markovskiy, Fiana Raiber, Shoham Sabach, Oren Kurland. [From Cluster Ranking to Document Ranking](https://dl.acm.org/doi/10.1145/3477495.3531819). 7/2022. 26 | - Wenchao Xiu, Yiran Wang, Taofeng Xue, Kai Zhang, Qin Zhang, Zhonghuo Wu, Yifan Yang, Gong Zhang. [DDEN: A Heterogeneous Learning-to-Rank Approach with Deep Debiasing Experts Network.](https://dl.acm.org/doi/10.1145/3477495.3536320). 7/2022. 27 | - Xinyan Fan, Jianxun Lian, Wayne Xin Zhao, Zheng Liu, Chaozhuo Li, Xing Xie. [Ada-Ranker: A Data Distribution Adaptive Ranking Paradigm for Sequential Recommendation](https://dl.acm.org/doi/10.1145/3477495.3531931). 7/2022. 28 | - Amifa Raj, Michael D. Ekstrand. [Measuring Fairness in Ranked Results: An Analytical and Empirical Comparison](https://dl.acm.org/doi/10.1145/3477495.3532018). 7/2022. 29 | - Enrique Amigó, Stefano Mizzaro, Damiano Spina. [Ranking Interruptus: When Truncated Rankings Are Better and How To Measure That](https://dl.acm.org/doi/10.1145/3477495.3532051). 7/2022. 30 | - Ziyi Ye, Xiaohui Xie, Yiqun Liu, Zhihong Wang, Xuancheng Li, Jiaji Li, Xuesong Chen, Min Zhang, Shaoping Ma. [Why Don't You Click: Understanding Non-Click Results in Web Search with Brain Signals](https://dl.acm.org/doi/10.1145/3477495.3532082). 7/2022. 31 | - George Zerveas, Navid Rekabsaz, Daniel Cohen, Carsten Eickhoff. [Mitigating Bias in Search Results Through Contextual Document Reranking and Neutrality Regularization.](https://dl.acm.org/doi/10.1145/3477495.3531891). 7/2022.* 32 | - Virginie Do, Nicolas Usunier. [Optimizing Generalized Gini Indices for Fairness in Rankings](https://dl.acm.org/doi/10.1145/3477495.3532035). 7/2022. 33 | - Shubham Chatterjee, Laura Dietz. [BERT-ER: Query-specific BERT Entity Representations for Entity Ranking](https://dl.acm.org/doi/10.1145/3477495.3531944). 7/2022.* 34 | - Ayushi Prakash, Sandeep Kumar Gupta, Mukesh Rawat. [Keyword Based Ranking of Web Pages by Normalizing Link Score](https://www.researchgate.net/publication/361785686_Keyword_Based_Ranking_of_Web_Pages_by_Normalizing_Link_Score). 6/2022. 35 | - Prem Sharma, Divakar Yadav, R N Thakur. [Web Page Ranking Using Web Mining Techniques: A Comprehensive Survey](https://www.researchgate.net/publication/360999317_Web_Page_Ranking_Using_Web_Mining_Techniques_A_Comprehensive_Survey). 5/2022. 36 | - Seonghwan Choi, Hyeondey Kim, Manjun Gim. [Do Not Read the Same News! Enhancing Diversity and Personalization of News Recommentation](https://dl.acm.org/doi/10.1145/3487553.3524936). 4/2022. 37 | 38 | ## 2021 39 | - Yiding Liu, Weixue Lu, Suqi Cheng, Daiting Shi, Shuaiqiang Wang, Zhicong Cheng, Dawei Yin. [Pre-trained Language Model for Web-scale Retrieval in Baidu Search](https://www.researchgate.net/publication/352209105_Pre-trained_Language_Model_for_Web-scale_Retrieval_in_Baidu_Search). 6/2021. 40 | - Anton Oleinik. [Relevance in Web search: between content, authority and popularity](https://www.researchgate.net/publication/349706191_Relevance_in_Web_search_between_content_authority_and_popularity). 3/2021. 41 | 42 | ## 2020 43 | - N. Mehala, Divyansh Bhatia. [A Concept-Based Approach for Generating Better Topics for Web Search Results](https://www.researchgate.net/publication/344268302_A_Concept-Based_Approach_for_Generating_Better_Topics_for_Web_Search_Results). 9/2020. 44 | 45 | 46 | # Older 47 | - Anlei Dong, Yi Chang, Zhaohui Zheng, Gilad Mishne, Jing Bai, Ruqiang Zhang, Karolina Buchner, Ciya Laio, Fernando Diaz. [Towards recency ranking in web search](https://dl.acm.org/doi/10.1145/1718487.1718490). 2/2010. 48 | - Ron Bekkerman, Shlomo Zilberstein, James Allan. [Web Page Clustering Using Heuristic Search in the Web Graph](https://www.researchgate.net/publication/220812214_Web_Page_Clustering_Using_Heuristic_Search_in_the_Web_Graph). 1/2007. -------------------------------------------------------------------------------- /CommonCrawl.md: -------------------------------------------------------------------------------- 1 | # Common Crawl 2 | Common Crawl is a non-profit organization that maintains a large index of the web that is updated on a bi-monthly basis and freely available. 3 | 4 | ## General 5 | - Official Site: https://commoncrawl.org/ 6 | - Common Crawl Index Server: https://index.commoncrawl.org/ 7 | - GitHub Repositories: https://github.com/commoncrawl - A few of the repositories are listed below, but there are many more. 8 | - [Common Crawl WARC Examples](https://github.com/commoncrawl/cc-warc-examples) - "This repository contains both wrappers for processing WARC files in Hadoop MapReduce jobs and also Hadoop examples to get you started." 9 | - [Jupyter Notebooks to Analyze Common Crawl Data](https://github.com/commoncrawl/cc-notebooks) - This includes several different notebooks, some may be especially interested in [running a notebook on AWS EMR](https://github.com/commoncrawl/cc-notebooks/blob/main/cc-emr-notebook/cluster_setup.md). 10 | - [Common Crawl PySpark Examples](https://github.com/commoncrawl/cc-pyspark) - "This project provides examples [of] how to process the Common Crawl dataset with Apache Spark and Python". 11 | - [Common Crawl Index Server](https://github.com/commoncrawl/cc-index-server) - "This project is a deployment of the pywb web archive replay and index server to provide an index query mechanism for datasets provided by Common Crawl". 12 | 13 | ## Tooling 14 | - [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) - Star: 127 - Updated: 3/2022 - Checked: 5/2023 - "a set of tools for working with CDX indices of web crawls and archives, including those at CommonCrawl and the Internet Archive's Wayback Machine." 15 | - rokasramas' [fork of comcrawl](https://github.com/rokasramas/comcrawl) - Stars: 0 - Updated: 4/2020 - Checked: 5/2023 - Includes a fix that hasn't been applied to the [original comcrawl library](https://github.com/michaelharms/comcrawl/) that allows it to work. 16 | - [getallurls](https://github.com/lc/gau) - Stars: 2.8k - Updated: 2/2023 - Checked: 5/2023 - Can fetch urls from Common Crawl as well as Open Threat Exchange, the Wayback Machine, and URLScan. 17 | - [CommonCrawlDocumentDownload](https://github.com/centic9/CommonCrawlDocumentDownload) - Stars: 50 - Updated: 4/2023 - Checked: 5/2023 - Downloads documents by file/mime type from CC. 18 | - [WARCannon](https://github.com/c6fc/warcannon) - Stars: 212 - Updated: 9/2022 - Checked: 5/2023 - Uses AWS to at scale search Common Crawl data with regex patterns. 19 | 20 | ## Other 21 | - [NewsFetch](https://newsfetch.tech/) - Stars: 13 - Updated: 10/2022 - Checked: 5/2023 - Can fetch news articles from the Common Crawl API. 22 | - [news-please](https://github.com/fhamborg/news-please) - Stars: 1.6k - Updated: 4/2023 - Checked: 5/2023 - Along with significant other functionality it can fetch articles from Common Crawl. 23 | - [PWA Store](https://github.com/Tarasa24/PWA-Store) - Stars: 5 - Updated: 9/2022 - Checked: 5/2023 - Uses Common Crawl and EMR to find as many PWA apps on the web as possible. 24 | 25 | ## What Is? 26 | - C4 Dataset - Text data extracted from Common Crawl. 27 | - https://github.com/shjwudp/c4-dataset-script 28 | - [CDX](https://github.com/webrecorder/pywb/wiki/CDX-Index-Format) - Capture/Crawl inDeX - Standard index format for WARCs. 29 | 30 | ## Tutorials 31 | 32 | ### General 33 | - Edward Ross. [CommonCrawl Category](https://skeptric.com/#category=commoncrawl). skeptric. 34 | - Ross has published a number of well-written articles on Common Crawl. A great place to start if you are looking to go through the basics and beyond. 35 | - [Searching 100 Billion Webpages With Capture Index](https://skeptric.com/searching-100b-pages-cdx/). 6/2020. 36 | - Explains how to use the web interface (slow) as well as the CDX Toolkit, comcrawl, and directly in Python without using a custom CommonCrawl library. Unfortunately both comcrawl and the CDX Toolkit require some tweaks to get running. 37 | - [Read Commonm Crawl Parquet Metadata with Python](https://skeptric.com/reading-parquet-metadata/). 4/2022. 38 | - Covers reading Parquet metadata using PyArrow, fastparquet, manually (in Python), and using asyncio to speed things up. 39 | - [CommonCrawl.org So you're ready to get started](https://commoncrawl.org/the-data/get-started/). 40 | - Covers a lot of ground, perhaps not the best for true beginners. Covers data locations, file formats (WARC, WAT, WET), indexes, as well as processing the files. 41 | - [CommonCrawl.org Examples using Common Crawl Data](https://commoncrawl.org/the-data/examples/). 42 | - Unfortunately the vast majority of the examples available here are quite old. 43 | 44 | ### AWS Athena 45 | - Sebastian Nagel. [Index to WARC Files and URLs in Columnar Format](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/). commoncrawl, 3/2018. 46 | - Stanislas Girard. [Parse Petabytes of data from CommonCrawl in seconds](https://www.primates.dev/parse-petabytes-of-data-from-commoncrawl-in-seconds/). primates.dev, 1/2020. 47 | - Simple and straightforward, short, fairly basic, but good place to start. 48 | - Athul Jayson. [Extracting Data from Common Crawl Dataset](https://blog.qburst.com/2020/07/extracting-data-from-common-crawl-dataset/). qburst, 7/2020. 49 | - Also has an associated GitHub repository. 50 | - Ryan Elkins. [Search the html across 25 billion websites for passive reconnaissance using common crawl](https://medium.com/@brevityinmotion/search-the-html-across-25-billion-websites-for-passive-reconnaissance-using-common-crawl-7fe109250b83). 7/2020. 51 | - While written from a security perspective it provides solid guidance to using AWS Athena with Common Crawl. It also utilizes Amazon SageMaker, S3, and AWS IAM. There is an associated repo. 52 | 53 | ### AWS EMR 54 | - Basil Latif. [Measuring Internet Links: Accessing the Common Crawl Dataset Using EMR and Pyspark in AWS](https://basil-latif.medium.com/measuring-internet-links-accessing-the-common-crawl-dataset-using-emr-and-pyspark-in-aws-fcf5eb26afd9). 6/2020. 55 | - [Common Crawl EMR Tutorial](https://github.com/haydenhw/commoncrawl-emr-tutorial) - Stars: 9 - Updated: 3/2021 - Checked: 5/2023 - "This guide walks you through submitting a Scala Spark application to EMR that queries 500k job urls from Common Crawl and saves the results to an S3 bucket in CSV format." 56 | 57 | ### AWS Lambda 58 | - Chris Madden, Aaron Bawcom. [Analyzing Performance and Cost of Large-Scale Data Processing with AWS Lambda](https://aws.amazon.com/blogs/apn/analyzing-performance-and-cost-of-large-scale-data-processing-with-aws-lambda/). 6/2019. 59 | - Covers the high-level process with associated GitHub repository. 60 | - Jader Dias. [One-click to download all the web pages you may want](https://medium.com/@jaderd/one-click-to-download-exactly-the-web-pages-you-may-want-no-matter-how-many-they-are-d4834265a0a3). 6/2022. 61 | - Builds on using Athena to get data from Common Crawl and AWS Lambda to download it. 62 | 63 | ### Snowflake 64 | - Venkat Sekar. [Querying TB sized External Tables with Snowflake](https://medium.com/snowflake/querying-tb-sized-external-tables-with-snowflake-5ab14e807d3). 2/1/2022. 65 | 66 | 67 | ### Basic 68 | - David Mackey. [Basic Information About CommonCrawl](common-crawl/basic-info-common-crawl.md). 5/2023. 69 | - David Mackey. [How To Manually Access CommonCrawl](common-crawl/basic-manually-accessing-common-crawl.md). 5/2023. 70 | - Chillar Anand. [Common Crawl On Laptop - Extracting Subset of Data](https://avilpage.com/2022/11/common-crawl-laptop-extract-subset.html). 11/2022. 71 | 72 | ### Other 73 | - Colin Dellow. [S3 Throughput: Scans vs Indexes](https://code402.com/blog/s3-scans-vs-index/). 2/2020. 74 | - Is it fasters to scan entire WARC files and attempt to pull just the data required from each WARC file utilizing the index? -------------------------------------------------------------------------------- /research/uncategorized-research.md: -------------------------------------------------------------------------------- 1 | # Uncategorized Research on Search Engines and Information Retrieval 2 | 3 | ## 2023 4 | - Christopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin, Martin Potthast. [Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face](https://www.researchgate.net/publication/368877450_Spacerini_Plug-and-play_Search_Engines_with_Pyserini_and_Hugging_Face). 2/2023. 5 | - Xinyuan Chang. [The Analysis of Open Source Search Engines](https://www.researchgate.net/publication/368910014_The_Analysis_of_Open_Source_Search_Engines). 2/2023.* 6 | 7 | ## 2022 8 | - Ali Abbasov, Vagif Gasimov. [Domain-oriented information search on the Internet](https://www.researchgate.net/publication/367004891_Domain-oriented_information_search_on_the_Internet). 11/2022. 9 | - Fenil Kaneria, Shafaq Khan, Nishara Nizamuddin. [Swift Search An open-source search engine](https://www.researchgate.net/publication/362646181_Swift_Search_An_open-source_search_engine). 11/2022. 10 | - Lingjun Xu, Shiyin Zhang, Guojie Song, Junshan Wang, Tianshu Wu, Guojun Liu. [Taxonomy-Enhanced Graph Neural Networks](https://dl.acm.org/doi/10.1145/3511808.3557467). 10/2022. 11 | - Masaki Suzuki, Yusuke Yamamoto. [Don't Judge by Looks: Search User Interface to Make Searchers Reflect on Their Relevance Criteria and Promote Content-Quality-Oriented Web Searches](https://dl.acm.org/doi/10.1145/3524458.3547222). 9/2022. 12 | - Martha Viviana Zuluaga, Sebastián Robledo, Oscar Arbelaez-Echeverri, Néstor Duque, Germán A. Osorio-Zuluaga. [Tree of Science - ToS: A Web-Based Tool for Scientific Literature Recommendation. Search Less, Researfch More!](https://www.researchgate.net/publication/362728432_Tree_of_Science_-_ToS_A_Web-Based_Tool_for_Scientific_Literature_Recommendation_Search_Less_Research_More). 8/2022. 13 | - Gaurav Gupta, Tharun Medini, Anshumali Shrivastava, Alexander J. Smola. [BLISS: A Billion scale Index using Iterative Re-partitioning](https://dl.acm.org/doi/10.1145/3534678.3539414). 8/2022.* 14 | - Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, Peter Staar. [DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation](https://dl.acm.org/doi/10.1145/3534678.3539043). 8/2022. 15 | - Sean Zhang, Varun Ursekar, Leman Akoglu. [Sparx: Distributed Outlier Detection at Scale](https://dl.acm.org/doi/10.1145/3534678.3539076). 8/2022. 16 | - Aleksandra Urman, Mykola Makhortykh, Roberto Ulloa, Juhi Kulshrestha. [Where the earth is flat and 9/11 is an inside job: A comparative algorithm audit of conspiratorial information in web search results](https://www.researchgate.net/publication/361863464_Where_the_earth_is_flat_and_911_is_an_inside_job_A_comparative_algorithm_audit_of_conspiratorial_information_in_web_search_results). 8/2022.* 17 | - Chenxu Zhu, Peng Du, Xianghui Zhu, Weinan Zhang, Yong Yu, Yang Cao. [User-tag Profile Modeling in Recommendation System via Contrast Weighted Tag Masking](https://dl.acm.org/doi/10.1145/3534678.3539102). 8/2022. 18 | - Marwah Alaofi, Luke Gallagher, Dana Mckay, Lauren L. Saling, Mark Sandrson, Falk Scholer, Damiano Spina, Ryen W. White. [Where Do Queries Come From?](https://dl.acm.org/doi/10.1145/3477495.3531711). 7/2022. 19 | - Yu Guo, Zhengyi Ma, Jiaxin Mao, Hongjin Qian, Xinyu Zhang, Hao Jiang, Zhao Cao, Zhicheng Dou. [Webformer: Pre-training with Web Pages for Information Retrieval](https://dl.acm.org/doi/10.1145/3477495.3532086). 7/2022. 20 | - Arnold Overwijk, Chenyan Xiong, Jamie Callan. [ClueWeb22: 10 Billion Documents with Rich Information](https://dl.acm.org/doi/10.1145/3477495.3536321). 7/2022. 21 | - Connor Lennox, Sumanta Kashyapi, Pooja Oza, Ben Gamari. [Wikimarks: Harvesting Relevance Benchmarks from Wikipedia](https://dl.acm.org/doi/10.1145/3477495.3531731). 7/2022. 22 | - Zelong Li, Jianchao Ji, Yingqiang Ge, Yongfeng Zhang. [AutoLossGen: Automatic Loss Function Generation for Recommender Systems](https://dl.acm.org/doi/10.1145/3477495.3531731). 7/2022.* 23 | - Thomas Grubb, Bill Anderson, Omar Alonso. [On Reliability Scores for Knowledge Graphs](https://dl.acm.org/doi/10.1145/3487553.3524212). 4/2022. 24 | - Mustaga Abualsaud, Mark Smucker. [The Dark Side of Relevance: The Effect of Non-Relevant Results on Search Behavior](https://dl.acm.org/doi/10.1145/3498366.3505770). 3/2022. 25 | - Chirag Shah, Emily M. Bender. [Situating Search](https://dl.acm.org/doi/10.1145/3498366.3505816). 3/2022.* 26 | 27 | ## 2021 28 | - Stefan Voigt, Tobias Hecking, Dennis Jankoswski, Julius Moller, Maximilian Schwinger. [Open Search @ DLR - towards transparent access to web-based information in science](https://www.researchgate.net/publication/356602703_Open_Search_DLR_-_towards_transparent_access_to_web-based_information_in_science). 1/2021. 29 | 30 | ## 2020 31 | - Mario Kubek. [Contemporary Web Search](https://www.researchgate.net/publication/333931478_Contemporary_Web_Search). 1/2020. 32 | - Hitweshwar Kumar Azad, Akshay Deepak, Kumar Abhishek. [Query Expansion for Improving Web Search](https://www.researchgate.net/publication/339480386_Query_Expansion_for_Improving_Web_Search). 1/2020. 33 | 34 | ## 2019 35 | - Rashmi P Sarode, Shelly Sachdeva, Wanming Chu, Sbhash Bhalla. [Segment-Search vs Knowledge Graphs: Making a Key-Word Search Engine for Web Documents](https://www.researchgate.net/publication/337923115_Segment-Search_vs_Knowledge_Graphs_Making_a_Key-Word_Search_Engine_for_Web_Documents). 12/2019. 36 | - Peilu Wang. Hao Jiang, Jingfang Xu, Qi Zhang. [Knowledge Graph Construction and Applications for Web Search and Beyond](https://www.researchgate.net/publication/336978553_Knowledge_Graph_Construction_and_Applications_for_Web_Search_and_Beyond). 11/2019.* 37 | - Dan Brickley, Matthew Burgess, Natasha Noy. [Google Dataset Search: Building a search engine for datasets in an open Web ecosystem](https://www.researchgate.net/publication/333067368_Google_Dataset_Search_Building_a_search_engine_for_datasets_in_an_open_Web_ecosystem). 5/2019. 38 | 39 | ## 2016 40 | - Weize Kong. [Extending Faceted Search to the Open-Domain Web](https://www.researchgate.net/publication/304618602_Extending_Faceted_Search_to_the_Open-Domain_Web). 6/2016.* 41 | - Bosubabu Sambana. [Web Search Engine](https://www.researchgate.net/publication/336265320_Web_Search_Engine). 3/2016. 42 | 43 | ## 2015 44 | - Evi Yulianti. [Finding Answers in Web Search](https://www.researchgate.net/publication/283659235_Finding_Answers_in_Web_Search). 8/2015. 45 | - Aleksandr Chuklin, Ilya Markov, Maarten de Rijke. [Click Models for Web Search](https://www.researchgate.net/publication/282201593_Click_Models_for_Web_Search). 7/2015. 46 | - Sonali Tanaji Kadam, Sanchika Bajpai. [Development of Web Annotation Technique for Search Result Records Using Web Database.](https://www.researchgate.net/publication/283779983_Development_of_Web_Annotation_Technique_for_Search_Result_Records_Using_Web_Database). 7/2015. 47 | 48 | ## 2014 49 | - Weize Kong, James Allan. [Extending Faceted Search to the General Web](https://www.researchgate.net/publication/284346690_Extending_Faceted_Search_to_the_General_Web). 11/2014.* 50 | - Guillem Frances, Xiao Bai, Berkant Barla Cambazoglu, Ricardo Baeza-Yates. [Improving the efficiency of multi-site web search engines](https://www.researchgate.net/publication/262172401_Improving_the_efficiency_of_multi-site_web_search_engines). 2/2014.* 51 | 52 | ## 2011 53 | - Yue Wang, Hongsong Li, Haixun Wang, Kenny Q. Zhu. [Toward Topic Searc on the Web](https://www.researchgate.net/publication/255563891_Toward_Topic_Search_on_the_Web). 5/2011. 54 | 55 | ## 2010 56 | - Michael Zimmer. [Web Search Studies: Multidisciplinary Perspectives on Web Search Engines](https://www.researchgate.net/publication/226672921_Web_Search_Studies_Multidisciplinary_Perspectives_on_Web_Search_Engines). 6/2010. 57 | 58 | ## 2004 59 | - Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma. [Block-based web search](https://www.researchgate.net/publication/221301159_Block-based_web_search). 7/2004. -------------------------------------------------------------------------------- /specific-engines/opensearch.md: -------------------------------------------------------------------------------- 1 | # OpenSearch 2 | - Official Website: https://opensearch.org/ 3 | - Forums: https://forum.opensearch.org/ 4 | - An open source fork of Elasticsearch started by Amazon. 5 | 6 | ## Some Basics 7 | - Built on Apache Lucene 8 | - Node(s) - A server that contains data and responds to search queries. 9 | - Cluster(s) - A group of nodes that work together to store and search data. 10 | - Index / Indices - A collection of documents. 11 | - Mapping(s) - Collection of fields that documents in an index. 12 | - Setting(s) - Configurations for an index. 13 | - Shard(s) - A portion of an index. 14 | - Shards are split evenly across the nodes in a cluster. 15 | - Each shard is a full Lucene index. 16 | - Rule of Thumb: Shards should be 10-50 GB each. 17 | - Primary Shard - The node is responsible for the shard. 18 | - Replica Shard - The node acts as a backup and can take some of the read load off the primary shard.- REST API - Allows one to interact with OpenSearch using HTTP requests. 19 | - API - OpenSearch provides a REST API for interacting with the server. 20 | 21 | ``` 22 | // Add a JSON doc to index 23 | PUT https://://_doc/ 24 | { 25 | "title": "The Wind Rises", 26 | "release_date": "2013-07-20" 27 | } 28 | 29 | // Search for a document 30 | GET https://://_search?q=wind 31 | 32 | // Delete a document 33 | DELETE https://://_doc/ 34 | ``` 35 | 36 | ## Quickstart 37 | 1. You'll need Docker installed. 38 | 2. There are several settings OpenSearch recommends tweaking on the host machine. I'm holding off on tweaking these in a dev environment. 39 | 3. Download the Docker Compose config file: `curl -O https://raw.githubusercontent.com/opensearch-project/documentation-website/2.6/assets/examples/docker-compose.yml` 40 | 4. Start the cluster: `docker-compose up -d` 41 | 5. Query the API: `curl https://localhost:9200 -ku admin:admin` 42 | - `-k` (or `--insecure`) - Don't check host name (since it's using demo certs). 43 | - `-u` - Allows one to provide username and password (`admin:admin`). 44 | 6. Explore OpenSearch Dashboards: http://localhost:5601 (use same user/pass as above). 45 | - Click Explore on My Own. 46 | - Select tenant (Global or Private is fine). 47 | 7. Go to Management -> Dev Tools and perform a query (default query is to show all results) 48 | - Can paste queries in cURL format and they'll be converted to Console syntax. 49 | - Green triangle runs query. 50 | - Keyboard shortcuts are under Help. 51 | 52 | ## Configuring OpenSearch 53 | - Most changes can be accomplished using the cluster settings API. 54 | - A few changes need to be made by modifying `opensearch.yml` and restarting the cluster, prefer the API whenever possible. 55 | - Note: `opensearch.yml` applies settings to the local node, the API applies to all nodes in the cluster. 56 | - One can also use environment variables when launching OpenSearch like so: 57 | `./opensearch -Ecluster.name=opensearch-cluster -Enode.name=opensearch-node1 -Ehttp.host=0.0.0.0 -Ediscovery.type=single-node` 58 | 59 | ### Using Cluster Settings API 60 | - View current settings: `GET _cluster/settings?include_defaults=true` 61 | - Non-default settings only: `GET _cluster/settings` 62 | - The types of settings and precedence: 63 | 1. Transient (cleared after restart) 64 | 2. Persistent 65 | 3. opensearch.yml 66 | 4. Default 67 | - To change settings specify the setting and whether it is persistent or transient: 68 | ``` 69 | PUT _cluster/settings 70 | { 71 | "persistent": { 72 | "action.auto_create_index": "false" 73 | } 74 | } 75 | ``` 76 | - Can also copy and paste from GET response and change the existing values: 77 | ``` 78 | PUT _cluster/settings 79 | { 80 | "persistent": { 81 | "action": { 82 | "auto_create_index": false 83 | } 84 | } 85 | } 86 | ``` 87 | 88 | ### Using opensearch.yml 89 | - Docker: `/usr/share/opensearch/config/opensearch.yml` 90 | - Linux: `/etc/opensearch/opensearch.yml` 91 | - Example Settings: 92 | ``` 93 | cluster.name: my-application 94 | action.auto_create_index: true 95 | compatibility.override_main_response_version: true 96 | ``` 97 | - To allow client app to connect to OpenSearch on a different domain: 98 | ``` 99 | - http.host:0.0.0.0 100 | - http.port:9200 101 | - http.cors.allow-origin:"http://localhost" 102 | - http.cors.enabled:true 103 | - http.cors.allow-headers:X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization 104 | - http.cors.allow-credentials:true 105 | ``` 106 | 107 | ## Plugins 108 | - One can use the `opensearch-plugin` command to list, install, and remove plugins. 109 | - If using OpenSearch in Docker, plugins must be managed by modifying the Docker image. 110 | 111 | ### List 112 | - `bin/opensearch-plugin list` 113 | - Or use CAT API: `GET _cat/plugins` 114 | 115 | ### Install 116 | - By Name: `bin/opensearch-plugin install ` 117 | - From Zip: `bin/opensearch-plugin install ` (must use HTTP(S), for local fies use `file://`) 118 | - Using Maven Coordinates: `bin/opensearch-plugin install ::` 119 | 120 | ### Remove 121 | - `bin/opensearch-plugin remove ` 122 | 123 | ### Restart 124 | - Restart the node after installing or removing a plugin. 125 | 126 | ### Batch Mode 127 | - To skip confirmation prompts when installing plugins: `bin/opensearch-plugin install --batch ` 128 | 129 | ### Bundled Plugins 130 | - Alerting - `opensearch-alerting` 131 | - Anomaly Detection - `opensearch-anomaly-detection` 132 | - Asynchronous Search - `opensearch-asynchronous-search` 133 | - Cross Cluster Replication - `opensearch-cross-cluster-replication` 134 | - Notifications - `notifications` 135 | - Reports Scheduler - `opensearch-reports-scheduler` 136 | - Geospatial - `opensearch-geospatial` 137 | - Index Management - `opensearch-index-management` 138 | - Job Scheduler - `opensearch-job-scheduler` 139 | - k-NN - `opensearch-knn` 140 | - ML Commons - `opensearch-ml` 141 | - Neural Search - `neural-search` 142 | - [Neural Search GitHub Repo](https://github.com/opensearch-project/neural-search) 143 | - Observability - `opensearch-observability` 144 | - Notebooks (`opensearch-notebooks`) has been merged into Observability. 145 | - Performance Analyzer - `opensearch-performance-analyzer` 146 | - Not available on Windows. 147 | - Security - `opensearch-security` 148 | - Security Analytics - `opensearch-security-analytics` 149 | - SQL - `opensearch-sql` 150 | 151 | ### Additional Plugins 152 | - These are available for install using `bin/opensearch-plugin install `, additional ones are also available outside of OpenSearch's GitHub. 153 | - `analysis-icu` 154 | - `analysis-kuromoji` 155 | - `analysis-nori` 156 | - `analysis-phonetic` 157 | - `analysis-smartcn` 158 | - `analysis-stempel` 159 | - `analysis-ukrainian` 160 | - `discovery-azure-classic` 161 | - `discovery-ec2` 162 | - `discovery-gce` 163 | - `ingest-attachment` 164 | - `mapper-annotated-text` 165 | - `mapper-murmur3` 166 | - `mapper-size` 167 | - `repository-azure` 168 | - `repository-gcs` 169 | - `repository-hdfs` 170 | - `repository-s3` 171 | - `store-smb` 172 | - `transport-nio` 173 | 174 | ## Data Prepper 175 | - "Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale." 176 | - [GitHub Repo](https://github.com/opensearch-project/data-prepper) 177 | 178 | ## Companies Working with OpenSearch 179 | - [Sematext](https://sematext.com/) 180 | 181 | ## Bibliography 182 | - https://opensearch.org/docs/latest/about/ 183 | - https://opensearch.org/docs/latest/quickstart/ 184 | - https://opensearch.org/docs/latest/install-and-configure/configuration/ 185 | - https://opensearch.org/docs/latest/install-and-configure/plugins/ -------------------------------------------------------------------------------- /specific-engines/solr/solr-terminology.md: -------------------------------------------------------------------------------- 1 | ## Terminology 2 | 3 | ### Concepts 4 | - Atomic Updates - "An approach to updating only one or more fields of a document, instead of reindexing the entire document." 5 | - Boolean operators - "control the inclusion or exclusion of keywords in a query by using operators such as AND, OR, and NOT." 6 | - Clustering (of Results) - "groups search results by similarities discovered when a search is executed, rather than when content is indexed." 7 | - "It can reveal unexpected commonalities among search results" 8 | - Document - "A group of fields and their values. Documents are the basic unit of data in a collection." 9 | - "basic unit of information is a document, which is a set of data that describes something." 10 | - Facet / Faceting - "The arrangement of search results into categories based on indexed terms." 11 | - Facet Constraint - A specific facet value within a category that further constrains the results. 12 | - Field - "The content to be indexed/searched along with metadata defining how the content should be processed by Solr." 13 | - "documents are composed of fields, which are more specific pieces of information." 14 | - Fields can be of different types - e.g. text, number. 15 | - Index - where Solr stores all of the data. 16 | - Indexing - adding data to Solr. 17 | - Inverse Document Frequency (IDF) - "A measure of the general importance of a term. It is calculated as the number of total Documents divided by the nuber of Documents that a particular word occurs in the collection." 18 | - Inverted Index - "A way of creating a searchable index that lists every word and the documents that contain those words, similar to an index in the back of a book which lists words and the pages on which they can be found." 19 | - Metadata - "Literally, data about data. Metadata is information about a document, such as its title, author, or location." 20 | - Natural Language Query - "A search that is entered as a user would normally speak or write". 21 | - Precision - "the percentage of documents in the returned results that are relevant." 22 | - Query - asking a question of the data using Solr. 23 | - Query Parser - "processes the terms entered by a user." 24 | - Recall - "the percentage of relevant results returned out of all relevant results in the system." 25 | - "The ability of a search engine to retrieve all of the possible matches to a user’s query." 26 | - Relevance - "the degree to which a query response satisfies a user who is searching for information." 27 | - "The appropriateness of a document to the search conducted by the user." 28 | - Stopwords - "Generally, words that have little meaning to a user’s search but which may have been entered as part of a natural language query." 29 | - "Stopwords are generally very small pronouns, conjunctions and prepositions (such as, "the", "with", or "and")" 30 | - Syonyms - "Synonyms generally are terms which are near to each other in meaning and may substitute for one another. In a search engine implementation, synonyms may be abbreviations as well as words, or terms that are not consistently hyphenated." 31 | - Term Frequency - "The number of times a word occurs in a given document." 32 | - Wildcard - "A wildcard allows a substitution of one or more letters of a word to account for possible variations in spelling or tenses." 33 | - Zookeeper - "The system used by SolrCloud to keep track of configuration files and node names for a cluster. A ZooKeeper cluster is used as the central configuration store for the cluster, a coordinator for operations requiring distributed synchronization, and the system of record for cluster topology." 34 | 35 | ### Infrastructure 36 | - Cluster - "a set of Solr nodes operating in coordination with each other via ZooKeeper, and managed as a unit." 37 | - Collection - "one or more Documents grouped together in a single logical index using a single configuration and Schema." 38 | - A similar concept is Cores in single-node installations and user-managed clusters. 39 | - Core - "An individual Solr instance (represents a logical index). Multiple cores can run on a single node." 40 | - Compare to Collection above. 41 | - Ensemble - "A ZooKeeper term to indicate multiple ZooKeeper instances running simultaneously and in coordination with each other for fault tolerance." 42 | - Leader - "A single Replica for each Shard that takes charge of coordinating index updates (document additions or deletions) to other replicas in the same shard." 43 | - "This is a transient responsibility assigned to a node via an election, if the current Shard Leader goes down, a new node will automatically be elected to take its place." 44 | - Node - "A JVM instance running Solr. Also known as a Solr server." 45 | - [Operator](https://solr.apache.org/operator/) - "[B]uilt to reliably manage Apache Solr on Kubernetes." 46 | - Overseer - "A single node in SolrCloud that is responsible for processing and coordinating actions involving the entire cluster. It keeps track of the state of existing nodes, collections, shards, and replicas, and assigns new replicas to nodes." 47 | - "This is a transient responsibility assigned to a node via an election, if the current Overseer goes down, a new node will be automatically elected to take its place." 48 | - Replica - "A Core that acts as a physical copy of a Shard." 49 | - Replication - "A method of copying a leader index from one server to one or more 'follower' or 'child' servers." 50 | - SolrCloud - "Umbrella term for a suite of functionality in Solr which allows managing a Cluster of Solr Nodes for scalability, fault tolerance, and high availability." 51 | 52 | ### Other 53 | - Common Query Parameters - Parameters that are "accepted by all query parsers." 54 | - Distributed Search - "queries are processed across more than one Shard." 55 | - Field Analysis - "tells Solr what to do with incoming data when building an index." 56 | - "A more accurate name for this process would be processing or even digestion, but the official name is analysis." 57 | - Filter 58 | - Filter Query - "a filter query runs a query against the entire index and caches the results...the strategic use of filter queries can improve search performance." 59 | - Operates on data already existing in the index. 60 | - Analysis Filter - Operates on data being ingested. 61 | - MoreLikeThis - "enables users to submit new queries that focus on particular terms returned in an earlier query." 62 | - `maxDoc` - The number of documents in the index including those which have been logically but not physically deleted. 63 | - `numDocs` - The "number of searchable documents in the index." 64 | - Some files may contain multiple documents, e.g. XML, JSON, or CSV. In this case the `numDocs` will be greater than the number of files indexed. 65 | - Request Handler (RequestHandler) - receives and processes requests. 66 | - "Logic and configuration parameters that tell Solr how to handle incoming "requests", whether the requests are to return search results, to index documents, or to handle other custom situations." 67 | - SearchComponent - "ogic and configuration parameters used by request handlers to process query requests." 68 | - "Examples of search components include faceting, highlighting, and "more like this" functionality." 69 | - Response Writer - "manages the final presentation of the query response." 70 | - Solr is bundled with both XML and JSON response writers. 71 | - SolrConfig (solrconfig.xml) - "The Apache Solr configuration file. Defines indexing options, RequestHandlers, highlighting, spellchecking and various other configurations." 72 | - The file, solrconfig.xml, is located in the Solr home conf directory." 73 | - `solr.home` - "the location under the main Solr installatiomn where Solr's collections and their `conf` and `data` directories are stored." 74 | - Solr Schema (managed-sechame.xml or schema.xml) - Defines how Solr builds indexes from data sent to it. 75 | - Stores information about the fields and data types. 76 | - Shard - "In SolrCloud, a logical partition of a single Collection." 77 | - "Every shard consists of at least one physical replica". 78 | 79 | # Bibliography / Resources 80 | - See Bibliography section of the main [Apache Solr document](../apache-solr.md). -------------------------------------------------------------------------------- /research/annotated-collaborative-research.md: -------------------------------------------------------------------------------- 1 | ## 2009 2 | - Sharoda A. Paul, Meredith Ringel Morris. [CoSense: enhancing sensemaking for collaborative web search](https://dl.acm.org/doi/10.1145/1518701.1518974). 4/2009.* 3 | - DM: This article is interesting from the perspective of advanced collaborative, ongoing research uses - e.g., several individuals working together on surfacing the best results on a specific topic. 4 | - "Broadly speaking, sensemaking is *finding meaning* in a situation. In HCI, sensemakingrefers to the cognitive act of understanding information[24]." - 1. 5 | - "One of the importantproblems facing HCI research today is the design ofcomputer interfaces to enable us to make sense of the vast amounts of information we encounter every day [24]." - 1. 6 | - "One of the prominent methodologies in this thread of research is Dervin’s “Sense-making” [4]. Sense-making occurs when a person, embedded in a particular context and moving through time-space, experiences a “gap” in reality. The person bridges this gap by constructing bridges consisting of ideas, thoughts, emotions, feelings, and memories. In the education literature...sensemaking refers to how students derive meanings about their learning experiences and how they identify particular ideas as important [5]. Weick [22], has explored sensemaking in the context of organizations. According to Weick, people organize their world to make sense of ambiguous situations they encounter and enact this sense back into the world to make that world more orderly. In HCI, sensemaking has focused on how users understand complex information spaces [16]. When interacting with large amounts ofinformation, people create representations to organize information in order to make sense of it. Sensemaking is the process of encoding information into external representations to answer complex, task-specific questions." - 2. 7 | - "...several collaborative search tools have recently been proposed by the research community [7]...they tend to offer two classes of support proposed by Morris & Horvitz [12]: awareness features (e.g., sharing of group members’ query histories, browsing histories, and/or comments on results) and division of labor features (e.g., chat systems, the ability to manually divide search results or URLs among group members, and/or algorithmic techniques for modifying group members’ search results based on others’ actions)." - 2. 8 | - "The temporality of the search process was important for participants’ sensemaking. Many participants wanted to see a unified chronological ordering of all events in the search process. They wanted to see the complete information path that was followed by other group members, and hence would have liked SearchTogether to make the browsing (in addition to searching) behavior of others more visible." - 3. 9 | - "The concept of query evolution seemed important toparticipants’ sensemaking; that is, participants wanted totake others’ queries and build upon them." - 3. 10 | - DM: See article for concepts like action awareness, context awareness, query evolution, sensemaking handoffs. 11 | 12 | ## 2008 13 | - Jeremy Pickens, Gene Golovchinsky, Chirag Shah, Pernilla Qvarfordt, Maribeth Back. [Algorithmic mediation for collaborative exploratory search](https://dl.acm.org/doi/10.1145/1390334.1390389). 7/2008. 14 | - "Using our system, two or more users with a common information need search together, si-multaneously. The collaborative system provides tools, userinterfaces and, most importantly, algorithmically-mediatedretrieval to focus, enhance and augment the team’s searchand communication activities." - 1. 15 | - "Information seeking can be more effective as a collabora-tion than as a solitary activity: different people bring differ-ent perspectives, experiences, expertise, and vocabulary tothe search process. A retrieval system that takes advantageof this breadth of experience should improve the quality ofresults obtained by its users [4]." - 1. 16 | - "In this work we explore the pos-sibilities of synchronous, explicit, algorithmically-mediated collaboration for search tasks [10]." - 1. 17 | 18 | ## 2007 19 | - Athanasios Papagelis, Christos Zaroliagis. [Searchius: A Collaborative Search Engine](https://www.researchgate.net/publication/4282197_Searchius_A_Collaborative_Search_Engine). 10/2007. 20 | - "Searchius is a collaborative search engine that produces search results based solely on user provided web-related data. We discuss the architecture of the system and how it compares to current state-of-the-art search engines." - 1. 21 | - "URLs can be explicitly collected (e.g., bookmarks) or implicitly collected (e.g., web-browsing history). These collections of web-related data can be combined, without loosing [sp] their discrete nature, to produce a view of the web from the user perspective." - 1. 22 | - DM: It would be interesting to add to the ranking algorithm an analysis of sites visited versus sites bookmarked. e.g., it seems likely that sites visited once and never bookmarked across a large number of users are links of low quality. 23 | - "Our approach is based on the observation that the web users act as small crawlers seeking information on the web using various media, which they subsequently store and organize into tree-like structures inside their information spaces." - 1. 24 | - "...Searchius is not capital intensive, since it concentrateson a small portion of the data that typical search enginescollect and analyze." - 1. 25 | - "To order pages by importance,Searchius uses an aggregation function based on the preference to pages by different users, thus avoiding the expen-sive iterative procedure of PageRank." - 1. 26 | - "Finally, the way peo-ple organize their bookmarks can be used to segment theURL space to relative sub-spaces. This property can be ex-ploited to provide efficient solutions to other applications,including the construction of web catalogs and finding re-lated URLs." - 1. 27 | - "Under the above context, the ranking of pages inSearchius is based on how many *different* users have votedfor a specific page p. The total number of such votes is called the *UsersRank of page p*." - 2. 28 | - "The Noise Reducer uses heuristic filters to removefrom the database low quality URLs." - 3. 29 | - "Searchius uses a similar approach. The IR score of a pageis calculated based on the page title, URL and semantic tag-ging given by users. This semantic tagging is the exact ana-log to anchor links and can produce diverse descriptions ofpages." - 3. 30 | - DM: The use of aggregate user provided titles, descriptions, and semantic tags instead of actual page content is an interesting idea. I would see this as additive, e.g., one could perform ranking on page rank, user rank, IR score, and user IR score. 31 | - "To cope with such problems,we introduce an aging procedure for the collected pages. Atpredefined time intervals we reduce the value of each pagein the Searchius database. Since we use a simple value ag-gregation procedure to find the most important pages for asearch query, the effect of old URL collections to search-ing results diminishes through time allowing more room forfresh data." - 4. 32 | - DM: This is an interesting approach to ensuring that newer pages have a chance to surface in the results while also allowing older pages to retain their authority if they continue to be widely utilized. 33 | - "The quality of page ordering will be benefited by implicit collection of URLs for three reasons. First, the pages wevisit determine with high accuracy our current interests. Ifwe can monitor the users browsing history then we can al-ways produce a fresh view of how the web dynamics evolve.On the contrary, bookmark collections may be somewhatoutdated. Second, the frequency of visiting specific pagesgives a much better indication of our relative preference forspecific sites than the bookmark collections. Third, the order in which we visit sites can also be used and exploitedin many ways. For example, frequently subsequent sites ina browsing history can be linked as belonging to the same theme." - 4. 34 | - "When a user adds several pages under a folderhe actually groups them by some sort of similarity. Thisinfers that folders partition the URL-space to conceptuallyrelated sub-spaces." - 4. -------------------------------------------------------------------------------- /specific-engines/solr/basic-tutorial.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | This is meant to be a cheatsheet-like condensation of [the official Solr tutorials](https://solr.apache.org/guide/solr/latest/getting-started/solr-tutorial.html). See the Bibliography/Resources section to links to the individual tutorials utilized. 3 | 4 | In a few places I've included additional details I thought relevant but overall this cheatsheet leaves out a lot of material and you may want to read the original tutorials first and use this for later reference. 5 | 6 | # Getting Solr 7 | - [Download the latest Solr](https://solr.apache.org/downloads.html). 8 | - Unzip the downloaded file: `tar -xzf solr-9.2.1.tgz` 9 | - Enter the extracted folder: `cd solr-9.2.1` 10 | 11 | # Launch Solr in SolrCloud Mode 12 | - `bin/solr start -c` 13 | - Add another Solr node to cluster: `bin/solr -c -z localhost:9983 -p 9984` 14 | - May need to increase open file limit to 65000. 15 | - May need to adjust available entropy. 16 | - Open Admin UI: http://localhost:8983/ 17 | 18 | # Create a Collection 19 | ``` 20 | curl --request POST \ 21 | --url http://localhost:8983/api/collections \ 22 | --header 'Content-Type: application/json' \ 23 | --data '{ 24 | "create": { 25 | "name": "techproducts", 26 | "numShards": 1, 27 | "replicationFactor": 1 28 | } 29 | }' 30 | ``` 31 | 32 | # Define a Schema 33 | ``` 34 | curl --request POST \ 35 | --url http://localhost:8983/api/collections/techproducts/schema \ 36 | --header 'Content-Type: application/json' \ 37 | --data '{ 38 | "add-field": [ 39 | {"name": "name", "type": "text_general", "multiValued": false}, 40 | {"name": "cat", "type": "string", "multiValued": true}, 41 | {"name": "manu", "type": "string"}, 42 | {"name": "features", "type": "text_general", "multiValued": true}, 43 | {"name": "weight", "type": "pfloat"}, 44 | {"name": "price", "type": "pfloat"}, 45 | {"name": "popularity", "type": "pint"}, 46 | {"name": "inStock", "type": "boolean", "stored": true}, 47 | {"name": "store", "type": "location"} 48 | ] 49 | }' 50 | ``` 51 | 52 | # Index Some Documents 53 | - Single Document: 54 | ``` 55 | curl --request POST \ 56 | --url 'http://localhost:8983/api/collections/techproducts/update' \ 57 | --header 'Content-Type: application/json' \ 58 | --data ' { 59 | "id" : "978-0641723445", 60 | "cat" : ["book","hardcover"], 61 | "name" : "The Lightning Thief", 62 | "author" : "Rick Riordan", 63 | "series_t" : "Percy Jackson and the Olympians", 64 | "sequence_i" : 1, 65 | "genre_s" : "fantasy", 66 | "inStock" : true, 67 | "price" : 12.50, 68 | "pages_i" : 384 69 | }' 70 | ``` 71 | - Multiple Documents: 72 | ``` 73 | curl --request POST \ 74 | --url 'http://localhost:8983/api/collections/techproducts/update' \ 75 | --header 'Content-Type: application/json' \ 76 | --data ' [ 77 | { 78 | "id" : "978-0641723445", 79 | "cat" : ["book","hardcover"], 80 | "name" : "The Lightning Thief", 81 | "author" : "Rick Riordan", 82 | "series_t" : "Percy Jackson and the Olympians", 83 | "sequence_i" : 1, 84 | "genre_s" : "fantasy", 85 | "inStock" : true, 86 | "price" : 12.50, 87 | "pages_i" : 384 88 | } 89 | , 90 | { 91 | "id" : "978-1423103349", 92 | "cat" : ["book","paperback"], 93 | "name" : "The Sea of Monsters", 94 | "author" : "Rick Riordan", 95 | "series_t" : "Percy Jackson and the Olympians", 96 | "sequence_i" : 2, 97 | "genre_s" : "fantasy", 98 | "inStock" : true, 99 | "price" : 6.49, 100 | "pages_i" : 304 101 | } 102 | ]' 103 | ``` 104 | - A file containing the documents: 105 | - NOTE: This file does not exist, so this import will not work as-is. 106 | ``` 107 | curl -H "Content-Type: application/json" \ 108 | -X POST \ 109 | -d @example/products.json \ 110 | --url 'http://localhost:8983/api/collections/techproducts/update?commit=true' 111 | ``` 112 | 113 | # Commit the Changes 114 | - "After documents are indexed into a collection, they are not immediately available for searching. In order to have them searchable, a commit operation (also called refresh in other search engines like OpenSearch etc.) is needed. Commits can be scheduled at periodic intervals using auto-commits as follows." 115 | - `curl -X POST -H 'Content-type: application/json' -d '{"set-property":{"updateHandler.autoCommit.maxTime":15000}}' http://localhost:8983/api/collections/techproducts/config` 116 | 117 | # Make Some Queries 118 | - `curl 'http://localhost:8983/solr/techproducts/select?q=name%3Alightning'` 119 | 120 | # Reset Solr to Original State 121 | - `bin/solr stop -all ; rm -Rf example/cloud` 122 | 123 | # Start Solr in SolrCloud Mode 124 | - `./bin/solr start -e cloud` 125 | - Set how many Solr nodes `2`. 126 | - Set the port for node1 to `8983`. 127 | - Set the port for node2 to `7574`. 128 | - The commands run by Solr are shown in the terminal and can be run in the future: 129 | - Start up node1: `bin/solr start -cloud -p 8983 -s "example/cloud/node1/solr"` 130 | - Start up node2: `bin/solr start -cloud -p 7574 -s "example/cloud/node2/solr" -z localhost:9983` 131 | - Create a collection `techproducts`. 132 | - Set the number of shards to `2`. 133 | - Set the number of replicas per shard to `2`. 134 | - Set the configuration to `sample_techproducts_configs`. 135 | 136 | # Index the Techproducts Data 137 | - `bin/post -c techproducts example/exampledocs/*` 138 | 139 | # Use the Solr Admin UI to Query 140 | - Open the Solr Admin UI in the browser: http://localhost:8983/ 141 | - On the left-hand side select the dropdown "Collection Selector" and choose "techproducts". 142 | - A new menu opens on the left-hand side under the Collection Selector, click on Query. 143 | - Click on Execute Query and you'll see the results of the query (ten documents in the collection). 144 | - Note the URL above the results, this can be used with `curl` or similar to make the same query. 145 | - Note that clicking on the URL above the results returns the raw response. 146 | - The URL should look something like this: `http://localhost:8983/solr/techproducts/select?indent=true&q.op=OR&q=*%3A*&useParams=` 147 | - The `q=` stands for query. 148 | - The operator `*:*` means match all documents in the collection. 149 | - This returns a parse error in curl, "Cannot parse ''*:*'':..." 150 | - If we use the html entity code for the colon, `%3A`, then it works. 151 | 152 | # Returning Only Specific Fields in Response to a Query 153 | - In the Query UI search for 'foundation' 154 | - One can choose which fields are returned using the `fl` parameter, e.g.: 155 | `curl "http://localhost:8983/solr/techproducts/select?q=foundation&fl=id"` 156 | 157 | # Limit the Fields Searched for a Query 158 | - We can also limit the fields searched using a field name and the desired search query, e.g.: 159 | `curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics"` 160 | 161 | # Searching for a Phrase 162 | - Enclose the phrase in double quotes, e.g.: 163 | `curl "http://localhost:8983/solr/techproducts/select?q=\"CAS+latency\""` 164 | - Note about that we had to escape the inner set of quotes with backslashes, this wouldn't be necessary in the Admin Query UI. 165 | 166 | # Query on Exact Multiple Terms / Phrases 167 | - By default Solr requires only one term to be present in a document for it to be included in the results. To require multiple terms or specific phrases you can use `+` before the terms, e.g.: `+electronics +music`: 168 | `curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics%20%2Bmusic"` 169 | - Can exclude specific terms / phrases using a `-`, for example: 170 | `curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics+-music"` 171 | 172 | # Create a New Collection 173 | - `bin/solr create -c films -s 2 -rf 2` 174 | - Automatically utilizes the `_default` configset. 175 | - `-s` sets the number of shards for the collection. 176 | - `-rf` sets the number of replicas. 177 | - If we open the Solr Admin UI we can select the films collection from the Collection Selector dropdown as we selected the techproducts collection earlier. 178 | 179 | # Working with the Schema API 180 | 181 | ## Creating the "names" Field 182 | - `curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' http://localhost:8983/solr/films/schema` 183 | - This can also be accomplished with slightly less control using the Admin UI. 184 | 185 | ## Creating a "catchall" Copy Field 186 | - A catchall field is created "defining a copy field that will take all the data from all fields and index it into a field named `_text_`." 187 | - `curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" : {"source":"*","dest":"_text_"}}' http://localhost:8983/solr/films/schema` 188 | - This can also be accomplished through the Admin UI. 189 | 190 | # Index the Film Data 191 | - Import JSON: `bin/post -c films example/films/films.json` 192 | - Or import XML: `bin/post -c films example/films/films.xml` 193 | - Or import CSV: `bin/post -c films example/films/films.csv -params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"` 194 | 195 | # Query the Film Data 196 | - Open the Admin UI and run Query, you should see 1100 results in the `numFound` field of the `response`, the first ten of which will be displayed. 197 | 198 | # Using Faceting 199 | - "Faceting allows the search results to be arranged into subsets (or buckets, or categories)..." 200 | - Types of Faceting: 201 | - Field Values 202 | - Numeric and Date Ranges 203 | - Pivots (Decision Tree) 204 | - Arbitrary Query Faceting 205 | 206 | ## Field Facets 207 | - In the Admin UI Query tab check the facet checkbox to see facet-related options appear. 208 | - "To see facet counts from all documents (q=*:*): turn on faceting (facet=true), and specify the field to facet on via the facet.field parameter." 209 | - If you want a list of facets but don't want any of the details of the results you can use `rows=0`. 210 | - Example in curl: `curl "http://localhost:8983/solr/films/select?q=*:*&rows=0&facet=true&facet.field=genre_str"` 211 | - One can use `facet.mincount` to only show facets with at least x documents in them. 212 | - Example in curl: `curl "http://localhost:8983/solr/films/select?=&q=\*:*&facet.field=genre_str&facet.mincount=200&facet=on&rows=0"` 213 | 214 | ## Range Facets 215 | - Admin UI does not support range facet options. 216 | - curl: 217 | ``` 218 | curl 'http://localhost:8983/solr/films/select?q=*:*&rows=0'\ 219 | '&facet=true'\ 220 | '&facet.range=initial_release_date'\ 221 | '&facet.range.start=NOW/YEAR-25YEAR'\ 222 | '&facet.range.end=NOW'\ 223 | '&facet.range.gap=%2B1YEAR' 224 | ``` 225 | - The above returns all films and groups them by year starting 25 yrs ago and ending today. 226 | 227 | ## Pivot Facets 228 | - `curl "http://localhost:8983/solr/films/select?q=\*:*&rows=0&facet=on&facet.pivot=genre_str,directed_by_str"` 229 | 230 | ## Remove Films Collection 231 | - If desired the films collection can be removed using: `bin/solr delete -c films` 232 | 233 | 234 | # Bibiography/Resources 235 | - https://solr.apache.org/guide/solr/latest/getting-started/solr-tutorial.html 236 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-five-minutes.html 237 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-techproducts.html 238 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-films.html -------------------------------------------------------------------------------- /research/books-research.md: -------------------------------------------------------------------------------- 1 | - NOTE: There are currently 36 books in Springer's the Information Retrieval Series, not all are listed below. 2 | - NOTE: If you are interested in volumes published by Pearson, Manning, Apress, Packt, O'Reilly, Chapman Hall/CRC, Morgan Kaufmann, or Addison-Wesley, you may want to consider a subscription to [O'Reilly Learning](https://oreilly.com/learning) which includes access to a number of volumes from these publishers (check if your specific volumes are included). 3 | 4 | ### General Audience 5 | - Dirk Lewandowski. *Understanding Search Engines*. Springer, 3/2023. 307 pp. 6 | - Alexander Halavais. *Search Engine Society*. Digital Media and Society, 11/2017. 240 pp. 7 | - Ian H. Witten, Marco Gori, Teresa Numerico. *Web Dragons: Inside the Myths of Search Engine Technology*. Morgan Kaufmann, 7/2010. 8 | 9 | #### Ethics 10 | - Rosie Graham. *Investigating Google's Search Engine: Ethics, Algorithms, and the Machines Built to Read Us*. Bloomsbury Academic, 1/2023. 256 pp. 11 | - Safiya Umoja Noble. *Algorithms of Oppression: How Search Engines Reinforce Racism*. NYU Press, 2/2018. 248 pp. 12 | 13 | ### Core 14 | - W. Bruce Croft, Donald Metzler, Trevor Strohman. *Search Engines: Information Retrieval in Practice*. Pearson, 2/2009. 552 pp. 15 | - Available for free from the [University of Massachusetts](https://ciir.cs.umass.edu/irbook/). 16 | - Christopher D. Manning, Hinrich Schütze, Prabhakar Raghavan. *Introduction to Information Retrieval*. Cambridge University Press, 7/2008. 506 pp. 17 | - Available for free from [Stanford](https://nlp.stanford.edu/IR-book/information-retrieval-book.html). 18 | - Stefan Buttcher, Charles L. A. Clarke, Gordon V. Cormack. *Information Retrieval: Implementing and Evaluating Search Engines*. The MIT Press, 2/2016. 632 pp. 19 | - Significant portions of the 2010 edition of this book are [available for free from the official site](https://plg.uwaterloo.ca/~ir/ir/book/). There are 16 chapters in that edition with 7 available for free. 20 | - It appears that the 2016 edition is a reprint of the 2010 edition verbatim. 21 | - Ricardo Baeza-Yates, Berthier Ribeiro-Neto. *Modern Information Retrieval: The Concepts and Technology Behind Search*, 2nd edition. Addison-Wesley Professional, 2/2011. 913 pp. 22 | - Chapters 1-2, 11, and 15 are available here: https://www.baeza.cl/mir2ed/contents.php.html. 23 | 24 | ### User Perspective 25 | - Karen Markey. *Online Searching: A Guide to Finding Quality Information Efficiently and Effectively*. Rowman & Littlefield Publishers, 3rd edition. 2/2023. 294 pp. 26 | - Focused on the user experience of searching, not building, but may be helpful, especially for those new to the field as it addresses some terminology and usability concerns. 27 | - Duncan O. Case. *Looking for Information: A Survey of Research on Information Seeking, Needs, and Behavior*, 4th edition. Emerald Publishing Limited, 4/2016. 528 pp. 28 | 29 | ### Practical 30 | - Jay M. Patel. *Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale*. Apress, 11/2020. 420 pp. 31 | - Tommaso Teofili. *Deep Learning for Search*. Manning Publications, 6/2019. 328 pp. 32 | - John Berryman, Doug Turnbull. *Relevant Search: With applications for Solr and Elasticsearch*. Manning Publications, 6/2016. 360 pp. 33 | - Martin White. *Enterprise Search*, 2nd edition. O'Reilly Media, 10/2015. 310 pp. 34 | 35 | #### Lucene 36 | - Atri Sharma. Practical Lucene 8: Uncover the Search Capabilities of Your Application. Apress, 10/2020. 114 pp. 37 | - Edwood Ng, Vineeth Mohan. Lucene 4 Cookbook. Packt Publishing, 6/2015. 220 pp. 38 | - Michael McCandless, Erik Hatcher, Otis Gospodnetic. Lucene in Action, 2nd edition. Manning Publications, 7/2010. 475 pp. 39 | 40 | #### Solr 41 | - Dikshant Shahi. Apache Solr: A Practical Approach to Enterprise Search. Apress, 12/2015. 328 pp. 42 | - Xavier Morera. Apache Solr Succinctly. Syncfusion, 4/2015. 141 pp. 43 | - Available free from: https://www.syncfusion.com/succinctly-free-ebooks/apachesolr 44 | - Rafal Kuc. Solr Cookbook, 3rd edition. Packt Publishing, 1/2015. 356 pp. 45 | - Trey Grainger, Timothy Potter. Solr in Action. Manning Publications, 3/2014. 664 pp. 46 | 47 | #### Elasticsearch 48 | - Note: Some of these volumes address to more or less extent the Elastic / ELK Stack, this is about IR but more on a monitoring, logging side than web search. 49 | - Madhusudhan Konda. Elasticsearch in Action, 2nd edition. Manning Publications, 10/2023. 592 pp. 50 | - Alberto Paro. Elasticsearch 8.x Cookbook, 5th edition. Packt Publishing, 5/2022. 750 pp. 51 | - Asjad Athick, Shay Banon. Getting Started with Elastic Stack 8.0. Packt Publishing, 3/2022. 474 pp. 52 | - Wai Tak Wong. Advanced Elasticsearch 7.0. Packt Publishing, 8/2019. 560 pp. 53 | - Pranav Shukla, Sharath Kumar M N. Learning Elastic Stack 7.0, 2nd edition. 5/2019. 474 pp. 54 | - Abhishek Andhavarapu. Learning Elasticsearch. Packt Publishing, 6/2017. 404 pp. 55 | - Bharvi Dixit. Mastering Elasticsearch 5.x, 3rd edition. 2/2017. 428 pp. 56 | - Bharvi Dixit. Elasticsearch Essentials. Packt Publishing, 1/2016. 240 pp. 57 | - Rafał Kuć, Marek Rogoziński. Mastering Elasticsearch, 2nd edition. Packt Publishing, 2/2015. 434 pp. 58 | - Clinton Gormley, Zachary Tong. Elasticsearch: The Definitive Guide. O'Reilly Media, 1/2015. 724 pp. 59 | - Rafał Kuć, Marek Rogoziński. Elasticsearch Server, 3rd edition. Packt Publishing, 2/2016. 556 pp. 60 | 61 | #### Spark 62 | - Alex Thomas. Natural Language Processing with Spark NLP. O'Reilly Media, 6/2020. 364 pp. 63 | 64 | #### Sphinx 65 | - Andrew Aksyonoff. Introduction to Search with Sphinx. O'Reilly Media, 4/2011. 148 pp. 66 | 67 | ### Information Architecture 68 | - Louis Rosenfeld, Peter Morville, Jorge Arango. Information Architecture: For the Web and Beyond, 4th edition. O'Reilly, 9/2015. 483 pp. 69 | - Gerald Kowalski. Information Retrieval Architecture and Algorithms. Spring, 2011. 70 | 71 | ### Collaborative 72 | - Chirag Shah. Social Information Seeking: Leveraging the Wisdom of the Crowd. Springer, 7/2017. 204 pp. 73 | - Chirag Shah. Collaborative Information Seeking: The Art and Science of Making the Whole Greater Than the Sum of All. Springer, 8/2014. 206 pp. (Vol. 34) 74 | - Satnam Alag. Collective Intelligence in Action. Manning Publications, 9/2008. 425 pp. 75 | 76 | ### Springer The Information Retrieval Series 77 | - Jiqun Liu. A Behavioral Economics Approach to Interactive Information Retrieval. Springer, 2/2023. 400 pp. (Vol. 48) 78 | - Yi Chang, Hongbo Deng (editors). Query Understanding for Search Engines. Springer, 12/2021. 236 pp. (Vol. 46) 79 | - Jianfeng Gao, Chenyan Xiong, Paul Bennett, Nick Craswell. Neural Approaches to Conversational Information Retrieval. Springer, 3/2023. 405 pp. (Vol. 44) 80 | - Tetsuya Sakai, Douglas W. Oard, Noriko Kando. Evaluating Information Retrieval and Access Tasks. Springer, 9/2020. (Vol. 43) 81 | - Deepak P., et al. Data Science for Fake News. Springer, 4/2021. (Vol. 42) 82 | - Nicola Ferro, Carol Peters. Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF. Springer, 8/2019. (Vol. 41) 83 | - Tetsuya Sakai. Laboratory Experiments in Information Retrieval. Springer, 9/2018. (Vol. 40) 84 | - Krisztian Balog. Entity-Oriented Search. Springer, 10/2018. 370 pp. (Vol. 39) 85 | - Available for free for Amazon Kindle as well as at https://eos-book.org/. 86 | - Chirag Shah. Social Information Seeking. Springer, 7/2017. (Vol. 38) 87 | - Peter Knees, Markus Schedl. Music Similarity and Retrieval. Springer, 5/2016. 319 pp. (Vol. 36) 88 | - Massimo Melucci. Introduction to Information Retrieval and Quantum Mechanics. Springer, 12/2015. 250 pp. (Vol. 35) 89 | - Chirag Shah. Collaborative Information Seeking. Springer, 7/2012. (Vol. 34) 90 | - Donald Metzler. A Feature-Centric View of Information Retrieval. Springer, 9/2011. 334 pp. (Vol. 27) 91 | - Gionvanni Maria Sacco, Yannis Tzitzikas. Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience. Spring, 8/2009. 357 pp. (Vol. 25) 92 | 93 | ### Other Springer 94 | - Aidan Hogan. The Web of Data. Springer, 2020. 697 pp.* 95 | - C. Maria Keet. The What and How of Modelling Information and Knowledge: Mind Maps to Ontologies. Springer, 2023. 192 pp. 96 | - Juanzi Li, Guillin Qi, Dongyan Zhao, Wolfgang Nejdl, Hai-Tao Zheng, eds. Semantic Web and Web Science. Springer, 2013. 413 pp. 97 | 98 | ### Uncategorized 99 | - Anuradha D. Thakare, Shilpa Laddha, Ambika Pawar. Hybrid Intelligent Systems for Information Retrieval. Chapman Hall/CRC, 11/2022. 252 pp. 100 | - Jiawei Han, Jian Pei, Hanghang Tong. Data Mining: Concepts and Techniques, 4th edition. Morgan Kaufmann, 7/2022. 752 pp. 101 | - Nicole Tonellotto, Craig Macdonald, Iadh Ounis. Efficient Query Processing for Scalable Web Search. 6/2019. 132 pp. 102 | - Available for free from https://tonellotto.github.io/publication/fntir/fntir_main.pdf 103 | - Jutta Haider, Olof Sundin. Invisible Search and Online Search Engines. Routledge: Taylor & Francis Group, 2019. 151 pp. 104 | - Available for free from https://library.oapen.org/bitstream/handle/20.500.12657/51256/9780429828027.pdf 105 | - ChengXiang Zhai and Sean Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM Books, 6/2016. 532 pp. 106 | - Bo Long, Yi Chang. Relevance Ranking for Vertical Search Engines. Morgan Kaufmann, 1/2014. 264 pp. 107 | - Gerald Kowalski. Information Retrieval Systems: Theory and Implementation. Springer, 3/2013. 300 pp. 108 | - Tyler Tate, Tony Russell-Rose. Designing the Search Experience: The Information Architecture of Discovery. Morgan Kaufmann, 1/2013. 320 pp. 109 | - Dirk Lewandowski, ed. Web Search Engine Research. Emerald Publishing Limited, 4/2012. 322 pp. 110 | - Giovanni Maria Sacco, Yannis Tzitzikas (editors). Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience. Springer, 3/2012. 357 pp. 111 | - Amy N. Langville, Carl D. Meyer. Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Presss, 2/2012. 240 pp. 112 | - Marcia J. Bates. Understanding Information Retrieval Systems: Management, Types, and Standards. Auerbach Publications, 12/2011. 752 pp. 113 | - Tie-Yan Liu. Learning to Rank for Information Retrieval. Springer, 4/2011. 302 pp. 114 | - Peter Morville, Jeffery Callender. Search Patterns: Design for Discovery. O'Reilly, 2/2010. 192 pp. 115 | - Maria Stone. Understanding and Evaluating Search Experience. Morgan & Claypool Publishers, 2020. 116 | - Irina Shamaeva, David Michael Galley. Custom Search - Discover more: A Complete Guide to Google Programmable Search. CRC Press, 2021. 184 pp. 117 | Grace Hui Yang, Marc Sloan, Jun Wang. Dynamic Information Retrieval Modeling. Springer, 2016. 146 pp. 118 | Wei Ding, Xia Lin. Information Architecture: The Design and Integration of Information Spaces. Springer, 2009. 149 pp. 119 | 120 | ### Older 121 | - Michael W. Berry, Murray Browne. Understanding Search Engines: Modeling and Text Retrieval, 2nd edition. Society for Industrial and Applied Mathematics, 5/2005. 184 pp. 122 | - David A. Grossman, Ophir Frieder. Information Retrieval: Algorithms and Heuristics. Springer, 10/2004. 352 pp. 123 | - Gerhard Weikum, Gottfried Vossen. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, 6/2001. 872 pp. 124 | - Richard K. Belew. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press, 1/2001. 384 pp. 125 | - Ian H. Witten, Alistair Moffat, Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition. Morgan Kaufmann, 5/1999. 560 pp. 126 | - Karen Sparck Jones, Peter Willett (editors). Readings in Information Retrieval. Morgan Kaufmann, 7/1997. 587 pp. 127 | - William Frakes, Ricardo Baeza-Yates (editors). Information Retrieval: Data Structures & Algorithms. Pearson College Divison, 1/1992. 464 pp. -------------------------------------------------------------------------------- /specific-engines/opensearch/opensearch-python-client.md: -------------------------------------------------------------------------------- 1 | # OpenSearch Python 2 | 3 | ## OpenSearch Python (opensearch-py) Client 4 | This section provides a brief overview of some portions of the OpenSearch Python client. See the [official documentation](https://opensearch-project.github.io/opensearch-py/index.html) for details. 5 | 6 | ## OpenSearch Client 7 | - `OpenSearch()` 8 | - `bulk()` 9 | - `count()` 10 | - `create()` - A document in the index. 11 | - `create_pit()` - Creates point in time context. 12 | - `delete()` - Remove a document from the index. 13 | - `delete_all_pits()` 14 | - `delete_by_query()` - Delete documents that match a specific query. 15 | - `delete_pit()` - Deletes one or more pits. 16 | - `delete_script()` 17 | - `exists()` - Whether a document exists in an index. 18 | - `exists_source()` 19 | - `explain()` - Why specific documents matched or did not match a query. 20 | - `field_caps()` 21 | - `get()` - Retrieve a document. 22 | - `get_all_pits()` 23 | - `get_script()` 24 | - `get_script_context()` 25 | - `get_script_languages()` 26 | - `get_source()` 27 | - `index()` - Creates or updates a document in an index. 28 | - `info()` - Returns info about cluster. 29 | - `mget()` - Retrieve multiple documents. 30 | - `msearch()` - Multiple search operations in a single request. 31 | - `msearch_template()` 32 | - `mtermvectors()` - Retrieve multiple term vectors in a single request. 33 | - `ping()` - Check if the cluster is up. 34 | - `put_script()` - Create or update a script. 35 | - `rank_eval()` - Evaluate the quality of ranked search results over a set of typical search queries. 36 | - `reindex()` - Reindex documents from one index to another. 37 | - `render_search_template()` - Use the Mustache language to pre-render a search definition. 38 | - `scripts_painless_execute()` 39 | - `scroll()` - Allows to retrieve a large number of results from a single search request. 40 | - `search()` - Search for documents in one or more indices. 41 | - `search_shards()` - Returns info about the indices and shards a request would be executed against. 42 | - `search_template()` 43 | - `termvectors()` - Retrieve information and statistics about terms in the fields of a particular document. 44 | - `update()` - Update a document. 45 | - `update_by_query()` - Update documents that match a specific query. 46 | 47 | ### Http Client 48 | - `delete()`, `get()`, `head()`, `post()`, `put()` 49 | 50 | ### Compact and aligned text (CAT) Client 51 | - `aliases()` 52 | - `all_pit_segments()` 53 | - `allocation()` - How many shards are allocated to each data node and how much disk they are using. 54 | - `cluster_manager()` - Info about cluster manager node. 55 | - `count()` 56 | - `fielddata()` - How much heap memory is being used by fielddate on each data node. 57 | - `health()` 58 | - `help()` - Regarding the CAT APIs 59 | - `indices()` - Info about indices. 60 | - `nodeatrs()` - Info about custom node attributes. 61 | - `nodes()` - Info about nodes. 62 | - `pending_tasks()` - Info about pending tasks. 63 | - `pit_segments()` 64 | - `recovery()` - Info about shard recoveries. 65 | - `plugins()` 66 | - `recovery()` 67 | - `repositories()` 68 | - `segment_replication()` 69 | - `segments()` 70 | - `shards()` 71 | - `snapshots()` 72 | - `tasks()` 73 | - `templates()` 74 | - `thread_pool()` 75 | 76 | ### Cluster Client 77 | - `allocation_explain()` 78 | - `delete_component_template()` 79 | - `delete_weighted_routing()` 80 | - `exists_component_template()` 81 | - `get_component_template()` 82 | - `get_settings()` - Of a cluster 83 | - `get_weighted_routing()` 84 | - `health()` 85 | - `pending_tasks()` 86 | - `put_component_template()` 87 | - `put_settings()` 88 | - `put_weighted_routing()` 89 | - `remote_info()` 90 | - `reroute()` - Manually change allocation of individual shards in the cluster. 91 | - `state()` - Comprehensive info about state of cluster. 92 | - `stats()` - High-level overview of cluster stats. 93 | 94 | ### Dangling Indices Client 95 | - https://opensearch-project.github.io/opensearch-py/api-ref/clients/dangling_indices_client.html 96 | 97 | ### Ingest Client 98 | - `delete_pipeline()` 99 | - `get_pipeline()` 100 | - `processor_grok()` - Returns a list of the built-in patterns. 101 | - `put_pipeline()` 102 | - `simulate()` - Simulate a pipeline with example docs. 103 | 104 | ### Indices Client 105 | - `add_block()` 106 | - `analyze()` 107 | - `clear_cache()` 108 | - `clone()` 109 | - `close()` 110 | - `create()` 111 | - `create_data_stream()` 112 | - `data_streams_stats()` 113 | - `delete()` 114 | - `delete_alias()` 115 | - `delete_data_stream()` 116 | - `delete_template()` 117 | - `exists()` - Whether a particular index exists 118 | - `exists_alias()` 119 | - `exists_template()` 120 | - `flush()` 121 | - `forcemerge()` - Of one or more indices 122 | - `get()` - Returns info about one or more indices 123 | - `get_alias()` 124 | - `get_data_stream()` 125 | - `get_field_mapping()` - Returns mapping for one or more fields 126 | - `get_mapping()` - Returns mapping for one or more indices 127 | - `get_settings()` - Returns settings for one or more indices 128 | - `get_template()` 129 | - `open()` - Opens an index 130 | - `put_alias()` - Creates or updates an alias 131 | - `put_mapping()` - Updates the index mappings 132 | - `put_settings()` 133 | - `put_template()` 134 | - `recovery()` - Info about shard recoveries 135 | - `refresh()` 136 | - `resolve_index()` - Resolves info about any matching indices, aliases, data streams. 137 | - `rollover()` - Updates an alias to point to a new index when old index is too large/old. 138 | - `segments()` - Low-level info about Lucene segments in one or more indices 139 | - `shard_stores()` 140 | - `shrink()` - Shrinks an existing index into a new index with fewer primary shards 141 | - `simulate_template()` 142 | - `split()` - Splits an existing index into a new index with more primary shards 143 | - `stats()` 144 | - `update_aliases()` 145 | - `validate_query()` - For validating a potentially expensive query without executing it. 146 | 147 | ### Nodes Client 148 | - `hot_threads()` 149 | - `info()` 150 | - `reload_secure_settings()` 151 | - `stats()` 152 | - `usage()` - Low-level info about REST actions usage 153 | 154 | ### Remote Client 155 | 156 | ### Security Client 157 | - `change_password()` - For current user 158 | - `create_action_group()` - Create or replace 159 | - `create_role()` - Create or replace 160 | - `create_role_mapping()` - Create or replace 161 | - `create_tenant()` - Create or replace 162 | - `create_user()` - Create or replace 163 | - `delete_action_group()` 164 | - `delete_distinguished_names()` 165 | - `delete_role()` 166 | - `delete_role_mapping()` 167 | - `delete_tenant()` 168 | - `delete_user()` 169 | - `flush_cache()` - Flushes the Security plugin user, authentication, and authorization caches. 170 | - `get_account_details()` - For current user 171 | - `get_action_group()` 172 | - `get_action_groups()` 173 | - `get_audit_configuration()` 174 | - `get_certificates()` 175 | - `get_configuration()` 176 | - `get_distinguished_names()` 177 | - `get_role()` 178 | - `get_role_mapping()` 179 | - `get_role_mappings()` 180 | - `get_roles()` 181 | - `get_tenant()` 182 | - `get_tenants()` 183 | - `get_user()` 184 | - `get_users()` 185 | - `health()` 186 | - `patch_action_group()` - Updates individual attributes of an action group 187 | - `patch_action_groups()` 188 | - `patch_audit_configuration()` 189 | - `patch_configuration()` 190 | - `patch_distinguished_names()` 191 | - `patch_role()` 192 | - `patch_role_mapping()` 193 | - `patch_role_mappings()` 194 | - `patch_roles()` 195 | - `patch_tenant()` 196 | - `patch_tenants()` 197 | - `patch_user()` 198 | - `patch_users()` 199 | - `reload_http_certificates()` 200 | - `reload_transport_certificates()` 201 | - `update_audit_configuration()` 202 | - `update_configuration()` 203 | - `update_distinguished_names()` 204 | 205 | ### Snapshot Client 206 | - `cleanup_repository()` - Removes stale data 207 | - `clone()` - Clones indices from one snapshot to another in the same repository 208 | - `create()` - Creates a snapshot in the repository 209 | - `create_repository()` - Creates a repository 210 | - `delete()` - Deletes a snapshot 211 | - `delete_repository()` - Deletes a repository 212 | - `get()` - Returns info about a snapshot 213 | - `get_repository()` - Returns info about a repository 214 | - `restore()` - Restores a snapshot 215 | - `status()` - Returns info about the status of a snapshot 216 | - `verify_repository()` - Verifies that a repository is working correctly 217 | 218 | ### Tasks Client 219 | - `cancel()` 220 | - `get()` - Return info about a task 221 | - `list()` - Return info about all tasks 222 | 223 | ### Features Client 224 | - `get_features()` - List of features which can be included in snapshots using the feature_states field when creating a snapshot 225 | 226 | ### Exceptions 227 | - `AuthenticationException` - 401 228 | - `AuthorizationException` - 403 229 | - `ConflictError` - 409 230 | - `ConnectionError` 231 | - `ConnectionTimeout` 232 | - `ImproperlyConfigured` 233 | - `NotFoundError` - 404 234 | - `RequestError` - 400 235 | - `SerializationError` 236 | - `SSLError` 237 | - `TransportError` 238 | 239 | ### Helpers 240 | - `aggs.Agg()` 241 | - `to_dict()` 242 | - `analysis.Analyzer()` 243 | - `document.Document()` 244 | - `delete()` 245 | - `exists()` 246 | - `get()` 247 | - `init()` - Create an index and populate the mappings 248 | - `mget()` 249 | - `save()` - Create or overwrite a document 250 | - `search()` 251 | - `to_dict()` 252 | - `update()` 253 | - `faceted_search.FacetedSearch()` 254 | - `add_filter()` 255 | - `aggregate()` 256 | - `build_search()` - Construct the search object 257 | - `execute()` - Execute the search and return response 258 | - `filter()` - Add a `post_filter` that narrows results based on facet filters 259 | - `highlight()` - For all the fields 260 | - `query()` 261 | - `search()` - Returns the base Search object to which the facets are added 262 | - `sort()` 263 | - `field.Field()` 264 | - `to_dict()` 265 | - `function.ScoreFunction()` 266 | - `to_dict()` 267 | - `index.Index()` 268 | - `aliases()` - Add to the index definition 269 | - `analyze()` 270 | - `analyzer()` 271 | - `clear_cache()` 272 | - `clone()` 273 | - `close()` 274 | - `create()` 275 | - `delete()` 276 | - `delete_alias()` 277 | - `document()` - Associates a Document subclass with an index. 278 | - `exists()` 279 | - `exists_alias()` 280 | - `flush()` 281 | - `forcemerge()` 282 | - `get()` 283 | - `get_alias()` 284 | - `get_field_mapping()` - For a specific field 285 | - `get_mapping()` - For a specific type 286 | - `get_settings()` 287 | - `get_upgrade()` - How much of index is upgraded 288 | - `mapping()` - Associate a mapping with index 289 | - `open()` 290 | - `put_alias()` 291 | - `put_mapping()` 292 | - `put_settings()` 293 | - `recovery()` 294 | - `refresh()` 295 | - `save()` - Sync the index definition with opensearch, create index if it doesn't exist, update settings/mappings if it does 296 | - `search()` 297 | - `segments()` 298 | - `settings()` - Add settings 299 | - `shard_stores()` 300 | - `shrink()` 301 | - `stats()` 302 | - `updateByQuery()` 303 | - `upgrade()` 304 | - `validate_query()` 305 | - `Mapping` 306 | - `Query` 307 | - `search.Search()` 308 | - `__getitem__(n)` - slicing Search instance for pagination 309 | - `__iter__()` - over the hits 310 | - `_clone()` - Clone of current search request, performs shallow copy of underlying objects. 311 | - Used internally by most state modifying APIs. 312 | - `collapse()` 313 | - `count()` - Number of matching hits 314 | - `delete()` - Delegates to `delete_by_query()` 315 | - `execute()` - Executes the search, returns an instance of Response wrapping all the data. 316 | - `from_dict()` - Constructs a new Search instance from a raw dict 317 | - `highlight()` - For some fields 318 | - `highlight_options()` - Set global options for current request 319 | - `response_class(cls)` - Override default wrapper used for the response 320 | - `scan()` - Returns a generator that will iterate over all the documents matching the query. 321 | - `script_field()` - Define script field to be calculated on hits 322 | - `sort()` 323 | - `source()` - Control how the `_source_` field is returned. 324 | - `suggest()` - Add a suggestions request to the search 325 | - `to_dict()` 326 | - `update_from_dict()` - Apply options from a serialized body to the current instance. 327 | - Modifies object in place. 328 | - Used mostly by from_dict. 329 | - `update_by_query.UpdateByQuery()` 330 | - `_clone()` - Clone of current search request, performs shallow copy of underlying objects. 331 | - `execute()` - Executes the update, returns an instance of Response wrapping all the data. 332 | - `from_dict()` - Constructs a new UpdateByQuery instance from a raw dict 333 | - `response_class(cls)` - Override default wrapper used for the response 334 | - `script()` - Define update action to take 335 | - Only accepts a single script 336 | - `to_dict()` 337 | - `update_from_dict()` - Apply options from a serialized body to the current instance. 338 | - Modifies object in place. 339 | - Used mostly by from_dict. 340 | - `wrappers.Range()` 341 | 342 | ## Plugins 343 | 344 | ### Alerting Plugin 345 | 346 | ### Index Management Plugin 347 | 348 | ### Serializer 349 | 350 | ### Transport --------------------------------------------------------------------------------