├── research
├── books-history.md
├── books-nlp.md
├── books-information-science.md
├── books-library.md
├── crawling-research.md
├── trust-research.md
├── books-knowledge-management.md
├── decentralization-research.md
├── federation-research.md
├── recommendations-research.md
├── semantic-research.md
├── personalization-research.md
├── fairness-research.md
├── research-main.md
├── ranking-research.md
├── uncategorized-research.md
├── annotated-collaborative-research.md
└── books-research.md
├── specific-engines
├── opensearch
│ ├── opensearch-python.md
│ ├── opensearch-vector.md
│ └── opensearch-python-client.md
├── pg_search.md
├── vector.md
├── aws-opensearch
│ ├── aws-opensearch-misc.md
│ ├── aws-opensearch-dql.md
│ ├── aws-opensearch-serverless.md
│ ├── aws-opensearch-vector.md
│ ├── aws-opensearch-code.md
│ └── aws-opensearch-main.md
├── solr
│ ├── solr-resources-other.md
│ ├── solr-extend.md
│ ├── basic-admin-ui-tutorial.md
│ ├── solr-resources-used-by.md
│ ├── solr-notes.md
│ ├── basic-indexing-own-data-tutorial.md
│ ├── solr-resources-code.md
│ ├── solr-resources-ui.md
│ ├── solr-development.md
│ ├── solr-resources-interesting-old.md
│ ├── basic-solrcloud-tutorial.md
│ ├── solr-resources-app-framework-integrations.md
│ ├── solr-resources-utlities.md
│ ├── solr-terminology.md
│ └── basic-tutorial.md
├── elasticsearch
│ ├── elasticsearch-build-ui.md
│ ├── elasticsearch-ui.md
│ ├── elasticsearch-ingestion.md
│ └── elasticsearch-clients.md
├── apache-lucene.md
├── elasticsearch.md
├── yacy.md
├── apache-solr.md
└── opensearch.md
├── SearchFrontEnd.md
├── ToCategorize.md
├── front-end
├── ui-component-libraries-for-search.md
└── ui-components-of-search.md
├── LICENSE
├── Glossary.md
├── features
└── faceting.md
├── common-crawl
├── common-crawl-resources.md
├── basic-info-common-crawl.md
└── basic-manually-accessing-common-crawl.md
├── vector-search
└── vector-basics.md
├── BuildingSearchEngines.md
├── web-archiving
└── archiving-introduction.md
├── collaborative
└── README.md
├── WebCrawlers.md
├── OpenSourceSearchEngines.md
└── CommonCrawl.md
/research/books-history.md:
--------------------------------------------------------------------------------
1 | - Simon Winchester. Knowing What We Know: The Transmission of Knowledge: From Ancient Wisdom to Modern Magic. Harper, 4/2023. 431 pp.*
--------------------------------------------------------------------------------
/specific-engines/opensearch/opensearch-python.md:
--------------------------------------------------------------------------------
1 | - Nabila Abraham. [Semantic Search with OpenSearch and Cohere: A Comprehensive Demo](https://cohere.com/blog/semantic-search-open-search-demo). cohere, 6/2023.
--------------------------------------------------------------------------------
/specific-engines/pg_search.md:
--------------------------------------------------------------------------------
1 | - Ming Ying. [pg_search: Elastic-Quality Full Text Search Inside Postgres](https://blog.paradedb.com/pages/introducing_search). ParadeDB, 10/2023.
2 | - Introduces pg_search with a high-level overview.
--------------------------------------------------------------------------------
/specific-engines/vector.md:
--------------------------------------------------------------------------------
1 | - [Milvus](https://milvus.io/) - specifically built for similarity search
2 | - [Qdrant](https://qdrant.io/) - open source
3 | - [GitHub](https://github.com/qdrant/qdrant) - Stars: 10.1k - Updated: 5/2023 - Checked: 5/2023.
--------------------------------------------------------------------------------
/specific-engines/aws-opensearch/aws-opensearch-misc.md:
--------------------------------------------------------------------------------
1 | - Count API - Get the count of documents without getting the actual documents.
2 | - [ev2900's OpenSearch Audit Logs Repo](https://github.com/ev2900/OpenSearch_Audit_Logs). Updated: 3/2024. Checked: 3/2024.
--------------------------------------------------------------------------------
/research/books-nlp.md:
--------------------------------------------------------------------------------
1 | - Paul Azunre. Transfer Learning for Natural Language Processing. Manning, 8/2021.
2 | - Hannes Hapke, Cole Howard, Hobson Lane. Natural Language Processing in Action: Understanding, Analyzing, and Generating Text in Python. Manning, 3/2019. 1114 pp.
--------------------------------------------------------------------------------
/research/books-information-science.md:
--------------------------------------------------------------------------------
1 | - David Bawden, Lyn Robinson. Introduction to Information Science, 2nd edition. Facet Publishing, 2/2022. 536 pp.
2 | - G. Edward Evans, Stacey Greenwell. Management Basics for Information Professionals, 4th edition. ALA Neal-Schuman, 1/2020. 352 pp.
--------------------------------------------------------------------------------
/specific-engines/solr/solr-resources-other.md:
--------------------------------------------------------------------------------
1 | - [Dual Indexing Neo4j and Solr for a Unified Platform](https://neo4j.com/blog/dual-indexing-neo4j-and-solr-for-a-unified-platform/). neo4j, 3/2021.
2 | - Written by Nathan Maynes who worked on implementing such a solution at Thomson Reuters.
--------------------------------------------------------------------------------
/SearchFrontEnd.md:
--------------------------------------------------------------------------------
1 | # Search on the Front-end
2 |
3 | ## Introduction
4 | This document covers resources available for building UI's for search engines.
5 |
6 |
7 | ## Contents
8 | - [Pre-Built Search Component Libraries](./front-end/ui-component-libraries-for-search.md)
9 | - [Understanding the UI Components of Search](./front-end/ui-components-of-search.md)`
--------------------------------------------------------------------------------
/specific-engines/solr/solr-extend.md:
--------------------------------------------------------------------------------
1 | - [Solr Extension Directory](https://solr.cool/) - A directory of extensions available for Solr. - Checked: 2/2025.
2 | - [solr-compound-word-filter](https://github.com/redlink-gmbh/solr-compound-word-filter) - Stars: 3 - Updated: 5/2024 - Checked: 5/2024
3 | - Redlink version of `solr.HyphenationCompoundWordTokenFilterFactory` with fix for LUCENE-8183 and support for epenthesis parameter.
--------------------------------------------------------------------------------
/specific-engines/aws-opensearch/aws-opensearch-dql.md:
--------------------------------------------------------------------------------
1 | # Dashboards Query Language (DQL)
2 |
3 | Used within OpenSearch Dashboards. Consists of four primary query types:
4 | - Terms Query - Matches a specific term.
5 | - Boolean Query - Combine multiple queries using AND, OR, and NOT.
6 | - Date and Range Query - Matches documents within a specific date or range.
7 | - Nested Field Query - Allows for retrieving specific portions of a document which contains nested fields.
--------------------------------------------------------------------------------
/specific-engines/opensearch/opensearch-vector.md:
--------------------------------------------------------------------------------
1 | - Valentin Crettaz. [How to Set Up Vector Search in OpenSearch](https://opster.com/guides/opensearch/opensearch-machine-learning/how-to-set-up-vector-search-in-opensearch/). Opster, 10/2023.
2 | - Dylan Castillo. [Semantic Search with OpenSearch, Cohere, and FastAPI](https://dylancastillo.co/posts/semantic-search-with-opensearch-cohere-and-fastapi.html). 4/2023, updated: 7/2024.
3 | - Seems useful but written for OpenSearch 2.6 so doesn't take advantage of some of the latest vector features in OpenSearch.
--------------------------------------------------------------------------------
/specific-engines/elasticsearch/elasticsearch-build-ui.md:
--------------------------------------------------------------------------------
1 | - [Elastic UI Framework](https://github.com/elastic/eui) - Stars: 6k - Updated: 5/2024 - Checked: 5/2024.
2 | - "...a collection of React UI components for quickly building user interfaces at Elastic."
3 | - [Elastic Search UI](https://github.com/elastic/search-ui) - Stars: 1.9k - Updated: 5/2024 - Checked: 5/2024.
4 | - "A React-based library for building search user interfaces."
5 | - "A JavaScript library for the fast development of modern, engaging search experiences with Elastic."
6 | - Note: This is not React specific and can be used with Elasticsearch "or any other search API."
--------------------------------------------------------------------------------
/specific-engines/aws-opensearch/aws-opensearch-serverless.md:
--------------------------------------------------------------------------------
1 | # AWS OpenSearch Serverless
2 |
3 | We won't be discussing the Serverless option much at this time but will highlight here a few differences between this offering and the managed offering which is covered more extensively.
4 |
5 | ## Resources
6 | - [Amazon OpenSearch Serverless Developer Guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html)
7 | - [Working with Vector Search Collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-vector-search.html)
8 | - [Build a search application with Amazon OpenSearch Serverless](https://aws.amazon.com/blogs/big-data/build-a-search-application-with-amazon-opensearch-serverless/). AWS Big Data Blog, 1/2023.
9 |
--------------------------------------------------------------------------------
/ToCategorize.md:
--------------------------------------------------------------------------------
1 | Wikimedia Foundation The Anatomy of Search Blog Series (Trey Jones)
2 | - [A Token of My Affection](https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/). 8/2018.
3 | - On tokenization.
4 | - [Variation Under Nature](https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/). 9/2018.
5 | - On normalization.
6 | - [The Root of the Problem](https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/). 11/2018/
7 | - On stemming.
8 | - [A Place for My Stuff](https://wikimediafoundation.org/news/2019/03/12/the-anatomy-of-search-a-place-for-my-stuff/). 3/2019.
9 | - On the inverted index.
10 | - [In Search Of...](https://wikimediafoundation.org/news/2019/09/05/the-anatomy-of-search-in-search-of/). 9/2019.
11 | - On querying/searching.
--------------------------------------------------------------------------------
/research/books-library.md:
--------------------------------------------------------------------------------
1 | - Peter Botticelli, Martha R. Mahard, Michele V. Cloonan. Libraries, Archives, and Museums Today: Insights from the Field. Rowman & Littlefield Publishers, 2/2019. 193 pp.*
2 | - Barbara B. Moran, Claudia J. Morner. Library and Information Center Management, 9th edition. Libraries Unlimited, 11/2017. 1006 pp.*
3 | - Lisa K. Hussey, Diane L. Velasquez. Library Management 101: A Practical Guide, 2nd edition. ALA Editions, 4/2019. 312 pp.
4 | - Bridgit McCafferty. Library Management: A Practical Guide for Librarians. Rowman & Littlefield Publishers, 5/2021. 169 pp.
5 | - Claire B. Joseph, Priscilla L. Stephenson, eds. Managing Health Sciences Libraries in a Time of Change. Rowman & Littlefield, 1/2024. 123 pp.*
6 | - Stacey Marien, ed. Library Technical Services; Adapting to a Changing Environment. Purdue University Press, 8/2020. 494 pp.
--------------------------------------------------------------------------------
/research/crawling-research.md:
--------------------------------------------------------------------------------
1 | # Crawling Research on Search Engines and Information Retrieval
2 |
3 | - Ajay Sudhir Bale, Naveen Ghorpade, S Kamalesh, et al. [Web Scraping Approaches and their Performance on Modern Websites](https://www.researchgate.net/publication/363669276_Web_Scraping_Approaches_and_their_Performance_on_Modern_Websites). 9/2022.
4 | - Jesse Sayles, Ryan P. Furey, Marilyn R. ten Brink. [How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations](https://www.researchgate.net/publication/361121252_How_deep_to_dig_effects_of_web-scraping_search_depth_on_hyperlink_network_analysis_of_environmental_stewardship_organizations). 6/2022.
5 | - Dvijesh Bhatt, Daiwat Amit Vyas, Sharnil Pandya. [Focused Web Crawler](https://www.researchgate.net/publication/344579272_Focused_Web_Crawler). 10/2020.
6 |
--------------------------------------------------------------------------------
/specific-engines/solr/basic-admin-ui-tutorial.md:
--------------------------------------------------------------------------------
1 | # Basic Solr Admin UI Tutorial
2 |
3 | ## Overview
4 | - Dashboard: http://hostname:8983/solr/
5 | - Navigation:
6 | - Logging Screen
7 | - Collections / Core Admin
8 | - Java Properties Screen
9 | - Dropdown: Collection Selector
10 | - Only available when running SolrCloud, relevant options available under Core Selector
11 | - Dropdown: Core Selector
12 | - Selecting a Core or Collection shows additional specific menu items
13 | - Login Screen
14 | - Getting Assistance
15 | - Documentation
16 | - Issue Tracker (JIRA)
17 | - IRC
18 | - Community Forum (mailing lists)
19 | - Solr Query Syntax
20 | - Security UI
21 | - Schema Designer
22 | - Only available when using SolrCloud
23 |
24 |
25 |
26 | ## Bibliography / Resources
27 | - https://solr.apache.org/guide/solr/latest/getting-started/solr-admin-ui.html
--------------------------------------------------------------------------------
/research/trust-research.md:
--------------------------------------------------------------------------------
1 | # Trust / Trustworthiness
2 |
3 | - Maarten de Rijke. [Beyond-Accuracy Goals, Again](https://dl.acm.org/doi/10.1145/3539597.3572332). 2/2023.
4 | - Markus Schedl, Emilia Gómez, Elisabeth Lex. [Trustworthy Algorithmic Ranking Systems.](https://dl.acm.org/doi/10.1145/3539597.3572723). 2/2023.*
5 | - Thomas Wadlow. [Who Must You Trust?: You must have some trust if you want to get anything done](https://dl.acm.org/doi/10.1145/2620660.2630691). 5/2014.
6 | - Dirk Lewandowski. [Credibility in Web Search Engines](https://www.researchgate.net/publication/230609381_Credibility_in_Web_Search_Engines). 8/2012.*
7 | - Yusuke Yamamoto, Katsumi Tanaka. [Enhancing Credibility Judgment of Web Search Results](https://www.researchgate.net/publication/221518035_Enhancing_Credibility_Judgment_of_Web_Search_Results). 5/2011.
8 | - Ken Thompson. [Reflections on trusting trust](https://dl.acm.org/doi/10.1145/358198.358210). 8/1984.
--------------------------------------------------------------------------------
/specific-engines/elasticsearch/elasticsearch-ui.md:
--------------------------------------------------------------------------------
1 | # Query Builders
2 | - [mirage](https://github.com/appbaseio/mirage) - Stars: 2.2k - Updated: 2019 - Checked: 5/2024.
3 | - Created by Appbase / ReactiveSsarch.
4 | - "Mirage is a modern, open-source web based query explorer for Elasticsearch...It offers a blocks based GUI for composing Elasticsearch queries and comes with an on-the-fly transformer to show the corresponding JSON query API of Elasticsearch."
5 |
6 | # Cluster Management
7 | - [elasticvue](https://github.com/cars10/elasticvue) - Stars: 1.6k - Updated: 4/2024 - Checked: 5/2024
8 | - "Elasticsearch gui for the browser"
9 | - Available as both a desktop app and a browser extension.
10 | - [elasticsearch-comrade](https://github.com/moshe/elasticsearch-comrade) - Stars: 271 - Updated: 2022 - Checked: 5/2024
11 | - "Elasticsearch admin panel built for ops and monitoring...highly inspired by Cerebro."
--------------------------------------------------------------------------------
/research/books-knowledge-management.md:
--------------------------------------------------------------------------------
1 | - Manlio Del Guidice, Veronica Scuotto, Armando Papa. Knowledge Management and AI in Society 5.0. Routledge, 3/2023. 91 pp.
2 | - Donald Hislop, Rachelle Bosua, Remko Helms. Knowledge Management in Organizations: A Critical Introduction, 4th edition. Oxford University Press, 4/2018.*
3 | - Kimiz Dalkir. Knowledge Management in Theory and Practice. Routledge, 9/2013.
4 | - Jennifer A. Bartlett. Knowledge Management: A Practical Guide for Librarians. Rowman & Littlefield Publishers, 5/2021. 151 pp.*
5 | - Irma Becerra-Fernandez, Rajiv Sabherwal, Richard Kumi. Knowledge Management: Systems and Processes in the AI Era. Routledge, 2/2024. 388 pp.*
6 | - Klaus North, Gita Kumta. Knowledge Management: Value Creation Through Organizational Learning. Springer, 4/2018. 642 pp.
7 | - Olivier Serrat. Knowledge Solutions: Tools, Methods, and Approaches to Drive Organizational Performance. Springer, 5/2017. 1503 pp.
8 | - Available under CC license.
9 | - Patrick Lambe. Organising Knowledge: Taxonomies, Knowledge and Organisational Effectiveness. Chandos Publishing, 3/2007. 300 pp.
--------------------------------------------------------------------------------
/front-end/ui-component-libraries-for-search.md:
--------------------------------------------------------------------------------
1 | ## Pre-Built Search Component Libraries
2 |
3 | ### Reactive Search
4 | - https://www.reactivesearch.io/product/search-ui
5 | - GitHub: https://github.com/appbaseio/reactivesearch
6 | - Stars: 4.9k, Updated: 1/2024, Checked: 2/2024.
7 | - Supports Elasticsearch, OpenSearch, Solr, MongoDB.
8 | - Available for React, Vue, Vanilla JS.
9 | - Note that this requires one to host an open source backend ReactiveSearch provides
10 | or to use ReactiveSearch's cloud service.
11 |
12 | ### Instantsearch
13 | - https://www.algolia.com/doc/guides/building-search-ui/what-is-instantsearch/js/
14 | - GitHub: https://github.com/algolia/instantsearch
15 | - Stars: 3.5k, Updated: 3/2024, Checked: 3/2024.
16 | - While open source by default it only supports Algolia. See Searchkit below for a wrapper
17 | around instantsearch that supports Elasticsearch and Opensearch.
18 |
19 | ### Searchkit
20 | - https://www.searchkit.co/
21 | - GitHub: https://github.com/searchkit/searchkit
22 | - Stars: 4.7k, Updated: 2/2024, Checked: 2/2024.
23 | - Supports Elasticsearch and Opensearch.
24 | - Built on top of Algolia's instantsearch.
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Dave Mackey
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/research/decentralization-research.md:
--------------------------------------------------------------------------------
1 | # Decentralization in Search Engines and Information Retrieval
2 |
3 | - Mario Kubek, Herwig Unger. [WebEngine Version 1.0: Building a Decentralised Web Search Engine](https://www.researchgate.net/publication/357504886_WebEngine_Version_10_Building_a_Decentralised_Web_Search_Engine). 1/2022.
4 | - See also Kubek and Unger's [The WebEngine - A Fully Integrated, Decentralised Web Search Engine](https://www.researchgate.net/publication/329183704_The_WebEngine_-_A_Fully_Integrated_Decentralised_Web_Search_Engine) from 11/2018.
5 | - Thanassis Tiropanis, Alexandra Poulovassilis, Adriane Chapman, George Roussos. [Search in a Redecentralised Web](https://www.researchgate.net/publication/357026616_Search_in_a_Redecentralised_Web). 12/2021.*
6 | - Hongsheng Xu, Ganglong Fan, Ke Li. [Construction of Search Engine System Based on Multithread Distributed Web Crawler](https://www.researchgate.net/publication/333206095_Construction_of_Search_Engine_System_Based_on_Multithread_Distributed_Web_Crawler). 5/2019.
7 | - Reaz Ahmed, Md. Faizul Bari, Rakibul Haque, R. Boutaba, Bertrand Mathieu. [DEWS: A decentralized engine for Web search](https://www.researchgate.net/publication/282931363_DEWS_A_decentralized_engine_for_Web_search). 1/2015.
--------------------------------------------------------------------------------
/Glossary.md:
--------------------------------------------------------------------------------
1 | - Back-pressure
2 | - Cluster
3 | - Documents
4 | - Fields
5 | - Graph
6 | - Indexes
7 | - Indexing
8 | - Information Retrieval (IR)
9 | - k-NN
10 | - Knowledge Graph (KG)
11 | - Lucene
12 | - Mapping
13 | - [Named Entity Recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) - "(also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc."
14 | - Natural Language Processing (NLP)
15 | - Neural Networks
16 | - Nodes
17 | - Pretrained Language Models (PLM)
18 | - Primary Shared
19 | - Replica Shard
20 | - Semantics
21 | - [Semantic Search](https://en.wikipedia.org/wiki/Semantic_search) - "Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query."
22 | - [Semantic Triple](https://en.wikipedia.org/wiki/Semantic_triple)
23 | - [Semantic Web](https://en.wikipedia.org/wiki/Semantic_Web)
24 | - Shards
--------------------------------------------------------------------------------
/features/faceting.md:
--------------------------------------------------------------------------------
1 | - [Wikipedia: Faceted Search](https://en.wikipedia.org/wiki/Faceted_search)
2 | - [Elastic Docs / App Search / Guides / Facets Guide](https://www.elastic.co/guide/en/app-search/current/facets-guide.html) - Fairly short and general introduction to facets.
3 | - [Elastic Docs / App Search / API Reference / Search API Facets](https://www.elastic.co/guide/en/app-search/current/facets.html)
4 | - [Elastic Search Labs: Faceted Search](https://www.elastic.co/search-labs/tutorials/search-tutorial/full-text-search/facets)
5 | - [Elastic Docs / App Search / Guides / Hierarchical Facets Guide](https://www.elastic.co/guide/en/app-search/current/hierarchical-facets-guide.html)
6 | - [OpenSearch Python Client: API Reference / helpers / faceted_search](https://opensearch-project.github.io/opensearch-py/api-ref/helpers/faceted_search.html)
7 | - [WPSOLR: What Are Search Facets, How To Use Them, And What Are Their Limitations?](https://www.wpsolr.com/what-are-search-facets-how-to-use-them-and-what-are-their-limitations/)
8 | - [Vinted Engineering: Faceted Search Using Elasticsearch](https://vinted.engineering/2023/03/21/faceted-search-using-elasticsearch/)
9 | - Interesting article on how Vinted implemented faceting with Elasticsearch and why they made specific decisions.
10 | - [StackOverflow: Custom Facets Using Elastic Search](https://stackoverflow.com/questions/60688229/custom-facets-using-elastic-search)
--------------------------------------------------------------------------------
/specific-engines/elasticsearch/elasticsearch-ingestion.md:
--------------------------------------------------------------------------------
1 | # Django
2 | - [django-elasticsearch-dsl](https://github.com/django-es/django-elasticsearch-dsl) - Stars: 1k - Updated: 9/2023 - Checked: 5/2024
3 | - "Django Elasticsearch DSL is a package that allows indexing of django models in elasticsearch. It is built as a thin wrapper around elasticsearch-dsl-py so you can use all the features developed by the elasticsearch-dsl-py team."
4 | - [elasticsearch-django](https://github.com/yunojuno/elasticsearch-django) - Stars: 74 - Updated: 11/2023 - Checked: 5/2024
5 | - "This is a lightweight Django app for people who are using Elasticsearch with Django, and want to manage their indexes."
6 |
7 | # Gmail
8 | - [elasticsearch-gmail](https://github.com/oliver006/elasticsearch-gmail) - Stars: 2k - Updated: 8/2023 - Checked: 5/2024
9 |
10 | # Multiple
11 | - [elasticsearch_loader](https://github.com/moshe/elasticsearch_loader) - Stars: 395 - Updated: 2022 - Checked: 5/2024
12 | - Batch loading data files (json, parquet, csv, tsv).
13 |
14 | # Mongoose (Mongo)
15 | - [mongoosastic](https://github.com/mongoosastic/mongoosastic) - Stars: 1.1k - Updated: 2022 - Checked: 5/2024.
16 |
17 | # Fluentd
18 | - [fluent-plugin-elasticsearch](https://github.com/uken/fluent-plugin-elasticsearch) - Stars: 885 - Updated: 1/2024 - Checked: 5/2024
19 | - "Send your logs to Elasticsearch (and search them with Kibana maybe?)"
--------------------------------------------------------------------------------
/common-crawl/common-crawl-resources.md:
--------------------------------------------------------------------------------
1 | ## Accessing the CommonCrawl Index - General
2 | - [Searching 100 Billion Webpages Pages With Capture Index](https://skeptric.com/searching-100b-pages-cdx/). skeptric, 6/2020.
3 | - Shows how to do so with CDX Toolkit, the Capture INdex API directly, and using comcrawl.
4 | - [Common Crawl on Laptop - Extracting Subset of Data](https://avilpage.com/2022/11/common-crawl-laptop-extract-subset.html). Avil Page, 11/2022.
5 |
6 | ## Understanding CommonCrawl WARC Files
7 | - [Extracting Text, Metadata and Data from Common Crawl](https://skeptric.com/text-meta-data-commoncrawl/). skeptric, 6/2020.
8 | - Covers WARC, WET, and WAT.
9 |
10 | ## Understanding CommonCrawl Parquet Files
11 | - [Read Common Crawl Parquet Metadata with Python](https://skeptric.com/reading-parquet-metadata/). skeptric, 4/2022.
12 | - Demonstrates PyArrow, fastparquet, manual methods.
13 |
14 | ## AWS Athena
15 | - [Common Crawl Index Athena](https://skeptric.com/common-crawl-index-athena/). skeptric, 6/2020.
16 |
17 | ## CDX Toolkit
18 | - [Extracting Job Ads from Common Crawl](https://skeptric.com/common-crawl-job-ads/). skeptric, 6/2020.
19 |
20 | ## SQL
21 | - [Accessing WARC files via SQL](https://digital.library.unt.edu/ark:/67531/metadc1608961/). UNIT Digital Library, 2019.
22 |
23 | ## For Security Purposes
24 | - [All Around the World: The Common Crawl Dataset](https://labs.watchtowr.com/all-around-the-world-the-common-crawl-dataset/). watchtowr, 10/2022.
--------------------------------------------------------------------------------
/research/federation-research.md:
--------------------------------------------------------------------------------
1 | # Federation in Search Engines and Information Retrieval
2 |
3 | - Shuchang Liu, Yingqiang Ge, Shuyuan Xu, Yongfeng Zhang, Amelie Marian. [Fairness-aware Federated Matrix Factorization](https://dl.acm.org/doi/10.1145/3523227.3546771). 9/2022.
4 | - Qi Zhang, Tiancheng Wu, Peichen Zhou, Shan Zhou, Yuan Yang, Xiulang Jin. [Felicitas: Federated Learning in Distributed Cross Device Collaborative Frameworks](https://dl.acm.org/doi/10.1145/3534678.3539039). 8/2022.
5 | - Seok-Ju Hahn, Minwoo Jeong, Junghye Lee. [Connecting Low-Loss Subspace for Personalized Federated Learning](https://dl.acm.org/doi/10.1145/3534678.3539254). 8/2022.
6 | - Sen Cui, Jian Liang, Weishen Pan, Kun Chen, Changshui Zhang, Fei Wang. [Collaboration Equilibrium in Federated Learning](https://dl.acm.org/doi/10.1145/3534678.3539237). 8/2022.
7 | - Yaliang Li, Bolin Ding, Jingren Zhou. [A Practical Introduction to Federated Learning](https://dl.acm.org/doi/10.1145/3534678.3542631). 8/2022.
8 | - Minas Pergantis, Iraklis Varlamis, Andreas Giannakoulopoulos. [User Evaluation and Metrics Analysis of a Prototype Web-Based Federated Search Engine for Art and Cultural Heritage](https://www.researchgate.net/publication/361097561_User_Evaluation_and_Metrics_Analysis_of_a_Prototype_Web-Based_Federated_Search_Engine_for_Art_and_Cultural_Heritage). 6/2022.
9 | - Dong Nguyen, Thomas Demeester, Dolf Trieschnigg, Djored Hiemstra. [Resource Selection for Federated Search on the Web](https://www.researchgate.net/publication/308152481_Resource_Selection_for_Federated_Search_on_the_Web). 9/2016.*
10 |
--------------------------------------------------------------------------------
/specific-engines/solr/solr-resources-used-by.md:
--------------------------------------------------------------------------------
1 | ### Projects Using Solr
2 | - [Fulcrum (Heliotrope)](https://github.com/mlibrary/heliotrope) - Stars: 41 - Updated: 5/2023 - Checked: 5/2023
3 | - "a Samvera-based digital publishing platform built by the University of Michigan Library"
4 | - [FXDesktopSearch](https://github.com/mirkosertic/FXDesktopSearch) - Stars: 164 - Updated: 5/2023 - Checked: 5/2023
5 | - "a JavaFX based desktop search application"
6 | - [Hydroshare](https://github.com/hydroshare/hydroshare) - Stars: 166 - Updated: 5/2023 - Checked: 5/2023
7 | - "collaborative website for better access to data and models in the hydrologic sciences."
8 | - [Islandora](https://www.islandora.ca/)
9 | - "Open-source Digital Asset Management"
10 | - [MontySolr](https://github.com/adsabs/montysolr) - Stars: 50 - Updated: 12/2022 - Checked: 5/2023
11 | - "the search engine behind Astrophysics Data System (ADS 2.0)"
12 | - [National Information Exchange Model (NIEM) Movement](https://github.com/NIEM/movement-solr) - Stars: 3 - Updated: 2/2023 - Checked: 5/2023
13 | - [Nelmio](https://github.com/nelmio/NelmioSolariumBundle) - Stars: 152 - Updated: 4/2023 - Checked: 5/2023
14 | - [Samvera](https://github.com/samvera/hyrax) - Stars: 166 - Updated: 5/2023 - Checked: 5/2023
15 | - "provides a foundation for creating many different digital repository applications."
16 | - [netarchivesuite's SolrWayback](https://github.com/netarchivesuite/solrwayback) - Stars: 77 - Updated: 5/2023 - Checked: 5/2023
17 | - "A search interface and wayback machine for the UKWA Solr based warc-indexer framework"
18 | - [OpenSextant's Xponents](https://github.com/OpenSextant/Xponents) - Stars: 42 - Updated: 5/2023 - Checked: 5/2023
--------------------------------------------------------------------------------
/common-crawl/basic-info-common-crawl.md:
--------------------------------------------------------------------------------
1 | # Basic Information About Common Crawl
2 |
3 | ## Introduction
4 | Common Crawl is a non-profit organization (founded by Gil Elbaz) which for years has been crawling the web in a similar manner to major search engines such as Google and Bing. The data is then made available for [free and under a non-restrictive license](https://commoncrawl.org/terms-of-use/) for usage by anyone who desires to use it.
5 |
6 | Common Crawl has been around for eons in internet years (since 2008) and has a number of well-known individuals on it's Board of Directors and Advisory Board.
7 |
8 | Amazingly, the CC is currently primarily maintained by a single individual - Sebastian Nagel.
9 |
10 | ## That Said...
11 | It does appear that CC has probably seen a decrease in funding over the years. The variety of voices on the blog has decreased markedly since ~2015, roughly when Lisa Green left her position as Director of Common Crawl (although she remains on the Board of Advisors) and more recently as crawls have become a bi-monthly event (formerly monthly). My sincere appreciation to Nagel who continues to carry the torch of an important project! I hope that funding and contributors will increase in the near future!
12 |
13 | ## Crawler
14 | The web crawler (CCBot) is built on the famous open source Apach Nutch engine and utilizes Apache Hadoop.
15 |
16 | ## Data
17 | The data is stored on Amazon Web Services' (AWS) S3 storage service. It can be accessed over HTTP (slow), using other AWS systems (e.g. EC2, Athena), and using the AWS CLI/SDK.
18 |
19 | ## Bibliography
20 | - [Common Crawl](https://commoncrawl.org/)
21 | - [About](https://commoncrawl.org/about/)
22 | - [Our Team](https://commoncrawl.org/about/team/)
23 |
--------------------------------------------------------------------------------
/specific-engines/solr/solr-notes.md:
--------------------------------------------------------------------------------
1 | ## Goal
2 | My memory isn't amazing so I tend to make concise notes that help me remember technologies. Much of this document will be a summarization of documentation for Apache Solr. On occasion it may also include information not in the source materials (or at least not in the same place), this is done to fill in gaps in my knowledge.
3 |
4 | ## Loading Data into Solr
5 | - Can ingest data from many sources (XML, CSV, Microsoft Word, PDF, etc.)
6 |
7 | ### Common Ways to Load Data
8 | - Using Solr Cell and Apache Tika, the latter is for ingesting binary files.
9 | - Uploading XML files using HTTP requests.
10 | - Creating a custom application that utilizes the Java Client API.
11 | - Note: By default the `-e` option when starting Solr sets the `example` directory as base directory for the Solr instance.
12 |
13 | ## Searching with Solr
14 | - A query is made to Solr and a *response handler* which calls a *query parser* "which interprets the terms and parameters of a query."
15 |
16 | ### Query Parsers
17 | - "Different query parsers support different syntax."
18 | - Default: Standard Query Parser (aka "lucene" query parser).
19 | - Other Included Query Parsers:
20 | - DisMax Query Parser
21 | - Extended DisMax (eDisMax) Query Parser
22 | - The Standard Query Parser is precise (but throws syntax errors when queries are incorrect) while the DisMax/eDisMax Query Parsers act similarly to web search engines.
23 | - eDisMax extends and improves on DisMax's functionality.
24 |
25 | ### Query Parser Input Types
26 | - Queries can be made using:
27 | - strings (e.g. words, terms)
28 | - "parameters for fine-tuning the query"
29 | - "parameters for controlling the presentation of the query response"
30 |
31 | ## Bibliography / Resources
32 | - See the main [Apache Solr](../apache-solr.md) document for a list of resources.
--------------------------------------------------------------------------------
/research/recommendations-research.md:
--------------------------------------------------------------------------------
1 | # Recommendations / Recommender Systems
2 |
3 | ## 2023
4 | - Zhe Fu, Xi Niu, Li Yu. [Wisdom of Crowds and Fine-Grained Learning for Serendipity Recommendations](https://dl.acm.org/doi/10.1145/3539618.3591787). SIGIR '23, 2023.
5 | - Yuanhao Liu, Qi Cao, Huawei Shen, Yunfan Wu, Shuchang Tao, Xueqi Cheng. [Popularity Debiasing from Exposure to Interaction in Collaborative Filtering](https://dl.acm.org/doi/10.1145/3539618.3591947). SIGIR '23, 7/2023.
6 | - Ziwei Fan, Ke Xu, Zhang Dong, Hao Peng, Jiawei Zhang, Philip S. Yu. [Graph Collaborative Signals Denoising and Augmentation for Recommendation](https://dl.acm.org/doi/10.1145/3539618.3591994). SIGIR '23, 7/2023.
7 | - Jiazheng Jing, Yinan Zhang, Xin Zhou, Zhiqi Shen. [Capturing Popularity Trends: A Simplistic Non-Personalized Approach for Enhanced Item Recommendation](https://dl.acm.org/doi/10.1145/3583780.3614801). CIKM '23, 10/2023.
8 | - Zhuang Liu, Haoxuan Li, Guanming Chen, Yuanxin Ouyang, Wenge Rong, Zhang Xiong. [PopDCL: Popularity-aware Debiased Contrastive Loss for Collaborative Filtering](https://dl.acm.org/doi/10.1145/3583780.3615009). CIKM '23, 10/2023.*
9 | - Sichun Luo, Chen Ma, Yuanzhang Xiao, Linqi Song. [Improving Long-Tail Item Recommendation with Graph Augmentation](https://dl.acm.org/doi/10.1145/3583780.3614929). CIKM '23, 10/2023.*
10 |
11 | ## 2020
12 | - Lijo Abraham. [Building a Recommendation System with Spark ML and Elasticsearch](https://towardsdatascience.com/building-a-recommendation-system-with-spark-ml-and-elasticsearch-abbd0fb59454). towardsdatascience, 9/2020.
13 |
14 | ## 2018
15 | - Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, Tony Jebara. [Variational Autoencoders for Collaborative Filtering](https://dl.acm.org/doi/10.1145/3178876.3186150). WWW '18, 4/2018.
16 |
17 | ## 2015
18 | - Carlos A. Gomez-Uribe, Neil Hunt. [The Netflix Recommender System: Algorithms, Business Value, and Innovation](https://dl.acm.org/doi/10.1145/2843948). 12/2015.
--------------------------------------------------------------------------------
/research/semantic-research.md:
--------------------------------------------------------------------------------
1 | # Semantic Research in Search Engines and Information Retrieval
2 |
3 | - Anand Kumar, B.P. Singh. [Semantic Web: Past, Present and Future](https://www.researchgate.net/publication/369038964_Semantic_Web_Past_Present_and_Future). 3/2023.
4 | - Hiteshwar Kumar Azaad, Akshay Deepak, Amisha Azad. [LOD search engine: A semantic search over linked data](https://www.researchgate.net/publication/356389086_LOD_search_engine_A_semantic_search_over_linked_data). 8/2022.
5 | - Uzoma Peter Ozioma, Amanze Bethran Chibuike, Agbakwuru Alphonsus Onyekachi, Agbasonu V.C. [Development of a Visual Semantic Web Ontology Based Learning Management System](https://www.researchgate.net/publication/361056506_DEVELOPMENT_OF_A_VISUAL_SEMANTIC_WEB_ONTOLOGY_BASED_LEARNING_MANAGEMENT_SYSTEM). 2/2022.
6 | - Anita Kumari, Jawahar Thalur. [Semantic Web Search Engines: A Comparative Study](https://www.researchgate.net/publication/330602787_Semantic_Web_Search_Engines_A_Comparative_Survey). 1/2019.
7 | - Amit Upadhyay, Amit Paul, Pijush Kanti Dutta Pramanik. [Semantic Web Crawler for More Relevant Search Using Ontology](https://www.researchgate.net/publication/282121492_Semantic_Web_Crawler_for_More_Relevant_Search_Using_Ontology). 12/2014.
8 | - G Sudeepthi, G Anuradha, M Surenda, Prasad Babu. [A Survey on Semantic Web Search Engine](https://www.researchgate.net/publication/268436376_A_Survey_on_Semantic_Web_Search_Engine). 3/2012.
9 | - S. Latha Shanmuga Vadivu, M. Rajaram, S.N. Sivanandam. [A Survey on semantic web mining based web search engines](https://www.researchgate.net/publication/283249382_A_Survey_on_semantic_web_mining_based_web_search_engines). 10/2011.
10 | - Bettina Fazzinga, Giorgio Gianforme, Georg Gottlob, Thomas Lukasiewicz. [Semantic Web search based on ontological conjunctive queries](https://www.researchgate.net/publication/220291161_Semantic_Web_search_based_on_ontological_conjunctive_queries). 12/2011.
11 | - Bettina Fazzinga, Thomas Lukasiewicz. [Semantic search on the Web](https://www.researchgate.net/publication/220575552_Semantic_search_on_the_Web). 1/2010.
--------------------------------------------------------------------------------
/specific-engines/solr/basic-indexing-own-data-tutorial.md:
--------------------------------------------------------------------------------
1 | # Indexing One's Own Data Tutorial
2 |
3 | The content here is pulled from the official Solr Tutorials, [Exercise 3: Index Your Own Data](https://solr.apache.org/guide/solr/latest/getting-started/tutorial-diy.html).
4 |
5 | These are essentially my notes on the course materials.
6 |
7 | ## Preparations
8 | - What sort of data will I index?
9 | - What do I need to prepare Solr for this data?
10 | - e.g., create fields, set up copy fields, determine analysis rules, etc.
11 | - What kind of search options should the users have?
12 | - How much testing is needed to ensure it is working correctly?
13 |
14 | ## Create Your Own Collection
15 | - Make sure that Solr is running, e.g. `./bin/solr start`
16 | - `./bin/solr create -c yourCollectionName -s 2 -rf 2`
17 |
18 | ## Indexing Ideas
19 |
20 | ### Local Files with bin/post
21 | - If files are local, can handle JSON, XML, CSV, HTML, PDF, MS Office, plain text, and other formats.
22 | - `./bin/post -c yourCollectionName ~/pathToYourData`
23 |
24 | ### SolrJ or Other Client APIs
25 |
26 | ### Documents Screen
27 | - http://localhost:8983/solr/#/yourCollectionName/documents
28 | - Paste in document or use Document Builder from Document Type to create a document field by field.
29 |
30 | ## Updating Data
31 | - Documents with an identical `uniqueKey` value in the field `id` are updated rather than added in future indexing operations.
32 |
33 |
34 | ## Deleting Data
35 | - "You can delete data by POSTing a delete command to the update URL and specifying the value of the document’s unique key field, or a query that matches multiple documents (be careful with that one!). We can use bin/post to delete documents also if we structure the request properly."
36 | - Specific Document: `bin/post -c yourCollectionName -d "someUniqueId"`
37 | - All Documents: `bin/post -c yourCollectionName -d "*:*"`
38 |
39 |
40 | ## Bibliography / Resources
41 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-diy.html
--------------------------------------------------------------------------------
/research/personalization-research.md:
--------------------------------------------------------------------------------
1 | # Personalization in Search Engines and Information Retrieval
2 |
3 | - Ulrich Matter, Roland Hodler, Johannes Ladwig. [Personalization of Web Search During the 2020 US Elections](https://www.researchgate.net/publication/363920168_Personalization_of_Web_Search_During_the_2020_US_Elections). 9/2022.
4 | - Sunny Sharma, Vijay Rana. [A Systematic Literature Review of Web Search Personalization](https://www.researchgate.net/publication/339466998_A_Systematic_Literature_Review_Of_Web_Search_Personalization). 2/2020.*
5 | - Wiem Chebil, Mohammad Wedyan, Haiyan Lu, Omar Elshaweesh. [Context-Aware Personalized Web Search Using Navigation History](https://www.researchgate.net/publication/339435401_Context-Aware_Personalized_Web_Search_Using_Navigation_History). 2/2020.
6 | - Eugene Agichtein, Eric Brill, Susan Dumais. [Improving Web Search Ranking by Incorporating User Behavior Information](https://www.researchgate.net/publication/330459175_Improving_Web_Search_Ranking_by_Incorporating_User_Behavior_Information). 1/2019.
7 | - S. Salehi, J.T. Du, H. Ashman. [Use of web search engines and personalisation in information searching for educational purposes](https://www.researchgate.net/publication/326197645_Use_of_web_search_engines_and_personalisation_in_information_searching_for_educational_purposes). 6/2018.
8 | - Eric Utrera, Alfredo Simón-Cuevas, José A. Olivas. [Analysis of trends in the customization of results in web search engines](https://www.researchgate.net/publication/325157476_Analysis_of_trends_in_the_customization_of_results_in_web_search_engines). 4/2018.
9 | - M. Omair Shafiq, Reda Alhajj, John G. Rokne. [On personalizing Web search using social network analysis](https://www.researchgate.net/publication/275219347_On_personalizing_Web_search_using_social_network_analysis). 9/2015.
10 | - Aniko Hannak, Piotr Sapiezynski, Arash Molavi Kakhki, Balachander Krishnamurthy, David Lazer, Alan Mislove, Christo Wilson. [Measuring personalization of web search](https://www.researchgate.net/publication/262424460_Measuring_personalization_of_Web_search). 5/2013.
--------------------------------------------------------------------------------
/specific-engines/aws-opensearch/aws-opensearch-vector.md:
--------------------------------------------------------------------------------
1 | # AWS OpenSearch Vector Functionality
2 |
3 | # High-Level
4 | - Jon Handler, Dylan Tong, Jianwei Li, Vamshi Vijay Nakkirtha. [Amazon OpenSearch Service's Vector Database Capabilities Explained.](https://aws.amazon.com/blogs/big-data/amazon-opensearch-services-vector-database-capabilities-explained/) AWS Big Data Blog, 6/2023.
5 |
6 | # Workshop
7 | - [Semantic and Vector Search with Amazon OpenSearch Service](https://catalog.workshops.aws/semantic-search/)
8 | - Covers Search Basics, Text Search, Semantic Search, Fullstack Semantic Search, Fine Tuning Semantic Search, Neural Search, Retrieval Augmented Generation, and ConversationalS Search.
9 | - I went through this workshop but ran into issues in Module 4. Even before this point the workshop instructions and the notebook were diverging in what they were saying. At this step I started getting errors. I think the fix for this particular issue is pretty straightforward but I'm not going to spend more time with this workshop, hopefully Amazon will fix at some point.
10 |
11 | # Tutorials
12 | - Matt Barlow. [How to Augment ChatGPT with AWS OpenSearch](https://stratusgrid.com/blog/augmenting-chatgpt-with-amazon-opensearch?locale=en). StratusGrid, 3/2024.
13 | - Uses LlamaIndex.
14 | - Arun Shankar. [Augmenting Large Language Models with Verified Information Sources: Leveraging AWS SageMaker and OpenSearch for Knowledge-Dirven Question Answering](https://medium.com/@shankar.arunp/augmenting-large-language-models-with-verified-information-sources-leveraging-aws-sagemaker-and-f6be17fb10a8). Medium, 4/2023.
15 | - James Matson. [So, You Want to Store Your GPT/LLM Data? AWS OpenSearch to the Rescue](https://betterprogramming.pub/%EF%B8%8Fso-you-want-to-store-your-llm-data-aws-opensearch-to-the-rescue-f704a0f70558). Better Programming, 5/2023.
16 |
17 | # Documentation
18 | - [OpenSearch Service Flow Framework Templates](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ml-workflow-framework.html)
19 | - These flow templates can be used to automate ML connector workflows.
--------------------------------------------------------------------------------
/specific-engines/solr/solr-resources-code.md:
--------------------------------------------------------------------------------
1 | # Language Integrations
2 |
3 | ## .NET
4 | - [SolrNet](https://github.com/SolrNet/SolrNet) - Stars: 913 - Updated: 5/2023 - Checked: 5/2023
5 | - [solr-net-linq](https://github.com/IharYakimush/solr-net-linq) - Stars: 4 - Updated: 4/2023 - Checked: 5/2023
6 |
7 | ## Elixir
8 | - [Elsol](https://github.com/findmypast/elsol) - Stars: 8 - Updated: 3/2023 - Checked: 5/2023
9 |
10 | ## Go
11 | - [Solr Go](https://github.com/stevenferrer/solr-go) - Stars: 36 - Updated: 3/2023 - Checked: 5/2023
12 | - [go-solr](https://github.com/vanng822/go-solr) - Stars: 67 - Updated: 9/2022 - Checked: 5/2023
13 |
14 | ## JS
15 | - [solr-client for Node.js](https://github.com/lbdremy/solr-node-client) - Stars: 456 - Updated: 3/2022 - Checked: 5/2023
16 |
17 | ## JVM (includes Java, Clojure, Scala, etc.)
18 | - [flux](https://github.com/mwmitchell/flux) - Stars: 35 - Updated: 7/2022 - Checked: 5/2023
19 | - "A Clojure based Solr client."
20 | - [solrs](https://github.com/inoio/solrs) - Stars: 103 - Updated: 4/2023 - Checked: 5/2023
21 | - "An async, non-blocking solr client for java/scala, providing a query interface like SolrJ"
22 | - [solr-scala-client](https://github.com/takezoe/solr-scala-client) - Stars: 91 - Updated: 9/2022 - Checked: 5/2023
23 |
24 | ## PHP
25 | - [Solarium](https://github.com/solariumphp/solarium) - Stars: 911 - Updated: 5/2023 - Checked: 5/2023
26 | - [SolrQueryComponent](https://github.com/InterNations/SolrQueryComponent) - Stars: 64 - Updated: 8/2022 - Checked: 5/2023
27 | - [pecl-search_engine-solr](https://github.com/php/pecl-search_engine-solr) - Stars: 56 - Updated: 5/2023 - Checked: 5/2023
28 |
29 | ## Python
30 | - [Pysolr](https://github.com/django-haystack/pysolr) - Stars: 639 - Updated: 5/2023 - Checked: 5/2023
31 | - [solrcloudpy](https://github.com/solrcloudpy/solrcloudpy) - Stars: 37 - Updated: 2/2021 - Checked: 5/2023
32 | - [solrq](https://github.com/swistakm/solrq) - Stars: 22 - Updated: 11/2022 - Checked: 5/2023
33 |
34 | ## Ruby
35 | - [rsolr](https://github.com/rsolr/rsolr) - Stars: 416 - Updated: 2/2022 - Checked: 5/2023
36 | - [solrb](https://github.com/machinio/solrb) - Stars: 38 - Updated: 9/2022 - Checked: 5/2023
37 | - [LSolr](https://github.com/supercaracal/lsolr) - Stars: 10 - Updated: 12/2022 - Checked: 5/2023
38 | - [Sunspot](https://github.com/sunspot/sunspot) - Stars: 3k - Updated: 3/2023 - Checked: 5/2023
--------------------------------------------------------------------------------
/specific-engines/aws-opensearch/aws-opensearch-code.md:
--------------------------------------------------------------------------------
1 | # AWS OpenSearch - Writing Code
2 |
3 | ## Introduction
4 |
5 | One of the stranger aspects of AWS' documentation on OpenSearch is it's relative lack of code samples. One would expect to find something comprehensive in the [Amazon OpenSearch Service Developer Guide's Sample Code](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/samplecode.html) section, but the section is extremely brief and only tangentially mentions Elasticsearch (not even OpenSeach) language clients. If you dig through [Using the AWS SDKs](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configuration-samples.html) subsection there is some reference - but you have to dig for it (at least imho).
6 |
7 | There is also a tutorial on [Creating a search application](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/search-example.html) but this is quite limited in scope and depends on a specific architecture (Amazon API Gateway <--> AWS Lambda <--> Amazon OpenSearch Service).
8 |
9 | Ironically, nested under OpenSearch Serverless one can find some additional code (though beware that the serverless offering and managed offering have significant differences) in [Using the AWS SDKs to interact with Amazon OpenSearch Serverless](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-sdk.html) and [Ingesting data into collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-clients.html).
10 |
11 | This section attempts to provide a gentler and more rounded conversation of writing code for AWS OpenSearch Service.
12 |
13 | ## Clients
14 |
15 | One can perform operational tasks on an OpenSearch cluster using the AWS SDK in various languages but this does not provide an interface for querying the cluster.
16 | - [Java AWS SDK: OpenSearch](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/opensearch/package-summary.html)
17 | - [JavaScript AWS SDK: OpenSearch](https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/client/opensearch/)
18 | - [Python AWS Boto3: OpenSearch](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/opensearch.html)
19 |
20 | Instead one needs to use the [OpenSearch language-specific clients](https://opensearch.org/docs/latest/clients/) for this task. Currently these are available for Python, Java, JavaScript, Go, Ruby, PHP, .NET, and Rust.
21 |
22 |
--------------------------------------------------------------------------------
/specific-engines/solr/solr-resources-ui.md:
--------------------------------------------------------------------------------
1 | # Discovery and/or UI
2 |
3 | ## Blacklight
4 | - [Blacklight](https://projectblacklight.org/) - Stars: 731 - Updated: 4/2023 - Checked: 5/2023
5 | - A Ruby on Rails open source frontend for querying and discovery of results from Solr.
6 | - [Boston Public Library](https://github.com/boston-library/commonwealth-vlr-engine)
7 | - [Penn State University](https://github.com/psu-libraries/psulib_blacklight)
8 | - [Stanford University](https://github.com/sul-dlss/exhibits)
9 | - [Temple University](https://github.com/tulibraries/funcake-solr)
10 | - [warclight](https://github.com/archivesunleashed/warclight)
11 |
12 | ## Everything Else
13 | - [AJAX Solr](https://github.com/evolvingweb/ajax-solr) - Stars: 654 - Updated: 10/2021 - Checked: 4/2023 - JS library for building UI's for Solr.
14 | - [SolrDora](https://github.com/hectorcorrea/solrdora) - Stars: 6 - Updated: 1/2023 - Checked: 5/2023*
15 | - Provides a UI for browsing a Solr collection.
16 | - [YASA (Yet Another Solr Admin)](https://github.com/yasa-org/yasa) - Stars: 47 - Updated: 4/2023 - Checked: 5/2023
17 | - A web-based UI for administering Solr written with Vue in TypeScript.
18 | - [RecordManager](https://github.com/NatLibFi/RecordManager) - Stars: 44 - Updated: 5/2023 - Checked: 5/2023
19 | - [Goobi](https://goobi.io/)
20 | - "an open source software suite for the control and presentation of digitization projects."
21 | - [DS-Discover](https://github.com/kb-dk/ds-discover) - Stars: 0 - Updated: 4/2023 - Checked: 5/2023
22 | - "Gateway for Solr text search, image similarity, sound location and other discovery technologies....Developed and maintained by the Royal Danish Library."
23 | - [Search Management UI](https://github.com/querqy/smui) - Stars: 49 - Updated: 5/2023 - Checked: 5/2023
24 | - [MOPSY Search](https://github.com/Der-Henning/mopsy-react) - Stars: 0 - Updated: 1/2022 - Checked: 5/2023
25 | - Hasn't been updated in a while, but has a live demo.
26 | - [solrkit](https://github.com/garysieling/solrkit) - Stars: 11 - Updated: 4/2018 - Checked: 5/2023
27 |
28 | ## Learning
29 | - [Multi-Select Facet Example](https://github.com/stevenferrer/multi-select-facet) - Stars: 32 - Updated: 3/2023 - Checked: 5/2023
30 | - "An example of multi-select facet using Solr, Vue and Go."
31 | - [Solr-JavaScript-Search-Client](https://github.com/BLE-LTER/Solr-JavaScript-Search-Client) - Stars: 7 - Updated: 11/2022 - Checked: 5/2023
--------------------------------------------------------------------------------
/specific-engines/solr/solr-development.md:
--------------------------------------------------------------------------------
1 | # Solr Development
2 |
3 | ## Start With
4 | - Documentation:
5 | - dev-docs/README.adoc
6 | - dev-docs/solr-source-code.adoc
7 | - dev-docs/how-to-contribute.adoc
8 | - dev-docs/git.adoc
9 | - dev-docs/FAQ.adoc
10 | - dev-docs/running-in-docker.adoc
11 |
12 |
13 | ## More Advanced
14 | - Documentation:
15 | - dev-docs/lucene-upgrade.md
16 | - dev-docs/working-between-major-versions.adoc
17 | - dev-docs/cloud-script.adoc
18 | - dev-docs/dependency-upgrades.adoc
19 | - dev-docs/overseer/overseer.adoc
20 | - dev-docs/shard-split/shard-split.adoc
21 |
22 | ## Core Dependencies
23 | - [Java 11 Java Development Kit (JDK)](https://adoptium.net/)
24 | - Documentation: dev-docs/solr-source-code.adoc
25 |
26 | ## Bundled Dependencies
27 | - General Documentation - help/dependencies.txt
28 | - [Antora](https://antora.org/) - static site generator that generates Solr Ref Guide HTML.
29 | - Documentation: dev-docs/ref-guide/antora.adoc
30 | - [AsciiDoc](https://asciidoc.org/) - markup language used by the Solr project.
31 | - Documentation: dev-docs/ref-guide/asciidoc-syntax.adoc
32 | - [Gradle](https://gradle.org/) - build system.
33 | - Documentation: dev-docs/solr-source-code.adoc
34 | - Documentation: dev-docs/jvms.adoc
35 |
36 | ## Development
37 | - Plugins, Modules, and Packages Overview: dev-docs/plugins-modules-packages.adoc
38 | - IDEs: dev-docs/IDEs.adoc
39 | - Dev Tools:
40 | - dev-tools/README.txt
41 | - dev-tools/scripts/README.md
42 | - Benchmarking: solr/benchmark/README.md
43 | - Docker:
44 | - solr/docker/README.md
45 | - solr/docker/gradle-help.txt
46 | - Examples:
47 | - solr/examples/README.md
48 | - solr/examples/films/README.md
49 | - solr/examples/films/vectors/README.md
50 | - Modules: Each module under solr/modules contains a README.md file.
51 | - Server: solr/server/README.md
52 | - Solar Reference Guide: solr/solr-ref-guide/README.adoc
53 |
54 | ## Discussions
55 | - UI:
56 | - [Shifting Execution Strategy to a New UI Plugin](https://lists.apache.org/thread/f3r6ymgpggrv38hyozmf2n9cgox5ck7k)
57 | - CLI:
58 | - [Improving the Solr CLI](https://lists.apache.org/thread/39fglyc5rwwsnso9bldhowxcr80jddwg)
59 | - SolrCell (Tika):
60 | - [Future of SolrCell in Solr](https://lists.apache.org/thread/lbm6wb88gd1cfktgs6sfvw5xf73o8trd)
61 | - Documentation:
62 | - [Solr documentation questions](https://lists.apache.org/thread/wlbvg71b6f608ddpho9jbxtl0vf04jds)
--------------------------------------------------------------------------------
/specific-engines/apache-lucene.md:
--------------------------------------------------------------------------------
1 | # Apache Lucene
2 |
3 | ## Introduction
4 |
5 | Lucene is the best known open source search engine library. It forms the core of popular software like Elasticsearch, OpenSearch, Apache Solr, and Neo4j.
6 |
7 | ## Implementations
8 | - [Lucene.NET](https://lucenenet.apache.org/index.html)
9 | - [Examine](https://github.com/Shazwazza/Examine)
10 | - A search engine implementation on top of Lucene.NET, e.g. somewhat similar to Solr.
11 | - [PyLucene](https://lucene.apache.org/pylucene/)
12 | - A wrapper around the Java library, not an actual port.
13 | - [lupyne](https://github.com/coady/lupyne)
14 | - A search engine built using PyLucene.
15 |
16 | ## Articles
17 | - [Lucene: The Good Parts](https://www.parse.ly/lucene/)
18 | - An opinioniated and interesting article that starts high-level and then dives into some technical details.
19 | - [IONOS Apache Lucene Tutorial](https://www.ionos.com/digitalguide/server/configuration/apache-lucene/)
20 | - Beginner tutorial.
21 | - [Baeldung Introduction to Apache Lucene](https://www.baeldung.com/lucene)
22 | - Another beginner tutorial.
23 | - [Data Warrior's Building a search engine (Lucene tutorial)](https://datawarrior.medium.com/building-a-search-engine-lucene-tutorial-a515e3bfb44b)
24 | - [Han Bo Sun's Lucene Full-Text Search - A Very Basic Tutorial](https://www.codeproject.com/Articles/5246976/Lucene-Full-Text-Search-A-Very-Basic-Tutorial)
25 | - [LuceneTutorial.com](https://lucenetutorial.com/)
26 | - [Ishan Upamanyu's Apache Lucene Tutorial](https://ishanupamanyu.com/blog/apache-lucene-tutorial/)
27 | - [A Simple Tutorial of Lucene's Indexing and Search Systems](https://github.com/jiepujiang/LuceneTutorial)
28 |
29 | ## Code
30 | - [shaikhu/lucene-in-action](https://github.com/shaikhu/lucene_in_action)
31 | - Provides updated code examples for the book Lucene in Action.
32 | - [Michael Froh's Lucene University](https://github.com/msfroh/lucene-university)
33 |
34 | ## Tooling
35 | - [clue](https://github.com/javasoze/clue) - CLI for interacting with Lucene.
36 | - [Luqum](https://github.com/jurismarches/luqum) - ""luqum" (as in LUcene QUery Manipolator) is a tool to parse queries written in the Lucene Query DSL and build an abstract syntax tree to inspect, analyze or otherwise manipulate search queries."
37 |
38 | ## Other
39 | - [Yelp's nrtsearch](https://github.com/Yelp/nrtsearch) - gRPC server on top of Lucene.
40 | - [Blacklab](https://github.com/INL/BlackLab) - "It allows fast, complex searches with accurate hit highlighting on large, tagged and annotated, bodies of text."
--------------------------------------------------------------------------------
/specific-engines/solr/solr-resources-interesting-old.md:
--------------------------------------------------------------------------------
1 | # Interesting But Old
2 | - [Elsevier Labs' Solr Dictionary Annotator](https://github.com/elsevierlabs-os/soda) - 2/2020.
3 | - [OpenSextant's Solr Text Tagger](https://github.com/OpenSextant/SolrTextTagger) - 7/2020.
4 | - [O'Reilly Media's Solr Plugin](https://github.com/oreillymedia/ifpress-solr-plugin) - 8/2020.
5 | - [Vector Scoring Plugin for Solr](https://github.com/saaay71/solr-vector-scoring) - 9/2019.
6 | - [Solr Recommender](https://github.com/pferrel/solr-recommender) - 6/2016.
7 | - [solr-movielens-recommender](https://github.com/o19s/solr-movielens-recommender) - 10/2016.
8 | - [solr-resource-recommender](https://github.com/lacic/solr-resource-recommender) - 12/2014.
9 | - [Selective Search](https://github.com/rajanim/selective-search) - 9/2019.
10 | - [solr-mlt](https://github.com/dfdeshom/custom-mlt) - 10/2019.
11 | - [UPenn's solrplugins](https://github.com/upenn-libraries/solrplugins) - 7/2017.
12 | - [solrgraph](https://github.com/kwatters/solrgraph) - 8/2021.
13 | - [carrot2's solr-integration-strategies](https://github.com/carrot2/solr-integration-strategies) - 1/2014.
14 | - [solr-fusion](https://github.com/outermedia/solr-fusion) - 3/2017.
15 | - [solr-quantities-detection-qparsers](https://github.com/SeaseLtd/solr-quantities-detection-qparsers) - 10/2019.
16 | - [kafka-solr-sink-connector](https://github.com/bkatwal/kafka-solr-sink-connector) - 11/2020.
17 | - [scrapy-solr](https://github.com/scalingexcellence/scrapy-solr) - 4/2016.
18 | - [SKOS Support for Apache Lucene and Solr](https://github.com/behas/lucene-skos) - 2/2016.
19 | - [Solr Mongo Importer](https://github.com/james75/SolrMongoImporter) - 11/2018.
20 | - [ftw.crawler](https://github.com/4teamwork/ftw.crawler) - 11/2017.
21 | - [solr-sql](https://github.com/bluejoe2008/solr-sql) - 4/2020.
22 | - [rdf-graph-search-with-solr-custom-streaming-expression](https://github.com/spoddutur/rdf-graph-search-with-solr-custom-streaming-expression) - 2/2018.
23 | - [nutch-solr-integration](https://github.com/basraven/nutch-solr-integration) - 10/2018.
24 | - [solrj-nested-docs](https://github.com/lucidworks/solrj-nested-docs) - 7/2014.
25 |
26 | ## UI
27 | - [Solrstrap](https://github.com/fergiemcdowall/solrstrap) - Stars: 86 - Updated: 4/2017 - Checked: 5/2023
28 | - [ngSolr](https://github.com/elmarquez/ngSolr) - Stars: 7 - Updated: 4/2016 - Checked: 5/2023
29 | - [HN-Search](https://github.com/agustingrigoriu/HN-Search) - Stars: 0 - Updated: 12/2019 - Checked: 5/2023
30 | - [Banana](https://github.com/lucidworks/banana/tree/release) - Stars: 672 - Updated: 6/2020 - Checked: 5/2023
--------------------------------------------------------------------------------
/specific-engines/solr/basic-solrcloud-tutorial.md:
--------------------------------------------------------------------------------
1 | # SolrCloud Tutorial
2 |
3 | - "SolrCloud is designed to provide a highly available, fault tolerant environment for distributing your indexed content and query requests across multiple servers."
4 | - "It’s a system in which data is organized into multiple pieces, or shards, that can be hosted on multiple machines, with replicas providing redundancy for both scalability and fault tolerance, and a ZooKeeper server that helps manage the overall structure so that both indexing and search requests can be routed properly.
5 | "
6 |
7 | ## Interactive Startup
8 | - `bin/solr -e cloud`
9 | - How many nodes? 2
10 | - node1 port? 8983
11 | - node2 port? 7574
12 | - Solr starts the nodes and displays the command it used to start each node.
13 | - The first node starts a ZooKeeper server on port 9983.
14 | - name of collection? gettingstarted
15 | - number of shards? 2
16 | - replicas per shard? 2
17 | - configuration? _default
18 |
19 | ## Check Status
20 | - `bin/solr status`
21 |
22 | ## Log Files
23 | - You can find log files in `example/cloud/node1/logs` and `example/cloud/node2/logs`.
24 |
25 | ## See Collection Deployed Across Cluster
26 | - http://localhost:8983/solr/#/~cloud
27 |
28 | ## Basic Diagnostics
29 | - `bin/solr healthcheck -c gettingstarted`
30 | - "The healthcheck command gathers basic information about each replica in a collection, such as number of docs, current status (active, down, etc.), and address (where the replica lives in the cluster)."
31 |
32 | ## Stopping SolrCloud
33 | - `bin/solr stop -all`
34 |
35 | ## Starting with -noprompt
36 | - Use the defaults instead of interactive
37 | - `bin/solr -e cloud -noprompt`
38 |
39 | ## Restarting Nodes
40 | - `bin/solr restart -c -p 8983 -s example/cloud/node1/solr`
41 | - `bin/solr restart -c -p 7574 -z localhost:9983 -s example/cloud/node2/solr`
42 | - Note that in the first case we don't need to specify the ZooKeeper address because it is the same as the node but for the second case we need to tell the command where ZooKeeper is located.
43 |
44 | ## Adding a Node to a Cluster
45 | - `mkdir solrHomeForNewNode`
46 | - `cp AnExistingSolrHomePath solrHomeForNewNodePath`
47 | - `bin/solr start -cloud -s solrHomeForNewNode/solr -p portNumber -z zooKeeperHostAndPort`
48 | - Example:
49 | - `mkdir -p example/cloud/node3/solr`
50 | - `cp example/cloud/node1/solr/solr.xml example/cloud/node3/solr/solr.xml`
51 | - `bin/solr start -cloud -s example/cloud/node3/solr -p 8987 -z localhost:9983`
52 |
53 | ## Bibliography / Resources
54 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-solrcloud.html
--------------------------------------------------------------------------------
/vector-search/vector-basics.md:
--------------------------------------------------------------------------------
1 | # Vector Search
2 | - Doug Turnbull. [Vector Search for the Uninitiated](https://softwaredoug.com/blog/2023/02/13/why-vector-search). 2/2023.
3 | - Provides a brief overview of how traditional search works and how vector search differs as well as the relative strengths and weaknesses of each.
4 | - Greg Kogan. [Introduction to Vector Search for Developers](https://www.pinecone.io/learn/vector-search-basics/).
5 | - A high-level overview of vector search with a slight emphasis on Pinecone's product. Touches on traditional search, vector embeddings, semantic similarity, etc.
6 | - Doug Turnbull. [Vector Search: The Hard Way](https://softwaredoug.com/blog/2023/09/05/vector-search-the-hard-way). 9/2023.
7 | - 75 slides on the challenges of Vector Search.
8 | - Dmitry Kan. [Keynote: Where Vector Search is taking us](https://haystackconf.com/eu2022/2022/09/27/keynote.html). Haystack Conference, 9/2022.
9 | - Slide deck and video presentation on the state of Vector Search and it's future.
10 | - Ethan Steininger. [Vector Search](https://github.com/esteininger/vector-search). 6/2023.
11 | - A GitHub repo with a collection of articles and links relating to Vector Search.
12 | - Panda Smith. [Build a search engine, not a vector DB](https://blog.elicit.com/search-vs-vector-db/). Elicit, 12/2023.
13 | - Guidance on using a solid search engine as the foundation for vector search.
14 |
15 | # Vector Search Challenges
16 | - James Briggs. [The Missing WHERE Clause in Vector Search](https://www.pinecone.io/learn/vector-search-filtering/). Pinecone.
17 | - Discusses the difficult challenge of filtering results in vector search, explains the pre/post filtering techniques and Pinecone's single stage filtering.
18 | - Gibbs Cullen has an additional post on Pinecone's implementation titled, [Introducing the hybrid index to enable keyword-aware semantic search](https://www.pinecone.io/blog/hybrid-search/). Pinecone, 10/2022.
19 |
20 | # Vector Embeddings
21 | - Roie Schwaber-Cohen. [Vector Embeddings for Developers: The Basics](https://www.pinecone.io/learn/vector-embeddings-for-developers/). Pinecone.
22 | - Solid article for beginners looking for high-level overview. Touches on vectors, vector embeddings, embedding models, word2vec, and semantic similarity.
23 |
24 | # Neo4j Vector Search
25 | - [Documentation: Vector Search Indexes](https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/)
26 | - Utilizes Apache Lucene, which uses HSNW Graph and k-ANN for querying.
27 |
28 | # Vector Companies / Databases
29 | - [Vector DB Comparison](https://superlinked.com/vector-db-comparison/)
30 | - [Milvus](https://milvus.io/)
31 | - [pgvector](https://github.com/pgvector/pgvector)
32 | - [Pinecone](https://www.pinecone.io/)
33 | - [Qdrant](https://qdrant.tech/)
34 | - [Vespa](https://vespa.ai/)
35 | - [Weaviate](https://weaviate.io/)
--------------------------------------------------------------------------------
/specific-engines/solr/solr-resources-app-framework-integrations.md:
--------------------------------------------------------------------------------
1 | ### Application/Framework Integrations
2 | - [Dokku Solr](https://github.com/dokku/dokku-solr) - Stars: 13 - Updated: 5/2023 - Checked: 5/2023
3 | - [Drupal Search API Solr](https://github.com/mkalkbrenner/search_api_solr) - Stars: 6 - Updated: 5/2023 - Checked: 5/2023
4 | - [eZ Platform](https://github.com/ezsystems/ezplatform-solr-search-engine) - Stars: 45 - Updated: 5/2023 - Checked: 5/2023
5 | - [Feathersjs Solr Client (feathers-solr)](https://github.com/sajov/feathers-solr) - Stars: 29 - Updated: 3/2023 - Checked: 5/2023
6 | - [Flask Backend Solr Service](https://github.com/NeuroBridge/Backend_service) - Stars: 0 - Updayted: 5/2023 - Checked: 5/2023
7 | - [@florajs/datasource-solr](https://github.com/florajs/datasource-solr) - Stars: 6 - Updated: 5/2023 - Checked: 5/2023
8 | - [Ibexa DXP Solr Integration](https://github.com/ibexa/solr) - Stars: 2 - Updated: 5/2023 - Checked: 5/2023
9 | - [Kafka Connect Solr](https://github.com/jcustenborder/kafka-connect-solr) - Stars: 37 - Updated: 9/2021 - Checked: 5/2023
10 | - [Solr Lando Plugin](https://github.com/lando/solr) - Stars: 0 - Updated: 5/2023 - Checked: 5/2023
11 | - [MusicBrainz Solr](https://github.com/metabrainz/mb-solr) - Stars: 3 - Updated: 1/2023 - Checked: 5/2023
12 | - [Omeka-S-module-SearchSolr](https://github.com/Daniel-KM/Omeka-S-module-SearchSolr) - Stars: 2 - Updated: 5/2023 - Checked: 5/2023
13 | - [collective.solr for Plone CMS](https://github.com/collective/collective.solr) - Stars: 19 - Updated: 5/2023 - Checked: 5/2023
14 | - [Solr Search for NodeBB](https://github.com/julianlam/nodebb-plugin-solr) - Stars: 22 - Updated: 1/2023 - Checked: 5/2023
15 | - [Spring Data Solr](https://github.com/spring-projects/spring-data-solr) - Stars: 386 - Updated: 5/2023 - Checked: 5/2023
16 | - [Modern SilverStripe Solr Search](https://github.com/Firesphere/silverstripe-solr-search) - Stars: 10 - Updated: 5/2023 - Checked: 5/2023
17 | - [TYPO3-Find](https://github.com/subugoe/typo3-find) - Starts: 17 - Updated: 9/2022 - Checked: 5/2023
18 | - For providing a UI using TYPO3 to a given Solr instance.
19 | - [TYPO3-Solr](https://github.com/TYPO3-Solr/ext-solr) - Stars: 126 - Updated: 5/2023 - Checked: 5/2023
20 | - For searching the contents of a TYPO3 CMS instance.
21 | - [Solr Engine for Laravel Scout](https://github.com/pxslip/laravel-scout-solr) - Stars: 17 - Updated: 5/2023 - Checked: 5/2023
22 | - [Apache Solr Dialect for SQLAlchemy and Apache Superset](https://github.com/aadel/sqlalchemy-solr) - Stars: 8 - Updated: 8/2022 - Checked: 5/2023
23 | - [Sitecore SmartSolrSchema](https://github.com/dataweaversio/SmartSolrSchema) - Stars: 17 - Updated: 12/2022 - Checked: 5/2023
24 | - Populates "not only standard Sitecore dynamic fields but also reads any custom languages that are set up"
25 | - [solr-power (for WordPress)](https://github.com/pantheon-systems/solr-power) - Stars: 122 - Updated: 4/2023 - Checked: 5/2023
--------------------------------------------------------------------------------
/specific-engines/elasticsearch.md:
--------------------------------------------------------------------------------
1 | # Extending Elasticsearch
2 | - [elastiknn](https://github.com/alexklibisz/elastiknn) - Stars: 324 - Updated: 5/2023 - Checked: 5/2023
3 | - Nearest Neighbor plugin.
4 | - [elasticsearch-sql](https://github.com/iamazy/elasticsearch-sql) - Stars: 321 - Updated: 11/2022 - Checked: 5/2023
5 | - [elasticsearch-carrot2](https://github.com/carrot2/elasticsearch-carrot2) - Stars: 289 - Updated: 1/2023 - Checked: 11/2023
6 | - On-the-fly clustering capabilities.
7 | - [search-extra](https://github.com/wikimedia/search-extra) - Stars: 48 - Updated: 11/2023 - Checked: 11/2023
8 | - A number of queries, filters, etc. from MediaWiki.
9 | - [search-highlighter](https://github.com/wikimedia/search-highlighter)
10 | - Stars: 96 - Updated: 11/2022 - Checked: 11/2023
11 | - MediaWiki highlighter mean for easy experimentation with weights and groupings for hits.
12 | - [zentity](https://github.com/zentity-io/zentity) - Stars: 142 - Updated: 2/2022 - Checked: 11/2023
13 | - A simple, fast, generic, transitive, multi-source, accommodating, logical entity resolution plugin.
14 |
15 | # Deployment
16 | - [elasticsearch-cloud-deploy](https://github.com/BigDataBoutique/elasticsearch-cloud-deploy) - Stars: 329 - Updated: 11/2022 - Checked: 5/2023
17 |
18 | # Development Integrations
19 | - [Nest.js Elasticsearch](https://github.com/nestjs/elasticsearch) - Stars: 331 - Updated: 5/2023 - Checked: 5/2023
20 |
21 | # Learn
22 | - [Elasticsearch Cheatsheet for developers](https://github.com/jolicode/elasticsearch-cheatsheet) - Stars: 336 - Updated: 5/2022 - Checked: 5/2023
23 | - [The Elasticsearch Handbook](https://elasticsearchbook.com/) - eBook, $29.
24 |
25 |
26 | # Applications Using
27 | - [DataCap](https://github.com/EdurtIO/datacap)
28 | - [DataShare](https://github.com/ICIJ/datashare)
29 | - [Diskover Community Edition](https://github.com/lacic/solr-resource-recommender)
30 | - [Magda](https://github.com/magda-io/magda)
31 | - [Tigris](https://github.com/tigrisdata/tigris)
32 | - [Zenodo](https://github.com/zenodo/zenodo)
33 |
34 | # Resources
35 | - [awesome-elasticsearch](https://github.com/dzharii/awesome-elasticsearch) - Stars: 4.7k - Updated: 9/2022 - Checked: 10/2023
36 | - [codingexplaing/complete-guide-to-elasticsearch](https://github.com/codingexplained/complete-guide-to-elasticsearch) - Stars: 1.6k - Updated: 1/2024 - Checked: 5/2024
37 | - "Contains all of the queries used within the Complete Guide to Elasticsearch course."
38 |
39 | ## Books
40 | - Anurag Srivastava. Elasticsearch 8 for Developers: A beginner's guide to indexing, analyzing, searching, and aggregating data, 2nd edition. BPB Publications, 10/2023. 392 pp.
41 | - Rafał Kuć, Marek Rogoziński. Elasticsearch Server, 3rd edition. Packt Publishing, 2/2016. 556 pp.
42 | - Steve Hoberman, Rafid Reaz. Elasticsearch Data Modeling and Schema Design. Technics Publications, 8/2023. 196 pp.
43 |
44 | # Companies Working with Elasticsearch
45 | - [Sematext](https://sematext.com/)
46 |
47 | # Terminology
48 | - Data Streams
49 | - Index Aliases
50 | - Index Rollups
51 | - Index State Management
52 | - Index Templates
53 | - Index Transforms
54 | - Ingest Pipeline
55 | - Ingest Processors
56 | - Processors
57 | - Search Analyzer
58 | - Tasks
--------------------------------------------------------------------------------
/BuildingSearchEngines.md:
--------------------------------------------------------------------------------
1 | # Building Search Engines
2 |
3 | ## Table of Contents
4 | - [Open Source Search Engines](OpenSourceSearchEngines.md)
5 | - [Open Source Web Crawlers](WebCrawlers.md)
6 | - Building Search Engines General Resources
7 | - [Common Crawl](CommonCrawl.md)
8 | - Wikimedia Search
9 | - Sites Covering Search Related Topics
10 | - Tutorials
11 | - [Books on Search and Information Retrieval](/research/books-research.md)
12 | - [Research on Search and Information Retrieval](/research/research-main.md)
13 | - [Vector Search](/vector-search/vector-basics.md)
14 |
15 |
16 | ## General Resources
17 | - [The Open Guide to Search Engineering](https://github.com/open-guides/og-search-engineering)
18 | - [Web Archiving Introduction](/web-archiving/archiving-introduction.md)
19 |
20 | ## Wikimedia Search
21 | - https://www.mediawiki.org/wiki/Wikimedia_Search_Platform
22 | - https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes
23 |
24 | ## Sites Covering Search Related Topics
25 | A number of the sites list below are commercial search providers, while there may be useful resources throughout the site, checking the blog is often a good place to start.
26 | - [Algolia](https://algolia.com/) - A popular SaaS search engine.
27 | - [Bonsai](https://bonsai.io/) - Managed Elasticsearch, OpenSearch, and SolrCloud platform.
28 |
29 | ## Community
30 | - [Relevance & Matching Tech](https://www.opensourceconnections.com/slack) - A long-lived and popular Slack communtiy for the IR/search community run by [OpenSource Connections](https://www.opensourceconnections.com/).
31 | - [IR-Relevant](https://ir-relevant.net/) - A new (2023) community forum for those interested in Information Retrieval. Created and run by [Sease](https://sease.io/), a well-known information retrieval and search provider.
32 |
33 | ## Tutorials
34 |
35 | ### For Beginners
36 | - [Build Your Own Search Engine and Web Crawler in 5 Minutes with Node.js, MySQL, and Elasticsearch](https://coderdose.com/build-your-own-search-engine-and-web-crawler-in-5-minutes-with-node-js-mysql-and-elasticsearch/). Coderdose, 3/2023.
37 | - [Web Search Engine: Design and implementation of a Web Search Engine Using Text Mining Techniques](https://www.codeproject.com/Articles/5319612/Web-Search-Engine). Code Project, 12/2021-3/2023.
38 | - Uses Python and a SQL back-end.
39 |
40 | ## ACM Conferences Related to Search
41 | - [WSDM: Web Search and Data Mining](https://dl.acm.org/conference/wsdm)
42 | - [IR: Research and Development in Information Retrieval](https://dl.acm.org/conference/ir)
43 | - [KDD: Knowledge Discovery and Data Mining](https://dl.acm.org/conference/kdd)
44 | - [CHI: Conference on Human Factors in Computing Systems](https://dl.acm.org/conference/chi) - Not focused on search but covers a lot of topics that are/should be of interest to those working with search.
45 | - [CIKM: Conference on Information and Knowledge Management](https://dl.acm.org/conference/cikm) - Largely focused on AI/ML.
46 | - [RECSYS: ACM Conference on Recommender Systems](https://dl.acm.org/conference/recsys)
47 | - [IDEAS: International Database Engineering & Applications Symposium](https://dl.acm.org/conference/ideas)
48 | - [WWW: International World Wide Web Conference](https://dl.acm.org/conference/www)
49 | - [MOD: International Conference on Management of Data](https://dl.acm.org/conference/mod)
50 |
--------------------------------------------------------------------------------
/specific-engines/yacy.md:
--------------------------------------------------------------------------------
1 | # YaCy
2 | - Official Site: https://yacy.net/
3 | - [GitHub Repo](https://github.com/yacy/yacy_search_server)
4 | - [Forums](https://community.searchlab.eu/)
5 | - [Subreddit](https://www.reddit.com/r/YaCy/)
6 | - [HackerNews](https://news.ycombinator.com/item?id=32597309)
7 | - [Wikipedia](https://en.wikipedia.org/wiki/YaCy)
8 | - An open source, distributed, P2P search engine built in Java with a focus on user privacy and decentralization. It's been around for a long time and continues to be actively developed. Includes a crawler.
9 |
10 | ## Other Projects
11 | - [Susper](https://github.com/fossasia/susper.com) - Stars: 1.7k - Updated: 3/2022 - Checke: 3/2023 - Built on top of YaCy and Apache Solr.
12 |
13 | ## Implementations
14 | - [Susper](https://susper.com)
15 | - [Land](https://www.land.nrw)
16 |
17 | ## YaCy Grid
18 | - YaCy is a P2P search engine while YaCy Grid is a distributed search engine but not P2P. Read more at:
19 | - Michael Christen. [The Story of YaCy Grid](https://community.searchlab.eu/t/the-story-of-yacy-grid/48). 6/2019.
20 | - Covers the origins of YaCy Grid and it's basic architecture.
21 |
22 | ## YaCy Searchlab
23 | - https://searchlab.eu/
24 | - [GitHub Repo](https://github.com/yacy/searchlab)
25 | - Provides a UI on top of YaCy Grid.
26 | - Michael Christen. [The Searchlab Project](https://community.searchlab.eu/t/the-searchlab-project/867). 10/2021-10/2022.
27 | - Covers the launch of Searchlab, an implementation of YaCy Grid with corresponding open source projects. Note that you should read through the thread as the initial post has not been updated.
28 |
29 | ## Articles
30 | - Keyhan Vakil. [Personal internet search engine with YaCy](https://www.kvakil.me/posts/2022-07-03-yacy-private-search-engine.html). 7/2022.
31 | - Covers using YaCy as a personal, private search engine. Not web-scale, but a good introduction to YaCy.
32 | - Arunmozhi. [Personal Bookmarking using YACY & yacy-it](https://arunmozhi.in/2022/06/27/personal-bookmarking-using-yacy-yacy-it/). 6/2022.
33 | - A good followup article on Vakil's. Arunmozhi created a Firefox extension to reduce the friction of adding sites to one's personal search engine.
34 | - Richard Osgood. [YaCy Personal Search Engine](https://www.richardosgood.com/posts/yacy-personal-search-engine/). 2/2023.
35 | - A followup on Vakil's article with some additional good ideas.
36 | - LinuxReviews
37 | - LinuxReviews has a fairly negative opinion of YaCy overall but sees a glimmer of opportunity for the engine. Unfortunately, the articles have not been updated since 2019/2021 respectively leaving some gap between what is described and what may now be YaCy's status. That said there is a decent amount of helpful technical info. to be found in the articles.
38 | - [YaCy](https://linuxreviews.org/YaCy). 9/2019.
39 | - [The YaCy Search Server Is Sort-Of Being Actively Developed Again After Half a Decade Of Inactivity](https://linuxreviews.org/The_YaCy_Search_Server_Is_Sort-Of_Being_Actively_Developed_Again_After_Half_A_Decade_Of_Inactivity). 4/2021.
40 | - Michael Herrmann, Kai-Chun Ning, Claudia Díaz, B. Praneel. [Description of the YaCy Distributed Web Search Engine](https://www.semanticscholar.org/paper/Description-of-the-YaCy-Distributed-Web-Search-Herrmann-Ning/8d0c816ab14ca3748a1887d7f2ef088d630f831d). 2014.
41 | - An academic article providing a technical description of YaCy.
42 | - Jeremy Rand. [Relevance and Privacy Improvements to the YaCy Decentralized Web Search Engine](https://shareok.org/handle/11244/299892). University of Oklahoma, 2018.
--------------------------------------------------------------------------------
/specific-engines/elasticsearch/elasticsearch-clients.md:
--------------------------------------------------------------------------------
1 | # Ruby
2 | - [elasticsearch-ruby](https://github.com/elastic/elasticsearch-ruby) - Stars: 2k - Updated: 4/2024 - Checked: 5/2024
3 | - [elasticsearch-rails](https://github.com/elastic/elasticsearch-rails) - Stars: 3.1k - Updated: 4/2024 - Checked: 5/2024.
4 | - [chewy](https://github.com/toptal/chewy) - Stars: 1.9k - Updated: 5/2024 - Checked: 5/2024.
5 | - "High-level Elasticsearch Ruby framework based on the official elasticsearchruby client."
6 |
7 | # Go
8 | - [go-elasticsearch](https://github.com/elastic/go-elasticsearch) - Stars: 5.5k - Updated: 4/2024 - Checked: 5/2024
9 | - Official.
10 | - [elasticsql](https://github.com/cch123/elasticsql) - Stars: 1.1k - Updated: 8/2023 - Checked: 5/2024
11 | - "Convert sql to elasticsearch DSL"
12 | - [vulcanizer](https://github.com/github/vulcanizer) - Stars: 663 - Updated: 4/2024 - Checked: 5/2024
13 | - By GitHub
14 | - "...a golang library for interacting with an Elasticsearch cluster...to provide a high level API to help with common tasks...operating an Elasticsearch cluster such as querying health status..., migrating data off of nodes, updating cluster settings, etc."
15 |
16 | # Rust
17 | - [elasticsearch-rs](https://github.com/elastic/elasticsearch-rs) - Stars: 686 - Updated: 12/2023 - Checked: 5/2024.
18 | - Official.
19 |
20 | # .NET
21 | - [elasticsearch-net](https://github.com/elastic/elasticsearch-net) - Stars: 3.5k - Updated: 4/2024 - Checked: 5/2024.
22 | - Official.
23 | - [ElasticLINQ](https://github.com/ElasticLINQ/ElasticLINQ) - Stars: 384 - Updated: 2023 - Checked: 5/2024.
24 |
25 | # PHP
26 | - [elastic-php](https://github.com/elastic/elasticsearch-php) - Stars: 5.2k - Updated: 3/2024 - Checked: 5/2024
27 | - Official.
28 | - [Elastica](https://github.com/ruflin/Elastica) - Stars: 2.3k - Updated: 5/2024 - Checked: 5/2024.
29 | - Updated by CirrusSearch which is used by MediaWiki/Wikipedia.
30 |
31 | ## Laravel
32 | - [laravel-scout-elasticsearch](https://github.com/matchish/laravel-scout-elasticsearch) - Stars: 684 - Updated: 2/2024 - Checked: 5/2024
33 |
34 | ## Symfony
35 | - [FOSElasticaBundle](https://github.com/FriendsOfSymfony/FOSElasticaBundle) - Stars: 1.2k - Updated: 4/2024 - Checked: 5/2024
36 | - Uses Elastica to integrate Elasticsearch with Symfony.
37 |
38 | ## Yii
39 | - [yii2-elasticsearch](https://github.com/yiisoft/yii2-elasticsearch) - Stars: 430 - Updated: 9/2023 - Checked: 5/2024
40 |
41 | # Python
42 | - [elasticsearch-py](https://github.com/elastic/elasticsearch-py) - Stars: 4.1k - Updated: 5/2024 - Checked: 5/2024
43 | - Official.
44 | - [elasticsearch-dsl-py](https://github.com/elastic/elasticsearch-dsl-py) - Stars: 3.8k - Updated: 5/2024 - Checked: 5/2024
45 | - "High level Python client for Elasticsearch"
46 |
47 | ## Django
48 | - [django-elasticsearch-dsl-drf](https://github.com/barseghyanartur/django-elasticsearch-dsl-drf) - Stars: 364 - Updated: 2022 - Checked: 5/2024
49 |
50 | # JavaScript
51 | - [elasticsearch-js](https://github.com/elastic/elasticsearch-js) - Stars: 5.2k - Updated: 5/2024 - Checked: 5/2024
52 | - Official.
53 | - [elastic-builder](https://github.com/sudo-suhas/elastic-builder) - Stars: 503 - Updated: 5/2024 - Checked: 5/2024
54 | - "A Node.js implementation of the elasticsearch Query DSL"
55 |
56 | # Elixir
57 | - [elasticsearch-elixir](https://github.com/danielberkompas/elasticsearch-elixir) - Stars: 416 - Updated: 9/2023 - Checked - 5/2024
58 |
59 | # Haskell
60 | - [bloodhound](https://github.com/bitemyapp/bloodhound) - Stars: 420 - Updated: 2/2024 - Checked: 5/2024.
61 | - "Haskell Elasticsearch client and query DSL"
62 |
63 | # Scala
64 | - [elastic4s](https://github.com/Philippus/elastic4s) - Stars: 1.6k - Updated: 5/2024 - Checked: 5/2024
--------------------------------------------------------------------------------
/web-archiving/archiving-introduction.md:
--------------------------------------------------------------------------------
1 | # Web Archiving Introduction
2 |
3 | ## Introduction
4 | Search engines depend on indexes which are built by crawling and caching the content of the web. While for search engines this caching can be temporary (before the relevant data is extracted into an index) it overlaps significantly with web archiving and using open crawl data often involves working with web archiving formats and tooling.
5 |
6 | In this document we'll be particularly interested in discussing the file formats utilized in modern web archiving.
7 |
8 | ## Origins
9 | The [ARC](https://archive.org/web/researcher/ArcFileFormat.php) format was created by the [Internet Archive](https://archive.org/) for use with it's Wayback Machine. It's success is WARC, released in a "finalized" form in 2009, this format (and subsequent revisions) continues to be the mainstay of web archiving.
10 |
11 | ## WARC Format
12 | - [Official Standard Specifications for WARC 1.1](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/)
13 | - Karl-Rainer Blumenthal. [The stack: An introduction to the WARC file](https://ait.blog.archive.org/post/the-stack-warc-file/). Archive.org, 4/2021.
14 | - A great introductory article to WARC including it's history, purpose, and implementation.
15 | - [Wikipedia on Web ARChive](https://en.wikipedia.org/wiki/Web_ARChive)
16 |
17 | A WARC file contains WARC records which are composed of eight pieces, six of which are actually utilized currently:
18 | - warcinfo - information about the request, "good provenance information" as Blumenthal puts it.
19 | - request - The HTTP request made by the archiving tool to the website that results in the response received.
20 | - response - The response received from the website (including file contents).
21 | - revisit - A record that has been previously archived and hasn't changed in subsequent visit.
22 | - resource - May include screenshots, videos of the page.
23 | - conversion - Conversion of older data into a current format (e.g. as an image standard is deprecated this might contain a replacement image in a new format).
24 | - continuation - Allows one to reference another WARC record that contains the remainder of the record.
25 | - metadata - Various metadata depending on the archiving source.
26 |
27 | They can be opened with a text editor (although many WARC files are quite large and may require a special editor with large file support).
28 |
29 | ## WAT Format
30 | Includes data extracted from the WARC format using JSON. Includes metadata, request, and response as well as the links extracted from the page.
31 |
32 | ## WET Format
33 | Includes data extracted from a WARC in plaintext.
34 |
35 | ## Bibliography/Resources
36 | - Archive.org
37 | - Note that the Internet Archive maintains a [general blog](https://blog.archive.org/) but for those interested in more technical aspects of the Archive, see the [Archive-It blog](https://ait.blog.archive.org/), which also covers general Archive-It news along with some technical posts.
38 | - [A New Wayback: Improving Web Archive Replay](https://ait.blog.archive.org/post/archive-it-wayback-release/). 9/2021.
39 | - Karl-Rainer Blumenthal. [The stack: A guide to A/V web archiving with youtube-dl](https://ait.blog.archive.org/post/the-stack-youtube-dl-guide/). 1/2021.
40 | - Karl-Rainer Blumenthal. [The stack: High fidelity web collecting at scale with Brozzler](https://ait.blog.archive.org/post/the-stack-brozzler/). 11/2020.
41 | - Molly Bragg, Kristine Hanna, et al. [Web Archiving Lifecycle Model](https://ait.blog.archive.org/learn-more/publications/web-archiving-life-cycle-model/). 3/2013.
42 | - CommonCrawl.org
43 | - Stephen Merity. [Navigating the WARC file format](https://commoncrawl.org/2014/04/navigating-the-warc-file-format/). 4/2014.
44 | - Brief introduction to the WARC format, but perhaps more importantly (as the info seems less readibly available around the web), discusses the WET and WAT formats.
--------------------------------------------------------------------------------
/collaborative/README.md:
--------------------------------------------------------------------------------
1 | # Collaborative Search Engines Till Now
2 |
3 | ## Introduction
4 |
5 | I am particulary interested in human augmented search engines and am [working towards building one](https://github.com/davidshq/next-search). I've decided to collect information regarding collaborative search engines here.
6 |
7 | ## Engines
8 | - [ApexKB / Jumper](https://en.wikipedia.org/wiki/ApexKB)
9 | - Last Release: 11/2010
10 | - [Seeks](https://en.wikipedia.org/wiki/Seeks)
11 | - Last Release: 4/2012
12 | - [Searx](https://en.wikipedia.org/wiki/Searx) - While inspired by Seeks it does not provide the collaborative search aspect.
13 |
14 | ## Wikipedia Articles
15 | - [Collaborative search engine](https://en.wikipedia.org/wiki/Collaborative_search_engine)
16 | - "let users combine their efforts in information retrieval (IR) activities, share information resources collaboratively using knowledge tags, and allow experts to guide less experienced people through their searches. Collaboration partners do so by providing query terms, collective tagging, adding comments or opinions, rating search results, and links clicked of former (successful) IR activities to users having the same or a related information need."
17 | - Implicit Collaboration: Collaborative filtering, recommendation systems.
18 | - I-Spy
19 | - Jumper 2.0
20 | - Seeks
21 | - Community Search Assistance
22 | - Burghardt et al. CSE
23 | - Longo et al.
24 | - Explicit Collaboration
25 | - SearchTogether (2007)
26 | - PlayByPlay
27 | - MUSE
28 | - MUST
29 | - Cerciamo
30 | - Papagelis et al.
31 | - CoSense
32 | - CoSearch
33 | - GroupWeb
34 | - ClassSearch
35 | - [Collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering)
36 | - [Robust collaborative filtering](https://en.wikipedia.org/wiki/Robust_collaborative_filtering)
37 | - Social Bookmarking
38 | - StumbleUpon
39 | - Digg
40 | - [Firefly](https://en.wikipedia.org/wiki/Firefly_(website))
41 | - [Recommender system](https://en.wikipedia.org/wiki/Recommender_system)
42 | - [Information filtering system](https://en.wikipedia.org/wiki/Information_filtering_system)
43 | - [Collaborative intelligence](https://en.wikipedia.org/wiki/Collaborative_intelligence)
44 | - [Crowdsourcing](https://en.wikipedia.org/wiki/Crowdsourcing)
45 | - Reddit
46 | - [Citizen science](https://en.wikipedia.org/wiki/Citizen_science)
47 | - [Enterprise bookmarking](https://en.wikipedia.org/wiki/Enterprise_bookmarking)
48 | - IBM Dogear (Lotus Connections)
49 | - Cogenz
50 | - Jumper 2.0
51 | - del.icio.us
52 | - [Social bookmarking](https://en.wikipedia.org/wiki/Social_bookmarking)
53 | - 1996 - itList
54 | - 1997 - NASA WebTagger
55 | - Backflip
56 | - Blink
57 | - Clip2
58 | - ClickMarks
59 | - HotLinks
60 | - 2003 - Delicious
61 | - Furl
62 | - Simpy
63 | - Spurl.net
64 | - unalog
65 | - CiteULike
66 | - Connotea
67 | - StumbleUpon
68 | - 2006 - Ma.gnolia (Gnolia)
69 | - Blue Dot (Faves)
70 | - Mister Wong
71 | - Diigo
72 | - Connectbeam
73 | - 2007 - IBM Lotus Connections
74 | - 2009 - Pinboard
75 | - Digg
76 | - Reddit
77 | - Newsvine
78 | - [Tag](https://en.wikipedia.org/wiki/Tag_(metadata))
79 | - [Comparison of enterprise bookmarking platforms](https://en.wikipedia.org/wiki/Comparison_of_enterprise_bookmarking_platforms)
80 | - [Knowledge management](https://en.wikipedia.org/wiki/Comparison_of_enterprise_bookmarking_platforms)
81 | - [Web directory](https://en.wikipedia.org/wiki/Web_directory)
82 | - [Folksonomy](https://en.wikipedia.org/wiki/Folksonomy) - aka Collaborative tagging
83 | - [List of social bookmarking sites](https://en.wikipedia.org/wiki/List_of_social_bookmarking_websites)
84 | - [List of social software](https://en.wikipedia.org/wiki/List_of_social_software)
85 | - [Elium (Knowledge Plaza)](https://en.wikipedia.org/wiki/Elium)
86 | - [Models of collaborative tagging](https://en.wikipedia.org/wiki/Models_of_collaborative_tagging)
--------------------------------------------------------------------------------
/front-end/ui-components-of-search.md:
--------------------------------------------------------------------------------
1 | # Understanding the UI Components of Search
2 |
3 | In this section we look at the various UI components commonly utilized in building a search engine. We will initially base our discussion on the components provided by [Pre-Built Search Component Libraries](./ui-component-libraries-for-search.md).
4 |
5 | ## Core Components
6 |
7 | ### Search Box
8 | This is the main input field where a user enters their queries.
9 |
10 | ### Search Results
11 | This is the area where search results are displayed. It typically includes a list of items with titles, descriptions, and other relevant information.
12 |
13 | #### Result Item (Hit)
14 | This is an individual search result displayed within the search results. It usually includes a title, description, and other relevant information.
15 |
16 | ### Pagination
17 | Pagination allows users to navigate through multiple pages of search results.
18 |
19 | ### Per Page
20 | This is a dropdown menu that allows users to select the number of search results displayed per page.
21 |
22 | ### Facets (Refinements, Filters)
23 | Facets are used to refine search results. They are typically displayed as a list of checkboxes or radio buttons that allow users to filter search results by various criteria such as category, price, date, etc.
24 |
25 | #### Types of Facet
26 | - Radio Buttons
27 | - Checkboxes
28 | - Hierarchical
29 | - Range (slider)
30 | - Rating
31 | - Toggle
32 |
33 | #### Selected Facets (Active Filters)
34 | This is a list of the currently selected facets. It is usually displayed above the search results and allows users to easily remove filters.
35 |
36 | ### Sort By
37 | Generally this is a dropdown menu that allows users to sort search results by various criteria such as relevance, date, popularity, etc. but sometimes it may be radio buttons or similar.
38 |
39 | ## Optional Components
40 | - Breadcrumbs
41 | - Geographic
42 | - History
43 | - Menus
44 | - Multi-Select Items
45 | - Stats
46 | - Related Items
47 | - Tag Cloud
48 | - Date Picker
49 | - Tree View
50 | - Tag Filters
51 |
52 | ## Functionality
53 | - Autocomplete (Typeahead)
54 | - Conditional Facets
55 | - Grouping Results
56 | - Highlight
57 | - Search-as-you-type
58 | - Snippet
59 | - Suggestions
60 | - Voice Search
61 |
62 | ## UI Elements
63 | - Input Box
64 | - Button
65 | - List
66 | - Checkbox
67 | - Range Slider
68 | - Radio Button
69 | - Dropdown
70 |
71 | ## Resources
72 | - [InstantSearch.js UI Library Widgets](https://www.algolia.com/doc/api-reference/widgets/js/).
73 | - [ReactiveSearch Components](https://docs.reactivesearch.io/docs/reactivesearch/react/overview/components/).
74 | - They have sub-pages for each type of component and this includes a screenshot of the visual appearance of the component, which can be quite helpful.
75 | - [Elastic Search UI Components](https://docs.elastic.co/search-ui/api/react/components/search-box)
76 | - David Ubersky's article [Search UI Patterns: Elements](https://ddsky.medium.com/search-ui-patterns-elements-80ea9d241f97)
77 |
78 | ### Generic Component Libraries
79 | Just a few sample UI component libraries that may be worth consideration. Not meant to be representative or exhaustive.
80 |
81 | #### General
82 | - [Bootstrap](https://getbootstrap.com/)
83 | - [Semantic UI](https://semantic-ui.com/)
84 | - Has some components specifically dedicated to search.
85 | - [Shoelace](https://shoelace.style/)
86 |
87 | #### Angular
88 | - [PrimeEng](https://primeng.org/)
89 |
90 | #### React
91 | - [Material UI (MUI)](https://mui.com/material-ui/all-components/)
92 | - [MP React Components](https://materialsproject.github.io/mp-react-components/?path=/story/introduction-mp-react-components--page)
93 | - Focused on material sciences.
94 | - [Chakra](https://chakra-ui.com/docs/components)
95 | - [Radix](https://www.radix-ui.com/)
96 | - Headless UI.
97 | - [React Bootstrap](https://react-bootstrap.github.io/)
98 |
99 | #### Vue
100 | - [PrimeVue](https://primevue.org/)
101 | - Has DataTable and DataView.
--------------------------------------------------------------------------------
/research/fairness-research.md:
--------------------------------------------------------------------------------
1 | # Diversity, Fairness, and Bias in Search Engines and Information Retrieval
2 |
3 | ## 2023
4 | - Ya-Lin Zhang, Yi-Xuan Sun, Fangfang Fan, Menng Li, Yeyu Zhao, Wei Wang, Longfei Li, Jun Zhou, Jinghua Feng. [A Framework for Detecting Frauds from Extremely Few Labels](https://dl.acm.org/doi/10.1145/3539597.3573022). 2/2023.
5 | - Gang Chen, Jiawei Chen, Fuli Feng, Sheng Zhou, Xiangnan He. [Unbiased Knowledge Distillation for Recommendation](https://dl.acm.org/doi/10.1145/3539597.3570477). 2/2023.
6 | - Xiaoying Zhang, Hongning Wang, Hang Li. [Disentangled Representation for Diversified Recommendations](https://dl.acm.org/doi/10.1145/3539597.3570389). 2/2023.
7 | - Sophie Scharf, Monika Wiegelmann, Arnt Bröder. [Information search in everyday decisions: The generalizability of the attraction search effect](https://www.researchgate.net/publication/366827841_Information_search_in_everyday_decisions_The_generalizability_of_the_attraction_search_effect). 1/2023.
8 | - Zheng Hu, Satoshi Nakagawa, Liang Luo, Yu Gu, Fuji Ren. [Celebrity-aware Graph Contrastive Learning Framework for Social Recommendation](https://dl.acm.org/doi/10.1145/3583780.3614806). CIKM '23. 10/2023.*
9 |
10 | ## 2022
11 | - Amifa Raj. [Fair Ranking Metrics](https://dl.acm.org/doi/10.1145/3523227.3547430). 9/2022.
12 | - Yuta Saito, Thorsten Joachims. [Fair Ranking as Fair Division: Impact-Based Individual Fairness in Ranking](https://dl.acm.org/doi/10.1145/3534678.3539353). 8/2022.*
13 | - Ji Liu, Zenan Li, Yuan Yao, Feng Xu, Xiaoxing Ma, Miao Xu, Hanghang Tong. [Fair Representation Learning: An Alternative to Mutual Information](https://dl.acm.org/doi/10.1145/3534678.3539302). 8/2022.
14 | - Mouxiang Chen, Chenghao Liu, Zemin Liu, Jianling Sun. [Scalar is Not Enough: Vectorization-based Unbiased Learning to Rank](https://dl.acm.org/doi/10.1145/3534678.3539468). 8/2022.
15 | - Zhaolin Gao, Tianshu Shen, Zheda Mai, Mohamed Reda Bouadjenek, Isaac Waller, Ashton Anderson. [Mitigating the Filter Bubble While Maintaining Relevance: Targeted Diversification with VAE-based Recommender Systems](https://dl.acm.org/doi/10.1145/3477495.3531890). 7/2022.
16 | - Wenjie Wang, Fuli Feng, Liqiang Nie, Tat-Seng Chua. [User-controllable Recommendation Against Filter Bubbles](https://dl.acm.org/doi/10.1145/3477495.3532075). 7/2022.*
17 | - Yuan Wang, Zhiqiang Tao, Yi Fang. [A Meta-learning Approach to Fair Ranking](https://dl.acm.org/doi/10.1145/3477495.3531892). 7/2022.*
18 | Mohammadmehdi Naghiaei, Hossein A. Rahmani, Yasher Deldjoo. [CPFair: Personalized Consumer and Producer Fairness Re-ranking for Recommendation Systems](https://dl.acm.org/doi/10.1145/3477495.3531959). 7/2022.
19 | - Anja Klasnja, Negar Arabzadeh, Mahbod Mehrvarz, Ebrahim Baghieri. [On the Characteristics of Ranking-based Gender Bias Measures](https://dl.acm.org/doi/10.1145/3501247.3531540). 6/2022.*
20 | - Qinzhi Jiang, Mustafa Naseem, Jamie Lai, Kentaro Toyama, Panos Papalambros. [Understanding Power Differentials and Cultural Differences in Co-design with Marginalized Populations](https://dl.acm.org/doi/10.1145/3530190.3534819). 6/2022.
21 |
22 | ## 2021
23 | - Valeria Mazzeo, Andrea Rapisarda, Giovanni Giuffrida. [Detection of Fake News on COVID-19 on Web Search Engines](https://www.researchgate.net/publication/352838694_Detection_of_Fake_News_on_COVID-19_on_Web_Search_Engines). 6/2021.
24 |
25 | ## 2019
26 | - Juhi Kulshrestha, Motahhare Eslami, Johnnatan Messias, Muhammad Bilal Zafar, Saptarshi Ghosh, Krishna P. Gummadi, Karrie Karahalios. [Search bias quantification: investigating political bias in social media and web search](https://www.researchgate.net/publication/327146029_Search_bias_quantification_investigating_political_bias_in_social_media_and_web_search). 4/2019.*
27 |
28 | ## 2018
29 | - Will Serrano. [Neural Networks in Big Data and Web Search](https://www.researchgate.net/publication/330028298_Neural_Networks_in_Big_Data_and_Web_Search). 12/2018.
30 |
31 | ## 2014
32 | - Xinyu Xing, Wei Meng, Dan Doozan, Nick Feamster, Wenke Lee, Alex C. Snoeren. [Exposing Inconsistent Web Search Results with Bobble](https://www.researchgate.net/publication/301967705_Exposing_Inconsistent_Web_Search_Results_with_Bobble). 3/2014.
33 |
34 | ## 2013
35 | - Ryen White. [Beliefs and biases in web search](https://www.researchgate.net/publication/262393954_Beliefs_and_biases_in_web_search). 7/2013.
36 | - Rodrygo L. T. Santos. [Explicit webs earch result diversification](https://www.researchgate.net/publication/262272502_Explicit_web_search_result_diversification). 6/2013.
--------------------------------------------------------------------------------
/specific-engines/solr/solr-resources-utlities.md:
--------------------------------------------------------------------------------
1 | # Solr Resources: Utilities
2 | - [Solr Ansible role](https://github.com/idealista/solr_role) - Stars: 23 - Updated: 1/2023 - Checked: 5/2023
3 | - [Apache Solr Container (Built with Ansible)](https://github.com/geerlingguy/solr-container) - Stars: 17 - Updated: 11/2022 - Checked: 5/2023
4 | - [NLA's blacklight-solrcloud-repository](https://github.com/nla/blacklight-solrcloud-repository) - Stars: 0 - Updated: 5/2023 - Checked: 5/2023
5 | - "A Blacklight repository to connect with a collection on a ZooKeeper managed SolrCloud cluster."
6 | - [Solr Bulk Indexing](https://github.com/miku/solrbulk) - Stars: 39 - Updated: 4/2023 - Checked: 5/2023
7 | - For indexing "a bunch of documents really, really, fast"
8 | - [solr-cmd-utils](https://github.com/tblsoft/solr-cmd-utils) - Stars: 3 - Updated: 12/2022 - Checked: 5/2023
9 | - Includes solr-pipeline, solr-dump, solr-extract-nouns, solr-numfound.
10 | - [Solr DB Importer](https://github.com/saro-lab/solr-db-importer) - Stars: 10 - Updated: 3/2023 - Checked: 5/2023
11 | - Supports MariaDB, Oracle, MSSQL, MySQL, PostgreSQL, and H2.
12 | - [ik-analyzer-solr](https://github.com/magese/ik-analyzer-solr) - Stars: 1.1k - Updated: 1/2022 - Checked: 5/2023
13 | - [Data Import Handler](https://github.com/SearchScale/dataimporthandler) - Stars: 51 - Updated: 4/2023 - Checked: 5/2023
14 | - Assists in importing records from databases into Solr.
15 | - [Multi Tier Annotation Search (MTAS)](https://github.com/textexploration/mtas) - Stars: 6 - Updated: 1/2022 - Checked: 5/2023
16 | - [RequestSanitizer](https://github.com/cominvent/request-sanitizer-component) - Stars: 4 - Updated: 12/2022 - Checked: 5/2023
17 | - Sanitizes request parameter input.
18 | - [Relevancy Dashboard](https://github.com/sul-dlss/relevancy_dashboard) - Stars: 3 - Updated: 5/2023 - Checked: 5/2023
19 | - "Analyzing relevancy changes across solr versions"
20 | - [solr-diagnostics](https://github.com/sematext/solr-diagnostics) - Stars: 5 - Updated: 5/2021 - Checked: 5/2023
21 | - "Gathers info from Solr that should help diagnose issues"
22 | - [OSC's Solr Dump](https://github.com/o19s/solr_dump) - Stars: 7 - Updated: 3/2022 - Checked: 5/2023
23 | - "Dump a Solr index to file; Read from dumped file to Solr."
24 | - [solrdump](https://github.com/ubleipzig/solrdump) - Stars: 35 - Updated: 4/2023 - Checked: 5/2023
25 | - "Export documents from a SOLR index as JSON"
26 | - [SolrCloud Manager](https://github.com/ekataglobal/solrcloud_manager) - Stars: 23 - Updated: 1/2022 - Checked: 5/2023
27 | - "Provides easy SolrCloud cluster management."
28 | - [solrcopy](https://github.com/juarezr/solrcopy) - Stars: 6 - Updated: 3/2022 - Checked: 5/2023
29 | - CLI for backup/restore of documents in Solr cores.
30 | - [Solr Grouping](https://github.com/nla/solr-grouping) - Stars: 2 - Updated: 1/2023 - Checked: 5/2023
31 | - Allows two levels of grouping (from the National Library of Australia)
32 | - [Solr Operator](https://github.com/apache/solr-operator) - Stars: 208 - Updated: 5/2023 - Checked: 5/2023
33 | - "Kubernetes Operator for Apache Solr"
34 | - [Lucidworks Spark/Solr Integration](https://github.com/lucidworks/spark-solr) - Stars: 440 - Updated: 5/2023 - Checked: 5/2023
35 | - "tools for reading data from Solr as a Spark DataFrame/RDD and indexing objects from Spark into Solr using SolrJ."
36 | - [solrscripts](https://github.com/tokee/solrscripts) - Stars: 10 - Updated: 4/2022 - Checked: 5/2023
37 | - Includes a tool for diffing schema configs and another for validating configs.
38 | - [Cominvent's Solr tools](https://github.com/cominvent/solr-tools) - Stars: 38 - Updated: 11/2022 - Checked: 5/2023
39 | - [SolrUtils](https://github.com/InterNations/SolrUtils) - Stars: 8 - Updated: 8/2022 - Checked: 5/2023
40 | - "helps with recurring tasks when working with Solr like escaping and sanitizing user input"
41 | - [Traject](https://github.com/traject/traject) - Stars: 98 - Updated: 4/2023 - Checked: 5/2023
42 | - "An easy to use, high-performance, flexible and extensible metadata transformation system, focused on library-archives-museums input, and indexing to Solr as output."
43 | - [Zeppelin Solr Interpreter](https://github.com/lucidworks/zeppelin-solr) - Stars: 28 - Updated: 1/2021 - Checked: 5/2023
44 | - "allows user to issue Solr queries and display results in the Zeppelin UI"
45 | - [solr-bench](https://github.com/fullstorydev/solr-bench) - Stars: 15 - Updated: 4/2023 - Checked: 5/2023
46 | - "Solr benchmarking and load testing harness"
--------------------------------------------------------------------------------
/WebCrawlers.md:
--------------------------------------------------------------------------------
1 | # Open Source Web Crawlers
2 |
3 | ## Table of Contents
4 | - Comments
5 | - General Resources
6 | - Apache Nutch
7 | - StormCrawler
8 | - Scrapy
9 | - Norconex Web Crawler
10 | - PulsarR
11 | - Heritrix
12 | - Sparkler
13 | - CoCrawler
14 | - Comparisons
15 | - Other
16 | - Maybe...?
17 |
18 | ## Comments
19 | - This page focuses on web crawlers/spiders as opposed to web scrapers. While there can be significant overlap between the two, our goal is to evaluate systems that are meant for web scale crawling.f
20 | - This document focuses on general purpose web crawlers. There is a growing niche of crawlers created specifically for security purposes which are not covered here.
21 | - We focus primarily on projects which are being actively developed. Projects which are showing limited signs of life may not be included. If you feel we've passed over a project that should be included, please create an issue or pull request.
22 |
23 | ## General Resources
24 | - [Awesome Crawler](https://github.com/BruceDone/awesome-crawler) - Stars: 5.5k - Updated: 12/2022 - Checked: 2/2025.
25 |
26 | ## Apache Nutch
27 | - https://nutch.apache.org/
28 | - [GitHub Repo](https://github.com/apache/nutch)
29 | - Stars: 2.6k - Updated: 3/2023 - Checked: 4/2023.
30 | - Probably the best known and most utilized open source web crawler.
31 | - [Nutch Tutorial](https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial) - The official tutorial for getting started with Nutch.
32 |
33 | ## StormCrawler
34 | - http://stormcrawler.net/index.html
35 | - [GitHub Repo](https://github.com/DigitalPebble/storm-crawler)
36 | - Stars: 795 - Updated: 4/2023 - Checked: 4/2023.
37 | - Open source web crawler built on Apache Storm.
38 | - OpenWebSearch.eu's [Owler](https://openwebsearch.eu/owler/) web crawler is built off of StormCrawler.
39 |
40 | ## Scrapy
41 | - https://scrapy.org/
42 | - [GitHub Repo](https://github.com/scrapy/scrapy)
43 | - Stars: 9.9k - Updated: 4/2023 - Checked: 4/2023.
44 | - A popular, open source web crawler/scraper written in Python.
45 | - [Scrapy Documentation on Broad Crawls](https://docs.scrapy.org/en/latest/topics/broad-crawls.html).
46 | - [WebScraping API's Web Crawling With Python](https://www.webscrapingapi.com/web-crawling-with-python). 12/2022.
47 |
48 | ## Norconex Web Crawler
49 | - https://opensource.norconex.com/crawlers/web/
50 | - [GitHub Repo](https://github.com/Norconex/collector-http)
51 | - Stars: 153 - Updated: 2/2023 - Checked: 4/2023.
52 | - Open source Java web crawler.
53 |
54 | ## PulsarR
55 | - https://github.com/platonai/pulsarr
56 | - Open source web crawler written in Kotlin.
57 |
58 | ## Heritrix
59 | - https://heritrix.readthedocs.io/en/latest/
60 | - [GitHub Repo](https://github.com/internetarchive/heritrix3)
61 | - Stars: 2.4k - Updated: 3/2023 - Checked: 4/2023.
62 | - Open source web crawler written in Java by the Internet Archive.
63 | - See also Internet Archive's browser-based distributed crawler, [brozzler](https://github.com/internetarchive/brozzler).
64 |
65 | ## Sparkler
66 | - http://irds.usc.edu/sparkler/
67 | - [GitHub Repo](https://github.com/USCDataScience/sparkler)
68 | - Stars: 400 - Updated: 4/2023 - Checked: 4/2023.
69 | - A next-generation successor to Apache Nutch that uses Spark, Kafka, Lucene/Solr, Tika, and pf4j.
70 |
71 | ## CoCrawler
72 | - [GitHub Repo](https://github.com/cocrawler/cocrawler)
73 | - Stars: 166 - Updated: 4/2022 - Checked: 4/2023.
74 | - Authored by Greg Lindahl (Blekko) in Python, pre-release.
75 | - Included primarily because Lindahl has a proven track record in web crawling.
76 |
77 | ## Comparisons
78 | - Rody. [Comparison of Open Source Web Crawlers for Data Mining and Web Scraping: Pros & Cons](https://outsourceit.today/comparison-open-source-web-crawlers/). outsourceit.today, 10/2022.
79 | - Covers Scrapy, Heritrix, Nutch, and PySpider.
80 |
81 | ## Other
82 | - [Crawlab](https://github.com/crawlab-team/crawlab) - Stars: 9.7k - Updated: 4/2023 - Checked: 4/2023.
83 | - A Go language, distributed web crawler admin platform that works with multiple languages and frameworks including Scrapy.
84 | - NOTE: Does not appear to have integrations with most web scale crawlers, e.g. Nutch or StormCrawler.
85 | - [ACHE](https://ache.readthedocs.io/en/latest/) - Stars: 461 - Updated: 2023 - Checked: 2/2025.
86 | - "ACHE is a web crawler for domain-specific search."
87 | - [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) - Stars: 702 - Updated: 2/2025 - Checked: 2/2025
88 | - "Run a high-fidelity browser-based web archiving crawler in a single Docker container"
89 |
90 | ## Maybe...?
91 | - This section includes a few crawlers that are in development and show some promise.
92 |
93 | - [Crawler](https://github.com/crwlrsoft/crawler) - Stars: 233 - Updated: 3/2023 - Checked: 4/2023.
94 | - Including this one because it's written in PHP, which isn't particularly common for web crawlers.
95 | - [SeimiCrawler](https://github.com/zhegexiaohuozi/SeimiCrawler) - Stars: 1.9k - Updateds: 4/2023 - Checked: 4/2023.
96 | - A Java-based, distributed, open source web crawler.
97 | - [XXL-CRAWLER](https://www.xuxueli.com/xxl-crawler/) - Stars: 654 - Updated: 10/2022 - Checked: 4/2023.
98 | - A Java-based, distributed, open source web crawler.
99 | - [Sparkler-Crawler](https://github.com/USCDataScience/sparkler) - Stars: 400 - Updated: 4/2023 - Checked: 4/2023.
100 | - A Java/Scala based web crawler built on Spark.
101 | - [crawler](https://github.com/a11ywatch/crawler) - Stars: 22 - Updated: 4/2023 - Checked: 4/2023.
102 | - A Rust, open source web crawler that claims it is "capable of handling millions of pages per second efficiently."
103 | - [colly](https://github.com/gocolly/colly) - Stars: 19.3k - Updated: 4/2023 - Checked: 4/2023.
104 | - A Go language open source frmework for building crawlers/scrapers/spiders.
105 | - [Montferret](https://www.montferret.dev/) - Stars: 5.3k - Updated: 4/2023 - Checked: 4/2023.
106 | - A Go language, open source web scraper. Letting it slide in for its interesting declarative approach.
--------------------------------------------------------------------------------
/research/research-main.md:
--------------------------------------------------------------------------------
1 | # Research on Search and Information Retrieval
2 |
3 | ## Collaborative
4 | - See [the page dedictaed to collaborative research](collaborative-research.md).
5 |
6 | ## Diversity, Fairness, Bias
7 | - See [the page dedicated to diversity, fairness, and bias research](fairness-research.md).
8 |
9 | ## Federation
10 | - See [the page dedicated to federation research](federation-research.md).
11 |
12 | ## Personalization
13 | - See [the page dedicated to personalization research](personalization-research.md).
14 |
15 | ## Ranking
16 | - See [the page dedicated to ranking research](ranking-research.md).
17 |
18 | ## Web Crawling
19 | - See [the page dedicated to web crawling research](crawling-research.md).
20 |
21 | ## Decentralization
22 | - See [the page dedicated to decentralization research](decentralization-research.md).
23 |
24 | ## Semantic
25 | - See [the page dedicated to semantic research](semantic-research.md).
26 |
27 | ## Uncategorized
28 | - See [the page dedicated to uncategorized research](uncategorized-research.md).
29 |
30 | ## Recommendations
31 | - See [the page dedicated to recommendations research](recommendations-research.md).
32 |
33 | ## Trustworthiness
34 | - See [the page dedicated to trustworthiness research](trust-research.md).
35 |
36 | ## Conversational
37 | - Jeffrey Dalton, Sophie Fischer, Paul Owoicho, Filip Radlinski, Rederico Rossetto, Johanne R. Trippas, Hamed Zamani. [Conversation Information Seeking: Theory and Application](https://dl.acm.org/doi/10.1145/3477495.3532678). 7/2022.*
38 |
39 | ## Internationalization and Localization
40 | - Zhuliu Li, Yiming Wang, Xiao Yan, Weizhi Meng, Yanen Li, Jaewon Yang. [TaxoTrans: Taxonomy-Guided Entity Translation](https://dl.acm.org/doi/10.1145/3534678.3539188). 8/2022.
41 |
42 | ## Privacy
43 | - Amit Kumar, Marc Spaniol. [There is a fine Line between Personalization and Surveillanced: Semantic User Interest Tracing via Entity-Level Analytics](https://dl.acm.org/doi/10.1145/3501247.3531592). 6/2022.
44 |
45 | ## Storage
46 | - Laurens Debackere, Pieter Colpaert, Ruben Taelman, Ruben Verborgh. [A Policy-Oriented Architecture for Enforcing Consent in Solid](https://dl.acm.org/doi/10.1145/3487553.3524630). 8/2022.
47 |
48 | ## Performance, Scalability
49 | - B. Barla Cambazoglu, Ricardo Baeza-Yates. [Scalability and Efficiency Challenges in Large-Scale Web Search Engines](https://dl.acm.org/doi/10.1145/2684822.2697039). 2/2015.
50 | - Kamlesh Kumar Pandey, Narendra Pradhan. [Internet Search Engine: Performance Evaluating the Google, Yahoo and Bing Web Search Engine based on their Searching Capabilities](https://www.researchgate.net/publication/324482784_Internet_Search_Engine_Performance_Evaluating_the_Google_Yahoo_and_Bing_Web_Search_Engine_based_on_their_Searching_Capabilities). 2/2018.
51 |
52 | ## SPAM
53 | - Asim Shahzad, Nazri Mohd Nawi, Syed Muhammad Zubair Rehman Gillani, Abdullah Khan. [An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach](https://www.researchgate.net/publication/356258479_An_Improved_Framework_for_Content-_and_Link-Based_Web-Spam_Detection_A_Combined_Approach). 11/2021.
54 | - Jayakrishnan Ashok, Pankaj Badoni. [Web Content Authentication: A Machine Learning Approach to Identify Fake and Authentic Web Pages on Internet](https://www.researchgate.net/publication/353005229_Web_Content_Authentication_A_Machine_Learning_Approach_to_Identify_Fake_and_Authentic_Web_Pages_on_Internet). 7/2021.
55 | - Asim Shahzad, Jamaluddin Mir, Aamer Khan, Muhammad Asshad, Muhammad Zeeshan, Ahsan Zubair, Muhammad Naeem. [The Web Spam Taxonomy and Algorithms for Detection and Prevention of Web Spamming - A Systematic Review](https://www.researchgate.net/publication/362861741_The_Web_Spam_Taxonomy_and_Algorithms_for_Detection_and_Prevention_of_Web_Spamming_-A_Systematic_Review). 7/2021.*
56 |
57 | ## Older Research
58 | - Na Dai, Brian D. Davison. [Capturing page freshness for web search](https://dl.acm.org/doi/10.1145/1835449.1835658). 7/2010.*
59 | - Carlos Castillo, Brian Davison. [Adversarial Web Search](https://www.researchgate.net/publication/220613785_Adversarial_Web_Search). 1/2010.
60 | - Amanda Spink, Michael Zimmer. [Web Search: Multidisclipinary Perspectives](https://www.researchgate.net/publication/321614743_Web_Search_Multidisciplinary_Perspectives). 1/2008.
61 | - Fang Qi-Ming, Yang Guang-Wen, Wu Yong-Wei, Zheng Wei Min. [P2P Web Search Technology](https://www.researchgate.net/publication/253605198_P2P_Web_Search_Technology). 1/2008.
62 | - Jim Jansen, Sherry Koshma, Amanda Spink. [Web Searching on the Vivisimo Search Engine](https://www.researchgate.net/publication/27479615_Web_Searching_on_the_Vivisimo_Search_Engine). 12/2006.
63 | - Nils Kammenhuber, Julia Luxenburger, Anja Feldmann, Gerhard Weikum. [Web search clickstreams](https://www.researchgate.net/publication/221611907_Web_search_clickstreams). 10/2006.
64 | - Amanda Spink, Minsoo Park, Jim Jansen, Jan Pedersen. [Multitasking during Web search sessions](https://www.researchgate.net/publication/222436299_Multitasking_during_Web_search_sessions). 1/2006.
65 | - Yiping Ke, Lin Deng, Wee Keong Ng, Dik Lee. [Web dynamics and their ramifications for the development of Web search engines](https://www.researchgate.net/publication/222416900_Web_dynamics_and_their_ramifications_for_the_development_of_Web_search_engines).
66 | - Jim Gray. [A Conversation with Tim Bray: Searching for ways to tame the world's vast stores of information](https://dl.acm.org/doi/10.1145/1046931.1046941). 2/2005.*
67 | - Mike Cafarella, Doug Cutting. [Building Nutch: Open Source Search: A case study in writing an open source search engine](https://dl.acm.org/doi/10.1145/988392.988408). 4/2004.*
68 | - Anna Patterson. [Why Writing Your Own Search Engine Is Hard: Big or small, proprietary or open source, Web or intranet, it's a tough job](https://dl.acm.org/doi/10.1145/988392.988407). 4/2004.*
69 | - Amanda Spink. [Web Search: Emerging Patterns](https://www.researchgate.net/publication/32962078_Web_Search_Emerging_Patterns). 9/2003.
70 | - Upendra Shardanand, Pattie Maes. [Social Information Filtering: Algorithms for Automating "Word of Mouth"](https://dl.acm.org/doi/10.1145/223904.223931). CHI '95. 5/1995.
--------------------------------------------------------------------------------
/specific-engines/aws-opensearch/aws-opensearch-main.md:
--------------------------------------------------------------------------------
1 | # AWS OpenSearch Service
2 |
3 | ## Introduction
4 | Amazon Web Services (AWS) used to offer a managed service called Amazon Elasticsearch Service (Amazon ES) that utilized the open source Elasticsearch engine. Elastic changed it's licensing in an attempt to prevent Amazon from using it's software without paying what Elastic saw as a reasonable price. AWS forked the last fully open source version of Elasticsearch and rebranded it as OpenSearch. OpenSearch itself is an open source search engine application that does not depend on AWS. AWS OpenSearch Service is a managed OpenSearch service. There is also a serverless offering available, but we will be focusing primarily on the managed offering at this point.
5 |
6 | ## Caveat
7 | Both Elasticsearch and OpenSearch are powerful document search engines but much of the documentation on them focuses on their usage within DevOps and Security analytics contexts. We will not be covering these topics as indexes although we will address them as they apply to the proper configuration and maintenance of OpenSearch clusters.
8 |
9 | ## Resources
10 | - [Amazon OpenSearch Service Documentation](https://docs.aws.amazon.com/opensearch-service/)
11 | - [Amazon OpenSearch Service Developer Guide](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html)
12 | - [Searching data in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/searching.html)
13 | - [Amazon OpenSearch Service API Operations](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_Operations_Amazon_OpenSearch_Service.html)
14 | - [Amazon OpenSearch Service AWS CLI Command Reference](https://docs.aws.amazon.com/cli/latest/reference/opensearch/)
15 | - [Amazon OpenSearch Ingestion API Operations Documentation](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_Operations_Amazon_OpenSearch_Ingestion.html)
16 |
17 |
18 |
19 |
20 | ## Ways to Search
21 | - URI - Simple but limited in what functionality it can utilize.
22 | - Request Body - Slightly more complex but able to utilize the full range of OpenSearch DSL.
23 |
24 | ## Boosting
25 |
26 | ## Search Result Highlighting
27 |
28 | ## Pagination
29 | - Point in Time (PIT) - Runs the queries against the data as it was at a specific point in time. This is the preferred method.
30 | - Using From and Size Parameters - Slightly less complicated but may not be as accurate as PIT.
31 |
32 | ## Packages (Dictionaries, Plugins)
33 | - See [Custom packages for Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/custom-packages.html) and [Plugins by engine version in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/supported-plugins.html)
34 |
35 | Somewhat confusingly named, these are custom dictionary files and plugins that can improve the quality of results returned.
36 |
37 | The plugins provided by Amazon for OpenSearch Service currently include analyzers for Japaneses, Chinese, Pinyin, and Korean as well as the more generally applicable Amazon Persoanlzied Ranking plugin ("re-ranks OpenSearch results based on each user's past behavior and preferences").
38 |
39 | We can use a synonym token filter to add tokens and stop token filter to remove tokens when a specific token is found.
40 |
41 | ## SQL
42 | - See [Querying your Amazon OpenSearch Service data with SQL](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/sql-support.html) and [SQL and PPL](https://opensearch.org/docs/latest/search-plugins/sql/index/)
43 | - It is what it sounds like, you can use SQL instead of the JSON-based DSL to query your OpenSearch cluster.
44 | - There is a SQL Workbench in OpenSearch Dashboards, a SQLI CLI is available, as well as a JDBC driver. There is also a read-only ODBC driver.
45 |
46 | ## k-NN search
47 | - See [k-Nearest Neighbor (k-NN) search in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html) and [k-NN search](https://opensearch.org/docs/latest/search-plugins/knn/index/).
48 |
49 | ## Cross-cluster search
50 | - See [Cross-cluster search in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/cross-cluster-search.html)
51 | - Sometimes one may want to create multiple smaller domains rather than one large domain, this is helpful when the data will be used for different types of workloads and the clusters can be optimized to support that specific workload.
52 |
53 | ## Learning to Rank
54 | - See [Learning to Rank for Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/learning-to-rank.html), [Elastic Learning to Rank Documentation](https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/index.html)
55 | - Uses machine learning and behavioral data to tune the relevance of search results.
56 | - Based on the Elasticsearch LTR plugin which utilizes models from the XGBoost and Ranklib libraries for rescoring results.
57 |
58 | ## Asynchronous Search
59 | - See [Asynchronous search in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/asynchronous-search.html)
60 | - "With asynchronous search for Amazon OpenSearch Service you can submit a search query that gets executed in the background, monitor the progress of the request, and retrieve results at a later stage. You can retrieve partial results as they become available before the search has completed. After the search finishes, save the results for later retrieval and analysis."
61 |
62 | ## Point in Time (PIT)
63 | - See [Point in time in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pit.html)
64 | - "The point in time (PIT) feature is a type of search that lets you run different queries against a dataset that's fixed in time. Typically, when you run the same query on the same index at different points in time, you receive different results because documents are constantly indexed, updated, and deleted. With PIT, you can query against a constant state of your dataset."
65 |
66 | ## Semantic Search
67 | - See [Semantic search in Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/semantic-search.html)
68 | - Allows one to perform semantic search using neural search or k-NN.
69 | - DM: Explore this section further.
70 |
--------------------------------------------------------------------------------
/OpenSourceSearchEngines.md:
--------------------------------------------------------------------------------
1 | # Open Source Search Engines
2 |
3 | ## Table of Contents
4 | - Apache Lucene
5 | - Lucene++
6 | - Apache Solr
7 | - Open Semantic Search
8 | - Subprojects
9 | - Solr PHP UI
10 | - Elasticsearch
11 | - Other Projects
12 | - dejavu
13 | - Fess
14 | - Searchkit
15 | - OpenSearch
16 | - Other Projects
17 | - Gigablast
18 | - [YaCy](/specific-engines/yacy.md)
19 | - Articles
20 | - Vald
21 | - Weaviate
22 | - MWMBL
23 | - Alexandria
24 | - Wiby
25 | - OpenSearchServer
26 | - Metasearch
27 | - MetaGer
28 | - Not Web Scale
29 | - meilisearch
30 | - Typesense
31 | - Smaller Engines
32 | - Sonic
33 | - ZincSearch
34 |
35 | ## Apache Lucene
36 | - https://lucene.apache.org/
37 | - The open source Java library that powers Apache Solr and Elasticsearch, among many other search projects.
38 |
39 | ### Lucene++
40 | - https://github.com/luceneplusplus/LucenePlusPlus
41 | - An open source C++ port of Lucene.
42 |
43 | ## Apache Solr
44 | - https://solr.apache.org/
45 | - See also dedicated pages [on Solr](/specific-engines/apache-solr.md)
46 |
47 | ## Open Semantic Search
48 | - https://opensemanticsearch.org/
49 | - Under the hood one is running Apache Solr, but there are some significant changes that make listing Open Semantic Search separately worthwile.[^opensemanticsearch]
50 |
51 | ### Subprojects
52 | - [Solr PHP UI](https://opensemanticsearch.org/solr-php-ui/) - Stars: 20 - Updated: 12/2021 - Checked: 2/2024
53 | - A frontend for Open Semantic Search.
54 | - [GitHub Repo](https://github.com/opensemanticsearch/solr-php-ui)
55 | - [Solr Ontology Tagger](https://github.com/opensemanticsearch/solr-ontology-tagger) - Stars: 39 - Updated: 1/2022 - Checked: 5/2023
56 | - [Solr Synonames](https://github.com/opensemanticsearch/solr-synonames) - Stars: 5 - Updated: 10/2020 - Checked: 5/2023
57 |
58 | ## Elasticsearch
59 | - https://elastic.co/
60 | - See also the [dedicated pages on Elasticsearch](/specific-engines//elasticsearch.md).
61 |
62 | ### Other Projects
63 | - [dejavu](https://github.com/appbaseio/dejavu) - Open source, JS web-based UI for Elasticsearch and OpenSearch.
64 | - [Fess](https://fess.codelibs.org/) - Open source, enterprise search server with web crawler and GUI. Written in Java.
65 | - [Searchkit](https://github.com/searchkit/searchkit) - Updated: 3/2023 - Checked: 3/2023 - Stars: 4.6k - Open source library for building search UI's with JS, React, Vue, Angular, etc. Written in TypeScript primarily.
66 |
67 | ## OpenSearch
68 | - https://opensearch.org/
69 | - An open source fork of Elasticsearch started by Amazon.[^controversy]
70 | - See also the [dedicated pages on OpenSearch](/specific-engines/opensearch.md)
71 |
72 | ### Other Projects
73 | - Please see Other Projects under Elasticsearch. Only projects that are for OpenSearch exclusively will be listed here.
74 |
75 | ## Gigablast
76 | - https://gigablast.com/
77 | - [GitHub Repo](https://github.com/gigablast/open-source-search-engine)
78 | - Founded in 2000 by Matt Wells as a closed source search engine it was later open sourceed. It is written in C++, is distributed, and includes both the engine and a crawler.
79 |
80 | ## YaCy
81 | - Please see the [dedicated page on YaCy](/specific-engines/yacy.md).
82 |
83 | ## Vald
84 | - https://vald.vdaas.org/
85 | - [GitHub Repo](https://github.com/vdaas/vald)
86 | - An open source, distributed vector search engine built using Go, utilized by Yahoo Japan.
87 |
88 | ## Weaviate
89 | - https://weaviate.io/
90 | - [GitHub Repo](https://github.com/weaviate/weaviate)
91 | - Open source vector search engine written in Go.
92 | - [Semantic Search through Wikipedia with Weaviate](https://github.com/weaviate/semantic-search-through-wikipedia-with-weaviate)
93 |
94 | ## MWMBL
95 | - https://mwmbl.org/
96 | - [GitHub Repo](https://github.com/mwmbl/mwmbl)
97 | - Open source, non-profit search engine written in Python.[^mwmbl]
98 |
99 | ## Alexandria
100 | - https://www.alexandria.org/
101 | - [GitHub Repo](https://www.alexandria.org/)
102 | - Open source search engine that uses CommonCrawl and is written in C++.
103 |
104 | ## Wiby
105 | - https://wiby.me/
106 | - [GitHub Repo](https://github.com/wibyweb/wiby)
107 | - [Installation and Setup Instructions](https://wiby.me/about/guide.html)
108 | - Open source search engine written in PHP, C, and Go.
109 |
110 | ## OpenSearchServer
111 | - https://www.opensearchserver.com/
112 | - [GitHub Repo](https://github.com/jaeksoft/opensearchserver)
113 | - Open source search engine written in Java, includes bundled crawler.
114 | - Note: No updates since 8/2021 as of 3/2023.
115 |
116 | ## Metasearch
117 |
118 | ### MetaGer
119 | - https://metager.org/
120 | - [Git Repo](https://gitlab.metager.de/open-source/MetaGer)
121 | - Open source metasearch engine run by a nonprofit.
122 |
123 | ## Not Web Scale
124 |
125 | ### meilisearch
126 | - https://www.meilisearch.com/
127 | - [GitHub Repo](https://github.com/meilisearch/meilisearch)
128 | - An open source search engine written in Rust.
129 |
130 | ### Typesense
131 | - https://typesense.org/
132 | - [GitHub Repo](https://github.com/typesense/typesense)
133 | - An open source Algolia alternative written in C/C++.[^typesense]
134 |
135 | ## Smaller Engines
136 | - [Sonic](https://github.com/valeriansaliou/sonic) - Updated: 1/2023 - Checked: 3/2023 - Stars: 18k - A lightweight, speedy search backend written in Rust.
137 | - [ZincSearch](https://github.com/zincsearch/zincsearch) - Updated: 3/2023 - Checked: 3/2023 - Stars: 14.7k - Lightweight alternative to Elasticsearch, written in Go. Includes a web UI.
138 |
139 | ## Footnotes
140 | [^controversy]: The fork was started following controversial licensing changes by Elasticsearch. For more on the history of this controversy see Graham Gillen's [Elasticsearch vs OpenSearch series](https://pureinsights.com/blog/2021/elasticsearch-vs-opensearch-user-point-of-view-part-1-of-3/). For a brief evaluation of OpenSearch's progress see Matt Asay's [One year of OpenSearch: Grading AWS’ open source effort](https://www.techrepublic.com/article/opensearch-grading-aws-open-source/).
141 | [^typesense]: Some interesting functionality includes tunable ranking, sorting, faceting & filtering, grouping & distinct, federated search, and curation. It doesn't appear to be in web scale usage but they've expressed interest in benchmarking larger datasets so I submmited an [issue requesting CommonCrawl be benchmarked](https://github.com/typesense/typesense/issues/933).
142 | [^opensemanticsearch]: It isn't meant for web search particularly but it offers a number of features which could be useful in a search engine - e.g. exploratory search as well as collaborative annotation and tagging.
143 | [^mwmbl]: The project has some similarities with what I'm looking to do with [Phoebe](https://github.com/davidshq/next-search/). It is open source, a non-profit, and the code is written in Python.
--------------------------------------------------------------------------------
/specific-engines/apache-solr.md:
--------------------------------------------------------------------------------
1 | # Apache Solr
2 | - Introduction
3 | - My Notes
4 | - Related Projects
5 | - General
6 | - [Discovery and/or UI](./solr/solr-resources-ui.md)
7 | - [(Coding) Language Integrations](./solr/solr-resources-code.md)
8 | - [Application/Framework Integrations](./solr/solr-resources-app-framework-integrations.md)
9 | - [Utilities](./solr/solr-resources-utilities.md)
10 | - Vector Search
11 | - [Projects Using Solr](./solr/solr-resources-used-by.md)
12 | - Discussion
13 | - Solr as a Service Options
14 | - Consulting Companies Working with Solr
15 | - Blogs
16 | - Demos
17 | - [Interesting But Old](./solr/solr-resources-interesting-old.md)
18 | - Bibliography/Resources
19 |
20 | ## Introduction
21 | Apache Solr is a search engine built on top of Apache Lucene (a Java library, also used in Elasticsearch). Solr receives queries via HTTP requests and provides responses in JSON by default (but can also output XML, CSV, etc.).
22 |
23 | ## My Notes
24 | These are largely pulled from Solr's documentation, consider them cliff notes / cheat sheets.
25 | - [Basic Solr Tutorial](./solr/basic-tutorial.md)
26 | - [Basic Solr Admin UI Tutorial](./solr/basic-admin-ui-tutorial.md)
27 | - [Basic SolrCloud Tutorial](./solr/basic-solrcloud-tutorial.md)
28 | - [Basic Indexing Your Own Data Tutorial](./solr/basic-indexing-your-own-data.md)
29 | - [Solr Development](./solr/solr-development.md)
30 | - [Solr Terminology](./solr/solr-terminology.md)
31 | - [Solr Notes Unorganized](./solr/solr-notes.md)
32 |
33 | ## Related Projects
34 |
35 | ### General
36 | - [Solr Plugin Directory](https://solr.cool/) - A directory of plugins/extensions available for Solr including query parsers, analyzers, response writers, search components, document transformers, and utilities.
37 |
38 | ### Neural
39 | - [Neural Solr](https://github.com/maxdotio/neural-solr) - Stars: 14 - Updated: 6/2022 - Checked: 5/2023
40 | - "This project provides a complete and working semantic search application, using Mighty Inference Server, Apache Solr v9, and an example Node.js express application."
41 |
42 | ### Other
43 | - [solr-constant-similarity](https://github.com/freedev/solr-constant-similarity) - Stars: 2 - Updated: 4/2022 - Checked: 5/2023
44 |
45 | ### Plugins
46 | - [sematext's Solr Redis Extensions](https://github.com/sematext/solr-redis) - Stars: 51 - Updated: 5/2022 - Checked: 5/2023.
47 | - "a ParserPlugin that provides a Solr query parser based on data stored in Redis."
48 | - [solr-sandbox](https://github.com/apache/solr-sandbox) - Stars: 7 - Updated: 5/2023 - Checked: 5/2023
49 | - "The solr sandbox repository serves as a place to host contributions that are not a part of core solr."
50 |
51 | ### Security
52 | - [solr-proxy](https://github.com/Trott/solr-proxy) - Stars: 7 - Updated: 5/2023 - Checked: 5/2023
53 | - "Reverse proxy to make a Solr instance read-only, rejecting requests that have the potential to modify the Solr index."
54 |
55 | ### Semantic Search
56 | - [Solr-SBERT-semantic-search](https://github.com/tkhang1999/Solr-SBERT-semantic-search) - Stars: 5 - Updated: 4/2023 - Checked: 5/2023
57 | - "a simple web demo of semantic search (search by meaning)...using Solr and BERT embeddings."
58 |
59 | ### Vector Search
60 | - [BERT Solr Search](https://github.com/DmitryKey/bert-solr-search) - Stars: 134 - Updated: 6/2022 - Checked: 3/2023 - Allows one to search with BERT vectors in Solr, also compatible with Elasticsearch/OpenSearch.
61 | - Has associated articles explaining the process that was used to build the solution.
62 | - [Vector Search for E-commerce with Chorus](https://opensourceconnections.com/blog/2023/03/22/building-vector-search-in-chorus-a-technical-deep-dive/) - a blog from OpenSource Connections showing how to add vector features to a Solr-powered e-commerce platform
63 |
64 | ## Learning Resources
65 | - [Solr for newbies workshop](https://github.com/hectorcorrea/solr-for-newbies) - Stars: 68 - Updated: 3/2023 - Checked: 5/2023.
66 | - [solr-tmbd](https://github.com/o19s/solr-tmdb) - Stars: 18 - Updated: 5/2023 - Checked: 5/2023
67 | - "part of the Think Like a Relevancy Engineer training provided by OpenSource Connections."
68 | - [OSC's pdf-discovery-demo](https://github.com/o19s/pdf-discovery-demo) - Stars: 25 - Updated: 4/2023 - Checked: 5/2023
69 | - "leverages the Solr Payload Component...and the Offset Highlighter Component...as well as pdf.js to make PDF documents searchable and have highlighting of matches with the text in context of the PDF."
70 | - [Apache Lucene Solr Guide](https://github.com/mikeroyal/Apache-Lucene-Solr-Guide) - Stars: 7 - Updated: 10/2021 - Checked: 5/2023
71 | - [Videos featuring Solr from OSC](https://www.youtube.com/playlist?list=PLCoJWKqBHERuLJgmR0PhiXmS3TUYjWatW) - a playlist of videos of talks featuring Solr
72 |
73 | ## Discussion
74 | - [Official Solr Users Mailing List](https://lists.apache.org/list.html?users@solr.apache.org)
75 |
76 | ## Solr as a Service Options
77 | - [OpenSolr](https://opensolr.com/)
78 | - 30-day free trial
79 | - Pricing starts at €10/mo.
80 | - [SearchStax](https://www.searchstax.com/)
81 | - Limited free account
82 | - Pricing starts at $9/mo.
83 | - [WebSolr](https://www.websolr.com/)
84 | - Standard from $59/mo with enterprise options
85 |
86 | ## Companies Working with Solr
87 | - [sematext](https://sematext.com/)
88 | - [Training on Solr](https://sematext.com/training/solr/)
89 | - [Monitoring of Solr](https://sematext.com/docs/integration/solr/), [SolrCloud](https://sematext.com/docs/integration/solrcloud/), and [Solr Logs](https://sematext.com/docs/integration/solr-logs/).
90 | - [SeaseLtd](https://sease.io/)
91 | - [OpenSource Connections](https://www.opensourceconnections.com)
92 |
93 | ## Blogs
94 | - [Joel Bernstein's Solr Analytics Blog](https://joelsolr.blogspot.com/)
95 |
96 | ## Demos
97 | - [Apache Solr Manual Search Demo - Multi-Language Model](https://demo.rondhuit.com/en/solr-manual)
98 | - [Slide Deck on setting up the demo](https://www.rondhuit.com/download/RONDHUIT-solrmanual-1.0.0.pdf)
99 | - [YouTube Video on setting up the demo](https://www.youtube.com/watch?v=rh3fP9qQAhw)
100 | - [KandaSearch blog post on setting up the demo](https://kandasearch.com/blogs/9c7ec12f-c09b-4ddd-b5eb-aafc3bb8b1a6)
101 |
102 |
103 |
104 | ## Bibliography / Resources
105 | - [Apache Solr](https://solr.apache.org/)
106 | - Solr Reference Guide
107 | - Getting Started
108 | - [Introduction to Solr](https://solr.apache.org/guide/solr/latest/getting-started/introduction.html)
109 | - Solr Concepts
110 | - [Documents, Fields, and Schema Design](https://solr.apache.org/guide/solr/latest/getting-started/documents-fields-schema-design.html)
111 | - [Solr Indexing](https://solr.apache.org/guide/solr/latest/getting-started/solr-indexing.html)
112 | - [Searching in Solr](https://solr.apache.org/guide/solr/latest/getting-started/searching-in-solr.html)
113 | - [Relevance](https://solr.apache.org/guide/solr/latest/getting-started/relevance.html)
114 | - [Solr Glossary](https://solr.apache.org/guide/solr/latest/getting-started/solr-glossary.html)
115 | - [Solr Tutorials](https://solr.apache.org/guide/solr/latest/getting-started/solr-tutorial.html)
116 |
117 |
--------------------------------------------------------------------------------
/common-crawl/basic-manually-accessing-common-crawl.md:
--------------------------------------------------------------------------------
1 | # How To Manually Access CommonCrawl
2 |
3 |
4 | ## Introduction
5 |
6 | I sometimes find it helpful to understand the intricacies of a process by performing it manually before attempting to automate it. This document outlines a manual process for retrieving data from CommonCrawl.
7 |
8 | **NOTE**: There is a web search interface one can use and one can download the files over HTTP but both of these are quite slow. I'll only be covering how to accomplish the crawl using Amazon S3 (where the data is stored).
9 |
10 | ## Prerequisites
11 |
12 | You'll need to have an AWS account, the AWS CLI installed, and programmatic access setup on your local system.
13 |
14 | ## Choose a Crawl
15 |
16 | Crawls are organized by date. You can go to http://index.commoncrawl.org/ to view the list of available crawls.
17 |
18 | For this example I'm using CC-MAIN-2023-14.
19 |
20 | ## Download the Index List
21 |
22 | You can see the list of index files on the previously mentioned page. We'll grab the one for our selected crawl using the AWS CLI:
23 | `aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2023-14/cc-index.paths.gz .`
24 |
25 | NOTE: The `.` in the above command is very important. The AWS CLI command is saying copy from the specified path on S3 to the specified path on the local system. The dot `.` will drop the file wherever you are running the command from. You can also specify a specific path instead of using `.`
26 |
27 | ## Opening the File
28 |
29 | Extract the contents of the gzip file and the result should be a file called `cc-index.paths`. This can be opened with a regular text editor.
30 |
31 | Scroll to the bottom and you'll find the path to get the `cluster.idx` file.
32 |
33 | ## Downloading the Cluster Index
34 |
35 | In our case the path shown is `cc-index/collections/CC-MAIN-2023-14/indexes/cluster.idx`. Using the AWS CLI our command should look like:
36 | `aws s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2023-14/indexes/cluster.idx .`
37 |
38 | This file is quite a bit larger (~150 MB) and may take a few minutes to download.
39 |
40 | ## Opening the Cluster Index
41 |
42 | There are a number of ways we could manipulate the file without opening it in a normal editor but for the moment lets do just that. A cross-platform program that can handle large files with a GUI is [klogg](https://klogg.filimonov.dev/).
43 |
44 | You should see data that looks something like this:
45 |
46 | ```
47 | 0,1,184,137)/igplay 20230325212225 cdx-00000.gz 0 158603 1
48 | 10,124,97,161)/paito-warna-trinidad-tobago-afternoon 20230330213447 cdx-00000.gz 158603 187900 2
49 | ```
50 |
51 | The above is two lines of data. Each contains the reversed domain of a site indexed by CommonCrawl (`0,1,184,137`, `10,124,97,161`) followed by the specific path that was indexed (e.g. `/igplay`, `/paito-warna-trinidad`...).
52 |
53 | This can be a little confusing at first glance. Where are the domain names? Sometimes servers don't have a domain name and instead are accessed by IP, because when sorting a file numeric characters come before alpha characters the file starts off with all the IPs it has crawled.
54 |
55 | Scroll down a bit in the file and you should start to see records that look like this:
56 |
57 | ```
58 | com,homesandgardens)/gardens/how-to-split-irises 20230330183612 cdx-00077.gz 736773153 222043 311253
59 | ```
60 |
61 | Note that whether the site is a domain or an IP it is reversed and separated by levels. If we reassemble the IPs/domains from the above examples we get:
62 | ```
63 | 137.184.1.0
64 | 161.97.124.10
65 | homesandgardens.com
66 | ```
67 |
68 | ## Finding and Downloading the CDX We Need
69 |
70 | If we are looking to access specific site data we need to figure out what CDX file serves as the index. In the case of the `homesandgardens.com` URL we can see that the CDX is `cdx-00077.gz`.
71 |
72 | We'll use the AWS CLI to download this file:
73 | ```
74 | aws s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2023-14/indexes/cdx-00077.gz .
75 | ```
76 |
77 | This file, again, is much larger than the previous files (closing in on 1 GB) so it may take some time to download.
78 |
79 | ## Finding and Downloading the WARC We Need
80 |
81 | Once downloaded we need to extract the contents from the gzip, which should be a plain text file called `cdx-00072`. Extracted this file will likely be several GB in size.
82 |
83 | When we open the file (using `klogg` or something similar) we should see records that look like this:
84 |
85 | ```
86 | com,homes-n-gardens)/adult-coloring-pages-garden 20230402111251 {"url": "https://homes-n-gardens.com/adult-coloring-pages-garden/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "XJHZYZYMKOL62PIT56QEUFJT5MYNYVOT", "length": "3991", "offset": "343066379", "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296950528.96/warc/CC-MAIN-20230402105054-20230402135054-00060.warc.gz", "charset": "UTF-8", "languages": "eng"}
87 | ```
88 |
89 | NOTE: See Appendix A for a prettified version of the JSON above.
90 |
91 | Note that we again have the reversed domain name at the beginning followed by the path to the specific file/document accessed and a little later we have `"filename":`. The filename specified here is the file we need to access to retrieve the data for the specific URL we are looking at. In this case it is:
92 |
93 | ```
94 | "crawl-data/CC-MAIN-2023-14/segments/1679296950528.96/warc/CC-MAIN-20230402105054-20230402135054-00060.warc.gz"
95 | ```
96 |
97 | We can get the WARC file using the AWS CLI:
98 | ```
99 | aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2023-14/segments/1679296950528.96/warc/CC-MAIN-20230402105054-20230402135054-00060.warc.gz .
100 | ```
101 |
102 | This file is likely to clock in at over 1 GB in it's compressed form - so expect the download to take some time.
103 |
104 | ## Viewing the WARC
105 |
106 | Once again we'll extract the file from it's gzip. We should end up with a file titled: `CC-MAIN-20230402105054-20230402135054-00060.warc`. We can open this file using klogg as well. Much of the file will be human readable but there will also be binary files that have been included (e.g. images) and these will appear as a long series of garbled characters.
107 |
108 | # Appendix A: Prettified JSON CDX Record
109 | ```json
110 | {
111 | "url": "https://homes-n-gardens.com/adult-coloring-pages-garden/",
112 | "mime": "text/html",
113 | "mime-detected": "text/html",
114 | "status": "200",
115 | "digest": "XJHZYZYMKOL62PIT56QEUFJT5MYNYVOT",
116 | "length": "3991",
117 | "offset": "343066379",
118 | "filename": "crawl-data/CC-MAIN-2023-14/segments/1679296950528.96/warc/CC-MAIN-20230402105054-20230402135054-00060.warc.gz",
119 | "charset": "UTF-8",
120 | "languages": "eng"
121 | }
122 | ```
123 |
124 |
125 | # Bibliography
126 |
127 | - [StackOverflow: How to view huge txt files in Linux?](https://stackoverflow.com/questions/21246752/how-to-view-huge-txt-files-in-linux)
128 | - Samuel Schaffhauser. [Using the Common Crawl as a Data Source](https://medium.com/@samuel.schaffhauser/using-the-common-crawl-as-a-data-source-693a41b3baa9). 6/2022.
129 | - Chillar Anand. [Common Crawl On Laptop - Extracting Subset of Data](https://avilpage.com/2022/11/common-crawl-laptop-extract-subset.html). 11/2022.
130 | - Derek Morgan. [Exploring the Common Crawl with Python](https://dmorgan.info/posts/common-crawl-python/). 2016.
--------------------------------------------------------------------------------
/research/ranking-research.md:
--------------------------------------------------------------------------------
1 | # Ranking and Recommendations
2 |
3 | ## 2023
4 | - Yi Ren, Xaio Han, Xu Zhao, Shenzheng Zhang, Yan Zhang. [Slate-Aware Ranking for Recommendation](https://dl.acm.org/doi/10.1145/3539597.3570380). 2/2023.
5 | - Yunjia Xi, Jianghao Lin, Weiwen Liu, Xinyi Dai, Weinan Zhang, Rui Zhang, Ruming Tang, Yong Yu. [A Bird's-eye View of Reranking: From List Level to Page Level](https://dl.acm.org/doi/10.1145/3539597.3570399). 2/2023.
6 | - Shuting Wang, Zhicheng Dou, Yutao Zhu. [Heterogeneous Graph-based Context-aware Document Ranking](https://dl.acm.org/doi/10.1145/3539597.3570390). 2/2023.*
7 | - Dan Luo, Lixin Zou, Qingyao Ai, Zhiyu Chen, Dawei Yin, Brian D. Davison. [Model-based Unbiased Learning to Rank](https://dl.acm.org/doi/10.1145/3539597.3570395). 2/2023.
8 |
9 | ## 2022
10 | - Syed Ahmed Yasin, P. V. R. D. Prasada Rao. [Enhanced CRNN-Based Optimal Web Page Classification and Improved Tunicate Swarm Algorith-Based Re-Ranking](https://www.researchgate.net/publication/365619498_Enhanced_CRNN-Based_Optimal_Web_Page_Classification_and_Improved_Tunicate_Swarm_Algorithm-Based_Re-Ranking). 11/2022.
11 | - Yuexin Wu, Xiaolei Huang. [A Gumbel-based Rating Prediction Framework for Imbalanced Recommendation](https://dl.acm.org/doi/10.1145/3511808.3557341). 10/2022.
12 | - Sejoon Oh, Berk Ustun, Julian McAuley, Srijan Kumar. [Rank List Sensitivity of Recommender Systems to Interaction Perturbations](https://dl.acm.org/doi/10.1145/3511808.3557425). 10/2022.*
13 | - Haolun Wu, Chen Ma, Yingxue Zhang, Xue Liu, Ruiming Tang, Mark Coates. [Adapting Triplet Importance of Implicit Feedback for Personalized Recommendation](https://dl.acm.org/doi/10.1145/3511808.3557229). 10/2022.
14 | - Yi Ren, Hongyan Tang, Siwen Zhu. [Unbiased Learning to Rank with Biased Continuous Feedback](https://dl.acm.org/doi/10.1145/3511808.3557483). 10/2022.
15 | - Weiwen Liu, Jiarui Qin, Ruiming Tang, Bo Chen. [Neural Re-ranking for Multi-stage Recommender Systems](https://dl.acm.org/doi/10.1145/3523227.3547369). 9/2022.*
16 | - Roberto Pellgrini, Wenjie Zhao, Iain Murray. [Don't recommend the obvious: estimate probability ratios](https://dl.acm.org/doi/10.1145/3523227.3546753). 9/2022.*
17 | - Giacomo Balloccu, Ludovico Boratto, Gianni Fenu, Mirko Marras. [Hands on Explaining Recommender Systems with Knowledge Graphs](https://dl.acm.org/doi/10.1145/3523227.3547374). 9/2022.*
18 | - Xiang Li, Xioajiang Zhou, Yao Xiao, Peihao Huang, Dayao Chen, Sheng Chen, Yunsen Xian. [AutoFAS: Automatic Feature and Architecture Selection for Pre-Ranking System](https://dl.acm.org/doi/10.1145/3534678.3539083). 8/2022.
19 | - Ruobing Xe, Qi Liu, Liangdong Wang, Shukai Liu, Bo Zhang, Leyu Lin. [Contrastive Cross-domain Recommendation in Matching](https://dl.acm.org/doi/10.1145/3534678.3539125). 8/2022.
20 | - Yankai Chen, Huifeng Guo, Yingxue Zhang, Chen Ma, Ruiming Tang, Jingjie Li, Irwin King. [Learning Binarized Graph Representations with Multi-faceted Quantization Reinforcement for Top-K Recommendation](https://dl.acm.org/doi/10.1145/3534678.3539452). 8/2022.
21 | - Zihan Lin, Hui Wang, Jingshu Mao, Wayne Xin Zhao, Cheng Wang, Peng Jiang, Ji-Rong Wen. [Feature-aware Diversified Re-ranking with Disentangled Representations for Relevant Recommendation](https://dl.acm.org/doi/10.1145/3534678.3539130). 8/2022.
22 | - Linsey Pang, Wei Liu, Keng-Hao Chang, Xue Li, Moumita Bhattacharya, Xianjing Liu, Stephen Guo. [Deep Search Relevance Ranking in Practice](https://dl.acm.org/doi/10.1145/3534678.3542632). 8/2022.*
23 | - Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony Lee, Ciya Liao.[Search Retrieval at Walmart](https://dl.acm.org/doi/10.1145/3534678.3539164). 8/2022.
24 | - Yi Li, Jieming Zhu, Weiwen Liu, Liangcai Su, Guohao Cai, Qi Zhang, Ruiming Tang, Xi Xiao, Xiuqiang He. [PEAR: Personalized Re-Ranking with Contextualized Transformer for Recommendation](https://dl.acm.org/doi/10.1145/3487553.3524208). 8/2022.
25 | - Egor Markovskiy, Fiana Raiber, Shoham Sabach, Oren Kurland. [From Cluster Ranking to Document Ranking](https://dl.acm.org/doi/10.1145/3477495.3531819). 7/2022.
26 | - Wenchao Xiu, Yiran Wang, Taofeng Xue, Kai Zhang, Qin Zhang, Zhonghuo Wu, Yifan Yang, Gong Zhang. [DDEN: A Heterogeneous Learning-to-Rank Approach with Deep Debiasing Experts Network.](https://dl.acm.org/doi/10.1145/3477495.3536320). 7/2022.
27 | - Xinyan Fan, Jianxun Lian, Wayne Xin Zhao, Zheng Liu, Chaozhuo Li, Xing Xie. [Ada-Ranker: A Data Distribution Adaptive Ranking Paradigm for Sequential Recommendation](https://dl.acm.org/doi/10.1145/3477495.3531931). 7/2022.
28 | - Amifa Raj, Michael D. Ekstrand. [Measuring Fairness in Ranked Results: An Analytical and Empirical Comparison](https://dl.acm.org/doi/10.1145/3477495.3532018). 7/2022.
29 | - Enrique Amigó, Stefano Mizzaro, Damiano Spina. [Ranking Interruptus: When Truncated Rankings Are Better and How To Measure That](https://dl.acm.org/doi/10.1145/3477495.3532051). 7/2022.
30 | - Ziyi Ye, Xiaohui Xie, Yiqun Liu, Zhihong Wang, Xuancheng Li, Jiaji Li, Xuesong Chen, Min Zhang, Shaoping Ma. [Why Don't You Click: Understanding Non-Click Results in Web Search with Brain Signals](https://dl.acm.org/doi/10.1145/3477495.3532082). 7/2022.
31 | - George Zerveas, Navid Rekabsaz, Daniel Cohen, Carsten Eickhoff. [Mitigating Bias in Search Results Through Contextual Document Reranking and Neutrality Regularization.](https://dl.acm.org/doi/10.1145/3477495.3531891). 7/2022.*
32 | - Virginie Do, Nicolas Usunier. [Optimizing Generalized Gini Indices for Fairness in Rankings](https://dl.acm.org/doi/10.1145/3477495.3532035). 7/2022.
33 | - Shubham Chatterjee, Laura Dietz. [BERT-ER: Query-specific BERT Entity Representations for Entity Ranking](https://dl.acm.org/doi/10.1145/3477495.3531944). 7/2022.*
34 | - Ayushi Prakash, Sandeep Kumar Gupta, Mukesh Rawat. [Keyword Based Ranking of Web Pages by Normalizing Link Score](https://www.researchgate.net/publication/361785686_Keyword_Based_Ranking_of_Web_Pages_by_Normalizing_Link_Score). 6/2022.
35 | - Prem Sharma, Divakar Yadav, R N Thakur. [Web Page Ranking Using Web Mining Techniques: A Comprehensive Survey](https://www.researchgate.net/publication/360999317_Web_Page_Ranking_Using_Web_Mining_Techniques_A_Comprehensive_Survey). 5/2022.
36 | - Seonghwan Choi, Hyeondey Kim, Manjun Gim. [Do Not Read the Same News! Enhancing Diversity and Personalization of News Recommentation](https://dl.acm.org/doi/10.1145/3487553.3524936). 4/2022.
37 |
38 | ## 2021
39 | - Yiding Liu, Weixue Lu, Suqi Cheng, Daiting Shi, Shuaiqiang Wang, Zhicong Cheng, Dawei Yin. [Pre-trained Language Model for Web-scale Retrieval in Baidu Search](https://www.researchgate.net/publication/352209105_Pre-trained_Language_Model_for_Web-scale_Retrieval_in_Baidu_Search). 6/2021.
40 | - Anton Oleinik. [Relevance in Web search: between content, authority and popularity](https://www.researchgate.net/publication/349706191_Relevance_in_Web_search_between_content_authority_and_popularity). 3/2021.
41 |
42 | ## 2020
43 | - N. Mehala, Divyansh Bhatia. [A Concept-Based Approach for Generating Better Topics for Web Search Results](https://www.researchgate.net/publication/344268302_A_Concept-Based_Approach_for_Generating_Better_Topics_for_Web_Search_Results). 9/2020.
44 |
45 |
46 | # Older
47 | - Anlei Dong, Yi Chang, Zhaohui Zheng, Gilad Mishne, Jing Bai, Ruqiang Zhang, Karolina Buchner, Ciya Laio, Fernando Diaz. [Towards recency ranking in web search](https://dl.acm.org/doi/10.1145/1718487.1718490). 2/2010.
48 | - Ron Bekkerman, Shlomo Zilberstein, James Allan. [Web Page Clustering Using Heuristic Search in the Web Graph](https://www.researchgate.net/publication/220812214_Web_Page_Clustering_Using_Heuristic_Search_in_the_Web_Graph). 1/2007.
--------------------------------------------------------------------------------
/CommonCrawl.md:
--------------------------------------------------------------------------------
1 | # Common Crawl
2 | Common Crawl is a non-profit organization that maintains a large index of the web that is updated on a bi-monthly basis and freely available.
3 |
4 | ## General
5 | - Official Site: https://commoncrawl.org/
6 | - Common Crawl Index Server: https://index.commoncrawl.org/
7 | - GitHub Repositories: https://github.com/commoncrawl - A few of the repositories are listed below, but there are many more.
8 | - [Common Crawl WARC Examples](https://github.com/commoncrawl/cc-warc-examples) - "This repository contains both wrappers for processing WARC files in Hadoop MapReduce jobs and also Hadoop examples to get you started."
9 | - [Jupyter Notebooks to Analyze Common Crawl Data](https://github.com/commoncrawl/cc-notebooks) - This includes several different notebooks, some may be especially interested in [running a notebook on AWS EMR](https://github.com/commoncrawl/cc-notebooks/blob/main/cc-emr-notebook/cluster_setup.md).
10 | - [Common Crawl PySpark Examples](https://github.com/commoncrawl/cc-pyspark) - "This project provides examples [of] how to process the Common Crawl dataset with Apache Spark and Python".
11 | - [Common Crawl Index Server](https://github.com/commoncrawl/cc-index-server) - "This project is a deployment of the pywb web archive replay and index server to provide an index query mechanism for datasets provided by Common Crawl".
12 |
13 | ## Tooling
14 | - [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit) - Star: 127 - Updated: 3/2022 - Checked: 5/2023 - "a set of tools for working with CDX indices of web crawls and archives, including those at CommonCrawl and the Internet Archive's Wayback Machine."
15 | - rokasramas' [fork of comcrawl](https://github.com/rokasramas/comcrawl) - Stars: 0 - Updated: 4/2020 - Checked: 5/2023 - Includes a fix that hasn't been applied to the [original comcrawl library](https://github.com/michaelharms/comcrawl/) that allows it to work.
16 | - [getallurls](https://github.com/lc/gau) - Stars: 2.8k - Updated: 2/2023 - Checked: 5/2023 - Can fetch urls from Common Crawl as well as Open Threat Exchange, the Wayback Machine, and URLScan.
17 | - [CommonCrawlDocumentDownload](https://github.com/centic9/CommonCrawlDocumentDownload) - Stars: 50 - Updated: 4/2023 - Checked: 5/2023 - Downloads documents by file/mime type from CC.
18 | - [WARCannon](https://github.com/c6fc/warcannon) - Stars: 212 - Updated: 9/2022 - Checked: 5/2023 - Uses AWS to at scale search Common Crawl data with regex patterns.
19 |
20 | ## Other
21 | - [NewsFetch](https://newsfetch.tech/) - Stars: 13 - Updated: 10/2022 - Checked: 5/2023 - Can fetch news articles from the Common Crawl API.
22 | - [news-please](https://github.com/fhamborg/news-please) - Stars: 1.6k - Updated: 4/2023 - Checked: 5/2023 - Along with significant other functionality it can fetch articles from Common Crawl.
23 | - [PWA Store](https://github.com/Tarasa24/PWA-Store) - Stars: 5 - Updated: 9/2022 - Checked: 5/2023 - Uses Common Crawl and EMR to find as many PWA apps on the web as possible.
24 |
25 | ## What Is?
26 | - C4 Dataset - Text data extracted from Common Crawl.
27 | - https://github.com/shjwudp/c4-dataset-script
28 | - [CDX](https://github.com/webrecorder/pywb/wiki/CDX-Index-Format) - Capture/Crawl inDeX - Standard index format for WARCs.
29 |
30 | ## Tutorials
31 |
32 | ### General
33 | - Edward Ross. [CommonCrawl Category](https://skeptric.com/#category=commoncrawl). skeptric.
34 | - Ross has published a number of well-written articles on Common Crawl. A great place to start if you are looking to go through the basics and beyond.
35 | - [Searching 100 Billion Webpages With Capture Index](https://skeptric.com/searching-100b-pages-cdx/). 6/2020.
36 | - Explains how to use the web interface (slow) as well as the CDX Toolkit, comcrawl, and directly in Python without using a custom CommonCrawl library. Unfortunately both comcrawl and the CDX Toolkit require some tweaks to get running.
37 | - [Read Commonm Crawl Parquet Metadata with Python](https://skeptric.com/reading-parquet-metadata/). 4/2022.
38 | - Covers reading Parquet metadata using PyArrow, fastparquet, manually (in Python), and using asyncio to speed things up.
39 | - [CommonCrawl.org So you're ready to get started](https://commoncrawl.org/the-data/get-started/).
40 | - Covers a lot of ground, perhaps not the best for true beginners. Covers data locations, file formats (WARC, WAT, WET), indexes, as well as processing the files.
41 | - [CommonCrawl.org Examples using Common Crawl Data](https://commoncrawl.org/the-data/examples/).
42 | - Unfortunately the vast majority of the examples available here are quite old.
43 |
44 | ### AWS Athena
45 | - Sebastian Nagel. [Index to WARC Files and URLs in Columnar Format](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/). commoncrawl, 3/2018.
46 | - Stanislas Girard. [Parse Petabytes of data from CommonCrawl in seconds](https://www.primates.dev/parse-petabytes-of-data-from-commoncrawl-in-seconds/). primates.dev, 1/2020.
47 | - Simple and straightforward, short, fairly basic, but good place to start.
48 | - Athul Jayson. [Extracting Data from Common Crawl Dataset](https://blog.qburst.com/2020/07/extracting-data-from-common-crawl-dataset/). qburst, 7/2020.
49 | - Also has an associated GitHub repository.
50 | - Ryan Elkins. [Search the html across 25 billion websites for passive reconnaissance using common crawl](https://medium.com/@brevityinmotion/search-the-html-across-25-billion-websites-for-passive-reconnaissance-using-common-crawl-7fe109250b83). 7/2020.
51 | - While written from a security perspective it provides solid guidance to using AWS Athena with Common Crawl. It also utilizes Amazon SageMaker, S3, and AWS IAM. There is an associated repo.
52 |
53 | ### AWS EMR
54 | - Basil Latif. [Measuring Internet Links: Accessing the Common Crawl Dataset Using EMR and Pyspark in AWS](https://basil-latif.medium.com/measuring-internet-links-accessing-the-common-crawl-dataset-using-emr-and-pyspark-in-aws-fcf5eb26afd9). 6/2020.
55 | - [Common Crawl EMR Tutorial](https://github.com/haydenhw/commoncrawl-emr-tutorial) - Stars: 9 - Updated: 3/2021 - Checked: 5/2023 - "This guide walks you through submitting a Scala Spark application to EMR that queries 500k job urls from Common Crawl and saves the results to an S3 bucket in CSV format."
56 |
57 | ### AWS Lambda
58 | - Chris Madden, Aaron Bawcom. [Analyzing Performance and Cost of Large-Scale Data Processing with AWS Lambda](https://aws.amazon.com/blogs/apn/analyzing-performance-and-cost-of-large-scale-data-processing-with-aws-lambda/). 6/2019.
59 | - Covers the high-level process with associated GitHub repository.
60 | - Jader Dias. [One-click to download all the web pages you may want](https://medium.com/@jaderd/one-click-to-download-exactly-the-web-pages-you-may-want-no-matter-how-many-they-are-d4834265a0a3). 6/2022.
61 | - Builds on using Athena to get data from Common Crawl and AWS Lambda to download it.
62 |
63 | ### Snowflake
64 | - Venkat Sekar. [Querying TB sized External Tables with Snowflake](https://medium.com/snowflake/querying-tb-sized-external-tables-with-snowflake-5ab14e807d3). 2/1/2022.
65 |
66 |
67 | ### Basic
68 | - David Mackey. [Basic Information About CommonCrawl](common-crawl/basic-info-common-crawl.md). 5/2023.
69 | - David Mackey. [How To Manually Access CommonCrawl](common-crawl/basic-manually-accessing-common-crawl.md). 5/2023.
70 | - Chillar Anand. [Common Crawl On Laptop - Extracting Subset of Data](https://avilpage.com/2022/11/common-crawl-laptop-extract-subset.html). 11/2022.
71 |
72 | ### Other
73 | - Colin Dellow. [S3 Throughput: Scans vs Indexes](https://code402.com/blog/s3-scans-vs-index/). 2/2020.
74 | - Is it fasters to scan entire WARC files and attempt to pull just the data required from each WARC file utilizing the index?
--------------------------------------------------------------------------------
/research/uncategorized-research.md:
--------------------------------------------------------------------------------
1 | # Uncategorized Research on Search Engines and Information Retrieval
2 |
3 | ## 2023
4 | - Christopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin, Martin Potthast. [Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face](https://www.researchgate.net/publication/368877450_Spacerini_Plug-and-play_Search_Engines_with_Pyserini_and_Hugging_Face). 2/2023.
5 | - Xinyuan Chang. [The Analysis of Open Source Search Engines](https://www.researchgate.net/publication/368910014_The_Analysis_of_Open_Source_Search_Engines). 2/2023.*
6 |
7 | ## 2022
8 | - Ali Abbasov, Vagif Gasimov. [Domain-oriented information search on the Internet](https://www.researchgate.net/publication/367004891_Domain-oriented_information_search_on_the_Internet). 11/2022.
9 | - Fenil Kaneria, Shafaq Khan, Nishara Nizamuddin. [Swift Search An open-source search engine](https://www.researchgate.net/publication/362646181_Swift_Search_An_open-source_search_engine). 11/2022.
10 | - Lingjun Xu, Shiyin Zhang, Guojie Song, Junshan Wang, Tianshu Wu, Guojun Liu. [Taxonomy-Enhanced Graph Neural Networks](https://dl.acm.org/doi/10.1145/3511808.3557467). 10/2022.
11 | - Masaki Suzuki, Yusuke Yamamoto. [Don't Judge by Looks: Search User Interface to Make Searchers Reflect on Their Relevance Criteria and Promote Content-Quality-Oriented Web Searches](https://dl.acm.org/doi/10.1145/3524458.3547222). 9/2022.
12 | - Martha Viviana Zuluaga, Sebastián Robledo, Oscar Arbelaez-Echeverri, Néstor Duque, Germán A. Osorio-Zuluaga. [Tree of Science - ToS: A Web-Based Tool for Scientific Literature Recommendation. Search Less, Researfch More!](https://www.researchgate.net/publication/362728432_Tree_of_Science_-_ToS_A_Web-Based_Tool_for_Scientific_Literature_Recommendation_Search_Less_Research_More). 8/2022.
13 | - Gaurav Gupta, Tharun Medini, Anshumali Shrivastava, Alexander J. Smola. [BLISS: A Billion scale Index using Iterative Re-partitioning](https://dl.acm.org/doi/10.1145/3534678.3539414). 8/2022.*
14 | - Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, Peter Staar. [DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation](https://dl.acm.org/doi/10.1145/3534678.3539043). 8/2022.
15 | - Sean Zhang, Varun Ursekar, Leman Akoglu. [Sparx: Distributed Outlier Detection at Scale](https://dl.acm.org/doi/10.1145/3534678.3539076). 8/2022.
16 | - Aleksandra Urman, Mykola Makhortykh, Roberto Ulloa, Juhi Kulshrestha. [Where the earth is flat and 9/11 is an inside job: A comparative algorithm audit of conspiratorial information in web search results](https://www.researchgate.net/publication/361863464_Where_the_earth_is_flat_and_911_is_an_inside_job_A_comparative_algorithm_audit_of_conspiratorial_information_in_web_search_results). 8/2022.*
17 | - Chenxu Zhu, Peng Du, Xianghui Zhu, Weinan Zhang, Yong Yu, Yang Cao. [User-tag Profile Modeling in Recommendation System via Contrast Weighted Tag Masking](https://dl.acm.org/doi/10.1145/3534678.3539102). 8/2022.
18 | - Marwah Alaofi, Luke Gallagher, Dana Mckay, Lauren L. Saling, Mark Sandrson, Falk Scholer, Damiano Spina, Ryen W. White. [Where Do Queries Come From?](https://dl.acm.org/doi/10.1145/3477495.3531711). 7/2022.
19 | - Yu Guo, Zhengyi Ma, Jiaxin Mao, Hongjin Qian, Xinyu Zhang, Hao Jiang, Zhao Cao, Zhicheng Dou. [Webformer: Pre-training with Web Pages for Information Retrieval](https://dl.acm.org/doi/10.1145/3477495.3532086). 7/2022.
20 | - Arnold Overwijk, Chenyan Xiong, Jamie Callan. [ClueWeb22: 10 Billion Documents with Rich Information](https://dl.acm.org/doi/10.1145/3477495.3536321). 7/2022.
21 | - Connor Lennox, Sumanta Kashyapi, Pooja Oza, Ben Gamari. [Wikimarks: Harvesting Relevance Benchmarks from Wikipedia](https://dl.acm.org/doi/10.1145/3477495.3531731). 7/2022.
22 | - Zelong Li, Jianchao Ji, Yingqiang Ge, Yongfeng Zhang. [AutoLossGen: Automatic Loss Function Generation for Recommender Systems](https://dl.acm.org/doi/10.1145/3477495.3531731). 7/2022.*
23 | - Thomas Grubb, Bill Anderson, Omar Alonso. [On Reliability Scores for Knowledge Graphs](https://dl.acm.org/doi/10.1145/3487553.3524212). 4/2022.
24 | - Mustaga Abualsaud, Mark Smucker. [The Dark Side of Relevance: The Effect of Non-Relevant Results on Search Behavior](https://dl.acm.org/doi/10.1145/3498366.3505770). 3/2022.
25 | - Chirag Shah, Emily M. Bender. [Situating Search](https://dl.acm.org/doi/10.1145/3498366.3505816). 3/2022.*
26 |
27 | ## 2021
28 | - Stefan Voigt, Tobias Hecking, Dennis Jankoswski, Julius Moller, Maximilian Schwinger. [Open Search @ DLR - towards transparent access to web-based information in science](https://www.researchgate.net/publication/356602703_Open_Search_DLR_-_towards_transparent_access_to_web-based_information_in_science). 1/2021.
29 |
30 | ## 2020
31 | - Mario Kubek. [Contemporary Web Search](https://www.researchgate.net/publication/333931478_Contemporary_Web_Search). 1/2020.
32 | - Hitweshwar Kumar Azad, Akshay Deepak, Kumar Abhishek. [Query Expansion for Improving Web Search](https://www.researchgate.net/publication/339480386_Query_Expansion_for_Improving_Web_Search). 1/2020.
33 |
34 | ## 2019
35 | - Rashmi P Sarode, Shelly Sachdeva, Wanming Chu, Sbhash Bhalla. [Segment-Search vs Knowledge Graphs: Making a Key-Word Search Engine for Web Documents](https://www.researchgate.net/publication/337923115_Segment-Search_vs_Knowledge_Graphs_Making_a_Key-Word_Search_Engine_for_Web_Documents). 12/2019.
36 | - Peilu Wang. Hao Jiang, Jingfang Xu, Qi Zhang. [Knowledge Graph Construction and Applications for Web Search and Beyond](https://www.researchgate.net/publication/336978553_Knowledge_Graph_Construction_and_Applications_for_Web_Search_and_Beyond). 11/2019.*
37 | - Dan Brickley, Matthew Burgess, Natasha Noy. [Google Dataset Search: Building a search engine for datasets in an open Web ecosystem](https://www.researchgate.net/publication/333067368_Google_Dataset_Search_Building_a_search_engine_for_datasets_in_an_open_Web_ecosystem). 5/2019.
38 |
39 | ## 2016
40 | - Weize Kong. [Extending Faceted Search to the Open-Domain Web](https://www.researchgate.net/publication/304618602_Extending_Faceted_Search_to_the_Open-Domain_Web). 6/2016.*
41 | - Bosubabu Sambana. [Web Search Engine](https://www.researchgate.net/publication/336265320_Web_Search_Engine). 3/2016.
42 |
43 | ## 2015
44 | - Evi Yulianti. [Finding Answers in Web Search](https://www.researchgate.net/publication/283659235_Finding_Answers_in_Web_Search). 8/2015.
45 | - Aleksandr Chuklin, Ilya Markov, Maarten de Rijke. [Click Models for Web Search](https://www.researchgate.net/publication/282201593_Click_Models_for_Web_Search). 7/2015.
46 | - Sonali Tanaji Kadam, Sanchika Bajpai. [Development of Web Annotation Technique for Search Result Records Using Web Database.](https://www.researchgate.net/publication/283779983_Development_of_Web_Annotation_Technique_for_Search_Result_Records_Using_Web_Database). 7/2015.
47 |
48 | ## 2014
49 | - Weize Kong, James Allan. [Extending Faceted Search to the General Web](https://www.researchgate.net/publication/284346690_Extending_Faceted_Search_to_the_General_Web). 11/2014.*
50 | - Guillem Frances, Xiao Bai, Berkant Barla Cambazoglu, Ricardo Baeza-Yates. [Improving the efficiency of multi-site web search engines](https://www.researchgate.net/publication/262172401_Improving_the_efficiency_of_multi-site_web_search_engines). 2/2014.*
51 |
52 | ## 2011
53 | - Yue Wang, Hongsong Li, Haixun Wang, Kenny Q. Zhu. [Toward Topic Searc on the Web](https://www.researchgate.net/publication/255563891_Toward_Topic_Search_on_the_Web). 5/2011.
54 |
55 | ## 2010
56 | - Michael Zimmer. [Web Search Studies: Multidisciplinary Perspectives on Web Search Engines](https://www.researchgate.net/publication/226672921_Web_Search_Studies_Multidisciplinary_Perspectives_on_Web_Search_Engines). 6/2010.
57 |
58 | ## 2004
59 | - Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma. [Block-based web search](https://www.researchgate.net/publication/221301159_Block-based_web_search). 7/2004.
--------------------------------------------------------------------------------
/specific-engines/opensearch.md:
--------------------------------------------------------------------------------
1 | # OpenSearch
2 | - Official Website: https://opensearch.org/
3 | - Forums: https://forum.opensearch.org/
4 | - An open source fork of Elasticsearch started by Amazon.
5 |
6 | ## Some Basics
7 | - Built on Apache Lucene
8 | - Node(s) - A server that contains data and responds to search queries.
9 | - Cluster(s) - A group of nodes that work together to store and search data.
10 | - Index / Indices - A collection of documents.
11 | - Mapping(s) - Collection of fields that documents in an index.
12 | - Setting(s) - Configurations for an index.
13 | - Shard(s) - A portion of an index.
14 | - Shards are split evenly across the nodes in a cluster.
15 | - Each shard is a full Lucene index.
16 | - Rule of Thumb: Shards should be 10-50 GB each.
17 | - Primary Shard - The node is responsible for the shard.
18 | - Replica Shard - The node acts as a backup and can take some of the read load off the primary shard.- REST API - Allows one to interact with OpenSearch using HTTP requests.
19 | - API - OpenSearch provides a REST API for interacting with the server.
20 |
21 | ```
22 | // Add a JSON doc to index
23 | PUT https://://_doc/
24 | {
25 | "title": "The Wind Rises",
26 | "release_date": "2013-07-20"
27 | }
28 |
29 | // Search for a document
30 | GET https://://_search?q=wind
31 |
32 | // Delete a document
33 | DELETE https://://_doc/
34 | ```
35 |
36 | ## Quickstart
37 | 1. You'll need Docker installed.
38 | 2. There are several settings OpenSearch recommends tweaking on the host machine. I'm holding off on tweaking these in a dev environment.
39 | 3. Download the Docker Compose config file: `curl -O https://raw.githubusercontent.com/opensearch-project/documentation-website/2.6/assets/examples/docker-compose.yml`
40 | 4. Start the cluster: `docker-compose up -d`
41 | 5. Query the API: `curl https://localhost:9200 -ku admin:admin`
42 | - `-k` (or `--insecure`) - Don't check host name (since it's using demo certs).
43 | - `-u` - Allows one to provide username and password (`admin:admin`).
44 | 6. Explore OpenSearch Dashboards: http://localhost:5601 (use same user/pass as above).
45 | - Click Explore on My Own.
46 | - Select tenant (Global or Private is fine).
47 | 7. Go to Management -> Dev Tools and perform a query (default query is to show all results)
48 | - Can paste queries in cURL format and they'll be converted to Console syntax.
49 | - Green triangle runs query.
50 | - Keyboard shortcuts are under Help.
51 |
52 | ## Configuring OpenSearch
53 | - Most changes can be accomplished using the cluster settings API.
54 | - A few changes need to be made by modifying `opensearch.yml` and restarting the cluster, prefer the API whenever possible.
55 | - Note: `opensearch.yml` applies settings to the local node, the API applies to all nodes in the cluster.
56 | - One can also use environment variables when launching OpenSearch like so:
57 | `./opensearch -Ecluster.name=opensearch-cluster -Enode.name=opensearch-node1 -Ehttp.host=0.0.0.0 -Ediscovery.type=single-node`
58 |
59 | ### Using Cluster Settings API
60 | - View current settings: `GET _cluster/settings?include_defaults=true`
61 | - Non-default settings only: `GET _cluster/settings`
62 | - The types of settings and precedence:
63 | 1. Transient (cleared after restart)
64 | 2. Persistent
65 | 3. opensearch.yml
66 | 4. Default
67 | - To change settings specify the setting and whether it is persistent or transient:
68 | ```
69 | PUT _cluster/settings
70 | {
71 | "persistent": {
72 | "action.auto_create_index": "false"
73 | }
74 | }
75 | ```
76 | - Can also copy and paste from GET response and change the existing values:
77 | ```
78 | PUT _cluster/settings
79 | {
80 | "persistent": {
81 | "action": {
82 | "auto_create_index": false
83 | }
84 | }
85 | }
86 | ```
87 |
88 | ### Using opensearch.yml
89 | - Docker: `/usr/share/opensearch/config/opensearch.yml`
90 | - Linux: `/etc/opensearch/opensearch.yml`
91 | - Example Settings:
92 | ```
93 | cluster.name: my-application
94 | action.auto_create_index: true
95 | compatibility.override_main_response_version: true
96 | ```
97 | - To allow client app to connect to OpenSearch on a different domain:
98 | ```
99 | - http.host:0.0.0.0
100 | - http.port:9200
101 | - http.cors.allow-origin:"http://localhost"
102 | - http.cors.enabled:true
103 | - http.cors.allow-headers:X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization
104 | - http.cors.allow-credentials:true
105 | ```
106 |
107 | ## Plugins
108 | - One can use the `opensearch-plugin` command to list, install, and remove plugins.
109 | - If using OpenSearch in Docker, plugins must be managed by modifying the Docker image.
110 |
111 | ### List
112 | - `bin/opensearch-plugin list`
113 | - Or use CAT API: `GET _cat/plugins`
114 |
115 | ### Install
116 | - By Name: `bin/opensearch-plugin install `
117 | - From Zip: `bin/opensearch-plugin install ` (must use HTTP(S), for local fies use `file://`)
118 | - Using Maven Coordinates: `bin/opensearch-plugin install ::`
119 |
120 | ### Remove
121 | - `bin/opensearch-plugin remove `
122 |
123 | ### Restart
124 | - Restart the node after installing or removing a plugin.
125 |
126 | ### Batch Mode
127 | - To skip confirmation prompts when installing plugins: `bin/opensearch-plugin install --batch `
128 |
129 | ### Bundled Plugins
130 | - Alerting - `opensearch-alerting`
131 | - Anomaly Detection - `opensearch-anomaly-detection`
132 | - Asynchronous Search - `opensearch-asynchronous-search`
133 | - Cross Cluster Replication - `opensearch-cross-cluster-replication`
134 | - Notifications - `notifications`
135 | - Reports Scheduler - `opensearch-reports-scheduler`
136 | - Geospatial - `opensearch-geospatial`
137 | - Index Management - `opensearch-index-management`
138 | - Job Scheduler - `opensearch-job-scheduler`
139 | - k-NN - `opensearch-knn`
140 | - ML Commons - `opensearch-ml`
141 | - Neural Search - `neural-search`
142 | - [Neural Search GitHub Repo](https://github.com/opensearch-project/neural-search)
143 | - Observability - `opensearch-observability`
144 | - Notebooks (`opensearch-notebooks`) has been merged into Observability.
145 | - Performance Analyzer - `opensearch-performance-analyzer`
146 | - Not available on Windows.
147 | - Security - `opensearch-security`
148 | - Security Analytics - `opensearch-security-analytics`
149 | - SQL - `opensearch-sql`
150 |
151 | ### Additional Plugins
152 | - These are available for install using `bin/opensearch-plugin install `, additional ones are also available outside of OpenSearch's GitHub.
153 | - `analysis-icu`
154 | - `analysis-kuromoji`
155 | - `analysis-nori`
156 | - `analysis-phonetic`
157 | - `analysis-smartcn`
158 | - `analysis-stempel`
159 | - `analysis-ukrainian`
160 | - `discovery-azure-classic`
161 | - `discovery-ec2`
162 | - `discovery-gce`
163 | - `ingest-attachment`
164 | - `mapper-annotated-text`
165 | - `mapper-murmur3`
166 | - `mapper-size`
167 | - `repository-azure`
168 | - `repository-gcs`
169 | - `repository-hdfs`
170 | - `repository-s3`
171 | - `store-smb`
172 | - `transport-nio`
173 |
174 | ## Data Prepper
175 | - "Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale."
176 | - [GitHub Repo](https://github.com/opensearch-project/data-prepper)
177 |
178 | ## Companies Working with OpenSearch
179 | - [Sematext](https://sematext.com/)
180 |
181 | ## Bibliography
182 | - https://opensearch.org/docs/latest/about/
183 | - https://opensearch.org/docs/latest/quickstart/
184 | - https://opensearch.org/docs/latest/install-and-configure/configuration/
185 | - https://opensearch.org/docs/latest/install-and-configure/plugins/
--------------------------------------------------------------------------------
/specific-engines/solr/solr-terminology.md:
--------------------------------------------------------------------------------
1 | ## Terminology
2 |
3 | ### Concepts
4 | - Atomic Updates - "An approach to updating only one or more fields of a document, instead of reindexing the entire document."
5 | - Boolean operators - "control the inclusion or exclusion of keywords in a query by using operators such as AND, OR, and NOT."
6 | - Clustering (of Results) - "groups search results by similarities discovered when a search is executed, rather than when content is indexed."
7 | - "It can reveal unexpected commonalities among search results"
8 | - Document - "A group of fields and their values. Documents are the basic unit of data in a collection."
9 | - "basic unit of information is a document, which is a set of data that describes something."
10 | - Facet / Faceting - "The arrangement of search results into categories based on indexed terms."
11 | - Facet Constraint - A specific facet value within a category that further constrains the results.
12 | - Field - "The content to be indexed/searched along with metadata defining how the content should be processed by Solr."
13 | - "documents are composed of fields, which are more specific pieces of information."
14 | - Fields can be of different types - e.g. text, number.
15 | - Index - where Solr stores all of the data.
16 | - Indexing - adding data to Solr.
17 | - Inverse Document Frequency (IDF) - "A measure of the general importance of a term. It is calculated as the number of total Documents divided by the nuber of Documents that a particular word occurs in the collection."
18 | - Inverted Index - "A way of creating a searchable index that lists every word and the documents that contain those words, similar to an index in the back of a book which lists words and the pages on which they can be found."
19 | - Metadata - "Literally, data about data. Metadata is information about a document, such as its title, author, or location."
20 | - Natural Language Query - "A search that is entered as a user would normally speak or write".
21 | - Precision - "the percentage of documents in the returned results that are relevant."
22 | - Query - asking a question of the data using Solr.
23 | - Query Parser - "processes the terms entered by a user."
24 | - Recall - "the percentage of relevant results returned out of all relevant results in the system."
25 | - "The ability of a search engine to retrieve all of the possible matches to a user’s query."
26 | - Relevance - "the degree to which a query response satisfies a user who is searching for information."
27 | - "The appropriateness of a document to the search conducted by the user."
28 | - Stopwords - "Generally, words that have little meaning to a user’s search but which may have been entered as part of a natural language query."
29 | - "Stopwords are generally very small pronouns, conjunctions and prepositions (such as, "the", "with", or "and")"
30 | - Syonyms - "Synonyms generally are terms which are near to each other in meaning and may substitute for one another. In a search engine implementation, synonyms may be abbreviations as well as words, or terms that are not consistently hyphenated."
31 | - Term Frequency - "The number of times a word occurs in a given document."
32 | - Wildcard - "A wildcard allows a substitution of one or more letters of a word to account for possible variations in spelling or tenses."
33 | - Zookeeper - "The system used by SolrCloud to keep track of configuration files and node names for a cluster. A ZooKeeper cluster is used as the central configuration store for the cluster, a coordinator for operations requiring distributed synchronization, and the system of record for cluster topology."
34 |
35 | ### Infrastructure
36 | - Cluster - "a set of Solr nodes operating in coordination with each other via ZooKeeper, and managed as a unit."
37 | - Collection - "one or more Documents grouped together in a single logical index using a single configuration and Schema."
38 | - A similar concept is Cores in single-node installations and user-managed clusters.
39 | - Core - "An individual Solr instance (represents a logical index). Multiple cores can run on a single node."
40 | - Compare to Collection above.
41 | - Ensemble - "A ZooKeeper term to indicate multiple ZooKeeper instances running simultaneously and in coordination with each other for fault tolerance."
42 | - Leader - "A single Replica for each Shard that takes charge of coordinating index updates (document additions or deletions) to other replicas in the same shard."
43 | - "This is a transient responsibility assigned to a node via an election, if the current Shard Leader goes down, a new node will automatically be elected to take its place."
44 | - Node - "A JVM instance running Solr. Also known as a Solr server."
45 | - [Operator](https://solr.apache.org/operator/) - "[B]uilt to reliably manage Apache Solr on Kubernetes."
46 | - Overseer - "A single node in SolrCloud that is responsible for processing and coordinating actions involving the entire cluster. It keeps track of the state of existing nodes, collections, shards, and replicas, and assigns new replicas to nodes."
47 | - "This is a transient responsibility assigned to a node via an election, if the current Overseer goes down, a new node will be automatically elected to take its place."
48 | - Replica - "A Core that acts as a physical copy of a Shard."
49 | - Replication - "A method of copying a leader index from one server to one or more 'follower' or 'child' servers."
50 | - SolrCloud - "Umbrella term for a suite of functionality in Solr which allows managing a Cluster of Solr Nodes for scalability, fault tolerance, and high availability."
51 |
52 | ### Other
53 | - Common Query Parameters - Parameters that are "accepted by all query parsers."
54 | - Distributed Search - "queries are processed across more than one Shard."
55 | - Field Analysis - "tells Solr what to do with incoming data when building an index."
56 | - "A more accurate name for this process would be processing or even digestion, but the official name is analysis."
57 | - Filter
58 | - Filter Query - "a filter query runs a query against the entire index and caches the results...the strategic use of filter queries can improve search performance."
59 | - Operates on data already existing in the index.
60 | - Analysis Filter - Operates on data being ingested.
61 | - MoreLikeThis - "enables users to submit new queries that focus on particular terms returned in an earlier query."
62 | - `maxDoc` - The number of documents in the index including those which have been logically but not physically deleted.
63 | - `numDocs` - The "number of searchable documents in the index."
64 | - Some files may contain multiple documents, e.g. XML, JSON, or CSV. In this case the `numDocs` will be greater than the number of files indexed.
65 | - Request Handler (RequestHandler) - receives and processes requests.
66 | - "Logic and configuration parameters that tell Solr how to handle incoming "requests", whether the requests are to return search results, to index documents, or to handle other custom situations."
67 | - SearchComponent - "ogic and configuration parameters used by request handlers to process query requests."
68 | - "Examples of search components include faceting, highlighting, and "more like this" functionality."
69 | - Response Writer - "manages the final presentation of the query response."
70 | - Solr is bundled with both XML and JSON response writers.
71 | - SolrConfig (solrconfig.xml) - "The Apache Solr configuration file. Defines indexing options, RequestHandlers, highlighting, spellchecking and various other configurations."
72 | - The file, solrconfig.xml, is located in the Solr home conf directory."
73 | - `solr.home` - "the location under the main Solr installatiomn where Solr's collections and their `conf` and `data` directories are stored."
74 | - Solr Schema (managed-sechame.xml or schema.xml) - Defines how Solr builds indexes from data sent to it.
75 | - Stores information about the fields and data types.
76 | - Shard - "In SolrCloud, a logical partition of a single Collection."
77 | - "Every shard consists of at least one physical replica".
78 |
79 | # Bibliography / Resources
80 | - See Bibliography section of the main [Apache Solr document](../apache-solr.md).
--------------------------------------------------------------------------------
/research/annotated-collaborative-research.md:
--------------------------------------------------------------------------------
1 | ## 2009
2 | - Sharoda A. Paul, Meredith Ringel Morris. [CoSense: enhancing sensemaking for collaborative web search](https://dl.acm.org/doi/10.1145/1518701.1518974). 4/2009.*
3 | - DM: This article is interesting from the perspective of advanced collaborative, ongoing research uses - e.g., several individuals working together on surfacing the best results on a specific topic.
4 | - "Broadly speaking, sensemaking is *finding meaning* in a situation. In HCI, sensemakingrefers to the cognitive act of understanding information[24]." - 1.
5 | - "One of the importantproblems facing HCI research today is the design ofcomputer interfaces to enable us to make sense of the vast amounts of information we encounter every day [24]." - 1.
6 | - "One of the prominent methodologies in this thread of research is Dervin’s “Sense-making” [4]. Sense-making occurs when a person, embedded in a particular context and moving through time-space, experiences a “gap” in reality. The person bridges this gap by constructing bridges consisting of ideas, thoughts, emotions, feelings, and memories. In the education literature...sensemaking refers to how students derive meanings about their learning experiences and how they identify particular ideas as important [5]. Weick [22], has explored sensemaking in the context of organizations. According to Weick, people organize their world to make sense of ambiguous situations they encounter and enact this sense back into the world to make that world more orderly. In HCI, sensemaking has focused on how users understand complex information spaces [16]. When interacting with large amounts ofinformation, people create representations to organize information in order to make sense of it. Sensemaking is the process of encoding information into external representations to answer complex, task-specific questions." - 2.
7 | - "...several collaborative search tools have recently been proposed by the research community [7]...they tend to offer two classes of support proposed by Morris & Horvitz [12]: awareness features (e.g., sharing of group members’ query histories, browsing histories, and/or comments on results) and division of labor features (e.g., chat systems, the ability to manually divide search results or URLs among group members, and/or algorithmic techniques for modifying group members’ search results based on others’ actions)." - 2.
8 | - "The temporality of the search process was important for participants’ sensemaking. Many participants wanted to see a unified chronological ordering of all events in the search process. They wanted to see the complete information path that was followed by other group members, and hence would have liked SearchTogether to make the browsing (in addition to searching) behavior of others more visible." - 3.
9 | - "The concept of query evolution seemed important toparticipants’ sensemaking; that is, participants wanted totake others’ queries and build upon them." - 3.
10 | - DM: See article for concepts like action awareness, context awareness, query evolution, sensemaking handoffs.
11 |
12 | ## 2008
13 | - Jeremy Pickens, Gene Golovchinsky, Chirag Shah, Pernilla Qvarfordt, Maribeth Back. [Algorithmic mediation for collaborative exploratory search](https://dl.acm.org/doi/10.1145/1390334.1390389). 7/2008.
14 | - "Using our system, two or more users with a common information need search together, si-multaneously. The collaborative system provides tools, userinterfaces and, most importantly, algorithmically-mediatedretrieval to focus, enhance and augment the team’s searchand communication activities." - 1.
15 | - "Information seeking can be more effective as a collabora-tion than as a solitary activity: different people bring differ-ent perspectives, experiences, expertise, and vocabulary tothe search process. A retrieval system that takes advantageof this breadth of experience should improve the quality ofresults obtained by its users [4]." - 1.
16 | - "In this work we explore the pos-sibilities of synchronous, explicit, algorithmically-mediated collaboration for search tasks [10]." - 1.
17 |
18 | ## 2007
19 | - Athanasios Papagelis, Christos Zaroliagis. [Searchius: A Collaborative Search Engine](https://www.researchgate.net/publication/4282197_Searchius_A_Collaborative_Search_Engine). 10/2007.
20 | - "Searchius is a collaborative search engine that produces search results based solely on user provided web-related data. We discuss the architecture of the system and how it compares to current state-of-the-art search engines." - 1.
21 | - "URLs can be explicitly collected (e.g., bookmarks) or implicitly collected (e.g., web-browsing history). These collections of web-related data can be combined, without loosing [sp] their discrete nature, to produce a view of the web from the user perspective." - 1.
22 | - DM: It would be interesting to add to the ranking algorithm an analysis of sites visited versus sites bookmarked. e.g., it seems likely that sites visited once and never bookmarked across a large number of users are links of low quality.
23 | - "Our approach is based on the observation that the web users act as small crawlers seeking information on the web using various media, which they subsequently store and organize into tree-like structures inside their information spaces." - 1.
24 | - "...Searchius is not capital intensive, since it concentrateson a small portion of the data that typical search enginescollect and analyze." - 1.
25 | - "To order pages by importance,Searchius uses an aggregation function based on the preference to pages by different users, thus avoiding the expen-sive iterative procedure of PageRank." - 1.
26 | - "Finally, the way peo-ple organize their bookmarks can be used to segment theURL space to relative sub-spaces. This property can be ex-ploited to provide efficient solutions to other applications,including the construction of web catalogs and finding re-lated URLs." - 1.
27 | - "Under the above context, the ranking of pages inSearchius is based on how many *different* users have votedfor a specific page p. The total number of such votes is called the *UsersRank of page p*." - 2.
28 | - "The Noise Reducer uses heuristic filters to removefrom the database low quality URLs." - 3.
29 | - "Searchius uses a similar approach. The IR score of a pageis calculated based on the page title, URL and semantic tag-ging given by users. This semantic tagging is the exact ana-log to anchor links and can produce diverse descriptions ofpages." - 3.
30 | - DM: The use of aggregate user provided titles, descriptions, and semantic tags instead of actual page content is an interesting idea. I would see this as additive, e.g., one could perform ranking on page rank, user rank, IR score, and user IR score.
31 | - "To cope with such problems,we introduce an aging procedure for the collected pages. Atpredefined time intervals we reduce the value of each pagein the Searchius database. Since we use a simple value ag-gregation procedure to find the most important pages for asearch query, the effect of old URL collections to search-ing results diminishes through time allowing more room forfresh data." - 4.
32 | - DM: This is an interesting approach to ensuring that newer pages have a chance to surface in the results while also allowing older pages to retain their authority if they continue to be widely utilized.
33 | - "The quality of page ordering will be benefited by implicit collection of URLs for three reasons. First, the pages wevisit determine with high accuracy our current interests. Ifwe can monitor the users browsing history then we can al-ways produce a fresh view of how the web dynamics evolve.On the contrary, bookmark collections may be somewhatoutdated. Second, the frequency of visiting specific pagesgives a much better indication of our relative preference forspecific sites than the bookmark collections. Third, the order in which we visit sites can also be used and exploitedin many ways. For example, frequently subsequent sites ina browsing history can be linked as belonging to the same theme." - 4.
34 | - "When a user adds several pages under a folderhe actually groups them by some sort of similarity. Thisinfers that folders partition the URL-space to conceptuallyrelated sub-spaces." - 4.
--------------------------------------------------------------------------------
/specific-engines/solr/basic-tutorial.md:
--------------------------------------------------------------------------------
1 | # Introduction
2 | This is meant to be a cheatsheet-like condensation of [the official Solr tutorials](https://solr.apache.org/guide/solr/latest/getting-started/solr-tutorial.html). See the Bibliography/Resources section to links to the individual tutorials utilized.
3 |
4 | In a few places I've included additional details I thought relevant but overall this cheatsheet leaves out a lot of material and you may want to read the original tutorials first and use this for later reference.
5 |
6 | # Getting Solr
7 | - [Download the latest Solr](https://solr.apache.org/downloads.html).
8 | - Unzip the downloaded file: `tar -xzf solr-9.2.1.tgz`
9 | - Enter the extracted folder: `cd solr-9.2.1`
10 |
11 | # Launch Solr in SolrCloud Mode
12 | - `bin/solr start -c`
13 | - Add another Solr node to cluster: `bin/solr -c -z localhost:9983 -p 9984`
14 | - May need to increase open file limit to 65000.
15 | - May need to adjust available entropy.
16 | - Open Admin UI: http://localhost:8983/
17 |
18 | # Create a Collection
19 | ```
20 | curl --request POST \
21 | --url http://localhost:8983/api/collections \
22 | --header 'Content-Type: application/json' \
23 | --data '{
24 | "create": {
25 | "name": "techproducts",
26 | "numShards": 1,
27 | "replicationFactor": 1
28 | }
29 | }'
30 | ```
31 |
32 | # Define a Schema
33 | ```
34 | curl --request POST \
35 | --url http://localhost:8983/api/collections/techproducts/schema \
36 | --header 'Content-Type: application/json' \
37 | --data '{
38 | "add-field": [
39 | {"name": "name", "type": "text_general", "multiValued": false},
40 | {"name": "cat", "type": "string", "multiValued": true},
41 | {"name": "manu", "type": "string"},
42 | {"name": "features", "type": "text_general", "multiValued": true},
43 | {"name": "weight", "type": "pfloat"},
44 | {"name": "price", "type": "pfloat"},
45 | {"name": "popularity", "type": "pint"},
46 | {"name": "inStock", "type": "boolean", "stored": true},
47 | {"name": "store", "type": "location"}
48 | ]
49 | }'
50 | ```
51 |
52 | # Index Some Documents
53 | - Single Document:
54 | ```
55 | curl --request POST \
56 | --url 'http://localhost:8983/api/collections/techproducts/update' \
57 | --header 'Content-Type: application/json' \
58 | --data ' {
59 | "id" : "978-0641723445",
60 | "cat" : ["book","hardcover"],
61 | "name" : "The Lightning Thief",
62 | "author" : "Rick Riordan",
63 | "series_t" : "Percy Jackson and the Olympians",
64 | "sequence_i" : 1,
65 | "genre_s" : "fantasy",
66 | "inStock" : true,
67 | "price" : 12.50,
68 | "pages_i" : 384
69 | }'
70 | ```
71 | - Multiple Documents:
72 | ```
73 | curl --request POST \
74 | --url 'http://localhost:8983/api/collections/techproducts/update' \
75 | --header 'Content-Type: application/json' \
76 | --data ' [
77 | {
78 | "id" : "978-0641723445",
79 | "cat" : ["book","hardcover"],
80 | "name" : "The Lightning Thief",
81 | "author" : "Rick Riordan",
82 | "series_t" : "Percy Jackson and the Olympians",
83 | "sequence_i" : 1,
84 | "genre_s" : "fantasy",
85 | "inStock" : true,
86 | "price" : 12.50,
87 | "pages_i" : 384
88 | }
89 | ,
90 | {
91 | "id" : "978-1423103349",
92 | "cat" : ["book","paperback"],
93 | "name" : "The Sea of Monsters",
94 | "author" : "Rick Riordan",
95 | "series_t" : "Percy Jackson and the Olympians",
96 | "sequence_i" : 2,
97 | "genre_s" : "fantasy",
98 | "inStock" : true,
99 | "price" : 6.49,
100 | "pages_i" : 304
101 | }
102 | ]'
103 | ```
104 | - A file containing the documents:
105 | - NOTE: This file does not exist, so this import will not work as-is.
106 | ```
107 | curl -H "Content-Type: application/json" \
108 | -X POST \
109 | -d @example/products.json \
110 | --url 'http://localhost:8983/api/collections/techproducts/update?commit=true'
111 | ```
112 |
113 | # Commit the Changes
114 | - "After documents are indexed into a collection, they are not immediately available for searching. In order to have them searchable, a commit operation (also called refresh in other search engines like OpenSearch etc.) is needed. Commits can be scheduled at periodic intervals using auto-commits as follows."
115 | - `curl -X POST -H 'Content-type: application/json' -d '{"set-property":{"updateHandler.autoCommit.maxTime":15000}}' http://localhost:8983/api/collections/techproducts/config`
116 |
117 | # Make Some Queries
118 | - `curl 'http://localhost:8983/solr/techproducts/select?q=name%3Alightning'`
119 |
120 | # Reset Solr to Original State
121 | - `bin/solr stop -all ; rm -Rf example/cloud`
122 |
123 | # Start Solr in SolrCloud Mode
124 | - `./bin/solr start -e cloud`
125 | - Set how many Solr nodes `2`.
126 | - Set the port for node1 to `8983`.
127 | - Set the port for node2 to `7574`.
128 | - The commands run by Solr are shown in the terminal and can be run in the future:
129 | - Start up node1: `bin/solr start -cloud -p 8983 -s "example/cloud/node1/solr"`
130 | - Start up node2: `bin/solr start -cloud -p 7574 -s "example/cloud/node2/solr" -z localhost:9983`
131 | - Create a collection `techproducts`.
132 | - Set the number of shards to `2`.
133 | - Set the number of replicas per shard to `2`.
134 | - Set the configuration to `sample_techproducts_configs`.
135 |
136 | # Index the Techproducts Data
137 | - `bin/post -c techproducts example/exampledocs/*`
138 |
139 | # Use the Solr Admin UI to Query
140 | - Open the Solr Admin UI in the browser: http://localhost:8983/
141 | - On the left-hand side select the dropdown "Collection Selector" and choose "techproducts".
142 | - A new menu opens on the left-hand side under the Collection Selector, click on Query.
143 | - Click on Execute Query and you'll see the results of the query (ten documents in the collection).
144 | - Note the URL above the results, this can be used with `curl` or similar to make the same query.
145 | - Note that clicking on the URL above the results returns the raw response.
146 | - The URL should look something like this: `http://localhost:8983/solr/techproducts/select?indent=true&q.op=OR&q=*%3A*&useParams=`
147 | - The `q=` stands for query.
148 | - The operator `*:*` means match all documents in the collection.
149 | - This returns a parse error in curl, "Cannot parse ''*:*'':..."
150 | - If we use the html entity code for the colon, `%3A`, then it works.
151 |
152 | # Returning Only Specific Fields in Response to a Query
153 | - In the Query UI search for 'foundation'
154 | - One can choose which fields are returned using the `fl` parameter, e.g.:
155 | `curl "http://localhost:8983/solr/techproducts/select?q=foundation&fl=id"`
156 |
157 | # Limit the Fields Searched for a Query
158 | - We can also limit the fields searched using a field name and the desired search query, e.g.:
159 | `curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics"`
160 |
161 | # Searching for a Phrase
162 | - Enclose the phrase in double quotes, e.g.:
163 | `curl "http://localhost:8983/solr/techproducts/select?q=\"CAS+latency\""`
164 | - Note about that we had to escape the inner set of quotes with backslashes, this wouldn't be necessary in the Admin Query UI.
165 |
166 | # Query on Exact Multiple Terms / Phrases
167 | - By default Solr requires only one term to be present in a document for it to be included in the results. To require multiple terms or specific phrases you can use `+` before the terms, e.g.: `+electronics +music`:
168 | `curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics%20%2Bmusic"`
169 | - Can exclude specific terms / phrases using a `-`, for example:
170 | `curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics+-music"`
171 |
172 | # Create a New Collection
173 | - `bin/solr create -c films -s 2 -rf 2`
174 | - Automatically utilizes the `_default` configset.
175 | - `-s` sets the number of shards for the collection.
176 | - `-rf` sets the number of replicas.
177 | - If we open the Solr Admin UI we can select the films collection from the Collection Selector dropdown as we selected the techproducts collection earlier.
178 |
179 | # Working with the Schema API
180 |
181 | ## Creating the "names" Field
182 | - `curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' http://localhost:8983/solr/films/schema`
183 | - This can also be accomplished with slightly less control using the Admin UI.
184 |
185 | ## Creating a "catchall" Copy Field
186 | - A catchall field is created "defining a copy field that will take all the data from all fields and index it into a field named `_text_`."
187 | - `curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" : {"source":"*","dest":"_text_"}}' http://localhost:8983/solr/films/schema`
188 | - This can also be accomplished through the Admin UI.
189 |
190 | # Index the Film Data
191 | - Import JSON: `bin/post -c films example/films/films.json`
192 | - Or import XML: `bin/post -c films example/films/films.xml`
193 | - Or import CSV: `bin/post -c films example/films/films.csv -params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"`
194 |
195 | # Query the Film Data
196 | - Open the Admin UI and run Query, you should see 1100 results in the `numFound` field of the `response`, the first ten of which will be displayed.
197 |
198 | # Using Faceting
199 | - "Faceting allows the search results to be arranged into subsets (or buckets, or categories)..."
200 | - Types of Faceting:
201 | - Field Values
202 | - Numeric and Date Ranges
203 | - Pivots (Decision Tree)
204 | - Arbitrary Query Faceting
205 |
206 | ## Field Facets
207 | - In the Admin UI Query tab check the facet checkbox to see facet-related options appear.
208 | - "To see facet counts from all documents (q=*:*): turn on faceting (facet=true), and specify the field to facet on via the facet.field parameter."
209 | - If you want a list of facets but don't want any of the details of the results you can use `rows=0`.
210 | - Example in curl: `curl "http://localhost:8983/solr/films/select?q=*:*&rows=0&facet=true&facet.field=genre_str"`
211 | - One can use `facet.mincount` to only show facets with at least x documents in them.
212 | - Example in curl: `curl "http://localhost:8983/solr/films/select?=&q=\*:*&facet.field=genre_str&facet.mincount=200&facet=on&rows=0"`
213 |
214 | ## Range Facets
215 | - Admin UI does not support range facet options.
216 | - curl:
217 | ```
218 | curl 'http://localhost:8983/solr/films/select?q=*:*&rows=0'\
219 | '&facet=true'\
220 | '&facet.range=initial_release_date'\
221 | '&facet.range.start=NOW/YEAR-25YEAR'\
222 | '&facet.range.end=NOW'\
223 | '&facet.range.gap=%2B1YEAR'
224 | ```
225 | - The above returns all films and groups them by year starting 25 yrs ago and ending today.
226 |
227 | ## Pivot Facets
228 | - `curl "http://localhost:8983/solr/films/select?q=\*:*&rows=0&facet=on&facet.pivot=genre_str,directed_by_str"`
229 |
230 | ## Remove Films Collection
231 | - If desired the films collection can be removed using: `bin/solr delete -c films`
232 |
233 |
234 | # Bibiography/Resources
235 | - https://solr.apache.org/guide/solr/latest/getting-started/solr-tutorial.html
236 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-five-minutes.html
237 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-techproducts.html
238 | - https://solr.apache.org/guide/solr/latest/getting-started/tutorial-films.html
--------------------------------------------------------------------------------
/research/books-research.md:
--------------------------------------------------------------------------------
1 | - NOTE: There are currently 36 books in Springer's the Information Retrieval Series, not all are listed below.
2 | - NOTE: If you are interested in volumes published by Pearson, Manning, Apress, Packt, O'Reilly, Chapman Hall/CRC, Morgan Kaufmann, or Addison-Wesley, you may want to consider a subscription to [O'Reilly Learning](https://oreilly.com/learning) which includes access to a number of volumes from these publishers (check if your specific volumes are included).
3 |
4 | ### General Audience
5 | - Dirk Lewandowski. *Understanding Search Engines*. Springer, 3/2023. 307 pp.
6 | - Alexander Halavais. *Search Engine Society*. Digital Media and Society, 11/2017. 240 pp.
7 | - Ian H. Witten, Marco Gori, Teresa Numerico. *Web Dragons: Inside the Myths of Search Engine Technology*. Morgan Kaufmann, 7/2010.
8 |
9 | #### Ethics
10 | - Rosie Graham. *Investigating Google's Search Engine: Ethics, Algorithms, and the Machines Built to Read Us*. Bloomsbury Academic, 1/2023. 256 pp.
11 | - Safiya Umoja Noble. *Algorithms of Oppression: How Search Engines Reinforce Racism*. NYU Press, 2/2018. 248 pp.
12 |
13 | ### Core
14 | - W. Bruce Croft, Donald Metzler, Trevor Strohman. *Search Engines: Information Retrieval in Practice*. Pearson, 2/2009. 552 pp.
15 | - Available for free from the [University of Massachusetts](https://ciir.cs.umass.edu/irbook/).
16 | - Christopher D. Manning, Hinrich Schütze, Prabhakar Raghavan. *Introduction to Information Retrieval*. Cambridge University Press, 7/2008. 506 pp.
17 | - Available for free from [Stanford](https://nlp.stanford.edu/IR-book/information-retrieval-book.html).
18 | - Stefan Buttcher, Charles L. A. Clarke, Gordon V. Cormack. *Information Retrieval: Implementing and Evaluating Search Engines*. The MIT Press, 2/2016. 632 pp.
19 | - Significant portions of the 2010 edition of this book are [available for free from the official site](https://plg.uwaterloo.ca/~ir/ir/book/). There are 16 chapters in that edition with 7 available for free.
20 | - It appears that the 2016 edition is a reprint of the 2010 edition verbatim.
21 | - Ricardo Baeza-Yates, Berthier Ribeiro-Neto. *Modern Information Retrieval: The Concepts and Technology Behind Search*, 2nd edition. Addison-Wesley Professional, 2/2011. 913 pp.
22 | - Chapters 1-2, 11, and 15 are available here: https://www.baeza.cl/mir2ed/contents.php.html.
23 |
24 | ### User Perspective
25 | - Karen Markey. *Online Searching: A Guide to Finding Quality Information Efficiently and Effectively*. Rowman & Littlefield Publishers, 3rd edition. 2/2023. 294 pp.
26 | - Focused on the user experience of searching, not building, but may be helpful, especially for those new to the field as it addresses some terminology and usability concerns.
27 | - Duncan O. Case. *Looking for Information: A Survey of Research on Information Seeking, Needs, and Behavior*, 4th edition. Emerald Publishing Limited, 4/2016. 528 pp.
28 |
29 | ### Practical
30 | - Jay M. Patel. *Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale*. Apress, 11/2020. 420 pp.
31 | - Tommaso Teofili. *Deep Learning for Search*. Manning Publications, 6/2019. 328 pp.
32 | - John Berryman, Doug Turnbull. *Relevant Search: With applications for Solr and Elasticsearch*. Manning Publications, 6/2016. 360 pp.
33 | - Martin White. *Enterprise Search*, 2nd edition. O'Reilly Media, 10/2015. 310 pp.
34 |
35 | #### Lucene
36 | - Atri Sharma. Practical Lucene 8: Uncover the Search Capabilities of Your Application. Apress, 10/2020. 114 pp.
37 | - Edwood Ng, Vineeth Mohan. Lucene 4 Cookbook. Packt Publishing, 6/2015. 220 pp.
38 | - Michael McCandless, Erik Hatcher, Otis Gospodnetic. Lucene in Action, 2nd edition. Manning Publications, 7/2010. 475 pp.
39 |
40 | #### Solr
41 | - Dikshant Shahi. Apache Solr: A Practical Approach to Enterprise Search. Apress, 12/2015. 328 pp.
42 | - Xavier Morera. Apache Solr Succinctly. Syncfusion, 4/2015. 141 pp.
43 | - Available free from: https://www.syncfusion.com/succinctly-free-ebooks/apachesolr
44 | - Rafal Kuc. Solr Cookbook, 3rd edition. Packt Publishing, 1/2015. 356 pp.
45 | - Trey Grainger, Timothy Potter. Solr in Action. Manning Publications, 3/2014. 664 pp.
46 |
47 | #### Elasticsearch
48 | - Note: Some of these volumes address to more or less extent the Elastic / ELK Stack, this is about IR but more on a monitoring, logging side than web search.
49 | - Madhusudhan Konda. Elasticsearch in Action, 2nd edition. Manning Publications, 10/2023. 592 pp.
50 | - Alberto Paro. Elasticsearch 8.x Cookbook, 5th edition. Packt Publishing, 5/2022. 750 pp.
51 | - Asjad Athick, Shay Banon. Getting Started with Elastic Stack 8.0. Packt Publishing, 3/2022. 474 pp.
52 | - Wai Tak Wong. Advanced Elasticsearch 7.0. Packt Publishing, 8/2019. 560 pp.
53 | - Pranav Shukla, Sharath Kumar M N. Learning Elastic Stack 7.0, 2nd edition. 5/2019. 474 pp.
54 | - Abhishek Andhavarapu. Learning Elasticsearch. Packt Publishing, 6/2017. 404 pp.
55 | - Bharvi Dixit. Mastering Elasticsearch 5.x, 3rd edition. 2/2017. 428 pp.
56 | - Bharvi Dixit. Elasticsearch Essentials. Packt Publishing, 1/2016. 240 pp.
57 | - Rafał Kuć, Marek Rogoziński. Mastering Elasticsearch, 2nd edition. Packt Publishing, 2/2015. 434 pp.
58 | - Clinton Gormley, Zachary Tong. Elasticsearch: The Definitive Guide. O'Reilly Media, 1/2015. 724 pp.
59 | - Rafał Kuć, Marek Rogoziński. Elasticsearch Server, 3rd edition. Packt Publishing, 2/2016. 556 pp.
60 |
61 | #### Spark
62 | - Alex Thomas. Natural Language Processing with Spark NLP. O'Reilly Media, 6/2020. 364 pp.
63 |
64 | #### Sphinx
65 | - Andrew Aksyonoff. Introduction to Search with Sphinx. O'Reilly Media, 4/2011. 148 pp.
66 |
67 | ### Information Architecture
68 | - Louis Rosenfeld, Peter Morville, Jorge Arango. Information Architecture: For the Web and Beyond, 4th edition. O'Reilly, 9/2015. 483 pp.
69 | - Gerald Kowalski. Information Retrieval Architecture and Algorithms. Spring, 2011.
70 |
71 | ### Collaborative
72 | - Chirag Shah. Social Information Seeking: Leveraging the Wisdom of the Crowd. Springer, 7/2017. 204 pp.
73 | - Chirag Shah. Collaborative Information Seeking: The Art and Science of Making the Whole Greater Than the Sum of All. Springer, 8/2014. 206 pp. (Vol. 34)
74 | - Satnam Alag. Collective Intelligence in Action. Manning Publications, 9/2008. 425 pp.
75 |
76 | ### Springer The Information Retrieval Series
77 | - Jiqun Liu. A Behavioral Economics Approach to Interactive Information Retrieval. Springer, 2/2023. 400 pp. (Vol. 48)
78 | - Yi Chang, Hongbo Deng (editors). Query Understanding for Search Engines. Springer, 12/2021. 236 pp. (Vol. 46)
79 | - Jianfeng Gao, Chenyan Xiong, Paul Bennett, Nick Craswell. Neural Approaches to Conversational Information Retrieval. Springer, 3/2023. 405 pp. (Vol. 44)
80 | - Tetsuya Sakai, Douglas W. Oard, Noriko Kando. Evaluating Information Retrieval and Access Tasks. Springer, 9/2020. (Vol. 43)
81 | - Deepak P., et al. Data Science for Fake News. Springer, 4/2021. (Vol. 42)
82 | - Nicola Ferro, Carol Peters. Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF. Springer, 8/2019. (Vol. 41)
83 | - Tetsuya Sakai. Laboratory Experiments in Information Retrieval. Springer, 9/2018. (Vol. 40)
84 | - Krisztian Balog. Entity-Oriented Search. Springer, 10/2018. 370 pp. (Vol. 39)
85 | - Available for free for Amazon Kindle as well as at https://eos-book.org/.
86 | - Chirag Shah. Social Information Seeking. Springer, 7/2017. (Vol. 38)
87 | - Peter Knees, Markus Schedl. Music Similarity and Retrieval. Springer, 5/2016. 319 pp. (Vol. 36)
88 | - Massimo Melucci. Introduction to Information Retrieval and Quantum Mechanics. Springer, 12/2015. 250 pp. (Vol. 35)
89 | - Chirag Shah. Collaborative Information Seeking. Springer, 7/2012. (Vol. 34)
90 | - Donald Metzler. A Feature-Centric View of Information Retrieval. Springer, 9/2011. 334 pp. (Vol. 27)
91 | - Gionvanni Maria Sacco, Yannis Tzitzikas. Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience. Spring, 8/2009. 357 pp. (Vol. 25)
92 |
93 | ### Other Springer
94 | - Aidan Hogan. The Web of Data. Springer, 2020. 697 pp.*
95 | - C. Maria Keet. The What and How of Modelling Information and Knowledge: Mind Maps to Ontologies. Springer, 2023. 192 pp.
96 | - Juanzi Li, Guillin Qi, Dongyan Zhao, Wolfgang Nejdl, Hai-Tao Zheng, eds. Semantic Web and Web Science. Springer, 2013. 413 pp.
97 |
98 | ### Uncategorized
99 | - Anuradha D. Thakare, Shilpa Laddha, Ambika Pawar. Hybrid Intelligent Systems for Information Retrieval. Chapman Hall/CRC, 11/2022. 252 pp.
100 | - Jiawei Han, Jian Pei, Hanghang Tong. Data Mining: Concepts and Techniques, 4th edition. Morgan Kaufmann, 7/2022. 752 pp.
101 | - Nicole Tonellotto, Craig Macdonald, Iadh Ounis. Efficient Query Processing for Scalable Web Search. 6/2019. 132 pp.
102 | - Available for free from https://tonellotto.github.io/publication/fntir/fntir_main.pdf
103 | - Jutta Haider, Olof Sundin. Invisible Search and Online Search Engines. Routledge: Taylor & Francis Group, 2019. 151 pp.
104 | - Available for free from https://library.oapen.org/bitstream/handle/20.500.12657/51256/9780429828027.pdf
105 | - ChengXiang Zhai and Sean Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM Books, 6/2016. 532 pp.
106 | - Bo Long, Yi Chang. Relevance Ranking for Vertical Search Engines. Morgan Kaufmann, 1/2014. 264 pp.
107 | - Gerald Kowalski. Information Retrieval Systems: Theory and Implementation. Springer, 3/2013. 300 pp.
108 | - Tyler Tate, Tony Russell-Rose. Designing the Search Experience: The Information Architecture of Discovery. Morgan Kaufmann, 1/2013. 320 pp.
109 | - Dirk Lewandowski, ed. Web Search Engine Research. Emerald Publishing Limited, 4/2012. 322 pp.
110 | - Giovanni Maria Sacco, Yannis Tzitzikas (editors). Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience. Springer, 3/2012. 357 pp.
111 | - Amy N. Langville, Carl D. Meyer. Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Presss, 2/2012. 240 pp.
112 | - Marcia J. Bates. Understanding Information Retrieval Systems: Management, Types, and Standards. Auerbach Publications, 12/2011. 752 pp.
113 | - Tie-Yan Liu. Learning to Rank for Information Retrieval. Springer, 4/2011. 302 pp.
114 | - Peter Morville, Jeffery Callender. Search Patterns: Design for Discovery. O'Reilly, 2/2010. 192 pp.
115 | - Maria Stone. Understanding and Evaluating Search Experience. Morgan & Claypool Publishers, 2020.
116 | - Irina Shamaeva, David Michael Galley. Custom Search - Discover more: A Complete Guide to Google Programmable Search. CRC Press, 2021. 184 pp.
117 | Grace Hui Yang, Marc Sloan, Jun Wang. Dynamic Information Retrieval Modeling. Springer, 2016. 146 pp.
118 | Wei Ding, Xia Lin. Information Architecture: The Design and Integration of Information Spaces. Springer, 2009. 149 pp.
119 |
120 | ### Older
121 | - Michael W. Berry, Murray Browne. Understanding Search Engines: Modeling and Text Retrieval, 2nd edition. Society for Industrial and Applied Mathematics, 5/2005. 184 pp.
122 | - David A. Grossman, Ophir Frieder. Information Retrieval: Algorithms and Heuristics. Springer, 10/2004. 352 pp.
123 | - Gerhard Weikum, Gottfried Vossen. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, 6/2001. 872 pp.
124 | - Richard K. Belew. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press, 1/2001. 384 pp.
125 | - Ian H. Witten, Alistair Moffat, Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition. Morgan Kaufmann, 5/1999. 560 pp.
126 | - Karen Sparck Jones, Peter Willett (editors). Readings in Information Retrieval. Morgan Kaufmann, 7/1997. 587 pp.
127 | - William Frakes, Ricardo Baeza-Yates (editors). Information Retrieval: Data Structures & Algorithms. Pearson College Divison, 1/1992. 464 pp.
--------------------------------------------------------------------------------
/specific-engines/opensearch/opensearch-python-client.md:
--------------------------------------------------------------------------------
1 | # OpenSearch Python
2 |
3 | ## OpenSearch Python (opensearch-py) Client
4 | This section provides a brief overview of some portions of the OpenSearch Python client. See the [official documentation](https://opensearch-project.github.io/opensearch-py/index.html) for details.
5 |
6 | ## OpenSearch Client
7 | - `OpenSearch()`
8 | - `bulk()`
9 | - `count()`
10 | - `create()` - A document in the index.
11 | - `create_pit()` - Creates point in time context.
12 | - `delete()` - Remove a document from the index.
13 | - `delete_all_pits()`
14 | - `delete_by_query()` - Delete documents that match a specific query.
15 | - `delete_pit()` - Deletes one or more pits.
16 | - `delete_script()`
17 | - `exists()` - Whether a document exists in an index.
18 | - `exists_source()`
19 | - `explain()` - Why specific documents matched or did not match a query.
20 | - `field_caps()`
21 | - `get()` - Retrieve a document.
22 | - `get_all_pits()`
23 | - `get_script()`
24 | - `get_script_context()`
25 | - `get_script_languages()`
26 | - `get_source()`
27 | - `index()` - Creates or updates a document in an index.
28 | - `info()` - Returns info about cluster.
29 | - `mget()` - Retrieve multiple documents.
30 | - `msearch()` - Multiple search operations in a single request.
31 | - `msearch_template()`
32 | - `mtermvectors()` - Retrieve multiple term vectors in a single request.
33 | - `ping()` - Check if the cluster is up.
34 | - `put_script()` - Create or update a script.
35 | - `rank_eval()` - Evaluate the quality of ranked search results over a set of typical search queries.
36 | - `reindex()` - Reindex documents from one index to another.
37 | - `render_search_template()` - Use the Mustache language to pre-render a search definition.
38 | - `scripts_painless_execute()`
39 | - `scroll()` - Allows to retrieve a large number of results from a single search request.
40 | - `search()` - Search for documents in one or more indices.
41 | - `search_shards()` - Returns info about the indices and shards a request would be executed against.
42 | - `search_template()`
43 | - `termvectors()` - Retrieve information and statistics about terms in the fields of a particular document.
44 | - `update()` - Update a document.
45 | - `update_by_query()` - Update documents that match a specific query.
46 |
47 | ### Http Client
48 | - `delete()`, `get()`, `head()`, `post()`, `put()`
49 |
50 | ### Compact and aligned text (CAT) Client
51 | - `aliases()`
52 | - `all_pit_segments()`
53 | - `allocation()` - How many shards are allocated to each data node and how much disk they are using.
54 | - `cluster_manager()` - Info about cluster manager node.
55 | - `count()`
56 | - `fielddata()` - How much heap memory is being used by fielddate on each data node.
57 | - `health()`
58 | - `help()` - Regarding the CAT APIs
59 | - `indices()` - Info about indices.
60 | - `nodeatrs()` - Info about custom node attributes.
61 | - `nodes()` - Info about nodes.
62 | - `pending_tasks()` - Info about pending tasks.
63 | - `pit_segments()`
64 | - `recovery()` - Info about shard recoveries.
65 | - `plugins()`
66 | - `recovery()`
67 | - `repositories()`
68 | - `segment_replication()`
69 | - `segments()`
70 | - `shards()`
71 | - `snapshots()`
72 | - `tasks()`
73 | - `templates()`
74 | - `thread_pool()`
75 |
76 | ### Cluster Client
77 | - `allocation_explain()`
78 | - `delete_component_template()`
79 | - `delete_weighted_routing()`
80 | - `exists_component_template()`
81 | - `get_component_template()`
82 | - `get_settings()` - Of a cluster
83 | - `get_weighted_routing()`
84 | - `health()`
85 | - `pending_tasks()`
86 | - `put_component_template()`
87 | - `put_settings()`
88 | - `put_weighted_routing()`
89 | - `remote_info()`
90 | - `reroute()` - Manually change allocation of individual shards in the cluster.
91 | - `state()` - Comprehensive info about state of cluster.
92 | - `stats()` - High-level overview of cluster stats.
93 |
94 | ### Dangling Indices Client
95 | - https://opensearch-project.github.io/opensearch-py/api-ref/clients/dangling_indices_client.html
96 |
97 | ### Ingest Client
98 | - `delete_pipeline()`
99 | - `get_pipeline()`
100 | - `processor_grok()` - Returns a list of the built-in patterns.
101 | - `put_pipeline()`
102 | - `simulate()` - Simulate a pipeline with example docs.
103 |
104 | ### Indices Client
105 | - `add_block()`
106 | - `analyze()`
107 | - `clear_cache()`
108 | - `clone()`
109 | - `close()`
110 | - `create()`
111 | - `create_data_stream()`
112 | - `data_streams_stats()`
113 | - `delete()`
114 | - `delete_alias()`
115 | - `delete_data_stream()`
116 | - `delete_template()`
117 | - `exists()` - Whether a particular index exists
118 | - `exists_alias()`
119 | - `exists_template()`
120 | - `flush()`
121 | - `forcemerge()` - Of one or more indices
122 | - `get()` - Returns info about one or more indices
123 | - `get_alias()`
124 | - `get_data_stream()`
125 | - `get_field_mapping()` - Returns mapping for one or more fields
126 | - `get_mapping()` - Returns mapping for one or more indices
127 | - `get_settings()` - Returns settings for one or more indices
128 | - `get_template()`
129 | - `open()` - Opens an index
130 | - `put_alias()` - Creates or updates an alias
131 | - `put_mapping()` - Updates the index mappings
132 | - `put_settings()`
133 | - `put_template()`
134 | - `recovery()` - Info about shard recoveries
135 | - `refresh()`
136 | - `resolve_index()` - Resolves info about any matching indices, aliases, data streams.
137 | - `rollover()` - Updates an alias to point to a new index when old index is too large/old.
138 | - `segments()` - Low-level info about Lucene segments in one or more indices
139 | - `shard_stores()`
140 | - `shrink()` - Shrinks an existing index into a new index with fewer primary shards
141 | - `simulate_template()`
142 | - `split()` - Splits an existing index into a new index with more primary shards
143 | - `stats()`
144 | - `update_aliases()`
145 | - `validate_query()` - For validating a potentially expensive query without executing it.
146 |
147 | ### Nodes Client
148 | - `hot_threads()`
149 | - `info()`
150 | - `reload_secure_settings()`
151 | - `stats()`
152 | - `usage()` - Low-level info about REST actions usage
153 |
154 | ### Remote Client
155 |
156 | ### Security Client
157 | - `change_password()` - For current user
158 | - `create_action_group()` - Create or replace
159 | - `create_role()` - Create or replace
160 | - `create_role_mapping()` - Create or replace
161 | - `create_tenant()` - Create or replace
162 | - `create_user()` - Create or replace
163 | - `delete_action_group()`
164 | - `delete_distinguished_names()`
165 | - `delete_role()`
166 | - `delete_role_mapping()`
167 | - `delete_tenant()`
168 | - `delete_user()`
169 | - `flush_cache()` - Flushes the Security plugin user, authentication, and authorization caches.
170 | - `get_account_details()` - For current user
171 | - `get_action_group()`
172 | - `get_action_groups()`
173 | - `get_audit_configuration()`
174 | - `get_certificates()`
175 | - `get_configuration()`
176 | - `get_distinguished_names()`
177 | - `get_role()`
178 | - `get_role_mapping()`
179 | - `get_role_mappings()`
180 | - `get_roles()`
181 | - `get_tenant()`
182 | - `get_tenants()`
183 | - `get_user()`
184 | - `get_users()`
185 | - `health()`
186 | - `patch_action_group()` - Updates individual attributes of an action group
187 | - `patch_action_groups()`
188 | - `patch_audit_configuration()`
189 | - `patch_configuration()`
190 | - `patch_distinguished_names()`
191 | - `patch_role()`
192 | - `patch_role_mapping()`
193 | - `patch_role_mappings()`
194 | - `patch_roles()`
195 | - `patch_tenant()`
196 | - `patch_tenants()`
197 | - `patch_user()`
198 | - `patch_users()`
199 | - `reload_http_certificates()`
200 | - `reload_transport_certificates()`
201 | - `update_audit_configuration()`
202 | - `update_configuration()`
203 | - `update_distinguished_names()`
204 |
205 | ### Snapshot Client
206 | - `cleanup_repository()` - Removes stale data
207 | - `clone()` - Clones indices from one snapshot to another in the same repository
208 | - `create()` - Creates a snapshot in the repository
209 | - `create_repository()` - Creates a repository
210 | - `delete()` - Deletes a snapshot
211 | - `delete_repository()` - Deletes a repository
212 | - `get()` - Returns info about a snapshot
213 | - `get_repository()` - Returns info about a repository
214 | - `restore()` - Restores a snapshot
215 | - `status()` - Returns info about the status of a snapshot
216 | - `verify_repository()` - Verifies that a repository is working correctly
217 |
218 | ### Tasks Client
219 | - `cancel()`
220 | - `get()` - Return info about a task
221 | - `list()` - Return info about all tasks
222 |
223 | ### Features Client
224 | - `get_features()` - List of features which can be included in snapshots using the feature_states field when creating a snapshot
225 |
226 | ### Exceptions
227 | - `AuthenticationException` - 401
228 | - `AuthorizationException` - 403
229 | - `ConflictError` - 409
230 | - `ConnectionError`
231 | - `ConnectionTimeout`
232 | - `ImproperlyConfigured`
233 | - `NotFoundError` - 404
234 | - `RequestError` - 400
235 | - `SerializationError`
236 | - `SSLError`
237 | - `TransportError`
238 |
239 | ### Helpers
240 | - `aggs.Agg()`
241 | - `to_dict()`
242 | - `analysis.Analyzer()`
243 | - `document.Document()`
244 | - `delete()`
245 | - `exists()`
246 | - `get()`
247 | - `init()` - Create an index and populate the mappings
248 | - `mget()`
249 | - `save()` - Create or overwrite a document
250 | - `search()`
251 | - `to_dict()`
252 | - `update()`
253 | - `faceted_search.FacetedSearch()`
254 | - `add_filter()`
255 | - `aggregate()`
256 | - `build_search()` - Construct the search object
257 | - `execute()` - Execute the search and return response
258 | - `filter()` - Add a `post_filter` that narrows results based on facet filters
259 | - `highlight()` - For all the fields
260 | - `query()`
261 | - `search()` - Returns the base Search object to which the facets are added
262 | - `sort()`
263 | - `field.Field()`
264 | - `to_dict()`
265 | - `function.ScoreFunction()`
266 | - `to_dict()`
267 | - `index.Index()`
268 | - `aliases()` - Add to the index definition
269 | - `analyze()`
270 | - `analyzer()`
271 | - `clear_cache()`
272 | - `clone()`
273 | - `close()`
274 | - `create()`
275 | - `delete()`
276 | - `delete_alias()`
277 | - `document()` - Associates a Document subclass with an index.
278 | - `exists()`
279 | - `exists_alias()`
280 | - `flush()`
281 | - `forcemerge()`
282 | - `get()`
283 | - `get_alias()`
284 | - `get_field_mapping()` - For a specific field
285 | - `get_mapping()` - For a specific type
286 | - `get_settings()`
287 | - `get_upgrade()` - How much of index is upgraded
288 | - `mapping()` - Associate a mapping with index
289 | - `open()`
290 | - `put_alias()`
291 | - `put_mapping()`
292 | - `put_settings()`
293 | - `recovery()`
294 | - `refresh()`
295 | - `save()` - Sync the index definition with opensearch, create index if it doesn't exist, update settings/mappings if it does
296 | - `search()`
297 | - `segments()`
298 | - `settings()` - Add settings
299 | - `shard_stores()`
300 | - `shrink()`
301 | - `stats()`
302 | - `updateByQuery()`
303 | - `upgrade()`
304 | - `validate_query()`
305 | - `Mapping`
306 | - `Query`
307 | - `search.Search()`
308 | - `__getitem__(n)` - slicing Search instance for pagination
309 | - `__iter__()` - over the hits
310 | - `_clone()` - Clone of current search request, performs shallow copy of underlying objects.
311 | - Used internally by most state modifying APIs.
312 | - `collapse()`
313 | - `count()` - Number of matching hits
314 | - `delete()` - Delegates to `delete_by_query()`
315 | - `execute()` - Executes the search, returns an instance of Response wrapping all the data.
316 | - `from_dict()` - Constructs a new Search instance from a raw dict
317 | - `highlight()` - For some fields
318 | - `highlight_options()` - Set global options for current request
319 | - `response_class(cls)` - Override default wrapper used for the response
320 | - `scan()` - Returns a generator that will iterate over all the documents matching the query.
321 | - `script_field()` - Define script field to be calculated on hits
322 | - `sort()`
323 | - `source()` - Control how the `_source_` field is returned.
324 | - `suggest()` - Add a suggestions request to the search
325 | - `to_dict()`
326 | - `update_from_dict()` - Apply options from a serialized body to the current instance.
327 | - Modifies object in place.
328 | - Used mostly by from_dict.
329 | - `update_by_query.UpdateByQuery()`
330 | - `_clone()` - Clone of current search request, performs shallow copy of underlying objects.
331 | - `execute()` - Executes the update, returns an instance of Response wrapping all the data.
332 | - `from_dict()` - Constructs a new UpdateByQuery instance from a raw dict
333 | - `response_class(cls)` - Override default wrapper used for the response
334 | - `script()` - Define update action to take
335 | - Only accepts a single script
336 | - `to_dict()`
337 | - `update_from_dict()` - Apply options from a serialized body to the current instance.
338 | - Modifies object in place.
339 | - Used mostly by from_dict.
340 | - `wrappers.Range()`
341 |
342 | ## Plugins
343 |
344 | ### Alerting Plugin
345 |
346 | ### Index Management Plugin
347 |
348 | ### Serializer
349 |
350 | ### Transport
--------------------------------------------------------------------------------