├── .gitignore ├── slides.pdf ├── slides.pptx ├── tutorial.pdf ├── scripts ├── cover_page_art.jpg ├── core_default.sh ├── to_pdf.sh └── cover_page.html ├── LICENSE ├── README.md ├── install_solr_jdk.md └── tutorial.md /.gitignore: -------------------------------------------------------------------------------- 1 | tmp/ 2 | schedule.md 3 | .DS_Store 4 | -------------------------------------------------------------------------------- /slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/HEAD/slides.pdf -------------------------------------------------------------------------------- /slides.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/HEAD/slides.pptx -------------------------------------------------------------------------------- /tutorial.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/HEAD/tutorial.pdf -------------------------------------------------------------------------------- /scripts/cover_page_art.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/HEAD/scripts/cover_page_art.jpg -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | This work is licensed under the 2 | Creative Commons Attribution 4.0 International License. 3 | To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. -------------------------------------------------------------------------------- /scripts/core_default.sh: -------------------------------------------------------------------------------- 1 | # Recreates the bibdata core empty 2 | # (use this if you installed Solr locally) 3 | solr delete -c bibdata 4 | solr create -c bibdata 5 | 6 | # Recreates the bibdata core empty 7 | # (use this if you installed Solr via a Docker container) 8 | docker exec -it solr-container solr delete -c bibdata 9 | docker exec -it solr-container solr create_core -c bibdata 10 | docker exec -it solr-container post -c bibdata books.json 11 | 12 | 13 | # Other docker commands 14 | # docker stop solr-container 15 | # docker rm solr-container 16 | # 17 | # docker system prune -a -f 18 | -------------------------------------------------------------------------------- /scripts/to_pdf.sh: -------------------------------------------------------------------------------- 1 | # Creates a PDF version of tutorial.md 2 | # via pandoc and wkhtmltopdf. 3 | 4 | # Convert the markdown file to HTML 5 | # https://pandoc.org/MANUAL.html 6 | # 7 | pandoc ../tutorial.md \ 8 | -f markdown \ 9 | -t html -s -o tutorial.html \ 10 | --toc \ 11 | --include-before-body=cover_page.html \ 12 | --metadata pagetitle="Solr for newbies workshop" 13 | 14 | # Convert the HTML file to PDF 15 | # https://wkhtmltopdf.org/usage/wkhtmltopdf.txt 16 | # (use the installer from wkhtmltopdf.org, not from Homebrew) 17 | # 18 | wkhtmltopdf \ 19 | --footer-line \ 20 | --footer-left "Solr for newbies workshop" \ 21 | --footer-right "[page]/[toPage]" \ 22 | --footer-spacing 20 \ 23 | --margin-top 15 \ 24 | --margin-left 15 \ 25 | --margin-bottom 30 \ 26 | --margin-right 15 \ 27 | --dpi 120 \ 28 | --enable-local-file-access \ 29 | tutorial.html ../tutorial.pdf 30 | 31 | # Other settings that I tried 32 | # 33 | # --dpi 200 \ 34 | # --zoom 1.3 \ 35 | # --disable-smart-shrinking \ 36 | # --print-media-type \ 37 | # --lowquality \ 38 | # --page-size Letter \ 39 | # 40 | 41 | rm tutorial.html 42 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Solr for newbies workshop 2 | 3 | The repository contains the materials for the **Solr for newbies** workshop. 4 | 5 | File [tutorial.md](https://github.com/hectorcorrea/solr-for-newbies/blob/main/tutorial.md) has most of the material that we will cover during the workshop. In order to have a textual record of the commands to execute (along with their parameters and output) all the examples in this file are shown as executed via command line utilities like cURL. However, during the workshop we will use a combination of approaches, some steps will be executed through the command line while others will be done through the admin web page that Solr provides out of the box. 6 | 7 | File **tutorial.pdf** has the same content as tutorial.md but in PDF format and it's [downloadable]( https://github.com/hectorcorrea/solr-for-newbies/raw/main/tutorial.pdf). 8 | 9 | File **slides.pdf** contains the slides used in the workshop to support the material in tutorial.md. 10 | 11 | File **books.json** has the sample data that we will use during the workshop. 12 | 13 | File **install_solr_jdk.md** has instructions on how to install Solr directly on your machine (instead of using a Docker container). 14 | 15 | Folder `scripts/` has a few scripts that can be used to automate the steps of adding documents or fields to the Solr core used in the tutorial. These scrips just bundle the steps documented in tutorial.md. 16 | -------------------------------------------------------------------------------- /scripts/cover_page.html: -------------------------------------------------------------------------------- 1 | 6 | 39 | 40 |
41 |

Solr for newbies workshop

42 | 43 |

44 | Hector Correa
45 | hector@hectorcorrea.com
46 | http://hectorcorrea.com/solr-for-newbies
47 |
48 |
49 |
50 |
This work is licensed under the Creative Commons Attribution 4.0 International 51 | License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. 52 |
53 | 54 |
55 | 56 |
57 |

Table of Contents

-------------------------------------------------------------------------------- /install_solr_jdk.md: -------------------------------------------------------------------------------- 1 | ## Installing Solr with the Java Development Kit 2 | 3 | ### Prerequisites 4 | 5 | To run Solr on our machine we need to have the Java Development Kit (JDK) installed. The version of Solr that we'll use in this tutorial requires a recent version of Java (Java 8 or greater). To verify if the JDK is installed run the following command from the Terminal: 6 | 7 | ``` 8 | $ java -version 9 | 10 | # 11 | # java version "11.0.2" 2019-01-15 LTS 12 | # Java(TM) SE Runtime Environment 18.9 (build 11.0.2+9-LTS) 13 | # Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.2+9-LTS, mixed mode) 14 | # 15 | ``` 16 | 17 | If the JDK is installed on our machine we'll see the text indicating the version that we have (e.g. "11.0.2" above). If the version number is "11.x", "10.x", "9.x" or "1.8" we should be OK, otherwise, follow the steps below to install a recent version. 18 | 19 | If we don't have the JDK installed we'll see something like 20 | 21 | ``` 22 | # 23 | # -bash: java: command not found 24 | # 25 | ``` 26 | 27 | If a recent version of Java *is installed* on our machine skip the "Installing Java" section below and jump to the "Installing Solr" section. If Java *is not installed* on our machine or we have an old version follow the steps below to install a recent version. 28 | 29 | 30 | ### Installing Java 31 | 32 | To install the Java Development Kit (JDK) go to http://www.oracle.com/technetwork/java/javase/downloads/index.html and click the "JDK Download" link. 33 | 34 | From there, under the "Java SE Development Kit 13.0.2" select the file appropriated for our operating system. For Mac download the ".dmg" file (`jdk-13.0.2_osx-x64_bin.dmg`) and for Windows download the ".exe" file (`jdk-13.0.2_windows-x64_bin.exe`). Accept the license and download the file. 35 | 36 | Run the installer that we downloaded and follow the instructions on screen. Once the installer has completed run the `java -version` command again. We should see the text with the Java version number this time. 37 | 38 | 39 | ### Installing Solr 40 | 41 | Once Java has been installed on your machine to install Solr we just need to *download* a zip file, *unzip* it on our machine, and run it. 42 | 43 | You can download Solr from the [Apache](https://solr.apache.org/) web site. To make it easy, below are the steps to download and install version 9.1.0 which is the one that we will be using. 44 | 45 | First, download Solr and save it to a file on your machine: 46 | 47 | ``` 48 | $ cd 49 | $ curl -OL http://archive.apache.org/dist/solr/solr/9.1.0/solr-9.1.0.tgz 50 | 51 | # 52 | # You'll see something like this... 53 | # % Total % Received % Xferd Average Speed Time Time Time Current 54 | # Dload Upload Total Spent Left Speed 55 | # 100 146M 100 146M 0 0 7081k 0 0:00:21 0:00:21 --:--:-- 8597k 56 | # 57 | ``` 58 | 59 | Then unzip the downloaded file with the following command: 60 | 61 | ``` 62 | $ tar zxvf solr-9.1.0.tgz 63 | 64 | # 65 | # A ton of information will be displayed here as Solr is being 66 | # decompressed/unzipped. Most of the lines will say something like 67 | # "x solr-9.1.0/the-name-of-a-file" 68 | # 69 | ``` 70 | 71 | ...and that's it, Solr is now available on your machine under the `solr-9.1.0` folder. Most of the utilities that we will use in this tutorial are under the `solr-9.1.0/bin` folder. 72 | 73 | First, let's make sure we can run Solr by executing the `solr` shell script with the `status` parameter: 74 | 75 | ``` 76 | $ cd ~/solr-9.1.0/bin 77 | $ ./solr status 78 | 79 | # 80 | # No Solr nodes are running. 81 | # 82 | ``` 83 | 84 | The "No Solr nodes are running" message is a bit anticlimactic but it's exactly what we want since it indicates that Solr is ready to be run. 85 | 86 | **Note for Windows users:** In Windows use the `solr.cmd` batch file instead of the `solr` shell script, in other words, use `solr.cmd status` instead of `./solr status`. 87 | 88 | 89 | ### Let's get Solr started 90 | 91 | To start Solr run the `solr` script again but with the `start` parameter: 92 | 93 | ``` 94 | $ ./solr start 95 | 96 | # [a couple of WARN messages plus...] 97 | # 98 | # Waiting up to 180 seconds to see Solr running on port 8983 [/] 99 | # Started Solr server on port 8983 (pid=31160). Happy searching! 100 | # 101 | ``` 102 | 103 | Notice that the message says that Solr is now running on port `8983`. 104 | 105 | You can validate this by opening your browser and going to http://localhost:8983/. This will display the Solr Admin page from where you can perform administrative tasks as well as add, update, and query data from Solr. 106 | 107 | You can also issue the `status` command again from the Terminal and Solr will report something like this: 108 | 109 | ``` 110 | $ ./solr status 111 | 112 | # Found 1 Solr nodes: 113 | # 114 | # Solr process 79276 running on port 8983 115 | # { 116 | # "solr_home":"/Users/user-id/solr-9.1.0/server/solr", 117 | # "version":"9.1.0 aa4f3d98ab19c201e7f3c74cd14c99174148616d - ishan - 2022-11-11 13:00:47", 118 | # "startTime":"2023-01-17T00:36:52.104Z", 119 | # "uptime":"0 days, 0 hours, 0 minutes, 40 seconds", 120 | # "memory":"167 MB (%32.6) of 512 MB"} 121 | # 122 | ``` 123 | 124 | Notice how Solr now reports that it has "Found 1 Solr node". Yay! 125 | 126 | 127 | ## Creating our first Solr core 128 | 129 | Solr uses the concept of *cores* to represent independent environments in which 130 | we configure data schemas and store data. This is similar to the concept of a 131 | "database" in MySQL or PostgreSQL. 132 | 133 | For our purposes, let's create a core named `bibdata` as follows (notice these commands require that Solr be running, if you stopped it, make sure you run `solr start` first) 134 | 135 | ``` 136 | $ ./solr create -c bibdata 137 | 138 | # WARNING: Using _default configset with data driven schema functionality. 139 | # NOT RECOMMENDED for production use. 140 | # 141 | # To turn off: bin/solr config -c bibdata -p 8983 -action set-user-property -property update.autoCreateFields -value false 142 | # 143 | # Created new core 'bibdata' 144 | # 145 | ``` 146 | 147 | Now we have a new core available to store documents. We'll ignore the warning because we are not in production, but we'll discuss this later on. 148 | 149 | For now our core is empty (since we haven't added any thing to it) and you can check this with the following command from the terminal: 150 | 151 | ``` 152 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*:*' 153 | 154 | # 155 | # { 156 | # "responseHeader":{ 157 | # "status":0, 158 | # "QTime":0, 159 | # "params":{ 160 | # "q":"*:*"}}, 161 | # "response":{"numFound":0,"start":0,"docs":[] 162 | # }} 163 | # 164 | ``` 165 | 166 | (or you can also point your browser to http://localhost:8983/solr#bibdata/query and click the "Execute Query" button at the bottom of the page) 167 | 168 | in either case you'll see `"numFound":0` indicating that there are no documents on it. 169 | 170 | 171 | ## Adding documents to Solr 172 | 173 | Now let's add a few documents to our `bibdata` core. First, [download this sample data](https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/main/books.json) file: 174 | 175 | ``` 176 | $ curl -OL https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/main/books.json 177 | 178 | # 179 | # You'll see something like this... 180 | # % Total % Received % Xferd Average Speed Time Time Time Current 181 | # Dload Upload Total Spent Left Speed 182 | # 100 1998 100 1998 0 0 5561 0 --:--:-- --:--:-- --:--:-- 5581 183 | # 184 | ``` 185 | 186 | File `books.json` contains a small sample data a set with information about a 187 | few thousand books. You can take a look at it via `cat books.json` or using the text editor of your choice. Below is an example on one of the books in this file: 188 | 189 | ``` 190 | { 191 | "id":"00008027", 192 | "author_txt_en":"Patent, Dorothy Hinshaw.", 193 | "authors_other_txts_en":["Muñoz, William,"], 194 | "title_txt_en":"Horses /", 195 | "responsibility_txt_en":"by Dorothy Hinshaw Patent ; photographs by William Muñoz.","publisher_place_str":"Minneapolis, Minn. :", 196 | "publisher_name_str":"Lerner Publications,", 197 | "publisher_date_str":"c2001.", 198 | "subjects_txts_en":["Horses","Horses"], 199 | "subjects_form_txts_en":["Juvenile literature"] 200 | } 201 | ``` 202 | 203 | Then, import this file to our `bibdata` core with the `post` utility that Solr 204 | provides out of the box (Windows users see note below): 205 | 206 | ``` 207 | $ ./post -c bibdata books.json 208 | 209 | # 210 | # (some text here...) 211 | # POSTing file books.json (application/json) to [base]/json/docs 212 | # 1 files indexed. 213 | # COMMITting Solr index changes to http://localhost:8983/solr/bibdata/update... 214 | # Time spent: 0:00:00.324 215 | # 216 | ``` 217 | 218 | Now if we run our query again we should see some results 219 | 220 | ``` 221 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*:*' 222 | 223 | # Response would be something like... 224 | # { 225 | # "responseHeader":{ 226 | # "status":0, 227 | # "QTime":36, 228 | # "params":{ 229 | # "q":"*:*"}}, 230 | # "response":{"numFound":30424,"start":0,"docs":[ 231 | # ... lots of information will display here ... 232 | # ]} 233 | # } 234 | # 235 | ``` 236 | 237 | Notice how the number of documents found is greater than zero (e.g. `"numFound":30424`) 238 | 239 | **Note for Windows users:** Unfortunately the `post` utility that comes out the box with Solr only works for Linux and Mac. However, there is another `post` utility buried under the `exampledocs` folder in Solr that we can use in Windows. Here is what you'll need to to: 240 | 241 | ``` 242 | > cd C:\Users\you\solr-9.1.0\examples\exampledocs 243 | > curl -OL https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/main/books.json 244 | > java -Dtype=application/json -Dc=bibdata -jar post.jar books.json 245 | 246 | # 247 | # you should see something along the lines of 248 | # 249 | # POSTing file books.json 250 | # 1 files indexed 251 | # COMMITting Solr index changes to http://... 252 | # 253 | ``` 254 | -------------------------------------------------------------------------------- /tutorial.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | # PART I: INTRODUCTION 4 | 5 | ## What is Solr 6 | 7 | Solr is an open source *search engine* developed by the Apache Software Foundation. On its [home page](https://solr.apache.org/) Solr advertises itself as 8 | 9 | Solr is the popular, blazing-fast, 10 | open source enterprise search platform built on Apache Lucene. 11 | 12 | and the book [Solr in Action](https://www.worldcat.org/title/solr-in-action/oclc/879605085) describes Solr as 13 | 14 | Solr is a scalable, ready-to-deploy enterprise search engine 15 | that’s optimized to search large volumes of text-centric data 16 | and return results sorted by relevance [p. 4] 17 | 18 | The fact that Solr is a search engine means that there is a strong focus on speed, large volumes of text data, and the ability to sort the results by relevance. 19 | 20 | Although Solr could technically be described as a NoSQL database (i.e. it allows us to store and retrieve data in a non-relational form) it is better to think of it as a search engine to emphasize the fact that it is better suited for text-centric and read-mostly environments [Solr in Action, p. 4]. 21 | 22 | 23 | ### What is Lucene 24 | 25 | The core functionality that Solr makes available is provided by a Java library called Lucene. Lucene is [the brain behind](https://lucene.apache.org/) the "indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities" that we will see in this tutorial. 26 | 27 | But Lucene is a Java Library than can only be used from other Java programs. Solr on the other hand is a wrapper around Lucene that allows us to use the Lucene functionality from any programming language that can submit HTTP requests. 28 | 29 | ``` 30 | ------------------- 31 | | Java Runtime | 32 | [client application] ----> HTTP request ----> | Solr --> Lucene | 33 | ------------------- 34 | ``` 35 | 36 | In this diagram the *client application* could be a program written in Ruby or Python. In fact, as we will see throughout this tutorial, it can also be a system utility like cURL or a web browser. Anything that can submit HTTP requests can communicate with Solr. 37 | 38 | 39 | ## Installing Solr for the first time 40 | 41 | To install Solr we are going to use a tool called Docker that allows us to download small virtual machines (called containers) with pre-installed software. In our case we'll download a container with Solr 9.1.0 installed on it and use that during the workshop. 42 | 43 | **NOTE:** You can also download and install the Solr binaries directly on your machine without using Docker. You'll need to have the Java Development Kit (JDK) for this to method to work. If you are interested in this approach take a look at [these instructions](https://github.com/hectorcorrea/solr-for-newbies/blob/code4lib-2023/install_solr_jdk.md) instead. 44 | 45 | For the Docker installation let's start by going to https://www.docker.com/, download the "Docker Desktop", install it, and run it. 46 | 47 | Once installed run the following command from the terminal to make sure it's running: 48 | 49 | ``` 50 | $ docker ps 51 | 52 | # 53 | # You'll see something like this 54 | # CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 55 | ``` 56 | 57 | If Docker is *not* running we'll see an error that will indicate something along the lines of 58 | 59 | ``` 60 | Error response from daemon: dial unix docker.raw.sock: connect: connection refused 61 | ``` 62 | 63 | If we see this error it could be that the Docker Desktop app has not fully started. Wait a few seconds and try again. We can also open the "Docker Desktop" app and see its status. 64 | 65 | 66 | ### Creating a Solr container 67 | 68 | Once Docker has been installed and it's up and running we can create a container to host Solr 9.1.0 with the following command: 69 | 70 | ``` 71 | $ docker run -d -p 8983:8983 --name solr-container solr:9.1.0 72 | 73 | # 74 | # You'll see something like this... 75 | # 76 | # Unable to find image 'solr:9.1.0' locally 77 | # 9.1.0: Pulling from library/solr 78 | # 846c0b181fff: Pull complete 79 | # ... 80 | # fc8f2125142b: Pull complete 81 | # Digest: sha256:971cd7a5c682390f8b1541ef74a8fd64d56c6a36e5c0849f6b48210a47b16fa2 82 | # Status: Downloaded newer image for solr:9.1.0 83 | # 47e8cd4d281db5a19e7bfc98ee02ca73e19af66e392e5d8d3532938af5a76e96 84 | 85 | ``` 86 | 87 | The parameter `-d` in the previous command tells Docker to run the container in the background (i.e. detached) and the parameter `-p 8983:8983` tells Docker to forwards calls to *our* local port `8983` to the port `8983` on the container. 88 | 89 | We can check that the new container is running with the following command: 90 | 91 | ``` 92 | $ docker ps 93 | 94 | # 95 | # You'll see something like this... 96 | # 97 | # CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 98 | # 47e8cd4d281d solr:9.1.0 "docker-entrypoint.s…" 2 minutes ago Up 2 minutes 0.0.0.0:8983->8983/tcp, :::8983->8983/tcp solr-container 99 | ``` 100 | 101 | Notice that now we have a container NAMED `solr-container` using the IMAGE `solr:9.1.0`. We can check the status of Solr with the following command: 102 | 103 | ``` 104 | $ docker exec -it solr-container solr status 105 | 106 | # Found 1 Solr nodes: 107 | # 108 | # Solr process 15 running on port 8983 109 | # { 110 | # "solr_home":"/var/solr/data", 111 | # "version":"9.1.0 aa4f3d98ab19c201e7f3c74cd14c99174148616d - ishan - 2022-11-11 13:00:47", 112 | # "startTime":"2023-01-12T20:48:46.084Z", 113 | # "uptime":"0 days, 0 hours, 9 minutes, 15 seconds", 114 | # "memory":"178.3 MB (%34.8) of 512 MB"} 115 | ``` 116 | 117 | We can also see Solr running by pointing your browser to http://localhost:8983/solr/ which will show the Solr Admin web page. In this page we can see that we do not have any cores defined to store data, we'll fix that in the next section. WARNING: Do not attempt to create a Solr cores via "Add Core" button in the Solr Admin page -- that button only leads to pain. 118 | 119 | 120 | ### Creating our first Solr core 121 | 122 | Solr uses the concept of *cores* to represent independent environments in which we configure data schemas and store data. This is similar to the concept of a "database" in MySQL or PostgreSQL. 123 | 124 | For our purposes, let's create a core named `bibdata` with the following command: 125 | 126 | ``` 127 | $ docker exec -it solr-container solr create_core -c bibdata 128 | 129 | # 130 | # WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use. 131 | # To turn off: bin/solr config -c bibdata -p 8983 -action set-user-property -property update.autoCreateFields -value false 132 | # 133 | # Created new core 'bibdata' 134 | ``` 135 | 136 | If we go back to http://localhost:8983/solr/ on our browser (we might need to refresh the page) we should see our newly created `bibdata` core available in the "Core Selector" dropdown list. 137 | 138 | Now that our core has been created we can query it with the following command: 139 | 140 | ``` 141 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*' 142 | 143 | # 144 | # { 145 | # "responseHeader":{ 146 | # "status":0, 147 | # "QTime":0, 148 | # "params":{ 149 | # "q":"*"}}, 150 | # "response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[] 151 | # }} 152 | ``` 153 | 154 | and we'll see `"numFound":0` indicating that there are no documents on it. We can also point our browser to http://localhost:8983/solr#bibdata/query and click the "Execute Query" button at the bottom of the page and see the same result. 155 | 156 | 157 | ### Adding documents to Solr 158 | 159 | Now let's add a few documents to our `bibdata` core. First, [download this sample data](https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/main/books.json) file: 160 | 161 | ``` 162 | $ curl -OL https://raw.githubusercontent.com/hectorcorrea/solr-for-newbies/main/books.json 163 | 164 | # 165 | # You'll see something like this... 166 | # % Total % Received % Xferd Average Speed Time Time Time Current 167 | # Dload Upload Total Spent Left Speed 168 | # 100 1998 100 1998 0 0 5561 0 --:--:-- --:--:-- --:--:-- 5581 169 | # 170 | ``` 171 | 172 | File `books.json` contains a small sample data a set with information about a few thousand books. We can take a look at it with something like `head books.json` or using the text editor of our choice. Below is an example on one of the books in this file: 173 | 174 | ``` 175 | { 176 | "id": "00008027", 177 | "author_txt_en": "Patent, Dorothy Hinshaw.", 178 | "authors_other_txts_en": [ 179 | "Muñoz, William," 180 | ], 181 | "title_txt_en": "Horses /", 182 | "responsibility_txt_en": "by Dorothy Hinshaw Patent ; photographs by William Muñoz.", 183 | "publisher_place_s": "Minneapolis, Minn. :", 184 | "publisher_name_s": "Lerner Publications,", 185 | "publisher_date_s": "c2001.", 186 | "subjects_ss": [ 187 | "Horses", 188 | "Horses" 189 | ], 190 | "subjects_form_ss": [ 191 | "Juvenile literature" 192 | ] 193 | }, 194 | ``` 195 | 196 | To import this data to our Solr we'll first *copy* the file to the Docker container 197 | 198 | ``` 199 | $ docker cp books.json solr-container:/opt/solr-9.1.0/books.json 200 | ``` 201 | 202 | and then we *load it* to Solr: 203 | 204 | ``` 205 | $ docker exec -it solr-container post -c bibdata books.json 206 | 207 | # 208 | # /opt/java/openjdk/bin/java -classpath /opt/solr/server/solr-webapp/webapp/WEB-INF/lib/solr-core-9.1.0.jar ... 209 | # SimplePostTool version 5.0.0 210 | # Posting files to [base] url http://localhost:8983/solr/bibdata/update... 211 | # POSTing file books.json (application/json) to [base]/json/docs 212 | # 1 files indexed. 213 | # COMMITting Solr index changes to http://localhost:8983/solr/bibdata/update... 214 | # Time spent: 0:00:01.951 215 | ``` 216 | 217 | Now if we re-run our query we should see some results: 218 | 219 | ``` 220 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*' 221 | 222 | # 223 | # { 224 | # "responseHeader":{ 225 | # "status":0, 226 | # "QTime":0, 227 | # "params":{ 228 | # "q":"*"}}, 229 | # "response":{"numFound":30424,"start":0,"numFoundExact":true,"docs":[ 230 | # { 231 | # ...the information for the first 10 documents will be displayed here.. 232 | # 233 | ``` 234 | 235 | Notice how the number of documents found is greater than zero (e.g. `"numFound":30424`) 236 | 237 | 238 | ## Searching for documents 239 | 240 | Now that we have added a few documents to our `bibdata` core we can query Solr for those documents. In a subsequent section we'll explore more advanced searching options and how our schema definition is key to enable different kind of searches, but for now we'll start with a few basic searches to get familiar with the way querying works in Solr. 241 | 242 | If you look at the content of the `books.json` file that we imported into our `bibdata` core you'll notice that the documents have the following fields: 243 | 244 | * **id**: string to identify each document ([MARC](https://www.loc.gov/marc/bibliographic/) 001) 245 | * **author_txt_en**: string for the main author (MARC 100a) 246 | * **authors_other_txts_en**: list of other authors (MARC 700a) 247 | * **title_txt_en**: title of the book (MARC 245ab) 248 | * **publisher_name_s**: publisher name (MARC 260b) 249 | * **subjects_ss**: an array of subjects (MARC 650a) 250 | 251 | The suffix added to each field (e.g. `_txt_en`) is a hint for Solr to pick the appropriate field type for each field as it ingests the data. We will look closely into this in a later section. 252 | 253 | 254 | ### Fetching data 255 | 256 | To fetch data from Solr we make an HTTP request to the `select` handler. For example: 257 | 258 | ``` 259 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*' 260 | ``` 261 | 262 | There are many parameters that we can pass to this handler to define what documents we want to fetch and what fields we want to fetch. 263 | 264 | 265 | ### Selecting what fields to fetch 266 | 267 | We can use the `fl` parameter to indicate what fields we want to fetch. For example to request the `id` and the `title_txt_en` of the documents we would use `fl=id,title_txt_en` as in the following example: 268 | 269 | ``` 270 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*&fl=id,title_txt_en' 271 | ``` 272 | 273 | **Note:** When issuing the commands via cURL (as in the previous example) make sure that the fields are separated by a comma *without any spaces in between them*. In other words make sure the URL says `fl=id,title_txt_en` and not `fl=id,` `title_txt_en`. If the parameter includes spaces Solr will not return any results and give you a cryptic error message instead. 274 | 275 | Try adding and removing some other fields to this list, for example, `fl=id,title_txt_en,author_txt_en` or `fl=id,title_txt_en,author_txt_en,subjects_ss` 276 | 277 | 278 | ### Filtering the documents to fetch 279 | 280 | In the previous examples you might have seen an inconspicuous `q=*` parameter in the URL. The `q` (query) parameter tells Solr what documents to retrieve. This is somewhat similar to the `WHERE` clause in a SQL SELECT query. 281 | 282 | If we want to retrieve all the documents we can just pass `q=*`. But if we want to filter the results we can use the following syntax: `q=field:value` to filter documents where a specific field has a particular value. For example, to include only documents where the `title_txt_en` has the word "teachers" we would use `q=title_txt_en:teachers`: 283 | 284 | ``` 285 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:teachers' 286 | ``` 287 | 288 | We can request filter by many different fields, for example to request documents where the `title_txt_en` includes the word "teachers" **or** the `author_txt_en` includes the word "Alice" we would use `q=title_txt_en:teachers author_txt_en:Alice` 289 | 290 | ``` 291 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:teachers+author_txt_en:Alice' 292 | ``` 293 | 294 | As we saw in the previous example, by default, Solr searches for either of the terms. If we want to force that both conditions are matched we must explicitly use the `AND` operator in the `q` value as in `q=title_txt_en:teachers AND author_txt_en:Alice` Notice that the `AND` operator **must be in uppercase**. 295 | 296 | ``` 297 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:teachers+AND+author_txt_en:Alice' 298 | ``` 299 | 300 | Now let's try something else. Let's issue a search for books where the title says "art history" using `q=title_txt_en:"art history"` (make sure the text "art history" is in quotes) 301 | 302 | ``` 303 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"art+history"' 304 | 305 | # the results will include 6 documents with titles like: 306 | # 307 | # "title":["... : strategies for middle and high school teachers /"]}, 308 | # "title":["... Sunday school teachers ... /"]}, 309 | # "title":["... solutions for middle and high school teachers /"]}] 310 | # 311 | ``` 312 | 313 | Notice how all three results have the term "art history" somewhere on the title. Now let's issue a slightly different query using `q=title_txt_en:"art history"~3` to indicate that we want the words "art" and "history" to be present in the `title_txt_en` but they can be a few words apart (notice the `~3`): 314 | 315 | ``` 316 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"art+history"~3' 317 | ``` 318 | 319 | The result for this query will include a few more books (notice that `numFound` is now `10` instead of `6`) and some of the new tiles include 320 | 321 | ``` 322 | # "title_txt_en":"History of art /"}, 323 | # "title_txt_en":"American art : a cultural history /"}, 324 | # "title_txt_en":"The invention of art : a cultural history /"}, 325 | # "title_txt_en":"A history of art in Africa /"}] 326 | ``` 327 | 328 | these new books include the words "art" and "history" but they don't have to be exactly next to each other, as long as they are close to each other they are considered a match (the `~3` in our query asks for "edit distance of 3"). 329 | 330 | When searching multi-word keywords for a given field make sure the keywords are surrounded by quotes, for example make sure to use `q=title_txt_en:"art history"` and not `q=title_txt_en:art history`. The later will execute a search for "school" in the `title_txt_en` field and "teachers" in the `_text_` field. 331 | 332 | You can validate this by running the query and passing the `debug` flag and seeing the `parsedquery` value. For example in the following command we surround both search terms in quotes: 333 | 334 | ``` 335 | $ curl -s 'http://localhost:8983/solr/bibdata/select?debug=all&q=title_txt_en:"art+history"' | grep parsedquery 336 | 337 | # 338 | # "parsedquery":"PhraseQuery(title_txt_en:\"art histori\")", 339 | # 340 | ``` 341 | 342 | notice that the `parsedQuery` shows that Solr is searching for, as we would expect, both words in the `title_txt_en` field. 343 | 344 | Now let's look at the `parsedQuery` when we don't surround the search terms in quotes: 345 | 346 | ``` 347 | $ curl -s 'http://localhost:8983/solr/bibdata/select?debugQuery=on&q=title_txt_en:art+history' | grep parsedquery 348 | 349 | # 350 | # "parsedquery":"title_txt_en:art _text_:history", 351 | # 352 | ``` 353 | 354 | notice that Solr searched for the word "art" in the `title_txt_en` field but searched for the word "history" on the `_text_` field. Certainly not what we were expecting. We'll elaborate in a later section on the significance of the `_text_` field but for now make sure to surround in quotes the search terms when issuing multi word searches. 355 | 356 | One last thing to notice is that Solr returns results paginated, by default it returns the first 10 documents that match the query. We'll see later on this tutorial how we can request a large page size (via the `rows` parameter) or another page (via the `start` parameter). But for now just notice that at the top of the results Solr always tells us the total number of results found: 357 | 358 | ``` 359 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:education&fl=id,title_txt_en' 360 | 361 | # 362 | # response will include 363 | # "response":{"numFound":340,"start":0,"docs":[ 364 | # 365 | ``` 366 | 367 | 368 | ### Getting facets 369 | 370 | When we issue a search, Solr is able to return facet information about the data in our core. This is a built-in feature of Solr and easy to use, we just need to include the `facet=on` and the `facet.field` parameter with the name of the field that we want to facet the information on. 371 | 372 | For example, to search for all documents with title "education" (`q=title_txt_en:education`) and retrieve facets (`facet=on`) based on the subjects (`facet.field=subjects_ss`) we'll use a query like this: 373 | 374 | ``` 375 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:education&facet=on&facet.field=subjects_ss' 376 | 377 | # response will include something like this 378 | # 379 | # "facet_counts":{ 380 | # "facet_fields":{ 381 | # "subjects_ss":[ 382 | # "Education",58, 383 | # "Educational change",16, 384 | # "Multicultural education",15, 385 | # "Education, Higher",14, 386 | # "Education and state",13, 387 | # # 388 | ``` 389 | 390 | 391 | ## Updating documents 392 | To update a document in Solr we have two options. The most common option is to post the data for that document again to Solr and let Solr overwrite the old document with the new data. The key for this to work is to provide *the same ID in the new data* as the ID of an existing document. 393 | 394 | For example, if we query the document with ID `00007345` we would get: 395 | 396 | ``` 397 | $ curl 'http://localhost:8983/solr/bibdata/select?q=id:00007345' 398 | 399 | # "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[ 400 | # { 401 | # "id":"00007345", 402 | # "authors_other_txts_en":["Giannakis, Georgios B."], 403 | # "title_txt_en":"Signal processing advances in wireless and mobile communications /", 404 | # "responsibility_txt_en":"edited by G.B. Giannakis ... [et al.].", 405 | # "publisher_place_s":"Upper Saddle River, NJ :", 406 | # "publisher_name_s":"Prentice Hall PTR,", 407 | # "publisher_date_s":"c2001.", 408 | # "subjects_ss":["Signal processing", "Wireless communication systems"], 409 | # "_version_":1755414312334131200 410 | # } 411 | # 412 | ``` 413 | 414 | If we post to Solr a new document with the **same ID** Solr will **overwrite** the existing document with the new data. Below is an example of how to update this document with new JSON data using `curl` to post the data to Solr. Notice that the command is issued against the `update` endpoint rather than the `select` endpoint we used in our previous commands. 415 | 416 | ``` 417 | $ curl -X POST --data '[{"id":"00007345","title_txt_en":"the new title"}]' 'http://localhost:8983/solr/bibdata/update?commit=true' 418 | ``` 419 | 420 | Out of the box Solr supports multiple input formats (JSON, XML, CSV), section [Uploading Data with Index Handlers](https://solr.apache.org/guide/solr/9_0/indexing-guide/indexing-with-update-handlers.html) in the Solr guide provides more details out this. 421 | 422 | If we query for the document with ID `00007345` again we will see the new data and notice that the fields that we did not provide during the update are now gone from the document, that's because Solr overwrote the old document with ID `00000034` with our new data that included only two fields (`id` and `title_txt_en`). 423 | 424 | ``` 425 | $ curl 'http://localhost:8983/solr/bibdata/select?q=id:00007345' 426 | 427 | # "response":{"numFound":1,"start":0,"docs":[ 428 | # { 429 | # "id":"00007345", 430 | # "title_txt_en":"the new title", 431 | # }]} 432 | # 433 | ``` 434 | 435 | The second option to update a document in Solr is to via [atomic updates](https://solr.apache.org/guide/solr/9_0/indexing-guide/partial-document-updates.html) in which we can indicate what fields of the document will be updated. Details of this method are out of scope for this tutorial but below is a very simple example to show the basic syntax, notice how we are using the `set` parameter in the `title_txt_en` field to indicate a different kind of update: 436 | 437 | ``` 438 | $ curl 'http://localhost:8983/solr/bibdata/select?q=id:00007450' 439 | # 440 | # "title_txt_en":"Principles of fluid mechanics /", 441 | # 442 | 443 | $ curl -X POST --data '[{"id":"00007450","title_txt_en":{"set":"the new title for 00007450"}}]' 'http://localhost:8983/solr/bibdata/update?commit=true' 444 | 445 | $ curl 'http://localhost:8983/solr/bibdata/select?q=id:00007450' 446 | # 447 | # title will say "the new title for 00007450" 448 | # and the rest of the fields will remain unchanged 449 | # 450 | ... 451 | ``` 452 | 453 | 454 | ## Deleting documents 455 | To delete documents from the `bibdata` core we also use the `update` endpoint but the structure of the command is as follows: 456 | 457 | ``` 458 | $ curl 'http://localhost:8983/solr/bibdata/update?commit=true' --data 'id:00008056' 459 | ``` 460 | 461 | The body of the request (`--data`) indicates to Solr that we want to delete a specific document (notice the `id:00008056` query). 462 | 463 | We can also pass a less specific query like `title_txt_en:teachers` to delete all documents where the title includes the word "teachers" (or a variation of it). Or we can delete *all documents* with a query like `*:*`. 464 | 465 | Be aware that even if you delete all documents from a Solr core the schema and the core's configuration will remain intact. For example, the fields that were defined are still available in the schema even if no documents exist in the core anymore. 466 | 467 | If you want to delete the entire core (documents, schema, and other configuration associated with it) you can use the Solr delete command instead: 468 | 469 | ``` 470 | $ docker exec -it solr-container solr delete -c bibdata 471 | 472 | # Deleting core 'bibdata' using command: 473 | # http://localhost:8983/solr/admin/cores?action=UNLOAD&core=bibdata&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true 474 | ``` 475 | 476 | You will need to re-create the core if you want to re-import data to it. 477 | 478 | 479 |
480 | 481 | # PART II: SCHEMA 482 | 483 | ## Solr's document model 484 | 485 | Solr uses a document model to represent data. Documents are [Solr's basic unit of information](https://solr.apache.org/guide/solr/9_0/getting-started/documents-fields-schema-design.html#how-solr-sees-the-world) and they can contain different fields depending on what information they represent. For example a book in a library catalog stored as a document in Solr might contain fields for author, title, and subjects, whereas information about a house in a real estate system using Solr might include fields for address, taxes, price, and number of rooms. 486 | 487 | In earlier versions of Solr documents were self-contained and did not support nested documents. Starting with version 8 Solr provides [support for nested documents](https://solr.apache.org/guide/8_0/indexing-nested-documents.html). This tutorial does not cover nested documents. 488 | 489 | 490 | ## Inverted index 491 | 492 | Search engines like Solr use a data structure called [inverted index](https://en.wikipedia.org/wiki/Inverted_index) to support fast retrieval of documents even with complex query expression on large datasets. The basic idea of an inverted index is to use the *terms* inside a document as the *key* of the index rather than the *document's ID* as the key. 493 | 494 | Let's illustrate this with an example. Suppose we have three books that we want to index. With a traditional index we would create something like this: 495 | 496 | ``` 497 | ID TITLE 498 | -- ------------------------------ 499 | 1 Princeton guide for dog owners 500 | 2 Princeton tour guide 501 | 3 Cats and dogs 502 | ``` 503 | 504 | With an inverted index Solr would take each of the words in the title of our books and use those words as the index key: 505 | 506 | ``` 507 | KEY DOCUMENT ID 508 | --------- ----------- 509 | princeton 1, 2 510 | owners 1 511 | dogs 1, 3 512 | guide 1, 2 513 | tour 2 514 | cats 3 515 | ``` 516 | 517 | Notice that the inverted index allow us to do searches for individual *words within the title*. For example a search for the word "guide" immediately tell us that documents 1 and 2 are a match. Likewise a search for "tour" will tells that document 2 is a match. 518 | 519 | Chapter 3 in Solr in Action has a more comprehensive explanation of how Solr uses inverted indexes to allow for partial matches as well as to aid with the ranking of the results. 520 | 521 | ## Field types, fields, dynamic fields, and copy fields 522 | The schema in Solr is the definition of the *field types* and *fields* configured for a given core. 523 | 524 | **Field Types** are the building blocks to define fields in our schema. Examples of field types are: `binary`, `boolean`, `pfloat`, `string`, `text_general`, and `text_en`. These are similar to the field types that are supported in a relational database like MySQL but, as we will see later, they are far more configurable than what you can do in a relational database. 525 | 526 | There are three kind of fields that can be defined in a Solr schema: 527 | 528 | * **Fields** are the specific fields that you define for your particular core. Fields are based of a field type, for example, we might define field `title` based on the `string` field type, `description` based on the `text` field type, and `price` base of the `pfloat` field type. 529 | 530 | * **dynamicFields** are field patterns that we define to automatically create new fields when the data submitted to Solr matches the given pattern. For example, we can define that if we receive data for a field that ends with `_txt` the field will be create it as a `text_general` field type. 531 | 532 | * **copyFields** are instructions to tell Solr how to automatically copy the value given for one field to another field. This is useful if we want to perform different transformation to the values as we ingest them. For example, we might want to remove punctuation characters for searching but preserve them for display purposes. 533 | 534 | Our newly created `bibdata` core already has a schema and you can view the definition through the Solr Admin web page via the [Schema Browser Screen](https://solr.apache.org/guide/solr/latest/indexing-guide/schema-browser-screen.html) at http://localhost:8983/solr/#/bibdata/schema or by exploring the `managed-schema` file via the [Files Screen](https://solr.apache.org/guide/solr/latest/configuration-guide/configuration-files.html#files-screen). 535 | 536 | You can also view this information with the [Schema API](https://solr.apache.org/guide/solr/latest/indexing-guide/schema-api.html) as shown in the following example. The (rather long) response will be organized in four categories: `fieldTypes`, `fields`, `dynamicFields`, and `copyFields` as shown below: 537 | 538 | ``` 539 | $ curl localhost:8983/solr/bibdata/schema 540 | 541 | # { 542 | # "responseHeader": {"status": 0, "QTime": 2}, 543 | # "schema": { 544 | # "fieldTypes": [lots of field types defined], 545 | # 546 | # "fields": [lots of fields defined], 547 | # 548 | # "dynamicFields":[lots of dynamic fields defined], 549 | # 550 | # "copyFields": [a few copy fields defined] 551 | # } 552 | # } 553 | # 554 | ``` 555 | 556 | The advantage of the Schema API is that it allows you to view *and update* the information programatically which is useful if you need to recreate identical Solr cores without manually configuring each field definition (e.g. development vs production) 557 | 558 | You can request information about each of these categories individually in the Schema API with the following commands (notice that combined words like `fieldTypes` and `dynamicFields` are *not* capitalized in the URLs below): 559 | 560 | ``` 561 | $ curl localhost:8983/solr/bibdata/schema/fieldtypes 562 | $ curl localhost:8983/solr/bibdata/schema/fields 563 | $ curl localhost:8983/solr/bibdata/schema/dynamicfields 564 | $ curl localhost:8983/solr/bibdata/schema/copyfields 565 | ``` 566 | 567 | Notice that unlike a relational database, where only a handful field types are available to choose from (e.g. integer, date, boolean, char, and varchar) in Solr there are lots of predefined field types available out of the box, and each of them with its own configuration. 568 | 569 | **Note for Solr 4.x users:** In Solr 4 the default mechanism to update the schema was by editing the file `schema.xml`. Starting in Solr 5 the default mechanism is through the "Managed Schema Definition" which uses the Schema API to add, edit, and remove fields. There is now a `managed-schema` file with the same information as `schema.xml` but you are not supposed to edit this file. See section "Managed Schema Definition in SolrConfig" in the [Solr Reference Guide 5.0 (PDF)](https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.0.pdf) for more information about this. 570 | 571 | 572 | ## Fields in our schema 573 | 574 | You might be wondering how did the fields like `id`, `title_txt_en`, `author_txt_en`, and `subjects_ss` in our `bibdata` core were created if we never explicitly defined them. 575 | 576 | Solr automatically created most of these fields when we imported the data from the `books.json` file. If you look at a few of the elements in the `books.json` file you'll recognize that they match *most* of the fields defined in our schema. Below is the data for one of the records in our sample data: 577 | 578 | ``` 579 | { 580 | "id":"00007345", 581 | "authors_other_txts_en":["Giannakis, Georgios B."], 582 | "title_txt_en":"Signal processing advances in wireless and mobile communications /", 583 | "responsibility_txt_en":"edited by G.B. Giannakis ... [et al.].", 584 | "publisher_place_s":"Upper Saddle River, NJ :", 585 | "publisher_name_s":"Prentice Hall PTR,", 586 | "publisher_date_s":"c2001.", 587 | "subjects_ss":["Signal processing", "Wireless communication systems"], 588 | } 589 | ``` 590 | 591 | The process that Solr follows when a new document is ingested into Solr is more or less as follows: 592 | 593 | 1. If there is an exact match for a field being ingested and the fields defined in the schema then Solr will use the definition in the schema to ingest the data. This is what happened for the `id` field. Our JSON data has an `id` field and so does the schema, therefore Solr stored the `id` value in the `id` field as indicated in the schema (i.e. as single-value string.) 594 | 595 | 2. If there is no exact match in the schema then Solr will look at the **dynamicFields** definitions to see if the field can be handled with some predefined settings. This is what happened with the `title_txt_en` field. Because there is not `title_txt_en` definition in the schema Solr used the dynamic field definition for `*_txt_en` that indicated that the value should be indexed using the text in English (`text_en`) field definition. 596 | 597 | 3. If no match is found in the dynamic fields either Solr will [guess what is the best type to use](https://bryanbende.com/development/2015/11/14/solr-schemas) based on the data for this field in the first document. This is what happened with the `authors_other_txts_en` field (notice that this field ends with `_txts_en` rather than `_txt_en`). In this case, since there is no dynamic field definition to handle this ending, Solr guessed and created field `authors_other_txts_en` as `text_general`. For production use Solr recommends to disable this automatic guessing, this is what the "WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use" was about when we first created our Solr core. 598 | 599 | In the following sections we are going to drill down into some of specifics of the fields and dynamic field definitions that are configured in our Solr core. 600 | 601 | 602 | ### Field: id 603 | Let's look at the details of the `id` field in our schema 604 | 605 | ``` 606 | $ curl localhost:8983/solr/bibdata/schema/fields/id 607 | 608 | # 609 | # Will return something like this 610 | # { 611 | # "responseHeader":{...}, 612 | # "field":{ 613 | # "name":"id", 614 | # "type":"string", 615 | # "multiValued":false, 616 | # "indexed":true, 617 | # "required":true, 618 | # "stored":true 619 | # } 620 | # } 621 | # 622 | ``` 623 | 624 | Notice how the field is of type `string` but also it is marked as not multi-value, to be indexed, required, and stored. 625 | 626 | The type `string` has also its own definition which we can view via: 627 | 628 | ``` 629 | $ curl localhost:8983/solr/bibdata/schema/fieldtypes/string 630 | 631 | # { 632 | # "responseHeader":{...}, 633 | # "fieldType":{ 634 | # "name":"string", 635 | # "class":"solr.StrField", 636 | # "sortMissingLast":true, 637 | # "docValues":true 638 | # } 639 | # } 640 | # 641 | ``` 642 | 643 | In this case the `class` points to an internal Solr class (`solr.StrField`) that will be used to handle values of the string type. 644 | 645 | 646 | ### Field: title_txt_en 647 | 648 | Now let's look at a more complex field and field type. If we look for a definition for the `title_txt_en` Solr will report that we don't have one: 649 | 650 | ``` 651 | $ curl localhost:8983/solr/bibdata/schema/fields/title_txt_en 652 | 653 | # { 654 | # "responseHeader":{... 655 | # "error":{ 656 | # "metadata":[ 657 | # "error-class","org.apache.solr.common.SolrException", 658 | # "root-error-class","org.apache.solr.common.SolrException"], 659 | # "msg":"No such path /schema/fields/title_txt_en", 660 | # "code":404}} 661 | # 662 | ``` 663 | 664 | However, if we look at the dynamic field definitions we'll notice that there is one for fields that end in `_txt_en`: 665 | 666 | ``` 667 | $ curl localhost:8983/solr/bibdata/schema/dynamicfields/*_txt_en 668 | 669 | # { 670 | # "responseHeader":{... 671 | # "dynamicField":{ 672 | # "name":"*_txt_en", 673 | # "type":"text_en", 674 | # "indexed":true, 675 | # "stored":true}} 676 | # 677 | ``` 678 | 679 | This tells Solr that any field name in the source data that does not already exist in the schema and that ends in `_txt_en` should be created as a field of type `text_en`. That looks innocent enough, so let's look closer to see what field type `text_en` means: 680 | 681 | ``` 682 | $ curl localhost:8983/solr/bibdata/schema/fieldtypes/text_en 683 | 684 | # { 685 | # "responseHeader":{...} 686 | # "fieldType":{ 687 | # "name":"text_en", 688 | # "class":"solr.TextField", 689 | # "positionIncrementGap":"100", 690 | # "indexAnalyzer":{ 691 | # "tokenizer":{ 692 | # "class":"solr.StandardTokenizerFactory" 693 | # }, 694 | # "filters":[ 695 | # { "class":"solr.StopFilterFactory" ... }, 696 | # { "class":"solr.LowerCaseFilterFactory" }, 697 | # { "class":"solr.EnglishPossessiveFilterFactory" }, 698 | # { "class":"solr.KeywordMarkerFilterFactory" ... }, 699 | # { "class":"solr.PorterStemFilterFactory" } 700 | # ] 701 | # }, 702 | # "queryAnalyzer":{ 703 | # "tokenizer":{ 704 | # "class":"solr.StandardTokenizerFactory" 705 | # }, 706 | # "filters":[ 707 | # { "class":"solr.SynonymGraphFilterFactory" ... }, 708 | # { "class":"solr.StopFilterFactory" ... }, 709 | # { "class":"solr.LowerCaseFilterFactory" }, 710 | # { "class":"solr.EnglishPossessiveFilterFactory" }, 711 | # { "class":"solr.KeywordMarkerFilterFactory" ... }, 712 | # { "class":"solr.PorterStemFilterFactory" } 713 | # ] 714 | # } 715 | # } 716 | # } 717 | ``` 718 | 719 | This is obviously a much more complex definition than the ones we saw before. Although the basics are the same (e.g. the field type points to class `solr.TextField`) notice that there are two new sections `indexAnalyzer` and `queryAnalyzer` for this field type. We will explore those in the next section. 720 | 721 | **Note:** The fact that the Solr schema API does not show dynamically created fields (like `title_txt_en`) is baffling, particularly since they do show in the [Schema Browser Screen](https://solr.apache.org/guide/solr/latest/indexing-guide/schema-browser-screen.html) of the Solr Admin screen. This has been a known issue for many years as shown in this [Stack Overflow question from 2010](https://stackoverflow.com/questions/3211139/solr-retrieve-field-names-from-a-solr-index) in which one of the answers suggests using the following command to list all fields, including those created via `dynamicField` definitions: `curl localhost:8983/solr/bibdata/admin/luke?numTerms=0` 722 | 723 | 724 | ## Analyzers, Tokenizers, and Filters 725 | 726 | The `indexAnalyzer` section defines the transformations to perform *as the data is indexed* in Solr and `queryAnalyzer` defines transformations to perform *as we query for data* out of Solr. It's important to notice that the output of the `indexAnalyzer` affects the terms *indexed*, but not the value *stored*. The [Solr Reference Guide](https://solr.apache.org/guide/solr/latest/indexing-guide/analyzers.html) says: 727 | 728 | The output of an Analyzer affects the terms indexed in a given field 729 | (and the terms used when parsing queries against those fields) but 730 | it has no impact on the stored value for the fields. For example: 731 | an analyzer might split "Brown Cow" into two indexed terms "brown" 732 | and "cow", but the stored value will still be a single String: "Brown Cow" 733 | 734 | When a value is *indexed* for a particular field the value is first passed to a `tokenizer` and then to the `filters` defined in the `indexAnalyzer` section for that field type. Similarly, when we *query* for a value in a given field the value our query is first processed by a `tokenizer` and then by the `filters` defined in the `queryAnalyzer` section for that field. 735 | 736 | If we look again at the definition for the `text_en` field type we'll notice that "stop words" (i.e. words to be ignored) are handled at index and query time (notice the `StopFilterFactory` filter appears in the `indexAnalyzer` and the `queryAnalyzer` sections.) However, notice that "synonyms" will only be applied at query time since the filter `SynonymGraphFilterFactory` only appears on the `queryAnalyzer` section. 737 | 738 | We can customize field type definitions to use different filters and tokenizers via the Schema API which we will discuss later on this tutorial. 739 | 740 | 741 | ### Tokenizers 742 | 743 | For most purposes we can think of a tokenizer as something that splits a given text into individual tokens or words. The [Solr Reference Guide](https://solr.apache.org/guide/solr/9_0/indexing-guide/tokenizers.html) defines Tokenizers as follows: 744 | 745 | Tokenizers are responsible for breaking 746 | field data into lexical units, or tokens. 747 | 748 | For example if we give the text "hello world" to a tokenizer it might split the text into two tokens like "hello" and "word". 749 | 750 | Solr comes with several [built-in tokenizers](https://solr.apache.org/guide/solr/9_0/indexing-guide/tokenizers.html) that handle a variety of data. For example if we expect a field to have information about a person's name the [Standard Tokenizer](https://solr.apache.org/guide/solr/9_0/indexing-guide/tokenizers.html#standard-tokenizer) might be appropriated for it. However, for a field that contains e-mail addresses the [UAX29 URL Email Tokenizer](https://solr.apache.org/guide/solr/9_0/indexing-guide/tokenizers.html#uax29-url-email-tokenizer) might be a better option. 751 | 752 | You can only have [one tokenizer per analyzer](https://solr.apache.org/guide/solr/9_0/indexing-guide/tokenizers.html) 753 | 754 | 755 | ### Filters 756 | 757 | Whereas a `tokenizer` takes a string of text and produces a set of tokens, a `filter` takes a set of tokens, process them, and produces a different set of tokens. The [Solr Reference Guide](https://solr.apache.org/guide/solr/9_0/indexing-guide/filters.html) says that 758 | 759 | in most cases a filter looks at each token in the stream sequentially 760 | and decides whether to pass it along, replace it or discard it. 761 | 762 | Notice that unlike tokenizers, whose job is to split text into tokens, the job of filters is a bit more complex since they might replace the token with a new one or discard it altogether. 763 | 764 | Solr comes with many [built-in Filters](https://solr.apache.org/guide/solr/9_0/indexing-guide/filters.html) that we can use to perform useful transformations. For example the ASCII Folding Filter converts non-ASCII characters to their ASCII equivalent (e.g. "México" is converted to "Mexico"). Likewise the English Possessive Filter removes singular possessives (trailing 's) from words. Another useful filter is the Porter Stem Filter that calculates word stems using English language rules (e.g. both "jumping" and "jumped" will be reduced to "jump".) 765 | 766 | 767 | ### Putting it all together 768 | 769 | When we looked at the definition for the `text_en` field type we noticed that at *index time* several filters were applied (`StopFilterFactory`, `LowerCaseFilterFactory`, `EnglishPossessiveFilterFactory`, `KeywordMarkerFilterFactory`, and `PorterStemFilterFactory`.) 770 | 771 | That means that if we *index* the text "The Television is Broken!" in a `text_en` field the filters defined in the `indexAnalyzer` will transform this text into two tokens: "televis", and "broken". Notice how the tokens were lowercased, the stop words ("the" and "is") dropped, and only the stem of the word "television" was indexed. 772 | 773 | Likewise, the definition for `text_en` included the additional filter `SynonymGraphFilter` at *query time*. So if we were to *query* for the text "The TV is Broken!" Solr will run this text through the filters indicated in the `queryAnalyzer` section and generate the following tokens: "televis", "tv", and "broken". Notice that an additional transformation was done to this text, namely, the word "TV" was expanded to its synonyms. This is because the `queryAnalyzer` uses the `SynonymGraphFilter` and a standard Solr configuration comes with those synonyms predefined in the `synonyms.txt` file. 774 | 775 | The [Analysis Screen](https://solr.apache.org/guide/solr/9_0/indexing-guide/analysis-screen.html) in the Solr Admin tool is a great way to see how a particular text is either indexed or queried by Solr *depending on the field type*. Point your browser to http://localhost:8983/solr/#/bibdata/analysis and try the following examples: 776 | 777 | Here are a few examples to try: 778 | 779 | * Enter "The quick brown fox jumps over the lazy dog" in the "Field Value (*index*)", select `string` as the field type and see how is indexed. Then select `text_general` and click "Analyze Values" to see how it's indexed. Lastly, select `text_en` and see how it's indexed. You might want to uncheck the "Verbose output" to see the differences more clearly. 780 | 781 | * With the text still on the "Field Value (*index*)" text box, enter "The quick brown fox jumps over the LAZY dog" on the "Field Value (*query*)" and try the different field types (`string/text_general/text_en`) again to see how each of them shows different matches. 782 | 783 | * Try changing the text on the "Field Value (*query*)" text box to "The quick brown foxes jumped over the LAZY dogs". Compare the results using `text_general` versus `text_en`. 784 | 785 | * Now enter "The TV is broken!" on the "Field Value (*index*)" text box, clear the "Field Value (*query*)" text box, select `text_en`, and see how the value is indexed. Then do the reverse, clear the indexed value and enter "The TV is broken!" on the "Field Value (*query*)" text box and notice synonyms being applied. 786 | 787 | * Now enter "The TV is broken!" on the "Field Value (*index*)" text box and "the television is broken" on the "Field Value (*query*)". Notice how they are matched because the use of synonyms applied for `text_en` fields. 788 | 789 | * Now enter "The TV is broken!" on the "Field Value (*index*)" text box and clear the "Field Value (*query*)" text box, select `text_general` and notice how the stop words were not removed because we are not using English specific rules. 790 | 791 | 792 | ## Handling text in Chinese, Japanese, and Korean (optional) 793 | 794 | If your data has text in Chinese, Japanese, or Korean (CJK) Solr has built-in support for searching text in these languages using the proper transformations. Just as Solr uses different transformation when using field type `text_en` instead of `text_general` Solr applies different rules when using field type `text_cjk`. 795 | 796 | You can see the definition of this field type with the following command. Notice how there are two new filters (`CJKWidthFilterFactory` and `CJKBigramFilterFactory`) that are different from what we saw in the `text_en` definition. 797 | 798 | ``` 799 | $ curl localhost:8983/solr/bibdata/schema/fieldtypes/text_cjk 800 | 801 | # ... 802 | # "fieldType":{ 803 | # "name":"text_cjk", 804 | # "class":"solr.TextField", 805 | # "positionIncrementGap":"100", 806 | # "analyzer":{ 807 | # "tokenizer":{ 808 | # "class":"solr.StandardTokenizerFactory"}, 809 | # "filters":[ 810 | # {"class":"solr.CJKWidthFilterFactory"}, 811 | # {"class":"solr.LowerCaseFilterFactory"}, 812 | # {"class":"solr.CJKBigramFilterFactory"}]}}} 813 | # 814 | ``` 815 | 816 | If you go to the Analysis Screen again and enter "胡志明" (Ho Chi Minh) as the "Field Value (index)", select `text_general` as the FieldType and analyse the values you'll notice how Solr calculated three tokens ("胡", "志", and "明") which is incorrect in Chinese. However, if you select `text_cjk` and analyze the values again you'll notice that you'll end with two tokens ("胡志" and "志明") thanks to the `CJKBigramFilterFactory` and that is the expected behavior for text in Chinese. 817 | 818 | The data for this section was taken from this [blog post](https://opensourceconnections.com/blog/2011/12/23/indexing-chinese-in-solr/). Although the technology referenced in the blog post is a bit dated, the basic concepts explained are still relevant, particularly if you, like me, are not a CJK speaker. Naomi Dushay's [CJK with Solr for Libraries](http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html) is a great resource on this topic. 819 | 820 | 821 | ## Stored vs indexed fields (optional) 822 | 823 | There are two properties on a Solr field that control whether its values are `stored`, `indexed`, or both. 824 | 825 | * Fields that are *stored but not indexed* can be fetched once a document has been found, but you cannot search by those fields (i.e. you cannot reference them in the `q` parameter). 826 | * Fields that are *indexed but not stored* are the reverse, you can search by them but you cannot fetch their values once a document has been found (i.e. you cannot reference them in the `fl` parameter). 827 | 828 | Technically it's also possible to [add a field that is neither stored nor indexed](https://stackoverflow.com/a/22298265/446681) but that's beyond the scope of this tutorial. 829 | 830 | There are many reasons to toggle the stored and indexed properties of a field. For example, perhaps we want to store a complex object as string in Solr so that we can display it to the user but we really don't want to index its values because we don't expect to ever search by this field. Conversely, perhaps we want to create a field with a combination of values to optimize a particular kind of search, but we don't want to display it to the users (the default `_text_` field in our schema is such an example). 831 | 832 | 833 | ## Customizing our schema 834 | So far we have only worked with the fields that were automatically added to our `bibdata` core as we imported the data. Because the fields in our source data had suffixes (e.g. `_txt_en`) that match with the default `dynamicField` definitions in a standard Solr installation most of our fields were created with the proper field type except, as we saw earlier, the `_txts_en` field which was created as a `text_general` field rather than at `text_en` field (because there was no definition for `_txts_en` fields). 835 | 836 | Also, although it's nice that we can do sophisticated searches by title (because it is a `text_en` field) we [could not sort](https://stackoverflow.com/a/7992380/446681) the results by this field because it's a tokenized field (technically we can sort by it but the results will not be what we would expect.) 837 | 838 | Let's customize our schema a little bit to get the most out of Solr. 839 | 840 | 841 | ### Recreating our Solr core 842 | Let's begin by recreating our Solr core so that we have a clean slate. 843 | 844 | Delete the existing `bibdata` core in Solr 845 | 846 | ``` 847 | $ docker exec -it solr-container solr delete -c bibdata 848 | 849 | # Deleting core 'bibdata' using command: 850 | # http://localhost:8983/solr/admin/cores?action=UNLOAD&core=bibdata... 851 | ``` 852 | 853 | Then re-create it 854 | 855 | ``` 856 | $ docker exec -it solr-container solr create_core -c bibdata 857 | 858 | # WARNING: Using _default configset with data driven schema functionality. 859 | # ... 860 | # 861 | # Created new core 'bibdata' 862 | ``` 863 | 864 | And finally query it (you should have zero documents since we have not re-imported the data) 865 | ``` 866 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*:*' 867 | 868 | # 869 | # "response":{"numFound":0,"start":0,"docs":[] 870 | # 871 | ``` 872 | 873 | This time *before* we import the data in the `books.json` file we are going to add a few field definitions to the schema to make sure the data is indexed and stored in the way that we want to. 874 | 875 | 876 | ### Handling `_txts_en` fields 877 | The first thing we'll do is add a new `dynamicField` definition to account for multi-value text fields in English for fields that end with `_txts_en` in our JSON data: 878 | 879 | ``` 880 | $ curl -X POST -H 'Content-type:application/json' --data-binary '{ 881 | "add-dynamic-field":{ 882 | "name":"*_txts_en", 883 | "type":"text_en", 884 | "multiValued":true} 885 | }' http://localhost:8983/solr/bibdata/schema 886 | ``` 887 | 888 | this will make sure Solr indexes these fields as `text_en` rather than the default `text_general` that it used when we did not have an `dynamicField` to account for them. 889 | 890 | 891 | ### Customizing the title field 892 | Secondly we'll ask Solr to store a string version of the title (in addition to the text version) so we can sort results by title. To do this we'll add a `copy-field` directive to our Schema to copy the value of the `title_txt_en` to another field (`title_s`). This way we'll have a text version for searching and a string version for sorting. 893 | 894 | ``` 895 | $ curl -X POST -H 'Content-type:application/json' --data-binary '{ 896 | "add-copy-field":[ 897 | { 898 | "source":"title_txt_en", 899 | "dest":[ "title_s" ] 900 | } 901 | ] 902 | }' http://localhost:8983/solr/bibdata/schema 903 | ``` 904 | 905 | 906 | ### Customizing the author fields 907 | Right now we have two separate fields for author information (`author_txt_en` for the main author and `authors_other_txts_en` for additional authors) which means that if we want to find books by a particular author we have to issue a query against two separate fields: `author_txt_en:"Sarah" OR authors_other_txts_en:"Sarah"` 908 | 909 | Let's use a `copy-field` directive to have Solr automatically combine the main author and additional authors into a new field. Notice that the new field `authors_all_txts_en` matches the `dynamicField` directive that we just created, meaning that it will be indexed as `text_en` multi-valued. 910 | 911 | ``` 912 | $ curl -X POST -H 'Content-type:application/json' --data-binary '{ 913 | "add-copy-field":[ 914 | { 915 | "source":"author_txt_en", 916 | "dest":[ "authors_all_txts_en" ] 917 | }, 918 | { 919 | "source":"authors_other_txts_en", 920 | "dest":[ "authors_all_txts_en" ] 921 | } 922 | ] 923 | }' http://localhost:8983/solr/bibdata/schema 924 | ``` 925 | 926 | 927 | ### Customizing the subject field (optional) 928 | Another customization that we'll do is to aggregate all the subject fields (`subjects_ss`, `subjects_geo_ss`, `subjects_chrono`) into a new single field `subjects_all_txts_en` and we'll make that field a text field so that we can search by subject easily. We'll do this via a copy field 929 | 930 | ``` 931 | $ curl -X POST -H 'Content-type:application/json' --data-binary '{ 932 | "add-copy-field":[ 933 | { 934 | "source":"subjects_ss", 935 | "dest": "subjects_all_txts_en" 936 | }, 937 | { 938 | "source":"subjects_geo_ss", 939 | "dest": "subjects_all_txts_en" 940 | }, 941 | { 942 | "source":"subjects_chrono_ss", 943 | "dest": "subjects_all_txts_en" 944 | } 945 | ] 946 | }' http://localhost:8983/solr/bibdata/schema 947 | ``` 948 | 949 | 950 | ### Populating the _text_ field 951 | As we saw earlier, by default, if no field is indicated in a search, Solr searches in the `_text_` field. This field is already defined in our schema but we are currently not populating it with anything since the field does not exist in our `books.json` data file. Let's fix that, let's tell Solr to copy the value of every field into the `_text_` field by using a `copyField` definition like the one below: 952 | 953 | ``` 954 | $ curl -X POST -H 'Content-type:application/json' --data-binary '{ 955 | "add-copy-field":[ 956 | { 957 | "source":"*", 958 | "dest":[ "_text_" ] 959 | } 960 | ] 961 | }' http://localhost:8983/solr/bibdata/schema 962 | ``` 963 | 964 | In a production environment we probably want to be a more selective on how we populate `_text_` but this will do for us. 965 | 966 | 967 | ### Testing our changes 968 | Now that we have configured our schema with a few specific field definitions let's re-import the data so that fields are indexed using the new configuration. 969 | 970 | ``` 971 | $ docker exec -it solr-container post -c bibdata books.json 972 | 973 | # /opt/java/openjdk/bin/java -classpath ... 974 | SimplePostTool version 5.0.0 975 | Posting files to [base] url http://localhost:8983/solr/bibdata/update... 976 | Entering auto mode. ... 977 | POSTing file books.json (application/json) to [base]/json/docs 978 | 1 files indexed. 979 | COMMITting Solr index changes to http://localhost:8983/solr/bibdata/update... 980 | Time spent: 0:00:02.871 981 | ``` 982 | 983 | 984 | ### Testing changes to the title field 985 | Now that we have a string version of the title field is possible for us to sort our search results by this field, for example, let's search for books that have the word "water" in the title (`q=title_txt_en:water`) and sort them by title (`sort=title_s+asc`): 986 | 987 | ``` 988 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:water&sort=title_s+asc' 989 | 990 | # 991 | # response will include 992 | # ... 993 | # "title_txt_en":"A practical guide to creating and maintaining water quality /", 994 | # "title_txt_en":"A practical guide to particle counting for drinking water treatment /", 995 | # "title_txt_en":"Applied ground-water hydrology and well hydraulics /", 996 | # "title_txt_en":"Assessment of blue-green algal toxins in raw and finished drinking water /", 997 | # "title_txt_en":"Bureau of Reclamation..." 998 | # "title_txt_en":"Carry me across the water : a novel /", 999 | # "title_txt_en":"Clean Water Act : proposed revisions to EPA regulations to clean up polluted waters /", 1000 | # "title_txt_en":"Cold water burning /" 1001 | # ... 1002 | # 1003 | ``` 1004 | 1005 | notice that the result are sorted alphabetically by title because we are using the string version of the field (`title_s`) for sorting. Try and see what the results look like if you sort by the text version of the title (`title_txt_en`): 1006 | 1007 | ``` 1008 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:water&sort=title_txt_en+asc' 1009 | ``` 1010 | 1011 | The results in this case will not look correct because Solr will be using the tokenized value of the `title_txt_en` field to sort rather than the string version. 1012 | 1013 | 1014 | ### Testing changes to the author field 1015 | 1016 | Take a look at the data for this particular book that has many authors and notice how the `authors_all_txts_en` field has the combination of `author_txt_en` and `authors_other_txts_en` even though our source data didn't have an `authors_all_txts_en` field: 1017 | 1018 | ``` 1019 | $ curl 'http://localhost:8983/solr/bibdata/select?q=id:00009214' 1020 | 1021 | # 1022 | # { 1023 | # "id":"00009214", 1024 | # "author_txt_en":"Everett, Barbara,", 1025 | # "authors_other_txts_en":["Gallop, Ruth,"] 1026 | # "authors_all_txts_en":["Everett, Barbara,", "Gallop, Ruth,"], 1027 | # } 1028 | # 1029 | ``` 1030 | 1031 | Likewise, let's search for books authored by "Gallop" using our new `authors_all_txts_en` field (`q=authors_all_txts_en:Gallop`) and notice how this document will be on the results regardless of whether Ruth Gallop is the main author or an additional author. 1032 | 1033 | ``` 1034 | $ curl 'http://localhost:8983/solr/bibdata/select?q=authors_all_txts_en:Gallop' 1035 | ``` 1036 | 1037 | 1038 | ### Testing the _text_ field 1039 | Let's run a query *without* specifing what field to search on, for example `q:biology` 1040 | 1041 | ``` 1042 | $ curl 'http://localhost:8983/solr/bibdata/select?q=biology&debug=all' 1043 | ``` 1044 | 1045 | The result will include all documents where the word "biology" is found in the `_text_` field and since we are now populating this field with a copy of every value in our documents this means that we'll get back any document that has the word "biology" in the title, the author, or the subject. 1046 | 1047 | We can confirm that Solr is searching on the `_text_` field by looking at the information in the parsed query, it will looks like this: 1048 | 1049 | ``` 1050 | "debug":{ 1051 | "rawquerystring":"biology", 1052 | "querystring":"biology", 1053 | "parsedquery":"_text_:biology", 1054 | ``` 1055 | 1056 | Notice that our raw query `"biology"` got parsed as `"_text_:biology"`. 1057 | 1058 | 1059 |
1060 | 1061 | # PART III: SEARCHING 1062 | 1063 | When we issue a search to Solr we pass the search parameters in the query string. In previous examples we passed values in the `q` parameter to indicate the values that we want to search for and `fl` to indicate what fields we want to retrieve. For example: 1064 | 1065 | ``` 1066 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*&fl=id,title_txt_en' 1067 | ``` 1068 | 1069 | In some instances we passed rather sophisticated values for these parameters, for example we used `q=title_txt_en:"art history"~3` when we wanted to search for books with the words "art" and "history" in the title within a few word words of each other. 1070 | 1071 | The components in Solr that parse these parameters are called Query Parsers. Their job is to extract the parameters and create a query that Lucene can understand. Remember that Lucene is the search engine underneath Solr. 1072 | 1073 | 1074 | ## Query Parsers 1075 | 1076 | Out of the box Solr comes with three query parsers: Standard, DisMax, and Extended DisMax (eDisMax). Each of them has its own advantages and disadvantages. 1077 | 1078 | * The [Standard](https://solr.apache.org/guide/solr/9_0/query-guide/standard-query-parser.html) query parser (aka the Lucene Parser) "supports a robust and fairly intuitive syntax allowing you to create a variety of structured queries. The largest disadvantage is that it’s very intolerant of syntax errors, as compared with something like the DisMax Query Parser which is designed to throw as few errors as possible." 1079 | 1080 | * The [DisMax](https://solr.apache.org/guide/solr/9_0/query-guide/dismax-query-parser.html) query parser (DisMax) interface "is more like that of Google than the interface of the 'lucene' Solr query parser. This similarity makes DisMax the appropriate query parser for many consumer applications. It accepts a simple syntax, and it rarely produces error messages." 1081 | 1082 | * The [Extended DisMax](https://solr.apache.org/guide/solr/9_0/query-guide/edismax-query-parser.html) (eDisMax) query parser is an improved version of the DisMax parser that is also very forgiving on errors when parsing user entered queries and like the Standard query parser supports complex query expressions. 1083 | 1084 | One key difference among these parsers is that they recognize different parameters. For example, the *DisMax* and *eDisMax* parsers supports a `qf` parameter to specify what fields should be searched for but this parameter is not supported by the *Standard* parser. 1085 | 1086 | The rest of the examples in this section are going to use the eDisMax parser, notice the `defType=edismax` in our queries to Solr to make this selection. As we will see later on this tutorial you can also set the default query parser of your Solr core to use eDisMax by updating the `defType` parameter in your `solrconfig.xml` so that you don't have to explicitly set it on every query. 1087 | 1088 | 1089 | ## Basic searching in Solr 1090 | The number of search parameters that you can pass to Solr is rather large and, as we've mentioned, they also depend on what query parser you are using. 1091 | 1092 | To see a list a comprehensive list of the parameters that apply to all parsers take a look at the [Common Query Parameters](https://solr.apache.org/guide/solr/9_0/query-guide/common-query-parameters.html) and the [Standard Query Parser](https://solr.apache.org/guide/solr/9_0/query-guide/standard-query-parser.html) sections in the Solr Reference Guide. 1093 | 1094 | Below are some of the parameters that are supported by all parsers: 1095 | 1096 | * `defType`: Query parser to use (default is `lucene`, other possible values are `dismax` and `edismax`) 1097 | * `q`: Search query, the basic syntax is `field:"value"`. 1098 | * `sort`: Sorting of the results (default is `score desc`, i.e. highest ranked document first) 1099 | * `rows`: Number of documents to return (default is `10`) 1100 | * `start`: Index of the first document to result (default is `0`) 1101 | * `fl`: List of fields to return in the result. 1102 | * `fq`: Filters results without calculating a score. 1103 | 1104 | Below are a few sample queries to show these parameters in action. Notice that spaces are URL encoded as `+` in the `curl` commands below, you do not need to encode them if you are submitting these queries via the Solr Admin interface in your browser. 1105 | 1106 | * Retrieve the first 10 documents where the `title_txt_en` includes the word "washington" (`q=title_txt_en:washington`) 1107 | ``` 1108 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:washington' 1109 | ``` 1110 | 1111 | * The next 15 documents for the same query (notice the `start=10` and `rows=15` parameters) 1112 | ``` 1113 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:washington&start=10&rows=15' 1114 | ``` 1115 | 1116 | * Retrieve the `id` and `title_txt_en` (`fl=id,title_txt_en`) where the title includes the words "women writers" but allowing for a word in between e.g. "women nature writers" (`q=title_txt_en:"women writers"~1`) Technically the `~N` means "N edit distance away" (See Solr in Action, p. 63). 1117 | ``` 1118 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:"women+writers"~1&fl=id,title_txt_en' 1119 | ``` 1120 | 1121 | * Documents that have additional authors (`q=authors_other_txts_en:*`), the `*` means "any value". 1122 | ``` 1123 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en,author_txt_en,authors_other_txts_en&q=authors_other_txts_en:*' 1124 | ``` 1125 | 1126 | * Documents that do *not* have additional authors (`q=NOT authors_other_txts_en:*`) be aware that the `NOT` **must be in uppercase**. 1127 | ``` 1128 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=*&q=NOT+authors_other_txts_en:*' 1129 | ``` 1130 | 1131 | * Documents where at least one of the subjects is about "communication" (`q=subjects_all_txts_en:communication`) -- in reality because this field is a Text in English field this query will return all documents where the `subjects_all_txts_en` has the word "commun", the stem of "communication", you can validate this in the debug output: 1132 | ``` 1133 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en,subjects_all_txts_en&q=subjects_all_txts_en:communication&debug=all' 1134 | ``` 1135 | 1136 | * Documents where title include "science" *and* at least one of the subjects is "women" (`q=title_txt_en:science AND subjects_all_txts_en:women` notice that both search conditions are indicated in the `q` parameter) Again, notice that the `AND` operator must be in uppercase. 1137 | ``` 1138 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en,subjects_all_txts_en&q=title_txt_en:science+AND+subjects_all_txts_en:women' 1139 | ``` 1140 | 1141 | * Documents where title *includes* the word "history" but *does not include* the word "art" (`q=title_txt_en:history AND NOT title_txt_en:art`) 1142 | ``` 1143 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:history+AND+NOT+title_txt_en:art' 1144 | ``` 1145 | 1146 | The [Solr Reference Guide](https://solr.apache.org/guide/solr/9_0/query-guide/standard-query-parser.html) and 1147 | [this Lucene tutorial](http://www.solrtutorial.com/solr-query-syntax.html) are good places to check for quick reference on the query syntax. 1148 | 1149 | 1150 | ### the qf parameter 1151 | 1152 | The DisMax and eDisMax query parsers provide another parameter, Query Fields `qf`, that should not be confused with the `q` or `fq` parameters. The `qf` parameter is used to indicate the *list of fields* that the search should be executed on along with their boost values. 1153 | 1154 | If we want to search for the same value in multiple fields at once (e.g. if we want to find all books where *the title or the author* includes the text "Washington") we must indicate each field/value pair individually: `q=title_txt_en:"Washington" authors_all_txts_en:"Washington"`. 1155 | 1156 | The `qf` parameter allows us specify the fields separate from the terms so that we can use instead: `q="Washington"` and `qf=title_txt_en authors_all_txts_en`. This is really handy if we want to customize what fields are searched in an application in which the user enters a single text (say "Washington") and the application automatically searches multiple fields. 1157 | 1158 | Below is an example of this (remember to select the eDisMax parser (`defType=edismax`) when using the `qf` parameter): 1159 | 1160 | ``` 1161 | $ curl 'http://localhost:8983/solr/bibdata/select?q="washington"&qf=title_txt_en+authors_all_txts_en&defType=edismax' 1162 | ``` 1163 | 1164 | 1165 | ### debugQuery 1166 | Solr provides an extra parameter `debug=all` that we can use to get debug information about a query. This is particularly useful if the results that we get are not what we were expecting. For example, let's run the same query again but this time passing the `debug=all` parameter: 1167 | 1168 | ``` 1169 | $ curl 'http://localhost:8983/solr/bibdata/select?q="washington"&qf=title_txt_en+authors_all_txts_en&defType=edismax&debug=all' 1170 | 1171 | # response will include 1172 | # { 1173 | # "responseHeader":{...} 1174 | # "response":{...} 1175 | # "debug":{ 1176 | # "rawquerystring":"\"washington\"", 1177 | # "querystring":"\"washington\"", 1178 | # "parsedquery":"+DisjunctionMaxQuery((title_txt_en:washington | authors_all_txts_en:washington))", 1179 | # "parsedquery_toString":"+(title_txt_en:washington | authors_all_txts_en:washington)", 1180 | # "explain":{ 1181 | # ... tons of information here ... 1182 | # } 1183 | # "QParser":"ExtendedDismaxQParser", 1184 | # } 1185 | # } 1186 | # 1187 | ``` 1188 | 1189 | Notice the `debug` property in the output, inside this property there is information about: 1190 | 1191 | * what value the server received for the search (`querystring`) which is useful to detect if you are not URL encoding properly the value sent to the server 1192 | * how the server parsed the query (`parsedquery`) which is useful to detect if the syntax on the `q` parameter was parsed as we expected (e.g. remember the example earlier when we passed two words `art history` without surrounding them in quotes and the parsed query showed that it was querying two different fields `title_txt_en` for "art" and `_text_` for "history") 1193 | * you can also see that some of the search terms were stemmed (e.g. if you query for "running" you'll notice that the parsed query will show "run") 1194 | * how each document was ranked (`explain`) 1195 | * what query parser (`QParser`) was used 1196 | 1197 | Check out this [blog post](https://hectorcorrea.com/blog/solr-debugquery/2021-11-11-00001) for more information about `debugQuery`. 1198 | 1199 | 1200 | ### Ranking of documents 1201 | 1202 | When Solr finds documents that match the query it ranks them so that the most relevant documents show up first. You can provide Solr guidance on what fields are more important to you so that Solr considers this when ranking documents that match a given query. 1203 | 1204 | Let's say that we want documents where the word "Washington" (`q=washington`) is found in the title or in the author (`qf=title_txt_en authors_all_txts_en`) 1205 | 1206 | ``` 1207 | $ curl 'http://localhost:8983/solr/bibdata/select?&q=washington&qf=title_txt_en+authors_all_txts_en&defType=edismax' 1208 | ``` 1209 | 1210 | Now let's say that we want to boost the documents where the author has the word "Washington" ahead of the documents where "Washington" was found in the title. To this we update the `qf` parameter as follows `qf=title_txt_en authors_all_txts_en^5` (notice the `^5` to boost the `authors_all_txts_en` field) 1211 | 1212 | ``` 1213 | $ curl 'http://localhost:8983/solr/bibdata/select?&q=washington&qf=title_txt_en+authors_all_txts_en^5&defType=edismax' 1214 | ``` 1215 | 1216 | Notice how documents where the author is named "Washington" come first, but we still get documents where the title includes the word "Washington". 1217 | 1218 | Boost values are arbitrary, you can use 1, 20, 789, 76.2, 1000, or whatever number you like, you can even use negative numbers (`qf=title_txt_en authors_all_txts_en^-10`). They are just a way for us to hint Solr which fields we consider more important in a particular search. 1219 | 1220 | If want to see why Solr ranked a result higher than another you can pass an additional parameter `debug.explain.structured=true` to see the explanation on how Solr ranked each of the documents in the result: 1221 | 1222 | ``` 1223 | $ curl 'http://localhost:8983/solr/bibdata/select?q=title_txt_en:west+authors_all_txts_en:washington&debug=all&debug.explain.structured=true' 1224 | ``` 1225 | 1226 | The result will include an `explain` node with a ton of information for each of the documents ranked. This information is rather complex but it has a wealth of details that could help us figure out why a particular document is ranked higher or lower than what we would expect. Take a look at [this blog post](https://library.brown.edu/create/digitaltechnologies/understanding-scoring-of-documents-in-solr/) to get an idea on how to interpret this information. 1227 | 1228 | 1229 | ### Filtering with ranges 1230 | 1231 | You can also filter a field to be within a range by using the bracket operator with the following syntax: `field:[firstValue TO lastValue]`. For example, to request documents with `id` between `00010500` and `00012050` we could do: `id:[00010500 TO 00012050]`. You can also indicate open-ended ranges by passing an asterisk as the value, for example: `id:[* TO 00012050]`. 1232 | 1233 | ``` 1234 | $ curl 'http://localhost:8983/solr/bibdata/select?q=id:\[00010500+TO+00012050\]' 1235 | ``` 1236 | 1237 | Be aware that range filtering with `string` fields would work as you would expect it to, but with `text_general` and `text_en` fields it will filter on the *terms indexed* not on the value of the field. 1238 | 1239 | 1240 | ### Where to find more 1241 | Searching is a large topic and complex topic. I've found the book "Relevant search with applications for Solr and Elasticsearch" (see references) to be a good conceptual reference with specifics on how to understand and configure Solr to improve search results. Chapter 3 on this book goes into great detail on how to read and understand the ranking of results. 1242 | 1243 | 1244 | ## Facets 1245 | One of the most popular features of Solr is the concept of *facets*. The [Solr Reference Guide](https://solr.apache.org/guide/solr/9_0/query-guide/faceting.html) defines it as: 1246 | 1247 | Faceting is the arrangement of search results into categories 1248 | based on indexed terms. 1249 | 1250 | Searchers are presented with the indexed terms, along with numerical 1251 | counts of how many matching documents were found for each term. 1252 | Faceting makes it easy for users to explore search results, narrowing 1253 | in on exactly the results they are looking for. 1254 | 1255 | You can easily get facet information from a query by selecting what field (or fields) you want to use to generate the categories and the counts. The basic syntax is `facet=on` followed by `facet.field=name-of-field`. For example to facet our dataset by *subjects* we would use the following syntax: `facet.field=subjects_ss` as in the following example: 1256 | 1257 | ``` 1258 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*&facet=on&facet.field=subjects_ss' 1259 | 1260 | # result will include 1261 | # 1262 | # "facet_counts":{ 1263 | # "facet_queries":{}, 1264 | # "facet_fields":{ 1265 | # "subjects_ss":[ 1266 | # "Women",435, 1267 | # "Large type books",415, 1268 | # "African Americans",337, 1269 | # "English language",330, 1270 | # "World War, 1939-1945",196, 1271 | # ... 1272 | # 1273 | ``` 1274 | 1275 | IMPORTANT: You might have noticed that we are using the `string` representation of the subjects (`subjects_ss`) to generate the facets rather than the `text_en` version stored in the `subjects_all_txts_en` field. This is because, as the Solr Reference Guide indicates facets are calculated "based on indexed terms". The indexed version of the `subjects_all_txts_en` field is tokenized whereas the indexed version of `subjects_ss` is the entire string. 1276 | 1277 | You can indicate more than one `facet.field` in a query to Solr (e.g. `facet.field=publisher_name_s&facet.field=subjects_ss`) to get facets for more than one field. 1278 | 1279 | There are several extra parameters that you can pass to Solr to customize how many facets are returned on result set. For example, if you want to list only the top 20 subjects in the facets rather than all of them you can indicate this with the following syntax: `f.subjects_ss.facet.limit=20`. You can also filter only get facets that have *at least* certain number of matches, for example only subjects that have at least 50 books `f.subjects_ss.facet.mincount=50` as shown the following example: 1280 | 1281 | ``` 1282 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*&facet=on&facet.field=subjects_ss&f.subjects_ss.facet.limit=20&f.subjects_ss.facet.mincount=50' 1283 | ``` 1284 | 1285 | You can also facet **by multiple fields at once** this is called [Pivot Faceting](https://solr.apache.org/guide/solr/9_0/query-guide/faceting.html#pivot-decision-tree-faceting). The way to do this is via the `facet.pivot` parameter. 1286 | 1287 | Note: Unfortunately the `facet.pivot` parameter is not available via the Solr Admin web page, if you want to try this example you will have to do it via the command on the terminal. 1288 | 1289 | This parameter allows you to list the fields that should be used to facet the data, for example to facet the information *by subject and then by publisher* (`facet.pivot=subjects_ss,publisher_name_s`) you could issue the following command: 1290 | 1291 | ``` 1292 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*&facet=on&facet.pivot=subjects_ss,publisher_name_s&facet.limit=5' 1293 | 1294 | # 1295 | # response will include facets organized as follows: 1296 | # 1297 | # "facet_counts":{ 1298 | # "facet_pivot":{ 1299 | # "subjects_ss,publisher_name_s":[{ 1300 | # "field":"subjects_ss", 1301 | # "value":"Women", 1302 | # "count":435, 1303 | # "pivot":[{ 1304 | # "field":"publisher_name_s", 1305 | # "value":"Chelsea House Publishers,", 1306 | # "count":22}, 1307 | # { 1308 | # "field":"publisher_name_s", 1309 | # "value":"Enslow Publishers,", 1310 | # "count":13}, 1311 | # ... 1312 | # ] 1313 | # } 1314 | # ] 1315 | # ... 1316 | # 1317 | ``` 1318 | 1319 | Notice how the results for the subject "Women" (435 results) are broken down by publisher under the "pivot" section. 1320 | 1321 | 1322 | ## Hit highlighting 1323 | 1324 | Another Solr feature is the ability to return a fragment of the document where the match was found for a given search term. This is called [highlighting](https://solr.apache.org/guide/solr/9_0/query-guide/highlighting.html). 1325 | 1326 | Let's say that we search for books where one of the authors or the title include the word "Washington". To do this we'll set our parameters 1327 | * `q=washington` 1328 | * `qf=title_txt_en authors_all_txts_en` 1329 | * `defType=edismax` (the `qf` parameter does not work with the Standard parser so we explicitly select eDisMax) 1330 | * `hl=on` (this is what enables hit highlightint) 1331 | 1332 | ``` 1333 | $ curl 'http://localhost:8983/solr/bibdata/select?defType=edismax&q=washington&qf=title_txt_en+authors_all_txts_en&hl=on' 1334 | 1335 | # 1336 | # response will include a highlight section like this 1337 | # 1338 | # "highlighting":{ 1339 | # "00065343":{ 1340 | # "title_txt_en":["Washington Irving's The legend of Sleepy Hollow.."], 1341 | # "authors_all_txts_en":["Irving, Washington,"]}, 1342 | # "00107795":{ 1343 | # "authors_all_txts_en":["Washington, Durthy."]}, 1344 | # "00044606":{ 1345 | # "title_txt_en":["University of Washington /"]}, 1346 | # 1347 | ``` 1348 | 1349 | Notice how the `highlighting` property includes the `id` of each document in the result (e.g. `00065343`), the field where the match was found (e.g. `authors_all_txts_en` and/or `title_txt_en`) and the text that matched within the field (e.g. `University of Washington /`). You can display this information along with your search results to allow the user to "preview" why each result was rendered. 1350 | 1351 | 1352 |
1353 | 1354 | # PART IV: MISCELLANEOUS (optional) 1355 | 1356 | In the next sections we'll make a few changes to the configuration of our `bidata` core in order to enable some other features of Solr like synonyms and spell checking. 1357 | 1358 | ## Solr's directories and configuration files 1359 | 1360 | In Linux, Solr is typically installed under the `/opt/solr` folder and the data for our cores is stored under the `/var/solr/data` folder. We can see this in our Docker container if we log into it. 1361 | 1362 | *Open a separate terminal window* and execute the following command to login into the container and see the files inside it: 1363 | 1364 | ``` 1365 | $ docker exec -it solr-container /bin/bash 1366 | $ ls -la 1367 | 1368 | # 1369 | # You'll see something like this 1370 | # 1371 | # bin CHANGES.txt docker lib LICENSE.txt NOTICE.txt README.txt 1372 | # books.json contrib example licenses modules prometheus-exporter server 1373 | # 1374 | ``` 1375 | 1376 | While still on the Docker container issue a command as follow to see the files with the configuration for our `bibdata` core: 1377 | 1378 | ``` 1379 | $ ls -la /var/solr/data/bibdata/conf/ 1380 | 1381 | # 1382 | # You'll see something like this 1383 | # 1384 | # drwxr-xr-x 2 solr solr 4096 Nov 11 07:31 lang 1385 | # -rw-r--r-- 1 solr solr 26665 Jan 15 18:07 managed-schema.xml 1386 | # -rw-r--r-- 1 solr solr 873 Nov 11 07:31 protwords.txt 1387 | # -rw-r--r-- 1 503 dialout 48192 Jan 15 19:45 solrconfig.xml 1388 | # -rw-r--r-- 1 solr solr 781 Nov 11 07:31 stopwords.txt 1389 | # -rw-r--r-- 1 solr solr 1124 Nov 11 07:31 synonyms.txt 1390 | ``` 1391 | 1392 | Notice the `solrconfig.xml`, `managed-schema.xml` and the `synonyms.txt` files. These are the files that we saw before under the "Files" option in the Solr Admin web page. 1393 | 1394 | File `managed-schema.xml` is where field definitions are declared. File `solrconfig.xml` is where we configure many of the features of Solr for our particular `bibdata` core. File `synonyms.txt` is where we define what words are considered synonyms and we'll look closely into this next. 1395 | 1396 | Before we continue let's exit from the Docker container with the `exit` command (don't worry the Docker container is still up and running in the background): 1397 | 1398 | ``` 1399 | $ exit 1400 | ``` 1401 | 1402 | 1403 | ## Synonyms 1404 | 1405 | In a previous section, when we looked at the `text_general` and `text_en` field types, we noticed that it used a filter to handle synonyms at query time. 1406 | 1407 | Here is how to view that definition again: 1408 | 1409 | ``` 1410 | $ curl 'http://localhost:8983/solr/bibdata/schema/fieldtypes/text_en' 1411 | 1412 | # 1413 | # "queryAnalyzer":{ 1414 | # "tokenizer":{ 1415 | # ... 1416 | # }, 1417 | # "filters":[ 1418 | # ... 1419 | # a few filters go here 1420 | # ... 1421 | # { 1422 | # "class":"solr.SynonymGraphFilterFactory", 1423 | # "expand":"true", 1424 | # "ignoreCase":"true", 1425 | # "synonyms":"synonyms.txt" 1426 | # }, 1427 | # ... 1428 | # 1429 | ``` 1430 | 1431 | Notice how one of the filter uses the `SynonymGraphFilterFactory` to handle synonyms and references a file `synonyms.txt`. 1432 | 1433 | You can view the contents of the `synonyms.txt` file for our `bibdata` core through the Files option in the Solr Admin web page: http://localhost:8983/solr/#/bibdata/files?file=synonyms.txt 1434 | 1435 | The contents of this file looks more or less like this: 1436 | 1437 | ``` 1438 | # Some synonym groups specific to this example 1439 | GB,gib,gigabyte,gigabytes 1440 | MB,mib,megabyte,megabytes 1441 | Television, Televisions, TV, TVs 1442 | #notice we use "gib" instead of "GiB" so any WordDelimiterGraphFilter coming 1443 | #after us won't split it into two words. 1444 | 1445 | # Synonym mappings can be used for spelling correction too 1446 | pixima => pixma 1447 | ``` 1448 | 1449 | ### Life without synonyms 1450 | 1451 | In the data in our `bibdata` core several of the books have the words "twentieth century" in the title but these books would not be retrieved if a user were to search for "20th century". 1452 | 1453 | Let's try it, first let's search for `q=title_txt_en:"twentieth century"`: 1454 | 1455 | ``` 1456 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"twentieth+century"' 1457 | 1458 | # 1459 | # result will include 84 results 1460 | # 1461 | ``` 1462 | 1463 | And now let's search for `q=title_txt_en:"20th century"`: 1464 | 1465 | ``` 1466 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"20th+century"' 1467 | 1468 | # 1469 | # result will include 22 results 1470 | # 1471 | ``` 1472 | 1473 | ### Adding synonyms 1474 | 1475 | We can indicate Solr that "twentieth" and "20th" are synonyms by updating the `synonyms.txt` file by adding a line as follows: 1476 | 1477 | ``` 1478 | 20th,twentieth 1479 | ``` 1480 | 1481 | Because our Solr is running inside a Docker container we need to update the `synonyms.txt` file *inside* the container. We are going to do this in four steps: 1482 | 1483 | 1. First we'll copy `synonyms.txt` from the Docker container to our machine 1484 | 2. Then we'll update the file in our machine (with whatever editor we are comfortable with) 1485 | 3. Next we'll copy our updated local copy back to the container 1486 | 4. And lastly, we'll tell Solr to reload the core's configuration so the changes take effect. 1487 | 1488 | To copy the `synonyms.txt` from the container to our machine we'll issue the following command: 1489 | 1490 | ``` 1491 | $ docker cp solr-container:/var/solr/data/bibdata/conf/synonyms.txt . 1492 | $ ls 1493 | 1494 | # 1495 | # drwxr-xr-x 3 user-id staff 96 Jan 16 18:02 . 1496 | # drwxr-xr-x 51 user-id staff 1632 Jan 12 20:10 .. 1497 | # -rw-r--r--@ 1 user-id staff 1124 Nov 11 02:31 synonyms.txt 1498 | # 1499 | ``` 1500 | 1501 | We can view the contents of the file with a command as follows: 1502 | 1503 | ``` 1504 | $ cat synonyms.txt 1505 | 1506 | # 1507 | # will include a few lines including 1508 | # 1509 | # GB,gib,gigabyte,gigabytes 1510 | # Television, Televisions, TV, TVs 1511 | # 1512 | ``` 1513 | 1514 | Let's edit this file with whatever editor your are comfortable. Our goal is to add a new line to make `20th` and `twentieth` synonyms, we can do it like this: 1515 | 1516 | ``` 1517 | $ echo "20th,twentieth" >> synonyms.txt 1518 | ``` 1519 | 1520 | Now that we have updated our local copy of the synonyms file we need to copy this new version back to the Docker container, we can do this with a command like this: 1521 | 1522 | ``` 1523 | $ docker cp synonyms.txt solr-container:/var/solr/data/bibdata/conf/ 1524 | ``` 1525 | 1526 | If we refresh the page http://localhost:8983/solr/#/bibdata/files?file=synonyms.txt on our browser we should see that the new line has been added to the `synonyms.txt` file. However, we *must reload our core* for the changes to take effect. We can do this as follow: 1527 | 1528 | ``` 1529 | $ curl 'http://localhost:8983/solr/admin/cores?action=RELOAD&core=bibdata' 1530 | 1531 | # response will look similar to this 1532 | # { 1533 | # "responseHeader":{ 1534 | # "status":0, 1535 | # "QTime":221}} 1536 | # 1537 | ``` 1538 | 1539 | You can also reload the core via the [Solr Admin](http://localhost:8983/solr/#/) page. Select "Core Admin", then "bibdata", and click "Reload". 1540 | 1541 | If you run the queries again they will both report "106 results found" regardless of whether you search for `q=title_txt_en:"twentieth century"` or `q=title_txt_en:"20th century"`: 1542 | 1543 | ``` 1544 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:"twentieth+century"' 1545 | 1546 | # 1547 | # result will include 106 results 1548 | # 88 with "twentieth century" plus 22 with "20th century" 1549 | # 1550 | ``` 1551 | 1552 | To find more about synonyms take a look at this [blog post](https://library.brown.edu/create/digitaltechnologies/using-synonyms-in-solr/) where I talk about the different ways of adding synonyms, how to test them in the Solr Admin tool, and the differences between applying synonyms at index time versus query time. 1553 | 1554 | 1555 | ## Core-specific configuration 1556 | 1557 | One of the most important configuration files for a Solr core is `solrconfig.xml` located in the configuration folder for the core. We can view the content of this file in our `bibdata` core in this URL http://localhost:8983/solr/#/bibdata/files?file=solrconfig.xml 1558 | 1559 | A default `solrconfig.xml` file is about 1100 lines of heavily documented XML. We won't need to make changes to most of the content of this file, but there are a couple of areas that are worth knowing about: request handlers and search components. 1560 | 1561 | **Note:** Despite its name, file `solrconfig.xml` controls the configuration *for our core*, not for the entire Solr installation. Each core has its own `solrconfig.xml` file. 1562 | 1563 | To make things easier for the rest of this section let's download two copies of this file to our local machine: 1564 | 1565 | ``` 1566 | $ docker cp solr-container:/var/solr/data/bibdata/conf/solrconfig.xml solrconfig.xml 1567 | $ docker cp solr-container:/var/solr/data/bibdata/conf/solrconfig.xml solrconfig.bak 1568 | $ ls 1569 | 1570 | # 1571 | # drwxr-xr-x 4 user-id staff 128 Jan 16 18:19 . 1572 | # drwxr-xr-x 51 user-id staff 1632 Jan 12 20:10 .. 1573 | # -rw-r--r--@ 1 user-id staff 47746 Jan 16 18:36 solrconfig.bak 1574 | # -rw-r--r--@ 1 user-id staff 47746 Jan 16 18:36 solrconfig.xml 1575 | # -rw-r--r--@ 1 user-id staff 1151 Jan 16 18:12 synonyms.txt 1576 | # 1577 | ``` 1578 | 1579 | `solrconfig.xml` is the file that we will be working on. Like with the `synonyms.txt` file before, the general workflow will be to make changes to this local version of the file, copy the updated file to the Docker container, and reload the Solr core to pick up the changes. 1580 | 1581 | `solrconfig.bak` on the other hand is just backup, in case we mess up `solrconfig.xml` and need to go back to a well-known state. 1582 | 1583 | 1584 | ### Request Handlers 1585 | 1586 | When we submit a request to Solr the request is processed by a request handler. Throughout this tutorial all our queries to Solr have gone to a URL that ends with `/select`, for example: 1587 | 1588 | ``` 1589 | $ curl 'http://localhost:8983/solr/bibdata/select?q=*' 1590 | ``` 1591 | 1592 | The `/select` in the URL points to a request handler defined in `solrconfig.xml`. If we look at the content of this file you'll notice (around like 733) a definition that looks like the one below, notice the `"/select"` in this request handler definition: 1593 | 1594 | ``` 1595 | # 1596 | # 1597 | # 1598 | # 1599 | # explicit 1600 | # 10 1601 | # 1602 | # 1603 | # 1604 | ``` 1605 | 1606 | We can make changes to this section to indicate that we want to use the eDisMax query parser (`defType`) by default and set the default query fields (`qf`) to title and author. To do so we could update the "defaults" section as follows: 1607 | 1608 | ``` 1609 | 1610 | 1611 | explicit 1612 | 10 1613 | edismax 1614 | title_txt_en authors_all_txts_en 1615 | 1616 | 1617 | ``` 1618 | 1619 | We need to copy our updated file back to the Docker container and reload the core for the changes to take effect, let's do this with the following commands: 1620 | 1621 | ``` 1622 | $ docker cp solrconfig.xml solr-container:/var/solr/data/bibdata/conf/ 1623 | $ curl 'http://localhost:8983/solr/admin/cores?action=RELOAD&core=bibdata' 1624 | ``` 1625 | 1626 | notice now now we we can issue a much simpler query since we don't have to specify the `qf` or `defType` parameters in the URL: 1627 | 1628 | ``` 1629 | $ curl 'http://localhost:8983/solr/bibdata/select?q=west' 1630 | ``` 1631 | 1632 | Be careful, an incorrect setting on the `solrconfig.xml` file can take our core down or cause queries to give unexpected results. For example, entering the `qf` value as `title_txt_en, authors_all_txts_en` (notice the comma to separate the fields) will cause Solr to ignore this parameter. 1633 | 1634 | The [Solr Reference Guide](https://solr.apache.org/guide/solr/latest/configuration-guide/requesthandlers-searchcomponents.html) has excellent documentation on what the values for a request handler mean and how we can configure them. 1635 | 1636 | 1637 | ### Search Components 1638 | 1639 | Request handlers in turn use *search components* to execute different operations on a search. Solr comes with several built-in [default search components](https://solr.apache.org/guide/solr/latest/configuration-guide/requesthandlers-searchcomponents.html#default-components) to implement faceting, highlighting, and spell checking to name a few. 1640 | 1641 | You can find the definition of the search components in the `solrconfig.xml` by looking at the `searchComponent` elements defined in this file. For example, in our `solrconfig.xml` there is a section like this for the highlighting feature that we used before: 1642 | 1643 | ``` 1644 | 1645 | 1646 | ... lots of other properties are define here... 1647 | 1650 | 1651 | ]]> 1652 | ]]> 1653 | 1654 | 1655 | ... lots of other properties are define here... 1656 | ``` 1657 | 1658 | Notice that the HTML tokens (`` and ``) that we saw in the highlighting results in previous section are defined here. 1659 | 1660 | 1661 | ### Spellchecker 1662 | 1663 | Solr provides spellcheck functionality out of the box that we can use to help users when they misspell a word in their queries. For example, if a user searches for "Washingon" (notice the missing "t") most likely Solr will return zero results, but with the spellcheck turned on Solr is able to suggest the correct spelling for the query (i.e. "Washington"). 1664 | 1665 | In our current `bibdata` core a search for "Washingon" will return zero results: 1666 | 1667 | ``` 1668 | $ curl 'http://localhost:8983/solr/bibdata/select?fl=id,title_txt_en&q=title_txt_en:washingon' 1669 | 1670 | # 1671 | # response will indicate 1672 | # { 1673 | # "responseHeader":{ 1674 | # "status":0, 1675 | # "params":{ 1676 | # "q":"title:washingon", 1677 | # "fl":"id,title"}}, 1678 | # "response":{"numFound":0,"start":0,"docs":[] 1679 | # }} 1680 | # 1681 | ``` 1682 | 1683 | Spellchecking is configured under the `/select` request handler in `solrconfig.xml`. To enable it we need to update the `defaults` settings and enable the `spellcheck` search component. 1684 | 1685 | To do this let's edit our local `solrconfig.xml` and replace the `` node again but now with the following content: 1686 | 1687 | ``` 1688 | 1689 | 1690 | explicit 1691 | 10 1692 | edismax 1693 | on 1694 | false 1695 | 5 1696 | 2 1697 | 5 1698 | true 1699 | true 1700 | 5 1701 | 3 1702 | 1703 | 1704 | 1705 | spellcheck 1706 | 1707 | 1708 | ``` 1709 | 1710 | and copy our updated version back to the Docker container and reload it: 1711 | 1712 | ``` 1713 | $ docker cp solrconfig.xml solr-container:/var/solr/data/bibdata/conf/ 1714 | $ curl 'http://localhost:8983/solr/admin/cores?action=RELOAD&core=bibdata' 1715 | ``` 1716 | 1717 | The `spellcheck` component indicated above is already defined in the `solrconfig.xml` with the following defaults. 1718 | 1719 | ``` 1720 | 1721 | text_general 1722 | 1723 | default 1724 | _text_ 1725 | solr.DirectSolrSpellChecker 1726 | ... 1727 | 1728 | books.json` to include only books published in 2001. The original MARC file has 250,000 books but the resulting file only includes 30,424 records. 1782 | 1783 | `marcli` is a small utility program that I wrote in Go to parse MARC files. If you are interested in the part that generates the JSON out of the MARC record take a look at the [processorSolr.go](https://github.com/hectorcorrea/marcli/blob/main/cmd/marcli/solr.go) file. 1784 | 1785 | 1786 | ## Acknowledgements 1787 | I would like to thank my former team at the Brown University Library for their support and recommendations as I prepared the initial version of this tutorial back in 2017 as well as those that attended the workshop at the Code4Lib conference in Washington, DC in 2018 and San Jose, CA in 2019. A special thanks goes to [Birkin Diana](https://github.com/birkin/) for helping me run the workshop in 2018 and 2019 and for taking the time to review the materials (multiple times!) and painstakingly testing each of the examples. 1788 | 1789 | Likewise, a big thanks to [Bess Sadler](https://github.com/bess), [Carolyn Cole](https://github.com/carolyncole), [Francis Kayiwa](https://github.com/kayiwa), and [James Griffin](https://github.com/jrgriffiniii) from the Princeton University Library for helping me run the workshop at Code4Lib 2023. 1790 | --------------------------------------------------------------------------------