└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # ElasticSearch in Python
  2 | 
  3 | ## Table of Contents 
  4 |   - [Concepts](#concepts)
  5 |   - [Connecting to Elasticsearch server in Python](#connecting-to-elasticsearch-server-in-python)
  6 |   - [Creating index](#creating-index)
  7 |   - [Data types](#data-types)
  8 |   - [Basic Operations](#basic-operations)
  9 |   - [Bulk API](#bulk-api)
 10 |   - [Search API](#search-api)
 11 |   - [SQL Search API](#sql-search-api)
 12 |   - [Aggregations](#aggregations)
 13 |   - [Filter context](#filter-context)
 14 |   - [Embedding vectors in Elasticsearch](#embedding-vectors-in-elasticsearch)
 15 |   - [Pagination](#pagination)
 16 |   - [Ingest pipelines](#ingest-pipelines)
 17 |   - [Ingest processors](#ingest-processors)
 18 |   - [Index Lifecycle Management (ILM)](#index-lifecycle-management-ilm)
 19 |   - [Analyzers](#analyzers)
 20 | 
 21 | ### I highly recommend checking out [This Video](https://youtu.be/a4HBKEda_F8) by FreeCodeCamp. This notebook is based on their tutorial
 22 | 
 23 | ## Concepts
 24 | 
 25 | ElasticSearch is a distributed, RESTful search & analytics engine with document-oriented model which uses inverted index.
 26 | 
 27 | **inverted index**: Maps words to the actual document location where those words occur. unlike databases which are forward indexed.
 28 | 
 29 | ### Compared to SQL
 30 | 
 31 | | SQL | Elastic|
 32 | |-|-|
 33 | | Table | Index |
 34 | |Column| Field|
 35 | |Row| Document|
 36 | 
 37 | **Index**: a set of documents  
 38 | every index has settings for shards & replicas.
 39 | - **Shard**: dividing the document into several fragments.  
 40 | - **Replicas**: *ReadOnly* copies of the index.  
 41 |     *Replicas help system availability and allowing parallel search.*
 42 | 
 43 | **Cluster**: Multiple nodes running Elasticsearch
 44 | 
 45 | **Mapping**: When inserting documents into indeces, Elasticsearch tries to infer data types for each field. this process is called mapping. 
 46 | mapping is done automatically but we can also set it manually.
 47 | 
 48 | ## Connecting to Elasticsearch server in Python
 49 | ```python
 50 | from elasticsearch import Elasticsearch
 51 | 
 52 | es = Elasticsearch(url)
 53 | ```
 54 | 
 55 | ## Creating index
 56 | 
 57 | 1. Simplest way
 58 | 
 59 |     In this method the **mappings** which define the structure of documents are infered automatically.
 60 | 
 61 |     ```python
 62 |     es.indices.create(index='index_name')
 63 |     ```
 64 | 2. Specify the number of replicas and shards
 65 |     ```python
 66 |     es.indices.create(
 67 |         index='name',
 68 |         settings={
 69 |             'index': {
 70 |                 'number_of_shards': 1,
 71 |                 'number_of_replicas': 2,
 72 |             }
 73 |         }
 74 |     )
 75 |     ```
 76 | 3. Specify the mappings *(code in next section)*
 77 | 
 78 | ## Data types
 79 | 
 80 | ### Common types
 81 | - Binary: Accepts a binary value as a **Base64** encoded string
 82 | - Boolean: True/False
 83 | - Numbers: long, integer, byte, short, etc.
 84 | - Dates
 85 | - Keyword: IDs, email addresses, status codes, zip codes, etc.  
 86 |     *keyword fields are not analyzed and we can't search part-of-text for them.*
 87 | ### Objects (JSON)
 88 | - Object
 89 | - Flattened
 90 |     - efficient for deeply nested JSON object
 91 |     - Hierarchical structure is nt preserved
 92 | - Nested
 93 |     - good for array of objects
 94 |     - maintains relationship between the object fields
 95 | 
 96 | ### Text search types
 97 | - Text
 98 |     - used for full-text content  
 99 |     *text fields are analyzed when indexed so we can search part-of-text with queries.*
100 | - Completion
101 | - Search as you type
102 | - Annotated text
103 | 
104 | ### Spatial data types
105 | - Geo point (lat, lng)
106 | - Geo shape (e.g. list of points in a shape)
107 | 
108 | ### Dense vector
109 | - Stores vectors of numeric values.
110 | - **Dense** means we have few or no zero elements.
111 | - This type **does not** support aggregations or sorting
112 | - Nested vectors are **not** supported
113 | - Use **KNN search** to retrieve nearest vectors
114 | - Max size is 4096
115 | - Elasticsearch does not automatically infer the mapping for dense vectors
116 | 
117 | > [!NOTE]  
118 | > By default, Elasticseach creates two data types for **strings**. the **text** type and the **keyword** type.  
119 | > with **text** we can search for individual words in text (*full-text search*), but with **keyword** the whole search phrase should be present (*keyword search*)
120 | 
121 | #### Set mapping manually:
122 | ```python
123 | mapping = {
124 |     'properties': {
125 |         'id': {
126 |             'type': 'integer',
127 |         },
128 |         'title': {
129 |             'type': 'text',
130 |             'fields': {
131 |                 'keyword': {
132 |                     'type': 'keyword',
133 |                     'ignore_above': 256,
134 |                 },
135 |             },
136 |         },
137 |         'text': {
138 |             'type': 'text',
139 |             'fields': {
140 |                 'keyword': {
141 |                     'type': 'keyword',
142 |                     'ignore_above': 256,
143 |                 },
144 |             },
145 |         },
146 |     }
147 | }
148 | 
149 | es.indeces.create(index='name', mappings=mapping)
150 | ```
151 | 
152 | ## Basic Operations
153 | 
154 | ### Inserting documents
155 | 
156 | ```python
157 | documents = [
158 |     {
159 |         'title': 'title1',
160 |         'text': 'this is a test document',
161 |         'num': 1,
162 |     },
163 |     {
164 |         'title': 'title2',
165 |         'text': 'this is a test document',
166 |         'num': 2,
167 |     },
168 | ]
169 | 
170 | for doc in documents:
171 |     es.index(index='index_name', body=doc)
172 | ```
173 | 
174 | ### Deleting documents
175 | 
176 | ```python
177 | es.delete(index='index_name', id=0)
178 | ```
179 | *if **id** doesn't exist, the operation throws an error.*
180 | 
181 | ### Getting documents
182 | 
183 | ```python
184 | es.get(index='index_name', id=0)
185 | ```
186 | *if **id** doesn't exist, the operation throws an error.*
187 | 
188 | ### Counting documents
189 | 
190 | ```python
191 | res = es.count(index='index_name')
192 | print(res['count'])
193 | 
194 | query = {
195 |     'range': {
196 |         'id': {
197 |             'gt': 0,
198 |         },
199 |     },
200 | }
201 | res = es.count(index='index_name', query=query)
202 | print(res['count'])
203 | ```
204 | 
205 | ### Updating documents
206 | 1. If document exists
207 | 
208 |     the update follows these steps:
209 |     1. Get the document
210 |     2. Update it (using script)
211 |     3. Re-index the result
212 | 
213 |     ```python
214 |     # update fields using script
215 |     es.update(
216 |         index='index_name',
217 |         id=id,
218 |         script={
219 |             'source': 'ctx._source.title = params.title',
220 |             'params': {
221 |                 'title': 'New title',
222 |             },
223 |         }
224 |     )
225 | 
226 |     # add new field using doc
227 |     es.update(
228 |         index='index_name',
229 |         id=id,
230 |         doc={'new_field' :'value'}
231 |     )
232 | 
233 |     # remove fields using script
234 |     es.update(
235 |         index='index_name',
236 |         id=id,
237 |         script={
238 |             'source': 'ctx._source.remove("field")'
239 |         }
240 |     )
241 |     ```
242 | 2. If document doesn't exist
243 | 
244 |     the update operation can create the document if **doc_as_upsert** is set to **True**
245 | 
246 |     ```python
247 |     res = es.update(
248 |         index='index_name',
249 |         id='new_id',
250 |         doc={
251 |             'book_id': 1,
252 |             'book_name': 'A book',
253 |         },
254 |         doc_as_upsert=True,
255 |     )
256 | 
257 | 
258 | ### Exists API
259 | 1. Check index
260 |     ```python
261 |     res = es.indices.exists(index='index_name')
262 |     print(res.body) # True/False
263 |     ```
264 | 2. Check document (with specific id)
265 |     ```python
266 |     res = es.exists(index='index_name', id=doc_id)
267 |     print(res.body)
268 |     ```
269 | 
270 | ## Bulk API
271 | 
272 | The bulk API performs multiple operations in one API call. this increases indexing speed.
273 | 
274 | The bulk API consists of two parts:
275 | - Action  
276 |     *action can be one of these:*
277 |     - index (creates or updates document)
278 |     - create (creates document if doesn't exist)
279 |     - update (updates document if already exists)
280 |     - delete
281 | - Source (the data)
282 | 
283 | ```python
284 | res = es.bulk(
285 |     operations=[
286 |         # action (index)
287 |         {
288 |             'index': {
289 |                 '_index': 'index_name',
290 |                 '_id': 1,
291 |             },
292 |         },
293 |         # source
294 |         {
295 |             'field1': 'value1',
296 |         },
297 |         # action (update)
298 |         {
299 |             'update': {
300 |                 '_index': 'index_name',
301 |                 '_id': 2,
302 |             },
303 |         },
304 |         # source
305 |         {
306 |             'field1': 'value2',
307 |         },
308 |         # action (delete doesn't require a source)
309 |         {
310 |             'delete': {
311 |                 '_index': 'index_name',
312 |                 '_id': 3,
313 |             }
314 |         },
315 |     ]
316 | )
317 | 
318 | # cheking for operation errors
319 | if res.body['errors']:
320 |     print('Error occured')
321 | ```
322 | 
323 | ## Search API
324 | 
325 | The search API offers several arguments:
326 | - index: where to search
327 | - q: used for simple searches. uses **Lucene** syntax
328 | - query: used for complex, structured queries. uses **DSL**
329 | - timeout: max time to wait for search
330 | - size: number of results (default is 10)
331 | - from: search starting point (for pagination)
332 | 
333 | ### Simple query
334 | ```python
335 | # simple query:
336 | es.search(
337 |     index='index*',
338 |     body={
339 |         'q': {'match_all': {}}
340 |     }
341 | )
342 | 
343 | # this call also does the same:
344 | es.search(index='index*')
345 | ```
346 | 
347 | ### DSL
348 | DSL (Domain-Specific Language) consists of two types of clauses:
349 | - Leaf clauses
350 |     - **match**  
351 |     full-text search. returns documents that match a provided text, number, date or bool. filed must be of type **text**
352 |     - **term**  
353 |     return documents that contain an exact term. field must be of type **keyword** or **numeric**
354 |     - **range**  
355 |     return documents that contain values within a provided range (gt, lt, gte, ...)
356 | 
357 | - Compound clauses (bool)
358 |     - combines multiple queries using these boolean logics:  
359 |         - **must**: all queries are mandatory
360 |         - **should**: queries are optional
361 |         - **must_not**: all queries must be false
362 |         - **filter**: filters documents based on given criteria  
363 |     - field must be of type **text**
364 | 
365 | ```python
366 | # Leaf
367 | es.search(
368 |     index='index*',
369 |     body={
370 |         'query': {
371 |             'match': {
372 |                 'field_name': 'value',
373 |             },
374 |         },
375 |         'size': 5,
376 |         'from': 10,
377 |     }
378 | )
379 | 
380 | # Compound
381 | es.search(
382 |     index='index_name',
383 |     body={
384 |         'query': {
385 |             'bool': {
386 |                 'must': [
387 |                     'match': {
388 |                         'field_name': 'value',
389 |                     },
390 |                     'range': {
391 |                         'field_name': {
392 |                             'gt': 0,
393 |                             'lte': 10,
394 |                         }
395 |                     }
396 |                 ],
397 |                 'filter': [
398 |                     'term': {
399 |                         'field_name': 'value',
400 |                     },
401 |                 ],
402 |             },
403 |         },
404 |     }
405 | )
406 | ```
407 | 
408 | ## SQL Search API
409 | 
410 | As an alternative to DSL queries, we can also use SQL queries to search documents in Elasticsearch.
411 | 
412 | The SQL API supports following parameters:
413 | - delimiter
414 | - cursor
415 | - format
416 | - filter
417 | - fetch_size
418 | - etc.
419 |   
420 | Supported formats are:
421 | - txt
422 | - csv
423 | - json
424 | - yaml
425 | - binary
426 | - etc.
427 | 
428 | ```python
429 | res = es.sql.query(
430 |     format='txt',
431 |     query='SELECT * FROM index ORDER BY field DESC LIMIT lim',
432 |     filter={
433 |         'range': {
434 |             'field': {
435 |                 'gte': 10,
436 |             },
437 |         },
438 |     },
439 |     fetch_size=5,
440 | )
441 | ```
442 | 
443 | We can also convert an SQL query to DSL using **translate**:
444 | 
445 | ```python
446 | es.sql.translate(
447 |     query='SELECT * FROM index',
448 | )
449 | ```
450 | 
451 | ## Aggregations
452 | 
453 | **Aggregation** performs calculation on the data (avg, min, max, count, sum)
454 | 
455 | ```python
456 | res = es.search(
457 |     index='index',
458 |     body={
459 |         'query': {
460 |             'match_all': {}
461 |         },
462 |         'aggs': {
463 |             'avg_agg': {
464 |                 'avg': {
465 |                     'field': 'age',
466 |                 }
467 |             }
468 |         }
469 |     }
470 | )
471 | 
472 | print(res['aggregations']['avg_agg']['value'])
473 | ```
474 | 
475 | ## Filter context
476 | 
477 | When searching in elasticsearch, we can either use **Query context** or **Filter context**
478 | 
479 | - Query: (*How well does this document match this query clause?*)  
480 | Elasticsearch generates a **Score** for each result
481 | 
482 | - Filter: (*Does this document match this query clause?*)  
483 |   Elasticsearch answers with **Yes/No**
484 | 
485 | Filters execute **faster** and require **less processing** compared to queries.
486 | 
487 | ## Embedding vectors in Elasticsearch
488 | **Embedding** transforms text into numerical vectors.  
489 | deep learning models are used to embed documents. these models preserve the meaning of the text.
490 | 
491 | The **dense vector** data type can be used to store embedding vectors in Elasticsearch.  
492 | and we can use **kNN algorithm** to search for similar vectors.
493 | ```python
494 | # creating a field called 'embedding' of type dense_vector
495 | es.indices.create(
496 |     index='index_name',
497 |     mappings={
498 |         'properties': {
499 |             'embedding': {
500 |                 'type': 'dense_vector',
501 |             }
502 |         }
503 |     }
504 | )
505 | 
506 | # retrieving documents similar to 'embedded_query' vector
507 | res = es.search(
508 |     knn={
509 |         'field': 'embedding',
510 |         'query_vector': embedded_query,
511 |         'num_candidates': 50, # selected documents to perform knn on
512 |         'k': 10, # k parameter of knn
513 |     }
514 | )
515 | 
516 | print(res['hits']['hits'])
517 | ```
518 | 
519 | ## Pagination
520 | 
521 | In elasticsearch, there are two ways for pagination:
522 | 1. **from/size**  
523 |    commonly used for small datasets  
524 |    it is limited to 10k max size  
525 |    high memory usage
526 | 
527 | 2. **search_after**  
528 |    document must have a sortable field (Numeric, Date)  
529 |    efficient for large datasets
530 | 
531 | ```python
532 | # from/size example
533 | es.search(
534 |     index='index_name',
535 |     body={
536 |         'from': 0, # for next pages, set from to 10, 20, ...
537 |         'size': 10,
538 |         'sort': [
539 |             {'field1': 'desc'},
540 |         ],
541 |     }
542 | )
543 | 
544 | # search_after example
545 | # we use the sort field of last doc in previous search to pass as search_after parameter
546 | last_doc = res['hits']['hits'][-1]['sort']
547 | 
548 | es.search(
549 |     index='index_name',
550 |     body={
551 |         'size': 10,
552 |         'sort': [
553 |             {'field1': 'desc'},
554 |         ],
555 |         'search_after': last_doc,
556 |     }
557 | )
558 | ```
559 | 
560 | ## Ingest pipelines
561 | 
562 | With pipelines, we can perform transformations on data before indexing
563 | 
564 | Some common transformations: (remove fields, lowercase text, remove HTML tags, etc.)
565 | 
566 | *Different pipeline processors are discused in next section*
567 | 
568 | ```python
569 | # creating the pipeline
570 | es.ingest.put_pipeline(
571 |     id: 'pipeline_name',
572 |     description: 'desc',
573 |     processors=[
574 |         {
575 |             'set': {
576 |                 'description': 'desc',
577 |                 'field': 'field_name',
578 |                 'value': 'val',
579 |             },
580 |         },
581 |         {
582 |             'lowercase': {
583 |                 'field': 'field_name',
584 |             },
585 |         }
586 |     ]
587 | )
588 | 
589 | # using the pipeline
590 | es.bulk(operations=ops, pipeline='pipeline_name')
591 | es.index(index='index_name', document=doc, pipeline='pipeline_name')
592 | ```
593 | We can also test a pipeline (with specified id) on a mock dataset using **simulate**:
594 | ```python
595 | res = es.ingest.simulate(
596 |     id: 'pipeline_name',
597 |     docs=[
598 |         {
599 |             '_index': 'index',
600 |             '_id': 'id1',
601 |             '_source': {
602 |                 'key': 'val'.
603 |             }
604 |         },
605 |         {
606 |             '_index': 'index',
607 |             '_id': 'id2',
608 |             '_source': {
609 |                 'key': 'val'.
610 |             }
611 |         }
612 |     ]
613 | )
614 | ```
615 | 
616 | Pipelines can fail. we can either **ignore** or **handle** failures.
617 | if we ignore the failure, the pipeline will skip the failed steps.
618 | 
619 | ```python
620 | es.ingest.put_pipeline(
621 |     id: 'id',
622 |     processors=[
623 |         {
624 |             'rename': {
625 |                 'description': 'desc',
626 |                 'field': 'field_name',
627 |                 'ignore_failure': False, # handling failure with on_failure
628 |                 'on_failure': [
629 |                     {
630 |                         'set': {
631 |                             'field': 'error',
632 |                             'value': "can't rename",
633 |                             'ignore_failure': True,  # ignoring failure
634 |                         }
635 |                     }
636 |                 ]
637 |             }
638 |         }
639 |     ]
640 | )
641 | ```
642 | 
643 | ## Ingest processors
644 | 
645 | Processors are organized into 5 categories:
646 | 
647 | - Data enrichment (append, inference, attachment, etc.)
648 | - Data filtering (drop, remove)
649 | - Array/JSON handling (foreach, json, sort)
650 | - Pipeline handling (fail, pipeline)
651 | - Data transformation (convert, rename, set, lowercase/uppercase, trim, split, etc.)
652 | 
653 | [More info in the official documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/processors.html)
654 | 
655 | ## Index Lifecycle Management (ILM)
656 | 
657 | ILM automates the **rollover** and **management** of indices.  
658 | It helps with storage optimization, automated data retention, efficient management if index size, etc.
659 | 
660 | **Rollover** is a process where the current index becomes *ReadOnly* and new documents are passed to a new index.
661 | 
662 | ILM has 4 phases:
663 | - Hot phase: Index is being frequently updated
664 | - Warm phase: Less frequently accessed data
665 | - Cold phase: Archived data for long-term storage
666 | - Delete phase: Data older than a threshold will be deleted
667 | 
668 | ```python
669 | policy = {
670 |     'phases': {
671 |         'hot': {
672 |             'actions': {
673 |                 'rollover': {
674 |                     'max_age': '30d',
675 |                 }
676 |             }
677 |         },
678 |         'delete': {
679 |             'min_age': '90d',
680 |             'actions': {
681 |                 'delete': {}
682 |             }
683 |         }
684 |     }
685 | }
686 | 
687 | es.ilm.put_lifecycle(name='policy_name', policy=policy)
688 | ```
689 | 
690 | ## Analyzers
691 | 
692 | Analyzers process text during indexing and searching. they tranfer text into tokens.
693 | 
694 | Analyzers consist of 3 components:
695 | - Character filters (Optional)  
696 | Some common filters: ['html_strip', 'mapping']
697 | 
698 | - Tokenizers (Only one)  
699 | Some common tokenizers: ['standard', 'lowercase', 'whitespace']
700 | 
701 | - Token filters (Optional)  
702 | Some common filters: ['apostrophe', 'decimal_digit', 'reverse', 'synonym_filter']
703 | 
704 | Elasticsearch provides ready-make options for processing text in various ways.  
705 | Each builtin analyzer is designed for specific type of data
706 | 
707 | Here are some common analyzers:
708 | - **Standard analyzer**:
709 |     - No character filter
710 |     - Standard Tokenizer
711 |     - Lowercase filter & Stop filter
712 | - **Simple analyzer**:
713 |     - No character filter
714 |     - Standard Tokenizer
715 |     - No token filter
716 | - **Whitespace analyzer**:
717 |     - No character filter
718 |     - Whitespace Tokenizer
719 |     - No token filter
720 | - **Stop analyzer**:
721 |     - No character filter
722 |     - Lowercase Tokenizer
723 |     - Stop filter
724 | 
725 | ```python
726 | # using ready analyzers
727 | res = es.indices.analyze(
728 |     analyzer='standard',
729 |     text='text to analyze'
730 | )
731 | tokens = res.body['tokens']
732 | 
733 | # defining custom analyzer
734 | settings = {
735 |     'settings': {
736 |         'analysis': {
737 |             'char_filter': { # defining character filter
738 |                 'ampersand_replace': {
739 |                     'type': 'mapping',
740 |                     'mappings': ['& => and'],
741 |                 },
742 |             },
743 |             'filter': { # defining token filter
744 |                 'synonym_filter': {
745 |                     'type': 'synonym',
746 |                     'synonyms': [
747 |                         'car, vehicle',
748 |                         'tv, television',
749 |                     ]
750 |                 }
751 |             }
752 |             'analyzer': { # defining custom analyzer
753 |                 'custom_analyzer': {
754 |                     'type': 'custom',
755 |                     'char_filter': ['html_strip', 'ampersand_replace'],
756 |                     'tokenizer': 'standard',
757 |                     'filter': ['lowercase', 'synonym_filter']
758 |                 }
759 |             }
760 |         }
761 |     },
762 |     'mappings': {
763 |         'properties': {
764 |             'text_field': {
765 |                 'type': 'text',
766 |                 'analyzer': 'cusotm_analyzer', # using analyzer on a field
767 |             }
768 |         }
769 |     }
770 | }
771 | 
772 | es.indices.create(index='index-name', body=settings)
773 | ```
774 | 


--------------------------------------------------------------------------------