└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # ElasticSearch in Python 2 | 3 | ## Table of Contents 4 | - [Concepts](#concepts) 5 | - [Connecting to Elasticsearch server in Python](#connecting-to-elasticsearch-server-in-python) 6 | - [Creating index](#creating-index) 7 | - [Data types](#data-types) 8 | - [Basic Operations](#basic-operations) 9 | - [Bulk API](#bulk-api) 10 | - [Search API](#search-api) 11 | - [SQL Search API](#sql-search-api) 12 | - [Aggregations](#aggregations) 13 | - [Filter context](#filter-context) 14 | - [Embedding vectors in Elasticsearch](#embedding-vectors-in-elasticsearch) 15 | - [Pagination](#pagination) 16 | - [Ingest pipelines](#ingest-pipelines) 17 | - [Ingest processors](#ingest-processors) 18 | - [Index Lifecycle Management (ILM)](#index-lifecycle-management-ilm) 19 | - [Analyzers](#analyzers) 20 | 21 | ### I highly recommend checking out [This Video](https://youtu.be/a4HBKEda_F8) by FreeCodeCamp. This notebook is based on their tutorial 22 | 23 | ## Concepts 24 | 25 | ElasticSearch is a distributed, RESTful search & analytics engine with document-oriented model which uses inverted index. 26 | 27 | **inverted index**: Maps words to the actual document location where those words occur. unlike databases which are forward indexed. 28 | 29 | ### Compared to SQL 30 | 31 | | SQL | Elastic| 32 | |-|-| 33 | | Table | Index | 34 | |Column| Field| 35 | |Row| Document| 36 | 37 | **Index**: a set of documents 38 | every index has settings for shards & replicas. 39 | - **Shard**: dividing the document into several fragments. 40 | - **Replicas**: *ReadOnly* copies of the index. 41 | *Replicas help system availability and allowing parallel search.* 42 | 43 | **Cluster**: Multiple nodes running Elasticsearch 44 | 45 | **Mapping**: When inserting documents into indeces, Elasticsearch tries to infer data types for each field. this process is called mapping. 46 | mapping is done automatically but we can also set it manually. 47 | 48 | ## Connecting to Elasticsearch server in Python 49 | ```python 50 | from elasticsearch import Elasticsearch 51 | 52 | es = Elasticsearch(url) 53 | ``` 54 | 55 | ## Creating index 56 | 57 | 1. Simplest way 58 | 59 | In this method the **mappings** which define the structure of documents are infered automatically. 60 | 61 | ```python 62 | es.indices.create(index='index_name') 63 | ``` 64 | 2. Specify the number of replicas and shards 65 | ```python 66 | es.indices.create( 67 | index='name', 68 | settings={ 69 | 'index': { 70 | 'number_of_shards': 1, 71 | 'number_of_replicas': 2, 72 | } 73 | } 74 | ) 75 | ``` 76 | 3. Specify the mappings *(code in next section)* 77 | 78 | ## Data types 79 | 80 | ### Common types 81 | - Binary: Accepts a binary value as a **Base64** encoded string 82 | - Boolean: True/False 83 | - Numbers: long, integer, byte, short, etc. 84 | - Dates 85 | - Keyword: IDs, email addresses, status codes, zip codes, etc. 86 | *keyword fields are not analyzed and we can't search part-of-text for them.* 87 | ### Objects (JSON) 88 | - Object 89 | - Flattened 90 | - efficient for deeply nested JSON object 91 | - Hierarchical structure is nt preserved 92 | - Nested 93 | - good for array of objects 94 | - maintains relationship between the object fields 95 | 96 | ### Text search types 97 | - Text 98 | - used for full-text content 99 | *text fields are analyzed when indexed so we can search part-of-text with queries.* 100 | - Completion 101 | - Search as you type 102 | - Annotated text 103 | 104 | ### Spatial data types 105 | - Geo point (lat, lng) 106 | - Geo shape (e.g. list of points in a shape) 107 | 108 | ### Dense vector 109 | - Stores vectors of numeric values. 110 | - **Dense** means we have few or no zero elements. 111 | - This type **does not** support aggregations or sorting 112 | - Nested vectors are **not** supported 113 | - Use **KNN search** to retrieve nearest vectors 114 | - Max size is 4096 115 | - Elasticsearch does not automatically infer the mapping for dense vectors 116 | 117 | > [!NOTE] 118 | > By default, Elasticseach creates two data types for **strings**. the **text** type and the **keyword** type. 119 | > with **text** we can search for individual words in text (*full-text search*), but with **keyword** the whole search phrase should be present (*keyword search*) 120 | 121 | #### Set mapping manually: 122 | ```python 123 | mapping = { 124 | 'properties': { 125 | 'id': { 126 | 'type': 'integer', 127 | }, 128 | 'title': { 129 | 'type': 'text', 130 | 'fields': { 131 | 'keyword': { 132 | 'type': 'keyword', 133 | 'ignore_above': 256, 134 | }, 135 | }, 136 | }, 137 | 'text': { 138 | 'type': 'text', 139 | 'fields': { 140 | 'keyword': { 141 | 'type': 'keyword', 142 | 'ignore_above': 256, 143 | }, 144 | }, 145 | }, 146 | } 147 | } 148 | 149 | es.indeces.create(index='name', mappings=mapping) 150 | ``` 151 | 152 | ## Basic Operations 153 | 154 | ### Inserting documents 155 | 156 | ```python 157 | documents = [ 158 | { 159 | 'title': 'title1', 160 | 'text': 'this is a test document', 161 | 'num': 1, 162 | }, 163 | { 164 | 'title': 'title2', 165 | 'text': 'this is a test document', 166 | 'num': 2, 167 | }, 168 | ] 169 | 170 | for doc in documents: 171 | es.index(index='index_name', body=doc) 172 | ``` 173 | 174 | ### Deleting documents 175 | 176 | ```python 177 | es.delete(index='index_name', id=0) 178 | ``` 179 | *if **id** doesn't exist, the operation throws an error.* 180 | 181 | ### Getting documents 182 | 183 | ```python 184 | es.get(index='index_name', id=0) 185 | ``` 186 | *if **id** doesn't exist, the operation throws an error.* 187 | 188 | ### Counting documents 189 | 190 | ```python 191 | res = es.count(index='index_name') 192 | print(res['count']) 193 | 194 | query = { 195 | 'range': { 196 | 'id': { 197 | 'gt': 0, 198 | }, 199 | }, 200 | } 201 | res = es.count(index='index_name', query=query) 202 | print(res['count']) 203 | ``` 204 | 205 | ### Updating documents 206 | 1. If document exists 207 | 208 | the update follows these steps: 209 | 1. Get the document 210 | 2. Update it (using script) 211 | 3. Re-index the result 212 | 213 | ```python 214 | # update fields using script 215 | es.update( 216 | index='index_name', 217 | id=id, 218 | script={ 219 | 'source': 'ctx._source.title = params.title', 220 | 'params': { 221 | 'title': 'New title', 222 | }, 223 | } 224 | ) 225 | 226 | # add new field using doc 227 | es.update( 228 | index='index_name', 229 | id=id, 230 | doc={'new_field' :'value'} 231 | ) 232 | 233 | # remove fields using script 234 | es.update( 235 | index='index_name', 236 | id=id, 237 | script={ 238 | 'source': 'ctx._source.remove("field")' 239 | } 240 | ) 241 | ``` 242 | 2. If document doesn't exist 243 | 244 | the update operation can create the document if **doc_as_upsert** is set to **True** 245 | 246 | ```python 247 | res = es.update( 248 | index='index_name', 249 | id='new_id', 250 | doc={ 251 | 'book_id': 1, 252 | 'book_name': 'A book', 253 | }, 254 | doc_as_upsert=True, 255 | ) 256 | 257 | 258 | ### Exists API 259 | 1. Check index 260 | ```python 261 | res = es.indices.exists(index='index_name') 262 | print(res.body) # True/False 263 | ``` 264 | 2. Check document (with specific id) 265 | ```python 266 | res = es.exists(index='index_name', id=doc_id) 267 | print(res.body) 268 | ``` 269 | 270 | ## Bulk API 271 | 272 | The bulk API performs multiple operations in one API call. this increases indexing speed. 273 | 274 | The bulk API consists of two parts: 275 | - Action 276 | *action can be one of these:* 277 | - index (creates or updates document) 278 | - create (creates document if doesn't exist) 279 | - update (updates document if already exists) 280 | - delete 281 | - Source (the data) 282 | 283 | ```python 284 | res = es.bulk( 285 | operations=[ 286 | # action (index) 287 | { 288 | 'index': { 289 | '_index': 'index_name', 290 | '_id': 1, 291 | }, 292 | }, 293 | # source 294 | { 295 | 'field1': 'value1', 296 | }, 297 | # action (update) 298 | { 299 | 'update': { 300 | '_index': 'index_name', 301 | '_id': 2, 302 | }, 303 | }, 304 | # source 305 | { 306 | 'field1': 'value2', 307 | }, 308 | # action (delete doesn't require a source) 309 | { 310 | 'delete': { 311 | '_index': 'index_name', 312 | '_id': 3, 313 | } 314 | }, 315 | ] 316 | ) 317 | 318 | # cheking for operation errors 319 | if res.body['errors']: 320 | print('Error occured') 321 | ``` 322 | 323 | ## Search API 324 | 325 | The search API offers several arguments: 326 | - index: where to search 327 | - q: used for simple searches. uses **Lucene** syntax 328 | - query: used for complex, structured queries. uses **DSL** 329 | - timeout: max time to wait for search 330 | - size: number of results (default is 10) 331 | - from: search starting point (for pagination) 332 | 333 | ### Simple query 334 | ```python 335 | # simple query: 336 | es.search( 337 | index='index*', 338 | body={ 339 | 'q': {'match_all': {}} 340 | } 341 | ) 342 | 343 | # this call also does the same: 344 | es.search(index='index*') 345 | ``` 346 | 347 | ### DSL 348 | DSL (Domain-Specific Language) consists of two types of clauses: 349 | - Leaf clauses 350 | - **match** 351 | full-text search. returns documents that match a provided text, number, date or bool. filed must be of type **text** 352 | - **term** 353 | return documents that contain an exact term. field must be of type **keyword** or **numeric** 354 | - **range** 355 | return documents that contain values within a provided range (gt, lt, gte, ...) 356 | 357 | - Compound clauses (bool) 358 | - combines multiple queries using these boolean logics: 359 | - **must**: all queries are mandatory 360 | - **should**: queries are optional 361 | - **must_not**: all queries must be false 362 | - **filter**: filters documents based on given criteria 363 | - field must be of type **text** 364 | 365 | ```python 366 | # Leaf 367 | es.search( 368 | index='index*', 369 | body={ 370 | 'query': { 371 | 'match': { 372 | 'field_name': 'value', 373 | }, 374 | }, 375 | 'size': 5, 376 | 'from': 10, 377 | } 378 | ) 379 | 380 | # Compound 381 | es.search( 382 | index='index_name', 383 | body={ 384 | 'query': { 385 | 'bool': { 386 | 'must': [ 387 | 'match': { 388 | 'field_name': 'value', 389 | }, 390 | 'range': { 391 | 'field_name': { 392 | 'gt': 0, 393 | 'lte': 10, 394 | } 395 | } 396 | ], 397 | 'filter': [ 398 | 'term': { 399 | 'field_name': 'value', 400 | }, 401 | ], 402 | }, 403 | }, 404 | } 405 | ) 406 | ``` 407 | 408 | ## SQL Search API 409 | 410 | As an alternative to DSL queries, we can also use SQL queries to search documents in Elasticsearch. 411 | 412 | The SQL API supports following parameters: 413 | - delimiter 414 | - cursor 415 | - format 416 | - filter 417 | - fetch_size 418 | - etc. 419 | 420 | Supported formats are: 421 | - txt 422 | - csv 423 | - json 424 | - yaml 425 | - binary 426 | - etc. 427 | 428 | ```python 429 | res = es.sql.query( 430 | format='txt', 431 | query='SELECT * FROM index ORDER BY field DESC LIMIT lim', 432 | filter={ 433 | 'range': { 434 | 'field': { 435 | 'gte': 10, 436 | }, 437 | }, 438 | }, 439 | fetch_size=5, 440 | ) 441 | ``` 442 | 443 | We can also convert an SQL query to DSL using **translate**: 444 | 445 | ```python 446 | es.sql.translate( 447 | query='SELECT * FROM index', 448 | ) 449 | ``` 450 | 451 | ## Aggregations 452 | 453 | **Aggregation** performs calculation on the data (avg, min, max, count, sum) 454 | 455 | ```python 456 | res = es.search( 457 | index='index', 458 | body={ 459 | 'query': { 460 | 'match_all': {} 461 | }, 462 | 'aggs': { 463 | 'avg_agg': { 464 | 'avg': { 465 | 'field': 'age', 466 | } 467 | } 468 | } 469 | } 470 | ) 471 | 472 | print(res['aggregations']['avg_agg']['value']) 473 | ``` 474 | 475 | ## Filter context 476 | 477 | When searching in elasticsearch, we can either use **Query context** or **Filter context** 478 | 479 | - Query: (*How well does this document match this query clause?*) 480 | Elasticsearch generates a **Score** for each result 481 | 482 | - Filter: (*Does this document match this query clause?*) 483 | Elasticsearch answers with **Yes/No** 484 | 485 | Filters execute **faster** and require **less processing** compared to queries. 486 | 487 | ## Embedding vectors in Elasticsearch 488 | **Embedding** transforms text into numerical vectors. 489 | deep learning models are used to embed documents. these models preserve the meaning of the text. 490 | 491 | The **dense vector** data type can be used to store embedding vectors in Elasticsearch. 492 | and we can use **kNN algorithm** to search for similar vectors. 493 | ```python 494 | # creating a field called 'embedding' of type dense_vector 495 | es.indices.create( 496 | index='index_name', 497 | mappings={ 498 | 'properties': { 499 | 'embedding': { 500 | 'type': 'dense_vector', 501 | } 502 | } 503 | } 504 | ) 505 | 506 | # retrieving documents similar to 'embedded_query' vector 507 | res = es.search( 508 | knn={ 509 | 'field': 'embedding', 510 | 'query_vector': embedded_query, 511 | 'num_candidates': 50, # selected documents to perform knn on 512 | 'k': 10, # k parameter of knn 513 | } 514 | ) 515 | 516 | print(res['hits']['hits']) 517 | ``` 518 | 519 | ## Pagination 520 | 521 | In elasticsearch, there are two ways for pagination: 522 | 1. **from/size** 523 | commonly used for small datasets 524 | it is limited to 10k max size 525 | high memory usage 526 | 527 | 2. **search_after** 528 | document must have a sortable field (Numeric, Date) 529 | efficient for large datasets 530 | 531 | ```python 532 | # from/size example 533 | es.search( 534 | index='index_name', 535 | body={ 536 | 'from': 0, # for next pages, set from to 10, 20, ... 537 | 'size': 10, 538 | 'sort': [ 539 | {'field1': 'desc'}, 540 | ], 541 | } 542 | ) 543 | 544 | # search_after example 545 | # we use the sort field of last doc in previous search to pass as search_after parameter 546 | last_doc = res['hits']['hits'][-1]['sort'] 547 | 548 | es.search( 549 | index='index_name', 550 | body={ 551 | 'size': 10, 552 | 'sort': [ 553 | {'field1': 'desc'}, 554 | ], 555 | 'search_after': last_doc, 556 | } 557 | ) 558 | ``` 559 | 560 | ## Ingest pipelines 561 | 562 | With pipelines, we can perform transformations on data before indexing 563 | 564 | Some common transformations: (remove fields, lowercase text, remove HTML tags, etc.) 565 | 566 | *Different pipeline processors are discused in next section* 567 | 568 | ```python 569 | # creating the pipeline 570 | es.ingest.put_pipeline( 571 | id: 'pipeline_name', 572 | description: 'desc', 573 | processors=[ 574 | { 575 | 'set': { 576 | 'description': 'desc', 577 | 'field': 'field_name', 578 | 'value': 'val', 579 | }, 580 | }, 581 | { 582 | 'lowercase': { 583 | 'field': 'field_name', 584 | }, 585 | } 586 | ] 587 | ) 588 | 589 | # using the pipeline 590 | es.bulk(operations=ops, pipeline='pipeline_name') 591 | es.index(index='index_name', document=doc, pipeline='pipeline_name') 592 | ``` 593 | We can also test a pipeline (with specified id) on a mock dataset using **simulate**: 594 | ```python 595 | res = es.ingest.simulate( 596 | id: 'pipeline_name', 597 | docs=[ 598 | { 599 | '_index': 'index', 600 | '_id': 'id1', 601 | '_source': { 602 | 'key': 'val'. 603 | } 604 | }, 605 | { 606 | '_index': 'index', 607 | '_id': 'id2', 608 | '_source': { 609 | 'key': 'val'. 610 | } 611 | } 612 | ] 613 | ) 614 | ``` 615 | 616 | Pipelines can fail. we can either **ignore** or **handle** failures. 617 | if we ignore the failure, the pipeline will skip the failed steps. 618 | 619 | ```python 620 | es.ingest.put_pipeline( 621 | id: 'id', 622 | processors=[ 623 | { 624 | 'rename': { 625 | 'description': 'desc', 626 | 'field': 'field_name', 627 | 'ignore_failure': False, # handling failure with on_failure 628 | 'on_failure': [ 629 | { 630 | 'set': { 631 | 'field': 'error', 632 | 'value': "can't rename", 633 | 'ignore_failure': True, # ignoring failure 634 | } 635 | } 636 | ] 637 | } 638 | } 639 | ] 640 | ) 641 | ``` 642 | 643 | ## Ingest processors 644 | 645 | Processors are organized into 5 categories: 646 | 647 | - Data enrichment (append, inference, attachment, etc.) 648 | - Data filtering (drop, remove) 649 | - Array/JSON handling (foreach, json, sort) 650 | - Pipeline handling (fail, pipeline) 651 | - Data transformation (convert, rename, set, lowercase/uppercase, trim, split, etc.) 652 | 653 | [More info in the official documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/processors.html) 654 | 655 | ## Index Lifecycle Management (ILM) 656 | 657 | ILM automates the **rollover** and **management** of indices. 658 | It helps with storage optimization, automated data retention, efficient management if index size, etc. 659 | 660 | **Rollover** is a process where the current index becomes *ReadOnly* and new documents are passed to a new index. 661 | 662 | ILM has 4 phases: 663 | - Hot phase: Index is being frequently updated 664 | - Warm phase: Less frequently accessed data 665 | - Cold phase: Archived data for long-term storage 666 | - Delete phase: Data older than a threshold will be deleted 667 | 668 | ```python 669 | policy = { 670 | 'phases': { 671 | 'hot': { 672 | 'actions': { 673 | 'rollover': { 674 | 'max_age': '30d', 675 | } 676 | } 677 | }, 678 | 'delete': { 679 | 'min_age': '90d', 680 | 'actions': { 681 | 'delete': {} 682 | } 683 | } 684 | } 685 | } 686 | 687 | es.ilm.put_lifecycle(name='policy_name', policy=policy) 688 | ``` 689 | 690 | ## Analyzers 691 | 692 | Analyzers process text during indexing and searching. they tranfer text into tokens. 693 | 694 | Analyzers consist of 3 components: 695 | - Character filters (Optional) 696 | Some common filters: ['html_strip', 'mapping'] 697 | 698 | - Tokenizers (Only one) 699 | Some common tokenizers: ['standard', 'lowercase', 'whitespace'] 700 | 701 | - Token filters (Optional) 702 | Some common filters: ['apostrophe', 'decimal_digit', 'reverse', 'synonym_filter'] 703 | 704 | Elasticsearch provides ready-make options for processing text in various ways. 705 | Each builtin analyzer is designed for specific type of data 706 | 707 | Here are some common analyzers: 708 | - **Standard analyzer**: 709 | - No character filter 710 | - Standard Tokenizer 711 | - Lowercase filter & Stop filter 712 | - **Simple analyzer**: 713 | - No character filter 714 | - Standard Tokenizer 715 | - No token filter 716 | - **Whitespace analyzer**: 717 | - No character filter 718 | - Whitespace Tokenizer 719 | - No token filter 720 | - **Stop analyzer**: 721 | - No character filter 722 | - Lowercase Tokenizer 723 | - Stop filter 724 | 725 | ```python 726 | # using ready analyzers 727 | res = es.indices.analyze( 728 | analyzer='standard', 729 | text='text to analyze' 730 | ) 731 | tokens = res.body['tokens'] 732 | 733 | # defining custom analyzer 734 | settings = { 735 | 'settings': { 736 | 'analysis': { 737 | 'char_filter': { # defining character filter 738 | 'ampersand_replace': { 739 | 'type': 'mapping', 740 | 'mappings': ['& => and'], 741 | }, 742 | }, 743 | 'filter': { # defining token filter 744 | 'synonym_filter': { 745 | 'type': 'synonym', 746 | 'synonyms': [ 747 | 'car, vehicle', 748 | 'tv, television', 749 | ] 750 | } 751 | } 752 | 'analyzer': { # defining custom analyzer 753 | 'custom_analyzer': { 754 | 'type': 'custom', 755 | 'char_filter': ['html_strip', 'ampersand_replace'], 756 | 'tokenizer': 'standard', 757 | 'filter': ['lowercase', 'synonym_filter'] 758 | } 759 | } 760 | } 761 | }, 762 | 'mappings': { 763 | 'properties': { 764 | 'text_field': { 765 | 'type': 'text', 766 | 'analyzer': 'cusotm_analyzer', # using analyzer on a field 767 | } 768 | } 769 | } 770 | } 771 | 772 | es.indices.create(index='index-name', body=settings) 773 | ``` 774 | --------------------------------------------------------------------------------