├── LICENSE ├── README.md ├── chapter1 ├── classification.ipynb ├── question-answering.ipynb ├── sentiment-analysis.ipynb └── text-summarization.ipynb ├── chapter10 └── PacktGPT_Plugin │ ├── .well-known │ └── ai-plugin.json │ ├── Dockerfile │ ├── README.md │ ├── app.py │ ├── chart.py │ ├── chart_soc.py │ ├── logo.png │ ├── openapi.yaml │ └── requirements.txt ├── chapter2 ├── dense-vs-sparse.ipynb ├── distance-magnitude.ipynb ├── knn-search.ipynb └── mapping-and-indexing-vectors.ipynb ├── chapter3 └── Chapter_3_Model_Management_and_Vector_Considerations_in_Elastic.ipynb ├── chapter5 ├── Chapter_5_Image_Search.ipynb └── images │ ├── citations.txt │ └── images.tar ├── chapter6 └── chapter6_pii_redaction.ipynb ├── chapter7 └── vector-search-for-observability.ipynb ├── chapter9 ├── README.md ├── config.py ├── data-preparation.ipynb ├── main.py ├── recipe_generator.py └── style.css └── cover.jpg /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 엘라스틱을 활용한 벡터 검색 실무 가이드 2 | 3 |

4 | 5 | 여기는 [엘라스틱을 활용한 벡터 검색 실무 가이드](https://wikibook.co.kr/vector-search/) 에 대한 코드 저장소 입니다. 6 | 7 | **벡터 검색을 활용한 검색, 가관측성 및 보안을 위한 자연어 처리 솔루션 구축 도구 모음** 8 | 9 | ## 이 책은 무엇에 관한 내용인가요? 10 | 자연어 처리(NLP)는 주로 검색 사례에서 사용되지만, 이 책은 벡터를 사용해 가관측성과 사이버 보안과 같은 중요한 분야의 도전을 극복하도록 영감을 주려 합니다. 11 | 각 장에서는 검색과 함께 가관측성 및 사이버보안 기능을 개선하기 위해 엘라스틱과 벡터 검색을 통합하는 데 초점을 맞춥니다. 12 | 13 | 이 책은 다음과 같은 흥미로운 기능을 다룹니다. 14 | * 벡터 검색을 활용한 성능 최적화 15 | * 이미지 벡터 검색과 그 응용 탐구 16 | * 개인 식별 정보 탐지 및 마스킹 17 | * 차세대 가관측성을 위한 로그 예측 구현 18 | * 사이버 보안을 위한 벡터 기반 봇 탐지 사용 19 | * 벡터 공간 시각화하고 Elastic과 Search.Next 탐구 20 | * Streamlit을 사용한 RAG 강화 애플리케이션 구현 21 | 22 | ## 지침 23 | 모든 코드는 폴더에 정리되어 있습니다. 24 | 25 | 코드는 다음과 같이 표시됩니다. 26 | ``` 27 | { 28 | '_source': { 29 | 'redacted': ' 전화번호는 이고, 주민등록번호는 입니다.', 30 | 'status': '비활성화' 31 | } 32 | } 33 | ``` 34 | 35 | 36 | **이 책의 대상 :** 37 | 엘라스틱 가관측성, 검색 또는 사이버 보안 경험이 있는 데이터 전문가로서 벡터 검색에 대한 지식을 확장하고자 하는 경우 이 책이 적합합니다. 이 책은 검색 애플리케이션 소유자, 제품 관리자, 가관측성 플랫폼 소유자, 보안 운영 센터 전문가에게 유용한 실용적인 지식을 제공합니다. 파이썬 사용, 기계 학습 모델 활용, 데이터 관리 경험이 있으면 이 책을 통해 더 많은 것을 얻을 수 있습니다. 38 | 39 | 다음 소프트웨어 및 하드웨어 목록을 사용하면 책에 있는 모든 소스코드(1-10장)를 실행할 수 있습니다. 40 | 41 | 42 | ### 필요한 소프트웨어와 하드웨어 43 | 44 | 이 책을 충분히 활용하려면 기본적으로 엘라스틱서치 작업, 파이썬 프로그래밍, 검색 개념을 이해하고 있어야 합니다. 이런 기본 지식을 바탕으로 책에서 다루는 고급 기술과 응용 방법을 더욱 효과적으로 이해할 수 있습니다. 45 | 46 | 시스템 요구사항은 아래와 같습니다. 47 | 48 | | 소프트웨어/하드웨어 | 운영체제 요구사항 | 49 | |------------------|----------------------------| 50 | | 엘라스틱서치 8.11+ | Windows, Linus, and MacOS | 51 | | 파이썬 3.9+ | | 52 | | 주피터 노트북 | 53 | 54 | 55 | ## 저자 소개 56 | **바할딘 아자르미(Bahaaldine Azarmi)** : 엘라스틱의 글로벌 고객 엔지니어링 부사장으로 기업이 데이터 아키텍처, 분산 시스템, 머신러닝, 생성형AI를 잘 활용하게 안내합니다. 클라우드 사용에 중심을 둔 고객 엔지니어링 팀을 이끌고 AI 분야에서 숙련된 커뮤니티를 구축하고 영감을 주려고 지식을 공유하는 데 열정을 쏟고 있습니다. 57 | 58 | **제프 베스탈(Jeff Vestal)** : 금융 거래 회사에서 10년 이상의 경력을 쌓으며 얻은 풍부한 배경지식과 엘라스틱서치에 대한 폭넓은 경험을 갖추고 있습니다. 운영 능력, 엔지니어링 기술, 머신 러닝 전문 지식이라는 독특한 조합을 가지고 있습니다. 엘라스틱서치의 수석 고객 엔터프라이즈 아키텍트로 일하면서 엘라스틱서치의 고급 검색 기능, 머신 러닝 기능, 생성형 AI 통합을 활용해 사용자가 복잡한 데이터 문제를 실행할 수 있는 인사이트로 전환할 수 있도록 능숙하게 안내하는 혁신적인 솔루션을 만드는 데 탁월한 역량을 발휘합니다. 59 | -------------------------------------------------------------------------------- /chapter1/classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "id": "N7qcJH8gHaEp" 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from sklearn.datasets import fetch_20newsgroups\n", 12 | "from sklearn.feature_extraction.text import CountVectorizer\n", 13 | "from sklearn.naive_bayes import MultinomialNB\n", 14 | "from sklearn.metrics import accuracy_score\n", 15 | "\n", 16 | "# 데이터 세트 로드\n", 17 | "newsgroups_train = fetch_20newsgroups(subset='train')\n", 18 | "newsgroups_test = fetch_20newsgroups(subset='test')\n", 19 | "\n", 20 | "# 텍스트 데이터 벡터화\n", 21 | "vectorizer = CountVectorizer()\n", 22 | "X_train = vectorizer.fit_transform(newsgroups_train.data)\n", 23 | "X_test = vectorizer.transform(newsgroups_test.data)\n", 24 | "\n", 25 | "# 나이브 베이즈(Naive Bayes) 분류기 훈련\n", 26 | "clf = MultinomialNB()\n", 27 | "clf.fit(X_train, newsgroups_train.target)\n", 28 | "\n", 29 | "# 테스트 세트 예측\n", 30 | "y_pred = clf.predict(X_test)\n", 31 | "\n", 32 | "# 정확도와 예측한 클래스 레이블 출력\n", 33 | "print(f\"Accuracy: {accuracy_score(newsgroups_test.target, y_pred)}\")\n", 34 | "print(f\"Predicted classes: {y_pred}\")" 35 | ] 36 | } 37 | ], 38 | "metadata": { 39 | "language_info": { 40 | "name": "python" 41 | }, 42 | "orig_nbformat": 4, 43 | "colab": { 44 | "provenance": [] 45 | }, 46 | "kernelspec": { 47 | "name": "python3", 48 | "display_name": "Python 3" 49 | } 50 | }, 51 | "nbformat": 4, 52 | "nbformat_minor": 0 53 | } 54 | -------------------------------------------------------------------------------- /chapter1/question-answering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "id": "xFv77o5OOmVt" 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "!pip install transformers\n", 12 | "\n", 13 | "from transformers import pipeline\n", 14 | "\n", 15 | "# 질의 응답(QA) 모델 적재\n", 16 | "model = pipeline(\"question-answering\", model=\"distilbert-base-cased-distilled-squad\", tokenizer=\"distilbert-base-cased\")\n", 17 | "\n", 18 | "# 질문과 문맥 구성\n", 19 | "question = \"What is the capital city of Korea?\"\n", 20 | "context = \"South Korea is located in East Asia on the southern part of the Korean Peninsula, bordered by North Korea and surrounded by the Yellow Sea and the East Sea. Its capital is Seoul, and the official language is Korean.\"\n", 21 | "\n", 22 | "# 답변 구하기\n", 23 | "result = model(question=question, context=context)\n", 24 | "\n", 25 | "# 답변 출력\n", 26 | "print(f\"Answer: {result['answer']}\")" 27 | ] 28 | } 29 | ], 30 | "metadata": { 31 | "language_info": { 32 | "name": "python" 33 | }, 34 | "orig_nbformat": 4, 35 | "colab": { 36 | "provenance": [] 37 | }, 38 | "kernelspec": { 39 | "name": "python3", 40 | "display_name": "Python 3" 41 | } 42 | }, 43 | "nbformat": 4, 44 | "nbformat_minor": 0 45 | } -------------------------------------------------------------------------------- /chapter1/sentiment-analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "!pip install vaderSentiment\n", 10 | "\n", 11 | "import nltk \n", 12 | "nltk.download('vader_lexicon')\n", 13 | "\n", 14 | "from nltk.sentiment import SentimentIntensityAnalyzer \n", 15 | "\n", 16 | "sia = SentimentIntensityAnalyzer() \n", 17 | "text = \"I really enjoyed the new movie. The acting was great and the plot was engaging.\" \n", 18 | "scores = sia.polarity_scores(text) \n", 19 | "print(scores) " 20 | ] 21 | } 22 | ], 23 | "metadata": { 24 | "language_info": { 25 | "name": "python" 26 | }, 27 | "orig_nbformat": 4 28 | }, 29 | "nbformat": 4, 30 | "nbformat_minor": 2 31 | } 32 | -------------------------------------------------------------------------------- /chapter1/text-summarization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "id": "8ldUEgHGLeEv" 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "!pip install transformers\n", 12 | "\n", 13 | "from transformers import pipeline\n", 14 | "\n", 15 | "# 요약할 문장 정의\n", 16 | "text = \"The blue-throated macaw (Ara glaucogularis) is a species of macaw that is endemic to a small area of north-central Bolivia, known as the Llanos de Moxos. Recent population and range estimates suggest that about 350 to 400 individuals remain in the wild. Its demise was brought on by nesting competition, avian predation, and a small native range, exacerbated by indigenous hunting and capture for the pet trade. Although plentiful in captivity, it is critically endangered in the wild and protected by trading prohibitions. In 2014, the species was designated a natural patrimony of Bolivia. This blue-throated macaw in flight was photographed at Loro Parque, on the Spanish island of Tenerife in the Canary Islands.\"\n", 17 | "\n", 18 | "# 요약 파이브라인 초기화\n", 19 | "summarizer = pipeline(\"summarization\")\n", 20 | "\n", 21 | "# 문장에 대한 요약문 생성\n", 22 | "summary = summarizer(text, max_length=100, min_length=30, do_sample=False)[0][\"summary_text\"]\n", 23 | "\n", 24 | "# 요약문 출력\n", 25 | "print(summary)" 26 | ] 27 | } 28 | ], 29 | "metadata": { 30 | "language_info": { 31 | "name": "python" 32 | }, 33 | "orig_nbformat": 4, 34 | "colab": { 35 | "provenance": [] 36 | }, 37 | "kernelspec": { 38 | "name": "python3", 39 | "display_name": "Python 3" 40 | } 41 | }, 42 | "nbformat": 4, 43 | "nbformat_minor": 0 44 | } -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/.well-known/ai-plugin.json: -------------------------------------------------------------------------------- 1 | { 2 | "schema_version": "v1", 3 | "name_for_human": "PacktGPT_Plugin", 4 | "name_for_model": "PacktGPT_Plugin", 5 | "description_for_human": "An Assistant, you know, for searching in the Elastic documentation", 6 | "description_for_model": "Get most recent elastic documentation post 2021 release, anything after release 7.15", 7 | "auth": { 8 | "type": "none" 9 | }, 10 | "api": { 11 | "type": "openapi", 12 | "url": "PLUGIN_HOSTNAME/openapi.yaml", 13 | "is_user_authenticated": false 14 | }, 15 | "logo_url": "PLUGIN_HOSTNAME/logo.png", 16 | "contact_email": "info@info.co", 17 | "legal_info_url": "http://www.example.com/legal" 18 | } -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.8-slim 2 | 3 | WORKDIR /app 4 | 5 | COPY app.py /app/app.py 6 | COPY logo.png /app/logo.png 7 | COPY .well-known /app/.well-known 8 | COPY openapi.yaml /app/openapi.yaml 9 | 10 | RUN pip install --no-cache-dir \ 11 | quart \ 12 | quart-cors \ 13 | openai \ 14 | elasticsearch \ 15 | requests 16 | 17 | EXPOSE 5001 18 | 19 | ENTRYPOINT ["python", "/app/app.py"] 20 | -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/README.md: -------------------------------------------------------------------------------- 1 | # ElasticGPT_Plugin 2 | 3 | See the full documentation online in this Elastic Blog Post 4 | -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/app.py: -------------------------------------------------------------------------------- 1 | import quart 2 | import quart_cors 3 | from quart import request 4 | 5 | import os 6 | import openai 7 | from elasticsearch import Elasticsearch 8 | 9 | from embedchain import App 10 | 11 | app = quart_cors.cors(quart.Quart(__name__), allow_origin="*") 12 | 13 | openai.api_key = os.environ['openai_api'] 14 | os.environ["OPENAI_API_KEY"] = openai.api_key 15 | 16 | model = "gpt-3.5-turbo" 17 | 18 | 19 | # 일래스틱 클라우드에 접속 20 | def es_connect(cid, user, passwd): 21 | es = Elasticsearch(cloud_id=cid, basic_auth=(user, passwd)) 22 | return es 23 | 24 | 25 | # 일래스틱서치 인덱스를 검색하고 결과 본문과 URL 반환 26 | def ESSearch(query_text): 27 | cloud_url = os.environ['cloud_url'] 28 | cid = os.environ['cloud_id'] 29 | cp = os.environ['cloud_pass'] 30 | cu = os.environ['cloud_user'] 31 | es = es_connect(cid, cu, cp) 32 | 33 | # 일래스틱서치 BM25 질의문 34 | query = { 35 | "bool": { 36 | "filter": [ 37 | { 38 | "prefix": { 39 | "url": "https://www.elastic.co/guide" 40 | } 41 | } 42 | 43 | ], 44 | "must": [{ 45 | "match": { 46 | "title": { 47 | "query": query_text, 48 | "boost": 1 49 | } 50 | } 51 | }] 52 | } 53 | } 54 | 55 | fields = ["title", "body_content", "url"] 56 | index = 'search-wikibook-cdl-source' 57 | resp = es.search(index=index, 58 | query=query, 59 | fields=fields, 60 | size=10, 61 | source=False) 62 | 63 | if not resp['hits']['hits']: 64 | return "No results found." 65 | 66 | body = resp['hits']['hits'][0]['fields']['body_content'][0] 67 | url = resp['hits']['hits'][0]['fields']['url'][0] 68 | 69 | print(len(resp['hits']['hits'])) 70 | 71 | elastic_bot = App() 72 | 73 | # 'resp'가 언급한 응답 객체라고 가정 74 | for hit in resp['hits']['hits']: 75 | for url in hit['fields']['url']: 76 | print(url) 77 | elastic_bot.add(url) 78 | 79 | return elastic_bot.query("What can you tell me about " + query_text) 80 | 81 | 82 | def truncate_text(text, max_tokens): 83 | tokens = text.split() 84 | if len(tokens) <= max_tokens: 85 | return text 86 | 87 | return ' '.join(tokens[:max_tokens]) 88 | 89 | 90 | # 주어진 프롬프트를 바탕으로 ChatGPT의 응답 생성 91 | def chat_gpt(prompt, 92 | model="gpt-3.5-turbo", 93 | max_tokens=1024, 94 | max_context_tokens=4000, 95 | safety_margin=5): 96 | # 모델의 컨텍스트 길이에 맞게 프롬프트 내용 자르기 97 | truncated_prompt = truncate_text( 98 | prompt, max_context_tokens - max_tokens - safety_margin) 99 | 100 | response = openai.ChatCompletion.create(model=model, 101 | messages=[{ 102 | "role": 103 | "system", 104 | "content": 105 | "You are a helpful assistant." 106 | }, { 107 | "role": "user", 108 | "content": truncated_prompt 109 | }]) 110 | 111 | return response["choices"][0]["message"]["content"] 112 | 113 | 114 | @app.get("/search") 115 | async def search(): 116 | query = request.args.get("query") 117 | url = ESSearch(query) 118 | return quart.Response(url) 119 | 120 | 121 | @app.get("/logo.png") 122 | async def plugin_logo(): 123 | filename = 'logo.png' 124 | return await quart.send_file(filename, mimetype='image/png') 125 | 126 | 127 | @app.get("/.well-known/ai-plugin.json") 128 | async def plugin_manifest(): 129 | host = request.headers['Host'] 130 | with open("./.well-known/ai-plugin.json") as f: 131 | text = f.read() 132 | text = text.replace("PLUGIN_HOSTNAME", f"https://{host}") 133 | return quart.Response(text, mimetype="text/json") 134 | 135 | 136 | @app.get("/openapi.yaml") 137 | async def openapi_spec(): 138 | host = request.headers['Host'] 139 | with open("openapi.yaml") as f: 140 | text = f.read() 141 | text = text.replace("PLUGIN_HOSTNAME", f"https://{host}") 142 | return quart.Response(text, mimetype="text/yaml") 143 | 144 | 145 | def main(): 146 | port = int(os.environ.get("PORT", 5001)) 147 | app.run(debug=True, host="127.0.0.1", port=port) 148 | 149 | 150 | if __name__ == "__main__": 151 | main() 152 | -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/chart.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import seaborn as sns 4 | 5 | # 상수 6 | k = 5 7 | threshold = 2 8 | 9 | # 시그모이드 함수 정의 10 | def soc_function(R, C, T, A, k, wR=1, wC=1, wT=1, wA=1, threshold=0): 11 | return 1 / (1 + np.exp(-k * (wR*R + wC*C + wT*T + wA*A - threshold))) 12 | 13 | # 1. 시그모이드 함수 곡선 14 | x = np.linspace(-10, 10, 400) 15 | y = 1 / (1 + np.exp(-x)) 16 | 17 | plt.figure(figsize=(8, 6)) 18 | plt.plot(x, y) 19 | plt.title("Figure 1: Sigmoid Function Curve") 20 | plt.xlabel("x") 21 | plt.ylabel("Sigmoid(x)") 22 | plt.grid(True) 23 | plt.show() 24 | 25 | # 2. 상호작용 3D 그림 26 | from mpl_toolkits.mplot3d import Axes3D 27 | 28 | # 예제를 위해 임의의 데이터를 생성했지만, 실제 데이터를 사용하면 더 좋을 수 있습니다. 29 | R_vals = np.linspace(0, 1, 10) 30 | C_vals = np.linspace(0, 1, 10) 31 | T_vals = np.linspace(0, 1, 10) 32 | 33 | R, C, T = np.meshgrid(R_vals, C_vals, T_vals) 34 | A_val = 0.5 # 이 예제에서는 단순화를 위해 고정 35 | SoC = soc_function(R, C, T, A_val, k) 36 | 37 | fig = plt.figure(figsize=(10, 8)) 38 | ax = fig.add_subplot(111, projection='3d') 39 | scatter = ax.scatter(R, C, T, c=SoC.ravel(), cmap='viridis') 40 | ax.set_xlabel('Relevance (R)') 41 | ax.set_ylabel('Continuity (C)') 42 | ax.set_zlabel('Timeliness (T)') 43 | ax.set_title("Figure 2: Interaction of R, C, and T") 44 | fig.colorbar(scatter, ax=ax, label='SoC Value') # This adds the colorbar 45 | plt.show() 46 | 47 | # 3. k의 변화가 미치는 영향 48 | ks = [1, 5, 10] 49 | R_val, C_val, T_val, A_val = 0.5, 0.6, 0.7, 0.8 50 | 51 | plt.figure(figsize=(8, 6)) 52 | 53 | 54 | plt.title("Figure 3: Effect of Varying k on SoC") 55 | plt.xlabel("Combined Factors") 56 | plt.ylabel("SoC Value") 57 | plt.legend() 58 | plt.grid(True) 59 | plt.show() 60 | 61 | # 4. 가중치의 히트맵 62 | # 단순화를 위해, 두 가중치 간(wR과 wC)의 상호작용을 고려해 보겠습니다. 63 | 64 | weights = np.linspace(0, 2, 50) 65 | wR, wC = np.meshgrid(weights, weights) 66 | SoC_heatmap = soc_function(R_val, C_val, T_val, A_val, k, wR, wC) 67 | 68 | plt.figure(figsize=(8, 6)) 69 | sns.heatmap(SoC_heatmap, cmap='viridis', cbar_kws={'label': 'SoC Value'}) 70 | plt.title("Figure 4: Heatmap of Varying wR and wC") 71 | plt.xlabel("Weight of Relevance (wR)") 72 | plt.ylabel("Weight of Continuity (wC)") 73 | plt.xticks(ticks=np.linspace(0, len(weights)-1, 5), labels=np.linspace(0, 2, 5)) 74 | plt.yticks(ticks=np.linspace(0, len(weights)-1, 5), labels=np.linspace(0, 2, 5)) 75 | plt.show() 76 | 77 | # Note : 나머지 그림과 표를 효과적으로 그리기 위해서는 더 구체적인 시나리오나 데이터 세트가 필요합니다. 78 | -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/chart_soc.py: -------------------------------------------------------------------------------- 1 | # 필요한 라이브러리 가져오기 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | 5 | 6 | # 시그모이드를 사용해 SoC 함수 정의하기 7 | def soc_function(R, C, T, A, k=1, threshold=2): 8 | return 1 / (1 + np.exp(-k * (R + C + T + A - threshold))) 9 | 10 | 11 | # 샘플 R, C, T, A 값 12 | R_values = np.linspace(0.8, 1, 50) 13 | C_values = np.linspace(0.8, 1, 50) 14 | T_values = np.linspace(0.8, 1, 50) 15 | A_values = np.linspace(0.8, 1, 50) 16 | 17 | # 샘플 R, C, T, A 값에 대한 SoC 값 계산. 18 | soc_values = soc_function(R_values, C_values, T_values, A_values) 19 | 20 | # 함수 표시 21 | plt.figure(figsize=(10, 6)) 22 | plt.plot(R_values, soc_values, label="SoC values", color='blue') 23 | plt.axhline(y=0.5, color='r', linestyle='--', label="Midpoint") 24 | plt.title("Significance of Context (SoC) vs. R (with C, T, A held constant)") 25 | plt.xlabel("Relevance (R)") 26 | plt.ylabel("SoC Value") 27 | plt.legend() 28 | plt.grid(True) 29 | plt.show() 30 | -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wikibook/vector-search/0558d712dcd51b2e7ca72bddfb6a3581086c3798/chapter10/PacktGPT_Plugin/logo.png -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/openapi.yaml: -------------------------------------------------------------------------------- 1 | openapi: 3.0.1 2 | info: 3 | title: ElasticGPT_Plugin 4 | description: Retrieve information from the most recent Elastic documentation 5 | version: 'v1' 6 | servers: 7 | - url: PLUGIN_HOSTNAME 8 | paths: 9 | /search: 10 | get: 11 | operationId: search 12 | summary: retrieves the document matching the query 13 | parameters: 14 | - in: query 15 | name: query 16 | schema: 17 | type: string 18 | description: use to filter relevant part of the elasticsearch documentations 19 | responses: 20 | "200": 21 | description: OK -------------------------------------------------------------------------------- /chapter10/PacktGPT_Plugin/requirements.txt: -------------------------------------------------------------------------------- 1 | elasticsearch==8.13.1 2 | embedchain==0.1.103 3 | openai==1.28.1 4 | quart==0.19.5 5 | quart-cors==0.7.0 6 | numpy==1.26.4 7 | matplotlib==3.8.4 8 | seaborn==0.13.2 -------------------------------------------------------------------------------- /chapter2/dense-vs-sparse.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np \n", 10 | "from scipy.sparse import random \n", 11 | "from sklearn.decomposition import TruncatedSVD \n", 12 | "import matplotlib.pyplot as plt \n", 13 | "\n", 14 | "# 1000개의 단어가 포함된 100개의 문서를 생성\n", 15 | "vocab_size = 10000\n", 16 | "num_docs = 100 \n", 17 | "doc_len = 1000 \n", 18 | "\n", 19 | "# 10000개의 단어로 구성된 어휘 집합을 생성\n", 20 | "vocab = [f'word{i}' for i in range(vocab_size)] \n", 21 | "\n", 22 | "# 각각의 문서에 대한 밀집 벡터를 랜덤하게 생성\n", 23 | "dense_vectors = np.zeros((num_docs, vocab_size)) \n", 24 | "for i in range(num_docs): \n", 25 | " word_indices = np.random.choice(vocab_size, doc_len) \n", 26 | " for j in word_indices: \n", 27 | " dense_vectors[i, j] += 1 \n", 28 | "\n", 29 | "# 밀집 벡터를 희소 벡터로 변환\n", 30 | "sparse_vectors = random(num_docs, vocab_size, density=0.01, format='csr') \n", 31 | "for i in range(num_docs): \n", 32 | " word_indices = np.random.choice(vocab_size, doc_len) \n", 33 | " for j in word_indices: \n", 34 | " sparse_vectors[i, j] += 1 \n", 35 | "\n", 36 | "# TruncatedSVD를 사용하여 밀집 벡터의 차원을 축소\n", 37 | "svd = TruncatedSVD(n_components=2) \n", 38 | "dense_vectors_svd = svd.fit_transform(dense_vectors) \n", 39 | "\n", 40 | "# TruncatedSVD를 희소 벡터에 적용\n", 41 | "sparse_vectors_svd = svd.transform(sparse_vectors)\n", 42 | "\n", 43 | "# 각각의 차원 축소 결과를 산점도 표시\n", 44 | "fig, ax = plt.subplots(figsize=(10, 8)) \n", 45 | "ax.scatter(dense_vectors_svd[:, 0], dense_vectors_svd[:, 1], c='b', label='Dense vectors') \n", 46 | "ax.scatter(sparse_vectors_svd[:, 0], sparse_vectors_svd[:, 1], c='r', label='Sparse vectors') \n", 47 | "ax.set_title('2D embeddings of dense and sparse document vectors after TruncatedSVD dimensionality reduction')\n", 48 | "ax.set_xlabel('Dimension 1') \n", 49 | "ax.set_ylabel('Dimension 2') \n", 50 | "ax.legend() \n", 51 | "plt.show()\n" 52 | ] 53 | } 54 | ], 55 | "metadata": { 56 | "language_info": { 57 | "name": "python" 58 | }, 59 | "orig_nbformat": 4 60 | }, 61 | "nbformat": 4, 62 | "nbformat_minor": 2 63 | } 64 | -------------------------------------------------------------------------------- /chapter2/distance-magnitude.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# spaCy 설치, 'en_core_web_md' 모델을 다운로드\n", 10 | "!pip install spacy\n", 11 | "!python -m spacy download en_core_web_md\n", 12 | "\n", 13 | "# 라이브러리 가져오기\n", 14 | "import spacy\n", 15 | "import numpy as np\n", 16 | "from scipy.spatial.distance import cosine, euclidean\n", 17 | "\n", 18 | "# 사전 학습된 단어 임베딩 모델 로드\n", 19 | "nlp = spacy.load('en_core_web_md')\n", 20 | "\n", 21 | "# 텍스트 정의\n", 22 | "text_a = \"The cat is playing with a toy.\"\n", 23 | "text_b = \"A kitten is interacting with a plaything.\"\n", 24 | "text_c = \"The chef is cooking a delicious meal.\"\n", 25 | "text_d = \"Economics is the social science that studies the production, distribution, and consumption of goods and services.\"\n", 26 | "text_e = \"Economics studies goods and services.\"\n", 27 | "\n", 28 | "# spaCy 모델을 사용해 텍스트를 벡터 표현으로 변환\n", 29 | "vector_a = nlp(text_a).vector\n", 30 | "vector_b = nlp(text_b).vector\n", 31 | "vector_c = nlp(text_c).vector\n", 32 | "vector_d = nlp(text_d).vector\n", 33 | "vector_e = nlp(text_e).vector\n", 34 | "\n", 35 | "# 벡터 간 코사인 유사도 계산\n", 36 | "cosine_sim_ab = 1 - cosine(vector_a, vector_b)\n", 37 | "cosine_sim_ac = 1 - cosine(vector_a, vector_c)\n", 38 | "cosine_sim_de = 1 - cosine(vector_d, vector_e)\n", 39 | "\n", 40 | "print(f\"Cosine similarity between Text A and Text B: {cosine_sim_ab:.2f}\")\n", 41 | "print(f\"Cosine similarity between Text A and Text C: {cosine_sim_ac:.2f}\")\n", 42 | "print(f\"Cosine similarity between Text D and Text E: {cosine_sim_de:.2f}\")\n", 43 | "\n", 44 | "# 벡터 간 유클리드 거리 계산\n", 45 | "euclidean_dist_ab = euclidean(vector_a, vector_b)\n", 46 | "euclidean_dist_ac = euclidean(vector_a, vector_c)\n", 47 | "euclidean_dist_de = euclidean(vector_d, vector_e)\n", 48 | "\n", 49 | "print(f\"Euclidean distance between Text A and Text B: {euclidean_dist_ab:.2f}\")\n", 50 | "print(f\"Euclidean distance between Text A and Text C: {euclidean_dist_ac:.2f}\")\n", 51 | "print(f\"Euclidean distance between Text D and Text E: {euclidean_dist_de:.2f}\")\n", 52 | "\n", 53 | "# 벡터의 크기 계산\n", 54 | "magnitude_d = np.linalg.norm(vector_d)\n", 55 | "magnitude_e = np.linalg.norm(vector_e)\n", 56 | "\n", 57 | "print(f\"Magnitude of Text D's vector: {magnitude_d:.2f}\")\n", 58 | "print(f\"Magnitude of Text E's vector: {magnitude_e:.2f}\")\n" 59 | ] 60 | } 61 | ], 62 | "metadata": { 63 | "language_info": { 64 | "name": "python" 65 | }, 66 | "orig_nbformat": 4 67 | }, 68 | "nbformat": 4, 69 | "nbformat_minor": 2 70 | } 71 | -------------------------------------------------------------------------------- /chapter2/knn-search.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "!pip install transformers elasticsearch \n", 10 | "\n", 11 | "import numpy as np \n", 12 | "from transformers import AutoTokenizer, AutoModel \n", 13 | "from elasticsearch import Elasticsearch \n", 14 | "import torch \n", 15 | "\n", 16 | "# 인증정보를 사용해 일래스틱서치 접속 정보 정의\n", 17 | "es = Elasticsearch(\n", 18 | " ['https://host:port'],\n", 19 | " http_auth=('username', 'password'),\n", 20 | " verify_certs=False\n", 21 | ")\n", 22 | " \n", 23 | "# 데이터를 저장할 인덱스의 매핑 정의\n", 24 | "mapping = { \n", 25 | " 'properties': { \n", 26 | " 'embedding': { \n", 27 | " 'type': 'dense_vector', \n", 28 | " 'dims': 768, # Ddense vector field의 차원을 정의합니다. \n", 29 | " 'index': 'true',\n", 30 | " \"similarity\": \"cosine\"\n", 31 | " } \n", 32 | " } \n", 33 | "} \n", 34 | "\n", 35 | "# 정의한 매핑으로 인덱스 생성\n", 36 | "es.indices.create(index='jokes-index', body={'mappings': mapping}) \n", 37 | "\n", 38 | "# 색인 할 유머 데이터 세트 구성\n", 39 | "jokes = [ \n", 40 | " { \n", 41 | " 'text': 'Why do cats make terrible storytellers? Because they only have one tail.', \n", 42 | " 'category': 'cat' \n", 43 | " }, \n", 44 | " { \n", 45 | " 'text': 'What did the cat say when he lost all his money? I am paw.', \n", 46 | " 'category': 'cat' \n", 47 | " }, \n", 48 | " { \n", 49 | " 'text': 'Why don\\'t cats play poker in the jungle? Too many cheetahs.', \n", 50 | " 'category': 'cat' \n", 51 | " },\n", 52 | " { \n", 53 | " 'text': 'Why did the tomato turn red? Because it saw the salad dressing!', \n", 54 | " 'category': 'vegetable' \n", 55 | " },\n", 56 | " { \n", 57 | " 'text': 'Why did the scarecrow win an award? Because he was outstanding in his field.', \n", 58 | " 'category': 'farm' \n", 59 | " },\n", 60 | " { \n", 61 | " 'text': 'Why did the hipster burn his tongue? Because he drank his coffee before it was cool.', \n", 62 | " 'category': 'hipster' \n", 63 | " }, \n", 64 | " {\n", 65 | " 'text': 'Why did the tomato turn red? Because it saw the salad dressing!', \n", 66 | " 'category': 'food' \n", 67 | " },\n", 68 | " {\n", 69 | " 'text': 'Why did the scarecrow win an award? Because he was out-standing in his field!', \n", 70 | " 'category': 'puns' \n", 71 | " },\n", 72 | " {\n", 73 | " 'text': 'What do you call a fake noodle? An impasta!', \n", 74 | " 'category': 'food' \n", 75 | " },\n", 76 | " {\n", 77 | " 'text': 'What do you call a belt made out of watches? A waist of time!', \n", 78 | " 'category': 'puns' \n", 79 | " },\n", 80 | " {\n", 81 | " 'text': 'Why did the math book look sad? Because it had too many problems!', \n", 82 | " 'category': 'math' \n", 83 | " },\n", 84 | " {\n", 85 | " 'text': 'Why did the gym close down? It just didn\\'t work out!', \n", 86 | " 'category': 'exercise' \n", 87 | " },\n", 88 | " {\n", 89 | " 'text': 'Why don\\'t scientists trust atoms? Because they make up everything!', \n", 90 | " 'category': 'science' \n", 91 | " },\n", 92 | " {\n", 93 | " 'text': 'What do you call a fake noodle? An impasta!', \n", 94 | " 'category': 'food' \n", 95 | " },\n", 96 | " {\n", 97 | " 'text': 'Why did the chicken cross the playground? To get to the other slide!', \n", 98 | " 'category': 'kids' \n", 99 | " },\n", 100 | " {\n", 101 | " 'text': 'Why did the frog call his insurance company? He had a jump in his car!', \n", 102 | " 'category': 'puns' \n", 103 | " }\n", 104 | "\n", 105 | "] \n", 106 | "\n", 107 | "# BERT 토크나이저와 모델 로드\n", 108 | "tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') \n", 109 | "model = AutoModel.from_pretrained('bert-base-uncased') \n", 110 | "\n", 111 | "# BERT를 활용하여 유머 데이터에 대한 임베딩 생성\n", 112 | "for joke in jokes: \n", 113 | " text = joke['text'] \n", 114 | " inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True) \n", 115 | " with torch.no_grad(): \n", 116 | " output = model(**inputs).last_hidden_state.mean(dim=1).squeeze(0).numpy() \n", 117 | " joke['embedding'] = output.tolist() \n", 118 | "\n", 119 | "# 일래스틱서치에 유머 데이터 색인\n", 120 | "for joke in jokes: \n", 121 | " es.index(index='jokes-index', body=joke) \n", 122 | "\n", 123 | "# 질의 벡터 생성\n", 124 | "# 질의 텍스트를 정의하고 BERT를 활용해 질의 텍스트를 벡터로 변환\n", 125 | "query = \"What do you get when you cross a snowman and a shark?\"\n", 126 | "inputs = tokenizer(query, return_tensors='pt', padding=True, truncation=True)\n", 127 | "with torch.no_grad():\n", 128 | " output = model(**inputs).last_hidden_state.mean(dim=1).squeeze(0).numpy()\n", 129 | "query_vector = output\n", 130 | "\n", 131 | "# 일래스틱서치 kNN 검색 쿼리 정의\n", 132 | "search = {\n", 133 | " \"knn\": {\n", 134 | " \"field\": \"embedding\",\n", 135 | " \"query_vector\": query_vector.tolist(),\n", 136 | " \"k\": 3,\n", 137 | " \"num_candidates\": 100\n", 138 | " },\n", 139 | " \"fields\": [ \"text\" ]\n", 140 | "}\n", 141 | "\n", 142 | "# kNN 검색 수행 및 결과 출력\n", 143 | "response = es.search(index='jokes-index', body=search)\n", 144 | "for hit in response['hits']['hits']:\n", 145 | " print(f\"Joke: {hit['_source']['text']}\")\n" 146 | ] 147 | } 148 | ], 149 | "metadata": { 150 | "language_info": { 151 | "name": "python" 152 | }, 153 | "orig_nbformat": 4 154 | }, 155 | "nbformat": 4, 156 | "nbformat_minor": 2 157 | } 158 | -------------------------------------------------------------------------------- /chapter2/mapping-and-indexing-vectors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "pycharm": { 8 | "is_executing": true 9 | } 10 | }, 11 | "outputs": [], 12 | "source": [ 13 | "!pip install transformers elasticsearch\n", 14 | "\n", 15 | "import numpy as np\n", 16 | "from transformers import AutoTokenizer, AutoModel\n", 17 | "from elasticsearch import Elasticsearch\n", 18 | "import torch\n", 19 | "\n", 20 | "# 인증정보를 사용해 일래스틱서치 접속 정보 정의\n", 21 | "es = Elasticsearch(\n", 22 | " ['https://hostname:port'],\n", 23 | " http_auth=('username', 'password'),\n", 24 | " verify_certs=False\n", 25 | ")\n", 26 | "\n", 27 | "# dense vector 필드를 위한 매핑 정의\n", 28 | "mapping = {\n", 29 | " 'properties': {\n", 30 | " 'embedding': {\n", 31 | " 'type': 'dense_vector',\n", 32 | " 'dims': 768 # dense vector의 차원수\n", 33 | " }\n", 34 | " }\n", 35 | "}\n", 36 | "\n", 37 | "# 정의한 매핑으로 인덱스 생성\n", 38 | "es.indices.create(index='chapter-2', body={'mappings': mapping})\n", 39 | "\n", 40 | "# 색인 할 문서 데이터 세트 구성\n", 41 | "docs = [\n", 42 | " {\n", 43 | " 'title': 'Document 1',\n", 44 | " 'text': 'This is the first document.'\n", 45 | " },\n", 46 | " {\n", 47 | " 'title': 'Document 2',\n", 48 | " 'text': 'This is the second document.'\n", 49 | " },\n", 50 | " {\n", 51 | " 'title': 'Document 3',\n", 52 | " 'text': 'This is the third document.'\n", 53 | " }\n", 54 | "]\n", 55 | "\n", 56 | "# BERT 토크나이저와 모델 로드\n", 57 | "tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')\n", 58 | "model = AutoModel.from_pretrained('bert-base-uncased')\n", 59 | "\n", 60 | "# BERT를 활용하여 각 문서의 임베딩 생성\n", 61 | "for doc in docs:\n", 62 | " text = doc['text']\n", 63 | " inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)\n", 64 | " with torch.no_grad():\n", 65 | " output = model(**inputs).last_hidden_state.mean(dim=1).squeeze(0).numpy()\n", 66 | " doc['embedding'] = output.tolist()\n", 67 | "\n", 68 | "# 일래스틱서치에 문서 색인\n", 69 | "for doc in docs:\n", 70 | " es.index(index='chapter-2', body=doc)\n" 71 | ] 72 | } 73 | ], 74 | "metadata": { 75 | "language_info": { 76 | "name": "python" 77 | }, 78 | "orig_nbformat": 4 79 | }, 80 | "nbformat": 4, 81 | "nbformat_minor": 2 82 | } 83 | -------------------------------------------------------------------------------- /chapter3/Chapter_3_Model_Management_and_Vector_Considerations_in_Elastic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# **3.1.2 Datasets 예제 코드**" 7 | ], 8 | "metadata": { 9 | "id": "8-FYIJj299o2" 10 | } 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "id": "uA0d_STPW7za" 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "!pip install datasets transformers" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "id": "8Ud4E6TlXFXP" 27 | }, 28 | "source": [ 29 | "# IMDB 데이터 세트 불러오기" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": { 36 | "id": "GpXxbN_bXBfo" 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "from datasets import load_dataset\n", 41 | "imdb_dataset = load_dataset(\"imdb\")" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": { 47 | "id": "VCFJODgdXW33" 48 | }, 49 | "source": [ 50 | "# imdb 데이터 세트 토큰화" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "id": "uFW5jLzJXe6d" 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "from datasets import load_dataset\n", 62 | "from transformers import AutoTokenizer\n", 63 | "\n", 64 | "# IMDB 데이터 세트 일부(샘플 100개) 적재\n", 65 | "imdb_dataset = load_dataset(\"imdb\", split=\"train[:100]\")\n", 66 | "\n", 67 | "# 토크나이저 초기화\n", 68 | "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\")\n", 69 | "\n", 70 | "# 자르기와 패딩을 사용해 IMDB 데이터 세트 토큰화\n", 71 | "tokenized_imdb_dataset = imdb_dataset.map(\n", 72 | "lambda x: tokenizer(x[\"text\"], truncation=True, padding=\"max_length\")\n", 73 | ")\n", 74 | "\n", 75 | "print(tokenized_imdb_dataset)\n", 76 | "\n", 77 | "# 토큰의 첫 번째 행 가져오기\n", 78 | "first_row_tokens = tokenized_imdb_dataset[0][\"input_ids\"]\n", 79 | "\n", 80 | "# 처음 10개 토큰과 토큰에 해당하는 단어 출력\n", 81 | "for token in first_row_tokens[:10]:\n", 82 | " print(f\"토큰: {token}, 단어: {tokenizer.decode([token])}\")" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": { 88 | "id": "4sEIuAX1Y12I" 89 | }, 90 | "source": [ 91 | "# **3.1.2 Spaces 예제 코드**\n", 92 | "Gradio 인터페이스 설정" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": { 99 | "id": "6b9qSiukY9kK" 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "!pip install gradio transformers" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": { 110 | "id": "MEvaRRUpY5tB" 111 | }, 112 | "outputs": [], 113 | "source": [ 114 | "import gradio as gr\n", 115 | "from transformers import pipeline\n", 116 | "\n", 117 | "sentiment_pipeline = pipeline(\"sentiment-analysis\")\n", 118 | "\n", 119 | "def sentiment_analysis(text):\n", 120 | " result = sentiment_pipeline(text)\n", 121 | " return result[0][\"label\"]\n", 122 | "\n", 123 | "iface = gr.Interface(fn=sentiment_analysis, inputs=\"text\", outputs=\"text\")\n", 124 | "iface.launch()" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": { 130 | "id": "1821xMLhZRGw" 131 | }, 132 | "source": [ 133 | "# **3.2 Eland 예제 코드**" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "id": "zeHEK6A-ZbA7" 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "!pip install eland elasticsearch seaborn" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": { 150 | "id": "wB_055jlaH2H" 151 | }, 152 | "source": [ 153 | "# 인덱스 생성 예제" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": { 159 | "id": "iLkd_1KWo6ba" 160 | }, 161 | "source": [ 162 | "## 일래스틱서치에 접속하여 샘플 인덱스 생성" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "id": "8vnErFw4o8eh" 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "import getpass\n", 174 | "from elasticsearch import Elasticsearch\n", 175 | "from datetime import datetime\n", 176 | "\n", 177 | "es_cloud_id = getpass.getpass('Enter Elastic Cloud ID: ')\n", 178 | "es_api_key = getpass.getpass('Enter user API key: ')\n", 179 | "\n", 180 | "es = Elasticsearch(cloud_id=es_cloud_id,\n", 181 | " api_key=es_api_key\n", 182 | " )\n", 183 | "es.info() # 클러스터 정보 확인" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": { 190 | "id": "KODxMIAM753X" 191 | }, 192 | "outputs": [], 193 | "source": [ 194 | "mapping = {\n", 195 | " \"settings\": {\n", 196 | " \"number_of_shards\": 1,\n", 197 | " \"number_of_replicas\": 0\n", 198 | " },\n", 199 | " \"mappings\": {\n", 200 | " \"properties\": {\n", 201 | " \"some_field\": {\"type\": \"float\"},\n", 202 | " \"column_a\": {\"type\": \"float\"},\n", 203 | " \"column_b\": {\"type\": \"float\"},\n", 204 | " \"category\": {\"type\": \"keyword\"},\n", 205 | " \"value\": {\"type\": \"float\"}\n", 206 | " }\n", 207 | " }\n", 208 | "}\n", 209 | "\n", 210 | "# 인덱스 생성\n", 211 | "es.indices.create(index=\"sample_eland_index\", body=mapping)\n", 212 | "\n", 213 | "# 인덱스에 샘플 데이터 넣기\n", 214 | "documents = [\n", 215 | " {\"some_field\": 95.0, \"column_a\": 5.0, \"column_b\": 10.0, \"category\": \"A\", \"value\": 50.0},\n", 216 | " {\"some_field\": 150.0, \"column_a\": 7.0, \"column_b\": 20.0, \"category\": \"B\", \"value\": 140.0},\n", 217 | " {\"some_field\": 200.0, \"column_a\": 8.0, \"column_b\": 25.0, \"category\": \"A\", \"value\": 200.0},\n", 218 | " {\"some_field\": 50.0, \"column_a\": 4.0, \"column_b\": 12.5, \"category\": \"C\", \"value\": 50.0}\n", 219 | "]\n", 220 | "\n", 221 | "for doc in documents:\n", 222 | " es.index(index=\"sample_eland_index\", body=doc)" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": { 228 | "id": "VJzKKd6opNLf" 229 | }, 230 | "source": [ 231 | "## Eland 예제" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": { 238 | "id": "puzcRKXHZTeB" 239 | }, 240 | "outputs": [], 241 | "source": [ 242 | "import eland as ed\n", 243 | "\n", 244 | "df = ed.DataFrame(es_client=es, es_index_pattern=\"sample_eland_index\")\n", 245 | "filtered_df = df[df['some_field'] > 100]\n", 246 | "filtered_df" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": { 253 | "id": "kchuV7LmZ94y" 254 | }, 255 | "outputs": [], 256 | "source": [ 257 | "average_value = df['some_field'].mean()\n", 258 | "average_value" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": { 265 | "id": "OUt_BZrQaA1l" 266 | }, 267 | "outputs": [], 268 | "source": [ 269 | "import seaborn as sns\n", 270 | "import pandas as pd\n", 271 | "\n", 272 | "filtered_df = df[df['some_field'] > 100]\n", 273 | "pandas_df = filtered_df.to_pandas()\n", 274 | "sns.boxplot(x='category', y='value', data=pandas_df)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": { 280 | "id": "9LyU11jZaK_s" 281 | }, 282 | "source": [ 283 | "# 허깅 페이스의 모델을 일래스틱서치에서 불러오기" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": { 290 | "id": "ERArF6WFaNt6" 291 | }, 292 | "outputs": [], 293 | "source": [ 294 | "pip -q install eland elasticsearch transformers sentence_transformers torch" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": { 301 | "id": "WD-ajPzoaiCf" 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "from pathlib import Path\n", 306 | "from eland.ml.pytorch import PyTorchModel\n", 307 | "from eland.ml.pytorch.transformers import TransformerModel\n", 308 | "from elasticsearch import Elasticsearch\n", 309 | "from elasticsearch.client import MlClient\n", 310 | "import getpass" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "metadata": { 317 | "id": "8CboLpvAatQv" 318 | }, 319 | "outputs": [], 320 | "source": [ 321 | "es_cloud_id = getpass.getpass('Enter Elastic Cloud ID: ')\n", 322 | "es_api_key = getpass.getpass('Enter user API key: ')" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": { 329 | "id": "NlNW7YKJau7O" 330 | }, 331 | "outputs": [], 332 | "source": [ 333 | "# 일래스틱 클라우드 연결\n", 334 | "es = Elasticsearch(cloud_id=es_cloud_id,\n", 335 | " api_key=es_api_key\n", 336 | " )\n", 337 | "es.info() # 클러스터 정보 확인" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": { 344 | "id": "hSyXy5aJa10s" 345 | }, 346 | "outputs": [], 347 | "source": [ 348 | "hf_model_id='sentence-transformers/msmarco-MiniLM-L-12-v3'\n", 349 | "tm = TransformerModel(model_id=hf_model_id, task_type=\"text_embedding\")" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "id": "I4MsWWhga4CX" 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "es_model_id = tm.elasticsearch_model_id()\n", 361 | "es_model_id" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": { 368 | "id": "eKtX28bTa50r" 369 | }, 370 | "outputs": [], 371 | "source": [ 372 | "tmp_path = \"models\"\n", 373 | "Path(tmp_path).mkdir(parents=True, exist_ok=True)\n", 374 | "model_path, config, vocab_path = tm.save(tmp_path)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": null, 380 | "metadata": { 381 | "id": "nI3Ppm9ea7vW" 382 | }, 383 | "outputs": [], 384 | "source": [ 385 | "ptm = PyTorchModel(es, es_model_id)\n", 386 | "ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "metadata": { 393 | "id": "AvRE1GPwa9aZ" 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "# 일래스틱서치에 존재하는 모델 정보 조회\n", 398 | "m = MlClient.get_trained_models(es, model_id=es_model_id)\n", 399 | "m.body" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": { 406 | "id": "_L1rccH1a-zu" 407 | }, 408 | "outputs": [], 409 | "source": [ 410 | "s = MlClient.start_trained_model_deployment(es, model_id=es_model_id)\n", 411 | "s.body" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": { 418 | "id": "-BQB368zbAyT" 419 | }, 420 | "outputs": [], 421 | "source": [ 422 | "# 이전 셀에서 모델 배포가 완료되지 않으면, IndexError: list index out of range와 같은 오류가 발생합니다. 이 경우 잠시 후 다시 시도해 주세요.\n", 423 | "stats = MlClient.get_trained_models_stats(es, model_id=es_model_id)\n", 424 | "stats.body['trained_model_stats'][0]['deployment_stats']['nodes'][0]['routing_state']" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": { 431 | "id": "ZELyEJvUbLOW" 432 | }, 433 | "outputs": [], 434 | "source": [ 435 | "docs = [\n", 436 | " {\n", 437 | " \"text_field\": \"Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.\"\n", 438 | " }\n", 439 | " ]" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "metadata": { 446 | "id": "hZ8U1oHGbM4l" 447 | }, 448 | "outputs": [], 449 | "source": [ 450 | "z = MlClient.infer_trained_model(es, model_id=es_model_id, docs=docs)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": null, 456 | "metadata": { 457 | "id": "5q3j1WDmbNRT" 458 | }, 459 | "outputs": [], 460 | "source": [ 461 | "doc_0_vector = z['inference_results'][0]['predicted_value']\n", 462 | "doc_0_vector" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": { 468 | "id": "UJT9EOxmn6pZ" 469 | }, 470 | "source": [ 471 | "# **3.6.1 차원 축소 예제 코드**" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": { 478 | "id": "Yh5QvWj-n9Ri" 479 | }, 480 | "outputs": [], 481 | "source": [ 482 | "import numpy as np\n", 483 | "import matplotlib.pyplot as plt\n", 484 | "from sklearn import datasets\n", 485 | "from sklearn.decomposition import PCA" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "metadata": { 492 | "id": "TyYMdCXNoChF" 493 | }, 494 | "outputs": [], 495 | "source": [ 496 | "# Iris 데이터 세트 적재\n", 497 | "iris = datasets.load_iris()\n", 498 | "X = iris.data\n", 499 | "y = iris.target\n", 500 | "\n", 501 | "# 차원 축소를 위해 PCA 적용\n", 502 | "pca = PCA(n_components=2)\n", 503 | "X_reduced = pca.fit_transform(X)" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": null, 509 | "metadata": { 510 | "id": "HgQwUpBJoFIg" 511 | }, 512 | "outputs": [], 513 | "source": [ 514 | "# 원 데이터 시각화\n", 515 | "plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')\n", 516 | "plt.xlabel('Sepal length')\n", 517 | "plt.ylabel('Sepal width')\n", 518 | "plt.title('Original Iris dataset')\n", 519 | "plt.show()" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": null, 525 | "metadata": { 526 | "id": "w2x7VthJoIMu" 527 | }, 528 | "outputs": [], 529 | "source": [ 530 | "# 차원 축소된 데이터 시각화\n", 531 | "plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')\n", 532 | "plt.xlabel('First Principal Component')\n", 533 | "plt.ylabel('Second Principal Component')\n", 534 | "plt.title('Iris dataset after PCA')\n", 535 | "plt.show()\n" 536 | ] 537 | }, 538 | { 539 | "cell_type": "markdown", 540 | "metadata": { 541 | "id": "eUluTeh2oXYX" 542 | }, 543 | "source": [ 544 | "# **3.6.2 양자화 예제 코드**" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": { 551 | "id": "RPY1tGMvoY7L" 552 | }, 553 | "outputs": [], 554 | "source": [ 555 | "import numpy as np\n", 556 | "from sklearn import datasets\n", 557 | "from sklearn.decomposition import PCA\n", 558 | "from sklearn.preprocessing import MinMaxScaler, QuantileTransformer" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": null, 564 | "metadata": { 565 | "id": "1zUhu904oede" 566 | }, 567 | "outputs": [], 568 | "source": [ 569 | "# 숫자 데이터 세트 불러오기\n", 570 | "digits = datasets.load_digits()\n", 571 | "X = digits.data\n", 572 | "\n", 573 | "# 원본 데이터 세트에서 첫 번째 예시 출력\n", 574 | "print(\"Original dataset (first example):\\n\", X[0])" 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": null, 580 | "metadata": { 581 | "id": "xWSyzBeqoltn" 582 | }, 583 | "outputs": [], 584 | "source": [ 585 | "# 차원 축소를 위해 PCA 적용\n", 586 | "pca = PCA(n_components=10)\n", 587 | "X_reduced = pca.fit_transform(X)\n", 588 | "\n", 589 | "# PCA 적용 후 첫 번째 예시 출력\n", 590 | "print(\"\\nReduced dataset after PCA (first example):\\n\", X_reduced[0])" 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": null, 596 | "metadata": { 597 | "id": "AcNn_VWqon0t" 598 | }, 599 | "outputs": [], 600 | "source": [ 601 | "# 축소된 벡터를 [0, 255] 범위로 정규화\n", 602 | "scaler = MinMaxScaler((0, 255))\n", 603 | "X_scaled = scaler.fit_transform(X_reduced)\n", 604 | "\n", 605 | "# 정규화 후 첫 번째 예시 출력\n", 606 | "print(\"\\nScaled dataset after normalization (first example):\\n\", X_scaled[0])" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": { 613 | "id": "B-R0I53hoqNj" 614 | }, 615 | "outputs": [], 616 | "source": [ 617 | "# 스케일링 된 벡터를 8비트 정수로 양자화\n", 618 | "X_quantized = np.round(X_scaled).astype(np.uint8)\n", 619 | "\n", 620 | "# 양자화 후 첫 번째 예시 출력\n", 621 | "print(\"\\nQuantized dataset (first example):\\n\", X_quantized[0])" 622 | ] 623 | } 624 | ], 625 | "metadata": { 626 | "colab": { 627 | "provenance": [] 628 | }, 629 | "kernelspec": { 630 | "display_name": "Python 3 (ipykernel)", 631 | "language": "python", 632 | "name": "python3" 633 | }, 634 | "language_info": { 635 | "codemirror_mode": { 636 | "name": "ipython", 637 | "version": 3 638 | }, 639 | "file_extension": ".py", 640 | "mimetype": "text/x-python", 641 | "name": "python", 642 | "nbconvert_exporter": "python", 643 | "pygments_lexer": "ipython3", 644 | "version": "3.8.10" 645 | } 646 | }, 647 | "nbformat": 4, 648 | "nbformat_minor": 0 649 | } -------------------------------------------------------------------------------- /chapter5/Chapter_5_Image_Search.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "t0H_tQ34VU_I" 7 | }, 8 | "source": [ 9 | "# 이미지에 대한 벡터 표현 생성" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "id": "F8E8p2pQVPQE" 16 | }, 17 | "source": [ 18 | "## 필수 라이브러리 설치" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": { 25 | "id": "GmYAUfaGj7Xm" 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "!pip install sentence_transformers elasticsearch" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": { 36 | "id": "i1McWk6ziuuF" 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "import getpass\n", 41 | "import torch\n", 42 | "import os\n", 43 | "import torchvision.transforms as transforms\n", 44 | "import json\n", 45 | "from PIL import Image\n", 46 | "from sentence_transformers import SentenceTransformer\n", 47 | "from elasticsearch import Elasticsearch, helpers" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": { 53 | "id": "589m23IlSAkZ" 54 | }, 55 | "source": [ 56 | "## 샘플 이미지 다운로드" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": { 63 | "id": "A1WxCPEkqhiF" 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "!curl -LJO https://raw.githubusercontent.com/PacktPublishing/Vector-Search-with-Elastic/main/chapter5/images/images.tar\n", 68 | "!tar xvf ./images.tar" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "id": "TzQoiaW1i86l" 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "# 이미지를 포함하는 폴더 경로 지정 \n", 80 | "image_dir = './images/index'\n", 81 | "\n", 82 | "# 인덱스 명 정의\n", 83 | "index_name = 'images_book_demo'\n", 84 | "\n", 85 | "# 일래스틱 클라우드에 접속 \n", 86 | "es_cloud_id = getpass.getpass('Enter Elastic Cloud ID: ') \n", 87 | "es_api_key = getpass.getpass('Enter user API key: ')\n", 88 | "\n", 89 | "es = Elasticsearch(cloud_id=es_cloud_id, api_key=es_api_key)\n", 90 | "es.info() # 클러스터 정보 확인" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": { 97 | "id": "ZOffdKqji9a-" 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "# 이미지를 임베딩 할 수 있는 모델 다운로드 및 로드 \n", 102 | "model = SentenceTransformer('clip-ViT-B-32-multilingual-v1')\n", 103 | "\n", 104 | "# 이미지 전처리 함수 구성\n", 105 | "transform = transforms.Compose([\n", 106 | " transforms.Resize(224),\n", 107 | " transforms.CenterCrop(224),\n", 108 | " lambda image: image.convert(\"RGB\"),\n", 109 | " transforms.ToTensor(),\n", 110 | " transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),\n", 111 | "])" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "id": "-Mhvf2Jujm2f" 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "def create_mapping_if_new(index_name, es):\n", 123 | "\n", 124 | " # 맵핑 정의\n", 125 | " mapping = {\n", 126 | " \"mappings\": {\n", 127 | " \"properties\": {\n", 128 | " \"image_vector\": {\n", 129 | " \"type\": \"dense_vector\",\n", 130 | " \"dims\": 512,\n", 131 | " \"index\": True,\n", 132 | " \"similarity\": \"cosine\"\n", 133 | " } ,\n", 134 | " \"filename\": {\n", 135 | " \"type\": \"keyword\"\n", 136 | " }\n", 137 | " }\n", 138 | " }\n", 139 | " }\n", 140 | "\n", 141 | " # 동일한 이름의 인덱스가 존재하는지 체크\n", 142 | " if not es.indices.exists(index=index_name):\n", 143 | " # 정의된 맵핑으로 인덱스 생성 \n", 144 | " es.indices.create(index=index_name, body=mapping)\n", 145 | "\n", 146 | "def embed_image(image_path):\n", 147 | " # 이미지 파일 오픈 \n", 148 | " with Image.open(image_path) as img:\n", 149 | " # 앞서 정의한 방식으로 이미지 전처리 \n", 150 | " image = transform(img).unsqueeze(0)\n", 151 | "\n", 152 | " # 컴퓨터 환경을 체크하여 가능한 경우 GPU 활용 \n", 153 | " if torch.cuda.is_available():\n", 154 | " image = image.to('cuda')\n", 155 | " model.to('cuda')\n", 156 | "\n", 157 | " # 모델을 활용하여 이미지 벡터 생성\n", 158 | " image_vector = model.encode(image)\n", 159 | "\n", 160 | " # 이미지 벡터가 torch tensor 형태로 생성된 경우, CPU에서 처리 가능하도록 데이터를 변환\n", 161 | " if isinstance(image_vector, torch.Tensor):\n", 162 | " image_vector = image_vector.cpu().numpy()\n", 163 | "\n", 164 | " # 리스트 형태로 변환 후,\n", 165 | " image_vector = image_vector.tolist()\n", 166 | "\n", 167 | " # 이미지 벡터 반환\n", 168 | " return image_vector" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": { 175 | "id": "j_DIChZvjt9_" 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "# 동일한 인덱스가 존재하지 않는 경우, 정의된 맵핑으로 인덱스 생성 \n", 180 | "create_mapping_if_new(index_name, es)\n", 181 | "\n", 182 | "# 이미지 파일과 벡터를 저장할 딕셔너리 형태의 변수 선언\n", 183 | "data = {}\n", 184 | "\n", 185 | "# 폴더에 있는 이미지 순환 \n", 186 | "for image_file in os.listdir(image_dir):\n", 187 | " # 이미지 벡터 생성 \n", 188 | " image_vector = embed_image(os.path.join(image_dir, image_file))\n", 189 | "\n", 190 | " # 이미지 벡터를 data 딕셔너리에 저장 \n", 191 | " data[image_file] = image_vector[0]\n", 192 | "\n", 193 | "# 일래스틱서치에 이미지 벡터를 색인\n", 194 | "documents = []\n", 195 | "for filename, vector in data.items():\n", 196 | "\n", 197 | " # 색인 대상 문서 구성\n", 198 | " document = {'_index': index_name,\n", 199 | " '_source': {\"filename\": filename,\n", 200 | " \"image_vector\": vector\n", 201 | " }\n", 202 | " }\n", 203 | "\n", 204 | "\n", 205 | " documents.append(document)\n", 206 | "\n", 207 | "#documents" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": { 214 | "id": "V04KgWyHpBMU" 215 | }, 216 | "outputs": [], 217 | "source": [ 218 | "from elasticsearch.helpers import BulkIndexError\n", 219 | "\n", 220 | "# bulk API를 통해 앞서 구성한 doc를 대량 색인\n", 221 | "try:\n", 222 | " helpers.bulk(es, documents)\n", 223 | "except BulkIndexError as e:\n", 224 | " for x in e.errors:\n", 225 | " print(x)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": { 231 | "id": "Fy-pUd8jucRy" 232 | }, 233 | "source": [ 234 | "# kNN 검색" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": { 240 | "id": "gsl9xgJJHyRd" 241 | }, 242 | "source": [ 243 | "검색 쿼리용 이미지 벡터화" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": { 250 | "id": "Fk9yL15SGp5T" 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "search_image = './images/search/patrice-bouchard-Yu_ejF2s_dI-unsplash.jpg'\n", 255 | "search_image_vector = embed_image(search_image)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": { 261 | "id": "3BZKfbi7H7H6" 262 | }, 263 | "source": [ 264 | "kNN 벡터 검색 수행" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": null, 270 | "metadata": { 271 | "id": "fgV6DIoCFBlM" 272 | }, 273 | "outputs": [], 274 | "source": [ 275 | "knn = {\n", 276 | " \"field\": \"image_vector\",\n", 277 | " \"query_vector\": search_image_vector[0],\n", 278 | " \"k\": 1,\n", 279 | " \"num_candidates\": 10\n", 280 | " }\n", 281 | "fields = [\"filename\"]\n", 282 | "size = 1\n", 283 | "source = False" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": { 290 | "id": "Zl9ys10oFBlM" 291 | }, 292 | "outputs": [], 293 | "source": [ 294 | "results = es.search(index=index_name,\n", 295 | " knn=knn,\n", 296 | " source=source,\n", 297 | " fields=fields,\n", 298 | " size=size\n", 299 | " )" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": { 306 | "id": "LNG006HaWwtV" 307 | }, 308 | "outputs": [], 309 | "source": [ 310 | "result_filename = results['hits']['hits'][0]['fields']['filename'][0]" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": { 316 | "id": "wZi0WvzyXyks" 317 | }, 318 | "source": "## 검색 결과 이미지 표시" 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": { 324 | "id": "Or986-SWXEcV" 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "from IPython.display import Image\n", 329 | "Image('./images/index/'+result_filename, width=400)" 330 | ] 331 | } 332 | ], 333 | "metadata": { 334 | "colab": { 335 | "provenance": [], 336 | "toc_visible": true 337 | }, 338 | "kernelspec": { 339 | "display_name": "Python 3", 340 | "name": "python3" 341 | }, 342 | "language_info": { 343 | "name": "python" 344 | } 345 | }, 346 | "nbformat": 4, 347 | "nbformat_minor": 0 348 | } 349 | -------------------------------------------------------------------------------- /chapter5/images/citations.txt: -------------------------------------------------------------------------------- 1 | All images sourced from usplash.com which allows for commercial use 2 | https://unsplash.com/license 3 | 4 | Photographer attributions: 5 | Amy Reed 6 | Brock Kirk 7 | Jack Bulmeron 8 | Jeffrey Kelso 9 | Jody Confer 10 | Joshua J. Cotten 11 | Patrice Bouchard 12 | Timothy Kindrachuk 13 | Gerhard Crous 14 | -------------------------------------------------------------------------------- /chapter5/images/images.tar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wikibook/vector-search/0558d712dcd51b2e7ca72bddfb6a3581086c3798/chapter5/images/images.tar -------------------------------------------------------------------------------- /chapter6/chapter6_pii_redaction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "id": "yD06d33fOK-y" 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "!pip install faker" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": { 17 | "id": "cJtau0HASK4K" 18 | }, 19 | "source": [ 20 | "# **기본 파이프라인 설정 예제 코드**\n", 21 | "## NER 모델 설정 및 로드\n" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": { 27 | "id": "Ly1f1P-l9ri8" 28 | }, 29 | "source": [ 30 | "### 필요한 파이썬 패키지 설치\n" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": { 36 | "id": "MJAb_8zlPFhQ" 37 | }, 38 | "source": [ 39 | "일래스틱은 허깅 페이스에서 모델을 다운로드하고 일래스틱서치로 로드하기 위해 [eland 파이썬 라이브러리](https://github.com/elastic/eland)를 사용합니다." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": { 46 | "id": "rUedSzQW9FIF" 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "!pip install eland elasticsearch transformers sentence_transformers torch" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "id": "wyUZXUi4RWWL" 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "from pathlib import Path\n", 62 | "from eland.ml.pytorch import PyTorchModel\n", 63 | "from eland.ml.pytorch.transformers import TransformerModel\n", 64 | "from elasticsearch import Elasticsearch\n", 65 | "from elasticsearch.client import MlClient" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": { 71 | "id": "r7nMIbHke37Q" 72 | }, 73 | "source": [ 74 | "### 일래스틱서치 인증 설정\n", 75 | "[일래스틱 클라우드 ID](https://www.elastic.co/guide/en/cloud/current/ec-cloud-id.html)와 [클러스터 API 키](https://www.elastic.co/guide/en/kibana/current/api-keys.html)를 사용한 인증 접근 방식을 권장합니다.\n", 76 | "\n", 77 | "자격 증명을 하기 위한 어떠한 방법을 사용해도 상관없습니다. 아래 예제에서는 깃 저장소에 자격 증명을 저장하지 않고 보안을 유지하기 위해 getpass 모듈을 사용하여 자격 증명을 입력합니다." 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "id": "Xsd2m7HoTCLm" 85 | }, 86 | "outputs": [], 87 | "source": [ 88 | "import getpass" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "id": "SSGgYHome69o" 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "es_cloud_id = getpass.getpass('Enter Elastic Cloud ID: ')\n", 100 | "es_api_key = getpass.getpass('Enter user API key: ')" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": { 106 | "id": "jL4VDnVp96lf" 107 | }, 108 | "source": [ 109 | "### 일래스틱 클라우드 접속" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": { 116 | "id": "I8mVJkKmetXo" 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "es = Elasticsearch(cloud_id=es_cloud_id,\n", 121 | " api_key=es_api_key\n", 122 | " )\n", 123 | "es.info() # 클러스터 정보 확인" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": { 129 | "id": "pOuF_1VnmVU-" 130 | }, 131 | "source": [ 132 | "## 허깅 페이스 모델 허브에서 모델을 다운로드해 일래스틱서치로 가져오기" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": { 138 | "id": "NwIItJyhoWeT" 139 | }, 140 | "source": [ 141 | "일래스틱 스택의 머신러닝 기능은 표준 BERT 모델 인터페이스를 준수하고 WordPiece 토큰화 알고리즘을 사용하는 변환 모델을 지원합니다.\n", 142 | "\n", 143 | "현재 지원되는 아키텍처 목록은 다음에서 볼 수 있습니다. [일래스틱 NLP 모델 지원 목록](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-model-ref.html)\n", 144 | "\n" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": { 150 | "id": "10VvWJ87alld" 151 | }, 152 | "source": [ 153 | "### HF 복사 링크를 사용하여 허깅 페이스에서 NER 모델 다운로드\n", 154 | "예제에서는 [sentence-transformers/msmarco-MiniLM-L-12-v3](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L-12-v3) 모델을 사용할 것입니다.\n" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": { 160 | "id": "uBMWHj-ZmtvE" 161 | }, 162 | "source": [ 163 | "### 모델 다운로드\n", 164 | "여기서는 Hugging Face의 모델 ID를 지정합니다. 이 ID를 얻는 가장 쉬운 방법은 모델 페이지에서 이름 옆에 있는 모델 이름 아이콘을 클릭하여 복사하는 것입니다.\n", 165 | "\n", 166 | "`TransformerModel`을 호출할 때 HF 모델 ID와 작업 유형을 지정합니다. auto를 지정하여 eland가 모델 구성 정보에서 자동으로 유형을 결정하도록 시도할 수 있습니다. 그러나 이것이 항상 가능한 것은 아니며 특정 task_type 값의 목록은 [지원되는 task_type 코드](https://github.com/elastic/eland/blob/15a300728876022b206161d71055c67b500a0192/eland/ml/pytorch/transformers.py#*L41*)에서 볼 수 있습니다.\n", 167 | "\n", 168 | "_\"Some weights of the model checkpoint\"에 대한 경고는 무시할 수 있습니다._" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": { 175 | "id": "zPV3oFsKiYFL" 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "hf_model_id='dslim/bert-base-NER'\n", 180 | "tm = TransformerModel(model_id=hf_model_id, task_type=\"ner\")" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": { 186 | "id": "sX-9dHuDmwgX" 187 | }, 188 | "source": [ 189 | "### 모델 ID 설정 및 확인\n", 190 | "이름이 일래스틱서치와 호환되도록 하기 위해서 '/'는 '__'로 대체됩니다.\n", 191 | "\n" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": { 198 | "id": "XkIQBBCbdqvQ" 199 | }, 200 | "outputs": [], 201 | "source": [ 202 | "es_model_id = tm.elasticsearch_model_id()\n", 203 | "es_model_id" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": { 209 | "id": "p0L2cfYwbIld" 210 | }, 211 | "source": [ 212 | "## 일래스틱서치가 사용하는 TorchScrpt로 모델 내보내기" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": { 219 | "id": "GsSpvvP-nbCK" 220 | }, 221 | "outputs": [], 222 | "source": [ 223 | "tmp_path = \"models\"\n", 224 | "Path(tmp_path).mkdir(parents=True, exist_ok=True)\n", 225 | "model_path, config, vocab_path = tm.save(tmp_path)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": { 231 | "id": "k1a_yNo6ba2E" 232 | }, 233 | "source": [ 234 | "## 일래스틱서치에 모델을 로드\n", 235 | "일래스틱서치에 모델이 없어야 합니다." 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": { 242 | "id": "Z4QD71Apnj4j" 243 | }, 244 | "outputs": [], 245 | "source": [ 246 | "ptm = PyTorchModel(es, es_model_id)\n", 247 | "ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": { 253 | "id": "4UYSzFp3vHdB" 254 | }, 255 | "source": [ 256 | "## 모델 시작" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": { 262 | "id": "wQwfozwznK4Y" 263 | }, 264 | "source": [ 265 | "### 모델에 대한 정보 확인\n", 266 | "필수는 아니지만 모델 개요를 얻는 데 유용합니다." 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": { 273 | "id": "b4Wv8EJvpfZI" 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "# 일래스틱서치에 있는 목록\n", 278 | "m = MlClient.get_trained_models(es, model_id=es_model_id)\n", 279 | "m.body" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": { 285 | "id": "oMGw3sk-pbaN" 286 | }, 287 | "source": [ 288 | "### 모델 배포\n", 289 | "ML 노드에 모델이 로드되고 NLP 작업에 사용할 수 있도록 프로세스가 시작됩니다.\n", 290 | "\n", 291 | "※ 머신러닝 노드의 메모리가 부족하면 ‘not enough memory on node~’라는 오류 메시지가 나타납니다. 3장에서 실습한 모델 때문에 메모리가 부족한 상황에서는 키바나 Dev Tools > Console에서 다음과 같은 명령어를 실행하여 모델을 삭제할 수 있습니다.\n", 292 | "\n", 293 | "DELETE _ml/trained_models/sentence-transformers__msmarco-minilm-l-12-v3?force" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": { 300 | "id": "w5muJ1rLqvUW" 301 | }, 302 | "outputs": [], 303 | "source": [ 304 | "s = MlClient.start_trained_model_deployment(es, model_id=es_model_id)\n", 305 | "s.body" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": { 311 | "id": "ZytlELrsnn_O" 312 | }, 313 | "source": [ 314 | "### 모델이 문제없이 시작되었는지 확인" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": { 321 | "id": "ZaQUUWe0Hxwz" 322 | }, 323 | "outputs": [], 324 | "source": [ 325 | "stats = MlClient.get_trained_models_stats(es, model_id=es_model_id)\n", 326 | "stats.body['trained_model_stats'][0]['deployment_stats']['nodes'][0]['routing_state']" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": { 332 | "id": "iO7YxIbTRZ-b" 333 | }, 334 | "source": [ 335 | "## 파이프라인 생성" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "id": "DsUkWlNkNsSB" 343 | }, 344 | "outputs": [], 345 | "source": [ 346 | "description= \"PII redacting ingest pipeline\"\n", 347 | "\n", 348 | "processors= [\n", 349 | " {\n", 350 | " \"set\": {\n", 351 | " \"field\": \"redacted\",\n", 352 | " \"value\": \"{{{message}}}\"\n", 353 | " }\n", 354 | " },\n", 355 | " {\n", 356 | " \"inference\": {\n", 357 | " \"model_id\": es_model_id,\n", 358 | " \"field_map\": {\n", 359 | " \"message\": \"text_field\"\n", 360 | " }\n", 361 | " }\n", 362 | " },\n", 363 | " {\n", 364 | " \"script\": {\n", 365 | " \"lang\": \"painless\",\n", 366 | " \"source\": \"\"\"String msg = ctx['message'];\n", 367 | " for (item in ctx['ml']['inference']['entities']) {\n", 368 | " msg = msg.replace(item['entity'], '<' + item['class_name'] + '>')\n", 369 | " }\n", 370 | " ctx['redacted']=msg\"\"\",\n", 371 | " \"if\": \"return ctx['ml']['inference']['entities'].isEmpty() == false\",\n", 372 | " \"tag\": \"ner_redact\",\n", 373 | " \"description\": \"Redact NER entities\"\n", 374 | " }\n", 375 | " },\n", 376 | " {\n", 377 | " \"redact\": {\n", 378 | " \"field\": \"redacted\",\n", 379 | " \"patterns\": [\n", 380 | " \"%{PHONE:PHONE}\",\n", 381 | " \"%{SSN:SSN}\"\n", 382 | " ],\n", 383 | " \"pattern_definitions\": {\n", 384 | " \"SSN\": \"\"\"\\d{6}-?\\d{7}\"\"\",\n", 385 | " \"PHONE\": \"\"\"\\d{3}-?\\d{3}-?\\d{4}\"\"\"\n", 386 | " }\n", 387 | " }\n", 388 | " },\n", 389 | " {\n", 390 | " \"remove\": {\n", 391 | " \"field\": [\n", 392 | " \"message\",\n", 393 | " \"ml\"\n", 394 | " ]\n", 395 | " }\n", 396 | " }\n", 397 | " ]\n", 398 | "\n", 399 | "on_failure= [\n", 400 | " {\n", 401 | " \"set\": {\n", 402 | " \"field\": \"failure\",\n", 403 | " \"value\": \"pii_script-redact\"\n", 404 | " }\n", 405 | " }\n", 406 | " ]\n", 407 | "\n", 408 | "\n", 409 | "\n", 410 | "response = es.ingest.put_pipeline(id=\"pii_redaction_pipeline_book\",\n", 411 | " description=description,\n", 412 | " processors=processors,\n", 413 | " on_failure=on_failure\n", 414 | ")\n", 415 | "\n", 416 | "\n", 417 | "# Print the response\n", 418 | "print(response)" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": { 424 | "id": "YuORHcRMTd0B" 425 | }, 426 | "source": [ 427 | "## 인덱스 템플릿 설정\n", 428 | "`pii_data*` 패턴으로 생성된 모든 인덱스를 찾습니다.\n", 429 | "필드에 대한 매핑을 작성합니다.\n", 430 | "새 데이터가 기본적으로 `pii_redaction_pipeline`을 사용하도록 구성합니다.\n" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "metadata": { 437 | "id": "dD4Q_tHxTjYi" 438 | }, 439 | "outputs": [], 440 | "source": [ 441 | "index_patterns =[\n", 442 | " \"pii_data*\"\n", 443 | " ]\n", 444 | "order = 1\n", 445 | "settings = {\n", 446 | " \"number_of_shards\": 1,\n", 447 | " \"number_of_replicas\": 1,\n", 448 | " \"index.default_pipeline\": \"pii_redaction_pipeline_book\"\n", 449 | " }\n", 450 | "mappings = {\n", 451 | " \"properties\": {\n", 452 | " \"message\": {\n", 453 | " \"type\": \"text\"\n", 454 | " },\n", 455 | " \"status\": {\n", 456 | " \"type\": \"keyword\"\n", 457 | " },\n", 458 | " \"redacted\": {\n", 459 | " \"type\": \"text\"\n", 460 | " }\n", 461 | " }\n", 462 | " }\n", 463 | "\n", 464 | "\n", 465 | "# 인덱스 템플릿 생성\n", 466 | "response = es.indices.put_template(name=\"pii_book_template\",\n", 467 | " index_patterns=index_patterns,\n", 468 | " order=order,\n", 469 | " settings=settings,\n", 470 | " mappings=mappings\n", 471 | " )\n", 472 | "\n", 473 | "# 응답값 출력\n", 474 | "print(response)\n" 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": { 480 | "id": "5obtia7wRSae" 481 | }, 482 | "source": [ 483 | "# **가짜 PII 만들기 예제 코드**\n", 484 | "## 가짜 PII 데이터 생성" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": null, 490 | "metadata": { 491 | "id": "99PLiMoHNo75" 492 | }, 493 | "outputs": [], 494 | "source": [ 495 | "from faker import Faker\n", 496 | "import json\n", 497 | "from pprint import pprint\n", 498 | "\n", 499 | "# Faker 클래스 인스턴스 생성\n", 500 | "fake = Faker('ko_kr')\n", 501 | "\n", 502 | "# 가짜 PII 생성 함수 정의\n", 503 | "def generate_fake_pii(num_records):\n", 504 | "\n", 505 | " fake_data = []\n", 506 | "\n", 507 | " for x in range(num_records):\n", 508 | " # 가짜 PII 생성\n", 509 | " fn = fake.first_name()\n", 510 | " ln = fake.last_name()\n", 511 | " pn = fake.phone_number()\n", 512 | " sn = fake.ssn()\n", 513 | " ai = fake.random_element(elements=('활성화', '비활성화'))\n", 514 | "\n", 515 | " call_log = {\n", 516 | " 'message' : f'{ln}{fn}의 전화번호는 {pn} 이고 주민등록번호는 {sn} 입니다.',\n", 517 | " 'status' : ai\n", 518 | " }\n", 519 | "\n", 520 | " fake_data.append(call_log)\n", 521 | " return fake_data\n", 522 | "\n", 523 | "# N 명의 가짜 PII 정보 생성\n", 524 | "num_records = 10 # 가짜 PII 정보를 생성할 숫자 설정\n", 525 | "fake_pii_data = generate_fake_pii(num_records)\n", 526 | "\n", 527 | "pprint(fake_pii_data)" 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "metadata": { 533 | "id": "cQKs5Y79TPsw" 534 | }, 535 | "source": [ 536 | "## 일래스틱서치에 가짜 데이터 수집\n", 537 | "데이터를 pii_data 인덱스로 저장하면 pii 제거 파이프라인을 통해 저장됩니다." 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": null, 543 | "metadata": { 544 | "id": "paxHUfguOu88" 545 | }, 546 | "outputs": [], 547 | "source": [ 548 | "from elasticsearch import Elasticsearch, helpers\n", 549 | "\n", 550 | "# 대량 인덱서\n", 551 | "# 일래스틱서치 문서 배열을 만드는 함수 정의\n", 552 | "def generate_documents_array(fake_data, index_name):\n", 553 | "\n", 554 | " # 문서를 저장할 빈 배열 만들기\n", 555 | " documents = []\n", 556 | "\n", 557 | " # 가짜 데이터 목록을 문서 형식에 맞게 변환\n", 558 | " for call in fake_data:\n", 559 | " # _index와 _source 키를 사용해 문서 만들기\n", 560 | " document = {\n", 561 | " '_index': index_name,\n", 562 | " '_source': call\n", 563 | " }\n", 564 | "\n", 565 | " # 문서 배열에 문서 추가\n", 566 | " documents.append(document)\n", 567 | "\n", 568 | " return documents\n", 569 | "\n", 570 | "\n", 571 | "# 일래스틱서치 문서 배열 만들기\n", 572 | "index_name = 'pii_data-book' # 인덱스 이름 설정\n", 573 | "documents_array = generate_documents_array(fake_pii_data, index_name)\n", 574 | "\n", 575 | "# 대량 색인 요청 본문을 개행 구분자를 사용해 단일 문자로 출력\n", 576 | "print(\"Bulk request: \")\n", 577 | "print(documents_array)\n", 578 | "\n", 579 | "try:\n", 580 | "\tresponse = helpers.bulk(es, documents_array)\n", 581 | "\tprint (\"\\nRESPONSE:\", response)\n", 582 | "except Exception as e:\n", 583 | "\tprint(\"\\nERROR:\", e)" 584 | ] 585 | }, 586 | { 587 | "cell_type": "code", 588 | "execution_count": null, 589 | "metadata": { 590 | "id": "1t7EKO5xFj3_" 591 | }, 592 | "outputs": [], 593 | "source": [ 594 | "query = {\"match_all\":{}}\n", 595 | "\n", 596 | "response = es.search(\n", 597 | " index=index_name,\n", 598 | " query=query\n", 599 | " )\n", 600 | "\n", 601 | "pprint(response['hits']['hits'])" 602 | ] 603 | } 604 | ], 605 | "metadata": { 606 | "colab": { 607 | "collapsed_sections": [ 608 | "5obtia7wRSae", 609 | "cJtau0HASK4K", 610 | "iO7YxIbTRZ-b" 611 | ], 612 | "provenance": [], 613 | "toc_visible": true 614 | }, 615 | "kernelspec": { 616 | "display_name": "Python 3 (ipykernel)", 617 | "language": "python", 618 | "name": "python3" 619 | }, 620 | "language_info": { 621 | "codemirror_mode": { 622 | "name": "ipython", 623 | "version": 3 624 | }, 625 | "file_extension": ".py", 626 | "mimetype": "text/x-python", 627 | "name": "python", 628 | "nbconvert_exporter": "python", 629 | "pygments_lexer": "ipython3", 630 | "version": "3.8.10" 631 | } 632 | }, 633 | "nbformat": 4, 634 | "nbformat_minor": 0 635 | } -------------------------------------------------------------------------------- /chapter7/vector-search-for-observability.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 필요 라이브러리 설치" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "!pip install faker\n", 17 | "!pip install openai\n", 18 | "!pip install elasticsearch\n", 19 | "!pip install sklearn" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": "## 합성 로그 생성 함수" 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "import openai\n", 34 | "import ast\n", 35 | "import json\n", 36 | "from elasticsearch import Elasticsearch, helpers\n", 37 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 38 | "from faker import Faker\n", 39 | "import random\n", 40 | "\n", 41 | "fake = Faker()\n", 42 | "\n", 43 | "# 1. 아파치 HTTP 서버 (일반 로그 형식)\n", 44 | "def generate_apache_log():\n", 45 | " return '{RemoteHost} - - [{Timestamp}] \"{RequestMethod} {RequestURI} {Protocol}\" {StatusCode} {ResponseSize}'.format(\n", 46 | " RemoteHost=fake.ipv4(),\n", 47 | " Timestamp=fake.date_time_this_year().strftime('%d/%b/%Y:%H:%M:%S %z'),\n", 48 | " RequestMethod=fake.http_method(),\n", 49 | " RequestURI=fake.uri(),\n", 50 | " Protocol='HTTP/1.1',\n", 51 | " StatusCode=random.choice([200, 404, 500]),\n", 52 | " ResponseSize=random.randint(100, 10000)\n", 53 | " )\n", 54 | "\n", 55 | "# 2. 엔진엑스 (혼합 로그 형식)\n", 56 | "def generate_nginx_log():\n", 57 | " return '{RemoteAddress} - {RemoteUser} [{Timestamp}] \"{RequestMethod} {RequestURI} {Protocol}\" {StatusCode} {ResponseSize} \\\n", 58 | "\"{Referer}\" \"{UserAgent}\"'.format(\n", 59 | " RemoteAddress=fake.ipv4(),\n", 60 | " RemoteUser='-',\n", 61 | " Timestamp=fake.date_time_this_year().strftime('%d/%b/%Y:%H:%M:%S %z'),\n", 62 | " RequestMethod=fake.http_method(),\n", 63 | " RequestURI=fake.uri(),\n", 64 | " Protocol='HTTP/1.1',\n", 65 | " StatusCode=random.choice([200, 404, 500]),\n", 66 | " ResponseSize=random.randint(100, 10000),\n", 67 | " Referer=fake.uri(),\n", 68 | " UserAgent=fake.user_agent()\n", 69 | " )\n", 70 | "\n", 71 | "# 3. 시스로그 (RFC 5424)\n", 72 | "def generate_syslog():\n", 73 | " return '<{Priority}>{Version} {Timestamp} {Hostname} {AppName} {ProcID} {MsgID} {StructuredData} {Message}'.format(\n", 74 | " Priority=random.randint(1, 191),\n", 75 | " Version=1,\n", 76 | " Timestamp=fake.date_time_this_year().isoformat(),\n", 77 | " Hostname=fake.hostname(),\n", 78 | " AppName=fake.word(),\n", 79 | " ProcID=random.randint(1000, 9999),\n", 80 | " MsgID=random.randint(1000, 9999),\n", 81 | " StructuredData='-',\n", 82 | " Message=fake.sentence()\n", 83 | " )\n", 84 | "\n", 85 | "# 4. 아마존 웹 서비스 클라우드 트레일\n", 86 | "def generate_aws_cloudtrail_log():\n", 87 | " return '{{\"eventVersion\": \"{EventVersion}\", \"userIdentity\": {{\"type\": \"IAMUser\", \"userName\": \"{UserName}\"}}, \\\n", 88 | "\"eventTime\": \"{Timestamp}\", \"eventSource\": \"{EventSource}\", \"eventName\": \"{EventName}\", \"awsRegion\": \"{AwsRegion}\" , \\\n", 89 | "\"sourceIPAddress\": \"{SourceIPAddress}\", \"userAgent\": \"{UserAgent}\", \"requestParameters\": {{\"key\": \"value\"}}, \\\n", 90 | "\"responseElements\": {{\"key\": \"value\"}}, \"requestID\": \"{RequestId}\", \"eventID\": \"{EventId}\", \"eventType\": \"AwsApiCall\", \\\n", 91 | "\"recipientAccountId\": \"{RecipientAccountId}\"}}'.format(\n", 92 | " EventVersion='1.08',\n", 93 | " UserName=fake.user_name(),\n", 94 | " Timestamp=fake.date_time_this_year().isoformat(),\n", 95 | " EventSource='s3.amazonaws.com',\n", 96 | " EventName='GetObject',\n", 97 | " AwsRegion='us-east-1',\n", 98 | " SourceIPAddress=fake.ipv4(),\n", 99 | " UserAgent=fake.user_agent(),\n", 100 | " RequestId=fake.uuid4(),\n", 101 | " EventId=fake.uuid4(),\n", 102 | " RecipientAccountId=fake.random_number(digits=12)\n", 103 | " )\n", 104 | "\n", 105 | "# 5. 마이크로소프트 윈도우즈 이벤트 로그\n", 106 | "def generate_windows_event_log():\n", 107 | " return '\\\n", 108 | "{EventID}{Level}{SourceName}\\\n", 109 | "{Computer}{Message}'.format(\n", 110 | " ProviderName=fake.word(),\n", 111 | " EventID=random.randint(1000, 9999),\n", 112 | " Level=random.randint(1, 5),\n", 113 | " Timestamp=fake.date_time_this_year().isoformat(),\n", 114 | " SourceName=fake.word(),\n", 115 | " Computer=fake.hostname(),\n", 116 | " Message=fake.sentence()\n", 117 | " )\n", 118 | "\n", 119 | "# 6. 리눅스 감사 로그\n", 120 | "def generate_linux_audit_log():\n", 121 | " return 'type={AuditType} msg=audit({Timestamp}): {Message}'.format(\n", 122 | " AuditType=fake.word(),\n", 123 | " Timestamp=fake.date_time_this_year().isoformat(),\n", 124 | " Message=fake.sentence()\n", 125 | " )\n", 126 | "\n", 127 | "def generate_logs(sources, total_logs, random_logs):\n", 128 | " # 함수 이름과 로그 종류 맵핑\n", 129 | " source_to_function = {\n", 130 | " 'apache': generate_apache_log,\n", 131 | " 'nginx': generate_nginx_log,\n", 132 | " 'syslog': generate_syslog,\n", 133 | " 'aws_cloudtrail': generate_aws_cloudtrail_log,\n", 134 | " 'windows_event': generate_windows_event_log,\n", 135 | " 'linux_audit': generate_linux_audit_log,\n", 136 | " }\n", 137 | " \n", 138 | " # 각 로그 종류별로 생성할 로그의 수 계산\n", 139 | " num_sources = len(sources)\n", 140 | " logs_per_source = [total_logs // num_sources] * num_sources\n", 141 | " if random_logs:\n", 142 | " for i in range(total_logs % num_sources):\n", 143 | " logs_per_source[i] += 1\n", 144 | " random.shuffle(logs_per_source)\n", 145 | " \n", 146 | " # 로그를 생성하고 리스트에 생성된 합성 로그 추가\n", 147 | " generated_logs = []\n", 148 | " for source, num_logs in zip(sources, logs_per_source):\n", 149 | " log_function = source_to_function[source]\n", 150 | " for _ in range(num_logs):\n", 151 | " generated_logs.append(log_function())\n", 152 | " \n", 153 | " return generated_logs\n", 154 | "\n" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": "## OpenAI를 활용한 로그 확장" 161 | }, 162 | { 163 | "cell_type": "code", 164 | "metadata": {}, 165 | "source": [ 166 | "# 활용 예\n", 167 | "sources_to_use = ['apache']\n", 168 | "total_logs_to_generate = 15\n", 169 | "random_logs_per_source = True\n", 170 | "logs = generate_logs(sources_to_use, total_logs_to_generate, random_logs_per_source)\n", 171 | "\n", 172 | "\n", 173 | "stringifiedPromptsArray = json.dumps(logs)\n", 174 | "\n", 175 | "print(\"Logs: \")\n", 176 | "print(logs)\n", 177 | "\n", 178 | "prompts = [\n", 179 | " {\n", 180 | " \"role\": \"user\",\n", 181 | " \"content\": stringifiedPromptsArray\n", 182 | " }\n", 183 | "]\n", 184 | "\n", 185 | "batchInstruction = {\n", 186 | " \"role\": \"system\",\n", 187 | " \"content\": \"Explain what happened for each log line of the array. Return a python array of the explanation. Only the array, no text around it or any extra comment, nothing else than the array should be in the answer. Don't forget in your completion to give the day, date and year of the log. Interpret some of the log content if you can, for example you have to translate what an error code 500.\"\n", 188 | "}\n", 189 | "\n", 190 | "prompts.append(batchInstruction)\n", 191 | "print(\"ChatGPT: \")\n", 192 | "\n", 193 | "\n", 194 | "# OpenAI API 키 선언\n", 195 | "openai_api_key = \"OPENAI_API_KEY\"\n", 196 | "\n", 197 | "# OpenAI 클라이언트 초기화\n", 198 | "openai.api_key = openai_api_key\n", 199 | "\n", 200 | "stringifiedBatchCompletion = openai.chat.completions.create(model=\"gpt-3.5-turbo\", messages=prompts, max_tokens=1000)\n", 201 | "print(stringifiedBatchCompletion.choices[0].message.content)\n", 202 | "batchCompletion = ast.literal_eval(stringifiedBatchCompletion.choices[0].message.content)\n", 203 | "\n", 204 | "#batchCompletion" 205 | ], 206 | "outputs": [], 207 | "execution_count": null 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "## 로그 벡터화" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "# 일래스틱서치 접속 정보 입력\n", 223 | "import getpass\n", 224 | "es_cloud_id = getpass.getpass('Enter Elastic Cloud ID: ') \n", 225 | "es_api_key = getpass.getpass('Enter user API key: ')\n", 226 | "\n", 227 | "# 일래스틱 클라우드 접속\n", 228 | "es = Elasticsearch(cloud_id=es_cloud_id, api_key=es_api_key)\n", 229 | "\n", 230 | "# 인덱스 맵핑 설정\n", 231 | "index_config = {\n", 232 | " \"mappings\": {\n", 233 | " \"properties\": {\n", 234 | " \"description_vectorized\": {\n", 235 | " \"type\": \"dense_vector\",\n", 236 | " \"dims\": 768,\n", 237 | " \"index\": True,\n", 238 | " \"similarity\": \"cosine\"\n", 239 | " }\n", 240 | " }\n", 241 | " }\n", 242 | "}\n", 243 | "\n", 244 | "# 인덱스 생성\n", 245 | "response = es.indices.create(index='logs', body=index_config)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": "#### ※ 일괄 색인 수행 전 7.4.2 모델 적재, 7.4.3 수집 파이프라인 생성 단계 수행이 필요합니다. " 252 | }, 253 | { 254 | "cell_type": "code", 255 | "metadata": {}, 256 | "source": [ 257 | "# 일괄 색인을 위한 JSON 문서 생성 \n", 258 | "bulk_index_body = []\n", 259 | "for index, log in enumerate(batchCompletion):\n", 260 | " document = {\n", 261 | " \"_index\": \"logs\", \n", 262 | " \"pipeline\": \"vectorize-log\",\n", 263 | " \"_source\": {\n", 264 | " \"text_field\": log, \"log\": logs[index]\n", 265 | " }\n", 266 | " }\n", 267 | " bulk_index_body.append(document)\n", 268 | "\n", 269 | "# 일괄 색인 문서 확인 \n", 270 | "print(\"Bulk request: \")\n", 271 | "print(bulk_index_body)\n", 272 | "\n", 273 | "try:\n", 274 | " response = helpers.bulk(es, bulk_index_body)\n", 275 | " print (\"\\nRESPONSE:\", response)\n", 276 | "except Exception as e:\n", 277 | " print(\"\\nERROR:\", e)\n" 278 | ], 279 | "outputs": [], 280 | "execution_count": null 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "## 시맨틱 검색" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "metadata": {}, 292 | "source": [ 293 | "\n", 294 | "def ESSearch(query_text):\n", 295 | " # 일래스틱서치 BM25와 kNN 하이브리드 검색\n", 296 | " query = {\n", 297 | " \"bool\": {\n", 298 | " \"filter\": [{\n", 299 | " \"exists\": {\n", 300 | " \"field\": \"description_vectorized\"\n", 301 | " }\n", 302 | " }]\n", 303 | " }\n", 304 | " }\n", 305 | "\n", 306 | " knn = {\n", 307 | " \"field\": \"description_vectorized\",\n", 308 | " \"k\": 1,\n", 309 | " \"num_candidates\": 20,\n", 310 | " \"query_vector_builder\": {\n", 311 | " \"text_embedding\": {\n", 312 | " \"model_id\": \"sentence-transformers__all-distilroberta-v1\",\n", 313 | " \"model_text\": query_text\n", 314 | " }\n", 315 | " },\n", 316 | " \"boost\": 24\n", 317 | " }\n", 318 | "\n", 319 | " fields = [\"text_field\"]\n", 320 | " index = 'logs'\n", 321 | " resp = es.search(index=index,\n", 322 | " query=query,\n", 323 | " knn=knn,\n", 324 | " fields=fields,\n", 325 | " size=1,\n", 326 | " source=False)\n", 327 | "\n", 328 | "\n", 329 | " #print(resp['hits']['hits'][0]['fields']['text_field'][0])\n", 330 | " return resp['hits']['hits'][0]['fields']['text_field'][0]\n", 331 | "\n", 332 | "\n", 333 | "ESSearch(\"Were there any error in March?\")" 334 | ], 335 | "outputs": [], 336 | "execution_count": null 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "metadata": {}, 341 | "source": [ 342 | "# 7.4.2 모델 적재 - 일래스틱서치에 임베딩 모델 적재하기\n", 343 | "#### 임베딩에 필요한 모델을 일래스틱서치에 적재합니다." 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": {}, 350 | "outputs": [], 351 | "source": [ 352 | "# 필요시 관련 라이브러리 설치 \n", 353 | "from pathlib import Path\n", 354 | "from eland.ml.pytorch import PyTorchModel\n", 355 | "from eland.ml.pytorch.transformers import TransformerModel\n", 356 | "\n", 357 | "# 허깅 페이스의 모델명과 작업 유형 설정\n", 358 | "hf_model_id='sentence-transformers/all-distilroberta-v1'\n", 359 | "tm = TransformerModel(model_id=hf_model_id, task_type=\"text_embedding\")\n", 360 | "\n", 361 | "# 일래스틱서치에서 이름으로 사용할 modelID 설정\n", 362 | "es_model_id = tm.elasticsearch_model_id()\n", 363 | "\n", 364 | "# 허깅 페이스에서 모델 다운로드\n", 365 | "tmp_path = \"models\"\n", 366 | "Path(tmp_path).mkdir(parents=True, exist_ok=True)\n", 367 | "model_path, config, vocab_path = tm.save(tmp_path)\n", 368 | "\n", 369 | "# 일래스틱서치에 모델 적재\n", 370 | "ptm = PyTorchModel(es, es_model_id)\n", 371 | "ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)\n" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 46, 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "from elasticsearch import Elasticsearch\n", 381 | "from elasticsearch.client import MlClient\n", 382 | "\n", 383 | "# 모델을 사용할 수 있는 상태로 배포\n", 384 | "s = MlClient.start_trained_model_deployment(es, model_id=es_model_id)" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "metadata": {}, 390 | "source": [ 391 | "# 임베딩 모델 정상 동작 여부 확인 \n", 392 | "docs = [\n", 393 | " {\n", 394 | " \"text_field\": \"Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.\"\n", 395 | " }\n", 396 | "]\n", 397 | "\n", 398 | "z = MlClient.infer_trained_model(es, model_id=es_model_id, docs=docs)\n", 399 | "doc_0_vector = z['inference_results'][0]['predicted_value']\n", 400 | "doc_0_vector" 401 | ], 402 | "outputs": [], 403 | "execution_count": null 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "# 7.4.3 수집 파이프라인 생성\n", 410 | "#### 아래의 일래스틱서치 API를 키바나에서 수행하여 일괄 색인 API에서 활용되는 vectorize-log 파이프라인을 정의합니다. " 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": null, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "#PUT _ingest/pipeline/vectorize-log\n", 420 | "#{\n", 421 | "# \"description\": \"ingest pipe for chapter 7\",\n", 422 | "# \"processors\": [\n", 423 | "# {\n", 424 | "# \"inference\": {\n", 425 | "# \"model_id\": \"sentence-transformers__all-distilroberta-v1\",\n", 426 | "# \"target_field\": \"description_vectorized\"\n", 427 | "# }\n", 428 | "# },\n", 429 | "# {\n", 430 | "# \"set\": {\n", 431 | "# \"field\": \"description_vectorized\",\n", 432 | "# \"copy_from\": \"description_vectorized.predicted_value\"\n", 433 | "# }\n", 434 | "# }\n", 435 | "# ]\n", 436 | "#}" 437 | ] 438 | } 439 | ], 440 | "metadata": { 441 | "kernelspec": { 442 | "display_name": "venv", 443 | "language": "python", 444 | "name": "python3" 445 | }, 446 | "language_info": { 447 | "codemirror_mode": { 448 | "name": "ipython", 449 | "version": 3 450 | }, 451 | "file_extension": ".py", 452 | "mimetype": "text/x-python", 453 | "name": "python", 454 | "nbconvert_exporter": "python", 455 | "pygments_lexer": "ipython3", 456 | "version": "3.11.5" 457 | }, 458 | "orig_nbformat": 4 459 | }, 460 | "nbformat": 4, 461 | "nbformat_minor": 2 462 | } 463 | -------------------------------------------------------------------------------- /chapter9/README.md: -------------------------------------------------------------------------------- 1 | # CookBot : 당신의 개인 요리 도우미 2 | 3 | CookBot은 OpenAI의 GPT-4 모델, 일래스틱서치, Streamlit을 기반으로 작동하며, 사용자의 질문에 맞는 레시피를 찾고 단계별 가이드를 통해 상호작용형 요리를 하는 것을 목표로 합니다. 4 | 5 | 6 | ## 기술 요구 사항 7 | 8 | 개발 환경이 아래 요구 사항을 만족하는지 먼저 확인해 주세요. 9 | 10 | - Python 3.8 또는 그 이상 11 | - [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html) 12 | - [Streamlit](https://docs.streamlit.io/library/get-started/installation) 13 | 14 | 필요한 파이썬 라이브러리는 아래와 같습니다. 15 | 16 | - `openai` 17 | - `streamlit` 18 | - `elasticsearch` 19 | - `requests` 20 | 21 | 라이브러리는 pip를 통해 설치할 수 있습니다. 22 | 23 | ```bash 24 | pip install openai streamlit elasticsearch requests 25 | ``` 26 | 27 | ## 시작하기 28 | 29 | ### 저장소 복제 30 | 31 | 먼저 저장소를 로컬 환경에 복제하세요. 32 | 33 | ```bash 34 | git clone https://github.com/your-username/CookBot.git 35 | cd CookBot 36 | ``` 37 | 38 | ### 환경 설정 39 | 가상 환경을 만들고 활성화 하세요. 40 | 41 | ```bash 42 | python3 -m venv env 43 | source env/bin/activate 44 | ``` 45 | 46 | ### 필요한 의존성 설치 47 | 48 | ```bash 49 | pip install -r requirements.txt 50 | ``` 51 | 52 | 53 | ### 환경 변수 설정 54 | 55 | OpenAI API 키를 환경 변수로 설정하세요. 56 | 57 | ```bash 58 | export OPENAI_API_KEY='your-api-key' 59 | ``` 60 | 61 | `'your-api-key'`를 실제 OpenAI API 키로 바꾸세요. 62 | 63 | ### 애플리케이션 실행 64 | 65 | Streamlit 서버를 시작하려면 아래와 같이 입력하세요. 66 | 67 | ```bash 68 | streamlit run app.py 69 | ``` 70 | 71 | Streamlit 앱은 `http://localhost:8501`에서 동작합니다. 이 URL을 웹 브라우저에서 열어 CookBot을 사용하세요. 72 | 73 | ## 사용법 74 | 75 | 요리 관련 질문을 입력 상자에 입력하면 CookBot이 관련 레시피를 찾고 그 정보를 바탕으로 단계별 가이드를 생성합니다. -------------------------------------------------------------------------------- /chapter9/config.py: -------------------------------------------------------------------------------- 1 | ES_ENDPOINT = 'https://ELASTICEARCH_HOST/recipes/_search' 2 | ES_USERNAME = 'USERNMAME' 3 | ES_PASSWORD = 'PASSWORD' 4 | OPENAI_API_KEY = "OPENAI_KEY" -------------------------------------------------------------------------------- /chapter9/data-preparation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "kNW7qszZOsrk" 7 | }, 8 | "source": [ 9 | "## 라이브러리 의존성" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "id": "qX4nvkGbOsrm" 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "!pip install elasticsearch pandas\n", 21 | "\n", 22 | "from elasticsearch import Elasticsearch, helpers\n", 23 | "from elasticsearch.helpers import BulkIndexError\n", 24 | "import pandas as pd" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "id": "8inTWj_QOsrn" 31 | }, 32 | "source": [ 33 | "## 일래스틱서치 접속" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "id": "FzAkhHOYOsro" 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "# 일래스틱서치 클라이언트 초기화. HOST:PORT/USERNMAME/PASSWORD는 사용하는 일래스틱서치 서버의 정보로 변경 필요\n", 45 | "es = Elasticsearch(\n", 46 | " ['HOST:PORT'],\n", 47 | " basic_auth=('USERNMAME', 'PASSWORD'),\n", 48 | " verify_certs=False\n", 49 | ")" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": { 55 | "id": "aWAiK1_TOsro" 56 | }, 57 | "source": [ 58 | "## 데이터 세트 파싱" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": { 65 | "id": "PmmONo2bOsro" 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "# 인덱스 정의\n", 70 | "index_name = 'recipes'\n", 71 | "\n", 72 | "# 매핑 정의\n", 73 | "mapping = {\n", 74 | " \"mappings\": {\n", 75 | " \"properties\": {\n", 76 | " \"group\": {\"type\": \"text\"},\n", 77 | " \"name\": {\"type\": \"text\"},\n", 78 | " \"rating\": {\"type\": \"text\"},\n", 79 | " \"n_rater\": {\"type\": \"text\"},\n", 80 | " \"n_reviewer\": {\"type\": \"text\"},\n", 81 | " \"summary\": {\n", 82 | " \"type\": \"text\",\n", 83 | " \"analyzer\": \"english\"\n", 84 | " },\n", 85 | " \"process\": {\"type\": \"text\"},\n", 86 | " \"ingredient\": {\n", 87 | " \"type\": \"text\"\n", 88 | " },\n", 89 | " \"ml.tokens\": {\n", 90 | " \"type\": \"rank_features\"\n", 91 | " }\n", 92 | " }\n", 93 | " }\n", 94 | "}\n", 95 | "\n", 96 | "# 인덱스 생성\n", 97 | "es.indices.create(index=index_name, body=mapping)\n", 98 | "\n", 99 | "# 판다스 라이브러리로 CSV 파일 불러오기(allrecipes.csv는 kaggle에서 다운로드 후 colab 폴더에 업로드 필요)\n", 100 | "with open('allrecipes.csv', 'r', encoding='utf-8', errors='ignore') as file:\n", 101 | " df = pd.read_csv(file)\n", 102 | "\n", 103 | "# allrecipes.csv의 rating열의 결측값 변경\n", 104 | "df['rating'] = df['rating'].fillna(0)\n", 105 | "\n", 106 | "# 색인을 위해 데이터프레임을 파이썬 딕셔너리 리스트로 변환\n", 107 | "recipes = df.to_dict('records')\n", 108 | "print(f\"Number of documents: {len(recipes)}\")" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": { 114 | "id": "qswBVy5ZOsrp" 115 | }, 116 | "source": [ 117 | "## 대량 색인" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": { 124 | "id": "unyUk7RFOsrp" 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "# 대량 색인을 위한 JSON 문서 배열 생성(키바나에서 elser 모델 설치 및 파이프라인 생성 필요)\n", 129 | "bulk_index_body = []\n", 130 | "for index, recipe in enumerate(recipes):\n", 131 | " document = {\n", 132 | " \"_index\": \"recipes\",\n", 133 | " \"pipeline\": \"elser-v2-recipes\",\n", 134 | " \"_source\": recipe\n", 135 | " }\n", 136 | " bulk_index_body.append(document)\n", 137 | "\n", 138 | "# 대량 색인 및 BulkIndexError 처리\n", 139 | "try:\n", 140 | " response = helpers.bulk(es, bulk_index_body, chunk_size=500, request_timeout=60 * 3)\n", 141 | " print(\"\\nRESPONSE:\", response)\n", 142 | "except BulkIndexError as e:\n", 143 | " for error in e.errors:\n", 144 | " print(f\"Document ID: {error['index']['_id']}\")\n", 145 | " print(f\"Error Type: {error['index']['error']['type']}\")\n", 146 | " print(f\"Error Reason: {error['index']['error']['reason']}\")" 147 | ] 148 | } 149 | ], 150 | "metadata": { 151 | "language_info": { 152 | "name": "python" 153 | }, 154 | "orig_nbformat": 4, 155 | "colab": { 156 | "provenance": [] 157 | }, 158 | "kernelspec": { 159 | "name": "python3", 160 | "display_name": "Python 3" 161 | } 162 | }, 163 | "nbformat": 4, 164 | "nbformat_minor": 0 165 | } -------------------------------------------------------------------------------- /chapter9/main.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | import requests 4 | import streamlit as st 5 | 6 | from config import OPENAI_API_KEY, ES_ENDPOINT, ES_USERNAME, ES_PASSWORD 7 | from recipe_generator import RecipeGenerator 8 | 9 | # RecipeGenerator 초기화 10 | generator = RecipeGenerator(OPENAI_API_KEY) 11 | 12 | # 일래스틱서치 조회 13 | def elasticsearch_query(query): 14 | headers = {'Content-Type': 'application/json'} 15 | response = requests.get(ES_ENDPOINT, headers=headers, data=json.dumps(query), auth=(ES_USERNAME, ES_PASSWORD)) 16 | 17 | result = { 18 | "name": response.json()["hits"]["hits"][0]["_source"]["name"], 19 | "group": response.json()["hits"]["hits"][0]["_source"]["group"], 20 | "ingredient": response.json()["hits"]["hits"][0]["_source"]["ingredient"], 21 | } 22 | return result 23 | 24 | # 나머지 Streamlit 코드 25 | with open("style.css") as f: 26 | st.markdown(f'', unsafe_allow_html=True) 27 | 28 | st.markdown( 29 | """ 30 | 42 | 43 | """, 44 | unsafe_allow_html=True, 45 | ) 46 | 47 | 48 | st.title('CookBot') 49 | st.caption('당신의 개인 요리 도우미') 50 | col1, col2 = st.columns([1, 6]) 51 | with col1: 52 | st.image("https://i.ibb.co/bWDhmTg/Screenshot-2023-07-28-at-11-06-56-PM.png") 53 | with col2: 54 | input_text = st.text_input(" ", placeholder="요리에 대해 궁금한 것이 있다면 무엇이든 물어보세요.") 55 | 56 | if input_text: 57 | query = { 58 | "sub_searches": [ 59 | { 60 | "query": { 61 | "match": { 62 | "ingredient": { 63 | "query": input_text, 64 | "operator": "and" 65 | } 66 | } 67 | } 68 | }, 69 | { 70 | "query": { 71 | "text_expansion": { 72 | "ml.tokens": { 73 | "model_id": ".elser_model_2", 74 | "model_text": input_text 75 | } 76 | } 77 | } 78 | } 79 | ], 80 | "rank": { 81 | "rrf": { 82 | "window_size": 50, 83 | "rank_constant": 20 84 | } 85 | } 86 | } 87 | 88 | recipe = elasticsearch_query(query) 89 | st.write(recipe) 90 | st.write(generator.generate(recipe)) 91 | -------------------------------------------------------------------------------- /chapter9/recipe_generator.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | from openai import OpenAI 4 | 5 | 6 | class RecipeGenerator: 7 | def __init__(self, api_key): 8 | self.api_key = api_key 9 | # openai.api_key = self.api_key 10 | self.client = OpenAI( 11 | api_key=self.api_key 12 | ) 13 | 14 | def generate(self, recipe): 15 | prompts = [{"role": "user", "content": json.dumps(recipe)}] 16 | instruction = { 17 | "role": "system", 18 | "content": "Take the recipes information and generate a recipe with a mouthwatering intro and a step by step guide." 19 | } 20 | prompts.append(instruction) 21 | 22 | generated_content = self.client.chat.completions.create( 23 | model="gpt-4", 24 | messages=prompts, 25 | max_tokens=1000 26 | ) 27 | return generated_content.choices[0].message.content -------------------------------------------------------------------------------- /chapter9/style.css: -------------------------------------------------------------------------------- 1 | header { 2 | background-color: #3e6879 !important; 3 | } 4 | 5 | .main { 6 | background-color: #3e6879; 7 | } 8 | 9 | #column-1 { 10 | background-image: url("https://i.ibb.co/bWDhmTg/Screenshot-2023-07-28-at-11-06-56-PM.png"); 11 | background-repeat: no-repeat; 12 | background-size: contain; 13 | height: 500px; 14 | width: 350px; 15 | color: transparent; 16 | } 17 | 18 | #column-2 { 19 | color: transparent; 20 | padding-top: 80px; 21 | width: 550px; 22 | } -------------------------------------------------------------------------------- /cover.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wikibook/vector-search/0558d712dcd51b2e7ca72bddfb6a3581086c3798/cover.jpg --------------------------------------------------------------------------------