├── .gitignore
├── 001. recommender system basic.ipynb
├── 002. recommender system basic with Python - 1 content based filtering.ipynb
├── 003. recommender system basic with Python - 2 Collaborative Filtering.ipynb
├── 004. recommender system basic with Python - 3 Matrix Factorization.ipynb
├── 005. naver news recommender.ipynb
├── 006. deep learning recommender system.ipynb
├── 007. wide-deep-RecSys model.ipynb
├── 008. simple book recommender system with Keras.ipynb
├── 009_chatgpt_recsys.ipynb
├── 010. LLM based Explainability RecSys .ipynb
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
  1 | 
  2 | # Created by https://www.gitignore.io/api/python,pycharm,jupyternotebooks
  3 | # Edit at https://www.gitignore.io/?templates=python,pycharm,jupyternotebooks
  4 | 
  5 | ### JupyterNotebooks ###
  6 | # gitignore template for Jupyter Notebooks
  7 | # website: http://jupyter.org/
  8 | 
  9 | .ipynb_checkpoints
 10 | */.ipynb_checkpoints/*
 11 | 
 12 | # IPython
 13 | profile_default/
 14 | ipython_config.py
 15 | 
 16 | # Remove previous ipynb_checkpoints
 17 | #   git rm -r .ipynb_checkpoints/
 18 | 
 19 | ### PyCharm ###
 20 | # Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
 21 | # Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
 22 | 
 23 | # User-specific stuff
 24 | .idea/**/workspace.xml
 25 | .idea/**/tasks.xml
 26 | .idea/**/usage.statistics.xml
 27 | .idea/**/dictionaries
 28 | .idea/**/shelf
 29 | 
 30 | # Generated files
 31 | .idea/**/contentModel.xml
 32 | 
 33 | # Sensitive or high-churn files
 34 | .idea/**/dataSources/
 35 | .idea/**/dataSources.ids
 36 | .idea/**/dataSources.local.xml
 37 | .idea/**/sqlDataSources.xml
 38 | .idea/**/dynamic.xml
 39 | .idea/**/uiDesigner.xml
 40 | .idea/**/dbnavigator.xml
 41 | 
 42 | # Gradle
 43 | .idea/**/gradle.xml
 44 | .idea/**/libraries
 45 | 
 46 | # Gradle and Maven with auto-import
 47 | # When using Gradle or Maven with auto-import, you should exclude module files,
 48 | # since they will be recreated, and may cause churn.  Uncomment if using
 49 | # auto-import.
 50 | # .idea/modules.xml
 51 | # .idea/*.iml
 52 | # .idea/modules
 53 | # *.iml
 54 | # *.ipr
 55 | 
 56 | # CMake
 57 | cmake-build-*/
 58 | 
 59 | # Mongo Explorer plugin
 60 | .idea/**/mongoSettings.xml
 61 | 
 62 | # File-based project format
 63 | *.iws
 64 | 
 65 | # IntelliJ
 66 | out/
 67 | 
 68 | # mpeltonen/sbt-idea plugin
 69 | .idea_modules/
 70 | 
 71 | # JIRA plugin
 72 | atlassian-ide-plugin.xml
 73 | 
 74 | # Cursive Clojure plugin
 75 | .idea/replstate.xml
 76 | 
 77 | # Crashlytics plugin (for Android Studio and IntelliJ)
 78 | com_crashlytics_export_strings.xml
 79 | crashlytics.properties
 80 | crashlytics-build.properties
 81 | fabric.properties
 82 | 
 83 | # Editor-based Rest Client
 84 | .idea/httpRequests
 85 | 
 86 | # Android studio 3.1+ serialized cache file
 87 | .idea/caches/build_file_checksums.ser
 88 | 
 89 | ### PyCharm Patch ###
 90 | # Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721
 91 | 
 92 | # *.iml
 93 | # modules.xml
 94 | # .idea/misc.xml
 95 | # *.ipr
 96 | 
 97 | # Sonarlint plugin
 98 | .idea/**/sonarlint/
 99 | 
100 | # SonarQube Plugin
101 | .idea/**/sonarIssues.xml
102 | 
103 | # Markdown Navigator plugin
104 | .idea/**/markdown-navigator.xml
105 | .idea/**/markdown-navigator/
106 | 
107 | ### Python ###
108 | # Byte-compiled / optimized / DLL files
109 | __pycache__/
110 | *.py[cod]
111 | *$py.class
112 | 
113 | # C extensions
114 | *.so
115 | 
116 | # Distribution / packaging
117 | .Python
118 | build/
119 | develop-eggs/
120 | dist/
121 | downloads/
122 | eggs/
123 | .eggs/
124 | lib/
125 | lib64/
126 | parts/
127 | sdist/
128 | var/
129 | wheels/
130 | pip-wheel-metadata/
131 | share/python-wheels/
132 | *.egg-info/
133 | .installed.cfg
134 | *.egg
135 | MANIFEST
136 | 
137 | # PyInstaller
138 | #  Usually these files are written by a python script from a template
139 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
140 | *.manifest
141 | *.spec
142 | 
143 | # Installer logs
144 | pip-log.txt
145 | pip-delete-this-directory.txt
146 | 
147 | # Unit test / coverage reports
148 | htmlcov/
149 | .tox/
150 | .nox/
151 | .coverage
152 | .coverage.*
153 | .cache
154 | nosetests.xml
155 | coverage.xml
156 | *.cover
157 | .hypothesis/
158 | .pytest_cache/
159 | 
160 | # Translations
161 | *.mo
162 | *.pot
163 | 
164 | # Scrapy stuff:
165 | .scrapy
166 | 
167 | # Sphinx documentation
168 | docs/_build/
169 | 
170 | # PyBuilder
171 | target/
172 | 
173 | # pyenv
174 | .python-version
175 | 
176 | # pipenv
177 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
178 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
179 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
180 | #   install all needed dependencies.
181 | #Pipfile.lock
182 | 
183 | # celery beat schedule file
184 | celerybeat-schedule
185 | 
186 | # SageMath parsed files
187 | *.sage.py
188 | 
189 | # Spyder project settings
190 | .spyderproject
191 | .spyproject
192 | 
193 | # Rope project settings
194 | .ropeproject
195 | 
196 | # Mr Developer
197 | .mr.developer.cfg
198 | .project
199 | .pydevproject
200 | 
201 | # mkdocs documentation
202 | /site
203 | 
204 | # mypy
205 | .mypy_cache/
206 | .dmypy.json
207 | dmypy.json
208 | 
209 | # Pyre type checker
210 | .pyre/
211 | 
212 | # End of https://www.gitignore.io/api/python,pycharm,jupyternotebooks
213 | 


--------------------------------------------------------------------------------
/001. recommender system basic.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 블로그 설명\n",
  8 |     "\n",
  9 |     "**해당 자료에 대한 설명은 아래 블로그에도 올려두었습니다.**\n",
 10 |     "- https://lsjsj92.tistory.com/563\n",
 11 |     "- https://lsjsj92.tistory.com/564\n",
 12 |     "\n",
 13 |     "----\n",
 14 |     "\n",
 15 |     "해당 자료는 아래 리스트에서 참고했습니다.  \n",
 16 |     "- https://www.kaggle.com/rounakbanik/movie-recommender-systems\n",
 17 |     "- https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system\n",
 18 |     "- https://wikidocs.net/5053\n",
 19 |     "- https://medium.com/towards-artificial-intelligence/content-based-recommender-system-4db1b3de03e7\n",
 20 |     "- https://www.youtube.com/watch?v=ZspR5PZemcs&list=PLU1prrdmLIpaGw0neztIByvshB9I7l6-f&index=11\n",
 21 |     "\n",
 22 |     "\n",
 23 |     "# 추천 시스템(Recommendation System)\n",
 24 |     "\n",
 25 |     "\n",
 26 |     "https://scvgoe.github.io/2017-02-01-%ED%98%91%EC%97%85-%ED%95%84%ED%84%B0%EB%A7%81-%EC%B6%94%EC%B2%9C-%EC%8B%9C%EC%8A%A4%ED%85%9C-(Collaborative-Filtering-Recommendation-System)/\n",
 27 |     "\n",
 28 |     "추천 시스템은 잘만 만들어진다면 친사용자 서비스이며 동시에 친기업 서비스입니다.\n",
 29 |     "사용자의 취향을 파악하고, 취향에 따른 상품을 추천해주기 때문에 그렇습니다. 즉, 그렇기에 사용자는 자신의 맞춤 제품이 나오니 구매할 확률이 올라가고 기업 입장에서는 이윤으로 돌아오는 것이죠.  \n",
 30 |     "\n",
 31 |     "추천 시스템의 가장 무서운 것은 자신이 몰랐던 취향도 추천해주는 것입니다. 이러한 추천 시스템을 경험한 사용자는 그 서비스의 충성 고객이 될 확률이 높아집니다. 그러면 더욱 더 많은 데이터가 쌓이게 되고 더욱 견고한 서비스가 구축이 됩니다.\n",
 32 |     "\n",
 33 |     "# 추천 시스템의 기본 유형\n",
 34 |     "\n",
 35 |     "추천 시스템의 기본은 크게 **Content based filtering** 방식과 **협업 필터링(Collaborative Filtering)** 방식으로 나뉘어 집니다. 특히, 협업 필터링은 다시 메모리(Memory based) 협업 필터링 잠재 요인(Latent Factor) 협업 필터링으로 세부적으로 소개되죠.  \n",
 36 |     "\n",
 37 |     "초반에는 콘텐츠 기반 필터링과 최근접 이웃 잠재 요인을 많이 사용했습니다. 하지만 넷플릭스의 사례 이후 잠재 요인 협업 필터링을 많이 사용하게 되었는데요. 이 잠재 요인 협업 필터링 방법에서는 **행렬 분해(Matrix Factorization)** 방법을 사용합니다.\n",
 38 |     "\n",
 39 |     "\n",
 40 |     "# Content based filtering\n",
 41 |     "\n",
 42 |     "콘텐츠 기반 필터링 방식은 사용자가 특정 아이템을 선호하는 경우 그 아이템과 비슷한 콘텐츠를 가진 다른 아이템을 추천하는 방식입니다.\n",
 43 |     "\n",
 44 |     "![1](https://user-images.githubusercontent.com/24634054/71624712-520aaf00-2c27-11ea-9546-562ee61517aa.JPG)\n",
 45 |     "\n",
 46 |     "굉장히 단순한 아이디어입니다. 예를 들어 사용자가 A라는 영화에 높은 평점을 줬는데 그 영화가 액션이었고 '이수진' 이라는 감독이었으면 '이수진' 감독의 다른 액션 영화를 추천해주는 것이죠.\n",
 47 |     "\n",
 48 |     "하지만, 이는 매우 단순한 추천이기 때문에 참고용으로 활용하지 잘 사용하지 않습니다.\n",
 49 |     "\n",
 50 |     "\n",
 51 |     "# 메모리(Memory based) Collaborative Filtering\n",
 52 |     "\n",
 53 |     "실제로는 새로운 영화가 나오면 다른 사람들의 평점이나, 평가를 들어본 뒤 영화를 선택하는 경우가 많습니다. 그냥 봤다가 재미없으면 망하기 때문이죠. 이와 같은 방식으로 사용자가 아이템에 매긴 평점, 상품 구매 이력 등의 **사용자 행동 양식(User Behavior)**을 기반으로 추천 해주는 것이 Collaborative Filtering 입니다.\n",
 54 |     "\n",
 55 |     "메모리 기반 협업 필터링은 사용자-아이템 행렬에서 사용자가 아직 평가하지 않은 아이템을 예측하는 것이 목표입니다.\n",
 56 |     "\n"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "markdown",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "![2](https://user-images.githubusercontent.com/24634054/71624713-520aaf00-2c27-11ea-9e04-471bb9e0ae1e.JPG)\n",
 64 |     "\n",
 65 |     "이 그림처럼 말이죠. 예를 들어 User2가 아직 ItemC에 대한 평가를 안했으니 User2는 ItemC에 대해 어떻게 평가할 것인지를 예측하는 것입니다.\n",
 66 |     "\n",
 67 |     "이처럼 메모리 기반 협업 필터링에서는 사용자-아이템 평점 행렬과 같은 모습을 가지고 있습니다. 따라서 column은 contents, row는 users가 되어야 합니다. 즉, 아래와 같이 데이터가 되어 있다면 pivot table 형식으로 데이터를 바꿔주어야 하는 것이죠!\n",
 68 |     "\n"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "![3](https://user-images.githubusercontent.com/24634054/71624714-52a34580-2c27-11ea-9bf8-0105bdf90b6a.JPG)\n",
 76 |     "\n",
 77 |     "이러한 모습을 가지고 있기 때문에 이 행렬은 굉장히 Sparse하다는 특징이 있습니다. 그리고 실무에서는 이 특징이 단점으로 꼽히게 되죠.  \n",
 78 |     "공간 낭비이니까요. 아무튼!\n",
 79 |     "\n",
 80 |     "이러한 메모리 기반 협업 필터링은 다시 아래와 같이 나뉠 수 있습니다.  \n",
 81 |     "- 사용자 기반 : 비슷한 고객들이 ~한 제품을 구매했다.\n",
 82 |     "- 아이템 기반 : ~ 상품을 구매한 고객들은 다음 상품도 구매했다.\n",
 83 |     "\n",
 84 |     "**사용자 기반** 의 협업 필터링 모습은 아래와 같을 것입니다.\n"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "![4](https://user-images.githubusercontent.com/24634054/71624715-52a34580-2c27-11ea-984d-412b17f9e7d8.JPG)\n",
 92 |     "\n",
 93 |     "즉, User1, User2는 ItemA ~ C까지의 평점이 비슷하기 때문에 비슷하다라고 생각하는 것이죠!\n",
 94 |     "\n",
 95 |     "**아이템 기반** 협업 필터링 모습은 아래와 같을 것입니다."
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "![5](https://user-images.githubusercontent.com/24634054/71624716-52a34580-2c27-11ea-855b-030aa67149e9.JPG)\n",
103 |     "\n",
104 |     "ItemA와 B는 사용자들의 평점 분포가 비슷하므로 유사도가 높다고 생각하는 것입니다. 그래서 User4에게 ItemA를 추천해주는 것입니다!\n",
105 |     "\n",
106 |     "그리고 일반적으로 사용자 기반보다는 아이템 기반이 좀 더 정확도가 높습니다.  \n",
107 |     "그 이유는 대체적으로 생각하는 것이 비슷한 상품을 좋아한다고 취향이 비슷한 것은 아니니까 라고 많이들 말씀합니다.  \n",
108 |     "\n",
109 |     "그래서 메모리 기반 협업 필터링을 사용할 때는 보통 아이템 기반으로 추천을 적용합니다. 그리고 그 유사도는 코사인 유사도(cosine similarity)를 대부분 활용합니다."
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "# Matrix Factorization Collaborative Filtering\n",
117 |     "\n",
118 |     "행렬 분해를 이용한 협업 필터링 방법도 있습니다. 이는 대규모 다차원 행렬을 SVD와 같은 차원 감소 기법으로 분해하는 과정에서 잠재 요인(Latent Factor)를 뽑아내는 방법입니다.\n",
119 |     "\n",
120 |     "사실, 위에서 아이템 기반 협업 필터링을 소개했지만 이 행렬 분해(Matrix Factorization) 방법을 더 많이 사용합니다.  \n",
121 |     "가장 큰 이유는 공간에 있습니다. 이는 아래에서 다시 설명합니다.\n",
122 |     "\n",
123 |     "행렬 분해(혹은 잠재 요인)으로 진행하는 collaborative filtering은 사용자-아이템 행렬 데이터를 이용해 '잠재 요인'을 끌어냅니다.  \n",
124 |     "즉, 사용자-아이템 행렬을 사용자-잠재요인, 아이템-잠재요인 행렬로 분해할 수 있습니다. 아래 사진과 같이 말이죠!\n",
125 |     "\n",
126 |     "출처 : https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf\n",
127 |     "\n",
128 |     "![1](https://user-images.githubusercontent.com/24634054/71657636-1507ff00-2d84-11ea-82c7-b615cb871011.JPG)\n",
129 |     "\n",
130 |     "저 잠재요인(latent factor)는 어떤 것인지 명확히 알 수는 없습니다. 하지만 뭐 예로 들면 코메디, 액션과 같은 장르가 될 수도 있는 것입니다.  \n",
131 |     "만약, 코메디, 액션과 같은 장르로 정해졌을 경우 사용자별 장르 선호도, 아이템 별 장르 가중치 값으로 분해될 수 있는 것입니다.\n",
132 |     "\n",
133 |     "그럼 아래 사진과 같은 계산이 가능해지죠~  \n",
134 |     "출처 : https://www.youtube.com/watch?v=ZspR5PZemcs&list=PLU1prrdmLIpaGw0neztIByvshB9I7l6-f&index=11\n",
135 |     "\n",
136 |     "![10](https://user-images.githubusercontent.com/24634054/71638406-8c656200-2ca2-11ea-9740-a3da282fefde.JPG)\n",
137 |     "\n",
138 |     "\n",
139 |     "보통 사용자-아이템 행렬은 R이라고 표현합니다. 그리고 R(u, i)라고 하는데 u번째 유저가 i번째 아이템에 대한 평가를 말합니다.  \n",
140 |     "또한, 사용자-잠재요인 행렬을 P, 아이템-잠재요인을 Q라고 합니다. 아이템-잠재요인은 보통 전치 행렬로 많이 사용하므로 Q.T라고 불리웁니다.\n",
141 |     "\n",
142 |     "그래서 아래 그림과 같이 R 행렬에서 나온 값을 기반으로 latent factor score를 매길 수 있게됩니다.\n",
143 |     "\n",
144 |     "![11](https://user-images.githubusercontent.com/24634054/71638458-fcc0b300-2ca3-11ea-93c9-dd80020f143e.JPG)\n",
145 |     "\n",
146 |     "이 값을 이용해서 아래와 같이 사용자가 평가하지 않은 콘텐츠의 점수를 예측할 수 있는 것입니다.  \n",
147 |     "즉, 이 값이 높으면 사용자에게 추천할 수 있게 됩니다.\n",
148 |     "\n",
149 |     "![12](https://user-images.githubusercontent.com/24634054/71638459-fcc0b300-2ca3-11ea-914c-c10ae98e063d.JPG)\n",
150 |     "\n",
151 |     "이렇게 이용하는 방법이 행렬 분해(matrix factorization)를 이용한 collaborative filtering입니다.  \n",
152 |     "혹은, latent factor based collaborative filtering이라고도 합니다.\n",
153 |     "\n",
154 |     "이렇게 하면 장점은 위에서도 잠깐 언급했듯이 저장 공간의 장점입니다.\n",
155 |     "\n",
156 |     "만약, matrix factorization 방법을 사용하지 않으면 아래와 같이 user - item matrix가 있을 것입니다.  \n",
157 |     "즉, 1000개의 item에 2000명의 user가 있으면 1000 * 2000 개의 파라미터가 필요합니다.\n",
158 |     "\n",
159 |     "출처 : https://www.youtube.com/watch?v=ZspR5PZemcs&list=PLU1prrdmLIpaGw0neztIByvshB9I7l6-f&index=11\n",
160 |     "\n",
161 |     "![13](https://user-images.githubusercontent.com/24634054/71638477-b9b30f80-2ca4-11ea-8581-5b1443fcf1f3.JPG)\n",
162 |     "\n",
163 |     "아래 그림도 이를 설명해줍니다.\n",
164 |     "\n",
165 |     "![16](https://user-images.githubusercontent.com/24634054/71638478-b9b30f80-2ca4-11ea-8eab-192335230ddd.JPG)\n",
166 |     "\n",
167 |     "하지만, matrix factorization을 활용하면 공간을 매우 효율적으로 사용할 수 있습니다.\n",
168 |     "\n",
169 |     "![17](https://user-images.githubusercontent.com/24634054/71638479-ba4ba600-2ca4-11ea-9ee7-5a0d4b58d27c.JPG)\n"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "metadata": {},
175 |    "source": [
176 |     "그리고 이러한 행렬분해는 아래와 같은 방법으로 됩니다.\n",
177 |     "\n",
178 |     "![20](https://user-images.githubusercontent.com/24634054/71638517-6a211380-2ca5-11ea-89f1-08d831c0ae37.JPG)"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "그럼 캐글에 있는 영화 추천 코드를 보면서 위 내용을 복습해보죠."
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": []
194 |   }
195 |  ],
196 |  "metadata": {
197 |   "kernelspec": {
198 |    "display_name": "Python 3",
199 |    "language": "python",
200 |    "name": "python3"
201 |   },
202 |   "language_info": {
203 |    "codemirror_mode": {
204 |     "name": "ipython",
205 |     "version": 3
206 |    },
207 |    "file_extension": ".py",
208 |    "mimetype": "text/x-python",
209 |    "name": "python",
210 |    "nbconvert_exporter": "python",
211 |    "pygments_lexer": "ipython3",
212 |    "version": "3.8.5"
213 |   }
214 |  },
215 |  "nbformat": 4,
216 |  "nbformat_minor": 2
217 | }
218 | 


--------------------------------------------------------------------------------
/005. naver news recommender.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 블로그 설명\n",
  8 |     "\n",
  9 |     "해당 자료에 대한 설명은 아래 블로그에 올려두었습니다.\n",
 10 |     "- https://lsjsj92.tistory.com/571"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 13,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "import pandas as pd\n",
 20 |     "import numpy as np\n",
 21 |     "import matplotlib.pyplot as plt\n",
 22 |     "import random\n",
 23 |     "from sklearn.manifold import TSNE\n",
 24 |     "from gensim.test.utils import common_texts\n",
 25 |     "from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n",
 26 |     "from sklearn.metrics.pairwise import cosine_similarity"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 14,
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "def make_doc2vec_models(tagged_data, tok, vector_size=128, window = 3, epochs = 40, min_count = 0, workers = 4):\n",
 36 |     "    model = Doc2Vec(tagged_data, vector_size=vector_size, window=window, epochs=epochs, min_count=min_count, workers=workers)\n",
 37 |     "    model.save(f'./datas/{tok}_news_model.doc2vec')"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 15,
 43 |    "metadata": {},
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "def get_data(preprocess = True):\n",
 47 |     "    if preprocess :\n",
 48 |     "        data = pd.read_csv('./datas/naver_news/preprocessing_tok_naver_data.csv')\n",
 49 |     "    else:\n",
 50 |     "        economy = pd.read_csv('./datas/naver_news/economy.csv', header=None)\n",
 51 |     "        policy = pd.read_csv('./datas/naver_news/policy.csv', header=None)\n",
 52 |     "        it = pd.read_csv('./datas/naver_news/it.csv', header=None)\n",
 53 |     "\n",
 54 |     "        columns = ['date', 'category', 'company', 'title', 'content', 'url']\n",
 55 |     "        economy.columns = columns\n",
 56 |     "        policy.columns = columns\n",
 57 |     "        it.columns = columns\n",
 58 |     "\n",
 59 |     "        data = pd.concat([economy, policy, it], axis = 0)\n",
 60 |     "        data.reset_index(drop=True, inplace=True)\n",
 61 |     "    \n",
 62 |     "    return data"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": 16,
 68 |    "metadata": {},
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "def get_preprocessing_data(data):\n",
 72 |     "    data.drop(['date', 'company', 'url'], axis = 1, inplace =True)\n",
 73 |     "    \n",
 74 |     "    category_mapping = {\n",
 75 |     "    '경제' : 0,\n",
 76 |     "    '정치' : 1,\n",
 77 |     "    'IT과학' : 2\n",
 78 |     "    }\n",
 79 |     "\n",
 80 |     "    data['category'] = data['category'].map(category_mapping)\n",
 81 |     "    data['title_content'] = data['title'] + \" \" + data['content']\n",
 82 |     "    data.drop(['title', 'content'], axis = 1, inplace = True)\n",
 83 |     "    return data"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": 17,
 89 |    "metadata": {},
 90 |    "outputs": [],
 91 |    "source": [
 92 |     "def make_doc2vec_data(data, column, t_document=False):\n",
 93 |     "    data_doc = []\n",
 94 |     "    for tag, doc in zip(data.index, data[column]):\n",
 95 |     "        doc = doc.split(\" \")\n",
 96 |     "        data_doc.append(([tag], doc))\n",
 97 |     "    if t_document:\n",
 98 |     "        data = [TaggedDocument(words=text, tags=tag) for tag, text in data_doc]\n",
 99 |     "        return data\n",
100 |     "    else:\n",
101 |     "        return data_doc"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 18,
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "def get_recommened_contents(user, data_doc, model):\n",
111 |     "    scores = []\n",
112 |     "\n",
113 |     "    for tags, text in data_doc:\n",
114 |     "        trained_doc_vec = model.docvecs[tags[0]]\n",
115 |     "        scores.append(cosine_similarity(user.reshape(-1, 128), trained_doc_vec.reshape(-1, 128)))\n",
116 |     "\n",
117 |     "    scores = np.array(scores).reshape(-1)\n",
118 |     "    scores = np.argsort(-scores)[:5]\n",
119 |     "    \n",
120 |     "    return data.loc[scores, :]"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 19,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "def make_user_embedding(index_list, data_doc, model):\n",
130 |     "    user = []\n",
131 |     "    user_embedding = []\n",
132 |     "    for i in index_list:\n",
133 |     "        user.append(data_doc[i][0][0])\n",
134 |     "    for i in user:\n",
135 |     "        user_embedding.append(model.docvecs[i])\n",
136 |     "    user_embedding = np.array(user_embedding)\n",
137 |     "    user = np.mean(user_embedding, axis = 0)\n",
138 |     "    return user"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 20,
144 |    "metadata": {},
145 |    "outputs": [],
146 |    "source": [
147 |     "def view_user_history(data):\n",
148 |     "    print(data[['category', 'title_content']])"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "code",
153 |    "execution_count": 21,
154 |    "metadata": {},
155 |    "outputs": [],
156 |    "source": [
157 |     "data = get_data()"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": 22,
163 |    "metadata": {},
164 |    "outputs": [
165 |     {
166 |      "data": {
167 |       "text/html": [
168 |        "<div>\n",
169 |        "<style scoped>\n",
170 |        "    .dataframe tbody tr th:only-of-type {\n",
171 |        "        vertical-align: middle;\n",
172 |        "    }\n",
173 |        "\n",
174 |        "    .dataframe tbody tr th {\n",
175 |        "        vertical-align: top;\n",
176 |        "    }\n",
177 |        "\n",
178 |        "    .dataframe thead th {\n",
179 |        "        text-align: right;\n",
180 |        "    }\n",
181 |        "</style>\n",
182 |        "<table border=\"1\" class=\"dataframe\">\n",
183 |        "  <thead>\n",
184 |        "    <tr style=\"text-align: right;\">\n",
185 |        "      <th></th>\n",
186 |        "      <th>category</th>\n",
187 |        "      <th>title_content</th>\n",
188 |        "      <th>mecab_tok</th>\n",
189 |        "    </tr>\n",
190 |        "  </thead>\n",
191 |        "  <tbody>\n",
192 |        "    <tr>\n",
193 |        "      <th>0</th>\n",
194 |        "      <td>0</td>\n",
195 |        "      <td>포스코ICT 4분기부터 실적 본격 개선 비용효율화 본격화 그룹 계열사와의 시너지 발...</td>\n",
196 |        "      <td>포스코 ICT 분기 실 본격 개선 비용 효율 본격화 그룹 계열사 시너지 발생 헤럴드...</td>\n",
197 |        "    </tr>\n",
198 |        "    <tr>\n",
199 |        "      <th>1</th>\n",
200 |        "      <td>0</td>\n",
201 |        "      <td>위메프 리퍼데이 바자회 수익금 소외계층 지원 위메프가 리퍼데이 바자회 ‘아름다운가게...</td>\n",
202 |        "      <td>위메프 리퍼 데이 바자회 수익금 소외 계층 지원 위메프 리퍼 데이 바자회 가게 위메...</td>\n",
203 |        "    </tr>\n",
204 |        "    <tr>\n",
205 |        "      <th>2</th>\n",
206 |        "      <td>0</td>\n",
207 |        "      <td>호반건설 광주시 ‘계림1구역 정비사업’ 수주…공사비 2700억원 최고 33층 총 9...</td>\n",
208 |        "      <td>호반건설 광주시 계림 구역 정비 사업 수주 공사비 원 최고 개동 아파트 가구 규모 ...</td>\n",
209 |        "    </tr>\n",
210 |        "    <tr>\n",
211 |        "      <th>3</th>\n",
212 |        "      <td>0</td>\n",
213 |        "      <td>성동조선해양 HSG중공업과 인수합병 MOU 체결 HSG중공업이 성동조선해양 인수를 ...</td>\n",
214 |        "      <td>성동 조선 해양 중공업 인수 합병 체결 중공업 성동 조선 해양 인수 위한 양해 각서...</td>\n",
215 |        "    </tr>\n",
216 |        "    <tr>\n",
217 |        "      <th>4</th>\n",
218 |        "      <td>0</td>\n",
219 |        "      <td>2019년 10월 프랜차이즈 정보공개서 79개 신규등록 서울·경기·인천 지역 서울시...</td>\n",
220 |        "      <td>년 월 프랜차이즈 정보 공개 개 신규 등록 서울 경기 인천 지역 서울시 경기도 새롭...</td>\n",
221 |        "    </tr>\n",
222 |        "  </tbody>\n",
223 |        "</table>\n",
224 |        "</div>"
225 |       ],
226 |       "text/plain": [
227 |        "   category                                      title_content  \\\n",
228 |        "0         0  포스코ICT 4분기부터 실적 본격 개선 비용효율화 본격화 그룹 계열사와의 시너지 발...   \n",
229 |        "1         0  위메프 리퍼데이 바자회 수익금 소외계층 지원 위메프가 리퍼데이 바자회 ‘아름다운가게...   \n",
230 |        "2         0  호반건설 광주시 ‘계림1구역 정비사업’ 수주…공사비 2700억원 최고 33층 총 9...   \n",
231 |        "3         0  성동조선해양 HSG중공업과 인수합병 MOU 체결 HSG중공업이 성동조선해양 인수를 ...   \n",
232 |        "4         0  2019년 10월 프랜차이즈 정보공개서 79개 신규등록 서울·경기·인천 지역 서울시...   \n",
233 |        "\n",
234 |        "                                           mecab_tok  \n",
235 |        "0  포스코 ICT 분기 실 본격 개선 비용 효율 본격화 그룹 계열사 시너지 발생 헤럴드...  \n",
236 |        "1  위메프 리퍼 데이 바자회 수익금 소외 계층 지원 위메프 리퍼 데이 바자회 가게 위메...  \n",
237 |        "2  호반건설 광주시 계림 구역 정비 사업 수주 공사비 원 최고 개동 아파트 가구 규모 ...  \n",
238 |        "3  성동 조선 해양 중공업 인수 합병 체결 중공업 성동 조선 해양 인수 위한 양해 각서...  \n",
239 |        "4  년 월 프랜차이즈 정보 공개 개 신규 등록 서울 경기 인천 지역 서울시 경기도 새롭...  "
240 |       ]
241 |      },
242 |      "execution_count": 22,
243 |      "metadata": {},
244 |      "output_type": "execute_result"
245 |     }
246 |    ],
247 |    "source": [
248 |     "data.head()"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": 23,
254 |    "metadata": {},
255 |    "outputs": [],
256 |    "source": [
257 |     "data_doc_title_content_tag = make_doc2vec_data(data, 'title_content', t_document=True)\n",
258 |     "data_doc_title_content = make_doc2vec_data(data, 'title_content')\n",
259 |     "data_doc_tok_tag = make_doc2vec_data(data, 'mecab_tok', t_document=True)\n",
260 |     "data_doc_tok = make_doc2vec_data(data, 'mecab_tok')"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "# make doc2vec models"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": 24,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": [
276 |     "make_doc2vec_models(data_doc_title_content_tag, tok=False)"
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": 25,
282 |    "metadata": {},
283 |    "outputs": [],
284 |    "source": [
285 |     "make_doc2vec_models(data_doc_tok_tag, tok=True)"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "metadata": {},
291 |    "source": [
292 |     "# load doc2vec models"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "code",
297 |    "execution_count": 26,
298 |    "metadata": {},
299 |    "outputs": [],
300 |    "source": [
301 |     "model_title_content = Doc2Vec.load('./datas/False_news_model.doc2vec')"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": 27,
307 |    "metadata": {},
308 |    "outputs": [],
309 |    "source": [
310 |     "model_tok = Doc2Vec.load('./datas/True_news_model.doc2vec')"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "code",
315 |    "execution_count": 49,
316 |    "metadata": {},
317 |    "outputs": [
318 |     {
319 |      "name": "stdout",
320 |      "output_type": "stream",
321 |      "text": [
322 |       "      category                                      title_content\n",
323 |       "510          0  SK 조직 차원 증거인멸 없어 반박문 제출 LG화학과 영업비밀 침해소송을 벌이고 있...\n",
324 |       "4187         0  제이엘케이인스펙션 공모가 9000원 확정 김동민 대표 적극적인 IR·주주친화 정책 ...\n",
325 |       "213          0  제재 대신 자율개선 유도하는 금감원 외환법규 위반 5개 은행 대상 첫 적용 금융감독...\n",
326 |       "696          0  동국제약 ‘마데카솔’ 소비자가 가장 추천하는 브랜드 동국제약 대표이사 오흥주 마데카...\n",
327 |       "3410         0  호반그룹 정기 인사…최승남 대표 총괄부회장 신규 선임 한국경제TV 신인규 기자 호반...\n"
328 |      ]
329 |     }
330 |    ],
331 |    "source": [
332 |     "user_category_1 = data.loc[random.sample(data.loc[data.category == 0, :].index.values.tolist(), 5), :]  #경제\n",
333 |     "view_user_history(user_category_1)"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": 45,
339 |    "metadata": {},
340 |    "outputs": [
341 |     {
342 |      "name": "stdout",
343 |      "output_type": "stream",
344 |      "text": [
345 |       "       category                                      title_content\n",
346 |       "10671         1  “3대 친문 농단” 국조 요구서 제출… 靑 매섭게 몰아치는 야권 자유한국당 곽상도ㆍ...\n",
347 |       "6194          1  黃 다시 일어나 끝까지 가겠다… 공수처·선거법 반드시 저지 黃 文정부 3대 게이트 ...\n",
348 |       "6825          1  미 대사 文대통령 겨냥 종북 좌파에 둘러싸여... 9월 여야 의원들 만나 위험 발언...\n",
349 |       "8533          1  명소·맛집 즐긴 아세안 정상…일정보다 핫했던 부산 나들이 2박 3일간 경찰 에스코트...\n",
350 |       "6236          1  민주 한국당 필리버스터 철회 없으면 정기국회서 41로 안건 처리 비공개 최고위원회…...\n"
351 |      ]
352 |     }
353 |    ],
354 |    "source": [
355 |     "user_category_2 = data.loc[random.sample(data.loc[data.category == 1, :].index.values.tolist(), 5), :]  #정치\n",
356 |     "view_user_history(user_category_2)"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "code",
361 |    "execution_count": 18,
362 |    "metadata": {},
363 |    "outputs": [
364 |     {
365 |      "name": "stdout",
366 |      "output_type": "stream",
367 |      "text": [
368 |       "       category                                      title_content\n",
369 |       "14551         2  ‘라인프렌즈’ ‘브롤스타즈’와 함께 ‘브라운앤프렌즈’ 팝업스토어 진행 엑스포츠뉴스닷...\n",
370 |       "17055         2  올해 1등 KT인상 대상 ‘5G 경쟁력 강화 TF’ 디지털데일리 최민지기자 KT 대...\n",
371 |       "17824         2  DID 시장 열린다 2.은행계좌부터 주식까지 블록체인으로 인증 OK 규제 샌드박스 ...\n",
372 |       "12625         2  인하대병원 제로페이 도입 한국간편결제진흥원 이사장 윤완수 은 인하대병원에서 제로페이...\n",
373 |       "18243         2  이슈 19금 신작 미소녀 RPG 방치소녀 학원편 신규 캐릭터 조운과 문추 추가 본 ...\n"
374 |      ]
375 |     }
376 |    ],
377 |    "source": [
378 |     "user_category_3 = data.loc[random.sample(data.loc[data.category == 2, :].index.values.tolist(), 5), :]  #IT 과학\n",
379 |     "view_user_history(user_category_3)"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "code",
384 |    "execution_count": 50,
385 |    "metadata": {},
386 |    "outputs": [],
387 |    "source": [
388 |     "user_1 = make_user_embedding(user_category_1.index.values.tolist(), data_doc_title_content, model_title_content) # 경제\n",
389 |     "user_2 = make_user_embedding(user_category_2.index.values.tolist(), data_doc_title_content, model_title_content) # 정치\n",
390 |     "user_3 = make_user_embedding(user_category_3.index.values.tolist(), data_doc_title_content, model_title_content) # IT과학"
391 |    ]
392 |   },
393 |   {
394 |    "cell_type": "code",
395 |    "execution_count": 51,
396 |    "metadata": {},
397 |    "outputs": [
398 |     {
399 |      "data": {
400 |       "text/html": [
401 |        "<div>\n",
402 |        "<style scoped>\n",
403 |        "    .dataframe tbody tr th:only-of-type {\n",
404 |        "        vertical-align: middle;\n",
405 |        "    }\n",
406 |        "\n",
407 |        "    .dataframe tbody tr th {\n",
408 |        "        vertical-align: top;\n",
409 |        "    }\n",
410 |        "\n",
411 |        "    .dataframe thead th {\n",
412 |        "        text-align: right;\n",
413 |        "    }\n",
414 |        "</style>\n",
415 |        "<table border=\"1\" class=\"dataframe\">\n",
416 |        "  <thead>\n",
417 |        "    <tr style=\"text-align: right;\">\n",
418 |        "      <th></th>\n",
419 |        "      <th>category</th>\n",
420 |        "      <th>title_content</th>\n",
421 |        "    </tr>\n",
422 |        "  </thead>\n",
423 |        "  <tbody>\n",
424 |        "    <tr>\n",
425 |        "      <th>5226</th>\n",
426 |        "      <td>0</td>\n",
427 |        "      <td>삼양그룹 정기 임원인사 실시…“성과주의 실현” 승진 10명 보직변경 5명 김지섭 삼...</td>\n",
428 |        "    </tr>\n",
429 |        "    <tr>\n",
430 |        "      <th>3933</th>\n",
431 |        "      <td>0</td>\n",
432 |        "      <td>동국제약 바이오의약품 위탁 개발·생산 사업 진출 서울 연합뉴스 동국제약은 지난 달 ...</td>\n",
433 |        "    </tr>\n",
434 |        "    <tr>\n",
435 |        "      <th>5251</th>\n",
436 |        "      <td>0</td>\n",
437 |        "      <td>삼양그룹 정기임원인사 단행...김지섭 부사장 승진 왼쪽부터 삼양사 식자재유통BU장 ...</td>\n",
438 |        "    </tr>\n",
439 |        "    <tr>\n",
440 |        "      <th>3329</th>\n",
441 |        "      <td>0</td>\n",
442 |        "      <td>호반그룹 총괄부회장에 최승남 대표 선임 2일 정기 임원인사 단행.. 각 계열사 대표...</td>\n",
443 |        "    </tr>\n",
444 |        "    <tr>\n",
445 |        "      <th>13015</th>\n",
446 |        "      <td>2</td>\n",
447 |        "      <td>동국제약 바이오의약품 위탁 개발·생산 사업 진출 서울 연합뉴스 동국제약은 지난 달 ...</td>\n",
448 |        "    </tr>\n",
449 |        "  </tbody>\n",
450 |        "</table>\n",
451 |        "</div>"
452 |       ],
453 |       "text/plain": [
454 |        "       category                                      title_content\n",
455 |        "5226          0  삼양그룹 정기 임원인사 실시…“성과주의 실현” 승진 10명 보직변경 5명 김지섭 삼...\n",
456 |        "3933          0  동국제약 바이오의약품 위탁 개발·생산 사업 진출 서울 연합뉴스 동국제약은 지난 달 ...\n",
457 |        "5251          0  삼양그룹 정기임원인사 단행...김지섭 부사장 승진 왼쪽부터 삼양사 식자재유통BU장 ...\n",
458 |        "3329          0  호반그룹 총괄부회장에 최승남 대표 선임 2일 정기 임원인사 단행.. 각 계열사 대표...\n",
459 |        "13015         2  동국제약 바이오의약품 위탁 개발·생산 사업 진출 서울 연합뉴스 동국제약은 지난 달 ..."
460 |       ]
461 |      },
462 |      "execution_count": 51,
463 |      "metadata": {},
464 |      "output_type": "execute_result"
465 |     }
466 |    ],
467 |    "source": [
468 |     "result = get_recommened_contents(user_1, data_doc_title_content, model_title_content)\n",
469 |     "pd.DataFrame(result.loc[:, ['category', 'title_content']])"
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "code",
474 |    "execution_count": 48,
475 |    "metadata": {},
476 |    "outputs": [
477 |     {
478 |      "data": {
479 |       "text/html": [
480 |        "<div>\n",
481 |        "<style scoped>\n",
482 |        "    .dataframe tbody tr th:only-of-type {\n",
483 |        "        vertical-align: middle;\n",
484 |        "    }\n",
485 |        "\n",
486 |        "    .dataframe tbody tr th {\n",
487 |        "        vertical-align: top;\n",
488 |        "    }\n",
489 |        "\n",
490 |        "    .dataframe thead th {\n",
491 |        "        text-align: right;\n",
492 |        "    }\n",
493 |        "</style>\n",
494 |        "<table border=\"1\" class=\"dataframe\">\n",
495 |        "  <thead>\n",
496 |        "    <tr style=\"text-align: right;\">\n",
497 |        "      <th></th>\n",
498 |        "      <th>category</th>\n",
499 |        "      <th>title_content</th>\n",
500 |        "    </tr>\n",
501 |        "  </thead>\n",
502 |        "  <tbody>\n",
503 |        "    <tr>\n",
504 |        "      <th>8228</th>\n",
505 |        "      <td>1</td>\n",
506 |        "      <td>노영민 비서실장과 정의용 국가안보실장 서울 연합뉴스 한상균 기자 청와대 노영민 비서...</td>\n",
507 |        "    </tr>\n",
508 |        "    <tr>\n",
509 |        "      <th>6348</th>\n",
510 |        "      <td>1</td>\n",
511 |        "      <td>한국당 친문농단 게이트 국정조사 요구서 제출키로 울산에 백원우팀 파견…靑 고래고기 ...</td>\n",
512 |        "    </tr>\n",
513 |        "    <tr>\n",
514 |        "      <th>6320</th>\n",
515 |        "      <td>1</td>\n",
516 |        "      <td>한국당 친문농단 게이트 국정조사 요구서 제출하기로 자유한국당 곽상도 의원이 28일 ...</td>\n",
517 |        "    </tr>\n",
518 |        "    <tr>\n",
519 |        "      <th>6554</th>\n",
520 |        "      <td>1</td>\n",
521 |        "      <td>박원순 시장 녹색교통지역 5등급차량 운행제한 상황실 방문 서울 뉴시스 박주성 기자 ...</td>\n",
522 |        "    </tr>\n",
523 |        "    <tr>\n",
524 |        "      <th>6842</th>\n",
525 |        "      <td>1</td>\n",
526 |        "      <td>단식 종료 황교안 내일부터 한국당 당무 복귀 서울 연합뉴스 홍정규 기자 단식농성을 ...</td>\n",
527 |        "    </tr>\n",
528 |        "  </tbody>\n",
529 |        "</table>\n",
530 |        "</div>"
531 |       ],
532 |       "text/plain": [
533 |        "      category                                      title_content\n",
534 |        "8228         1  노영민 비서실장과 정의용 국가안보실장 서울 연합뉴스 한상균 기자 청와대 노영민 비서...\n",
535 |        "6348         1  한국당 친문농단 게이트 국정조사 요구서 제출키로 울산에 백원우팀 파견…靑 고래고기 ...\n",
536 |        "6320         1  한국당 친문농단 게이트 국정조사 요구서 제출하기로 자유한국당 곽상도 의원이 28일 ...\n",
537 |        "6554         1  박원순 시장 녹색교통지역 5등급차량 운행제한 상황실 방문 서울 뉴시스 박주성 기자 ...\n",
538 |        "6842         1  단식 종료 황교안 내일부터 한국당 당무 복귀 서울 연합뉴스 홍정규 기자 단식농성을 ..."
539 |       ]
540 |      },
541 |      "execution_count": 48,
542 |      "metadata": {},
543 |      "output_type": "execute_result"
544 |     }
545 |    ],
546 |    "source": [
547 |     "result = get_recommened_contents(user_2, data_doc_title_content, model_title_content)\n",
548 |     "pd.DataFrame(result.loc[:, ['category', 'title_content']])"
549 |    ]
550 |   },
551 |   {
552 |    "cell_type": "code",
553 |    "execution_count": 22,
554 |    "metadata": {},
555 |    "outputs": [
556 |     {
557 |      "data": {
558 |       "text/html": [
559 |        "<div>\n",
560 |        "<style scoped>\n",
561 |        "    .dataframe tbody tr th:only-of-type {\n",
562 |        "        vertical-align: middle;\n",
563 |        "    }\n",
564 |        "\n",
565 |        "    .dataframe tbody tr th {\n",
566 |        "        vertical-align: top;\n",
567 |        "    }\n",
568 |        "\n",
569 |        "    .dataframe thead th {\n",
570 |        "        text-align: right;\n",
571 |        "    }\n",
572 |        "</style>\n",
573 |        "<table border=\"1\" class=\"dataframe\">\n",
574 |        "  <thead>\n",
575 |        "    <tr style=\"text-align: right;\">\n",
576 |        "      <th></th>\n",
577 |        "      <th>category</th>\n",
578 |        "      <th>title_content</th>\n",
579 |        "    </tr>\n",
580 |        "  </thead>\n",
581 |        "  <tbody>\n",
582 |        "    <tr>\n",
583 |        "      <th>13073</th>\n",
584 |        "      <td>2</td>\n",
585 |        "      <td>브롤스타즈 라인프렌즈 오리지널 캐릭터 등장 슈퍼셀은 라인프렌즈와 모바일 슈팅 게임 ...</td>\n",
586 |        "    </tr>\n",
587 |        "    <tr>\n",
588 |        "      <th>16573</th>\n",
589 |        "      <td>2</td>\n",
590 |        "      <td>넥슨 글로벌 멀티 플랫폼 프로젝트 ‘카트라이더 드리프트’ 테스트 넥슨 대표 이정헌 ...</td>\n",
591 |        "    </tr>\n",
592 |        "    <tr>\n",
593 |        "      <th>11608</th>\n",
594 |        "      <td>1</td>\n",
595 |        "      <td>사진한국당 규탄하는 야3당 대표 머니투데이 홍봉진 기자 바른미래당 손학규 정의당 심...</td>\n",
596 |        "    </tr>\n",
597 |        "    <tr>\n",
598 |        "      <th>16536</th>\n",
599 |        "      <td>2</td>\n",
600 |        "      <td>카트라이더 드리프트 9일까지 비공개 테스트 진행 넥슨은 멀티 플랫폼 프로젝트 카트라...</td>\n",
601 |        "    </tr>\n",
602 |        "    <tr>\n",
603 |        "      <th>13387</th>\n",
604 |        "      <td>2</td>\n",
605 |        "      <td>슈퍼셀 라인프렌즈와 ‘브롤스타즈’ IP 라이선싱 계약 브롤스타즈 캐릭터 상품 개발 ...</td>\n",
606 |        "    </tr>\n",
607 |        "  </tbody>\n",
608 |        "</table>\n",
609 |        "</div>"
610 |       ],
611 |       "text/plain": [
612 |        "       category                                      title_content\n",
613 |        "13073         2  브롤스타즈 라인프렌즈 오리지널 캐릭터 등장 슈퍼셀은 라인프렌즈와 모바일 슈팅 게임 ...\n",
614 |        "16573         2  넥슨 글로벌 멀티 플랫폼 프로젝트 ‘카트라이더 드리프트’ 테스트 넥슨 대표 이정헌 ...\n",
615 |        "11608         1  사진한국당 규탄하는 야3당 대표 머니투데이 홍봉진 기자 바른미래당 손학규 정의당 심...\n",
616 |        "16536         2  카트라이더 드리프트 9일까지 비공개 테스트 진행 넥슨은 멀티 플랫폼 프로젝트 카트라...\n",
617 |        "13387         2  슈퍼셀 라인프렌즈와 ‘브롤스타즈’ IP 라이선싱 계약 브롤스타즈 캐릭터 상품 개발 ..."
618 |       ]
619 |      },
620 |      "execution_count": 22,
621 |      "metadata": {},
622 |      "output_type": "execute_result"
623 |     }
624 |    ],
625 |    "source": [
626 |     "result = get_recommened_contents(user_3, data_doc_title_content, model_title_content)\n",
627 |     "pd.DataFrame(result.loc[:, ['category', 'title_content']])"
628 |    ]
629 |   },
630 |   {
631 |    "cell_type": "code",
632 |    "execution_count": null,
633 |    "metadata": {},
634 |    "outputs": [],
635 |    "source": []
636 |   },
637 |   {
638 |    "cell_type": "markdown",
639 |    "metadata": {},
640 |    "source": [
641 |     "# 형태소 분석 후 결과"
642 |    ]
643 |   },
644 |   {
645 |    "cell_type": "code",
646 |    "execution_count": 52,
647 |    "metadata": {},
648 |    "outputs": [],
649 |    "source": [
650 |     "user_1 = make_user_embedding(user_category_1.index.values.tolist(), data_doc_tok, model_tok) # 경제\n",
651 |     "user_2 = make_user_embedding(user_category_2.index.values.tolist(), data_doc_tok, model_tok) # 정치\n",
652 |     "user_3 = make_user_embedding(user_category_3.index.values.tolist(), data_doc_tok, model_tok) # IT과학"
653 |    ]
654 |   },
655 |   {
656 |    "cell_type": "code",
657 |    "execution_count": 53,
658 |    "metadata": {},
659 |    "outputs": [
660 |     {
661 |      "data": {
662 |       "text/html": [
663 |        "<div>\n",
664 |        "<style scoped>\n",
665 |        "    .dataframe tbody tr th:only-of-type {\n",
666 |        "        vertical-align: middle;\n",
667 |        "    }\n",
668 |        "\n",
669 |        "    .dataframe tbody tr th {\n",
670 |        "        vertical-align: top;\n",
671 |        "    }\n",
672 |        "\n",
673 |        "    .dataframe thead th {\n",
674 |        "        text-align: right;\n",
675 |        "    }\n",
676 |        "</style>\n",
677 |        "<table border=\"1\" class=\"dataframe\">\n",
678 |        "  <thead>\n",
679 |        "    <tr style=\"text-align: right;\">\n",
680 |        "      <th></th>\n",
681 |        "      <th>category</th>\n",
682 |        "      <th>title_content</th>\n",
683 |        "    </tr>\n",
684 |        "  </thead>\n",
685 |        "  <tbody>\n",
686 |        "    <tr>\n",
687 |        "      <th>11226</th>\n",
688 |        "      <td>1</td>\n",
689 |        "      <td>개회사 하는 조성욱 공정거래위원장 서울 연합뉴스 임헌정 기자 조성욱 공정거래위원장이...</td>\n",
690 |        "    </tr>\n",
691 |        "    <tr>\n",
692 |        "      <th>11702</th>\n",
693 |        "      <td>1</td>\n",
694 |        "      <td>국민의례하는 문재인 대통령 문재인 대통령이 3일 청와대에서 열린 국가기후환경회의 격...</td>\n",
695 |        "    </tr>\n",
696 |        "    <tr>\n",
697 |        "      <th>4683</th>\n",
698 |        "      <td>0</td>\n",
699 |        "      <td>정관장 알파프로젝트와 상담하세요 서울 뉴시스 박주성 기자 KGC인삼공사가 2일 오전...</td>\n",
700 |        "    </tr>\n",
701 |        "    <tr>\n",
702 |        "      <th>6391</th>\n",
703 |        "      <td>1</td>\n",
704 |        "      <td>미래를 향한 전진 4.0 창당발기인대회에서 박수치는 이언주 서울 연합뉴스 하사헌 기...</td>\n",
705 |        "    </tr>\n",
706 |        "    <tr>\n",
707 |        "      <th>11325</th>\n",
708 |        "      <td>1</td>\n",
709 |        "      <td>국민의례 하는 이낙연 총리 서울 뉴스1 오대일 기자 이낙연 국무총리가 3일 오후 서...</td>\n",
710 |        "    </tr>\n",
711 |        "  </tbody>\n",
712 |        "</table>\n",
713 |        "</div>"
714 |       ],
715 |       "text/plain": [
716 |        "       category                                      title_content\n",
717 |        "11226         1  개회사 하는 조성욱 공정거래위원장 서울 연합뉴스 임헌정 기자 조성욱 공정거래위원장이...\n",
718 |        "11702         1  국민의례하는 문재인 대통령 문재인 대통령이 3일 청와대에서 열린 국가기후환경회의 격...\n",
719 |        "4683          0  정관장 알파프로젝트와 상담하세요 서울 뉴시스 박주성 기자 KGC인삼공사가 2일 오전...\n",
720 |        "6391          1  미래를 향한 전진 4.0 창당발기인대회에서 박수치는 이언주 서울 연합뉴스 하사헌 기...\n",
721 |        "11325         1  국민의례 하는 이낙연 총리 서울 뉴스1 오대일 기자 이낙연 국무총리가 3일 오후 서..."
722 |       ]
723 |      },
724 |      "execution_count": 53,
725 |      "metadata": {},
726 |      "output_type": "execute_result"
727 |     }
728 |    ],
729 |    "source": [
730 |     "result = get_recommened_contents(user_1, data_doc_tok, model_tok)\n",
731 |     "pd.DataFrame(result.loc[:, ['category', 'title_content']])"
732 |    ]
733 |   },
734 |   {
735 |    "cell_type": "code",
736 |    "execution_count": 54,
737 |    "metadata": {},
738 |    "outputs": [
739 |     {
740 |      "data": {
741 |       "text/html": [
742 |        "<div>\n",
743 |        "<style scoped>\n",
744 |        "    .dataframe tbody tr th:only-of-type {\n",
745 |        "        vertical-align: middle;\n",
746 |        "    }\n",
747 |        "\n",
748 |        "    .dataframe tbody tr th {\n",
749 |        "        vertical-align: top;\n",
750 |        "    }\n",
751 |        "\n",
752 |        "    .dataframe thead th {\n",
753 |        "        text-align: right;\n",
754 |        "    }\n",
755 |        "</style>\n",
756 |        "<table border=\"1\" class=\"dataframe\">\n",
757 |        "  <thead>\n",
758 |        "    <tr style=\"text-align: right;\">\n",
759 |        "      <th></th>\n",
760 |        "      <th>category</th>\n",
761 |        "      <th>title_content</th>\n",
762 |        "    </tr>\n",
763 |        "  </thead>\n",
764 |        "  <tbody>\n",
765 |        "    <tr>\n",
766 |        "      <th>4794</th>\n",
767 |        "      <td>0</td>\n",
768 |        "      <td>권태신 부회장 주한아르헨티나 대사 접견 서울 뉴시스 권태신 전경련 부회장이 2일 서...</td>\n",
769 |        "    </tr>\n",
770 |        "    <tr>\n",
771 |        "      <th>8434</th>\n",
772 |        "      <td>1</td>\n",
773 |        "      <td>육해공 철통 방어 한·아세안 모터케이드 행렬 부산 연합뉴스 지난달 부산에서 한·아세...</td>\n",
774 |        "    </tr>\n",
775 |        "    <tr>\n",
776 |        "      <th>733</th>\n",
777 |        "      <td>0</td>\n",
778 |        "      <td>부산항 수출입 화물 가득 부산 연합뉴스 조정호 기자 1일 부산항 신선대부두에 수출입...</td>\n",
779 |        "    </tr>\n",
780 |        "    <tr>\n",
781 |        "      <th>4846</th>\n",
782 |        "      <td>0</td>\n",
783 |        "      <td>인사 한화토탈 한화토탈은 2일 임원 승진 인사를 발표했다. 승진자는 상무 1명 상무...</td>\n",
784 |        "    </tr>\n",
785 |        "    <tr>\n",
786 |        "      <th>2908</th>\n",
787 |        "      <td>0</td>\n",
788 |        "      <td>인사말하는 김기문 회장 서울 뉴스1 허경 기자 김기문 중소기업중앙회장이 2일 서울 ...</td>\n",
789 |        "    </tr>\n",
790 |        "  </tbody>\n",
791 |        "</table>\n",
792 |        "</div>"
793 |       ],
794 |       "text/plain": [
795 |        "      category                                      title_content\n",
796 |        "4794         0  권태신 부회장 주한아르헨티나 대사 접견 서울 뉴시스 권태신 전경련 부회장이 2일 서...\n",
797 |        "8434         1  육해공 철통 방어 한·아세안 모터케이드 행렬 부산 연합뉴스 지난달 부산에서 한·아세...\n",
798 |        "733          0  부산항 수출입 화물 가득 부산 연합뉴스 조정호 기자 1일 부산항 신선대부두에 수출입...\n",
799 |        "4846         0  인사 한화토탈 한화토탈은 2일 임원 승진 인사를 발표했다. 승진자는 상무 1명 상무...\n",
800 |        "2908         0  인사말하는 김기문 회장 서울 뉴스1 허경 기자 김기문 중소기업중앙회장이 2일 서울 ..."
801 |       ]
802 |      },
803 |      "execution_count": 54,
804 |      "metadata": {},
805 |      "output_type": "execute_result"
806 |     }
807 |    ],
808 |    "source": [
809 |     "result = get_recommened_contents(user_2, data_doc_tok, model_tok)\n",
810 |     "pd.DataFrame(result.loc[:, ['category', 'title_content']])"
811 |    ]
812 |   },
813 |   {
814 |    "cell_type": "code",
815 |    "execution_count": 55,
816 |    "metadata": {},
817 |    "outputs": [
818 |     {
819 |      "data": {
820 |       "text/html": [
821 |        "<div>\n",
822 |        "<style scoped>\n",
823 |        "    .dataframe tbody tr th:only-of-type {\n",
824 |        "        vertical-align: middle;\n",
825 |        "    }\n",
826 |        "\n",
827 |        "    .dataframe tbody tr th {\n",
828 |        "        vertical-align: top;\n",
829 |        "    }\n",
830 |        "\n",
831 |        "    .dataframe thead th {\n",
832 |        "        text-align: right;\n",
833 |        "    }\n",
834 |        "</style>\n",
835 |        "<table border=\"1\" class=\"dataframe\">\n",
836 |        "  <thead>\n",
837 |        "    <tr style=\"text-align: right;\">\n",
838 |        "      <th></th>\n",
839 |        "      <th>category</th>\n",
840 |        "      <th>title_content</th>\n",
841 |        "    </tr>\n",
842 |        "  </thead>\n",
843 |        "  <tbody>\n",
844 |        "    <tr>\n",
845 |        "      <th>10471</th>\n",
846 |        "      <td>1</td>\n",
847 |        "      <td>한국 부임한 도미타 고지 신임 주한 일본대사 서울 연합뉴스 임헌정 기자 도미타 고지...</td>\n",
848 |        "    </tr>\n",
849 |        "    <tr>\n",
850 |        "      <th>10517</th>\n",
851 |        "      <td>1</td>\n",
852 |        "      <td>입국하는 도미타 고지 신임 주한일본대사 CBS노컷뉴스 이한형 기자 도미타 고지 신임...</td>\n",
853 |        "    </tr>\n",
854 |        "    <tr>\n",
855 |        "      <th>11084</th>\n",
856 |        "      <td>1</td>\n",
857 |        "      <td>청와대 천막농성장 앞에서 열린 한국당 최고위원회의 서울 뉴스1 박세연 기자 3일 오...</td>\n",
858 |        "    </tr>\n",
859 |        "    <tr>\n",
860 |        "      <th>8353</th>\n",
861 |        "      <td>1</td>\n",
862 |        "      <td>수보회의 참석하는 문 대통령 서울 연합뉴스 한상균 기자 문재인 대통령이 2일 오후 ...</td>\n",
863 |        "    </tr>\n",
864 |        "    <tr>\n",
865 |        "      <th>16157</th>\n",
866 |        "      <td>2</td>\n",
867 |        "      <td>Tech  BIZ AI로 블루·화이트칼라 넘어 뉴칼라 일자리 만들 것 데이비드 반스...</td>\n",
868 |        "    </tr>\n",
869 |        "  </tbody>\n",
870 |        "</table>\n",
871 |        "</div>"
872 |       ],
873 |       "text/plain": [
874 |        "       category                                      title_content\n",
875 |        "10471         1  한국 부임한 도미타 고지 신임 주한 일본대사 서울 연합뉴스 임헌정 기자 도미타 고지...\n",
876 |        "10517         1  입국하는 도미타 고지 신임 주한일본대사 CBS노컷뉴스 이한형 기자 도미타 고지 신임...\n",
877 |        "11084         1  청와대 천막농성장 앞에서 열린 한국당 최고위원회의 서울 뉴스1 박세연 기자 3일 오...\n",
878 |        "8353          1  수보회의 참석하는 문 대통령 서울 연합뉴스 한상균 기자 문재인 대통령이 2일 오후 ...\n",
879 |        "16157         2  Tech  BIZ AI로 블루·화이트칼라 넘어 뉴칼라 일자리 만들 것 데이비드 반스..."
880 |       ]
881 |      },
882 |      "execution_count": 55,
883 |      "metadata": {},
884 |      "output_type": "execute_result"
885 |     }
886 |    ],
887 |    "source": [
888 |     "result = get_recommened_contents(user_3, data_doc_tok, model_tok)\n",
889 |     "pd.DataFrame(result.loc[:, ['category', 'title_content']])"
890 |    ]
891 |   },
892 |   {
893 |    "cell_type": "markdown",
894 |    "metadata": {},
895 |    "source": [
896 |     "형태소 분석 후 결과는 성능이 썩 좋지 않음을 알 수 있다."
897 |    ]
898 |   },
899 |   {
900 |    "cell_type": "code",
901 |    "execution_count": null,
902 |    "metadata": {},
903 |    "outputs": [],
904 |    "source": []
905 |   },
906 |   {
907 |    "cell_type": "code",
908 |    "execution_count": null,
909 |    "metadata": {},
910 |    "outputs": [],
911 |    "source": []
912 |   },
913 |   {
914 |    "cell_type": "code",
915 |    "execution_count": null,
916 |    "metadata": {},
917 |    "outputs": [],
918 |    "source": []
919 |   },
920 |   {
921 |    "cell_type": "code",
922 |    "execution_count": null,
923 |    "metadata": {},
924 |    "outputs": [],
925 |    "source": []
926 |   },
927 |   {
928 |    "cell_type": "code",
929 |    "execution_count": null,
930 |    "metadata": {},
931 |    "outputs": [],
932 |    "source": []
933 |   },
934 |   {
935 |    "cell_type": "code",
936 |    "execution_count": null,
937 |    "metadata": {},
938 |    "outputs": [],
939 |    "source": []
940 |   }
941 |  ],
942 |  "metadata": {
943 |   "kernelspec": {
944 |    "display_name": "Python 3",
945 |    "language": "python",
946 |    "name": "python3"
947 |   },
948 |   "language_info": {
949 |    "codemirror_mode": {
950 |     "name": "ipython",
951 |     "version": 3
952 |    },
953 |    "file_extension": ".py",
954 |    "mimetype": "text/x-python",
955 |    "name": "python",
956 |    "nbconvert_exporter": "python",
957 |    "pygments_lexer": "ipython3",
958 |    "version": "3.8.5"
959 |   }
960 |  },
961 |  "nbformat": 4,
962 |  "nbformat_minor": 2
963 | }
964 | 


--------------------------------------------------------------------------------
/006. deep learning recommender system.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 블로그 설명\n",
  8 |     "\n",
  9 |     "해당 자료에 대한 설명은 아래 블로그에 올려두었습니다.\n",
 10 |     "- https://lsjsj92.tistory.com/577"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 1,
 16 |    "metadata": {},
 17 |    "outputs": [
 18 |     {
 19 |      "name": "stderr",
 20 |      "output_type": "stream",
 21 |      "text": [
 22 |       "Using TensorFlow backend.\n",
 23 |       "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
 24 |       "  _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n",
 25 |       "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
 26 |       "  _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n",
 27 |       "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
 28 |       "  _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n",
 29 |       "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
 30 |       "  _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n",
 31 |       "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
 32 |       "  _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n",
 33 |       "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
 34 |       "  np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n"
 35 |      ]
 36 |     }
 37 |    ],
 38 |    "source": [
 39 |     "import pandas as pd\n",
 40 |     "import pickle\n",
 41 |     "import numpy as np\n",
 42 |     "import matplotlib.pyplot as plt\n",
 43 |     "import seaborn as sns\n",
 44 |     "import random\n",
 45 |     "from sklearn.model_selection import train_test_split\n",
 46 |     "from keras.layers import Input, Dense, Concatenate, concatenate, Dropout, Reshape, dot, Dot\n",
 47 |     "from keras.models import Model\n",
 48 |     "from keras.callbacks import ModelCheckpoint, EarlyStopping\n",
 49 |     "from keras import backend as K\n",
 50 |     "from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix, mean_squared_error\n",
 51 |     "from gensim.models.doc2vec import Doc2Vec, TaggedDocument"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 2,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "X = np.load('./datas/X_data.npy')\n",
 61 |     "y = np.load('./datas/y_data.npy') "
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": 7,
 67 |    "metadata": {},
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "def keras_model():\n",
 71 |     "    user_vector_input = Input(shape=(128, ))\n",
 72 |     "    \n",
 73 |     "    dense_u_v = Dense(128, activation = 'relu')(user_vector_input)\n",
 74 |     "    dense_u_v = Dropout(0.5)(dense_u_v)\n",
 75 |     "    dense_u_v = Dense(64, activation = 'relu')(dense_u_v)\n",
 76 |     "    dense_u_v = Dropout(0.5)(dense_u_v)\n",
 77 |     "    dense_u_v = Dense(32, activation = 'relu')(dense_u_v)\n",
 78 |     "    dense_u_v = Dropout(0.5)(dense_u_v)\n",
 79 |     "    dense_u_v = Dense(16, activation = 'relu')(dense_u_v)\n",
 80 |     "    dense_u_v = Dense(1, activation = 'sigmoid')(dense_u_v)\n",
 81 |     "    \n",
 82 |     "    model = Model(inputs=user_vector_input, outputs=dense_u_v)\n",
 83 |     "    model.compile(optimizer = 'Adam', loss='binary_crossentropy', metrics=['acc'])\n",
 84 |     "    return model\n"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 8,
 90 |    "metadata": {},
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "model = keras_model()"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 13,
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size = 0.2, random_state=1)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 14,
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": [
111 |     "modelpath = './datas/recommender.model'\n",
112 |     "checkpointer = ModelCheckpoint(filepath = modelpath, monitor='val_loss', verbose=1, save_best_only=True)\n",
113 |     "early_stop = EarlyStopping(monitor='val_loss', patience=3)"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 15,
119 |    "metadata": {},
120 |    "outputs": [
121 |     {
122 |      "name": "stdout",
123 |      "output_type": "stream",
124 |      "text": [
125 |       "Train on 72457 samples, validate on 18115 samples\n",
126 |       "Epoch 1/50\n",
127 |       "72457/72457 [==============================] - 15s 207us/step - loss: 0.6417 - acc: 0.6030 - val_loss: 0.6135 - val_acc: 0.6306\n",
128 |       "\n",
129 |       "Epoch 00001: val_loss improved from inf to 0.61350, saving model to ./datas/recommender.model\n",
130 |       "Epoch 2/50\n",
131 |       "72457/72457 [==============================] - 7s 94us/step - loss: 0.6071 - acc: 0.6261 - val_loss: 0.5916 - val_acc: 0.6601\n",
132 |       "\n",
133 |       "Epoch 00002: val_loss improved from 0.61350 to 0.59162, saving model to ./datas/recommender.model\n",
134 |       "Epoch 3/50\n",
135 |       "72457/72457 [==============================] - 7s 102us/step - loss: 0.5899 - acc: 0.6461 - val_loss: 0.5665 - val_acc: 0.6895\n",
136 |       "\n",
137 |       "Epoch 00003: val_loss improved from 0.59162 to 0.56651, saving model to ./datas/recommender.model\n",
138 |       "Epoch 4/50\n",
139 |       "72457/72457 [==============================] - 8s 104us/step - loss: 0.5710 - acc: 0.6716 - val_loss: 0.5612 - val_acc: 0.7101\n",
140 |       "\n",
141 |       "Epoch 00004: val_loss improved from 0.56651 to 0.56123, saving model to ./datas/recommender.model\n",
142 |       "Epoch 5/50\n",
143 |       "72457/72457 [==============================] - 7s 96us/step - loss: 0.5516 - acc: 0.6963 - val_loss: 0.5512 - val_acc: 0.7243\n",
144 |       "\n",
145 |       "Epoch 00005: val_loss improved from 0.56123 to 0.55125, saving model to ./datas/recommender.model\n",
146 |       "Epoch 6/50\n",
147 |       "72457/72457 [==============================] - 7s 95us/step - loss: 0.5372 - acc: 0.7109 - val_loss: 0.5405 - val_acc: 0.7243\n",
148 |       "\n",
149 |       "Epoch 00006: val_loss improved from 0.55125 to 0.54048, saving model to ./datas/recommender.model\n",
150 |       "Epoch 7/50\n",
151 |       "72457/72457 [==============================] - 7s 94us/step - loss: 0.5244 - acc: 0.7216 - val_loss: 0.5471 - val_acc: 0.7110\n",
152 |       "\n",
153 |       "Epoch 00007: val_loss did not improve\n",
154 |       "Epoch 8/50\n",
155 |       "72457/72457 [==============================] - 8s 112us/step - loss: 0.5208 - acc: 0.7257 - val_loss: 0.5459 - val_acc: 0.7154\n",
156 |       "\n",
157 |       "Epoch 00008: val_loss did not improve\n",
158 |       "Epoch 9/50\n",
159 |       "72457/72457 [==============================] - 7s 98us/step - loss: 0.5111 - acc: 0.7340 - val_loss: 0.5391 - val_acc: 0.7200\n",
160 |       "\n",
161 |       "Epoch 00009: val_loss improved from 0.54048 to 0.53907, saving model to ./datas/recommender.model\n",
162 |       "Epoch 10/50\n",
163 |       "72457/72457 [==============================] - 7s 96us/step - loss: 0.5050 - acc: 0.7391 - val_loss: 0.5310 - val_acc: 0.7292\n",
164 |       "\n",
165 |       "Epoch 00010: val_loss improved from 0.53907 to 0.53100, saving model to ./datas/recommender.model\n",
166 |       "Epoch 11/50\n",
167 |       "72457/72457 [==============================] - 7s 96us/step - loss: 0.4991 - acc: 0.7428 - val_loss: 0.5285 - val_acc: 0.7261\n",
168 |       "\n",
169 |       "Epoch 00011: val_loss improved from 0.53100 to 0.52849, saving model to ./datas/recommender.model\n",
170 |       "Epoch 12/50\n",
171 |       "72457/72457 [==============================] - 7s 95us/step - loss: 0.4948 - acc: 0.7484 - val_loss: 0.5502 - val_acc: 0.7075\n",
172 |       "\n",
173 |       "Epoch 00012: val_loss did not improve\n",
174 |       "Epoch 13/50\n",
175 |       "72457/72457 [==============================] - 8s 107us/step - loss: 0.4910 - acc: 0.7497 - val_loss: 0.5268 - val_acc: 0.7344\n",
176 |       "\n",
177 |       "Epoch 00013: val_loss improved from 0.52849 to 0.52680, saving model to ./datas/recommender.model\n",
178 |       "Epoch 14/50\n",
179 |       "72457/72457 [==============================] - 7s 98us/step - loss: 0.4888 - acc: 0.7507 - val_loss: 0.5283 - val_acc: 0.7234\n",
180 |       "\n",
181 |       "Epoch 00014: val_loss did not improve\n",
182 |       "Epoch 15/50\n",
183 |       "72457/72457 [==============================] - 7s 96us/step - loss: 0.4841 - acc: 0.7538 - val_loss: 0.5346 - val_acc: 0.7189\n",
184 |       "\n",
185 |       "Epoch 00015: val_loss did not improve\n",
186 |       "Epoch 16/50\n",
187 |       "72457/72457 [==============================] - 8s 106us/step - loss: 0.4808 - acc: 0.7568 - val_loss: 0.5545 - val_acc: 0.6986\n",
188 |       "\n",
189 |       "Epoch 00016: val_loss did not improve\n"
190 |      ]
191 |     },
192 |     {
193 |      "data": {
194 |       "text/plain": [
195 |        "<keras.callbacks.History at 0x199658b26d8>"
196 |       ]
197 |      },
198 |      "execution_count": 15,
199 |      "metadata": {},
200 |      "output_type": "execute_result"
201 |     }
202 |    ],
203 |    "source": [
204 |     "model.fit(X_train, y_train, validation_split = 0.2, epochs=50, batch_size=64, callbacks=[early_stop, checkpointer])"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": 16,
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": [
213 |     "pred = model.predict(X_test)"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": 17,
219 |    "metadata": {},
220 |    "outputs": [],
221 |    "source": [
222 |     "pred_label = [1 if i > 0.5 else 0 for i in pred]"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "code",
227 |    "execution_count": 18,
228 |    "metadata": {},
229 |    "outputs": [
230 |     {
231 |      "name": "stdout",
232 |      "output_type": "stream",
233 |      "text": [
234 |       "0.7108376715570278\n",
235 |       "0.5794753086419753\n",
236 |       "0.9192166462668299\n",
237 |       "0.7031753742878594\n"
238 |      ]
239 |     }
240 |    ],
241 |    "source": [
242 |     "print(f1_score(y_test, pred_label))\n",
243 |     "print(precision_score(y_test, pred_label))\n",
244 |     "print(recall_score(y_test, pred_label))\n",
245 |     "print(accuracy_score(y_test, pred_label))"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": null,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": []
254 |   }
255 |  ],
256 |  "metadata": {
257 |   "kernelspec": {
258 |    "display_name": "Python 3",
259 |    "language": "python",
260 |    "name": "python3"
261 |   },
262 |   "language_info": {
263 |    "codemirror_mode": {
264 |     "name": "ipython",
265 |     "version": 3
266 |    },
267 |    "file_extension": ".py",
268 |    "mimetype": "text/x-python",
269 |    "name": "python",
270 |    "nbconvert_exporter": "python",
271 |    "pygments_lexer": "ipython3",
272 |    "version": "3.8.5"
273 |   }
274 |  },
275 |  "nbformat": 4,
276 |  "nbformat_minor": 2
277 | }
278 | 


--------------------------------------------------------------------------------
/009_chatgpt_recsys.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "id": "01422061",
   6 |    "metadata": {},
   7 |    "source": [
   8 |     "# Blog\n",
   9 |     "- https://lsjsj92.tistory.com/657\n",
  10 |     "\n",
  11 |     "위 블로그에서 설명한 코드입니다.\n",
  12 |     "\n",
  13 |     "# Data\n",
  14 |     "- https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset"
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "code",
  19 |    "execution_count": 1,
  20 |    "id": "1925745b",
  21 |    "metadata": {},
  22 |    "outputs": [
  23 |     {
  24 |      "data": {
  25 |       "text/html": [
  26 |        "\n",
  27 |        "<style>\n",
  28 |        "    div#notebook-container    { width: 95%; }\n",
  29 |        "    div#menubar-container     { width: 95%; }\n",
  30 |        "    div#maintoolbar-container { width: 99%; }\n",
  31 |        "</style>\n"
  32 |       ],
  33 |       "text/plain": [
  34 |        "<IPython.core.display.HTML object>"
  35 |       ]
  36 |      },
  37 |      "metadata": {},
  38 |      "output_type": "display_data"
  39 |     }
  40 |    ],
  41 |    "source": [
  42 |     "from IPython.display import display, HTML\n",
  43 |     "display(HTML(data=\"\"\"\n",
  44 |     "<style>\n",
  45 |     "    div#notebook-container    { width: 95%; }\n",
  46 |     "    div#menubar-container     { width: 95%; }\n",
  47 |     "    div#maintoolbar-container { width: 99%; }\n",
  48 |     "</style>\n",
  49 |     "\"\"\"))"
  50 |    ]
  51 |   },
  52 |   {
  53 |    "cell_type": "code",
  54 |    "execution_count": 371,
  55 |    "id": "3e46dd11",
  56 |    "metadata": {},
  57 |    "outputs": [],
  58 |    "source": [
  59 |     "import requests\n",
  60 |     "\n",
  61 |     "import pandas as pd\n",
  62 |     "import numpy as np\n",
  63 |     "import copy\n",
  64 |     "import json\n",
  65 |     "\n",
  66 |     "from ast import literal_eval\n",
  67 |     "\n",
  68 |     "import torch\n",
  69 |     "from sentence_transformers import SentenceTransformer, util\n",
  70 |     "from transformers import AutoTokenizer, AutoModel\n",
  71 |     "from transformers import OwlViTProcessor, OwlViTForObjectDetection\n",
  72 |     "from transformers import pipeline\n",
  73 |     "from transformers import GPT2TokenizerFast\n",
  74 |     "from PIL import Image\n",
  75 |     "\n",
  76 |     "import pickle\n",
  77 |     "\n"
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "code",
  82 |    "execution_count": 3,
  83 |    "id": "b5fea154",
  84 |    "metadata": {},
  85 |    "outputs": [],
  86 |    "source": [
  87 |     "import matplotlib.pyplot as plt\n",
  88 |     "from typing import List, Tuple, Dict\n",
  89 |     "\n",
  90 |     "import sklearn.datasets as datasets\n",
  91 |     "import sklearn.manifold as manifold"
  92 |    ]
  93 |   },
  94 |   {
  95 |    "cell_type": "code",
  96 |    "execution_count": 4,
  97 |    "id": "ffd81bad",
  98 |    "metadata": {},
  99 |    "outputs": [],
 100 |    "source": [
 101 |     "import openai\n",
 102 |     "import os\n",
 103 |     "import sys\n",
 104 |     "from dotenv import load_dotenv\n",
 105 |     "\n",
 106 |     "load_dotenv()    \n",
 107 |     "openai.api_key = os.getenv(\"OPENAI_API_KEY\")"
 108 |    ]
 109 |   },
 110 |   {
 111 |    "cell_type": "code",
 112 |    "execution_count": 5,
 113 |    "id": "4a88f0a5",
 114 |    "metadata": {},
 115 |    "outputs": [],
 116 |    "source": [
 117 |     "cur_os = sys.platform"
 118 |    ]
 119 |   },
 120 |   {
 121 |    "cell_type": "code",
 122 |    "execution_count": 6,
 123 |    "id": "dd02d945",
 124 |    "metadata": {},
 125 |    "outputs": [],
 126 |    "source": [
 127 |     "model_path = f\"D:/github\" if cur_os.startswith('win') else None"
 128 |    ]
 129 |   },
 130 |   {
 131 |    "cell_type": "markdown",
 132 |    "id": "755c7288",
 133 |    "metadata": {},
 134 |    "source": [
 135 |     "## 데이터셋"
 136 |    ]
 137 |   },
 138 |   {
 139 |    "cell_type": "code",
 140 |    "execution_count": 7,
 141 |    "id": "widespread-trial",
 142 |    "metadata": {},
 143 |    "outputs": [
 144 |     {
 145 |      "name": "stdout",
 146 |      "output_type": "stream",
 147 |      "text": [
 148 |       "(45466, 24)\n"
 149 |      ]
 150 |     },
 151 |     {
 152 |      "data": {
 153 |       "text/html": [
 154 |        "<div>\n",
 155 |        "<style scoped>\n",
 156 |        "    .dataframe tbody tr th:only-of-type {\n",
 157 |        "        vertical-align: middle;\n",
 158 |        "    }\n",
 159 |        "\n",
 160 |        "    .dataframe tbody tr th {\n",
 161 |        "        vertical-align: top;\n",
 162 |        "    }\n",
 163 |        "\n",
 164 |        "    .dataframe thead th {\n",
 165 |        "        text-align: right;\n",
 166 |        "    }\n",
 167 |        "</style>\n",
 168 |        "<table border=\"1\" class=\"dataframe\">\n",
 169 |        "  <thead>\n",
 170 |        "    <tr style=\"text-align: right;\">\n",
 171 |        "      <th></th>\n",
 172 |        "      <th>adult</th>\n",
 173 |        "      <th>belongs_to_collection</th>\n",
 174 |        "      <th>budget</th>\n",
 175 |        "      <th>genres</th>\n",
 176 |        "      <th>homepage</th>\n",
 177 |        "      <th>id</th>\n",
 178 |        "      <th>imdb_id</th>\n",
 179 |        "      <th>original_language</th>\n",
 180 |        "      <th>original_title</th>\n",
 181 |        "      <th>overview</th>\n",
 182 |        "      <th>...</th>\n",
 183 |        "      <th>release_date</th>\n",
 184 |        "      <th>revenue</th>\n",
 185 |        "      <th>runtime</th>\n",
 186 |        "      <th>spoken_languages</th>\n",
 187 |        "      <th>status</th>\n",
 188 |        "      <th>tagline</th>\n",
 189 |        "      <th>title</th>\n",
 190 |        "      <th>video</th>\n",
 191 |        "      <th>vote_average</th>\n",
 192 |        "      <th>vote_count</th>\n",
 193 |        "    </tr>\n",
 194 |        "  </thead>\n",
 195 |        "  <tbody>\n",
 196 |        "    <tr>\n",
 197 |        "      <th>0</th>\n",
 198 |        "      <td>False</td>\n",
 199 |        "      <td>{'id': 10194, 'name': 'Toy Story Collection', ...</td>\n",
 200 |        "      <td>30000000</td>\n",
 201 |        "      <td>[{'id': 16, 'name': 'Animation'}, {'id': 35, '...</td>\n",
 202 |        "      <td>http://toystory.disney.com/toy-story</td>\n",
 203 |        "      <td>862</td>\n",
 204 |        "      <td>tt0114709</td>\n",
 205 |        "      <td>en</td>\n",
 206 |        "      <td>Toy Story</td>\n",
 207 |        "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
 208 |        "      <td>...</td>\n",
 209 |        "      <td>1995-10-30</td>\n",
 210 |        "      <td>373554033</td>\n",
 211 |        "      <td>81.0</td>\n",
 212 |        "      <td>[{'iso_639_1': 'en', 'name': 'English'}]</td>\n",
 213 |        "      <td>Released</td>\n",
 214 |        "      <td>NaN</td>\n",
 215 |        "      <td>Toy Story</td>\n",
 216 |        "      <td>False</td>\n",
 217 |        "      <td>7.7</td>\n",
 218 |        "      <td>5415</td>\n",
 219 |        "    </tr>\n",
 220 |        "    <tr>\n",
 221 |        "      <th>1</th>\n",
 222 |        "      <td>False</td>\n",
 223 |        "      <td>NaN</td>\n",
 224 |        "      <td>65000000</td>\n",
 225 |        "      <td>[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...</td>\n",
 226 |        "      <td>NaN</td>\n",
 227 |        "      <td>8844</td>\n",
 228 |        "      <td>tt0113497</td>\n",
 229 |        "      <td>en</td>\n",
 230 |        "      <td>Jumanji</td>\n",
 231 |        "      <td>When siblings Judy and Peter discover an encha...</td>\n",
 232 |        "      <td>...</td>\n",
 233 |        "      <td>1995-12-15</td>\n",
 234 |        "      <td>262797249</td>\n",
 235 |        "      <td>104.0</td>\n",
 236 |        "      <td>[{'iso_639_1': 'en', 'name': 'English'}, {'iso...</td>\n",
 237 |        "      <td>Released</td>\n",
 238 |        "      <td>Roll the dice and unleash the excitement!</td>\n",
 239 |        "      <td>Jumanji</td>\n",
 240 |        "      <td>False</td>\n",
 241 |        "      <td>6.9</td>\n",
 242 |        "      <td>2413</td>\n",
 243 |        "    </tr>\n",
 244 |        "    <tr>\n",
 245 |        "      <th>2</th>\n",
 246 |        "      <td>False</td>\n",
 247 |        "      <td>{'id': 119050, 'name': 'Grumpy Old Men Collect...</td>\n",
 248 |        "      <td>0</td>\n",
 249 |        "      <td>[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...</td>\n",
 250 |        "      <td>NaN</td>\n",
 251 |        "      <td>15602</td>\n",
 252 |        "      <td>tt0113228</td>\n",
 253 |        "      <td>en</td>\n",
 254 |        "      <td>Grumpier Old Men</td>\n",
 255 |        "      <td>A family wedding reignites the ancient feud be...</td>\n",
 256 |        "      <td>...</td>\n",
 257 |        "      <td>1995-12-22</td>\n",
 258 |        "      <td>0</td>\n",
 259 |        "      <td>101.0</td>\n",
 260 |        "      <td>[{'iso_639_1': 'en', 'name': 'English'}]</td>\n",
 261 |        "      <td>Released</td>\n",
 262 |        "      <td>Still Yelling. Still Fighting. Still Ready for...</td>\n",
 263 |        "      <td>Grumpier Old Men</td>\n",
 264 |        "      <td>False</td>\n",
 265 |        "      <td>6.5</td>\n",
 266 |        "      <td>92</td>\n",
 267 |        "    </tr>\n",
 268 |        "    <tr>\n",
 269 |        "      <th>3</th>\n",
 270 |        "      <td>False</td>\n",
 271 |        "      <td>NaN</td>\n",
 272 |        "      <td>16000000</td>\n",
 273 |        "      <td>[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...</td>\n",
 274 |        "      <td>NaN</td>\n",
 275 |        "      <td>31357</td>\n",
 276 |        "      <td>tt0114885</td>\n",
 277 |        "      <td>en</td>\n",
 278 |        "      <td>Waiting to Exhale</td>\n",
 279 |        "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
 280 |        "      <td>...</td>\n",
 281 |        "      <td>1995-12-22</td>\n",
 282 |        "      <td>81452156</td>\n",
 283 |        "      <td>127.0</td>\n",
 284 |        "      <td>[{'iso_639_1': 'en', 'name': 'English'}]</td>\n",
 285 |        "      <td>Released</td>\n",
 286 |        "      <td>Friends are the people who let you be yourself...</td>\n",
 287 |        "      <td>Waiting to Exhale</td>\n",
 288 |        "      <td>False</td>\n",
 289 |        "      <td>6.1</td>\n",
 290 |        "      <td>34</td>\n",
 291 |        "    </tr>\n",
 292 |        "    <tr>\n",
 293 |        "      <th>4</th>\n",
 294 |        "      <td>False</td>\n",
 295 |        "      <td>{'id': 96871, 'name': 'Father of the Bride Col...</td>\n",
 296 |        "      <td>0</td>\n",
 297 |        "      <td>[{'id': 35, 'name': 'Comedy'}]</td>\n",
 298 |        "      <td>NaN</td>\n",
 299 |        "      <td>11862</td>\n",
 300 |        "      <td>tt0113041</td>\n",
 301 |        "      <td>en</td>\n",
 302 |        "      <td>Father of the Bride Part II</td>\n",
 303 |        "      <td>Just when George Banks has recovered from his ...</td>\n",
 304 |        "      <td>...</td>\n",
 305 |        "      <td>1995-02-10</td>\n",
 306 |        "      <td>76578911</td>\n",
 307 |        "      <td>106.0</td>\n",
 308 |        "      <td>[{'iso_639_1': 'en', 'name': 'English'}]</td>\n",
 309 |        "      <td>Released</td>\n",
 310 |        "      <td>Just When His World Is Back To Normal... He's ...</td>\n",
 311 |        "      <td>Father of the Bride Part II</td>\n",
 312 |        "      <td>False</td>\n",
 313 |        "      <td>5.7</td>\n",
 314 |        "      <td>173</td>\n",
 315 |        "    </tr>\n",
 316 |        "  </tbody>\n",
 317 |        "</table>\n",
 318 |        "<p>5 rows × 24 columns</p>\n",
 319 |        "</div>"
 320 |       ],
 321 |       "text/plain": [
 322 |        "   adult                              belongs_to_collection    budget  \\\n",
 323 |        "0  False  {'id': 10194, 'name': 'Toy Story Collection', ...  30000000   \n",
 324 |        "1  False                                                NaN  65000000   \n",
 325 |        "2  False  {'id': 119050, 'name': 'Grumpy Old Men Collect...         0   \n",
 326 |        "3  False                                                NaN  16000000   \n",
 327 |        "4  False  {'id': 96871, 'name': 'Father of the Bride Col...         0   \n",
 328 |        "\n",
 329 |        "                                              genres  \\\n",
 330 |        "0  [{'id': 16, 'name': 'Animation'}, {'id': 35, '...   \n",
 331 |        "1  [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...   \n",
 332 |        "2  [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...   \n",
 333 |        "3  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   \n",
 334 |        "4                     [{'id': 35, 'name': 'Comedy'}]   \n",
 335 |        "\n",
 336 |        "                               homepage     id    imdb_id original_language  \\\n",
 337 |        "0  http://toystory.disney.com/toy-story    862  tt0114709                en   \n",
 338 |        "1                                   NaN   8844  tt0113497                en   \n",
 339 |        "2                                   NaN  15602  tt0113228                en   \n",
 340 |        "3                                   NaN  31357  tt0114885                en   \n",
 341 |        "4                                   NaN  11862  tt0113041                en   \n",
 342 |        "\n",
 343 |        "                original_title  \\\n",
 344 |        "0                    Toy Story   \n",
 345 |        "1                      Jumanji   \n",
 346 |        "2             Grumpier Old Men   \n",
 347 |        "3            Waiting to Exhale   \n",
 348 |        "4  Father of the Bride Part II   \n",
 349 |        "\n",
 350 |        "                                            overview  ... release_date  \\\n",
 351 |        "0  Led by Woody, Andy's toys live happily in his ...  ...   1995-10-30   \n",
 352 |        "1  When siblings Judy and Peter discover an encha...  ...   1995-12-15   \n",
 353 |        "2  A family wedding reignites the ancient feud be...  ...   1995-12-22   \n",
 354 |        "3  Cheated on, mistreated and stepped on, the wom...  ...   1995-12-22   \n",
 355 |        "4  Just when George Banks has recovered from his ...  ...   1995-02-10   \n",
 356 |        "\n",
 357 |        "     revenue runtime                                   spoken_languages  \\\n",
 358 |        "0  373554033    81.0           [{'iso_639_1': 'en', 'name': 'English'}]   \n",
 359 |        "1  262797249   104.0  [{'iso_639_1': 'en', 'name': 'English'}, {'iso...   \n",
 360 |        "2          0   101.0           [{'iso_639_1': 'en', 'name': 'English'}]   \n",
 361 |        "3   81452156   127.0           [{'iso_639_1': 'en', 'name': 'English'}]   \n",
 362 |        "4   76578911   106.0           [{'iso_639_1': 'en', 'name': 'English'}]   \n",
 363 |        "\n",
 364 |        "     status                                            tagline  \\\n",
 365 |        "0  Released                                                NaN   \n",
 366 |        "1  Released          Roll the dice and unleash the excitement!   \n",
 367 |        "2  Released  Still Yelling. Still Fighting. Still Ready for...   \n",
 368 |        "3  Released  Friends are the people who let you be yourself...   \n",
 369 |        "4  Released  Just When His World Is Back To Normal... He's ...   \n",
 370 |        "\n",
 371 |        "                         title  video vote_average vote_count  \n",
 372 |        "0                    Toy Story  False          7.7       5415  \n",
 373 |        "1                      Jumanji  False          6.9       2413  \n",
 374 |        "2             Grumpier Old Men  False          6.5         92  \n",
 375 |        "3            Waiting to Exhale  False          6.1         34  \n",
 376 |        "4  Father of the Bride Part II  False          5.7        173  \n",
 377 |        "\n",
 378 |        "[5 rows x 24 columns]"
 379 |       ]
 380 |      },
 381 |      "execution_count": 7,
 382 |      "metadata": {},
 383 |      "output_type": "execute_result"
 384 |     }
 385 |    ],
 386 |    "source": [
 387 |     "movies_metadata = pd.read_csv('./movie_meta/movies_metadata.csv', sep=\",\", dtype=str)\n",
 388 |     "print(movies_metadata.shape)\n",
 389 |     "movies_metadata.head()"
 390 |    ]
 391 |   },
 392 |   {
 393 |    "cell_type": "code",
 394 |    "execution_count": 8,
 395 |    "id": "meaning-kingdom",
 396 |    "metadata": {},
 397 |    "outputs": [
 398 |     {
 399 |      "data": {
 400 |       "text/html": [
 401 |        "<div>\n",
 402 |        "<style scoped>\n",
 403 |        "    .dataframe tbody tr th:only-of-type {\n",
 404 |        "        vertical-align: middle;\n",
 405 |        "    }\n",
 406 |        "\n",
 407 |        "    .dataframe tbody tr th {\n",
 408 |        "        vertical-align: top;\n",
 409 |        "    }\n",
 410 |        "\n",
 411 |        "    .dataframe thead th {\n",
 412 |        "        text-align: right;\n",
 413 |        "    }\n",
 414 |        "</style>\n",
 415 |        "<table border=\"1\" class=\"dataframe\">\n",
 416 |        "  <thead>\n",
 417 |        "    <tr style=\"text-align: right;\">\n",
 418 |        "      <th></th>\n",
 419 |        "      <th>id</th>\n",
 420 |        "      <th>genres</th>\n",
 421 |        "      <th>title</th>\n",
 422 |        "      <th>overview</th>\n",
 423 |        "      <th>release_date</th>\n",
 424 |        "    </tr>\n",
 425 |        "  </thead>\n",
 426 |        "  <tbody>\n",
 427 |        "    <tr>\n",
 428 |        "      <th>0</th>\n",
 429 |        "      <td>862</td>\n",
 430 |        "      <td>[{'id': 16, 'name': 'Animation'}, {'id': 35, '...</td>\n",
 431 |        "      <td>Toy Story</td>\n",
 432 |        "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
 433 |        "      <td>1995-10-30</td>\n",
 434 |        "    </tr>\n",
 435 |        "    <tr>\n",
 436 |        "      <th>1</th>\n",
 437 |        "      <td>8844</td>\n",
 438 |        "      <td>[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...</td>\n",
 439 |        "      <td>Jumanji</td>\n",
 440 |        "      <td>When siblings Judy and Peter discover an encha...</td>\n",
 441 |        "      <td>1995-12-15</td>\n",
 442 |        "    </tr>\n",
 443 |        "    <tr>\n",
 444 |        "      <th>2</th>\n",
 445 |        "      <td>15602</td>\n",
 446 |        "      <td>[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...</td>\n",
 447 |        "      <td>Grumpier Old Men</td>\n",
 448 |        "      <td>A family wedding reignites the ancient feud be...</td>\n",
 449 |        "      <td>1995-12-22</td>\n",
 450 |        "    </tr>\n",
 451 |        "    <tr>\n",
 452 |        "      <th>3</th>\n",
 453 |        "      <td>31357</td>\n",
 454 |        "      <td>[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...</td>\n",
 455 |        "      <td>Waiting to Exhale</td>\n",
 456 |        "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
 457 |        "      <td>1995-12-22</td>\n",
 458 |        "    </tr>\n",
 459 |        "    <tr>\n",
 460 |        "      <th>4</th>\n",
 461 |        "      <td>11862</td>\n",
 462 |        "      <td>[{'id': 35, 'name': 'Comedy'}]</td>\n",
 463 |        "      <td>Father of the Bride Part II</td>\n",
 464 |        "      <td>Just when George Banks has recovered from his ...</td>\n",
 465 |        "      <td>1995-02-10</td>\n",
 466 |        "    </tr>\n",
 467 |        "  </tbody>\n",
 468 |        "</table>\n",
 469 |        "</div>"
 470 |       ],
 471 |       "text/plain": [
 472 |        "      id                                             genres  \\\n",
 473 |        "0    862  [{'id': 16, 'name': 'Animation'}, {'id': 35, '...   \n",
 474 |        "1   8844  [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...   \n",
 475 |        "2  15602  [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...   \n",
 476 |        "3  31357  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   \n",
 477 |        "4  11862                     [{'id': 35, 'name': 'Comedy'}]   \n",
 478 |        "\n",
 479 |        "                         title  \\\n",
 480 |        "0                    Toy Story   \n",
 481 |        "1                      Jumanji   \n",
 482 |        "2             Grumpier Old Men   \n",
 483 |        "3            Waiting to Exhale   \n",
 484 |        "4  Father of the Bride Part II   \n",
 485 |        "\n",
 486 |        "                                            overview release_date  \n",
 487 |        "0  Led by Woody, Andy's toys live happily in his ...   1995-10-30  \n",
 488 |        "1  When siblings Judy and Peter discover an encha...   1995-12-15  \n",
 489 |        "2  A family wedding reignites the ancient feud be...   1995-12-22  \n",
 490 |        "3  Cheated on, mistreated and stepped on, the wom...   1995-12-22  \n",
 491 |        "4  Just when George Banks has recovered from his ...   1995-02-10  "
 492 |       ]
 493 |      },
 494 |      "execution_count": 8,
 495 |      "metadata": {},
 496 |      "output_type": "execute_result"
 497 |     }
 498 |    ],
 499 |    "source": [
 500 |     "movies_metadata = movies_metadata[['id', 'genres', 'title', 'overview', 'release_date']]\n",
 501 |     "movies_metadata.head()"
 502 |    ]
 503 |   },
 504 |   {
 505 |    "cell_type": "code",
 506 |    "execution_count": 9,
 507 |    "id": "incomplete-botswana",
 508 |    "metadata": {},
 509 |    "outputs": [
 510 |     {
 511 |      "data": {
 512 |       "text/html": [
 513 |        "<div>\n",
 514 |        "<style scoped>\n",
 515 |        "    .dataframe tbody tr th:only-of-type {\n",
 516 |        "        vertical-align: middle;\n",
 517 |        "    }\n",
 518 |        "\n",
 519 |        "    .dataframe tbody tr th {\n",
 520 |        "        vertical-align: top;\n",
 521 |        "    }\n",
 522 |        "\n",
 523 |        "    .dataframe thead th {\n",
 524 |        "        text-align: right;\n",
 525 |        "    }\n",
 526 |        "</style>\n",
 527 |        "<table border=\"1\" class=\"dataframe\">\n",
 528 |        "  <thead>\n",
 529 |        "    <tr style=\"text-align: right;\">\n",
 530 |        "      <th></th>\n",
 531 |        "      <th>id</th>\n",
 532 |        "      <th>genres</th>\n",
 533 |        "      <th>title</th>\n",
 534 |        "      <th>overview</th>\n",
 535 |        "      <th>release_date</th>\n",
 536 |        "    </tr>\n",
 537 |        "  </thead>\n",
 538 |        "  <tbody>\n",
 539 |        "    <tr>\n",
 540 |        "      <th>0</th>\n",
 541 |        "      <td>862</td>\n",
 542 |        "      <td>[{'id': 16, 'name': 'Animation'}, {'id': 35, '...</td>\n",
 543 |        "      <td>Toy Story</td>\n",
 544 |        "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
 545 |        "      <td>1995-10-30</td>\n",
 546 |        "    </tr>\n",
 547 |        "    <tr>\n",
 548 |        "      <th>1</th>\n",
 549 |        "      <td>8844</td>\n",
 550 |        "      <td>[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...</td>\n",
 551 |        "      <td>Jumanji</td>\n",
 552 |        "      <td>When siblings Judy and Peter discover an encha...</td>\n",
 553 |        "      <td>1995-12-15</td>\n",
 554 |        "    </tr>\n",
 555 |        "    <tr>\n",
 556 |        "      <th>2</th>\n",
 557 |        "      <td>15602</td>\n",
 558 |        "      <td>[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...</td>\n",
 559 |        "      <td>Grumpier Old Men</td>\n",
 560 |        "      <td>A family wedding reignites the ancient feud be...</td>\n",
 561 |        "      <td>1995-12-22</td>\n",
 562 |        "    </tr>\n",
 563 |        "    <tr>\n",
 564 |        "      <th>3</th>\n",
 565 |        "      <td>31357</td>\n",
 566 |        "      <td>[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...</td>\n",
 567 |        "      <td>Waiting to Exhale</td>\n",
 568 |        "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
 569 |        "      <td>1995-12-22</td>\n",
 570 |        "    </tr>\n",
 571 |        "    <tr>\n",
 572 |        "      <th>4</th>\n",
 573 |        "      <td>11862</td>\n",
 574 |        "      <td>[{'id': 35, 'name': 'Comedy'}]</td>\n",
 575 |        "      <td>Father of the Bride Part II</td>\n",
 576 |        "      <td>Just when George Banks has recovered from his ...</td>\n",
 577 |        "      <td>1995-02-10</td>\n",
 578 |        "    </tr>\n",
 579 |        "  </tbody>\n",
 580 |        "</table>\n",
 581 |        "</div>"
 582 |       ],
 583 |       "text/plain": [
 584 |        "      id                                             genres  \\\n",
 585 |        "0    862  [{'id': 16, 'name': 'Animation'}, {'id': 35, '...   \n",
 586 |        "1   8844  [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...   \n",
 587 |        "2  15602  [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...   \n",
 588 |        "3  31357  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...   \n",
 589 |        "4  11862                     [{'id': 35, 'name': 'Comedy'}]   \n",
 590 |        "\n",
 591 |        "                         title  \\\n",
 592 |        "0                    Toy Story   \n",
 593 |        "1                      Jumanji   \n",
 594 |        "2             Grumpier Old Men   \n",
 595 |        "3            Waiting to Exhale   \n",
 596 |        "4  Father of the Bride Part II   \n",
 597 |        "\n",
 598 |        "                                            overview release_date  \n",
 599 |        "0  Led by Woody, Andy's toys live happily in his ...   1995-10-30  \n",
 600 |        "1  When siblings Judy and Peter discover an encha...   1995-12-15  \n",
 601 |        "2  A family wedding reignites the ancient feud be...   1995-12-22  \n",
 602 |        "3  Cheated on, mistreated and stepped on, the wom...   1995-12-22  \n",
 603 |        "4  Just when George Banks has recovered from his ...   1995-02-10  "
 604 |       ]
 605 |      },
 606 |      "execution_count": 9,
 607 |      "metadata": {},
 608 |      "output_type": "execute_result"
 609 |     }
 610 |    ],
 611 |    "source": [
 612 |     "movies_metadata['genres'] = movies_metadata['genres'].apply(literal_eval)\n",
 613 |     "movies_metadata.head()"
 614 |    ]
 615 |   },
 616 |   {
 617 |    "cell_type": "markdown",
 618 |    "id": "4abd5ee3",
 619 |    "metadata": {},
 620 |    "source": [
 621 |     "## 사용할 컬럼 설정"
 622 |    ]
 623 |   },
 624 |   {
 625 |    "cell_type": "code",
 626 |    "execution_count": 10,
 627 |    "id": "collaborative-draft",
 628 |    "metadata": {},
 629 |    "outputs": [],
 630 |    "source": [
 631 |     "def get_genre(x):\n",
 632 |     "    names = [i['name'] for i in x]\n",
 633 |     "    if len(names) > 3:\n",
 634 |     "        names = names[:3]\n",
 635 |     "    return \" \".join(names)"
 636 |    ]
 637 |   },
 638 |   {
 639 |    "cell_type": "code",
 640 |    "execution_count": 11,
 641 |    "id": "informed-banana",
 642 |    "metadata": {},
 643 |    "outputs": [],
 644 |    "source": [
 645 |     "movies_metadata['genres'] = movies_metadata['genres'].apply(lambda x : get_genre(x))"
 646 |    ]
 647 |   },
 648 |   {
 649 |    "cell_type": "code",
 650 |    "execution_count": 12,
 651 |    "id": "large-reality",
 652 |    "metadata": {},
 653 |    "outputs": [
 654 |     {
 655 |      "data": {
 656 |       "text/html": [
 657 |        "<div>\n",
 658 |        "<style scoped>\n",
 659 |        "    .dataframe tbody tr th:only-of-type {\n",
 660 |        "        vertical-align: middle;\n",
 661 |        "    }\n",
 662 |        "\n",
 663 |        "    .dataframe tbody tr th {\n",
 664 |        "        vertical-align: top;\n",
 665 |        "    }\n",
 666 |        "\n",
 667 |        "    .dataframe thead th {\n",
 668 |        "        text-align: right;\n",
 669 |        "    }\n",
 670 |        "</style>\n",
 671 |        "<table border=\"1\" class=\"dataframe\">\n",
 672 |        "  <thead>\n",
 673 |        "    <tr style=\"text-align: right;\">\n",
 674 |        "      <th></th>\n",
 675 |        "      <th>id</th>\n",
 676 |        "      <th>genres</th>\n",
 677 |        "      <th>title</th>\n",
 678 |        "      <th>overview</th>\n",
 679 |        "      <th>release_date</th>\n",
 680 |        "    </tr>\n",
 681 |        "  </thead>\n",
 682 |        "  <tbody>\n",
 683 |        "    <tr>\n",
 684 |        "      <th>0</th>\n",
 685 |        "      <td>862</td>\n",
 686 |        "      <td>Animation Comedy Family</td>\n",
 687 |        "      <td>Toy Story</td>\n",
 688 |        "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
 689 |        "      <td>1995-10-30</td>\n",
 690 |        "    </tr>\n",
 691 |        "    <tr>\n",
 692 |        "      <th>1</th>\n",
 693 |        "      <td>8844</td>\n",
 694 |        "      <td>Adventure Fantasy Family</td>\n",
 695 |        "      <td>Jumanji</td>\n",
 696 |        "      <td>When siblings Judy and Peter discover an encha...</td>\n",
 697 |        "      <td>1995-12-15</td>\n",
 698 |        "    </tr>\n",
 699 |        "    <tr>\n",
 700 |        "      <th>2</th>\n",
 701 |        "      <td>15602</td>\n",
 702 |        "      <td>Romance Comedy</td>\n",
 703 |        "      <td>Grumpier Old Men</td>\n",
 704 |        "      <td>A family wedding reignites the ancient feud be...</td>\n",
 705 |        "      <td>1995-12-22</td>\n",
 706 |        "    </tr>\n",
 707 |        "    <tr>\n",
 708 |        "      <th>3</th>\n",
 709 |        "      <td>31357</td>\n",
 710 |        "      <td>Comedy Drama Romance</td>\n",
 711 |        "      <td>Waiting to Exhale</td>\n",
 712 |        "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
 713 |        "      <td>1995-12-22</td>\n",
 714 |        "    </tr>\n",
 715 |        "    <tr>\n",
 716 |        "      <th>4</th>\n",
 717 |        "      <td>11862</td>\n",
 718 |        "      <td>Comedy</td>\n",
 719 |        "      <td>Father of the Bride Part II</td>\n",
 720 |        "      <td>Just when George Banks has recovered from his ...</td>\n",
 721 |        "      <td>1995-02-10</td>\n",
 722 |        "    </tr>\n",
 723 |        "  </tbody>\n",
 724 |        "</table>\n",
 725 |        "</div>"
 726 |       ],
 727 |       "text/plain": [
 728 |        "      id                    genres                        title  \\\n",
 729 |        "0    862   Animation Comedy Family                    Toy Story   \n",
 730 |        "1   8844  Adventure Fantasy Family                      Jumanji   \n",
 731 |        "2  15602            Romance Comedy             Grumpier Old Men   \n",
 732 |        "3  31357      Comedy Drama Romance            Waiting to Exhale   \n",
 733 |        "4  11862                    Comedy  Father of the Bride Part II   \n",
 734 |        "\n",
 735 |        "                                            overview release_date  \n",
 736 |        "0  Led by Woody, Andy's toys live happily in his ...   1995-10-30  \n",
 737 |        "1  When siblings Judy and Peter discover an encha...   1995-12-15  \n",
 738 |        "2  A family wedding reignites the ancient feud be...   1995-12-22  \n",
 739 |        "3  Cheated on, mistreated and stepped on, the wom...   1995-12-22  \n",
 740 |        "4  Just when George Banks has recovered from his ...   1995-02-10  "
 741 |       ]
 742 |      },
 743 |      "execution_count": 12,
 744 |      "metadata": {},
 745 |      "output_type": "execute_result"
 746 |     }
 747 |    ],
 748 |    "source": [
 749 |     "movies_metadata.head()"
 750 |    ]
 751 |   },
 752 |   {
 753 |    "cell_type": "code",
 754 |    "execution_count": 13,
 755 |    "id": "special-florist",
 756 |    "metadata": {},
 757 |    "outputs": [
 758 |     {
 759 |      "data": {
 760 |       "text/plain": [
 761 |        "index            0\n",
 762 |        "id               0\n",
 763 |        "genres           0\n",
 764 |        "title            0\n",
 765 |        "overview         0\n",
 766 |        "release_date    71\n",
 767 |        "dtype: int64"
 768 |       ]
 769 |      },
 770 |      "execution_count": 13,
 771 |      "metadata": {},
 772 |      "output_type": "execute_result"
 773 |     }
 774 |    ],
 775 |    "source": [
 776 |     "movies_metadata = movies_metadata[movies_metadata['overview'].notnull()]\n",
 777 |     "movies_metadata = movies_metadata[movies_metadata['title'].notnull()]\n",
 778 |     "movies_metadata.isna().sum()"
 779 |    ]
 780 |   },
 781 |   {
 782 |    "cell_type": "code",
 783 |    "execution_count": 14,
 784 |    "id": "related-rubber",
 785 |    "metadata": {},
 786 |    "outputs": [],
 787 |    "source": [
 788 |     "movies_metadata['feature'] = movies_metadata['genres'] + \" / \" + movies_metadata['title'] + \" / \" + movies_metadata['overview']"
 789 |    ]
 790 |   },
 791 |   {
 792 |    "cell_type": "markdown",
 793 |    "id": "beginning-insured",
 794 |    "metadata": {},
 795 |    "source": [
 796 |     "# HuggingFace embedding"
 797 |    ]
 798 |   },
 799 |   {
 800 |    "cell_type": "code",
 801 |    "execution_count": 18,
 802 |    "id": "1d862f87",
 803 |    "metadata": {},
 804 |    "outputs": [
 805 |     {
 806 |      "data": {
 807 |       "text/plain": [
 808 |        "SentenceTransformer(\n",
 809 |        "  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel \n",
 810 |        "  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\n",
 811 |        "  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})\n",
 812 |        ")"
 813 |       ]
 814 |      },
 815 |      "execution_count": 18,
 816 |      "metadata": {},
 817 |      "output_type": "execute_result"
 818 |     }
 819 |    ],
 820 |    "source": [
 821 |     "if cur_os.startswith('win'):\n",
 822 |     "    model = SentenceTransformer(f'{model_path}/distiluse-base-multilingual-cased-v2')    \n",
 823 |     "else:\n",
 824 |     "    model = SentenceTransformer(\"sentence-transformers/distiluse-base-multilingual-cased-v2\")\n",
 825 |     "\n",
 826 |     "model"
 827 |    ]
 828 |   },
 829 |   {
 830 |    "cell_type": "code",
 831 |    "execution_count": 24,
 832 |    "id": "23090b67",
 833 |    "metadata": {},
 834 |    "outputs": [
 835 |     {
 836 |      "data": {
 837 |       "text/plain": [
 838 |        "(44506, 10)"
 839 |       ]
 840 |      },
 841 |      "execution_count": 24,
 842 |      "metadata": {},
 843 |      "output_type": "execute_result"
 844 |     }
 845 |    ],
 846 |    "source": [
 847 |     "movies_metadata['hf_embeddings'] = movies_metadata['feature'].apply(lambda x : model.encode(x))\n",
 848 |     "movies_metadata.shape"
 849 |    ]
 850 |   },
 851 |   {
 852 |    "cell_type": "code",
 853 |    "execution_count": 48,
 854 |    "id": "infinite-sudan",
 855 |    "metadata": {},
 856 |    "outputs": [
 857 |     {
 858 |      "name": "stdout",
 859 |      "output_type": "stream",
 860 |      "text": [
 861 |       "(44506, 9)\n"
 862 |      ]
 863 |     },
 864 |     {
 865 |      "data": {
 866 |       "text/html": [
 867 |        "<div>\n",
 868 |        "<style scoped>\n",
 869 |        "    .dataframe tbody tr th:only-of-type {\n",
 870 |        "        vertical-align: middle;\n",
 871 |        "    }\n",
 872 |        "\n",
 873 |        "    .dataframe tbody tr th {\n",
 874 |        "        vertical-align: top;\n",
 875 |        "    }\n",
 876 |        "\n",
 877 |        "    .dataframe thead th {\n",
 878 |        "        text-align: right;\n",
 879 |        "    }\n",
 880 |        "</style>\n",
 881 |        "<table border=\"1\" class=\"dataframe\">\n",
 882 |        "  <thead>\n",
 883 |        "    <tr style=\"text-align: right;\">\n",
 884 |        "      <th></th>\n",
 885 |        "      <th>id</th>\n",
 886 |        "      <th>genres</th>\n",
 887 |        "      <th>title</th>\n",
 888 |        "      <th>overview</th>\n",
 889 |        "      <th>release_date</th>\n",
 890 |        "      <th>feature</th>\n",
 891 |        "      <th>text_len</th>\n",
 892 |        "      <th>feature_len</th>\n",
 893 |        "      <th>hf_embeddings</th>\n",
 894 |        "    </tr>\n",
 895 |        "  </thead>\n",
 896 |        "  <tbody>\n",
 897 |        "    <tr>\n",
 898 |        "      <th>0</th>\n",
 899 |        "      <td>862</td>\n",
 900 |        "      <td>Animation Comedy Family</td>\n",
 901 |        "      <td>Toy Story</td>\n",
 902 |        "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
 903 |        "      <td>1995-10-30</td>\n",
 904 |        "      <td>Animation Comedy Family / Toy Story / Led by W...</td>\n",
 905 |        "      <td>303</td>\n",
 906 |        "      <td>341</td>\n",
 907 |        "      <td>[-0.047183443, -0.02021129, 0.096098304, -0.05...</td>\n",
 908 |        "    </tr>\n",
 909 |        "    <tr>\n",
 910 |        "      <th>1</th>\n",
 911 |        "      <td>8844</td>\n",
 912 |        "      <td>Adventure Fantasy Family</td>\n",
 913 |        "      <td>Jumanji</td>\n",
 914 |        "      <td>When siblings Judy and Peter discover an encha...</td>\n",
 915 |        "      <td>1995-12-15</td>\n",
 916 |        "      <td>Adventure Fantasy Family / Jumanji / When sibl...</td>\n",
 917 |        "      <td>395</td>\n",
 918 |        "      <td>432</td>\n",
 919 |        "      <td>[0.0013290617, -0.0071765357, 0.048141554, -0....</td>\n",
 920 |        "    </tr>\n",
 921 |        "    <tr>\n",
 922 |        "      <th>2</th>\n",
 923 |        "      <td>15602</td>\n",
 924 |        "      <td>Romance Comedy</td>\n",
 925 |        "      <td>Grumpier Old Men</td>\n",
 926 |        "      <td>A family wedding reignites the ancient feud be...</td>\n",
 927 |        "      <td>1995-12-22</td>\n",
 928 |        "      <td>Romance Comedy / Grumpier Old Men / A family w...</td>\n",
 929 |        "      <td>327</td>\n",
 930 |        "      <td>363</td>\n",
 931 |        "      <td>[-0.06425337, -0.008573138, -0.10116352, -0.00...</td>\n",
 932 |        "    </tr>\n",
 933 |        "    <tr>\n",
 934 |        "      <th>3</th>\n",
 935 |        "      <td>31357</td>\n",
 936 |        "      <td>Comedy Drama Romance</td>\n",
 937 |        "      <td>Waiting to Exhale</td>\n",
 938 |        "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
 939 |        "      <td>1995-12-22</td>\n",
 940 |        "      <td>Comedy Drama Romance / Waiting to Exhale / Che...</td>\n",
 941 |        "      <td>270</td>\n",
 942 |        "      <td>313</td>\n",
 943 |        "      <td>[-0.032623984, -0.03347266, 0.02557934, -0.033...</td>\n",
 944 |        "    </tr>\n",
 945 |        "    <tr>\n",
 946 |        "      <th>4</th>\n",
 947 |        "      <td>11862</td>\n",
 948 |        "      <td>Comedy</td>\n",
 949 |        "      <td>Father of the Bride Part II</td>\n",
 950 |        "      <td>Just when George Banks has recovered from his ...</td>\n",
 951 |        "      <td>1995-02-10</td>\n",
 952 |        "      <td>Comedy / Father of the Bride Part II / Just wh...</td>\n",
 953 |        "      <td>318</td>\n",
 954 |        "      <td>357</td>\n",
 955 |        "      <td>[0.03181218, -0.0038158156, 0.02099277, 0.0031...</td>\n",
 956 |        "    </tr>\n",
 957 |        "  </tbody>\n",
 958 |        "</table>\n",
 959 |        "</div>"
 960 |       ],
 961 |       "text/plain": [
 962 |        "      id                    genres                        title  \\\n",
 963 |        "0    862   Animation Comedy Family                    Toy Story   \n",
 964 |        "1   8844  Adventure Fantasy Family                      Jumanji   \n",
 965 |        "2  15602            Romance Comedy             Grumpier Old Men   \n",
 966 |        "3  31357      Comedy Drama Romance            Waiting to Exhale   \n",
 967 |        "4  11862                    Comedy  Father of the Bride Part II   \n",
 968 |        "\n",
 969 |        "                                            overview release_date  \\\n",
 970 |        "0  Led by Woody, Andy's toys live happily in his ...   1995-10-30   \n",
 971 |        "1  When siblings Judy and Peter discover an encha...   1995-12-15   \n",
 972 |        "2  A family wedding reignites the ancient feud be...   1995-12-22   \n",
 973 |        "3  Cheated on, mistreated and stepped on, the wom...   1995-12-22   \n",
 974 |        "4  Just when George Banks has recovered from his ...   1995-02-10   \n",
 975 |        "\n",
 976 |        "                                             feature  text_len  feature_len  \\\n",
 977 |        "0  Animation Comedy Family / Toy Story / Led by W...       303          341   \n",
 978 |        "1  Adventure Fantasy Family / Jumanji / When sibl...       395          432   \n",
 979 |        "2  Romance Comedy / Grumpier Old Men / A family w...       327          363   \n",
 980 |        "3  Comedy Drama Romance / Waiting to Exhale / Che...       270          313   \n",
 981 |        "4  Comedy / Father of the Bride Part II / Just wh...       318          357   \n",
 982 |        "\n",
 983 |        "                                       hf_embeddings  \n",
 984 |        "0  [-0.047183443, -0.02021129, 0.096098304, -0.05...  \n",
 985 |        "1  [0.0013290617, -0.0071765357, 0.048141554, -0....  \n",
 986 |        "2  [-0.06425337, -0.008573138, -0.10116352, -0.00...  \n",
 987 |        "3  [-0.032623984, -0.03347266, 0.02557934, -0.033...  \n",
 988 |        "4  [0.03181218, -0.0038158156, 0.02099277, 0.0031...  "
 989 |       ]
 990 |      },
 991 |      "execution_count": 48,
 992 |      "metadata": {},
 993 |      "output_type": "execute_result"
 994 |     }
 995 |    ],
 996 |    "source": [
 997 |     "print(movies_metadata.shape)\n",
 998 |     "movies_metadata.head()"
 999 |    ]
1000 |   },
1001 |   {
1002 |    "cell_type": "code",
1003 |    "execution_count": 49,
1004 |    "id": "exotic-circuit",
1005 |    "metadata": {},
1006 |    "outputs": [],
1007 |    "source": [
1008 |     "movies_metadata.to_csv('./movie_meta/movies_metadata_em.csv')"
1009 |    ]
1010 |   },
1011 |   {
1012 |    "cell_type": "markdown",
1013 |    "id": "entertaining-surge",
1014 |    "metadata": {},
1015 |    "source": [
1016 |     "# OpenAI Embedding"
1017 |    ]
1018 |   },
1019 |   {
1020 |    "cell_type": "code",
1021 |    "execution_count": 26,
1022 |    "id": "streaming-ultimate",
1023 |    "metadata": {},
1024 |    "outputs": [],
1025 |    "source": [
1026 |     "openai_embedding_model = \"text-embedding-ada-002\""
1027 |    ]
1028 |   },
1029 |   {
1030 |    "cell_type": "code",
1031 |    "execution_count": 27,
1032 |    "id": "progressive-amino",
1033 |    "metadata": {},
1034 |    "outputs": [],
1035 |    "source": [
1036 |     "def get_doc_embedding(text: str) -> List[float]:\n",
1037 |     "    return get_embedding(text, openai_embedding_model)"
1038 |    ]
1039 |   },
1040 |   {
1041 |    "cell_type": "code",
1042 |    "execution_count": 28,
1043 |    "id": "posted-creator",
1044 |    "metadata": {},
1045 |    "outputs": [],
1046 |    "source": [
1047 |     "def get_embedding(text: str, model: str) -> List[float]:\n",
1048 |     "    result = openai.Embedding.create(\n",
1049 |     "      model=model,\n",
1050 |     "      input=text)\n",
1051 |     "    return result[\"data\"][0][\"embedding\"]"
1052 |    ]
1053 |   },
1054 |   {
1055 |    "cell_type": "code",
1056 |    "execution_count": 29,
1057 |    "id": "refined-hotel",
1058 |    "metadata": {},
1059 |    "outputs": [],
1060 |    "source": [
1061 |     "# movies_metadata['openai_embeddings'] = movies_metadata['feature'].apply(lambda x : get_embedding(x, openai_embedding_model))"
1062 |    ]
1063 |   },
1064 |   {
1065 |    "cell_type": "code",
1066 |    "execution_count": null,
1067 |    "id": "uniform-bernard",
1068 |    "metadata": {},
1069 |    "outputs": [],
1070 |    "source": []
1071 |   },
1072 |   {
1073 |    "cell_type": "code",
1074 |    "execution_count": null,
1075 |    "id": "contemporary-tobago",
1076 |    "metadata": {},
1077 |    "outputs": [],
1078 |    "source": []
1079 |   },
1080 |   {
1081 |    "cell_type": "markdown",
1082 |    "id": "talented-landing",
1083 |    "metadata": {},
1084 |    "source": [
1085 |     "# Query와 비슷한 앱"
1086 |    ]
1087 |   },
1088 |   {
1089 |    "cell_type": "code",
1090 |    "execution_count": 30,
1091 |    "id": "prime-independence",
1092 |    "metadata": {},
1093 |    "outputs": [],
1094 |    "source": [
1095 |     "top_k = 5"
1096 |    ]
1097 |   },
1098 |   {
1099 |    "cell_type": "code",
1100 |    "execution_count": 671,
1101 |    "id": "dedicated-basin",
1102 |    "metadata": {},
1103 |    "outputs": [],
1104 |    "source": [
1105 |     "def get_query_sim_top_k(query, model, df, top_k):\n",
1106 |     "    query_encode = model.encode(query)\n",
1107 |     "    cos_scores = util.pytorch_cos_sim(query_encode, df['hf_embeddings'])[0]\n",
1108 |     "    top_results = torch.topk(cos_scores, k=top_k)\n",
1109 |     "    return top_results"
1110 |    ]
1111 |   },
1112 |   {
1113 |    "cell_type": "code",
1114 |    "execution_count": 32,
1115 |    "id": "romantic-category",
1116 |    "metadata": {},
1117 |    "outputs": [
1118 |     {
1119 |      "name": "stderr",
1120 |      "output_type": "stream",
1121 |      "text": [
1122 |       "/Users/leesoojin/opt/anaconda3/envs/openai/lib/python3.8/site-packages/sentence_transformers/util.py:39: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_new.cpp:233.)\n",
1123 |       "  b = torch.tensor(b)\n"
1124 |      ]
1125 |     }
1126 |    ],
1127 |    "source": [
1128 |     "query = \"Are there any documentary films?\"\n",
1129 |     "top_result = get_query_sim_top_k(query, model, movies_metadata)"
1130 |    ]
1131 |   },
1132 |   {
1133 |    "cell_type": "code",
1134 |    "execution_count": 33,
1135 |    "id": "provincial-floor",
1136 |    "metadata": {},
1137 |    "outputs": [
1138 |     {
1139 |      "data": {
1140 |       "text/plain": [
1141 |        "torch.return_types.topk(\n",
1142 |        "values=tensor([0.5390, 0.5117, 0.5093, 0.5067, 0.4992]),\n",
1143 |        "indices=tensor([24020, 10124, 22428, 35263, 22273]))"
1144 |       ]
1145 |      },
1146 |      "execution_count": 33,
1147 |      "metadata": {},
1148 |      "output_type": "execute_result"
1149 |     }
1150 |    ],
1151 |    "source": [
1152 |     "top_result"
1153 |    ]
1154 |   },
1155 |   {
1156 |    "cell_type": "code",
1157 |    "execution_count": 34,
1158 |    "id": "continuing-pasta",
1159 |    "metadata": {},
1160 |    "outputs": [
1161 |     {
1162 |      "data": {
1163 |       "text/html": [
1164 |        "<div>\n",
1165 |        "<style scoped>\n",
1166 |        "    .dataframe tbody tr th:only-of-type {\n",
1167 |        "        vertical-align: middle;\n",
1168 |        "    }\n",
1169 |        "\n",
1170 |        "    .dataframe tbody tr th {\n",
1171 |        "        vertical-align: top;\n",
1172 |        "    }\n",
1173 |        "\n",
1174 |        "    .dataframe thead th {\n",
1175 |        "        text-align: right;\n",
1176 |        "    }\n",
1177 |        "</style>\n",
1178 |        "<table border=\"1\" class=\"dataframe\">\n",
1179 |        "  <thead>\n",
1180 |        "    <tr style=\"text-align: right;\">\n",
1181 |        "      <th></th>\n",
1182 |        "      <th>title</th>\n",
1183 |        "      <th>overview</th>\n",
1184 |        "      <th>genres</th>\n",
1185 |        "    </tr>\n",
1186 |        "  </thead>\n",
1187 |        "  <tbody>\n",
1188 |        "    <tr>\n",
1189 |        "      <th>24020</th>\n",
1190 |        "      <td>The 50 Worst Movies Ever Made</td>\n",
1191 |        "      <td>There are some movies that are so bad they're ...</td>\n",
1192 |        "      <td>Documentary</td>\n",
1193 |        "    </tr>\n",
1194 |        "    <tr>\n",
1195 |        "      <th>10124</th>\n",
1196 |        "      <td>Trekkies 2</td>\n",
1197 |        "      <td>sequel to the 1997 documentary film Trekkies.</td>\n",
1198 |        "      <td>Documentary</td>\n",
1199 |        "    </tr>\n",
1200 |        "    <tr>\n",
1201 |        "      <th>22428</th>\n",
1202 |        "      <td>The Spanish Earth</td>\n",
1203 |        "      <td>A propaganda film made during the Spanish Civi...</td>\n",
1204 |        "      <td>War Documentary</td>\n",
1205 |        "    </tr>\n",
1206 |        "    <tr>\n",
1207 |        "      <th>35263</th>\n",
1208 |        "      <td>Tomorrow</td>\n",
1209 |        "      <td>Documentary film about global warming.</td>\n",
1210 |        "      <td>Documentary</td>\n",
1211 |        "    </tr>\n",
1212 |        "    <tr>\n",
1213 |        "      <th>22273</th>\n",
1214 |        "      <td>I Know That Voice</td>\n",
1215 |        "      <td>A documentary about voice-over actors.</td>\n",
1216 |        "      <td>Documentary</td>\n",
1217 |        "    </tr>\n",
1218 |        "  </tbody>\n",
1219 |        "</table>\n",
1220 |        "</div>"
1221 |       ],
1222 |       "text/plain": [
1223 |        "                               title  \\\n",
1224 |        "24020  The 50 Worst Movies Ever Made   \n",
1225 |        "10124                     Trekkies 2   \n",
1226 |        "22428              The Spanish Earth   \n",
1227 |        "35263                       Tomorrow   \n",
1228 |        "22273              I Know That Voice   \n",
1229 |        "\n",
1230 |        "                                                overview           genres  \n",
1231 |        "24020  There are some movies that are so bad they're ...      Documentary  \n",
1232 |        "10124      sequel to the 1997 documentary film Trekkies.      Documentary  \n",
1233 |        "22428  A propaganda film made during the Spanish Civi...  War Documentary  \n",
1234 |        "35263             Documentary film about global warming.      Documentary  \n",
1235 |        "22273             A documentary about voice-over actors.      Documentary  "
1236 |       ]
1237 |      },
1238 |      "execution_count": 34,
1239 |      "metadata": {},
1240 |      "output_type": "execute_result"
1241 |     }
1242 |    ],
1243 |    "source": [
1244 |     "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'overview', 'genres']]"
1245 |    ]
1246 |   },
1247 |   {
1248 |    "cell_type": "code",
1249 |    "execution_count": null,
1250 |    "id": "portable-racing",
1251 |    "metadata": {},
1252 |    "outputs": [],
1253 |    "source": []
1254 |   },
1255 |   {
1256 |    "cell_type": "code",
1257 |    "execution_count": 35,
1258 |    "id": "joined-monday",
1259 |    "metadata": {},
1260 |    "outputs": [
1261 |     {
1262 |      "data": {
1263 |       "text/html": [
1264 |        "<div>\n",
1265 |        "<style scoped>\n",
1266 |        "    .dataframe tbody tr th:only-of-type {\n",
1267 |        "        vertical-align: middle;\n",
1268 |        "    }\n",
1269 |        "\n",
1270 |        "    .dataframe tbody tr th {\n",
1271 |        "        vertical-align: top;\n",
1272 |        "    }\n",
1273 |        "\n",
1274 |        "    .dataframe thead th {\n",
1275 |        "        text-align: right;\n",
1276 |        "    }\n",
1277 |        "</style>\n",
1278 |        "<table border=\"1\" class=\"dataframe\">\n",
1279 |        "  <thead>\n",
1280 |        "    <tr style=\"text-align: right;\">\n",
1281 |        "      <th></th>\n",
1282 |        "      <th>title</th>\n",
1283 |        "      <th>overview</th>\n",
1284 |        "      <th>genres</th>\n",
1285 |        "    </tr>\n",
1286 |        "  </thead>\n",
1287 |        "  <tbody>\n",
1288 |        "    <tr>\n",
1289 |        "      <th>38573</th>\n",
1290 |        "      <td>Catastrophe</td>\n",
1291 |        "      <td>A film cataloguing some of the world's largest...</td>\n",
1292 |        "      <td>Thriller Documentary</td>\n",
1293 |        "    </tr>\n",
1294 |        "    <tr>\n",
1295 |        "      <th>11441</th>\n",
1296 |        "      <td>When the Levees Broke: A Requiem in Four Acts</td>\n",
1297 |        "      <td>In August 2005, the American city of New Orlea...</td>\n",
1298 |        "      <td>Documentary</td>\n",
1299 |        "    </tr>\n",
1300 |        "    <tr>\n",
1301 |        "      <th>2404</th>\n",
1302 |        "      <td>Earthquake</td>\n",
1303 |        "      <td>Earthquake is a 1974 American disaster film th...</td>\n",
1304 |        "      <td>Action Drama Thriller</td>\n",
1305 |        "    </tr>\n",
1306 |        "    <tr>\n",
1307 |        "      <th>35263</th>\n",
1308 |        "      <td>Tomorrow</td>\n",
1309 |        "      <td>Documentary film about global warming.</td>\n",
1310 |        "      <td>Documentary</td>\n",
1311 |        "    </tr>\n",
1312 |        "    <tr>\n",
1313 |        "      <th>41943</th>\n",
1314 |        "      <td>Disaster!</td>\n",
1315 |        "      <td>A spoof of disaster films, an asteroid is comi...</td>\n",
1316 |        "      <td>Action Animation Comedy</td>\n",
1317 |        "    </tr>\n",
1318 |        "  </tbody>\n",
1319 |        "</table>\n",
1320 |        "</div>"
1321 |       ],
1322 |       "text/plain": [
1323 |        "                                               title  \\\n",
1324 |        "38573                                    Catastrophe   \n",
1325 |        "11441  When the Levees Broke: A Requiem in Four Acts   \n",
1326 |        "2404                                      Earthquake   \n",
1327 |        "35263                                       Tomorrow   \n",
1328 |        "41943                                      Disaster!   \n",
1329 |        "\n",
1330 |        "                                                overview  \\\n",
1331 |        "38573  A film cataloguing some of the world's largest...   \n",
1332 |        "11441  In August 2005, the American city of New Orlea...   \n",
1333 |        "2404   Earthquake is a 1974 American disaster film th...   \n",
1334 |        "35263             Documentary film about global warming.   \n",
1335 |        "41943  A spoof of disaster films, an asteroid is comi...   \n",
1336 |        "\n",
1337 |        "                        genres  \n",
1338 |        "38573     Thriller Documentary  \n",
1339 |        "11441              Documentary  \n",
1340 |        "2404     Action Drama Thriller  \n",
1341 |        "35263              Documentary  \n",
1342 |        "41943  Action Animation Comedy  "
1343 |       ]
1344 |      },
1345 |      "execution_count": 35,
1346 |      "metadata": {},
1347 |      "output_type": "execute_result"
1348 |     }
1349 |    ],
1350 |    "source": [
1351 |     "query = \"Are there any movies about natural disasters?\"\n",
1352 |     "top_result = get_query_sim_top_k(query, model, movies_metadata)\n",
1353 |     "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'overview', 'genres']]"
1354 |    ]
1355 |   },
1356 |   {
1357 |    "cell_type": "code",
1358 |    "execution_count": 36,
1359 |    "id": "serious-serbia",
1360 |    "metadata": {},
1361 |    "outputs": [
1362 |     {
1363 |      "data": {
1364 |       "text/html": [
1365 |        "<div>\n",
1366 |        "<style scoped>\n",
1367 |        "    .dataframe tbody tr th:only-of-type {\n",
1368 |        "        vertical-align: middle;\n",
1369 |        "    }\n",
1370 |        "\n",
1371 |        "    .dataframe tbody tr th {\n",
1372 |        "        vertical-align: top;\n",
1373 |        "    }\n",
1374 |        "\n",
1375 |        "    .dataframe thead th {\n",
1376 |        "        text-align: right;\n",
1377 |        "    }\n",
1378 |        "</style>\n",
1379 |        "<table border=\"1\" class=\"dataframe\">\n",
1380 |        "  <thead>\n",
1381 |        "    <tr style=\"text-align: right;\">\n",
1382 |        "      <th></th>\n",
1383 |        "      <th>title</th>\n",
1384 |        "      <th>overview</th>\n",
1385 |        "      <th>genres</th>\n",
1386 |        "    </tr>\n",
1387 |        "  </thead>\n",
1388 |        "  <tbody>\n",
1389 |        "    <tr>\n",
1390 |        "      <th>36936</th>\n",
1391 |        "      <td>The Flying Man</td>\n",
1392 |        "      <td>A new superhero is coming, only this time it's...</td>\n",
1393 |        "      <td>Action Mystery Science Fiction</td>\n",
1394 |        "    </tr>\n",
1395 |        "    <tr>\n",
1396 |        "      <th>30101</th>\n",
1397 |        "      <td>Up, Up, and Away</td>\n",
1398 |        "      <td>A boy is the only family member without superp...</td>\n",
1399 |        "      <td>Action Family TV Movie</td>\n",
1400 |        "    </tr>\n",
1401 |        "    <tr>\n",
1402 |        "      <th>24646</th>\n",
1403 |        "      <td>The Four Feathers</td>\n",
1404 |        "      <td>They made him a hero by branding him a coward ...</td>\n",
1405 |        "      <td>TV Movie Adventure Drama</td>\n",
1406 |        "    </tr>\n",
1407 |        "    <tr>\n",
1408 |        "      <th>12672</th>\n",
1409 |        "      <td>Hancock</td>\n",
1410 |        "      <td>Hancock is a down-and-out superhero who's forc...</td>\n",
1411 |        "      <td>Fantasy Action</td>\n",
1412 |        "    </tr>\n",
1413 |        "    <tr>\n",
1414 |        "      <th>4216</th>\n",
1415 |        "      <td>Too Late the Hero</td>\n",
1416 |        "      <td>A WWII film set on a Pacific island. Japanese ...</td>\n",
1417 |        "      <td>Drama Action War</td>\n",
1418 |        "    </tr>\n",
1419 |        "  </tbody>\n",
1420 |        "</table>\n",
1421 |        "</div>"
1422 |       ],
1423 |       "text/plain": [
1424 |        "                   title                                           overview  \\\n",
1425 |        "36936     The Flying Man  A new superhero is coming, only this time it's...   \n",
1426 |        "30101   Up, Up, and Away  A boy is the only family member without superp...   \n",
1427 |        "24646  The Four Feathers  They made him a hero by branding him a coward ...   \n",
1428 |        "12672            Hancock  Hancock is a down-and-out superhero who's forc...   \n",
1429 |        "4216   Too Late the Hero  A WWII film set on a Pacific island. Japanese ...   \n",
1430 |        "\n",
1431 |        "                               genres  \n",
1432 |        "36936  Action Mystery Science Fiction  \n",
1433 |        "30101          Action Family TV Movie  \n",
1434 |        "24646        TV Movie Adventure Drama  \n",
1435 |        "12672                  Fantasy Action  \n",
1436 |        "4216                 Drama Action War  "
1437 |       ]
1438 |      },
1439 |      "execution_count": 36,
1440 |      "metadata": {},
1441 |      "output_type": "execute_result"
1442 |     }
1443 |    ],
1444 |    "source": [
1445 |     "query = \"Are there any movies about heros?\"\n",
1446 |     "top_result = get_query_sim_top_k(query, model, movies_metadata)\n",
1447 |     "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'overview', 'genres']]"
1448 |    ]
1449 |   },
1450 |   {
1451 |    "cell_type": "markdown",
1452 |    "id": "burning-venice",
1453 |    "metadata": {},
1454 |    "source": [
1455 |     "https://www.imdb.com/title/tt0211174/?ref_=fn_al_tt_1"
1456 |    ]
1457 |   },
1458 |   {
1459 |    "cell_type": "markdown",
1460 |    "id": "painful-drama",
1461 |    "metadata": {},
1462 |    "source": [
1463 |     "# ChatGPT 이용\n",
1464 |     "\n",
1465 |     "- 2개의 chatgpt 이용\n",
1466 |     "    - 1개는 이 질문의 의도를 파악하는 것. 설명을 원하는 것인지, 추천을 해달라는 것인지\n",
1467 |     "    - 각 분류에 따라 문구가 달라짐\n",
1468 |     "        - 질문의 의도면 가장 유사한 텍스트를 가져와서 설명해주는 것\n",
1469 |     "        - 추천이면, cossim topk를 가져와서 출력하도록"
1470 |    ]
1471 |   },
1472 |   {
1473 |    "cell_type": "code",
1474 |    "execution_count": null,
1475 |    "id": "controlled-religion",
1476 |    "metadata": {},
1477 |    "outputs": [],
1478 |    "source": []
1479 |   },
1480 |   {
1481 |    "cell_type": "code",
1482 |    "execution_count": 37,
1483 |    "id": "greater-underwear",
1484 |    "metadata": {},
1485 |    "outputs": [],
1486 |    "source": [
1487 |     "def print_msg(msg):\n",
1488 |     "    completion = openai.ChatCompletion.create(\n",
1489 |     "                    model=\"gpt-3.5-turbo\",\n",
1490 |     "                    messages=msg\n",
1491 |     "                    )\n",
1492 |     "    return completion['choices'][0]['message']['content']"
1493 |    ]
1494 |   },
1495 |   {
1496 |    "cell_type": "markdown",
1497 |    "id": "a356862b",
1498 |    "metadata": {},
1499 |    "source": [
1500 |     "## Prompt test"
1501 |    ]
1502 |   },
1503 |   {
1504 |    "cell_type": "code",
1505 |    "execution_count": 38,
1506 |    "id": "synthetic-touch",
1507 |    "metadata": {},
1508 |    "outputs": [
1509 |     {
1510 |      "data": {
1511 |       "text/plain": [
1512 |        "'description'"
1513 |       ]
1514 |      },
1515 |      "execution_count": 38,
1516 |      "metadata": {},
1517 |      "output_type": "execute_result"
1518 |     }
1519 |    ],
1520 |    "source": [
1521 |     "messages = [\n",
1522 |     "    {\"role\": \"system\", \"content\": \"You are a helpful assistant who understands the intent of the user's question.\"},\n",
1523 |     "    {\"role\": \"user\", \"content\": \"Which category does the sentence below belong to: 'description', 'recommended'? Show only categories. \\n context: tell me about instagram \\n A:\"}\n",
1524 |     "]\n",
1525 |     "\n",
1526 |     "\n",
1527 |     "print_msg(messages)"
1528 |    ]
1529 |   },
1530 |   {
1531 |    "cell_type": "code",
1532 |    "execution_count": 39,
1533 |    "id": "powerful-savings",
1534 |    "metadata": {},
1535 |    "outputs": [
1536 |     {
1537 |      "data": {
1538 |       "text/plain": [
1539 |        "'recommend'"
1540 |       ]
1541 |      },
1542 |      "execution_count": 39,
1543 |      "metadata": {},
1544 |      "output_type": "execute_result"
1545 |     }
1546 |    ],
1547 |    "source": [
1548 |     "messages = [\n",
1549 |     "    {\"role\": \"system\", \"content\": \"You are a helpful assistant who understands the intent of the user's question.\"},\n",
1550 |     "    {\"role\": \"user\", \"content\": \"Which category does the sentence below belong to: 'description', 'recommend'? Show only categories. \\n context: What apps are similar to Instagram? \\n A:\"}\n",
1551 |     "]\n",
1552 |     "\n",
1553 |     "\n",
1554 |     "print_msg(messages)\n",
1555 |     "\n"
1556 |    ]
1557 |   },
1558 |   {
1559 |    "cell_type": "code",
1560 |    "execution_count": 40,
1561 |    "id": "still-outdoors",
1562 |    "metadata": {},
1563 |    "outputs": [
1564 |     {
1565 |      "data": {
1566 |       "text/plain": [
1567 |        "'recommend'"
1568 |       ]
1569 |      },
1570 |      "execution_count": 40,
1571 |      "metadata": {},
1572 |      "output_type": "execute_result"
1573 |     }
1574 |    ],
1575 |    "source": [
1576 |     "messages = [\n",
1577 |     "    {\"role\": \"system\", \"content\": \"You are a helpful assistant who understands the intent of the user's question.\"},\n",
1578 |     "    {\"role\": \"user\", \"content\": \"Which category does the sentence below belong to: 'description', 'recommend'? Show only categories. \\n context: Recommend apps similar to Instagram. \\n A:\"}\n",
1579 |     "]\n",
1580 |     "\n",
1581 |     "\n",
1582 |     "print_msg(messages)"
1583 |    ]
1584 |   },
1585 |   {
1586 |    "cell_type": "code",
1587 |    "execution_count": null,
1588 |    "id": "manual-violin",
1589 |    "metadata": {},
1590 |    "outputs": [],
1591 |    "source": []
1592 |   },
1593 |   {
1594 |    "cell_type": "code",
1595 |    "execution_count": 41,
1596 |    "id": "equal-income",
1597 |    "metadata": {},
1598 |    "outputs": [
1599 |     {
1600 |      "data": {
1601 |       "text/plain": [
1602 |        "'Here are some apps similar to Instagram that you might want to check out:'"
1603 |       ]
1604 |      },
1605 |      "execution_count": 41,
1606 |      "metadata": {},
1607 |      "output_type": "execute_result"
1608 |     }
1609 |    ],
1610 |    "source": [
1611 |     "messages = [\n",
1612 |     "    {\"role\": \"system\", \"content\": \"You are a helpful assistant who recommend contents.\"},\n",
1613 |     "    {\"role\": \"user\", \"content\": \"Simply repeat the provided context and put a sentence in front of the context. \\n context: Recommend apps similar to Instagram.\"}\n",
1614 |     "]\n",
1615 |     "\n",
1616 |     "\n",
1617 |     "print_msg(messages)"
1618 |    ]
1619 |   },
1620 |   {
1621 |    "cell_type": "code",
1622 |    "execution_count": 42,
1623 |    "id": "saving-lincoln",
1624 |    "metadata": {},
1625 |    "outputs": [
1626 |     {
1627 |      "data": {
1628 |       "text/plain": [
1629 |        "'Here are some apps like Instagram that you might find helpful!'"
1630 |       ]
1631 |      },
1632 |      "execution_count": 42,
1633 |      "metadata": {},
1634 |      "output_type": "execute_result"
1635 |     }
1636 |    ],
1637 |    "source": [
1638 |     "messages = [\n",
1639 |     "    {\"role\": \"system\", \"content\": \"You are a helpful assistant who recommend contents.\"},\n",
1640 |     "    {\"role\": \"user\", \"content\": \"Simplify the sentences for recommending services \\n context: Recommend apps similar to Instagram.\"}\n",
1641 |     "]\n",
1642 |     "\n",
1643 |     "\n",
1644 |     "print_msg(messages)"
1645 |    ]
1646 |   },
1647 |   {
1648 |    "cell_type": "code",
1649 |    "execution_count": 43,
1650 |    "id": "interstate-director",
1651 |    "metadata": {},
1652 |    "outputs": [
1653 |     {
1654 |      "data": {
1655 |       "text/plain": [
1656 |        "\"Of course! I'd be happy to explain the item to you.\""
1657 |       ]
1658 |      },
1659 |      "execution_count": 43,
1660 |      "metadata": {},
1661 |      "output_type": "execute_result"
1662 |     }
1663 |    ],
1664 |    "source": [
1665 |     "messages = [\n",
1666 |     "    {\"role\": \"system\", \"content\": \"You are a helpful assistant who kindly answers.\"},\n",
1667 |     "    {\"role\": \"user\", \"content\": \"Please write a simple greeting starting with 'of course' to explain the item to the user.\"}\n",
1668 |     "]\n",
1669 |     "\n",
1670 |     "\n",
1671 |     "print_msg(messages)"
1672 |    ]
1673 |   },
1674 |   {
1675 |    "cell_type": "code",
1676 |    "execution_count": 44,
1677 |    "id": "introductory-phrase",
1678 |    "metadata": {},
1679 |    "outputs": [
1680 |     {
1681 |      "data": {
1682 |       "text/plain": [
1683 |        "\"Of course! I'd be happy to recommend some great items for you.\""
1684 |       ]
1685 |      },
1686 |      "execution_count": 44,
1687 |      "metadata": {},
1688 |      "output_type": "execute_result"
1689 |     }
1690 |    ],
1691 |    "source": [
1692 |     "messages = [\n",
1693 |     "    {\"role\": \"system\", \"content\": \"You are a helpful assistant who recommend contents based on user question.\"},\n",
1694 |     "    {\"role\": \"user\", \"content\": \"Write 1 sentence of a simple greeting that starts with 'Of course!' to recommend items to users.\"}\n",
1695 |     "]\n",
1696 |     "\n",
1697 |     "\n",
1698 |     "print_msg(messages)"
1699 |    ]
1700 |   },
1701 |   {
1702 |    "cell_type": "code",
1703 |    "execution_count": 103,
1704 |    "id": "f4b35964",
1705 |    "metadata": {},
1706 |    "outputs": [
1707 |     {
1708 |      "data": {
1709 |       "text/html": [
1710 |        "<div>\n",
1711 |        "<style scoped>\n",
1712 |        "    .dataframe tbody tr th:only-of-type {\n",
1713 |        "        vertical-align: middle;\n",
1714 |        "    }\n",
1715 |        "\n",
1716 |        "    .dataframe tbody tr th {\n",
1717 |        "        vertical-align: top;\n",
1718 |        "    }\n",
1719 |        "\n",
1720 |        "    .dataframe thead th {\n",
1721 |        "        text-align: right;\n",
1722 |        "    }\n",
1723 |        "</style>\n",
1724 |        "<table border=\"1\" class=\"dataframe\">\n",
1725 |        "  <thead>\n",
1726 |        "    <tr style=\"text-align: right;\">\n",
1727 |        "      <th></th>\n",
1728 |        "      <th>title</th>\n",
1729 |        "      <th>overview</th>\n",
1730 |        "      <th>genres</th>\n",
1731 |        "    </tr>\n",
1732 |        "  </thead>\n",
1733 |        "  <tbody>\n",
1734 |        "    <tr>\n",
1735 |        "      <th>36936</th>\n",
1736 |        "      <td>The Flying Man</td>\n",
1737 |        "      <td>A new superhero is coming, only this time it's...</td>\n",
1738 |        "      <td>Action Mystery Science Fiction</td>\n",
1739 |        "    </tr>\n",
1740 |        "    <tr>\n",
1741 |        "      <th>30101</th>\n",
1742 |        "      <td>Up, Up, and Away</td>\n",
1743 |        "      <td>A boy is the only family member without superp...</td>\n",
1744 |        "      <td>Action Family TV Movie</td>\n",
1745 |        "    </tr>\n",
1746 |        "    <tr>\n",
1747 |        "      <th>24646</th>\n",
1748 |        "      <td>The Four Feathers</td>\n",
1749 |        "      <td>They made him a hero by branding him a coward ...</td>\n",
1750 |        "      <td>TV Movie Adventure Drama</td>\n",
1751 |        "    </tr>\n",
1752 |        "    <tr>\n",
1753 |        "      <th>12672</th>\n",
1754 |        "      <td>Hancock</td>\n",
1755 |        "      <td>Hancock is a down-and-out superhero who's forc...</td>\n",
1756 |        "      <td>Fantasy Action</td>\n",
1757 |        "    </tr>\n",
1758 |        "    <tr>\n",
1759 |        "      <th>4216</th>\n",
1760 |        "      <td>Too Late the Hero</td>\n",
1761 |        "      <td>A WWII film set on a Pacific island. Japanese ...</td>\n",
1762 |        "      <td>Drama Action War</td>\n",
1763 |        "    </tr>\n",
1764 |        "  </tbody>\n",
1765 |        "</table>\n",
1766 |        "</div>"
1767 |       ],
1768 |       "text/plain": [
1769 |        "                   title                                           overview  \\\n",
1770 |        "36936     The Flying Man  A new superhero is coming, only this time it's...   \n",
1771 |        "30101   Up, Up, and Away  A boy is the only family member without superp...   \n",
1772 |        "24646  The Four Feathers  They made him a hero by branding him a coward ...   \n",
1773 |        "12672            Hancock  Hancock is a down-and-out superhero who's forc...   \n",
1774 |        "4216   Too Late the Hero  A WWII film set on a Pacific island. Japanese ...   \n",
1775 |        "\n",
1776 |        "                               genres  \n",
1777 |        "36936  Action Mystery Science Fiction  \n",
1778 |        "30101          Action Family TV Movie  \n",
1779 |        "24646        TV Movie Adventure Drama  \n",
1780 |        "12672                  Fantasy Action  \n",
1781 |        "4216                 Drama Action War  "
1782 |       ]
1783 |      },
1784 |      "execution_count": 103,
1785 |      "metadata": {},
1786 |      "output_type": "execute_result"
1787 |     }
1788 |    ],
1789 |    "source": [
1790 |     "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'overview', 'genres']]"
1791 |    ]
1792 |   },
1793 |   {
1794 |    "cell_type": "markdown",
1795 |    "id": "34ee1584",
1796 |    "metadata": {},
1797 |    "source": [
1798 |     "## 필요한 Prompt 설정\n",
1799 |     "\n",
1800 |     "- 추천인가? 설명인가? 의도 파악인가?"
1801 |    ]
1802 |   },
1803 |   {
1804 |    "cell_type": "code",
1805 |    "execution_count": 815,
1806 |    "id": "split-scotland",
1807 |    "metadata": {},
1808 |    "outputs": [],
1809 |    "source": [
1810 |     "msg_prompt = {\n",
1811 |     "    'recom' : {\n",
1812 |     "                'system' : \"You are a helpful assistant who recommend movie based on user question.\", \n",
1813 |     "                'user' : \"Write 1 sentence of a simple greeting that starts with 'Of course!' to recommend movie items to users.\", \n",
1814 |     "              },\n",
1815 |     "    'desc' : {\n",
1816 |     "                'system' : \"You are a helpful assistant who kindly answers.\", \n",
1817 |     "                'user' : \"Please write a simple greeting starting with 'of course' to explain the item to the user.\", \n",
1818 |     "              },\n",
1819 |     "    'intent' : {\n",
1820 |     "                'system' : \"You are a helpful assistant who understands the intent of the user's question.\",\n",
1821 |     "                'user' : \"Which category does the sentence below belong to: 'description', 'recommended', 'search'? Show only categories. \\n context:\"\n",
1822 |     "                }\n",
1823 |     "}"
1824 |    ]
1825 |   },
1826 |   {
1827 |    "cell_type": "code",
1828 |    "execution_count": 856,
1829 |    "id": "separate-confusion",
1830 |    "metadata": {},
1831 |    "outputs": [
1832 |     {
1833 |      "data": {
1834 |       "text/plain": [
1835 |        "{'recom': {'system': 'You are a helpful assistant who recommend movie based on user question.',\n",
1836 |        "  'user': \"Write 1 sentence of a simple greeting that starts with 'Of course!' to recommend movie items to users.\"},\n",
1837 |        " 'desc': {'system': 'You are a helpful assistant who kindly answers.',\n",
1838 |        "  'user': \"Please write a simple greeting starting with 'of course' to explain the item to the user.\"},\n",
1839 |        " 'intent': {'system': \"You are a helpful assistant who understands the intent of the user's question.\",\n",
1840 |        "  'user': \"Which category does the sentence below belong to: 'description', 'recommended', 'search'? Show only categories. \\n context:\"}}"
1841 |       ]
1842 |      },
1843 |      "execution_count": 856,
1844 |      "metadata": {},
1845 |      "output_type": "execute_result"
1846 |     }
1847 |    ],
1848 |    "source": [
1849 |     "msg_prompt"
1850 |    ]
1851 |   },
1852 |   {
1853 |    "cell_type": "code",
1854 |    "execution_count": 870,
1855 |    "id": "biological-mustang",
1856 |    "metadata": {},
1857 |    "outputs": [],
1858 |    "source": [
1859 |     "user_msg_history = []"
1860 |    ]
1861 |   },
1862 |   {
1863 |    "cell_type": "code",
1864 |    "execution_count": 871,
1865 |    "id": "d8b72193",
1866 |    "metadata": {},
1867 |    "outputs": [],
1868 |    "source": [
1869 |     "def get_chatgpt_msg(msg):\n",
1870 |     "    completion = openai.ChatCompletion.create(\n",
1871 |     "                    model=\"gpt-3.5-turbo\",\n",
1872 |     "                    messages=msg\n",
1873 |     "                    )\n",
1874 |     "    return completion['choices'][0]['message']['content']"
1875 |    ]
1876 |   },
1877 |   {
1878 |    "cell_type": "code",
1879 |    "execution_count": 872,
1880 |    "id": "partial-device",
1881 |    "metadata": {},
1882 |    "outputs": [],
1883 |    "source": [
1884 |     "def set_prompt(intent, query, msg_prompt_init, model):\n",
1885 |     "    '''prompt 형태를 만들어주는 함수'''\n",
1886 |     "    m = dict()\n",
1887 |     "    # 검색 또는 추천이면\n",
1888 |     "    if ('recom' in intent) or ('search' in intent):\n",
1889 |     "        msg = msg_prompt_init['recom'] # 시스템 메세지를 가지고오고\n",
1890 |     "    # 설명문이면\n",
1891 |     "    elif 'desc' in intent:\n",
1892 |     "        msg = msg_prompt_init['desc'] # 시스템 메세지를 가지고오고\n",
1893 |     "    # intent 파악\n",
1894 |     "    else:\n",
1895 |     "        msg = msg_prompt_init['intent']\n",
1896 |     "        msg['user'] += f' {query} \\n A:'\n",
1897 |     "    for k, v in msg.items():\n",
1898 |     "        m['role'], m['content'] = k, v\n",
1899 |     "    return [m]"
1900 |    ]
1901 |   },
1902 |   {
1903 |    "cell_type": "code",
1904 |    "execution_count": 872,
1905 |    "id": "50db26ff",
1906 |    "metadata": {},
1907 |    "outputs": [],
1908 |    "source": [
1909 |     "def user_interact(query, model, msg_prompt_init):\n",
1910 |     "    # 1. 사용자의 의도를 파악\n",
1911 |     "    user_intent = set_prompt('intent', query, msg_prompt_init, None)\n",
1912 |     "    user_intent = get_chatgpt_msg(user_intent).lower()\n",
1913 |     "    print(\"user_intent : \", user_intent)\n",
1914 |     "    \n",
1915 |     "    # 2. 사용자의 쿼리에 따라 prompt 생성    \n",
1916 |     "    intent_data = set_prompt(user_intent, query, msg_prompt_init, model)\n",
1917 |     "    intent_data_msg = get_chatgpt_msg(intent_data).replace(\"\\n\", \"\").strip()\n",
1918 |     "    print(\"intent_data_msg : \", intent_data_msg)\n",
1919 |     "    \n",
1920 |     "    # 3-1. 추천 또는 검색이면\n",
1921 |     "    if ('recom' in user_intent) or ('search' in user_intent):\n",
1922 |     "        recom_msg = str()\n",
1923 |     "        # 기존에 메세지가 있으면 쿼리로 대체\n",
1924 |     "        if (len(user_msg_history) > 0 ) and (user_msg_history[-1]['role'] == 'assistant'):\n",
1925 |     "            query = user_msg_history[-1]['content']['feature']\n",
1926 |     "        # 유사 아이템 가져오기\n",
1927 |     "        #top_result = get_query_sim_top_k(query, model, movies_metadata, top_k=1 if 'recom' in user_intent else 3) # 추천 개수 설정하려면!\n",
1928 |     "        top_result = get_query_sim_top_k(query, model, movies_metadata, top_k=3)\n",
1929 |     "        #print(\"top_result : \", top_result)\n",
1930 |     "        # 검색이면, 자기 자신의 컨텐츠는 제외\n",
1931 |     "        top_index = top_result[1].numpy() if 'recom' in user_intent else top_result[1].numpy()[1:]\n",
1932 |     "        #print(\"top_index : \", top_index)\n",
1933 |     "        # 장르, 제목, overview를 가져와서 출력\n",
1934 |     "        r_set_d = movies_metadata.iloc[top_index, :][['genres', 'title', 'overview']]\n",
1935 |     "        r_set_d = json.loads(r_set_d.to_json(orient=\"records\"))\n",
1936 |     "        for r in r_set_d:\n",
1937 |     "            for _, v in r.items():\n",
1938 |     "                recom_msg += f\"{v} \\n\"\n",
1939 |     "            recom_msg += \"\\n\"\n",
1940 |     "        user_msg_history.append({'role' : 'assistant', 'content' : f\"{intent_data_msg} {str(recom_msg)}\"})\n",
1941 |     "        print(f\"\\n recom data : {intent_data_msg} {str(recom_msg)}\")\n",
1942 |     "    # 3-2. 설명이면\n",
1943 |     "    elif 'desc' in user_intent:\n",
1944 |     "        # 이전 메세지에 따라서 설명을 가져와야 하기 때문에 이전 메세지 컨텐츠를 가져옴\n",
1945 |     "        top_result = get_query_sim_top_k(user_msg_history[-1]['content'], model, movies_metadata, top_k=1)\n",
1946 |     "        # feature가 상세 설명이라고 가정하고 해당 컬럼의 값을 가져와 출력\n",
1947 |     "        r_set_d = movies_metadata.iloc[top_result[1].numpy(), :][['feature']]\n",
1948 |     "        r_set_d = json.loads(r_set_d.to_json(orient=\"records\"))[0]\n",
1949 |     "        user_msg_history.append({'role' : 'assistant', 'content' : r_set_d})\n",
1950 |     "        print(f\"\\n describe : {intent_data_msg} {r_set_d}\")"
1951 |    ]
1952 |   },
1953 |   {
1954 |    "cell_type": "markdown",
1955 |    "id": "29adeae4",
1956 |    "metadata": {},
1957 |    "source": [
1958 |     "## 위의 user_interact 함수처럼 구성하지 말고 \n"
1959 |    ]
1960 |   },
1961 |   {
1962 |    "cell_type": "code",
1963 |    "execution_count": 872,
1964 |    "id": "1cb4af5a",
1965 |    "metadata": {},
1966 |    "outputs": [],
1967 |    "source": [
1968 |     "'''\n",
1969 |     "\n",
1970 |     "import openai\n",
1971 |     "\n",
1972 |     "openai.api_key = \"YOUR_API_KEY\" # supply your API key however you choose\n",
1973 |     "\n",
1974 |     "message = {\"role\":\"user\", \"content\": input(\"This is the beginning of your chat with AI. [To exit, send \\\"###\\\".]\\n\\nYou:\")};\n",
1975 |     "\n",
1976 |     "conversation = [{\"role\": \"system\", \"content\": \"DIRECTIVE_FOR_gpt-3.5-turbo\"}]\n",
1977 |     "\n",
1978 |     "while(message[\"content\"]!=\"###\"):\n",
1979 |     "    conversation.append(message)\n",
1980 |     "    completion = openai.ChatCompletion.create(model=\"gpt-3.5-turbo\", messages=conversation) \n",
1981 |     "    message[\"content\"] = input(f\"Assistant: {completion.choices[0].message.content} \\nYou:\")\n",
1982 |     "    print()\n",
1983 |     "    conversation.append(completion.choices[0].message)\n",
1984 |     "    \n",
1985 |     "'''"
1986 |    ]
1987 |   },
1988 |   {
1989 |    "cell_type": "markdown",
1990 |    "id": "7bd5bff7",
1991 |    "metadata": {},
1992 |    "source": [
1993 |     "## 위처럼 하는 것이 깔끔할 수 있습니다."
1994 |    ]
1995 |   },
1996 |   {
1997 |    "cell_type": "code",
1998 |    "execution_count": 873,
1999 |    "id": "metric-single",
2000 |    "metadata": {},
2001 |    "outputs": [],
2002 |    "source": []
2003 |   },
2004 |   {
2005 |    "cell_type": "markdown",
2006 |    "id": "be150f32",
2007 |    "metadata": {},
2008 |    "source": [
2009 |     "## 쿼리에 따른 추천 프로세스 실행"
2010 |    ]
2011 |   },
2012 |   {
2013 |    "cell_type": "code",
2014 |    "execution_count": 874,
2015 |    "id": "modern-cover",
2016 |    "metadata": {},
2017 |    "outputs": [
2018 |     {
2019 |      "name": "stdout",
2020 |      "output_type": "stream",
2021 |      "text": [
2022 |       "user_intent :  recommended\n",
2023 |       "intent_data_msg :  Of course! Here are some top-rated movie items that you might enjoy.\n",
2024 |       "\n",
2025 |       " recom data : Of course! Here are some top-rated movie items that you might enjoy. \n",
2026 |       "\n",
2027 |       "X-Men \n",
2028 |       "Two mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers. \n",
2029 |       "\n",
2030 |       "\n"
2031 |      ]
2032 |     }
2033 |    ],
2034 |    "source": [
2035 |     "query = \"Please recommend a movie similar to a marvel heros movie.\"\n",
2036 |     "user_interact(query, model, copy.deepcopy(msg_prompt))"
2037 |    ]
2038 |   },
2039 |   {
2040 |    "cell_type": "code",
2041 |    "execution_count": 876,
2042 |    "id": "mathematical-gabriel",
2043 |    "metadata": {},
2044 |    "outputs": [
2045 |     {
2046 |      "name": "stdout",
2047 |      "output_type": "stream",
2048 |      "text": [
2049 |       "user_intent :  description\n",
2050 |       "intent_data_msg :  Of course! Let me explain what this item is and how it works.\n",
2051 |       "\n",
2052 |       " describe : Of course! Let me explain what this item is and how it works. {'feature': 'Adventure Action Science Fiction / X-Men / Two mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers.'}\n"
2053 |      ]
2054 |     }
2055 |    ],
2056 |    "source": [
2057 |     "query = \"Can you describe on the above?\"\n",
2058 |     "user_interact(query, model, copy.deepcopy(msg_prompt))"
2059 |    ]
2060 |   },
2061 |   {
2062 |    "cell_type": "code",
2063 |    "execution_count": 877,
2064 |    "id": "515afec7",
2065 |    "metadata": {},
2066 |    "outputs": [
2067 |     {
2068 |      "data": {
2069 |       "text/plain": [
2070 |        "[{'role': 'assistant',\n",
2071 |        "  'content': 'Of course! Here are some top-rated movie items that you might enjoy. \\n\\nX-Men \\nTwo mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers. \\n\\n'},\n",
2072 |        " {'role': 'assistant',\n",
2073 |        "  'content': {'feature': 'Adventure Action Science Fiction / X-Men / Two mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers.'}}]"
2074 |       ]
2075 |      },
2076 |      "execution_count": 877,
2077 |      "metadata": {},
2078 |      "output_type": "execute_result"
2079 |     }
2080 |    ],
2081 |    "source": [
2082 |     "user_msg_history"
2083 |    ]
2084 |   },
2085 |   {
2086 |    "cell_type": "code",
2087 |    "execution_count": 878,
2088 |    "id": "978cee80",
2089 |    "metadata": {},
2090 |    "outputs": [
2091 |     {
2092 |      "name": "stdout",
2093 |      "output_type": "stream",
2094 |      "text": [
2095 |       "user_intent :  'search'\n",
2096 |       "intent_data_msg :  Of course! We have a great selection of movie items that will fit your every need.\n",
2097 |       "\n",
2098 |       " recom data : Of course! We have a great selection of movie items that will fit your every need. \n",
2099 |       "X-Men: Days of Future Past \n",
2100 |       "The ultimate X-Men ensemble fights a war for the survival of the species across two time periods as they join forces with their younger selves in an epic battle that must change the past – to save our future. \n",
2101 |       "\n",
2102 |       "\n"
2103 |      ]
2104 |     }
2105 |    ],
2106 |    "source": [
2107 |     "query = \"Are there other movies that are similar to the ones above?\"\n",
2108 |     "user_interact(query, model, copy.deepcopy(msg_prompt))"
2109 |    ]
2110 |   },
2111 |   {
2112 |    "cell_type": "code",
2113 |    "execution_count": null,
2114 |    "id": "e2d7275c",
2115 |    "metadata": {},
2116 |    "outputs": [],
2117 |    "source": []
2118 |   },
2119 |   {
2120 |    "cell_type": "markdown",
2121 |    "id": "e74f1f6a",
2122 |    "metadata": {},
2123 |    "source": [
2124 |     "# 사용자 태그"
2125 |    ]
2126 |   },
2127 |   {
2128 |    "cell_type": "code",
2129 |    "execution_count": 717,
2130 |    "id": "f3bdc6a2",
2131 |    "metadata": {},
2132 |    "outputs": [],
2133 |    "source": [
2134 |     "user_perfer_tag = ['computer','tech','science']"
2135 |    ]
2136 |   },
2137 |   {
2138 |    "cell_type": "code",
2139 |    "execution_count": 718,
2140 |    "id": "76d26f79",
2141 |    "metadata": {},
2142 |    "outputs": [],
2143 |    "source": [
2144 |     "top_result = get_query_sim_top_k(' '.join(user_perfer_tag), model, movies_metadata, top_k=3 )"
2145 |    ]
2146 |   },
2147 |   {
2148 |    "cell_type": "code",
2149 |    "execution_count": 719,
2150 |    "id": "e3f79117",
2151 |    "metadata": {},
2152 |    "outputs": [
2153 |     {
2154 |      "data": {
2155 |       "text/html": [
2156 |        "<div>\n",
2157 |        "<style scoped>\n",
2158 |        "    .dataframe tbody tr th:only-of-type {\n",
2159 |        "        vertical-align: middle;\n",
2160 |        "    }\n",
2161 |        "\n",
2162 |        "    .dataframe tbody tr th {\n",
2163 |        "        vertical-align: top;\n",
2164 |        "    }\n",
2165 |        "\n",
2166 |        "    .dataframe thead th {\n",
2167 |        "        text-align: right;\n",
2168 |        "    }\n",
2169 |        "</style>\n",
2170 |        "<table border=\"1\" class=\"dataframe\">\n",
2171 |        "  <thead>\n",
2172 |        "    <tr style=\"text-align: right;\">\n",
2173 |        "      <th></th>\n",
2174 |        "      <th>title</th>\n",
2175 |        "      <th>genres</th>\n",
2176 |        "      <th>overview</th>\n",
2177 |        "    </tr>\n",
2178 |        "  </thead>\n",
2179 |        "  <tbody>\n",
2180 |        "    <tr>\n",
2181 |        "      <th>23040</th>\n",
2182 |        "      <td>Transcendence</td>\n",
2183 |        "      <td>Thriller Science Fiction Drama</td>\n",
2184 |        "      <td>Two leading computer scientists work toward th...</td>\n",
2185 |        "    </tr>\n",
2186 |        "    <tr>\n",
2187 |        "      <th>30222</th>\n",
2188 |        "      <td>Debug</td>\n",
2189 |        "      <td>Horror Science Fiction</td>\n",
2190 |        "      <td>Six young computer hackers sent to work on a d...</td>\n",
2191 |        "    </tr>\n",
2192 |        "    <tr>\n",
2193 |        "      <th>996</th>\n",
2194 |        "      <td>The Lawnmower Man</td>\n",
2195 |        "      <td>Horror Thriller Science Fiction</td>\n",
2196 |        "      <td>A simple man is turned into a genius through t...</td>\n",
2197 |        "    </tr>\n",
2198 |        "  </tbody>\n",
2199 |        "</table>\n",
2200 |        "</div>"
2201 |       ],
2202 |       "text/plain": [
2203 |        "                   title                           genres  \\\n",
2204 |        "23040      Transcendence   Thriller Science Fiction Drama   \n",
2205 |        "30222              Debug           Horror Science Fiction   \n",
2206 |        "996    The Lawnmower Man  Horror Thriller Science Fiction   \n",
2207 |        "\n",
2208 |        "                                                overview  \n",
2209 |        "23040  Two leading computer scientists work toward th...  \n",
2210 |        "30222  Six young computer hackers sent to work on a d...  \n",
2211 |        "996    A simple man is turned into a genius through t...  "
2212 |       ]
2213 |      },
2214 |      "execution_count": 719,
2215 |      "metadata": {},
2216 |      "output_type": "execute_result"
2217 |     }
2218 |    ],
2219 |    "source": [
2220 |     "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'genres', 'overview']]"
2221 |    ]
2222 |   },
2223 |   {
2224 |    "cell_type": "code",
2225 |    "execution_count": null,
2226 |    "id": "00320379",
2227 |    "metadata": {},
2228 |    "outputs": [],
2229 |    "source": []
2230 |   },
2231 |   {
2232 |    "cell_type": "code",
2233 |    "execution_count": null,
2234 |    "id": "4f9e9d6b",
2235 |    "metadata": {},
2236 |    "outputs": [],
2237 |    "source": []
2238 |   },
2239 |   {
2240 |    "cell_type": "code",
2241 |    "execution_count": 723,
2242 |    "id": "6c0e88fa",
2243 |    "metadata": {},
2244 |    "outputs": [
2245 |     {
2246 |      "data": {
2247 |       "text/html": [
2248 |        "<div>\n",
2249 |        "<style scoped>\n",
2250 |        "    .dataframe tbody tr th:only-of-type {\n",
2251 |        "        vertical-align: middle;\n",
2252 |        "    }\n",
2253 |        "\n",
2254 |        "    .dataframe tbody tr th {\n",
2255 |        "        vertical-align: top;\n",
2256 |        "    }\n",
2257 |        "\n",
2258 |        "    .dataframe thead th {\n",
2259 |        "        text-align: right;\n",
2260 |        "    }\n",
2261 |        "</style>\n",
2262 |        "<table border=\"1\" class=\"dataframe\">\n",
2263 |        "  <thead>\n",
2264 |        "    <tr style=\"text-align: right;\">\n",
2265 |        "      <th></th>\n",
2266 |        "      <th>id</th>\n",
2267 |        "      <th>genres</th>\n",
2268 |        "      <th>title</th>\n",
2269 |        "      <th>overview</th>\n",
2270 |        "      <th>release_date</th>\n",
2271 |        "      <th>feature</th>\n",
2272 |        "      <th>text_len</th>\n",
2273 |        "      <th>feature_len</th>\n",
2274 |        "      <th>hf_embeddings</th>\n",
2275 |        "    </tr>\n",
2276 |        "  </thead>\n",
2277 |        "  <tbody>\n",
2278 |        "    <tr>\n",
2279 |        "      <th>23165</th>\n",
2280 |        "      <td>127585</td>\n",
2281 |        "      <td>Action Adventure Fantasy</td>\n",
2282 |        "      <td>X-Men: Days of Future Past</td>\n",
2283 |        "      <td>The ultimate X-Men ensemble fights a war for t...</td>\n",
2284 |        "      <td>2014-05-15</td>\n",
2285 |        "      <td>Action Adventure Fantasy / X-Men: Days of Futu...</td>\n",
2286 |        "      <td>208</td>\n",
2287 |        "      <td>264</td>\n",
2288 |        "      <td>[-0.047461677, 0.002811962, -0.010485606, -0.0...</td>\n",
2289 |        "    </tr>\n",
2290 |        "  </tbody>\n",
2291 |        "</table>\n",
2292 |        "</div>"
2293 |       ],
2294 |       "text/plain": [
2295 |        "           id                    genres                       title  \\\n",
2296 |        "23165  127585  Action Adventure Fantasy  X-Men: Days of Future Past   \n",
2297 |        "\n",
2298 |        "                                                overview release_date  \\\n",
2299 |        "23165  The ultimate X-Men ensemble fights a war for t...   2014-05-15   \n",
2300 |        "\n",
2301 |        "                                                 feature  text_len  \\\n",
2302 |        "23165  Action Adventure Fantasy / X-Men: Days of Futu...       208   \n",
2303 |        "\n",
2304 |        "       feature_len                                      hf_embeddings  \n",
2305 |        "23165          264  [-0.047461677, 0.002811962, -0.010485606, -0.0...  "
2306 |       ]
2307 |      },
2308 |      "execution_count": 723,
2309 |      "metadata": {},
2310 |      "output_type": "execute_result"
2311 |     }
2312 |    ],
2313 |    "source": [
2314 |     "movies_metadata[movies_metadata['title'] == 'X-Men: Days of Future Past']"
2315 |    ]
2316 |   },
2317 |   {
2318 |    "cell_type": "code",
2319 |    "execution_count": null,
2320 |    "id": "05b01f55",
2321 |    "metadata": {},
2322 |    "outputs": [],
2323 |    "source": []
2324 |   },
2325 |   {
2326 |    "cell_type": "code",
2327 |    "execution_count": null,
2328 |    "id": "422c3155",
2329 |    "metadata": {},
2330 |    "outputs": [],
2331 |    "source": []
2332 |   }
2333 |  ],
2334 |  "metadata": {
2335 |   "kernelspec": {
2336 |    "display_name": "Python 3 (ipykernel)",
2337 |    "language": "python",
2338 |    "name": "python3"
2339 |   },
2340 |   "language_info": {
2341 |    "codemirror_mode": {
2342 |     "name": "ipython",
2343 |     "version": 3
2344 |    },
2345 |    "file_extension": ".py",
2346 |    "mimetype": "text/x-python",
2347 |    "name": "python",
2348 |    "nbconvert_exporter": "python",
2349 |    "pygments_lexer": "ipython3",
2350 |    "version": "3.8.16"
2351 |   }
2352 |  },
2353 |  "nbformat": 4,
2354 |  "nbformat_minor": 5
2355 | }
2356 | 


--------------------------------------------------------------------------------
/010. LLM based Explainability RecSys .ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 78,
   6 |    "id": "7a6caca3-db33-4610-87a3-3331ad342413",
   7 |    "metadata": {},
   8 |    "outputs": [],
   9 |    "source": [
  10 |     "from torch.utils.data import TensorDataset\n",
  11 |     "from torch.utils.data import DataLoader\n",
  12 |     "from torch.utils.data import Dataset\n",
  13 |     "from tqdm import tqdm\n",
  14 |     "from sklearn.preprocessing import LabelEncoder\n",
  15 |     "from sklearn.model_selection import train_test_split\n",
  16 |     "\n",
  17 |     "from langchain.docstore.document import Document\n",
  18 |     "from langchain.chains.summarize import load_summarize_chain\n",
  19 |     "from langchain_community.embeddings import HuggingFaceEmbeddings\n",
  20 |     "from langchain_openai import OpenAIEmbeddings\n",
  21 |     "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
  22 |     "from langchain_community.vectorstores.faiss import FAISS\n",
  23 |     "from langchain.document_loaders.csv_loader import CSVLoader\n",
  24 |     "from langchain.prompts import PromptTemplate\n",
  25 |     "from langchain_openai import OpenAI, ChatOpenAI\n",
  26 |     "from langchain.chains import LLMChain\n",
  27 |     "from langchain.chains import RetrievalQA\n",
  28 |     "\n",
  29 |     "from collections import defaultdict\n",
  30 |     "\n",
  31 |     "import time\n",
  32 |     "import os\n",
  33 |     "import random\n",
  34 |     "import pandas as pd\n",
  35 |     "import numpy as np\n",
  36 |     "import torch\n",
  37 |     "import torch.nn as nn\n",
  38 |     "import torch.nn.functional as F\n",
  39 |     "import torch.utils.data as Data\n",
  40 |     "import math\n",
  41 |     "import requests\n",
  42 |     "import json\n",
  43 |     "import pickle\n"
  44 |    ]
  45 |   },
  46 |   {
  47 |    "cell_type": "code",
  48 |    "execution_count": 79,
  49 |    "id": "6db84a61-92a7-40e0-ba06-b6f8b0dc6a77",
  50 |    "metadata": {},
  51 |    "outputs": [
  52 |     {
  53 |      "data": {
  54 |       "text/plain": [
  55 |        "True"
  56 |       ]
  57 |      },
  58 |      "execution_count": 79,
  59 |      "metadata": {},
  60 |      "output_type": "execute_result"
  61 |     }
  62 |    ],
  63 |    "source": [
  64 |     "from dotenv import load_dotenv\n",
  65 |     "\n",
  66 |     "load_dotenv()"
  67 |    ]
  68 |   },
  69 |   {
  70 |    "cell_type": "code",
  71 |    "execution_count": 80,
  72 |    "id": "0ddd644c-1973-4c61-a88e-b10d49cc2ba2",
  73 |    "metadata": {},
  74 |    "outputs": [],
  75 |    "source": [
  76 |     "project_path = os.path.abspath(os.getcwd())\n",
  77 |     "data_dir_nm = 'data'\n",
  78 |     "model_dir_nm = 'model'\n",
  79 |     "data_path = f\"{project_path}/{data_dir_nm}\"\n",
  80 |     "model_path = f\"{project_path}/{model_dir_nm}\""
  81 |    ]
  82 |   },
  83 |   {
  84 |    "cell_type": "markdown",
  85 |    "id": "5411d293-2104-4520-bb25-cc1d09d8385b",
  86 |    "metadata": {},
  87 |    "source": [
  88 |     "# Load data\n",
  89 |     "- MovieLens1M movie info\n",
  90 |     "- MovieLens test set\n",
  91 |     "- LabelEncoder\n",
  92 |     "- model"
  93 |    ]
  94 |   },
  95 |   {
  96 |    "cell_type": "code",
  97 |    "execution_count": 81,
  98 |    "id": "c89865d0-d454-42e2-9f3f-67b0b046f7c1",
  99 |    "metadata": {},
 100 |    "outputs": [],
 101 |    "source": [
 102 |     "with open('./data/movielens1m_label_encoders.pkl', 'rb') as f:\n",
 103 |     "    label_encoders = pickle.load(f)"
 104 |    ]
 105 |   },
 106 |   {
 107 |    "cell_type": "code",
 108 |    "execution_count": 85,
 109 |    "id": "5bd214e6-5032-4b54-a3bf-f00fc5990333",
 110 |    "metadata": {},
 111 |    "outputs": [],
 112 |    "source": [
 113 |     "movie_info = pd.read_csv(f\"{data_path}/movies.csv\", dtype=str)\n",
 114 |     "test_df = pd.read_csv(f\"{data_path}/movielens1m_test.csv\", dtype=str)\n",
 115 |     "movielens_rcmm_origin = pd.read_csv(f\"{data_path}/movielens_rcmm.csv\", dtype=str)"
 116 |    ]
 117 |   },
 118 |   {
 119 |    "cell_type": "code",
 120 |    "execution_count": 86,
 121 |    "id": "787d1ab6-4d70-4fc1-a527-4e4509e499ee",
 122 |    "metadata": {},
 123 |    "outputs": [
 124 |     {
 125 |      "data": {
 126 |       "text/html": [
 127 |        "<div>\n",
 128 |        "<style scoped>\n",
 129 |        "    .dataframe tbody tr th:only-of-type {\n",
 130 |        "        vertical-align: middle;\n",
 131 |        "    }\n",
 132 |        "\n",
 133 |        "    .dataframe tbody tr th {\n",
 134 |        "        vertical-align: top;\n",
 135 |        "    }\n",
 136 |        "\n",
 137 |        "    .dataframe thead th {\n",
 138 |        "        text-align: right;\n",
 139 |        "    }\n",
 140 |        "</style>\n",
 141 |        "<table border=\"1\" class=\"dataframe\">\n",
 142 |        "  <thead>\n",
 143 |        "    <tr style=\"text-align: right;\">\n",
 144 |        "      <th></th>\n",
 145 |        "      <th>movie_id</th>\n",
 146 |        "      <th>title</th>\n",
 147 |        "      <th>movie_decade</th>\n",
 148 |        "      <th>genre</th>\n",
 149 |        "    </tr>\n",
 150 |        "  </thead>\n",
 151 |        "  <tbody>\n",
 152 |        "    <tr>\n",
 153 |        "      <th>0</th>\n",
 154 |        "      <td>1</td>\n",
 155 |        "      <td>Toy Story</td>\n",
 156 |        "      <td>1990s</td>\n",
 157 |        "      <td>Animation</td>\n",
 158 |        "    </tr>\n",
 159 |        "    <tr>\n",
 160 |        "      <th>1</th>\n",
 161 |        "      <td>2</td>\n",
 162 |        "      <td>Jumanji</td>\n",
 163 |        "      <td>1990s</td>\n",
 164 |        "      <td>Adventure</td>\n",
 165 |        "    </tr>\n",
 166 |        "    <tr>\n",
 167 |        "      <th>2</th>\n",
 168 |        "      <td>3</td>\n",
 169 |        "      <td>Grumpier Old Men</td>\n",
 170 |        "      <td>1990s</td>\n",
 171 |        "      <td>Comedy</td>\n",
 172 |        "    </tr>\n",
 173 |        "    <tr>\n",
 174 |        "      <th>3</th>\n",
 175 |        "      <td>4</td>\n",
 176 |        "      <td>Waiting to Exhale</td>\n",
 177 |        "      <td>1990s</td>\n",
 178 |        "      <td>Comedy</td>\n",
 179 |        "    </tr>\n",
 180 |        "    <tr>\n",
 181 |        "      <th>4</th>\n",
 182 |        "      <td>5</td>\n",
 183 |        "      <td>Father of the Bride Part II</td>\n",
 184 |        "      <td>1990s</td>\n",
 185 |        "      <td>Comedy</td>\n",
 186 |        "    </tr>\n",
 187 |        "  </tbody>\n",
 188 |        "</table>\n",
 189 |        "</div>"
 190 |       ],
 191 |       "text/plain": [
 192 |        "  movie_id                        title movie_decade      genre\n",
 193 |        "0        1                    Toy Story        1990s  Animation\n",
 194 |        "1        2                      Jumanji        1990s  Adventure\n",
 195 |        "2        3             Grumpier Old Men        1990s     Comedy\n",
 196 |        "3        4            Waiting to Exhale        1990s     Comedy\n",
 197 |        "4        5  Father of the Bride Part II        1990s     Comedy"
 198 |       ]
 199 |      },
 200 |      "execution_count": 86,
 201 |      "metadata": {},
 202 |      "output_type": "execute_result"
 203 |     }
 204 |    ],
 205 |    "source": [
 206 |     "movie_info.head()"
 207 |    ]
 208 |   },
 209 |   {
 210 |    "cell_type": "code",
 211 |    "execution_count": 87,
 212 |    "id": "313ad90a-618f-4184-8426-8b39f6aaa701",
 213 |    "metadata": {},
 214 |    "outputs": [
 215 |     {
 216 |      "data": {
 217 |       "text/html": [
 218 |        "<div>\n",
 219 |        "<style scoped>\n",
 220 |        "    .dataframe tbody tr th:only-of-type {\n",
 221 |        "        vertical-align: middle;\n",
 222 |        "    }\n",
 223 |        "\n",
 224 |        "    .dataframe tbody tr th {\n",
 225 |        "        vertical-align: top;\n",
 226 |        "    }\n",
 227 |        "\n",
 228 |        "    .dataframe thead th {\n",
 229 |        "        text-align: right;\n",
 230 |        "    }\n",
 231 |        "</style>\n",
 232 |        "<table border=\"1\" class=\"dataframe\">\n",
 233 |        "  <thead>\n",
 234 |        "    <tr style=\"text-align: right;\">\n",
 235 |        "      <th></th>\n",
 236 |        "      <th>user_id</th>\n",
 237 |        "      <th>movie_id</th>\n",
 238 |        "      <th>movie_decade</th>\n",
 239 |        "      <th>movie_year</th>\n",
 240 |        "      <th>rating_year</th>\n",
 241 |        "      <th>rating_month</th>\n",
 242 |        "      <th>rating_decade</th>\n",
 243 |        "      <th>genre1</th>\n",
 244 |        "      <th>genre2</th>\n",
 245 |        "      <th>genre3</th>\n",
 246 |        "      <th>gender</th>\n",
 247 |        "      <th>age</th>\n",
 248 |        "      <th>occupation</th>\n",
 249 |        "      <th>zip</th>\n",
 250 |        "      <th>label</th>\n",
 251 |        "    </tr>\n",
 252 |        "  </thead>\n",
 253 |        "  <tbody>\n",
 254 |        "    <tr>\n",
 255 |        "      <th>0</th>\n",
 256 |        "      <td>2741</td>\n",
 257 |        "      <td>957</td>\n",
 258 |        "      <td>7</td>\n",
 259 |        "      <td>65</td>\n",
 260 |        "      <td>0</td>\n",
 261 |        "      <td>7</td>\n",
 262 |        "      <td>0</td>\n",
 263 |        "      <td>4</td>\n",
 264 |        "      <td>6</td>\n",
 265 |        "      <td>15</td>\n",
 266 |        "      <td>1</td>\n",
 267 |        "      <td>2</td>\n",
 268 |        "      <td>6</td>\n",
 269 |        "      <td>3078</td>\n",
 270 |        "      <td>1.0</td>\n",
 271 |        "    </tr>\n",
 272 |        "    <tr>\n",
 273 |        "      <th>1</th>\n",
 274 |        "      <td>4931</td>\n",
 275 |        "      <td>609</td>\n",
 276 |        "      <td>8</td>\n",
 277 |        "      <td>70</td>\n",
 278 |        "      <td>0</td>\n",
 279 |        "      <td>5</td>\n",
 280 |        "      <td>0</td>\n",
 281 |        "      <td>0</td>\n",
 282 |        "      <td>14</td>\n",
 283 |        "      <td>15</td>\n",
 284 |        "      <td>1</td>\n",
 285 |        "      <td>2</td>\n",
 286 |        "      <td>3</td>\n",
 287 |        "      <td>1918</td>\n",
 288 |        "      <td>1.0</td>\n",
 289 |        "    </tr>\n",
 290 |        "    <tr>\n",
 291 |        "      <th>2</th>\n",
 292 |        "      <td>5786</td>\n",
 293 |        "      <td>3143</td>\n",
 294 |        "      <td>8</td>\n",
 295 |        "      <td>73</td>\n",
 296 |        "      <td>0</td>\n",
 297 |        "      <td>11</td>\n",
 298 |        "      <td>0</td>\n",
 299 |        "      <td>4</td>\n",
 300 |        "      <td>17</td>\n",
 301 |        "      <td>15</td>\n",
 302 |        "      <td>1</td>\n",
 303 |        "      <td>1</td>\n",
 304 |        "      <td>15</td>\n",
 305 |        "      <td>3397</td>\n",
 306 |        "      <td>0.0</td>\n",
 307 |        "    </tr>\n",
 308 |        "    <tr>\n",
 309 |        "      <th>3</th>\n",
 310 |        "      <td>5917</td>\n",
 311 |        "      <td>1741</td>\n",
 312 |        "      <td>8</td>\n",
 313 |        "      <td>78</td>\n",
 314 |        "      <td>0</td>\n",
 315 |        "      <td>10</td>\n",
 316 |        "      <td>0</td>\n",
 317 |        "      <td>4</td>\n",
 318 |        "      <td>17</td>\n",
 319 |        "      <td>15</td>\n",
 320 |        "      <td>1</td>\n",
 321 |        "      <td>4</td>\n",
 322 |        "      <td>13</td>\n",
 323 |        "      <td>417</td>\n",
 324 |        "      <td>0.0</td>\n",
 325 |        "    </tr>\n",
 326 |        "    <tr>\n",
 327 |        "      <th>4</th>\n",
 328 |        "      <td>1339</td>\n",
 329 |        "      <td>1009</td>\n",
 330 |        "      <td>6</td>\n",
 331 |        "      <td>52</td>\n",
 332 |        "      <td>0</td>\n",
 333 |        "      <td>10</td>\n",
 334 |        "      <td>0</td>\n",
 335 |        "      <td>0</td>\n",
 336 |        "      <td>0</td>\n",
 337 |        "      <td>15</td>\n",
 338 |        "      <td>1</td>\n",
 339 |        "      <td>4</td>\n",
 340 |        "      <td>4</td>\n",
 341 |        "      <td>1800</td>\n",
 342 |        "      <td>1.0</td>\n",
 343 |        "    </tr>\n",
 344 |        "  </tbody>\n",
 345 |        "</table>\n",
 346 |        "</div>"
 347 |       ],
 348 |       "text/plain": [
 349 |        "  user_id movie_id movie_decade movie_year rating_year rating_month  \\\n",
 350 |        "0    2741      957            7         65           0            7   \n",
 351 |        "1    4931      609            8         70           0            5   \n",
 352 |        "2    5786     3143            8         73           0           11   \n",
 353 |        "3    5917     1741            8         78           0           10   \n",
 354 |        "4    1339     1009            6         52           0           10   \n",
 355 |        "\n",
 356 |        "  rating_decade genre1 genre2 genre3 gender age occupation   zip label  \n",
 357 |        "0             0      4      6     15      1   2          6  3078   1.0  \n",
 358 |        "1             0      0     14     15      1   2          3  1918   1.0  \n",
 359 |        "2             0      4     17     15      1   1         15  3397   0.0  \n",
 360 |        "3             0      4     17     15      1   4         13   417   0.0  \n",
 361 |        "4             0      0      0     15      1   4          4  1800   1.0  "
 362 |       ]
 363 |      },
 364 |      "execution_count": 87,
 365 |      "metadata": {},
 366 |      "output_type": "execute_result"
 367 |     }
 368 |    ],
 369 |    "source": [
 370 |     "test_df.head()"
 371 |    ]
 372 |   },
 373 |   {
 374 |    "cell_type": "markdown",
 375 |    "id": "676a5c72-9880-4c6a-9265-9c2be4e41f2f",
 376 |    "metadata": {},
 377 |    "source": [
 378 |     "# Set dataset"
 379 |    ]
 380 |   },
 381 |   {
 382 |    "cell_type": "code",
 383 |    "execution_count": 16,
 384 |    "id": "0d842fff-d3fb-485f-a845-5707ae1be82e",
 385 |    "metadata": {},
 386 |    "outputs": [],
 387 |    "source": [
 388 |     "class MVLensDataset(Dataset):\n",
 389 |     "    def __init__(self, data, u_i_cols, label_col):\n",
 390 |     "        self.n = data.shape[0]\n",
 391 |     "        self.y = data[label_col].astype(np.float32).values.reshape(-1, 1)\n",
 392 |     "\n",
 393 |     "        self.u_i_cols = u_i_cols\n",
 394 |     "        \n",
 395 |     "        self.data_v = data[self.u_i_cols].astype(np.int64).values\n",
 396 |     "\n",
 397 |     "        self.field_dims = np.max(self.data_v, axis=0) + 1\n",
 398 |     "\n",
 399 |     "\n",
 400 |     "    def __len__(self):\n",
 401 |     "        return self.n\n",
 402 |     "\n",
 403 |     "    def __getitem__(self, idx):\n",
 404 |     "        return [self.data_v[idx], self.y[idx]]\n",
 405 |     "        \n",
 406 |     "u_i_feature = ['user_id', 'movie_id']\n",
 407 |     "label = 'label'\n",
 408 |     "batch_size = 512\n",
 409 |     "test_dataset = MVLensDataset(data=test_df, u_i_cols=u_i_feature, label_col=label)\n",
 410 |     "test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)"
 411 |    ]
 412 |   },
 413 |   {
 414 |    "cell_type": "markdown",
 415 |    "id": "e132b9d3-bd38-4901-9048-7ae84cb55d63",
 416 |    "metadata": {},
 417 |    "source": [
 418 |     "# Set Model (NCF)"
 419 |    ]
 420 |   },
 421 |   {
 422 |    "cell_type": "code",
 423 |    "execution_count": 10,
 424 |    "id": "fb9fb05d-3342-4fcd-a8f0-c548ffb157a3",
 425 |    "metadata": {},
 426 |    "outputs": [],
 427 |    "source": [
 428 |     "class NeuMF(torch.nn.Module):\n",
 429 |     "    def __init__(self, config):\n",
 430 |     "        super(NeuMF, self).__init__()\n",
 431 |     "        # config \n",
 432 |     "        self.config = config\n",
 433 |     "        self.num_users = config['num_users']\n",
 434 |     "        self.num_items = config['num_items']\n",
 435 |     "        self.latent_dim_mf = config['latent_dim_mf']\n",
 436 |     "        self.latent_dim_mlp = config['latent_dim_mlp']\n",
 437 |     "        # Embedding setting\n",
 438 |     "        self.embedding_user_mlp = torch.nn.Embedding(num_embeddings=self.num_users, embedding_dim=self.latent_dim_mlp)\n",
 439 |     "        self.embedding_item_mlp = torch.nn.Embedding(num_embeddings=self.num_items, embedding_dim=self.latent_dim_mlp)\n",
 440 |     "        self.embedding_user_mf = torch.nn.Embedding(num_embeddings=self.num_users, embedding_dim=self.latent_dim_mf)\n",
 441 |     "        self.embedding_item_mf = torch.nn.Embedding(num_embeddings=self.num_items, embedding_dim=self.latent_dim_mf)\n",
 442 |     "        # MLP layer\n",
 443 |     "        self.fc_layers = torch.nn.ModuleList()\n",
 444 |     "        for idx, (in_size, out_size) in enumerate(zip(config['layers'][:-1], config['layers'][1:])):\n",
 445 |     "            self.fc_layers.append(torch.nn.Linear(in_size, out_size))\n",
 446 |     "        # output layer\n",
 447 |     "        self.affine_output = torch.nn.Linear(in_features=config['layers'][-1] + config['latent_dim_mf'], out_features=1)\n",
 448 |     "        self.logistic = torch.nn.Sigmoid()\n",
 449 |     "\n",
 450 |     "    def forward(self, user_indices, item_indices):\n",
 451 |     "        user_embedding_mlp = self.embedding_user_mlp(user_indices)\n",
 452 |     "        item_embedding_mlp = self.embedding_item_mlp(item_indices)\n",
 453 |     "        user_embedding_mf = self.embedding_user_mf(user_indices)\n",
 454 |     "        item_embedding_mf = self.embedding_item_mf(item_indices)\n",
 455 |     "        # MLP, MF\n",
 456 |     "        mlp_vector = torch.cat([user_embedding_mlp, item_embedding_mlp], dim=-1)\n",
 457 |     "        mf_vector =torch.mul(user_embedding_mf, item_embedding_mf)\n",
 458 |     "\n",
 459 |     "        # MLP feed\n",
 460 |     "        for idx, _ in enumerate(range(len(self.fc_layers))):\n",
 461 |     "            mlp_vector = self.fc_layers[idx](mlp_vector)\n",
 462 |     "            mlp_vector = torch.nn.ReLU()(mlp_vector)\n",
 463 |     "        # concat MLP & MF\n",
 464 |     "        vector = torch.cat([mlp_vector, mf_vector], dim=-1)\n",
 465 |     "        # prediction\n",
 466 |     "        logits = self.affine_output(vector)\n",
 467 |     "        rating = self.logistic(logits)\n",
 468 |     "        return rating"
 469 |    ]
 470 |   },
 471 |   {
 472 |    "cell_type": "code",
 473 |    "execution_count": 14,
 474 |    "id": "8ac6ae2f-5d1f-4541-896b-cb2d1f806551",
 475 |    "metadata": {},
 476 |    "outputs": [
 477 |     {
 478 |      "data": {
 479 |       "text/plain": [
 480 |        "<All keys matched successfully>"
 481 |       ]
 482 |      },
 483 |      "execution_count": 14,
 484 |      "metadata": {},
 485 |      "output_type": "execute_result"
 486 |     }
 487 |    ],
 488 |    "source": [
 489 |     "config = {\n",
 490 |     "    'num_users': 6040,\n",
 491 |     "    'num_items': 3706,\n",
 492 |     "    'latent_dim_mf': 8,\n",
 493 |     "    'latent_dim_mlp': 16,\n",
 494 |     "    'layers': [32, 16, 8]\n",
 495 |     "}\n",
 496 |     "model = NeuMF(config)\n",
 497 |     "model.load_state_dict(torch.load('./model/ncf_mlm'))"
 498 |    ]
 499 |   },
 500 |   {
 501 |    "cell_type": "code",
 502 |    "execution_count": 19,
 503 |    "id": "45fd4228-a3f4-4693-91c1-6ab262bbbaf6",
 504 |    "metadata": {},
 505 |    "outputs": [],
 506 |    "source": [
 507 |     "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")"
 508 |    ]
 509 |   },
 510 |   {
 511 |    "cell_type": "markdown",
 512 |    "id": "6cc369fc-8d80-41c6-8e77-08855970bb6f",
 513 |    "metadata": {},
 514 |    "source": [
 515 |     "# Predict test data"
 516 |    ]
 517 |   },
 518 |   {
 519 |    "cell_type": "code",
 520 |    "execution_count": 32,
 521 |    "id": "80db479a-da2e-4d5c-9e84-7ebaf3ffdcfb",
 522 |    "metadata": {},
 523 |    "outputs": [],
 524 |    "source": [
 525 |     "user_pred_info = {}\n",
 526 |     "top = 10\n",
 527 |     "\n",
 528 |     "def test_model(model, test_loader):\n",
 529 |     "    # eval mode\n",
 530 |     "    model.eval()\n",
 531 |     "    user_pred_info = defaultdict(list)\n",
 532 |     "    with torch.no_grad():\n",
 533 |     "        with tqdm(test_loader, unit='batch') as tepoch:\n",
 534 |     "            for samples in tepoch:\n",
 535 |     "                user_items, y = samples[0], samples[1]\n",
 536 |     "                user_items, y = user_items.to(device), y.to(device)\n",
 537 |     "                # user=0, item=1\n",
 538 |     "                y_pred = model(user_items[:, 0], user_items[:, 1])\n",
 539 |     "                for user_item, p in zip(user_items, y_pred):\n",
 540 |     "                    # save model predict result\n",
 541 |     "                    user_pred_info[int(user_item[0])].append((int(user_item[1]), float(p)))\n",
 542 |     "    return user_pred_info"
 543 |    ]
 544 |   },
 545 |   {
 546 |    "cell_type": "code",
 547 |    "execution_count": 33,
 548 |    "id": "c0e348d1-ac22-49b6-bd4d-bad608210087",
 549 |    "metadata": {},
 550 |    "outputs": [
 551 |     {
 552 |      "name": "stderr",
 553 |      "output_type": "stream",
 554 |      "text": [
 555 |       "100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391/391 [00:04<00:00, 93.74batch/s]\n"
 556 |      ]
 557 |     }
 558 |    ],
 559 |    "source": [
 560 |     "ncf_user_pred_info = test_model(model, test_dataloader)"
 561 |    ]
 562 |   },
 563 |   {
 564 |    "cell_type": "markdown",
 565 |    "id": "326ba940-8dba-44c0-bce8-9e447afa5eea",
 566 |    "metadata": {},
 567 |    "source": [
 568 |     "# Get Ranked list"
 569 |    ]
 570 |   },
 571 |   {
 572 |    "cell_type": "code",
 573 |    "execution_count": 34,
 574 |    "id": "418e6877-cf41-437d-8df2-49b0eec54707",
 575 |    "metadata": {},
 576 |    "outputs": [
 577 |     {
 578 |      "name": "stderr",
 579 |      "output_type": "stream",
 580 |      "text": [
 581 |       "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6035/6035 [00:00<00:00, 98230.88it/s]\n"
 582 |      ]
 583 |     }
 584 |    ],
 585 |    "source": [
 586 |     "for user, data_info in tqdm(ncf_user_pred_info.items(), total=len(ncf_user_pred_info), position=0, leave=True):\n",
 587 |     "    # sorted by high prop and slice by top(10)\n",
 588 |     "    ranklist = sorted(data_info, key=lambda s : s[1], reverse=True)[:top]\n",
 589 |     "    # to list\n",
 590 |     "    ranklist = list(dict.fromkeys([r[0] for r in ranklist]))\n",
 591 |     "    user_pred_info[str(user)] = ranklist"
 592 |    ]
 593 |   },
 594 |   {
 595 |    "cell_type": "code",
 596 |    "execution_count": 88,
 597 |    "id": "dafb43dc-8084-44af-8d0e-89aef28e1b07",
 598 |    "metadata": {},
 599 |    "outputs": [
 600 |     {
 601 |      "name": "stdout",
 602 |      "output_type": "stream",
 603 |      "text": [
 604 |       "사용자 : 2741의 추천 리스트 : [3613, 3306, 191, 2855, 957, 996, 2577]\n"
 605 |      ]
 606 |     }
 607 |    ],
 608 |    "source": [
 609 |     "for user, recom_list in user_pred_info.items():\n",
 610 |     "    print(f\"사용자 : {user}의 추천 리스트 : {recom_list}\")\n",
 611 |     "    break"
 612 |    ]
 613 |   },
 614 |   {
 615 |    "cell_type": "code",
 616 |    "execution_count": null,
 617 |    "id": "b9ed6550-8f10-4d7c-968b-9ee884ab6c67",
 618 |    "metadata": {},
 619 |    "outputs": [],
 620 |    "source": []
 621 |   },
 622 |   {
 623 |    "cell_type": "code",
 624 |    "execution_count": null,
 625 |    "id": "1fb0999e-eded-415b-9f34-bbac7b8f5c05",
 626 |    "metadata": {},
 627 |    "outputs": [],
 628 |    "source": []
 629 |   },
 630 |   {
 631 |    "cell_type": "markdown",
 632 |    "id": "d49bb508-9498-435d-8a1f-cd708be81c1c",
 633 |    "metadata": {},
 634 |    "source": [
 635 |     "# Sampling random users"
 636 |    ]
 637 |   },
 638 |   {
 639 |    "cell_type": "code",
 640 |    "execution_count": 45,
 641 |    "id": "4b0f0016-50ac-4652-996d-1bc6eb0f9770",
 642 |    "metadata": {},
 643 |    "outputs": [
 644 |     {
 645 |      "data": {
 646 |       "text/plain": [
 647 |        "['1662']"
 648 |       ]
 649 |      },
 650 |      "execution_count": 45,
 651 |      "metadata": {},
 652 |      "output_type": "execute_result"
 653 |     }
 654 |    ],
 655 |    "source": [
 656 |     "random_user_origin = random.sample(list(user_pred_info.keys()), 1)\n",
 657 |     "sample_user_pred_info = user_pred_info[random_user_origin[0]]\n",
 658 |     "random_user = list(map(int, random_user_origin))\n",
 659 |     "random_user = label_encoders['user_id'].inverse_transform(random_user)\n",
 660 |     "random_user = list(map(str, random_user))\n",
 661 |     "random_user"
 662 |    ]
 663 |   },
 664 |   {
 665 |    "cell_type": "code",
 666 |    "execution_count": 46,
 667 |    "id": "11562c70-94cf-4124-b0ae-4f7c9a31aa1e",
 668 |    "metadata": {},
 669 |    "outputs": [
 670 |     {
 671 |      "data": {
 672 |       "text/plain": [
 673 |        "array(['296', '1580', '3943', '3863'], dtype=object)"
 674 |       ]
 675 |      },
 676 |      "execution_count": 46,
 677 |      "metadata": {},
 678 |      "output_type": "execute_result"
 679 |     }
 680 |    ],
 681 |    "source": [
 682 |     "sample_user_pred_info_trans = list(map(int, sample_user_pred_info)) \n",
 683 |     "sample_user_pred_info_trans = label_encoders['movie_id'].inverse_transform(sample_user_pred_info_trans)\n",
 684 |     "sample_user_pred_info_trans"
 685 |    ]
 686 |   },
 687 |   {
 688 |    "cell_type": "code",
 689 |    "execution_count": 47,
 690 |    "id": "5a64030a-7d71-4130-bcfa-d3a9fedffecb",
 691 |    "metadata": {},
 692 |    "outputs": [
 693 |     {
 694 |      "data": {
 695 |       "text/html": [
 696 |        "<div>\n",
 697 |        "<style scoped>\n",
 698 |        "    .dataframe tbody tr th:only-of-type {\n",
 699 |        "        vertical-align: middle;\n",
 700 |        "    }\n",
 701 |        "\n",
 702 |        "    .dataframe tbody tr th {\n",
 703 |        "        vertical-align: top;\n",
 704 |        "    }\n",
 705 |        "\n",
 706 |        "    .dataframe thead th {\n",
 707 |        "        text-align: right;\n",
 708 |        "    }\n",
 709 |        "</style>\n",
 710 |        "<table border=\"1\" class=\"dataframe\">\n",
 711 |        "  <thead>\n",
 712 |        "    <tr style=\"text-align: right;\">\n",
 713 |        "      <th></th>\n",
 714 |        "      <th>movie_id</th>\n",
 715 |        "      <th>title</th>\n",
 716 |        "      <th>movie_decade</th>\n",
 717 |        "      <th>genre</th>\n",
 718 |        "    </tr>\n",
 719 |        "  </thead>\n",
 720 |        "  <tbody>\n",
 721 |        "    <tr>\n",
 722 |        "      <th>293</th>\n",
 723 |        "      <td>296</td>\n",
 724 |        "      <td>Pulp Fiction</td>\n",
 725 |        "      <td>1990s</td>\n",
 726 |        "      <td>Crime</td>\n",
 727 |        "    </tr>\n",
 728 |        "    <tr>\n",
 729 |        "      <th>1539</th>\n",
 730 |        "      <td>1580</td>\n",
 731 |        "      <td>Men in Black</td>\n",
 732 |        "      <td>1990s</td>\n",
 733 |        "      <td>Action</td>\n",
 734 |        "    </tr>\n",
 735 |        "    <tr>\n",
 736 |        "      <th>3793</th>\n",
 737 |        "      <td>3863</td>\n",
 738 |        "      <td>Cell, The</td>\n",
 739 |        "      <td>2000s</td>\n",
 740 |        "      <td>Sci-Fi</td>\n",
 741 |        "    </tr>\n",
 742 |        "    <tr>\n",
 743 |        "      <th>3873</th>\n",
 744 |        "      <td>3943</td>\n",
 745 |        "      <td>Bamboozled</td>\n",
 746 |        "      <td>2000s</td>\n",
 747 |        "      <td>Comedy</td>\n",
 748 |        "    </tr>\n",
 749 |        "  </tbody>\n",
 750 |        "</table>\n",
 751 |        "</div>"
 752 |       ],
 753 |       "text/plain": [
 754 |        "     movie_id         title movie_decade   genre\n",
 755 |        "293       296  Pulp Fiction        1990s   Crime\n",
 756 |        "1539     1580  Men in Black        1990s  Action\n",
 757 |        "3793     3863     Cell, The        2000s  Sci-Fi\n",
 758 |        "3873     3943    Bamboozled        2000s  Comedy"
 759 |       ]
 760 |      },
 761 |      "execution_count": 47,
 762 |      "metadata": {},
 763 |      "output_type": "execute_result"
 764 |     }
 765 |    ],
 766 |    "source": [
 767 |     "movie_info[movie_info['movie_id'].isin(sample_user_pred_info_trans)]"
 768 |    ]
 769 |   },
 770 |   {
 771 |    "cell_type": "code",
 772 |    "execution_count": 48,
 773 |    "id": "32f28b93-ef7a-4c2c-be23-9301cafae59e",
 774 |    "metadata": {},
 775 |    "outputs": [
 776 |     {
 777 |      "name": "stdout",
 778 |      "output_type": "stream",
 779 |      "text": [
 780 |       "(25, 15)\n"
 781 |      ]
 782 |     },
 783 |     {
 784 |      "data": {
 785 |       "text/html": [
 786 |        "<div>\n",
 787 |        "<style scoped>\n",
 788 |        "    .dataframe tbody tr th:only-of-type {\n",
 789 |        "        vertical-align: middle;\n",
 790 |        "    }\n",
 791 |        "\n",
 792 |        "    .dataframe tbody tr th {\n",
 793 |        "        vertical-align: top;\n",
 794 |        "    }\n",
 795 |        "\n",
 796 |        "    .dataframe thead th {\n",
 797 |        "        text-align: right;\n",
 798 |        "    }\n",
 799 |        "</style>\n",
 800 |        "<table border=\"1\" class=\"dataframe\">\n",
 801 |        "  <thead>\n",
 802 |        "    <tr style=\"text-align: right;\">\n",
 803 |        "      <th></th>\n",
 804 |        "      <th>user_id</th>\n",
 805 |        "      <th>movie_id</th>\n",
 806 |        "      <th>movie_decade</th>\n",
 807 |        "      <th>movie_year</th>\n",
 808 |        "      <th>rating_year</th>\n",
 809 |        "      <th>rating_month</th>\n",
 810 |        "      <th>rating_decade</th>\n",
 811 |        "      <th>genre1</th>\n",
 812 |        "      <th>genre2</th>\n",
 813 |        "      <th>genre3</th>\n",
 814 |        "      <th>gender</th>\n",
 815 |        "      <th>age</th>\n",
 816 |        "      <th>occupation</th>\n",
 817 |        "      <th>zip</th>\n",
 818 |        "      <th>label</th>\n",
 819 |        "    </tr>\n",
 820 |        "  </thead>\n",
 821 |        "  <tbody>\n",
 822 |        "    <tr>\n",
 823 |        "      <th>962948</th>\n",
 824 |        "      <td>1662</td>\n",
 825 |        "      <td>527</td>\n",
 826 |        "      <td>1990s</td>\n",
 827 |        "      <td>1993</td>\n",
 828 |        "      <td>2000</td>\n",
 829 |        "      <td>11</td>\n",
 830 |        "      <td>2000s</td>\n",
 831 |        "      <td>Drama</td>\n",
 832 |        "      <td>War</td>\n",
 833 |        "      <td>non data</td>\n",
 834 |        "      <td>M</td>\n",
 835 |        "      <td>25</td>\n",
 836 |        "      <td>12</td>\n",
 837 |        "      <td>94121</td>\n",
 838 |        "      <td>1</td>\n",
 839 |        "    </tr>\n",
 840 |        "    <tr>\n",
 841 |        "      <th>962949</th>\n",
 842 |        "      <td>1662</td>\n",
 843 |        "      <td>2762</td>\n",
 844 |        "      <td>1990s</td>\n",
 845 |        "      <td>1999</td>\n",
 846 |        "      <td>2000</td>\n",
 847 |        "      <td>11</td>\n",
 848 |        "      <td>2000s</td>\n",
 849 |        "      <td>Thriller</td>\n",
 850 |        "      <td>non data</td>\n",
 851 |        "      <td>non data</td>\n",
 852 |        "      <td>M</td>\n",
 853 |        "      <td>25</td>\n",
 854 |        "      <td>12</td>\n",
 855 |        "      <td>94121</td>\n",
 856 |        "      <td>1</td>\n",
 857 |        "    </tr>\n",
 858 |        "    <tr>\n",
 859 |        "      <th>962950</th>\n",
 860 |        "      <td>1662</td>\n",
 861 |        "      <td>1259</td>\n",
 862 |        "      <td>1980s</td>\n",
 863 |        "      <td>1986</td>\n",
 864 |        "      <td>2000</td>\n",
 865 |        "      <td>11</td>\n",
 866 |        "      <td>2000s</td>\n",
 867 |        "      <td>Adventure</td>\n",
 868 |        "      <td>Comedy</td>\n",
 869 |        "      <td>Drama</td>\n",
 870 |        "      <td>M</td>\n",
 871 |        "      <td>25</td>\n",
 872 |        "      <td>12</td>\n",
 873 |        "      <td>94121</td>\n",
 874 |        "      <td>1</td>\n",
 875 |        "    </tr>\n",
 876 |        "    <tr>\n",
 877 |        "      <th>962951</th>\n",
 878 |        "      <td>1662</td>\n",
 879 |        "      <td>589</td>\n",
 880 |        "      <td>1990s</td>\n",
 881 |        "      <td>1991</td>\n",
 882 |        "      <td>2000</td>\n",
 883 |        "      <td>11</td>\n",
 884 |        "      <td>2000s</td>\n",
 885 |        "      <td>Action</td>\n",
 886 |        "      <td>Sci-Fi</td>\n",
 887 |        "      <td>Thriller</td>\n",
 888 |        "      <td>M</td>\n",
 889 |        "      <td>25</td>\n",
 890 |        "      <td>12</td>\n",
 891 |        "      <td>94121</td>\n",
 892 |        "      <td>1</td>\n",
 893 |        "    </tr>\n",
 894 |        "    <tr>\n",
 895 |        "      <th>962952</th>\n",
 896 |        "      <td>1662</td>\n",
 897 |        "      <td>2858</td>\n",
 898 |        "      <td>1990s</td>\n",
 899 |        "      <td>1999</td>\n",
 900 |        "      <td>2000</td>\n",
 901 |        "      <td>11</td>\n",
 902 |        "      <td>2000s</td>\n",
 903 |        "      <td>Comedy</td>\n",
 904 |        "      <td>Drama</td>\n",
 905 |        "      <td>non data</td>\n",
 906 |        "      <td>M</td>\n",
 907 |        "      <td>25</td>\n",
 908 |        "      <td>12</td>\n",
 909 |        "      <td>94121</td>\n",
 910 |        "      <td>1</td>\n",
 911 |        "    </tr>\n",
 912 |        "  </tbody>\n",
 913 |        "</table>\n",
 914 |        "</div>"
 915 |       ],
 916 |       "text/plain": [
 917 |        "       user_id movie_id movie_decade movie_year rating_year rating_month  \\\n",
 918 |        "962948    1662      527        1990s       1993        2000           11   \n",
 919 |        "962949    1662     2762        1990s       1999        2000           11   \n",
 920 |        "962950    1662     1259        1980s       1986        2000           11   \n",
 921 |        "962951    1662      589        1990s       1991        2000           11   \n",
 922 |        "962952    1662     2858        1990s       1999        2000           11   \n",
 923 |        "\n",
 924 |        "       rating_decade     genre1    genre2    genre3 gender age occupation  \\\n",
 925 |        "962948         2000s      Drama       War  non data      M  25         12   \n",
 926 |        "962949         2000s   Thriller  non data  non data      M  25         12   \n",
 927 |        "962950         2000s  Adventure    Comedy     Drama      M  25         12   \n",
 928 |        "962951         2000s     Action    Sci-Fi  Thriller      M  25         12   \n",
 929 |        "962952         2000s     Comedy     Drama  non data      M  25         12   \n",
 930 |        "\n",
 931 |        "          zip label  \n",
 932 |        "962948  94121     1  \n",
 933 |        "962949  94121     1  \n",
 934 |        "962950  94121     1  \n",
 935 |        "962951  94121     1  \n",
 936 |        "962952  94121     1  "
 937 |       ]
 938 |      },
 939 |      "execution_count": 48,
 940 |      "metadata": {},
 941 |      "output_type": "execute_result"
 942 |     }
 943 |    ],
 944 |    "source": [
 945 |     "sample_user_history = movielens_rcmm_origin[movielens_rcmm_origin['user_id'] == random_user[0]].fillna('non data')\n",
 946 |     "print(sample_user_history.shape)\n",
 947 |     "sample_user_history.head()"
 948 |    ]
 949 |   },
 950 |   {
 951 |    "cell_type": "code",
 952 |    "execution_count": null,
 953 |    "id": "4d527c69-eeb6-4337-aa36-d81791cdfa56",
 954 |    "metadata": {},
 955 |    "outputs": [],
 956 |    "source": []
 957 |   },
 958 |   {
 959 |    "cell_type": "markdown",
 960 |    "id": "b65b5906-e36e-4d1f-a725-5687b52cd3ad",
 961 |    "metadata": {},
 962 |    "source": [
 963 |     "# Set user info by history --> to LLM input"
 964 |    ]
 965 |   },
 966 |   {
 967 |    "cell_type": "code",
 968 |    "execution_count": 49,
 969 |    "id": "4c8fc4d4-2e96-4daa-aa85-7daa17faa783",
 970 |    "metadata": {},
 971 |    "outputs": [],
 972 |    "source": [
 973 |     "# Recent user info\n",
 974 |     "recent_ratio = int(sample_user_history.shape[0] * 0.1)\n",
 975 |     "user_data = movielens_rcmm_origin[movielens_rcmm_origin['user_id'] == random_user[0]].fillna('non data')[['movie_decade', 'movie_year', 'rating_year', 'rating_decade', 'genre1', 'genre2', 'gender', 'age', 'zip']].values[:recent_ratio]\n",
 976 |     "recent_user_hist_info = \"#### Item interaction information\\n\\n- (item) : metadata information of items \\n- (user) : metadata information of users\"\n",
 977 |     "for cnt, rows in enumerate(user_data):\n",
 978 |     "    recent_user_hist_info += f\"\\n\\n{cnt+1}th.\\n- (Item) Movie Release Decade (ex. 1990s movies): {rows[0]}\\n- (Item) Movie Release Year: {rows[1]}\\n- (User) Rating Year: {rows[2]}\\n- (User) Rating Decade (e.g., 1990s ratings): {rows[3]}\\n- (Item) Genre 1: {rows[4]}\\n- (Item) Genre 2: {rows[5]}\\n- (User) Gender: {rows[6]}\\n- (User) Age: {rows[7]}\\n- (User) Address Information (zipcode): {rows[8]}\\n##### End of {cnt+1}th item interaction information\""
 979 |    ]
 980 |   },
 981 |   {
 982 |    "cell_type": "code",
 983 |    "execution_count": 51,
 984 |    "id": "edf2ac7e-1a0f-414a-ba46-84c35e00432b",
 985 |    "metadata": {},
 986 |    "outputs": [],
 987 |    "source": [
 988 |     "# Entire user history information\n",
 989 |     "user_data = movielens_rcmm_origin[movielens_rcmm_origin['user_id'] == random_user[0]].fillna('non data')[['movie_decade', 'movie_year', 'rating_year', 'rating_decade', 'genre1', 'genre2', 'gender', 'age', 'zip']].values\n",
 990 |     "user_all_hist_info = \"#### Item interaction information\\n\\n- (item) : metadata information of items \\n- (user) : metadata information of users\"\n",
 991 |     "for cnt, rows in enumerate(user_data):\n",
 992 |     "    user_all_hist_info += f\"\\n\\n{cnt+1}th.\\n- (Item) Movie Release Decade (ex. 1990s movies): {rows[0]}\\n- (Item) Movie Release Year: {rows[1]}\\n- (User) Rating Year: {rows[2]}\\n- (User) Rating Decade (e.g., 1990s ratings): {rows[3]}\\n- (Item) Genre 1: {rows[4]}\\n- (Item) Genre 2: {rows[5]}\\n- (User) Gender: {rows[6]}\\n- (User) Age: {rows[7]}\\n- (User) Address Information (zipcode): {rows[8]}\\n##### End of {cnt+1}th item interaction information\""
 993 |    ]
 994 |   },
 995 |   {
 996 |    "cell_type": "code",
 997 |    "execution_count": 52,
 998 |    "id": "18f086d7-128c-4542-b7af-59917caa67e1",
 999 |    "metadata": {},
1000 |    "outputs": [
1001 |     {
1002 |      "name": "stdout",
1003 |      "output_type": "stream",
1004 |      "text": [
1005 |       "#### Item interaction information\n",
1006 |       "\n",
1007 |       "- (item) : metadata information of items \n",
1008 |       "- (user) : metadata information of users\n",
1009 |       "\n",
1010 |       "1th.\n",
1011 |       "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n",
1012 |       "- (Item) Movie Release Year: 1993\n",
1013 |       "- (User) Rating Year: 2000\n",
1014 |       "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n",
1015 |       "- (Item) Genre 1: Drama\n",
1016 |       "- (Item) Genre 2: War\n",
1017 |       "- (User) Gender: M\n",
1018 |       "- (User) Age: 25\n",
1019 |       "- (User) Address Information (zipcode): 94121\n",
1020 |       "##### End of 1th item interaction information\n",
1021 |       "\n",
1022 |       "2th.\n",
1023 |       "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n",
1024 |       "- (Item) Movie Release Year: 1999\n",
1025 |       "- (User) Rating Year: 2000\n",
1026 |       "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n",
1027 |       "- (Item) Genre 1: Thriller\n",
1028 |       "- (Item) Genre 2: non data\n",
1029 |       "- (User) Gender: M\n",
1030 |       "- (User) Age: 25\n",
1031 |       "- (User) Address Information (zipcode): 94121\n",
1032 |       "##### End of 2th item interaction information\n"
1033 |      ]
1034 |     }
1035 |    ],
1036 |    "source": [
1037 |     "print(recent_user_hist_info)"
1038 |    ]
1039 |   },
1040 |   {
1041 |    "cell_type": "code",
1042 |    "execution_count": 55,
1043 |    "id": "5b3fc1a7-1412-4258-b2e7-b7f507d765c3",
1044 |    "metadata": {},
1045 |    "outputs": [
1046 |     {
1047 |      "name": "stdout",
1048 |      "output_type": "stream",
1049 |      "text": [
1050 |       "#### Item interaction information\n",
1051 |       "\n",
1052 |       "- (item) : metadata information of items \n",
1053 |       "- (user) : metadata information of users\n",
1054 |       "\n",
1055 |       "1th.\n",
1056 |       "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n",
1057 |       "- (Item) Movie Release Year: 1993\n",
1058 |       "- (User) Rating Year: 2000\n",
1059 |       "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n",
1060 |       "- (Item) Genre 1: Drama\n",
1061 |       "- (Item) Genre 2: War\n",
1062 |       "- (User) Gender: M\n",
1063 |       "- (User) Age: 25\n",
1064 |       "- (User) Address Information (zipcode): 94121\n",
1065 |       "##### End of 1th item interaction information\n",
1066 |       "\n",
1067 |       "2th.\n",
1068 |       "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n",
1069 |       "- (Item) Movie Release Year: 1999\n",
1070 |       "- (User) Rating Year: 2000\n",
1071 |       "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n",
1072 |       "- (Item) Genre 1: Thriller\n",
1073 |       "- (Item) Genre 2: non data\n",
1074 |       "- (User) Gender: M\n",
1075 |       "- (User) Age: 25\n",
1076 |       "- (User) Address Information (zipcode): 94121\n",
1077 |       "##### End of 2th item interaction information\n",
1078 |       "\n",
1079 |       "3th.\n",
1080 |       "- (Item) Movie Release Decade (ex. 1990s movies): 1980s\n",
1081 |       "- (Item) Movie Release Year: 1986\n",
1082 |       "- (User) Rating Year: 2000\n",
1083 |       "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n",
1084 |       "- (Item) Genre 1: Adventure\n",
1085 |       "- (Item) Genre 2: Comedy\n",
1086 |       "- (User) Gender: M\n",
1087 |       "- (User) Age: 25\n",
1088 |       "- (User) Address Information (zipcode): 94121\n",
1089 |       "##### End of 3th item interaction information\n",
1090 |       "\n",
1091 |       "4th.\n",
1092 |       "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n",
1093 |       "- (Item) Movie Release Year: 1991\n",
1094 |       "- (User) Rating Year: 2000\n",
1095 |       "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n",
1096 |       "- (Item) Genre 1: Action\n",
1097 |       "- (Item) Genre 2: Sci-Fi\n",
1098 |       "- (User) Gender: M\n",
1099 |       "- (User) Age: 25\n",
1100 |       "- (User) Address Information (zipcode): 94121\n",
1101 |       "##### End of \n"
1102 |      ]
1103 |     }
1104 |    ],
1105 |    "source": [
1106 |     "print(user_all_hist_info[:1500])"
1107 |    ]
1108 |   },
1109 |   {
1110 |    "cell_type": "code",
1111 |    "execution_count": null,
1112 |    "id": "779d69b8-19e4-43d5-9abf-4bcb24b4baa1",
1113 |    "metadata": {},
1114 |    "outputs": [],
1115 |    "source": []
1116 |   },
1117 |   {
1118 |    "cell_type": "markdown",
1119 |    "id": "7a478f9e-8a3c-4bad-915f-7e9558fe4947",
1120 |    "metadata": {},
1121 |    "source": [
1122 |     "# Summay user history"
1123 |    ]
1124 |   },
1125 |   {
1126 |    "cell_type": "code",
1127 |    "execution_count": 56,
1128 |    "id": "256bd0f1-f29b-4fa5-9fd2-6fa34bddbaca",
1129 |    "metadata": {},
1130 |    "outputs": [
1131 |     {
1132 |      "name": "stdout",
1133 |      "output_type": "stream",
1134 |      "text": [
1135 |       "9011\n"
1136 |      ]
1137 |     }
1138 |    ],
1139 |    "source": [
1140 |     "print(len(user_all_hist_info))"
1141 |    ]
1142 |   },
1143 |   {
1144 |    "cell_type": "code",
1145 |    "execution_count": 57,
1146 |    "id": "c8ef4f29-0c12-481e-9bc4-928c45cf8a9d",
1147 |    "metadata": {},
1148 |    "outputs": [],
1149 |    "source": [
1150 |     "docs = []\n",
1151 |     "text_splitter = RecursiveCharacterTextSplitter(chunk_size=550, chunk_overlap=100)\n",
1152 |     "texts = text_splitter.split_text(user_all_hist_info)\n",
1153 |     "docs += [Document(page_content=t) for t in texts]"
1154 |    ]
1155 |   },
1156 |   {
1157 |    "cell_type": "code",
1158 |    "execution_count": 58,
1159 |    "id": "aa4d9dd9-49f7-42d5-b0e3-8618e7eac0a3",
1160 |    "metadata": {},
1161 |    "outputs": [],
1162 |    "source": [
1163 |     "template = '''Below is the user's past history information. Considering the user's main characteristics, persona, preferences, and meaningful patterns, please summarize the user information within 700 characters.\\n\\n##### User history information: {text}.'''\n",
1164 |     "\n",
1165 |     "prompt = PromptTemplate(template=template, input_variables=['text'])\n",
1166 |     "\n",
1167 |     "llm = ChatOpenAI(temperature=0, model='gpt-4o')\n"
1168 |    ]
1169 |   },
1170 |   {
1171 |    "cell_type": "code",
1172 |    "execution_count": 59,
1173 |    "id": "8958d283-47ed-4549-9620-338e2771d052",
1174 |    "metadata": {},
1175 |    "outputs": [
1176 |     {
1177 |      "name": "stderr",
1178 |      "output_type": "stream",
1179 |      "text": [
1180 |       "/Users/leesoojin/opt/anaconda3/envs/llm/lib/python3.9/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `run` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.\n",
1181 |       "  warn_deprecated(\n"
1182 |      ]
1183 |     }
1184 |    ],
1185 |    "source": [
1186 |     "chain = load_summarize_chain(llm, \n",
1187 |     "                             chain_type='map_reduce', \n",
1188 |     "                             map_prompt=prompt, combine_prompt=prompt,\n",
1189 |     "                             verbose=False)\n",
1190 |     "summary = chain.run(docs)"
1191 |    ]
1192 |   },
1193 |   {
1194 |    "cell_type": "code",
1195 |    "execution_count": 60,
1196 |    "id": "bce8a1cb-5e94-439c-9a99-fc0692846a1e",
1197 |    "metadata": {},
1198 |    "outputs": [
1199 |     {
1200 |      "data": {
1201 |       "text/plain": [
1202 |        "'The user is a 25-year-old male residing in the 94121 zip code area. He has a strong preference for movies from the 1990s, with specific interests in various years such as 1990, 1991, 1993, 1994, 1996, 1997, 1998, and 1999. His favorite genres include drama, thriller, comedy, romance, action, adventure, and sci-fi. He tends to rate movies primarily in the early 2000s, suggesting a nostalgic inclination towards films from his formative years. This user enjoys revisiting and evaluating films from the past, reflecting a blend of nostalgia and a methodical approach to his movie-watching habits.'"
1203 |       ]
1204 |      },
1205 |      "execution_count": 60,
1206 |      "metadata": {},
1207 |      "output_type": "execute_result"
1208 |     }
1209 |    ],
1210 |    "source": [
1211 |     "summary"
1212 |    ]
1213 |   },
1214 |   {
1215 |    "cell_type": "markdown",
1216 |    "id": "d8e95e1d-ed30-46af-9080-30afd5ff5301",
1217 |    "metadata": {},
1218 |    "source": [
1219 |     "# Get user persona and characteristics"
1220 |    ]
1221 |   },
1222 |   {
1223 |    "cell_type": "code",
1224 |    "execution_count": 62,
1225 |    "id": "1c98559a-d4d1-4ab3-84b6-90978e0c43b0",
1226 |    "metadata": {},
1227 |    "outputs": [],
1228 |    "source": [
1229 |     "\n",
1230 |     "template = \"\"\"Below is the user's item interaction history information. Using this data, please derive the user's main characteristics, persona, preferences, and meaningful patterns.\n",
1231 |     "\n",
1232 |     "# User history information\n",
1233 |     "{user_hist}\n",
1234 |     "\n",
1235 |     "Please output in the following format:\n",
1236 |     "\n",
1237 |     "- Main characteristics of the user: string\n",
1238 |     "- User persona: string\n",
1239 |     "- User preferences: string\n",
1240 |     "- Meaningful patterns of the user: string\n",
1241 |     "\n",
1242 |     "\"\"\"\n",
1243 |     "prompt = PromptTemplate(template=template, input_variables=['user_hist'])\n"
1244 |    ]
1245 |   },
1246 |   {
1247 |    "cell_type": "code",
1248 |    "execution_count": 63,
1249 |    "id": "782e9108-ee4b-4e4a-90c4-9b1b951367fd",
1250 |    "metadata": {},
1251 |    "outputs": [],
1252 |    "source": [
1253 |     "llm = ChatOpenAI(temperature=0, model='gpt-4o')\n",
1254 |     "chain = LLMChain(llm=llm, prompt=prompt)\n",
1255 |     "user_recent_summary = chain.invoke({'user_hist': recent_user_hist_info})"
1256 |    ]
1257 |   },
1258 |   {
1259 |    "cell_type": "code",
1260 |    "execution_count": 64,
1261 |    "id": "db87eae3-0cbe-4c5b-98bd-6bf0c5b6e99e",
1262 |    "metadata": {},
1263 |    "outputs": [
1264 |     {
1265 |      "data": {
1266 |       "text/plain": [
1267 |        "{'user_hist': '#### Item interaction information\\n\\n- (item) : metadata information of items \\n- (user) : metadata information of users\\n\\n1th.\\n- (Item) Movie Release Decade (ex. 1990s movies): 1990s\\n- (Item) Movie Release Year: 1993\\n- (User) Rating Year: 2000\\n- (User) Rating Decade (e.g., 1990s ratings): 2000s\\n- (Item) Genre 1: Drama\\n- (Item) Genre 2: War\\n- (User) Gender: M\\n- (User) Age: 25\\n- (User) Address Information (zipcode): 94121\\n##### End of 1th item interaction information\\n\\n2th.\\n- (Item) Movie Release Decade (ex. 1990s movies): 1990s\\n- (Item) Movie Release Year: 1999\\n- (User) Rating Year: 2000\\n- (User) Rating Decade (e.g., 1990s ratings): 2000s\\n- (Item) Genre 1: Thriller\\n- (Item) Genre 2: non data\\n- (User) Gender: M\\n- (User) Age: 25\\n- (User) Address Information (zipcode): 94121\\n##### End of 2th item interaction information',\n",
1268 |        " 'text': \"- Main characteristics of the user: The user is a 25-year-old male living in the 94121 zip code area. He rated movies in the early 2000s.\\n\\n- User persona: The user is a young adult male who enjoys watching movies from the 1990s. He seems to have a preference for serious and intense genres, indicating a possible interest in thought-provoking and emotionally engaging content.\\n\\n- User preferences: The user prefers movies from the 1990s, particularly those in the Drama and Thriller genres. He also shows an interest in War-themed movies.\\n\\n- Meaningful patterns of the user: The user consistently rates movies from the 1990s, suggesting a nostalgic or particular interest in that decade's filmography. His genre preferences lean towards Drama and Thriller, with a specific inclination towards movies that are intense and possibly have complex narratives. The user’s ratings are from the early 2000s, indicating that he might have been actively watching and rating movies during that period.\"}"
1269 |       ]
1270 |      },
1271 |      "execution_count": 64,
1272 |      "metadata": {},
1273 |      "output_type": "execute_result"
1274 |     }
1275 |    ],
1276 |    "source": [
1277 |     "user_recent_summary"
1278 |    ]
1279 |   },
1280 |   {
1281 |    "cell_type": "code",
1282 |    "execution_count": null,
1283 |    "id": "084708fc-76c9-473c-aafe-befb6a5e4a7b",
1284 |    "metadata": {},
1285 |    "outputs": [],
1286 |    "source": []
1287 |   },
1288 |   {
1289 |    "cell_type": "markdown",
1290 |    "id": "2141d6ec-5166-46d9-aa78-67353b982b1e",
1291 |    "metadata": {},
1292 |    "source": [
1293 |     "# Explainbilty"
1294 |    ]
1295 |   },
1296 |   {
1297 |    "cell_type": "code",
1298 |    "execution_count": 66,
1299 |    "id": "4b532750-d7c5-4453-9141-9013c722c115",
1300 |    "metadata": {},
1301 |    "outputs": [
1302 |     {
1303 |      "data": {
1304 |       "text/html": [
1305 |        "<div>\n",
1306 |        "<style scoped>\n",
1307 |        "    .dataframe tbody tr th:only-of-type {\n",
1308 |        "        vertical-align: middle;\n",
1309 |        "    }\n",
1310 |        "\n",
1311 |        "    .dataframe tbody tr th {\n",
1312 |        "        vertical-align: top;\n",
1313 |        "    }\n",
1314 |        "\n",
1315 |        "    .dataframe thead th {\n",
1316 |        "        text-align: right;\n",
1317 |        "    }\n",
1318 |        "</style>\n",
1319 |        "<table border=\"1\" class=\"dataframe\">\n",
1320 |        "  <thead>\n",
1321 |        "    <tr style=\"text-align: right;\">\n",
1322 |        "      <th></th>\n",
1323 |        "      <th>movie_id</th>\n",
1324 |        "      <th>title</th>\n",
1325 |        "      <th>movie_decade</th>\n",
1326 |        "      <th>genre</th>\n",
1327 |        "    </tr>\n",
1328 |        "  </thead>\n",
1329 |        "  <tbody>\n",
1330 |        "    <tr>\n",
1331 |        "      <th>293</th>\n",
1332 |        "      <td>296</td>\n",
1333 |        "      <td>Pulp Fiction</td>\n",
1334 |        "      <td>1990s</td>\n",
1335 |        "      <td>Crime</td>\n",
1336 |        "    </tr>\n",
1337 |        "    <tr>\n",
1338 |        "      <th>1539</th>\n",
1339 |        "      <td>1580</td>\n",
1340 |        "      <td>Men in Black</td>\n",
1341 |        "      <td>1990s</td>\n",
1342 |        "      <td>Action</td>\n",
1343 |        "    </tr>\n",
1344 |        "    <tr>\n",
1345 |        "      <th>3793</th>\n",
1346 |        "      <td>3863</td>\n",
1347 |        "      <td>Cell, The</td>\n",
1348 |        "      <td>2000s</td>\n",
1349 |        "      <td>Sci-Fi</td>\n",
1350 |        "    </tr>\n",
1351 |        "    <tr>\n",
1352 |        "      <th>3873</th>\n",
1353 |        "      <td>3943</td>\n",
1354 |        "      <td>Bamboozled</td>\n",
1355 |        "      <td>2000s</td>\n",
1356 |        "      <td>Comedy</td>\n",
1357 |        "    </tr>\n",
1358 |        "  </tbody>\n",
1359 |        "</table>\n",
1360 |        "</div>"
1361 |       ],
1362 |       "text/plain": [
1363 |        "     movie_id         title movie_decade   genre\n",
1364 |        "293       296  Pulp Fiction        1990s   Crime\n",
1365 |        "1539     1580  Men in Black        1990s  Action\n",
1366 |        "3793     3863     Cell, The        2000s  Sci-Fi\n",
1367 |        "3873     3943    Bamboozled        2000s  Comedy"
1368 |       ]
1369 |      },
1370 |      "execution_count": 66,
1371 |      "metadata": {},
1372 |      "output_type": "execute_result"
1373 |     }
1374 |    ],
1375 |    "source": [
1376 |     "user_recom_result = movie_info[movie_info['movie_id'].isin(sample_user_pred_info_trans)]\n",
1377 |     "user_recom_result"
1378 |    ]
1379 |   },
1380 |   {
1381 |    "cell_type": "code",
1382 |    "execution_count": 67,
1383 |    "id": "4326e016-2703-48b4-a57f-920c35a2d37a",
1384 |    "metadata": {},
1385 |    "outputs": [],
1386 |    "source": [
1387 |     "user_recent_summary_info = user_recent_summary['text']\n",
1388 |     "user_entire_summary_info = summary"
1389 |    ]
1390 |   },
1391 |   {
1392 |    "cell_type": "code",
1393 |    "execution_count": null,
1394 |    "id": "f1f5ceda-fbc7-4728-a50a-c9133e29c6d5",
1395 |    "metadata": {},
1396 |    "outputs": [],
1397 |    "source": []
1398 |   },
1399 |   {
1400 |    "cell_type": "code",
1401 |    "execution_count": 70,
1402 |    "id": "7670b00b-a5e7-4688-8aec-3f3c527d27b0",
1403 |    "metadata": {},
1404 |    "outputs": [],
1405 |    "source": [
1406 |     "user_data = user_recom_result[['title', 'movie_decade', 'genre']].values\n",
1407 |     "\n",
1408 |     "user_recom_info = \"#### User Recommendation List\\n\\n\"\n",
1409 |     "for cnt, rows in enumerate(user_data):\n",
1410 |     "    user_recom_info += f\"\\n\\nRecommendation {cnt+1}:\\n- Item Title: {rows[0]}\\n- (Item) Movie Release Decade (e.g., 1990s movie): {rows[1]}\\n- Item Genre (Category): {rows[2]}\\n##### End of Recommendation {cnt+1} Information\"\n"
1411 |    ]
1412 |   },
1413 |   {
1414 |    "cell_type": "code",
1415 |    "execution_count": 71,
1416 |    "id": "24c0018e-ee5c-4014-b809-a8066d7dcbe1",
1417 |    "metadata": {},
1418 |    "outputs": [
1419 |     {
1420 |      "data": {
1421 |       "text/plain": [
1422 |        "'#### User Recommendation List\\n\\n\\n\\nRecommendation 1:\\n- Item Title: Pulp Fiction\\n- (Item) Movie Release Decade (e.g., 1990s movie): 1990s\\n- Item Genre (Category): Crime\\n##### End of Recommendation 1 Information\\n\\nRecommendation 2:\\n- Item Title: Men in Black\\n- (Item) Movie Release Decade (e.g., 1990s movie): 1990s\\n- Item Genre (Category): Action\\n##### End of Recommendation 2 Information\\n\\nRecommendation 3:\\n- Item Title: Cell, The\\n- (Item) Movie Release Decade (e.g., 1990s movie): 2000s\\n- Item Genre (Category): Sci-Fi\\n##### End of Recommendation 3 Information\\n\\nRecommendation 4:\\n- Item Title: Bamboozled\\n- (Item) Movie Release Decade (e.g., 1990s movie): 2000s\\n- Item Genre (Category): Comedy\\n##### End of Recommendation 4 Information'"
1423 |       ]
1424 |      },
1425 |      "execution_count": 71,
1426 |      "metadata": {},
1427 |      "output_type": "execute_result"
1428 |     }
1429 |    ],
1430 |    "source": [
1431 |     "user_recom_info"
1432 |    ]
1433 |   },
1434 |   {
1435 |    "cell_type": "code",
1436 |    "execution_count": 73,
1437 |    "id": "f286da16-9e56-422e-835c-3a30c3b1c132",
1438 |    "metadata": {},
1439 |    "outputs": [],
1440 |    "source": [
1441 |     "template = \"\"\"The data below contains the user's main characteristics, persona, and preference information. There is preference information based on the entire history and also based on the last 10 interactions.\n",
1442 |     "\n",
1443 |     "#### Main characteristics based on the entire history\n",
1444 |     "{user_entire_summary_info}\n",
1445 |     "\n",
1446 |     "#### Main characteristics based on the last 10 interactions\n",
1447 |     "{user_recent_summary_info}\n",
1448 |     "\n",
1449 |     "Below is the item information recommended by the recommendation system for the above user.\n",
1450 |     "\n",
1451 |     "#### Recommendation results provided by the recommendation system\n",
1452 |     "{recom_list}\n",
1453 |     "\n",
1454 |     "Your role is to write the reason for the recommendation by comparing the user's main characteristics information with the recommendation results provided by the recommendation system.\n",
1455 |     "The recommendation results are a list of items provided by the recommendation system based on the user's past interaction information.\n",
1456 |     "If you determine that the reason for the recommendation is inappropriate, please say, 'It does not seem to be an appropriate recommendation' and also provide the reason why it is not appropriate.\n",
1457 |     "\n",
1458 |     "To summarize your role:\n",
1459 |     "\n",
1460 |     "- Consider the user information (main characteristics based on the entire history, main characteristics based on the last 10 interactions)\n",
1461 |     "- The recommendation results are a recommendation list provided by the recommendation system based on the user's past interactions\n",
1462 |     "- Write the reason for the recommendation by referring to the recommendation results and user information\n",
1463 |     "- If the reason for the recommendation is inappropriate, say 'It does not seem to be an appropriate recommendation' and explain why it is not appropriate\n",
1464 |     "- Do not include unnecessary words, perform the requested task and respond\n",
1465 |     "- If you are unsure, think it over and if you really don't know, respond with 'I don't know'\n",
1466 |     "\"\"\"\n",
1467 |     "\n",
1468 |     "prompt = PromptTemplate(template=template, input_variables=['user_entire_summary_info', 'user_recent_summary_info', 'recom_list'])\n"
1469 |    ]
1470 |   },
1471 |   {
1472 |    "cell_type": "code",
1473 |    "execution_count": 74,
1474 |    "id": "d00e2774-4fde-4170-8b58-e35f73d8b062",
1475 |    "metadata": {},
1476 |    "outputs": [],
1477 |    "source": [
1478 |     "llm = ChatOpenAI(temperature=0, model='gpt-4o')\n",
1479 |     "chain = LLMChain(llm=llm, prompt=prompt)\n",
1480 |     "recommend_explain = chain.invoke({'user_entire_summary_info': user_entire_summary_info, 'user_recent_summary_info':user_recent_summary_info, 'recom_list':user_recom_info})"
1481 |    ]
1482 |   },
1483 |   {
1484 |    "cell_type": "code",
1485 |    "execution_count": null,
1486 |    "id": "53e9bce0-3bbe-485b-832e-29ea87fbc86e",
1487 |    "metadata": {},
1488 |    "outputs": [],
1489 |    "source": []
1490 |   },
1491 |   {
1492 |    "cell_type": "code",
1493 |    "execution_count": 75,
1494 |    "id": "9d6f2762-93ca-4df3-a3b7-11f355c5376f",
1495 |    "metadata": {},
1496 |    "outputs": [
1497 |     {
1498 |      "name": "stdout",
1499 |      "output_type": "stream",
1500 |      "text": [
1501 |       "#### Recommendation 1: Pulp Fiction\n",
1502 |       "- **Reason for Recommendation:** \"Pulp Fiction\" is a 1990s movie, aligning with the user's strong preference for films from that decade. The genre is Crime, which, while not explicitly listed in the user's favorite genres, often overlaps with Drama and Thriller, both of which the user enjoys. The intense and complex narrative of \"Pulp Fiction\" fits the user's interest in thought-provoking and emotionally engaging content.\n",
1503 |       "\n",
1504 |       "#### Recommendation 2: Men in Black\n",
1505 |       "- **Reason for Recommendation:** \"Men in Black\" is a 1990s movie, which matches the user's preference for that decade. The genre is Action, one of the user's favorite genres. This recommendation aligns well with the user's interest in 1990s films and action-packed narratives.\n",
1506 |       "\n",
1507 |       "#### Recommendation 3: The Cell\n",
1508 |       "- **It does not seem to be an appropriate recommendation**\n",
1509 |       "- **Reason:** \"The Cell\" is a 2000s movie, which does not align with the user's strong preference for 1990s films. Although the genre is Sci-Fi, which the user enjoys, the decade mismatch makes this recommendation less suitable.\n",
1510 |       "\n",
1511 |       "#### Recommendation 4: Bamboozled\n",
1512 |       "- **It does not seem to be an appropriate recommendation**\n",
1513 |       "- **Reason:** \"Bamboozled\" is a 2000s movie, which does not align with the user's preference for 1990s films. Additionally, while the user enjoys Comedy, the primary interest is in Drama and Thriller genres, making this recommendation less fitting.\n"
1514 |      ]
1515 |     }
1516 |    ],
1517 |    "source": [
1518 |     "print(recommend_explain['text'])"
1519 |    ]
1520 |   },
1521 |   {
1522 |    "cell_type": "code",
1523 |    "execution_count": null,
1524 |    "id": "be6791ea-2f1d-4b86-9ee5-6bea5e15779b",
1525 |    "metadata": {},
1526 |    "outputs": [],
1527 |    "source": []
1528 |   },
1529 |   {
1530 |    "cell_type": "code",
1531 |    "execution_count": null,
1532 |    "id": "6a4a3352-8baa-47b9-ac15-febfbe6ca3de",
1533 |    "metadata": {},
1534 |    "outputs": [],
1535 |    "source": []
1536 |   },
1537 |   {
1538 |    "cell_type": "code",
1539 |    "execution_count": null,
1540 |    "id": "6f87e87a-84ab-4864-a5c4-4e509f8120ba",
1541 |    "metadata": {},
1542 |    "outputs": [],
1543 |    "source": []
1544 |   }
1545 |  ],
1546 |  "metadata": {
1547 |   "kernelspec": {
1548 |    "display_name": "Python 3 (ipykernel)",
1549 |    "language": "python",
1550 |    "name": "python3"
1551 |   },
1552 |   "language_info": {
1553 |    "codemirror_mode": {
1554 |     "name": "ipython",
1555 |     "version": 3
1556 |    },
1557 |    "file_extension": ".py",
1558 |    "mimetype": "text/x-python",
1559 |    "name": "python",
1560 |    "nbconvert_exporter": "python",
1561 |    "pygments_lexer": "ipython3",
1562 |    "version": "3.9.18"
1563 |   }
1564 |  },
1565 |  "nbformat": 4,
1566 |  "nbformat_minor": 5
1567 | }
1568 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # 파이썬을 활용한 추천 시스템 구현(recommender system with Python)
 3 | 
 4 | ### 각 파일에 대한 자료 설명
 5 | 
 6 | > 각 파일에 대한 설명은 https://lsjsj92.tistory.com/ 블로그에 올려두었습니다. 상세주소는 각 파일 최상단에 있으니 참고바랍니다.
 7 | 
 8 | **1. recommender system basic**
 9 | - 추천 시스템 기본 유형 소개 : 이론
10 |     - content based filtering
11 |     - collaborative filtering
12 | 
13 |     
14 | **2. recommender system basic with Python - 1 content based filtering**
15 | - 파이썬을 활용해 content based filtering 구현
16 | - kaggle의 movies dataset 활용
17 | 
18 | 
19 | **3. recommender system basic with Python - 2 Collaborative Filtering**
20 | - 파이썬을 활용해 collaborative filtering 구현
21 | - kaggle의 movies dataset, movielens dataset 활용
22 | 
23 | 
24 | **4. recommender system basic with Python - 3 Matrix Factorization**
25 | - 파이썬을 활용해 Matrix Factorization 구현 및 이론 설명
26 | - kaggle의 movies dataset, movielens dataset 활용
27 | 
28 | 
29 | **5. naver news recommender**
30 | - Naver news 데이터를 활용해 추천 시스템 적용
31 | - Doc2vec 등의 embedding 방법을 사용
32 | 
33 | **6. deep learning recommender system**
34 | - 딥러닝 기반의 추천 시스템 활용 예제 코드
35 | - Keras 활용
36 | 
37 | 
38 | **7. Wide & Deep recommender system**
39 | - Wide & Deep paper를 기반으로 한 추천 시스템 모델 구현
40 | - 컨셉만 유지하면서 구현하였음
41 | - Keras를 활용
42 | 
43 | **8. Simple book recommender system with Keras(kaggle data)**
44 | - Kaggle에 있는 book 데이터를 활용한 간단한 추천 시스템 구현
45 | - Keras를 활용해 만들 수 있는 기본적인 추천 모형 코드
46 | 
47 | **9. recommender system using ChatGPT**
48 | - ChatGPT을 활용한 추천 시스템
49 | - https://lsjsj92.tistory.com/657
50 | 
51 | **10. LLM based explainability recsys**
52 | - LLM을 활용한 추천 시스템의 설명 가능성 부여
53 | - LangChain, gpt-4o 활용
54 | - https://lsjsj92.tistory.com/670
55 | 


--------------------------------------------------------------------------------