├── .gitignore ├── 001. recommender system basic.ipynb ├── 002. recommender system basic with Python - 1 content based filtering.ipynb ├── 003. recommender system basic with Python - 2 Collaborative Filtering.ipynb ├── 004. recommender system basic with Python - 3 Matrix Factorization.ipynb ├── 005. naver news recommender.ipynb ├── 006. deep learning recommender system.ipynb ├── 007. wide-deep-RecSys model.ipynb ├── 008. simple book recommender system with Keras.ipynb ├── 009_chatgpt_recsys.ipynb ├── 010. LLM based Explainability RecSys .ipynb └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | # Created by https://www.gitignore.io/api/python,pycharm,jupyternotebooks 3 | # Edit at https://www.gitignore.io/?templates=python,pycharm,jupyternotebooks 4 | 5 | ### JupyterNotebooks ### 6 | # gitignore template for Jupyter Notebooks 7 | # website: http://jupyter.org/ 8 | 9 | .ipynb_checkpoints 10 | */.ipynb_checkpoints/* 11 | 12 | # IPython 13 | profile_default/ 14 | ipython_config.py 15 | 16 | # Remove previous ipynb_checkpoints 17 | # git rm -r .ipynb_checkpoints/ 18 | 19 | ### PyCharm ### 20 | # Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm 21 | # Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839 22 | 23 | # User-specific stuff 24 | .idea/**/workspace.xml 25 | .idea/**/tasks.xml 26 | .idea/**/usage.statistics.xml 27 | .idea/**/dictionaries 28 | .idea/**/shelf 29 | 30 | # Generated files 31 | .idea/**/contentModel.xml 32 | 33 | # Sensitive or high-churn files 34 | .idea/**/dataSources/ 35 | .idea/**/dataSources.ids 36 | .idea/**/dataSources.local.xml 37 | .idea/**/sqlDataSources.xml 38 | .idea/**/dynamic.xml 39 | .idea/**/uiDesigner.xml 40 | .idea/**/dbnavigator.xml 41 | 42 | # Gradle 43 | .idea/**/gradle.xml 44 | .idea/**/libraries 45 | 46 | # Gradle and Maven with auto-import 47 | # When using Gradle or Maven with auto-import, you should exclude module files, 48 | # since they will be recreated, and may cause churn. Uncomment if using 49 | # auto-import. 50 | # .idea/modules.xml 51 | # .idea/*.iml 52 | # .idea/modules 53 | # *.iml 54 | # *.ipr 55 | 56 | # CMake 57 | cmake-build-*/ 58 | 59 | # Mongo Explorer plugin 60 | .idea/**/mongoSettings.xml 61 | 62 | # File-based project format 63 | *.iws 64 | 65 | # IntelliJ 66 | out/ 67 | 68 | # mpeltonen/sbt-idea plugin 69 | .idea_modules/ 70 | 71 | # JIRA plugin 72 | atlassian-ide-plugin.xml 73 | 74 | # Cursive Clojure plugin 75 | .idea/replstate.xml 76 | 77 | # Crashlytics plugin (for Android Studio and IntelliJ) 78 | com_crashlytics_export_strings.xml 79 | crashlytics.properties 80 | crashlytics-build.properties 81 | fabric.properties 82 | 83 | # Editor-based Rest Client 84 | .idea/httpRequests 85 | 86 | # Android studio 3.1+ serialized cache file 87 | .idea/caches/build_file_checksums.ser 88 | 89 | ### PyCharm Patch ### 90 | # Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721 91 | 92 | # *.iml 93 | # modules.xml 94 | # .idea/misc.xml 95 | # *.ipr 96 | 97 | # Sonarlint plugin 98 | .idea/**/sonarlint/ 99 | 100 | # SonarQube Plugin 101 | .idea/**/sonarIssues.xml 102 | 103 | # Markdown Navigator plugin 104 | .idea/**/markdown-navigator.xml 105 | .idea/**/markdown-navigator/ 106 | 107 | ### Python ### 108 | # Byte-compiled / optimized / DLL files 109 | __pycache__/ 110 | *.py[cod] 111 | *$py.class 112 | 113 | # C extensions 114 | *.so 115 | 116 | # Distribution / packaging 117 | .Python 118 | build/ 119 | develop-eggs/ 120 | dist/ 121 | downloads/ 122 | eggs/ 123 | .eggs/ 124 | lib/ 125 | lib64/ 126 | parts/ 127 | sdist/ 128 | var/ 129 | wheels/ 130 | pip-wheel-metadata/ 131 | share/python-wheels/ 132 | *.egg-info/ 133 | .installed.cfg 134 | *.egg 135 | MANIFEST 136 | 137 | # PyInstaller 138 | # Usually these files are written by a python script from a template 139 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 140 | *.manifest 141 | *.spec 142 | 143 | # Installer logs 144 | pip-log.txt 145 | pip-delete-this-directory.txt 146 | 147 | # Unit test / coverage reports 148 | htmlcov/ 149 | .tox/ 150 | .nox/ 151 | .coverage 152 | .coverage.* 153 | .cache 154 | nosetests.xml 155 | coverage.xml 156 | *.cover 157 | .hypothesis/ 158 | .pytest_cache/ 159 | 160 | # Translations 161 | *.mo 162 | *.pot 163 | 164 | # Scrapy stuff: 165 | .scrapy 166 | 167 | # Sphinx documentation 168 | docs/_build/ 169 | 170 | # PyBuilder 171 | target/ 172 | 173 | # pyenv 174 | .python-version 175 | 176 | # pipenv 177 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 178 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 179 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 180 | # install all needed dependencies. 181 | #Pipfile.lock 182 | 183 | # celery beat schedule file 184 | celerybeat-schedule 185 | 186 | # SageMath parsed files 187 | *.sage.py 188 | 189 | # Spyder project settings 190 | .spyderproject 191 | .spyproject 192 | 193 | # Rope project settings 194 | .ropeproject 195 | 196 | # Mr Developer 197 | .mr.developer.cfg 198 | .project 199 | .pydevproject 200 | 201 | # mkdocs documentation 202 | /site 203 | 204 | # mypy 205 | .mypy_cache/ 206 | .dmypy.json 207 | dmypy.json 208 | 209 | # Pyre type checker 210 | .pyre/ 211 | 212 | # End of https://www.gitignore.io/api/python,pycharm,jupyternotebooks 213 | -------------------------------------------------------------------------------- /001. recommender system basic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 블로그 설명\n", 8 | "\n", 9 | "**해당 자료에 대한 설명은 아래 블로그에도 올려두었습니다.**\n", 10 | "- https://lsjsj92.tistory.com/563\n", 11 | "- https://lsjsj92.tistory.com/564\n", 12 | "\n", 13 | "----\n", 14 | "\n", 15 | "해당 자료는 아래 리스트에서 참고했습니다. \n", 16 | "- https://www.kaggle.com/rounakbanik/movie-recommender-systems\n", 17 | "- https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system\n", 18 | "- https://wikidocs.net/5053\n", 19 | "- https://medium.com/towards-artificial-intelligence/content-based-recommender-system-4db1b3de03e7\n", 20 | "- https://www.youtube.com/watch?v=ZspR5PZemcs&list=PLU1prrdmLIpaGw0neztIByvshB9I7l6-f&index=11\n", 21 | "\n", 22 | "\n", 23 | "# 추천 시스템(Recommendation System)\n", 24 | "\n", 25 | "\n", 26 | "https://scvgoe.github.io/2017-02-01-%ED%98%91%EC%97%85-%ED%95%84%ED%84%B0%EB%A7%81-%EC%B6%94%EC%B2%9C-%EC%8B%9C%EC%8A%A4%ED%85%9C-(Collaborative-Filtering-Recommendation-System)/\n", 27 | "\n", 28 | "추천 시스템은 잘만 만들어진다면 친사용자 서비스이며 동시에 친기업 서비스입니다.\n", 29 | "사용자의 취향을 파악하고, 취향에 따른 상품을 추천해주기 때문에 그렇습니다. 즉, 그렇기에 사용자는 자신의 맞춤 제품이 나오니 구매할 확률이 올라가고 기업 입장에서는 이윤으로 돌아오는 것이죠. \n", 30 | "\n", 31 | "추천 시스템의 가장 무서운 것은 자신이 몰랐던 취향도 추천해주는 것입니다. 이러한 추천 시스템을 경험한 사용자는 그 서비스의 충성 고객이 될 확률이 높아집니다. 그러면 더욱 더 많은 데이터가 쌓이게 되고 더욱 견고한 서비스가 구축이 됩니다.\n", 32 | "\n", 33 | "# 추천 시스템의 기본 유형\n", 34 | "\n", 35 | "추천 시스템의 기본은 크게 **Content based filtering** 방식과 **협업 필터링(Collaborative Filtering)** 방식으로 나뉘어 집니다. 특히, 협업 필터링은 다시 메모리(Memory based) 협업 필터링 잠재 요인(Latent Factor) 협업 필터링으로 세부적으로 소개되죠. \n", 36 | "\n", 37 | "초반에는 콘텐츠 기반 필터링과 최근접 이웃 잠재 요인을 많이 사용했습니다. 하지만 넷플릭스의 사례 이후 잠재 요인 협업 필터링을 많이 사용하게 되었는데요. 이 잠재 요인 협업 필터링 방법에서는 **행렬 분해(Matrix Factorization)** 방법을 사용합니다.\n", 38 | "\n", 39 | "\n", 40 | "# Content based filtering\n", 41 | "\n", 42 | "콘텐츠 기반 필터링 방식은 사용자가 특정 아이템을 선호하는 경우 그 아이템과 비슷한 콘텐츠를 가진 다른 아이템을 추천하는 방식입니다.\n", 43 | "\n", 44 | "![1](https://user-images.githubusercontent.com/24634054/71624712-520aaf00-2c27-11ea-9546-562ee61517aa.JPG)\n", 45 | "\n", 46 | "굉장히 단순한 아이디어입니다. 예를 들어 사용자가 A라는 영화에 높은 평점을 줬는데 그 영화가 액션이었고 '이수진' 이라는 감독이었으면 '이수진' 감독의 다른 액션 영화를 추천해주는 것이죠.\n", 47 | "\n", 48 | "하지만, 이는 매우 단순한 추천이기 때문에 참고용으로 활용하지 잘 사용하지 않습니다.\n", 49 | "\n", 50 | "\n", 51 | "# 메모리(Memory based) Collaborative Filtering\n", 52 | "\n", 53 | "실제로는 새로운 영화가 나오면 다른 사람들의 평점이나, 평가를 들어본 뒤 영화를 선택하는 경우가 많습니다. 그냥 봤다가 재미없으면 망하기 때문이죠. 이와 같은 방식으로 사용자가 아이템에 매긴 평점, 상품 구매 이력 등의 **사용자 행동 양식(User Behavior)**을 기반으로 추천 해주는 것이 Collaborative Filtering 입니다.\n", 54 | "\n", 55 | "메모리 기반 협업 필터링은 사용자-아이템 행렬에서 사용자가 아직 평가하지 않은 아이템을 예측하는 것이 목표입니다.\n", 56 | "\n" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "![2](https://user-images.githubusercontent.com/24634054/71624713-520aaf00-2c27-11ea-9e04-471bb9e0ae1e.JPG)\n", 64 | "\n", 65 | "이 그림처럼 말이죠. 예를 들어 User2가 아직 ItemC에 대한 평가를 안했으니 User2는 ItemC에 대해 어떻게 평가할 것인지를 예측하는 것입니다.\n", 66 | "\n", 67 | "이처럼 메모리 기반 협업 필터링에서는 사용자-아이템 평점 행렬과 같은 모습을 가지고 있습니다. 따라서 column은 contents, row는 users가 되어야 합니다. 즉, 아래와 같이 데이터가 되어 있다면 pivot table 형식으로 데이터를 바꿔주어야 하는 것이죠!\n", 68 | "\n" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "![3](https://user-images.githubusercontent.com/24634054/71624714-52a34580-2c27-11ea-9bf8-0105bdf90b6a.JPG)\n", 76 | "\n", 77 | "이러한 모습을 가지고 있기 때문에 이 행렬은 굉장히 Sparse하다는 특징이 있습니다. 그리고 실무에서는 이 특징이 단점으로 꼽히게 되죠. \n", 78 | "공간 낭비이니까요. 아무튼!\n", 79 | "\n", 80 | "이러한 메모리 기반 협업 필터링은 다시 아래와 같이 나뉠 수 있습니다. \n", 81 | "- 사용자 기반 : 비슷한 고객들이 ~한 제품을 구매했다.\n", 82 | "- 아이템 기반 : ~ 상품을 구매한 고객들은 다음 상품도 구매했다.\n", 83 | "\n", 84 | "**사용자 기반** 의 협업 필터링 모습은 아래와 같을 것입니다.\n" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "![4](https://user-images.githubusercontent.com/24634054/71624715-52a34580-2c27-11ea-984d-412b17f9e7d8.JPG)\n", 92 | "\n", 93 | "즉, User1, User2는 ItemA ~ C까지의 평점이 비슷하기 때문에 비슷하다라고 생각하는 것이죠!\n", 94 | "\n", 95 | "**아이템 기반** 협업 필터링 모습은 아래와 같을 것입니다." 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "![5](https://user-images.githubusercontent.com/24634054/71624716-52a34580-2c27-11ea-855b-030aa67149e9.JPG)\n", 103 | "\n", 104 | "ItemA와 B는 사용자들의 평점 분포가 비슷하므로 유사도가 높다고 생각하는 것입니다. 그래서 User4에게 ItemA를 추천해주는 것입니다!\n", 105 | "\n", 106 | "그리고 일반적으로 사용자 기반보다는 아이템 기반이 좀 더 정확도가 높습니다. \n", 107 | "그 이유는 대체적으로 생각하는 것이 비슷한 상품을 좋아한다고 취향이 비슷한 것은 아니니까 라고 많이들 말씀합니다. \n", 108 | "\n", 109 | "그래서 메모리 기반 협업 필터링을 사용할 때는 보통 아이템 기반으로 추천을 적용합니다. 그리고 그 유사도는 코사인 유사도(cosine similarity)를 대부분 활용합니다." 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "# Matrix Factorization Collaborative Filtering\n", 117 | "\n", 118 | "행렬 분해를 이용한 협업 필터링 방법도 있습니다. 이는 대규모 다차원 행렬을 SVD와 같은 차원 감소 기법으로 분해하는 과정에서 잠재 요인(Latent Factor)를 뽑아내는 방법입니다.\n", 119 | "\n", 120 | "사실, 위에서 아이템 기반 협업 필터링을 소개했지만 이 행렬 분해(Matrix Factorization) 방법을 더 많이 사용합니다. \n", 121 | "가장 큰 이유는 공간에 있습니다. 이는 아래에서 다시 설명합니다.\n", 122 | "\n", 123 | "행렬 분해(혹은 잠재 요인)으로 진행하는 collaborative filtering은 사용자-아이템 행렬 데이터를 이용해 '잠재 요인'을 끌어냅니다. \n", 124 | "즉, 사용자-아이템 행렬을 사용자-잠재요인, 아이템-잠재요인 행렬로 분해할 수 있습니다. 아래 사진과 같이 말이죠!\n", 125 | "\n", 126 | "출처 : https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf\n", 127 | "\n", 128 | "![1](https://user-images.githubusercontent.com/24634054/71657636-1507ff00-2d84-11ea-82c7-b615cb871011.JPG)\n", 129 | "\n", 130 | "저 잠재요인(latent factor)는 어떤 것인지 명확히 알 수는 없습니다. 하지만 뭐 예로 들면 코메디, 액션과 같은 장르가 될 수도 있는 것입니다. \n", 131 | "만약, 코메디, 액션과 같은 장르로 정해졌을 경우 사용자별 장르 선호도, 아이템 별 장르 가중치 값으로 분해될 수 있는 것입니다.\n", 132 | "\n", 133 | "그럼 아래 사진과 같은 계산이 가능해지죠~ \n", 134 | "출처 : https://www.youtube.com/watch?v=ZspR5PZemcs&list=PLU1prrdmLIpaGw0neztIByvshB9I7l6-f&index=11\n", 135 | "\n", 136 | "![10](https://user-images.githubusercontent.com/24634054/71638406-8c656200-2ca2-11ea-9740-a3da282fefde.JPG)\n", 137 | "\n", 138 | "\n", 139 | "보통 사용자-아이템 행렬은 R이라고 표현합니다. 그리고 R(u, i)라고 하는데 u번째 유저가 i번째 아이템에 대한 평가를 말합니다. \n", 140 | "또한, 사용자-잠재요인 행렬을 P, 아이템-잠재요인을 Q라고 합니다. 아이템-잠재요인은 보통 전치 행렬로 많이 사용하므로 Q.T라고 불리웁니다.\n", 141 | "\n", 142 | "그래서 아래 그림과 같이 R 행렬에서 나온 값을 기반으로 latent factor score를 매길 수 있게됩니다.\n", 143 | "\n", 144 | "![11](https://user-images.githubusercontent.com/24634054/71638458-fcc0b300-2ca3-11ea-93c9-dd80020f143e.JPG)\n", 145 | "\n", 146 | "이 값을 이용해서 아래와 같이 사용자가 평가하지 않은 콘텐츠의 점수를 예측할 수 있는 것입니다. \n", 147 | "즉, 이 값이 높으면 사용자에게 추천할 수 있게 됩니다.\n", 148 | "\n", 149 | "![12](https://user-images.githubusercontent.com/24634054/71638459-fcc0b300-2ca3-11ea-914c-c10ae98e063d.JPG)\n", 150 | "\n", 151 | "이렇게 이용하는 방법이 행렬 분해(matrix factorization)를 이용한 collaborative filtering입니다. \n", 152 | "혹은, latent factor based collaborative filtering이라고도 합니다.\n", 153 | "\n", 154 | "이렇게 하면 장점은 위에서도 잠깐 언급했듯이 저장 공간의 장점입니다.\n", 155 | "\n", 156 | "만약, matrix factorization 방법을 사용하지 않으면 아래와 같이 user - item matrix가 있을 것입니다. \n", 157 | "즉, 1000개의 item에 2000명의 user가 있으면 1000 * 2000 개의 파라미터가 필요합니다.\n", 158 | "\n", 159 | "출처 : https://www.youtube.com/watch?v=ZspR5PZemcs&list=PLU1prrdmLIpaGw0neztIByvshB9I7l6-f&index=11\n", 160 | "\n", 161 | "![13](https://user-images.githubusercontent.com/24634054/71638477-b9b30f80-2ca4-11ea-8581-5b1443fcf1f3.JPG)\n", 162 | "\n", 163 | "아래 그림도 이를 설명해줍니다.\n", 164 | "\n", 165 | "![16](https://user-images.githubusercontent.com/24634054/71638478-b9b30f80-2ca4-11ea-8eab-192335230ddd.JPG)\n", 166 | "\n", 167 | "하지만, matrix factorization을 활용하면 공간을 매우 효율적으로 사용할 수 있습니다.\n", 168 | "\n", 169 | "![17](https://user-images.githubusercontent.com/24634054/71638479-ba4ba600-2ca4-11ea-9ee7-5a0d4b58d27c.JPG)\n" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "그리고 이러한 행렬분해는 아래와 같은 방법으로 됩니다.\n", 177 | "\n", 178 | "![20](https://user-images.githubusercontent.com/24634054/71638517-6a211380-2ca5-11ea-89f1-08d831c0ae37.JPG)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "그럼 캐글에 있는 영화 추천 코드를 보면서 위 내용을 복습해보죠." 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [] 194 | } 195 | ], 196 | "metadata": { 197 | "kernelspec": { 198 | "display_name": "Python 3", 199 | "language": "python", 200 | "name": "python3" 201 | }, 202 | "language_info": { 203 | "codemirror_mode": { 204 | "name": "ipython", 205 | "version": 3 206 | }, 207 | "file_extension": ".py", 208 | "mimetype": "text/x-python", 209 | "name": "python", 210 | "nbconvert_exporter": "python", 211 | "pygments_lexer": "ipython3", 212 | "version": "3.8.5" 213 | } 214 | }, 215 | "nbformat": 4, 216 | "nbformat_minor": 2 217 | } 218 | -------------------------------------------------------------------------------- /005. naver news recommender.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 블로그 설명\n", 8 | "\n", 9 | "해당 자료에 대한 설명은 아래 블로그에 올려두었습니다.\n", 10 | "- https://lsjsj92.tistory.com/571" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 13, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import pandas as pd\n", 20 | "import numpy as np\n", 21 | "import matplotlib.pyplot as plt\n", 22 | "import random\n", 23 | "from sklearn.manifold import TSNE\n", 24 | "from gensim.test.utils import common_texts\n", 25 | "from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n", 26 | "from sklearn.metrics.pairwise import cosine_similarity" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 14, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "def make_doc2vec_models(tagged_data, tok, vector_size=128, window = 3, epochs = 40, min_count = 0, workers = 4):\n", 36 | " model = Doc2Vec(tagged_data, vector_size=vector_size, window=window, epochs=epochs, min_count=min_count, workers=workers)\n", 37 | " model.save(f'./datas/{tok}_news_model.doc2vec')" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 15, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "def get_data(preprocess = True):\n", 47 | " if preprocess :\n", 48 | " data = pd.read_csv('./datas/naver_news/preprocessing_tok_naver_data.csv')\n", 49 | " else:\n", 50 | " economy = pd.read_csv('./datas/naver_news/economy.csv', header=None)\n", 51 | " policy = pd.read_csv('./datas/naver_news/policy.csv', header=None)\n", 52 | " it = pd.read_csv('./datas/naver_news/it.csv', header=None)\n", 53 | "\n", 54 | " columns = ['date', 'category', 'company', 'title', 'content', 'url']\n", 55 | " economy.columns = columns\n", 56 | " policy.columns = columns\n", 57 | " it.columns = columns\n", 58 | "\n", 59 | " data = pd.concat([economy, policy, it], axis = 0)\n", 60 | " data.reset_index(drop=True, inplace=True)\n", 61 | " \n", 62 | " return data" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 16, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "def get_preprocessing_data(data):\n", 72 | " data.drop(['date', 'company', 'url'], axis = 1, inplace =True)\n", 73 | " \n", 74 | " category_mapping = {\n", 75 | " '경제' : 0,\n", 76 | " '정치' : 1,\n", 77 | " 'IT과학' : 2\n", 78 | " }\n", 79 | "\n", 80 | " data['category'] = data['category'].map(category_mapping)\n", 81 | " data['title_content'] = data['title'] + \" \" + data['content']\n", 82 | " data.drop(['title', 'content'], axis = 1, inplace = True)\n", 83 | " return data" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 17, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "def make_doc2vec_data(data, column, t_document=False):\n", 93 | " data_doc = []\n", 94 | " for tag, doc in zip(data.index, data[column]):\n", 95 | " doc = doc.split(\" \")\n", 96 | " data_doc.append(([tag], doc))\n", 97 | " if t_document:\n", 98 | " data = [TaggedDocument(words=text, tags=tag) for tag, text in data_doc]\n", 99 | " return data\n", 100 | " else:\n", 101 | " return data_doc" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 18, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "def get_recommened_contents(user, data_doc, model):\n", 111 | " scores = []\n", 112 | "\n", 113 | " for tags, text in data_doc:\n", 114 | " trained_doc_vec = model.docvecs[tags[0]]\n", 115 | " scores.append(cosine_similarity(user.reshape(-1, 128), trained_doc_vec.reshape(-1, 128)))\n", 116 | "\n", 117 | " scores = np.array(scores).reshape(-1)\n", 118 | " scores = np.argsort(-scores)[:5]\n", 119 | " \n", 120 | " return data.loc[scores, :]" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 19, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "def make_user_embedding(index_list, data_doc, model):\n", 130 | " user = []\n", 131 | " user_embedding = []\n", 132 | " for i in index_list:\n", 133 | " user.append(data_doc[i][0][0])\n", 134 | " for i in user:\n", 135 | " user_embedding.append(model.docvecs[i])\n", 136 | " user_embedding = np.array(user_embedding)\n", 137 | " user = np.mean(user_embedding, axis = 0)\n", 138 | " return user" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 20, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "def view_user_history(data):\n", 148 | " print(data[['category', 'title_content']])" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 21, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "data = get_data()" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 22, 163 | "metadata": {}, 164 | "outputs": [ 165 | { 166 | "data": { 167 | "text/html": [ 168 | "
\n", 169 | "\n", 182 | "\n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | "
categorytitle_contentmecab_tok
00포스코ICT 4분기부터 실적 본격 개선 비용효율화 본격화 그룹 계열사와의 시너지 발...포스코 ICT 분기 실 본격 개선 비용 효율 본격화 그룹 계열사 시너지 발생 헤럴드...
10위메프 리퍼데이 바자회 수익금 소외계층 지원 위메프가 리퍼데이 바자회 ‘아름다운가게...위메프 리퍼 데이 바자회 수익금 소외 계층 지원 위메프 리퍼 데이 바자회 가게 위메...
20호반건설 광주시 ‘계림1구역 정비사업’ 수주…공사비 2700억원 최고 33층 총 9...호반건설 광주시 계림 구역 정비 사업 수주 공사비 원 최고 개동 아파트 가구 규모 ...
30성동조선해양 HSG중공업과 인수합병 MOU 체결 HSG중공업이 성동조선해양 인수를 ...성동 조선 해양 중공업 인수 합병 체결 중공업 성동 조선 해양 인수 위한 양해 각서...
402019년 10월 프랜차이즈 정보공개서 79개 신규등록 서울·경기·인천 지역 서울시...년 월 프랜차이즈 정보 공개 개 신규 등록 서울 경기 인천 지역 서울시 경기도 새롭...
\n", 224 | "
" 225 | ], 226 | "text/plain": [ 227 | " category title_content \\\n", 228 | "0 0 포스코ICT 4분기부터 실적 본격 개선 비용효율화 본격화 그룹 계열사와의 시너지 발... \n", 229 | "1 0 위메프 리퍼데이 바자회 수익금 소외계층 지원 위메프가 리퍼데이 바자회 ‘아름다운가게... \n", 230 | "2 0 호반건설 광주시 ‘계림1구역 정비사업’ 수주…공사비 2700억원 최고 33층 총 9... \n", 231 | "3 0 성동조선해양 HSG중공업과 인수합병 MOU 체결 HSG중공업이 성동조선해양 인수를 ... \n", 232 | "4 0 2019년 10월 프랜차이즈 정보공개서 79개 신규등록 서울·경기·인천 지역 서울시... \n", 233 | "\n", 234 | " mecab_tok \n", 235 | "0 포스코 ICT 분기 실 본격 개선 비용 효율 본격화 그룹 계열사 시너지 발생 헤럴드... \n", 236 | "1 위메프 리퍼 데이 바자회 수익금 소외 계층 지원 위메프 리퍼 데이 바자회 가게 위메... \n", 237 | "2 호반건설 광주시 계림 구역 정비 사업 수주 공사비 원 최고 개동 아파트 가구 규모 ... \n", 238 | "3 성동 조선 해양 중공업 인수 합병 체결 중공업 성동 조선 해양 인수 위한 양해 각서... \n", 239 | "4 년 월 프랜차이즈 정보 공개 개 신규 등록 서울 경기 인천 지역 서울시 경기도 새롭... " 240 | ] 241 | }, 242 | "execution_count": 22, 243 | "metadata": {}, 244 | "output_type": "execute_result" 245 | } 246 | ], 247 | "source": [ 248 | "data.head()" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 23, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "data_doc_title_content_tag = make_doc2vec_data(data, 'title_content', t_document=True)\n", 258 | "data_doc_title_content = make_doc2vec_data(data, 'title_content')\n", 259 | "data_doc_tok_tag = make_doc2vec_data(data, 'mecab_tok', t_document=True)\n", 260 | "data_doc_tok = make_doc2vec_data(data, 'mecab_tok')" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "# make doc2vec models" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 24, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "make_doc2vec_models(data_doc_title_content_tag, tok=False)" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": 25, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "make_doc2vec_models(data_doc_tok_tag, tok=True)" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "# load doc2vec models" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 26, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "model_title_content = Doc2Vec.load('./datas/False_news_model.doc2vec')" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 27, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "model_tok = Doc2Vec.load('./datas/True_news_model.doc2vec')" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 49, 316 | "metadata": {}, 317 | "outputs": [ 318 | { 319 | "name": "stdout", 320 | "output_type": "stream", 321 | "text": [ 322 | " category title_content\n", 323 | "510 0 SK 조직 차원 증거인멸 없어 반박문 제출 LG화학과 영업비밀 침해소송을 벌이고 있...\n", 324 | "4187 0 제이엘케이인스펙션 공모가 9000원 확정 김동민 대표 적극적인 IR·주주친화 정책 ...\n", 325 | "213 0 제재 대신 자율개선 유도하는 금감원 외환법규 위반 5개 은행 대상 첫 적용 금융감독...\n", 326 | "696 0 동국제약 ‘마데카솔’ 소비자가 가장 추천하는 브랜드 동국제약 대표이사 오흥주 마데카...\n", 327 | "3410 0 호반그룹 정기 인사…최승남 대표 총괄부회장 신규 선임 한국경제TV 신인규 기자 호반...\n" 328 | ] 329 | } 330 | ], 331 | "source": [ 332 | "user_category_1 = data.loc[random.sample(data.loc[data.category == 0, :].index.values.tolist(), 5), :] #경제\n", 333 | "view_user_history(user_category_1)" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 45, 339 | "metadata": {}, 340 | "outputs": [ 341 | { 342 | "name": "stdout", 343 | "output_type": "stream", 344 | "text": [ 345 | " category title_content\n", 346 | "10671 1 “3대 친문 농단” 국조 요구서 제출… 靑 매섭게 몰아치는 야권 자유한국당 곽상도ㆍ...\n", 347 | "6194 1 黃 다시 일어나 끝까지 가겠다… 공수처·선거법 반드시 저지 黃 文정부 3대 게이트 ...\n", 348 | "6825 1 미 대사 文대통령 겨냥 종북 좌파에 둘러싸여... 9월 여야 의원들 만나 위험 발언...\n", 349 | "8533 1 명소·맛집 즐긴 아세안 정상…일정보다 핫했던 부산 나들이 2박 3일간 경찰 에스코트...\n", 350 | "6236 1 민주 한국당 필리버스터 철회 없으면 정기국회서 41로 안건 처리 비공개 최고위원회…...\n" 351 | ] 352 | } 353 | ], 354 | "source": [ 355 | "user_category_2 = data.loc[random.sample(data.loc[data.category == 1, :].index.values.tolist(), 5), :] #정치\n", 356 | "view_user_history(user_category_2)" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 18, 362 | "metadata": {}, 363 | "outputs": [ 364 | { 365 | "name": "stdout", 366 | "output_type": "stream", 367 | "text": [ 368 | " category title_content\n", 369 | "14551 2 ‘라인프렌즈’ ‘브롤스타즈’와 함께 ‘브라운앤프렌즈’ 팝업스토어 진행 엑스포츠뉴스닷...\n", 370 | "17055 2 올해 1등 KT인상 대상 ‘5G 경쟁력 강화 TF’ 디지털데일리 최민지기자 KT 대...\n", 371 | "17824 2 DID 시장 열린다 2.은행계좌부터 주식까지 블록체인으로 인증 OK 규제 샌드박스 ...\n", 372 | "12625 2 인하대병원 제로페이 도입 한국간편결제진흥원 이사장 윤완수 은 인하대병원에서 제로페이...\n", 373 | "18243 2 이슈 19금 신작 미소녀 RPG 방치소녀 학원편 신규 캐릭터 조운과 문추 추가 본 ...\n" 374 | ] 375 | } 376 | ], 377 | "source": [ 378 | "user_category_3 = data.loc[random.sample(data.loc[data.category == 2, :].index.values.tolist(), 5), :] #IT 과학\n", 379 | "view_user_history(user_category_3)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 50, 385 | "metadata": {}, 386 | "outputs": [], 387 | "source": [ 388 | "user_1 = make_user_embedding(user_category_1.index.values.tolist(), data_doc_title_content, model_title_content) # 경제\n", 389 | "user_2 = make_user_embedding(user_category_2.index.values.tolist(), data_doc_title_content, model_title_content) # 정치\n", 390 | "user_3 = make_user_embedding(user_category_3.index.values.tolist(), data_doc_title_content, model_title_content) # IT과학" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 51, 396 | "metadata": {}, 397 | "outputs": [ 398 | { 399 | "data": { 400 | "text/html": [ 401 | "
\n", 402 | "\n", 415 | "\n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | "
categorytitle_content
52260삼양그룹 정기 임원인사 실시…“성과주의 실현” 승진 10명 보직변경 5명 김지섭 삼...
39330동국제약 바이오의약품 위탁 개발·생산 사업 진출 서울 연합뉴스 동국제약은 지난 달 ...
52510삼양그룹 정기임원인사 단행...김지섭 부사장 승진 왼쪽부터 삼양사 식자재유통BU장 ...
33290호반그룹 총괄부회장에 최승남 대표 선임 2일 정기 임원인사 단행.. 각 계열사 대표...
130152동국제약 바이오의약품 위탁 개발·생산 사업 진출 서울 연합뉴스 동국제약은 지난 달 ...
\n", 451 | "
" 452 | ], 453 | "text/plain": [ 454 | " category title_content\n", 455 | "5226 0 삼양그룹 정기 임원인사 실시…“성과주의 실현” 승진 10명 보직변경 5명 김지섭 삼...\n", 456 | "3933 0 동국제약 바이오의약품 위탁 개발·생산 사업 진출 서울 연합뉴스 동국제약은 지난 달 ...\n", 457 | "5251 0 삼양그룹 정기임원인사 단행...김지섭 부사장 승진 왼쪽부터 삼양사 식자재유통BU장 ...\n", 458 | "3329 0 호반그룹 총괄부회장에 최승남 대표 선임 2일 정기 임원인사 단행.. 각 계열사 대표...\n", 459 | "13015 2 동국제약 바이오의약품 위탁 개발·생산 사업 진출 서울 연합뉴스 동국제약은 지난 달 ..." 460 | ] 461 | }, 462 | "execution_count": 51, 463 | "metadata": {}, 464 | "output_type": "execute_result" 465 | } 466 | ], 467 | "source": [ 468 | "result = get_recommened_contents(user_1, data_doc_title_content, model_title_content)\n", 469 | "pd.DataFrame(result.loc[:, ['category', 'title_content']])" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 48, 475 | "metadata": {}, 476 | "outputs": [ 477 | { 478 | "data": { 479 | "text/html": [ 480 | "
\n", 481 | "\n", 494 | "\n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | "
categorytitle_content
82281노영민 비서실장과 정의용 국가안보실장 서울 연합뉴스 한상균 기자 청와대 노영민 비서...
63481한국당 친문농단 게이트 국정조사 요구서 제출키로 울산에 백원우팀 파견…靑 고래고기 ...
63201한국당 친문농단 게이트 국정조사 요구서 제출하기로 자유한국당 곽상도 의원이 28일 ...
65541박원순 시장 녹색교통지역 5등급차량 운행제한 상황실 방문 서울 뉴시스 박주성 기자 ...
68421단식 종료 황교안 내일부터 한국당 당무 복귀 서울 연합뉴스 홍정규 기자 단식농성을 ...
\n", 530 | "
" 531 | ], 532 | "text/plain": [ 533 | " category title_content\n", 534 | "8228 1 노영민 비서실장과 정의용 국가안보실장 서울 연합뉴스 한상균 기자 청와대 노영민 비서...\n", 535 | "6348 1 한국당 친문농단 게이트 국정조사 요구서 제출키로 울산에 백원우팀 파견…靑 고래고기 ...\n", 536 | "6320 1 한국당 친문농단 게이트 국정조사 요구서 제출하기로 자유한국당 곽상도 의원이 28일 ...\n", 537 | "6554 1 박원순 시장 녹색교통지역 5등급차량 운행제한 상황실 방문 서울 뉴시스 박주성 기자 ...\n", 538 | "6842 1 단식 종료 황교안 내일부터 한국당 당무 복귀 서울 연합뉴스 홍정규 기자 단식농성을 ..." 539 | ] 540 | }, 541 | "execution_count": 48, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "result = get_recommened_contents(user_2, data_doc_title_content, model_title_content)\n", 548 | "pd.DataFrame(result.loc[:, ['category', 'title_content']])" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 22, 554 | "metadata": {}, 555 | "outputs": [ 556 | { 557 | "data": { 558 | "text/html": [ 559 | "
\n", 560 | "\n", 573 | "\n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | "
categorytitle_content
130732브롤스타즈 라인프렌즈 오리지널 캐릭터 등장 슈퍼셀은 라인프렌즈와 모바일 슈팅 게임 ...
165732넥슨 글로벌 멀티 플랫폼 프로젝트 ‘카트라이더 드리프트’ 테스트 넥슨 대표 이정헌 ...
116081사진한국당 규탄하는 야3당 대표 머니투데이 홍봉진 기자 바른미래당 손학규 정의당 심...
165362카트라이더 드리프트 9일까지 비공개 테스트 진행 넥슨은 멀티 플랫폼 프로젝트 카트라...
133872슈퍼셀 라인프렌즈와 ‘브롤스타즈’ IP 라이선싱 계약 브롤스타즈 캐릭터 상품 개발 ...
\n", 609 | "
" 610 | ], 611 | "text/plain": [ 612 | " category title_content\n", 613 | "13073 2 브롤스타즈 라인프렌즈 오리지널 캐릭터 등장 슈퍼셀은 라인프렌즈와 모바일 슈팅 게임 ...\n", 614 | "16573 2 넥슨 글로벌 멀티 플랫폼 프로젝트 ‘카트라이더 드리프트’ 테스트 넥슨 대표 이정헌 ...\n", 615 | "11608 1 사진한국당 규탄하는 야3당 대표 머니투데이 홍봉진 기자 바른미래당 손학규 정의당 심...\n", 616 | "16536 2 카트라이더 드리프트 9일까지 비공개 테스트 진행 넥슨은 멀티 플랫폼 프로젝트 카트라...\n", 617 | "13387 2 슈퍼셀 라인프렌즈와 ‘브롤스타즈’ IP 라이선싱 계약 브롤스타즈 캐릭터 상품 개발 ..." 618 | ] 619 | }, 620 | "execution_count": 22, 621 | "metadata": {}, 622 | "output_type": "execute_result" 623 | } 624 | ], 625 | "source": [ 626 | "result = get_recommened_contents(user_3, data_doc_title_content, model_title_content)\n", 627 | "pd.DataFrame(result.loc[:, ['category', 'title_content']])" 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": null, 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [] 636 | }, 637 | { 638 | "cell_type": "markdown", 639 | "metadata": {}, 640 | "source": [ 641 | "# 형태소 분석 후 결과" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": 52, 647 | "metadata": {}, 648 | "outputs": [], 649 | "source": [ 650 | "user_1 = make_user_embedding(user_category_1.index.values.tolist(), data_doc_tok, model_tok) # 경제\n", 651 | "user_2 = make_user_embedding(user_category_2.index.values.tolist(), data_doc_tok, model_tok) # 정치\n", 652 | "user_3 = make_user_embedding(user_category_3.index.values.tolist(), data_doc_tok, model_tok) # IT과학" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": 53, 658 | "metadata": {}, 659 | "outputs": [ 660 | { 661 | "data": { 662 | "text/html": [ 663 | "
\n", 664 | "\n", 677 | "\n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | "
categorytitle_content
112261개회사 하는 조성욱 공정거래위원장 서울 연합뉴스 임헌정 기자 조성욱 공정거래위원장이...
117021국민의례하는 문재인 대통령 문재인 대통령이 3일 청와대에서 열린 국가기후환경회의 격...
46830정관장 알파프로젝트와 상담하세요 서울 뉴시스 박주성 기자 KGC인삼공사가 2일 오전...
63911미래를 향한 전진 4.0 창당발기인대회에서 박수치는 이언주 서울 연합뉴스 하사헌 기...
113251국민의례 하는 이낙연 총리 서울 뉴스1 오대일 기자 이낙연 국무총리가 3일 오후 서...
\n", 713 | "
" 714 | ], 715 | "text/plain": [ 716 | " category title_content\n", 717 | "11226 1 개회사 하는 조성욱 공정거래위원장 서울 연합뉴스 임헌정 기자 조성욱 공정거래위원장이...\n", 718 | "11702 1 국민의례하는 문재인 대통령 문재인 대통령이 3일 청와대에서 열린 국가기후환경회의 격...\n", 719 | "4683 0 정관장 알파프로젝트와 상담하세요 서울 뉴시스 박주성 기자 KGC인삼공사가 2일 오전...\n", 720 | "6391 1 미래를 향한 전진 4.0 창당발기인대회에서 박수치는 이언주 서울 연합뉴스 하사헌 기...\n", 721 | "11325 1 국민의례 하는 이낙연 총리 서울 뉴스1 오대일 기자 이낙연 국무총리가 3일 오후 서..." 722 | ] 723 | }, 724 | "execution_count": 53, 725 | "metadata": {}, 726 | "output_type": "execute_result" 727 | } 728 | ], 729 | "source": [ 730 | "result = get_recommened_contents(user_1, data_doc_tok, model_tok)\n", 731 | "pd.DataFrame(result.loc[:, ['category', 'title_content']])" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": 54, 737 | "metadata": {}, 738 | "outputs": [ 739 | { 740 | "data": { 741 | "text/html": [ 742 | "
\n", 743 | "\n", 756 | "\n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | "
categorytitle_content
47940권태신 부회장 주한아르헨티나 대사 접견 서울 뉴시스 권태신 전경련 부회장이 2일 서...
84341육해공 철통 방어 한·아세안 모터케이드 행렬 부산 연합뉴스 지난달 부산에서 한·아세...
7330부산항 수출입 화물 가득 부산 연합뉴스 조정호 기자 1일 부산항 신선대부두에 수출입...
48460인사 한화토탈 한화토탈은 2일 임원 승진 인사를 발표했다. 승진자는 상무 1명 상무...
29080인사말하는 김기문 회장 서울 뉴스1 허경 기자 김기문 중소기업중앙회장이 2일 서울 ...
\n", 792 | "
" 793 | ], 794 | "text/plain": [ 795 | " category title_content\n", 796 | "4794 0 권태신 부회장 주한아르헨티나 대사 접견 서울 뉴시스 권태신 전경련 부회장이 2일 서...\n", 797 | "8434 1 육해공 철통 방어 한·아세안 모터케이드 행렬 부산 연합뉴스 지난달 부산에서 한·아세...\n", 798 | "733 0 부산항 수출입 화물 가득 부산 연합뉴스 조정호 기자 1일 부산항 신선대부두에 수출입...\n", 799 | "4846 0 인사 한화토탈 한화토탈은 2일 임원 승진 인사를 발표했다. 승진자는 상무 1명 상무...\n", 800 | "2908 0 인사말하는 김기문 회장 서울 뉴스1 허경 기자 김기문 중소기업중앙회장이 2일 서울 ..." 801 | ] 802 | }, 803 | "execution_count": 54, 804 | "metadata": {}, 805 | "output_type": "execute_result" 806 | } 807 | ], 808 | "source": [ 809 | "result = get_recommened_contents(user_2, data_doc_tok, model_tok)\n", 810 | "pd.DataFrame(result.loc[:, ['category', 'title_content']])" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 55, 816 | "metadata": {}, 817 | "outputs": [ 818 | { 819 | "data": { 820 | "text/html": [ 821 | "
\n", 822 | "\n", 835 | "\n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | "
categorytitle_content
104711한국 부임한 도미타 고지 신임 주한 일본대사 서울 연합뉴스 임헌정 기자 도미타 고지...
105171입국하는 도미타 고지 신임 주한일본대사 CBS노컷뉴스 이한형 기자 도미타 고지 신임...
110841청와대 천막농성장 앞에서 열린 한국당 최고위원회의 서울 뉴스1 박세연 기자 3일 오...
83531수보회의 참석하는 문 대통령 서울 연합뉴스 한상균 기자 문재인 대통령이 2일 오후 ...
161572Tech BIZ AI로 블루·화이트칼라 넘어 뉴칼라 일자리 만들 것 데이비드 반스...
\n", 871 | "
" 872 | ], 873 | "text/plain": [ 874 | " category title_content\n", 875 | "10471 1 한국 부임한 도미타 고지 신임 주한 일본대사 서울 연합뉴스 임헌정 기자 도미타 고지...\n", 876 | "10517 1 입국하는 도미타 고지 신임 주한일본대사 CBS노컷뉴스 이한형 기자 도미타 고지 신임...\n", 877 | "11084 1 청와대 천막농성장 앞에서 열린 한국당 최고위원회의 서울 뉴스1 박세연 기자 3일 오...\n", 878 | "8353 1 수보회의 참석하는 문 대통령 서울 연합뉴스 한상균 기자 문재인 대통령이 2일 오후 ...\n", 879 | "16157 2 Tech BIZ AI로 블루·화이트칼라 넘어 뉴칼라 일자리 만들 것 데이비드 반스..." 880 | ] 881 | }, 882 | "execution_count": 55, 883 | "metadata": {}, 884 | "output_type": "execute_result" 885 | } 886 | ], 887 | "source": [ 888 | "result = get_recommened_contents(user_3, data_doc_tok, model_tok)\n", 889 | "pd.DataFrame(result.loc[:, ['category', 'title_content']])" 890 | ] 891 | }, 892 | { 893 | "cell_type": "markdown", 894 | "metadata": {}, 895 | "source": [ 896 | "형태소 분석 후 결과는 성능이 썩 좋지 않음을 알 수 있다." 897 | ] 898 | }, 899 | { 900 | "cell_type": "code", 901 | "execution_count": null, 902 | "metadata": {}, 903 | "outputs": [], 904 | "source": [] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": null, 909 | "metadata": {}, 910 | "outputs": [], 911 | "source": [] 912 | }, 913 | { 914 | "cell_type": "code", 915 | "execution_count": null, 916 | "metadata": {}, 917 | "outputs": [], 918 | "source": [] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": null, 923 | "metadata": {}, 924 | "outputs": [], 925 | "source": [] 926 | }, 927 | { 928 | "cell_type": "code", 929 | "execution_count": null, 930 | "metadata": {}, 931 | "outputs": [], 932 | "source": [] 933 | }, 934 | { 935 | "cell_type": "code", 936 | "execution_count": null, 937 | "metadata": {}, 938 | "outputs": [], 939 | "source": [] 940 | } 941 | ], 942 | "metadata": { 943 | "kernelspec": { 944 | "display_name": "Python 3", 945 | "language": "python", 946 | "name": "python3" 947 | }, 948 | "language_info": { 949 | "codemirror_mode": { 950 | "name": "ipython", 951 | "version": 3 952 | }, 953 | "file_extension": ".py", 954 | "mimetype": "text/x-python", 955 | "name": "python", 956 | "nbconvert_exporter": "python", 957 | "pygments_lexer": "ipython3", 958 | "version": "3.8.5" 959 | } 960 | }, 961 | "nbformat": 4, 962 | "nbformat_minor": 2 963 | } 964 | -------------------------------------------------------------------------------- /006. deep learning recommender system.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 블로그 설명\n", 8 | "\n", 9 | "해당 자료에 대한 설명은 아래 블로그에 올려두었습니다.\n", 10 | "- https://lsjsj92.tistory.com/577" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": {}, 17 | "outputs": [ 18 | { 19 | "name": "stderr", 20 | "output_type": "stream", 21 | "text": [ 22 | "Using TensorFlow backend.\n", 23 | "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 24 | " _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n", 25 | "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 26 | " _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n", 27 | "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 28 | " _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n", 29 | "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 30 | " _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n", 31 | "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 32 | " _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n", 33 | "d:\\anaconda3\\envs\\soojin\\lib\\site-packages\\tensorflow\\python\\framework\\dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n", 34 | " np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n" 35 | ] 36 | } 37 | ], 38 | "source": [ 39 | "import pandas as pd\n", 40 | "import pickle\n", 41 | "import numpy as np\n", 42 | "import matplotlib.pyplot as plt\n", 43 | "import seaborn as sns\n", 44 | "import random\n", 45 | "from sklearn.model_selection import train_test_split\n", 46 | "from keras.layers import Input, Dense, Concatenate, concatenate, Dropout, Reshape, dot, Dot\n", 47 | "from keras.models import Model\n", 48 | "from keras.callbacks import ModelCheckpoint, EarlyStopping\n", 49 | "from keras import backend as K\n", 50 | "from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix, mean_squared_error\n", 51 | "from gensim.models.doc2vec import Doc2Vec, TaggedDocument" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 2, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "X = np.load('./datas/X_data.npy')\n", 61 | "y = np.load('./datas/y_data.npy') " 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 7, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "def keras_model():\n", 71 | " user_vector_input = Input(shape=(128, ))\n", 72 | " \n", 73 | " dense_u_v = Dense(128, activation = 'relu')(user_vector_input)\n", 74 | " dense_u_v = Dropout(0.5)(dense_u_v)\n", 75 | " dense_u_v = Dense(64, activation = 'relu')(dense_u_v)\n", 76 | " dense_u_v = Dropout(0.5)(dense_u_v)\n", 77 | " dense_u_v = Dense(32, activation = 'relu')(dense_u_v)\n", 78 | " dense_u_v = Dropout(0.5)(dense_u_v)\n", 79 | " dense_u_v = Dense(16, activation = 'relu')(dense_u_v)\n", 80 | " dense_u_v = Dense(1, activation = 'sigmoid')(dense_u_v)\n", 81 | " \n", 82 | " model = Model(inputs=user_vector_input, outputs=dense_u_v)\n", 83 | " model.compile(optimizer = 'Adam', loss='binary_crossentropy', metrics=['acc'])\n", 84 | " return model\n" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 8, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "model = keras_model()" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 13, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size = 0.2, random_state=1)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 14, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "modelpath = './datas/recommender.model'\n", 112 | "checkpointer = ModelCheckpoint(filepath = modelpath, monitor='val_loss', verbose=1, save_best_only=True)\n", 113 | "early_stop = EarlyStopping(monitor='val_loss', patience=3)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 15, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "Train on 72457 samples, validate on 18115 samples\n", 126 | "Epoch 1/50\n", 127 | "72457/72457 [==============================] - 15s 207us/step - loss: 0.6417 - acc: 0.6030 - val_loss: 0.6135 - val_acc: 0.6306\n", 128 | "\n", 129 | "Epoch 00001: val_loss improved from inf to 0.61350, saving model to ./datas/recommender.model\n", 130 | "Epoch 2/50\n", 131 | "72457/72457 [==============================] - 7s 94us/step - loss: 0.6071 - acc: 0.6261 - val_loss: 0.5916 - val_acc: 0.6601\n", 132 | "\n", 133 | "Epoch 00002: val_loss improved from 0.61350 to 0.59162, saving model to ./datas/recommender.model\n", 134 | "Epoch 3/50\n", 135 | "72457/72457 [==============================] - 7s 102us/step - loss: 0.5899 - acc: 0.6461 - val_loss: 0.5665 - val_acc: 0.6895\n", 136 | "\n", 137 | "Epoch 00003: val_loss improved from 0.59162 to 0.56651, saving model to ./datas/recommender.model\n", 138 | "Epoch 4/50\n", 139 | "72457/72457 [==============================] - 8s 104us/step - loss: 0.5710 - acc: 0.6716 - val_loss: 0.5612 - val_acc: 0.7101\n", 140 | "\n", 141 | "Epoch 00004: val_loss improved from 0.56651 to 0.56123, saving model to ./datas/recommender.model\n", 142 | "Epoch 5/50\n", 143 | "72457/72457 [==============================] - 7s 96us/step - loss: 0.5516 - acc: 0.6963 - val_loss: 0.5512 - val_acc: 0.7243\n", 144 | "\n", 145 | "Epoch 00005: val_loss improved from 0.56123 to 0.55125, saving model to ./datas/recommender.model\n", 146 | "Epoch 6/50\n", 147 | "72457/72457 [==============================] - 7s 95us/step - loss: 0.5372 - acc: 0.7109 - val_loss: 0.5405 - val_acc: 0.7243\n", 148 | "\n", 149 | "Epoch 00006: val_loss improved from 0.55125 to 0.54048, saving model to ./datas/recommender.model\n", 150 | "Epoch 7/50\n", 151 | "72457/72457 [==============================] - 7s 94us/step - loss: 0.5244 - acc: 0.7216 - val_loss: 0.5471 - val_acc: 0.7110\n", 152 | "\n", 153 | "Epoch 00007: val_loss did not improve\n", 154 | "Epoch 8/50\n", 155 | "72457/72457 [==============================] - 8s 112us/step - loss: 0.5208 - acc: 0.7257 - val_loss: 0.5459 - val_acc: 0.7154\n", 156 | "\n", 157 | "Epoch 00008: val_loss did not improve\n", 158 | "Epoch 9/50\n", 159 | "72457/72457 [==============================] - 7s 98us/step - loss: 0.5111 - acc: 0.7340 - val_loss: 0.5391 - val_acc: 0.7200\n", 160 | "\n", 161 | "Epoch 00009: val_loss improved from 0.54048 to 0.53907, saving model to ./datas/recommender.model\n", 162 | "Epoch 10/50\n", 163 | "72457/72457 [==============================] - 7s 96us/step - loss: 0.5050 - acc: 0.7391 - val_loss: 0.5310 - val_acc: 0.7292\n", 164 | "\n", 165 | "Epoch 00010: val_loss improved from 0.53907 to 0.53100, saving model to ./datas/recommender.model\n", 166 | "Epoch 11/50\n", 167 | "72457/72457 [==============================] - 7s 96us/step - loss: 0.4991 - acc: 0.7428 - val_loss: 0.5285 - val_acc: 0.7261\n", 168 | "\n", 169 | "Epoch 00011: val_loss improved from 0.53100 to 0.52849, saving model to ./datas/recommender.model\n", 170 | "Epoch 12/50\n", 171 | "72457/72457 [==============================] - 7s 95us/step - loss: 0.4948 - acc: 0.7484 - val_loss: 0.5502 - val_acc: 0.7075\n", 172 | "\n", 173 | "Epoch 00012: val_loss did not improve\n", 174 | "Epoch 13/50\n", 175 | "72457/72457 [==============================] - 8s 107us/step - loss: 0.4910 - acc: 0.7497 - val_loss: 0.5268 - val_acc: 0.7344\n", 176 | "\n", 177 | "Epoch 00013: val_loss improved from 0.52849 to 0.52680, saving model to ./datas/recommender.model\n", 178 | "Epoch 14/50\n", 179 | "72457/72457 [==============================] - 7s 98us/step - loss: 0.4888 - acc: 0.7507 - val_loss: 0.5283 - val_acc: 0.7234\n", 180 | "\n", 181 | "Epoch 00014: val_loss did not improve\n", 182 | "Epoch 15/50\n", 183 | "72457/72457 [==============================] - 7s 96us/step - loss: 0.4841 - acc: 0.7538 - val_loss: 0.5346 - val_acc: 0.7189\n", 184 | "\n", 185 | "Epoch 00015: val_loss did not improve\n", 186 | "Epoch 16/50\n", 187 | "72457/72457 [==============================] - 8s 106us/step - loss: 0.4808 - acc: 0.7568 - val_loss: 0.5545 - val_acc: 0.6986\n", 188 | "\n", 189 | "Epoch 00016: val_loss did not improve\n" 190 | ] 191 | }, 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "" 196 | ] 197 | }, 198 | "execution_count": 15, 199 | "metadata": {}, 200 | "output_type": "execute_result" 201 | } 202 | ], 203 | "source": [ 204 | "model.fit(X_train, y_train, validation_split = 0.2, epochs=50, batch_size=64, callbacks=[early_stop, checkpointer])" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 16, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "pred = model.predict(X_test)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 17, 219 | "metadata": {}, 220 | "outputs": [], 221 | "source": [ 222 | "pred_label = [1 if i > 0.5 else 0 for i in pred]" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 18, 228 | "metadata": {}, 229 | "outputs": [ 230 | { 231 | "name": "stdout", 232 | "output_type": "stream", 233 | "text": [ 234 | "0.7108376715570278\n", 235 | "0.5794753086419753\n", 236 | "0.9192166462668299\n", 237 | "0.7031753742878594\n" 238 | ] 239 | } 240 | ], 241 | "source": [ 242 | "print(f1_score(y_test, pred_label))\n", 243 | "print(precision_score(y_test, pred_label))\n", 244 | "print(recall_score(y_test, pred_label))\n", 245 | "print(accuracy_score(y_test, pred_label))" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [] 254 | } 255 | ], 256 | "metadata": { 257 | "kernelspec": { 258 | "display_name": "Python 3", 259 | "language": "python", 260 | "name": "python3" 261 | }, 262 | "language_info": { 263 | "codemirror_mode": { 264 | "name": "ipython", 265 | "version": 3 266 | }, 267 | "file_extension": ".py", 268 | "mimetype": "text/x-python", 269 | "name": "python", 270 | "nbconvert_exporter": "python", 271 | "pygments_lexer": "ipython3", 272 | "version": "3.8.5" 273 | } 274 | }, 275 | "nbformat": 4, 276 | "nbformat_minor": 2 277 | } 278 | -------------------------------------------------------------------------------- /009_chatgpt_recsys.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "01422061", 6 | "metadata": {}, 7 | "source": [ 8 | "# Blog\n", 9 | "- https://lsjsj92.tistory.com/657\n", 10 | "\n", 11 | "위 블로그에서 설명한 코드입니다.\n", 12 | "\n", 13 | "# Data\n", 14 | "- https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "id": "1925745b", 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "data": { 25 | "text/html": [ 26 | "\n", 27 | "\n" 32 | ], 33 | "text/plain": [ 34 | "" 35 | ] 36 | }, 37 | "metadata": {}, 38 | "output_type": "display_data" 39 | } 40 | ], 41 | "source": [ 42 | "from IPython.display import display, HTML\n", 43 | "display(HTML(data=\"\"\"\n", 44 | "\n", 49 | "\"\"\"))" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 371, 55 | "id": "3e46dd11", 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "import requests\n", 60 | "\n", 61 | "import pandas as pd\n", 62 | "import numpy as np\n", 63 | "import copy\n", 64 | "import json\n", 65 | "\n", 66 | "from ast import literal_eval\n", 67 | "\n", 68 | "import torch\n", 69 | "from sentence_transformers import SentenceTransformer, util\n", 70 | "from transformers import AutoTokenizer, AutoModel\n", 71 | "from transformers import OwlViTProcessor, OwlViTForObjectDetection\n", 72 | "from transformers import pipeline\n", 73 | "from transformers import GPT2TokenizerFast\n", 74 | "from PIL import Image\n", 75 | "\n", 76 | "import pickle\n", 77 | "\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 3, 83 | "id": "b5fea154", 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "import matplotlib.pyplot as plt\n", 88 | "from typing import List, Tuple, Dict\n", 89 | "\n", 90 | "import sklearn.datasets as datasets\n", 91 | "import sklearn.manifold as manifold" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 4, 97 | "id": "ffd81bad", 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "import openai\n", 102 | "import os\n", 103 | "import sys\n", 104 | "from dotenv import load_dotenv\n", 105 | "\n", 106 | "load_dotenv() \n", 107 | "openai.api_key = os.getenv(\"OPENAI_API_KEY\")" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 5, 113 | "id": "4a88f0a5", 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "cur_os = sys.platform" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 6, 123 | "id": "dd02d945", 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "model_path = f\"D:/github\" if cur_os.startswith('win') else None" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "id": "755c7288", 133 | "metadata": {}, 134 | "source": [ 135 | "## 데이터셋" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 7, 141 | "id": "widespread-trial", 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "name": "stdout", 146 | "output_type": "stream", 147 | "text": [ 148 | "(45466, 24)\n" 149 | ] 150 | }, 151 | { 152 | "data": { 153 | "text/html": [ 154 | "
\n", 155 | "\n", 168 | "\n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | "
adultbelongs_to_collectionbudgetgenreshomepageidimdb_idoriginal_languageoriginal_titleoverview...release_daterevenueruntimespoken_languagesstatustaglinetitlevideovote_averagevote_count
0False{'id': 10194, 'name': 'Toy Story Collection', ...30000000[{'id': 16, 'name': 'Animation'}, {'id': 35, '...http://toystory.disney.com/toy-story862tt0114709enToy StoryLed by Woody, Andy's toys live happily in his ......1995-10-3037355403381.0[{'iso_639_1': 'en', 'name': 'English'}]ReleasedNaNToy StoryFalse7.75415
1FalseNaN65000000[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...NaN8844tt0113497enJumanjiWhen siblings Judy and Peter discover an encha......1995-12-15262797249104.0[{'iso_639_1': 'en', 'name': 'English'}, {'iso...ReleasedRoll the dice and unleash the excitement!JumanjiFalse6.92413
2False{'id': 119050, 'name': 'Grumpy Old Men Collect...0[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...NaN15602tt0113228enGrumpier Old MenA family wedding reignites the ancient feud be......1995-12-220101.0[{'iso_639_1': 'en', 'name': 'English'}]ReleasedStill Yelling. Still Fighting. Still Ready for...Grumpier Old MenFalse6.592
3FalseNaN16000000[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...NaN31357tt0114885enWaiting to ExhaleCheated on, mistreated and stepped on, the wom......1995-12-2281452156127.0[{'iso_639_1': 'en', 'name': 'English'}]ReleasedFriends are the people who let you be yourself...Waiting to ExhaleFalse6.134
4False{'id': 96871, 'name': 'Father of the Bride Col...0[{'id': 35, 'name': 'Comedy'}]NaN11862tt0113041enFather of the Bride Part IIJust when George Banks has recovered from his ......1995-02-1076578911106.0[{'iso_639_1': 'en', 'name': 'English'}]ReleasedJust When His World Is Back To Normal... He's ...Father of the Bride Part IIFalse5.7173
\n", 318 | "

5 rows × 24 columns

\n", 319 | "
" 320 | ], 321 | "text/plain": [ 322 | " adult belongs_to_collection budget \\\n", 323 | "0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 \n", 324 | "1 False NaN 65000000 \n", 325 | "2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 \n", 326 | "3 False NaN 16000000 \n", 327 | "4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 \n", 328 | "\n", 329 | " genres \\\n", 330 | "0 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... \n", 331 | "1 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... \n", 332 | "2 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... \n", 333 | "3 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... \n", 334 | "4 [{'id': 35, 'name': 'Comedy'}] \n", 335 | "\n", 336 | " homepage id imdb_id original_language \\\n", 337 | "0 http://toystory.disney.com/toy-story 862 tt0114709 en \n", 338 | "1 NaN 8844 tt0113497 en \n", 339 | "2 NaN 15602 tt0113228 en \n", 340 | "3 NaN 31357 tt0114885 en \n", 341 | "4 NaN 11862 tt0113041 en \n", 342 | "\n", 343 | " original_title \\\n", 344 | "0 Toy Story \n", 345 | "1 Jumanji \n", 346 | "2 Grumpier Old Men \n", 347 | "3 Waiting to Exhale \n", 348 | "4 Father of the Bride Part II \n", 349 | "\n", 350 | " overview ... release_date \\\n", 351 | "0 Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 \n", 352 | "1 When siblings Judy and Peter discover an encha... ... 1995-12-15 \n", 353 | "2 A family wedding reignites the ancient feud be... ... 1995-12-22 \n", 354 | "3 Cheated on, mistreated and stepped on, the wom... ... 1995-12-22 \n", 355 | "4 Just when George Banks has recovered from his ... ... 1995-02-10 \n", 356 | "\n", 357 | " revenue runtime spoken_languages \\\n", 358 | "0 373554033 81.0 [{'iso_639_1': 'en', 'name': 'English'}] \n", 359 | "1 262797249 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... \n", 360 | "2 0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] \n", 361 | "3 81452156 127.0 [{'iso_639_1': 'en', 'name': 'English'}] \n", 362 | "4 76578911 106.0 [{'iso_639_1': 'en', 'name': 'English'}] \n", 363 | "\n", 364 | " status tagline \\\n", 365 | "0 Released NaN \n", 366 | "1 Released Roll the dice and unleash the excitement! \n", 367 | "2 Released Still Yelling. Still Fighting. Still Ready for... \n", 368 | "3 Released Friends are the people who let you be yourself... \n", 369 | "4 Released Just When His World Is Back To Normal... He's ... \n", 370 | "\n", 371 | " title video vote_average vote_count \n", 372 | "0 Toy Story False 7.7 5415 \n", 373 | "1 Jumanji False 6.9 2413 \n", 374 | "2 Grumpier Old Men False 6.5 92 \n", 375 | "3 Waiting to Exhale False 6.1 34 \n", 376 | "4 Father of the Bride Part II False 5.7 173 \n", 377 | "\n", 378 | "[5 rows x 24 columns]" 379 | ] 380 | }, 381 | "execution_count": 7, 382 | "metadata": {}, 383 | "output_type": "execute_result" 384 | } 385 | ], 386 | "source": [ 387 | "movies_metadata = pd.read_csv('./movie_meta/movies_metadata.csv', sep=\",\", dtype=str)\n", 388 | "print(movies_metadata.shape)\n", 389 | "movies_metadata.head()" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 8, 395 | "id": "meaning-kingdom", 396 | "metadata": {}, 397 | "outputs": [ 398 | { 399 | "data": { 400 | "text/html": [ 401 | "
\n", 402 | "\n", 415 | "\n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | "
idgenrestitleoverviewrelease_date
0862[{'id': 16, 'name': 'Animation'}, {'id': 35, '...Toy StoryLed by Woody, Andy's toys live happily in his ...1995-10-30
18844[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...JumanjiWhen siblings Judy and Peter discover an encha...1995-12-15
215602[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...Grumpier Old MenA family wedding reignites the ancient feud be...1995-12-22
331357[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...Waiting to ExhaleCheated on, mistreated and stepped on, the wom...1995-12-22
411862[{'id': 35, 'name': 'Comedy'}]Father of the Bride Part IIJust when George Banks has recovered from his ...1995-02-10
\n", 469 | "
" 470 | ], 471 | "text/plain": [ 472 | " id genres \\\n", 473 | "0 862 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... \n", 474 | "1 8844 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... \n", 475 | "2 15602 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... \n", 476 | "3 31357 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... \n", 477 | "4 11862 [{'id': 35, 'name': 'Comedy'}] \n", 478 | "\n", 479 | " title \\\n", 480 | "0 Toy Story \n", 481 | "1 Jumanji \n", 482 | "2 Grumpier Old Men \n", 483 | "3 Waiting to Exhale \n", 484 | "4 Father of the Bride Part II \n", 485 | "\n", 486 | " overview release_date \n", 487 | "0 Led by Woody, Andy's toys live happily in his ... 1995-10-30 \n", 488 | "1 When siblings Judy and Peter discover an encha... 1995-12-15 \n", 489 | "2 A family wedding reignites the ancient feud be... 1995-12-22 \n", 490 | "3 Cheated on, mistreated and stepped on, the wom... 1995-12-22 \n", 491 | "4 Just when George Banks has recovered from his ... 1995-02-10 " 492 | ] 493 | }, 494 | "execution_count": 8, 495 | "metadata": {}, 496 | "output_type": "execute_result" 497 | } 498 | ], 499 | "source": [ 500 | "movies_metadata = movies_metadata[['id', 'genres', 'title', 'overview', 'release_date']]\n", 501 | "movies_metadata.head()" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 9, 507 | "id": "incomplete-botswana", 508 | "metadata": {}, 509 | "outputs": [ 510 | { 511 | "data": { 512 | "text/html": [ 513 | "
\n", 514 | "\n", 527 | "\n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | "
idgenrestitleoverviewrelease_date
0862[{'id': 16, 'name': 'Animation'}, {'id': 35, '...Toy StoryLed by Woody, Andy's toys live happily in his ...1995-10-30
18844[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...JumanjiWhen siblings Judy and Peter discover an encha...1995-12-15
215602[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...Grumpier Old MenA family wedding reignites the ancient feud be...1995-12-22
331357[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...Waiting to ExhaleCheated on, mistreated and stepped on, the wom...1995-12-22
411862[{'id': 35, 'name': 'Comedy'}]Father of the Bride Part IIJust when George Banks has recovered from his ...1995-02-10
\n", 581 | "
" 582 | ], 583 | "text/plain": [ 584 | " id genres \\\n", 585 | "0 862 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... \n", 586 | "1 8844 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... \n", 587 | "2 15602 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... \n", 588 | "3 31357 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... \n", 589 | "4 11862 [{'id': 35, 'name': 'Comedy'}] \n", 590 | "\n", 591 | " title \\\n", 592 | "0 Toy Story \n", 593 | "1 Jumanji \n", 594 | "2 Grumpier Old Men \n", 595 | "3 Waiting to Exhale \n", 596 | "4 Father of the Bride Part II \n", 597 | "\n", 598 | " overview release_date \n", 599 | "0 Led by Woody, Andy's toys live happily in his ... 1995-10-30 \n", 600 | "1 When siblings Judy and Peter discover an encha... 1995-12-15 \n", 601 | "2 A family wedding reignites the ancient feud be... 1995-12-22 \n", 602 | "3 Cheated on, mistreated and stepped on, the wom... 1995-12-22 \n", 603 | "4 Just when George Banks has recovered from his ... 1995-02-10 " 604 | ] 605 | }, 606 | "execution_count": 9, 607 | "metadata": {}, 608 | "output_type": "execute_result" 609 | } 610 | ], 611 | "source": [ 612 | "movies_metadata['genres'] = movies_metadata['genres'].apply(literal_eval)\n", 613 | "movies_metadata.head()" 614 | ] 615 | }, 616 | { 617 | "cell_type": "markdown", 618 | "id": "4abd5ee3", 619 | "metadata": {}, 620 | "source": [ 621 | "## 사용할 컬럼 설정" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": 10, 627 | "id": "collaborative-draft", 628 | "metadata": {}, 629 | "outputs": [], 630 | "source": [ 631 | "def get_genre(x):\n", 632 | " names = [i['name'] for i in x]\n", 633 | " if len(names) > 3:\n", 634 | " names = names[:3]\n", 635 | " return \" \".join(names)" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": 11, 641 | "id": "informed-banana", 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "movies_metadata['genres'] = movies_metadata['genres'].apply(lambda x : get_genre(x))" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": 12, 651 | "id": "large-reality", 652 | "metadata": {}, 653 | "outputs": [ 654 | { 655 | "data": { 656 | "text/html": [ 657 | "
\n", 658 | "\n", 671 | "\n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | "
idgenrestitleoverviewrelease_date
0862Animation Comedy FamilyToy StoryLed by Woody, Andy's toys live happily in his ...1995-10-30
18844Adventure Fantasy FamilyJumanjiWhen siblings Judy and Peter discover an encha...1995-12-15
215602Romance ComedyGrumpier Old MenA family wedding reignites the ancient feud be...1995-12-22
331357Comedy Drama RomanceWaiting to ExhaleCheated on, mistreated and stepped on, the wom...1995-12-22
411862ComedyFather of the Bride Part IIJust when George Banks has recovered from his ...1995-02-10
\n", 725 | "
" 726 | ], 727 | "text/plain": [ 728 | " id genres title \\\n", 729 | "0 862 Animation Comedy Family Toy Story \n", 730 | "1 8844 Adventure Fantasy Family Jumanji \n", 731 | "2 15602 Romance Comedy Grumpier Old Men \n", 732 | "3 31357 Comedy Drama Romance Waiting to Exhale \n", 733 | "4 11862 Comedy Father of the Bride Part II \n", 734 | "\n", 735 | " overview release_date \n", 736 | "0 Led by Woody, Andy's toys live happily in his ... 1995-10-30 \n", 737 | "1 When siblings Judy and Peter discover an encha... 1995-12-15 \n", 738 | "2 A family wedding reignites the ancient feud be... 1995-12-22 \n", 739 | "3 Cheated on, mistreated and stepped on, the wom... 1995-12-22 \n", 740 | "4 Just when George Banks has recovered from his ... 1995-02-10 " 741 | ] 742 | }, 743 | "execution_count": 12, 744 | "metadata": {}, 745 | "output_type": "execute_result" 746 | } 747 | ], 748 | "source": [ 749 | "movies_metadata.head()" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": 13, 755 | "id": "special-florist", 756 | "metadata": {}, 757 | "outputs": [ 758 | { 759 | "data": { 760 | "text/plain": [ 761 | "index 0\n", 762 | "id 0\n", 763 | "genres 0\n", 764 | "title 0\n", 765 | "overview 0\n", 766 | "release_date 71\n", 767 | "dtype: int64" 768 | ] 769 | }, 770 | "execution_count": 13, 771 | "metadata": {}, 772 | "output_type": "execute_result" 773 | } 774 | ], 775 | "source": [ 776 | "movies_metadata = movies_metadata[movies_metadata['overview'].notnull()]\n", 777 | "movies_metadata = movies_metadata[movies_metadata['title'].notnull()]\n", 778 | "movies_metadata.isna().sum()" 779 | ] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "execution_count": 14, 784 | "id": "related-rubber", 785 | "metadata": {}, 786 | "outputs": [], 787 | "source": [ 788 | "movies_metadata['feature'] = movies_metadata['genres'] + \" / \" + movies_metadata['title'] + \" / \" + movies_metadata['overview']" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "id": "beginning-insured", 794 | "metadata": {}, 795 | "source": [ 796 | "# HuggingFace embedding" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": 18, 802 | "id": "1d862f87", 803 | "metadata": {}, 804 | "outputs": [ 805 | { 806 | "data": { 807 | "text/plain": [ 808 | "SentenceTransformer(\n", 809 | " (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel \n", 810 | " (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\n", 811 | " (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})\n", 812 | ")" 813 | ] 814 | }, 815 | "execution_count": 18, 816 | "metadata": {}, 817 | "output_type": "execute_result" 818 | } 819 | ], 820 | "source": [ 821 | "if cur_os.startswith('win'):\n", 822 | " model = SentenceTransformer(f'{model_path}/distiluse-base-multilingual-cased-v2') \n", 823 | "else:\n", 824 | " model = SentenceTransformer(\"sentence-transformers/distiluse-base-multilingual-cased-v2\")\n", 825 | "\n", 826 | "model" 827 | ] 828 | }, 829 | { 830 | "cell_type": "code", 831 | "execution_count": 24, 832 | "id": "23090b67", 833 | "metadata": {}, 834 | "outputs": [ 835 | { 836 | "data": { 837 | "text/plain": [ 838 | "(44506, 10)" 839 | ] 840 | }, 841 | "execution_count": 24, 842 | "metadata": {}, 843 | "output_type": "execute_result" 844 | } 845 | ], 846 | "source": [ 847 | "movies_metadata['hf_embeddings'] = movies_metadata['feature'].apply(lambda x : model.encode(x))\n", 848 | "movies_metadata.shape" 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": 48, 854 | "id": "infinite-sudan", 855 | "metadata": {}, 856 | "outputs": [ 857 | { 858 | "name": "stdout", 859 | "output_type": "stream", 860 | "text": [ 861 | "(44506, 9)\n" 862 | ] 863 | }, 864 | { 865 | "data": { 866 | "text/html": [ 867 | "
\n", 868 | "\n", 881 | "\n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | "
idgenrestitleoverviewrelease_datefeaturetext_lenfeature_lenhf_embeddings
0862Animation Comedy FamilyToy StoryLed by Woody, Andy's toys live happily in his ...1995-10-30Animation Comedy Family / Toy Story / Led by W...303341[-0.047183443, -0.02021129, 0.096098304, -0.05...
18844Adventure Fantasy FamilyJumanjiWhen siblings Judy and Peter discover an encha...1995-12-15Adventure Fantasy Family / Jumanji / When sibl...395432[0.0013290617, -0.0071765357, 0.048141554, -0....
215602Romance ComedyGrumpier Old MenA family wedding reignites the ancient feud be...1995-12-22Romance Comedy / Grumpier Old Men / A family w...327363[-0.06425337, -0.008573138, -0.10116352, -0.00...
331357Comedy Drama RomanceWaiting to ExhaleCheated on, mistreated and stepped on, the wom...1995-12-22Comedy Drama Romance / Waiting to Exhale / Che...270313[-0.032623984, -0.03347266, 0.02557934, -0.033...
411862ComedyFather of the Bride Part IIJust when George Banks has recovered from his ...1995-02-10Comedy / Father of the Bride Part II / Just wh...318357[0.03181218, -0.0038158156, 0.02099277, 0.0031...
\n", 959 | "
" 960 | ], 961 | "text/plain": [ 962 | " id genres title \\\n", 963 | "0 862 Animation Comedy Family Toy Story \n", 964 | "1 8844 Adventure Fantasy Family Jumanji \n", 965 | "2 15602 Romance Comedy Grumpier Old Men \n", 966 | "3 31357 Comedy Drama Romance Waiting to Exhale \n", 967 | "4 11862 Comedy Father of the Bride Part II \n", 968 | "\n", 969 | " overview release_date \\\n", 970 | "0 Led by Woody, Andy's toys live happily in his ... 1995-10-30 \n", 971 | "1 When siblings Judy and Peter discover an encha... 1995-12-15 \n", 972 | "2 A family wedding reignites the ancient feud be... 1995-12-22 \n", 973 | "3 Cheated on, mistreated and stepped on, the wom... 1995-12-22 \n", 974 | "4 Just when George Banks has recovered from his ... 1995-02-10 \n", 975 | "\n", 976 | " feature text_len feature_len \\\n", 977 | "0 Animation Comedy Family / Toy Story / Led by W... 303 341 \n", 978 | "1 Adventure Fantasy Family / Jumanji / When sibl... 395 432 \n", 979 | "2 Romance Comedy / Grumpier Old Men / A family w... 327 363 \n", 980 | "3 Comedy Drama Romance / Waiting to Exhale / Che... 270 313 \n", 981 | "4 Comedy / Father of the Bride Part II / Just wh... 318 357 \n", 982 | "\n", 983 | " hf_embeddings \n", 984 | "0 [-0.047183443, -0.02021129, 0.096098304, -0.05... \n", 985 | "1 [0.0013290617, -0.0071765357, 0.048141554, -0.... \n", 986 | "2 [-0.06425337, -0.008573138, -0.10116352, -0.00... \n", 987 | "3 [-0.032623984, -0.03347266, 0.02557934, -0.033... \n", 988 | "4 [0.03181218, -0.0038158156, 0.02099277, 0.0031... " 989 | ] 990 | }, 991 | "execution_count": 48, 992 | "metadata": {}, 993 | "output_type": "execute_result" 994 | } 995 | ], 996 | "source": [ 997 | "print(movies_metadata.shape)\n", 998 | "movies_metadata.head()" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "code", 1003 | "execution_count": 49, 1004 | "id": "exotic-circuit", 1005 | "metadata": {}, 1006 | "outputs": [], 1007 | "source": [ 1008 | "movies_metadata.to_csv('./movie_meta/movies_metadata_em.csv')" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "markdown", 1013 | "id": "entertaining-surge", 1014 | "metadata": {}, 1015 | "source": [ 1016 | "# OpenAI Embedding" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": 26, 1022 | "id": "streaming-ultimate", 1023 | "metadata": {}, 1024 | "outputs": [], 1025 | "source": [ 1026 | "openai_embedding_model = \"text-embedding-ada-002\"" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "code", 1031 | "execution_count": 27, 1032 | "id": "progressive-amino", 1033 | "metadata": {}, 1034 | "outputs": [], 1035 | "source": [ 1036 | "def get_doc_embedding(text: str) -> List[float]:\n", 1037 | " return get_embedding(text, openai_embedding_model)" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "code", 1042 | "execution_count": 28, 1043 | "id": "posted-creator", 1044 | "metadata": {}, 1045 | "outputs": [], 1046 | "source": [ 1047 | "def get_embedding(text: str, model: str) -> List[float]:\n", 1048 | " result = openai.Embedding.create(\n", 1049 | " model=model,\n", 1050 | " input=text)\n", 1051 | " return result[\"data\"][0][\"embedding\"]" 1052 | ] 1053 | }, 1054 | { 1055 | "cell_type": "code", 1056 | "execution_count": 29, 1057 | "id": "refined-hotel", 1058 | "metadata": {}, 1059 | "outputs": [], 1060 | "source": [ 1061 | "# movies_metadata['openai_embeddings'] = movies_metadata['feature'].apply(lambda x : get_embedding(x, openai_embedding_model))" 1062 | ] 1063 | }, 1064 | { 1065 | "cell_type": "code", 1066 | "execution_count": null, 1067 | "id": "uniform-bernard", 1068 | "metadata": {}, 1069 | "outputs": [], 1070 | "source": [] 1071 | }, 1072 | { 1073 | "cell_type": "code", 1074 | "execution_count": null, 1075 | "id": "contemporary-tobago", 1076 | "metadata": {}, 1077 | "outputs": [], 1078 | "source": [] 1079 | }, 1080 | { 1081 | "cell_type": "markdown", 1082 | "id": "talented-landing", 1083 | "metadata": {}, 1084 | "source": [ 1085 | "# Query와 비슷한 앱" 1086 | ] 1087 | }, 1088 | { 1089 | "cell_type": "code", 1090 | "execution_count": 30, 1091 | "id": "prime-independence", 1092 | "metadata": {}, 1093 | "outputs": [], 1094 | "source": [ 1095 | "top_k = 5" 1096 | ] 1097 | }, 1098 | { 1099 | "cell_type": "code", 1100 | "execution_count": 671, 1101 | "id": "dedicated-basin", 1102 | "metadata": {}, 1103 | "outputs": [], 1104 | "source": [ 1105 | "def get_query_sim_top_k(query, model, df, top_k):\n", 1106 | " query_encode = model.encode(query)\n", 1107 | " cos_scores = util.pytorch_cos_sim(query_encode, df['hf_embeddings'])[0]\n", 1108 | " top_results = torch.topk(cos_scores, k=top_k)\n", 1109 | " return top_results" 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "code", 1114 | "execution_count": 32, 1115 | "id": "romantic-category", 1116 | "metadata": {}, 1117 | "outputs": [ 1118 | { 1119 | "name": "stderr", 1120 | "output_type": "stream", 1121 | "text": [ 1122 | "/Users/leesoojin/opt/anaconda3/envs/openai/lib/python3.8/site-packages/sentence_transformers/util.py:39: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_new.cpp:233.)\n", 1123 | " b = torch.tensor(b)\n" 1124 | ] 1125 | } 1126 | ], 1127 | "source": [ 1128 | "query = \"Are there any documentary films?\"\n", 1129 | "top_result = get_query_sim_top_k(query, model, movies_metadata)" 1130 | ] 1131 | }, 1132 | { 1133 | "cell_type": "code", 1134 | "execution_count": 33, 1135 | "id": "provincial-floor", 1136 | "metadata": {}, 1137 | "outputs": [ 1138 | { 1139 | "data": { 1140 | "text/plain": [ 1141 | "torch.return_types.topk(\n", 1142 | "values=tensor([0.5390, 0.5117, 0.5093, 0.5067, 0.4992]),\n", 1143 | "indices=tensor([24020, 10124, 22428, 35263, 22273]))" 1144 | ] 1145 | }, 1146 | "execution_count": 33, 1147 | "metadata": {}, 1148 | "output_type": "execute_result" 1149 | } 1150 | ], 1151 | "source": [ 1152 | "top_result" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "code", 1157 | "execution_count": 34, 1158 | "id": "continuing-pasta", 1159 | "metadata": {}, 1160 | "outputs": [ 1161 | { 1162 | "data": { 1163 | "text/html": [ 1164 | "
\n", 1165 | "\n", 1178 | "\n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | "
titleoverviewgenres
24020The 50 Worst Movies Ever MadeThere are some movies that are so bad they're ...Documentary
10124Trekkies 2sequel to the 1997 documentary film Trekkies.Documentary
22428The Spanish EarthA propaganda film made during the Spanish Civi...War Documentary
35263TomorrowDocumentary film about global warming.Documentary
22273I Know That VoiceA documentary about voice-over actors.Documentary
\n", 1220 | "
" 1221 | ], 1222 | "text/plain": [ 1223 | " title \\\n", 1224 | "24020 The 50 Worst Movies Ever Made \n", 1225 | "10124 Trekkies 2 \n", 1226 | "22428 The Spanish Earth \n", 1227 | "35263 Tomorrow \n", 1228 | "22273 I Know That Voice \n", 1229 | "\n", 1230 | " overview genres \n", 1231 | "24020 There are some movies that are so bad they're ... Documentary \n", 1232 | "10124 sequel to the 1997 documentary film Trekkies. Documentary \n", 1233 | "22428 A propaganda film made during the Spanish Civi... War Documentary \n", 1234 | "35263 Documentary film about global warming. Documentary \n", 1235 | "22273 A documentary about voice-over actors. Documentary " 1236 | ] 1237 | }, 1238 | "execution_count": 34, 1239 | "metadata": {}, 1240 | "output_type": "execute_result" 1241 | } 1242 | ], 1243 | "source": [ 1244 | "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'overview', 'genres']]" 1245 | ] 1246 | }, 1247 | { 1248 | "cell_type": "code", 1249 | "execution_count": null, 1250 | "id": "portable-racing", 1251 | "metadata": {}, 1252 | "outputs": [], 1253 | "source": [] 1254 | }, 1255 | { 1256 | "cell_type": "code", 1257 | "execution_count": 35, 1258 | "id": "joined-monday", 1259 | "metadata": {}, 1260 | "outputs": [ 1261 | { 1262 | "data": { 1263 | "text/html": [ 1264 | "
\n", 1265 | "\n", 1278 | "\n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | "
titleoverviewgenres
38573CatastropheA film cataloguing some of the world's largest...Thriller Documentary
11441When the Levees Broke: A Requiem in Four ActsIn August 2005, the American city of New Orlea...Documentary
2404EarthquakeEarthquake is a 1974 American disaster film th...Action Drama Thriller
35263TomorrowDocumentary film about global warming.Documentary
41943Disaster!A spoof of disaster films, an asteroid is comi...Action Animation Comedy
\n", 1320 | "
" 1321 | ], 1322 | "text/plain": [ 1323 | " title \\\n", 1324 | "38573 Catastrophe \n", 1325 | "11441 When the Levees Broke: A Requiem in Four Acts \n", 1326 | "2404 Earthquake \n", 1327 | "35263 Tomorrow \n", 1328 | "41943 Disaster! \n", 1329 | "\n", 1330 | " overview \\\n", 1331 | "38573 A film cataloguing some of the world's largest... \n", 1332 | "11441 In August 2005, the American city of New Orlea... \n", 1333 | "2404 Earthquake is a 1974 American disaster film th... \n", 1334 | "35263 Documentary film about global warming. \n", 1335 | "41943 A spoof of disaster films, an asteroid is comi... \n", 1336 | "\n", 1337 | " genres \n", 1338 | "38573 Thriller Documentary \n", 1339 | "11441 Documentary \n", 1340 | "2404 Action Drama Thriller \n", 1341 | "35263 Documentary \n", 1342 | "41943 Action Animation Comedy " 1343 | ] 1344 | }, 1345 | "execution_count": 35, 1346 | "metadata": {}, 1347 | "output_type": "execute_result" 1348 | } 1349 | ], 1350 | "source": [ 1351 | "query = \"Are there any movies about natural disasters?\"\n", 1352 | "top_result = get_query_sim_top_k(query, model, movies_metadata)\n", 1353 | "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'overview', 'genres']]" 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "code", 1358 | "execution_count": 36, 1359 | "id": "serious-serbia", 1360 | "metadata": {}, 1361 | "outputs": [ 1362 | { 1363 | "data": { 1364 | "text/html": [ 1365 | "
\n", 1366 | "\n", 1379 | "\n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | "
titleoverviewgenres
36936The Flying ManA new superhero is coming, only this time it's...Action Mystery Science Fiction
30101Up, Up, and AwayA boy is the only family member without superp...Action Family TV Movie
24646The Four FeathersThey made him a hero by branding him a coward ...TV Movie Adventure Drama
12672HancockHancock is a down-and-out superhero who's forc...Fantasy Action
4216Too Late the HeroA WWII film set on a Pacific island. Japanese ...Drama Action War
\n", 1421 | "
" 1422 | ], 1423 | "text/plain": [ 1424 | " title overview \\\n", 1425 | "36936 The Flying Man A new superhero is coming, only this time it's... \n", 1426 | "30101 Up, Up, and Away A boy is the only family member without superp... \n", 1427 | "24646 The Four Feathers They made him a hero by branding him a coward ... \n", 1428 | "12672 Hancock Hancock is a down-and-out superhero who's forc... \n", 1429 | "4216 Too Late the Hero A WWII film set on a Pacific island. Japanese ... \n", 1430 | "\n", 1431 | " genres \n", 1432 | "36936 Action Mystery Science Fiction \n", 1433 | "30101 Action Family TV Movie \n", 1434 | "24646 TV Movie Adventure Drama \n", 1435 | "12672 Fantasy Action \n", 1436 | "4216 Drama Action War " 1437 | ] 1438 | }, 1439 | "execution_count": 36, 1440 | "metadata": {}, 1441 | "output_type": "execute_result" 1442 | } 1443 | ], 1444 | "source": [ 1445 | "query = \"Are there any movies about heros?\"\n", 1446 | "top_result = get_query_sim_top_k(query, model, movies_metadata)\n", 1447 | "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'overview', 'genres']]" 1448 | ] 1449 | }, 1450 | { 1451 | "cell_type": "markdown", 1452 | "id": "burning-venice", 1453 | "metadata": {}, 1454 | "source": [ 1455 | "https://www.imdb.com/title/tt0211174/?ref_=fn_al_tt_1" 1456 | ] 1457 | }, 1458 | { 1459 | "cell_type": "markdown", 1460 | "id": "painful-drama", 1461 | "metadata": {}, 1462 | "source": [ 1463 | "# ChatGPT 이용\n", 1464 | "\n", 1465 | "- 2개의 chatgpt 이용\n", 1466 | " - 1개는 이 질문의 의도를 파악하는 것. 설명을 원하는 것인지, 추천을 해달라는 것인지\n", 1467 | " - 각 분류에 따라 문구가 달라짐\n", 1468 | " - 질문의 의도면 가장 유사한 텍스트를 가져와서 설명해주는 것\n", 1469 | " - 추천이면, cossim topk를 가져와서 출력하도록" 1470 | ] 1471 | }, 1472 | { 1473 | "cell_type": "code", 1474 | "execution_count": null, 1475 | "id": "controlled-religion", 1476 | "metadata": {}, 1477 | "outputs": [], 1478 | "source": [] 1479 | }, 1480 | { 1481 | "cell_type": "code", 1482 | "execution_count": 37, 1483 | "id": "greater-underwear", 1484 | "metadata": {}, 1485 | "outputs": [], 1486 | "source": [ 1487 | "def print_msg(msg):\n", 1488 | " completion = openai.ChatCompletion.create(\n", 1489 | " model=\"gpt-3.5-turbo\",\n", 1490 | " messages=msg\n", 1491 | " )\n", 1492 | " return completion['choices'][0]['message']['content']" 1493 | ] 1494 | }, 1495 | { 1496 | "cell_type": "markdown", 1497 | "id": "a356862b", 1498 | "metadata": {}, 1499 | "source": [ 1500 | "## Prompt test" 1501 | ] 1502 | }, 1503 | { 1504 | "cell_type": "code", 1505 | "execution_count": 38, 1506 | "id": "synthetic-touch", 1507 | "metadata": {}, 1508 | "outputs": [ 1509 | { 1510 | "data": { 1511 | "text/plain": [ 1512 | "'description'" 1513 | ] 1514 | }, 1515 | "execution_count": 38, 1516 | "metadata": {}, 1517 | "output_type": "execute_result" 1518 | } 1519 | ], 1520 | "source": [ 1521 | "messages = [\n", 1522 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant who understands the intent of the user's question.\"},\n", 1523 | " {\"role\": \"user\", \"content\": \"Which category does the sentence below belong to: 'description', 'recommended'? Show only categories. \\n context: tell me about instagram \\n A:\"}\n", 1524 | "]\n", 1525 | "\n", 1526 | "\n", 1527 | "print_msg(messages)" 1528 | ] 1529 | }, 1530 | { 1531 | "cell_type": "code", 1532 | "execution_count": 39, 1533 | "id": "powerful-savings", 1534 | "metadata": {}, 1535 | "outputs": [ 1536 | { 1537 | "data": { 1538 | "text/plain": [ 1539 | "'recommend'" 1540 | ] 1541 | }, 1542 | "execution_count": 39, 1543 | "metadata": {}, 1544 | "output_type": "execute_result" 1545 | } 1546 | ], 1547 | "source": [ 1548 | "messages = [\n", 1549 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant who understands the intent of the user's question.\"},\n", 1550 | " {\"role\": \"user\", \"content\": \"Which category does the sentence below belong to: 'description', 'recommend'? Show only categories. \\n context: What apps are similar to Instagram? \\n A:\"}\n", 1551 | "]\n", 1552 | "\n", 1553 | "\n", 1554 | "print_msg(messages)\n", 1555 | "\n" 1556 | ] 1557 | }, 1558 | { 1559 | "cell_type": "code", 1560 | "execution_count": 40, 1561 | "id": "still-outdoors", 1562 | "metadata": {}, 1563 | "outputs": [ 1564 | { 1565 | "data": { 1566 | "text/plain": [ 1567 | "'recommend'" 1568 | ] 1569 | }, 1570 | "execution_count": 40, 1571 | "metadata": {}, 1572 | "output_type": "execute_result" 1573 | } 1574 | ], 1575 | "source": [ 1576 | "messages = [\n", 1577 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant who understands the intent of the user's question.\"},\n", 1578 | " {\"role\": \"user\", \"content\": \"Which category does the sentence below belong to: 'description', 'recommend'? Show only categories. \\n context: Recommend apps similar to Instagram. \\n A:\"}\n", 1579 | "]\n", 1580 | "\n", 1581 | "\n", 1582 | "print_msg(messages)" 1583 | ] 1584 | }, 1585 | { 1586 | "cell_type": "code", 1587 | "execution_count": null, 1588 | "id": "manual-violin", 1589 | "metadata": {}, 1590 | "outputs": [], 1591 | "source": [] 1592 | }, 1593 | { 1594 | "cell_type": "code", 1595 | "execution_count": 41, 1596 | "id": "equal-income", 1597 | "metadata": {}, 1598 | "outputs": [ 1599 | { 1600 | "data": { 1601 | "text/plain": [ 1602 | "'Here are some apps similar to Instagram that you might want to check out:'" 1603 | ] 1604 | }, 1605 | "execution_count": 41, 1606 | "metadata": {}, 1607 | "output_type": "execute_result" 1608 | } 1609 | ], 1610 | "source": [ 1611 | "messages = [\n", 1612 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant who recommend contents.\"},\n", 1613 | " {\"role\": \"user\", \"content\": \"Simply repeat the provided context and put a sentence in front of the context. \\n context: Recommend apps similar to Instagram.\"}\n", 1614 | "]\n", 1615 | "\n", 1616 | "\n", 1617 | "print_msg(messages)" 1618 | ] 1619 | }, 1620 | { 1621 | "cell_type": "code", 1622 | "execution_count": 42, 1623 | "id": "saving-lincoln", 1624 | "metadata": {}, 1625 | "outputs": [ 1626 | { 1627 | "data": { 1628 | "text/plain": [ 1629 | "'Here are some apps like Instagram that you might find helpful!'" 1630 | ] 1631 | }, 1632 | "execution_count": 42, 1633 | "metadata": {}, 1634 | "output_type": "execute_result" 1635 | } 1636 | ], 1637 | "source": [ 1638 | "messages = [\n", 1639 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant who recommend contents.\"},\n", 1640 | " {\"role\": \"user\", \"content\": \"Simplify the sentences for recommending services \\n context: Recommend apps similar to Instagram.\"}\n", 1641 | "]\n", 1642 | "\n", 1643 | "\n", 1644 | "print_msg(messages)" 1645 | ] 1646 | }, 1647 | { 1648 | "cell_type": "code", 1649 | "execution_count": 43, 1650 | "id": "interstate-director", 1651 | "metadata": {}, 1652 | "outputs": [ 1653 | { 1654 | "data": { 1655 | "text/plain": [ 1656 | "\"Of course! I'd be happy to explain the item to you.\"" 1657 | ] 1658 | }, 1659 | "execution_count": 43, 1660 | "metadata": {}, 1661 | "output_type": "execute_result" 1662 | } 1663 | ], 1664 | "source": [ 1665 | "messages = [\n", 1666 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant who kindly answers.\"},\n", 1667 | " {\"role\": \"user\", \"content\": \"Please write a simple greeting starting with 'of course' to explain the item to the user.\"}\n", 1668 | "]\n", 1669 | "\n", 1670 | "\n", 1671 | "print_msg(messages)" 1672 | ] 1673 | }, 1674 | { 1675 | "cell_type": "code", 1676 | "execution_count": 44, 1677 | "id": "introductory-phrase", 1678 | "metadata": {}, 1679 | "outputs": [ 1680 | { 1681 | "data": { 1682 | "text/plain": [ 1683 | "\"Of course! I'd be happy to recommend some great items for you.\"" 1684 | ] 1685 | }, 1686 | "execution_count": 44, 1687 | "metadata": {}, 1688 | "output_type": "execute_result" 1689 | } 1690 | ], 1691 | "source": [ 1692 | "messages = [\n", 1693 | " {\"role\": \"system\", \"content\": \"You are a helpful assistant who recommend contents based on user question.\"},\n", 1694 | " {\"role\": \"user\", \"content\": \"Write 1 sentence of a simple greeting that starts with 'Of course!' to recommend items to users.\"}\n", 1695 | "]\n", 1696 | "\n", 1697 | "\n", 1698 | "print_msg(messages)" 1699 | ] 1700 | }, 1701 | { 1702 | "cell_type": "code", 1703 | "execution_count": 103, 1704 | "id": "f4b35964", 1705 | "metadata": {}, 1706 | "outputs": [ 1707 | { 1708 | "data": { 1709 | "text/html": [ 1710 | "
\n", 1711 | "\n", 1724 | "\n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | "
titleoverviewgenres
36936The Flying ManA new superhero is coming, only this time it's...Action Mystery Science Fiction
30101Up, Up, and AwayA boy is the only family member without superp...Action Family TV Movie
24646The Four FeathersThey made him a hero by branding him a coward ...TV Movie Adventure Drama
12672HancockHancock is a down-and-out superhero who's forc...Fantasy Action
4216Too Late the HeroA WWII film set on a Pacific island. Japanese ...Drama Action War
\n", 1766 | "
" 1767 | ], 1768 | "text/plain": [ 1769 | " title overview \\\n", 1770 | "36936 The Flying Man A new superhero is coming, only this time it's... \n", 1771 | "30101 Up, Up, and Away A boy is the only family member without superp... \n", 1772 | "24646 The Four Feathers They made him a hero by branding him a coward ... \n", 1773 | "12672 Hancock Hancock is a down-and-out superhero who's forc... \n", 1774 | "4216 Too Late the Hero A WWII film set on a Pacific island. Japanese ... \n", 1775 | "\n", 1776 | " genres \n", 1777 | "36936 Action Mystery Science Fiction \n", 1778 | "30101 Action Family TV Movie \n", 1779 | "24646 TV Movie Adventure Drama \n", 1780 | "12672 Fantasy Action \n", 1781 | "4216 Drama Action War " 1782 | ] 1783 | }, 1784 | "execution_count": 103, 1785 | "metadata": {}, 1786 | "output_type": "execute_result" 1787 | } 1788 | ], 1789 | "source": [ 1790 | "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'overview', 'genres']]" 1791 | ] 1792 | }, 1793 | { 1794 | "cell_type": "markdown", 1795 | "id": "34ee1584", 1796 | "metadata": {}, 1797 | "source": [ 1798 | "## 필요한 Prompt 설정\n", 1799 | "\n", 1800 | "- 추천인가? 설명인가? 의도 파악인가?" 1801 | ] 1802 | }, 1803 | { 1804 | "cell_type": "code", 1805 | "execution_count": 815, 1806 | "id": "split-scotland", 1807 | "metadata": {}, 1808 | "outputs": [], 1809 | "source": [ 1810 | "msg_prompt = {\n", 1811 | " 'recom' : {\n", 1812 | " 'system' : \"You are a helpful assistant who recommend movie based on user question.\", \n", 1813 | " 'user' : \"Write 1 sentence of a simple greeting that starts with 'Of course!' to recommend movie items to users.\", \n", 1814 | " },\n", 1815 | " 'desc' : {\n", 1816 | " 'system' : \"You are a helpful assistant who kindly answers.\", \n", 1817 | " 'user' : \"Please write a simple greeting starting with 'of course' to explain the item to the user.\", \n", 1818 | " },\n", 1819 | " 'intent' : {\n", 1820 | " 'system' : \"You are a helpful assistant who understands the intent of the user's question.\",\n", 1821 | " 'user' : \"Which category does the sentence below belong to: 'description', 'recommended', 'search'? Show only categories. \\n context:\"\n", 1822 | " }\n", 1823 | "}" 1824 | ] 1825 | }, 1826 | { 1827 | "cell_type": "code", 1828 | "execution_count": 856, 1829 | "id": "separate-confusion", 1830 | "metadata": {}, 1831 | "outputs": [ 1832 | { 1833 | "data": { 1834 | "text/plain": [ 1835 | "{'recom': {'system': 'You are a helpful assistant who recommend movie based on user question.',\n", 1836 | " 'user': \"Write 1 sentence of a simple greeting that starts with 'Of course!' to recommend movie items to users.\"},\n", 1837 | " 'desc': {'system': 'You are a helpful assistant who kindly answers.',\n", 1838 | " 'user': \"Please write a simple greeting starting with 'of course' to explain the item to the user.\"},\n", 1839 | " 'intent': {'system': \"You are a helpful assistant who understands the intent of the user's question.\",\n", 1840 | " 'user': \"Which category does the sentence below belong to: 'description', 'recommended', 'search'? Show only categories. \\n context:\"}}" 1841 | ] 1842 | }, 1843 | "execution_count": 856, 1844 | "metadata": {}, 1845 | "output_type": "execute_result" 1846 | } 1847 | ], 1848 | "source": [ 1849 | "msg_prompt" 1850 | ] 1851 | }, 1852 | { 1853 | "cell_type": "code", 1854 | "execution_count": 870, 1855 | "id": "biological-mustang", 1856 | "metadata": {}, 1857 | "outputs": [], 1858 | "source": [ 1859 | "user_msg_history = []" 1860 | ] 1861 | }, 1862 | { 1863 | "cell_type": "code", 1864 | "execution_count": 871, 1865 | "id": "d8b72193", 1866 | "metadata": {}, 1867 | "outputs": [], 1868 | "source": [ 1869 | "def get_chatgpt_msg(msg):\n", 1870 | " completion = openai.ChatCompletion.create(\n", 1871 | " model=\"gpt-3.5-turbo\",\n", 1872 | " messages=msg\n", 1873 | " )\n", 1874 | " return completion['choices'][0]['message']['content']" 1875 | ] 1876 | }, 1877 | { 1878 | "cell_type": "code", 1879 | "execution_count": 872, 1880 | "id": "partial-device", 1881 | "metadata": {}, 1882 | "outputs": [], 1883 | "source": [ 1884 | "def set_prompt(intent, query, msg_prompt_init, model):\n", 1885 | " '''prompt 형태를 만들어주는 함수'''\n", 1886 | " m = dict()\n", 1887 | " # 검색 또는 추천이면\n", 1888 | " if ('recom' in intent) or ('search' in intent):\n", 1889 | " msg = msg_prompt_init['recom'] # 시스템 메세지를 가지고오고\n", 1890 | " # 설명문이면\n", 1891 | " elif 'desc' in intent:\n", 1892 | " msg = msg_prompt_init['desc'] # 시스템 메세지를 가지고오고\n", 1893 | " # intent 파악\n", 1894 | " else:\n", 1895 | " msg = msg_prompt_init['intent']\n", 1896 | " msg['user'] += f' {query} \\n A:'\n", 1897 | " for k, v in msg.items():\n", 1898 | " m['role'], m['content'] = k, v\n", 1899 | " return [m]" 1900 | ] 1901 | }, 1902 | { 1903 | "cell_type": "code", 1904 | "execution_count": 872, 1905 | "id": "50db26ff", 1906 | "metadata": {}, 1907 | "outputs": [], 1908 | "source": [ 1909 | "def user_interact(query, model, msg_prompt_init):\n", 1910 | " # 1. 사용자의 의도를 파악\n", 1911 | " user_intent = set_prompt('intent', query, msg_prompt_init, None)\n", 1912 | " user_intent = get_chatgpt_msg(user_intent).lower()\n", 1913 | " print(\"user_intent : \", user_intent)\n", 1914 | " \n", 1915 | " # 2. 사용자의 쿼리에 따라 prompt 생성 \n", 1916 | " intent_data = set_prompt(user_intent, query, msg_prompt_init, model)\n", 1917 | " intent_data_msg = get_chatgpt_msg(intent_data).replace(\"\\n\", \"\").strip()\n", 1918 | " print(\"intent_data_msg : \", intent_data_msg)\n", 1919 | " \n", 1920 | " # 3-1. 추천 또는 검색이면\n", 1921 | " if ('recom' in user_intent) or ('search' in user_intent):\n", 1922 | " recom_msg = str()\n", 1923 | " # 기존에 메세지가 있으면 쿼리로 대체\n", 1924 | " if (len(user_msg_history) > 0 ) and (user_msg_history[-1]['role'] == 'assistant'):\n", 1925 | " query = user_msg_history[-1]['content']['feature']\n", 1926 | " # 유사 아이템 가져오기\n", 1927 | " #top_result = get_query_sim_top_k(query, model, movies_metadata, top_k=1 if 'recom' in user_intent else 3) # 추천 개수 설정하려면!\n", 1928 | " top_result = get_query_sim_top_k(query, model, movies_metadata, top_k=3)\n", 1929 | " #print(\"top_result : \", top_result)\n", 1930 | " # 검색이면, 자기 자신의 컨텐츠는 제외\n", 1931 | " top_index = top_result[1].numpy() if 'recom' in user_intent else top_result[1].numpy()[1:]\n", 1932 | " #print(\"top_index : \", top_index)\n", 1933 | " # 장르, 제목, overview를 가져와서 출력\n", 1934 | " r_set_d = movies_metadata.iloc[top_index, :][['genres', 'title', 'overview']]\n", 1935 | " r_set_d = json.loads(r_set_d.to_json(orient=\"records\"))\n", 1936 | " for r in r_set_d:\n", 1937 | " for _, v in r.items():\n", 1938 | " recom_msg += f\"{v} \\n\"\n", 1939 | " recom_msg += \"\\n\"\n", 1940 | " user_msg_history.append({'role' : 'assistant', 'content' : f\"{intent_data_msg} {str(recom_msg)}\"})\n", 1941 | " print(f\"\\n recom data : {intent_data_msg} {str(recom_msg)}\")\n", 1942 | " # 3-2. 설명이면\n", 1943 | " elif 'desc' in user_intent:\n", 1944 | " # 이전 메세지에 따라서 설명을 가져와야 하기 때문에 이전 메세지 컨텐츠를 가져옴\n", 1945 | " top_result = get_query_sim_top_k(user_msg_history[-1]['content'], model, movies_metadata, top_k=1)\n", 1946 | " # feature가 상세 설명이라고 가정하고 해당 컬럼의 값을 가져와 출력\n", 1947 | " r_set_d = movies_metadata.iloc[top_result[1].numpy(), :][['feature']]\n", 1948 | " r_set_d = json.loads(r_set_d.to_json(orient=\"records\"))[0]\n", 1949 | " user_msg_history.append({'role' : 'assistant', 'content' : r_set_d})\n", 1950 | " print(f\"\\n describe : {intent_data_msg} {r_set_d}\")" 1951 | ] 1952 | }, 1953 | { 1954 | "cell_type": "markdown", 1955 | "id": "29adeae4", 1956 | "metadata": {}, 1957 | "source": [ 1958 | "## 위의 user_interact 함수처럼 구성하지 말고 \n" 1959 | ] 1960 | }, 1961 | { 1962 | "cell_type": "code", 1963 | "execution_count": 872, 1964 | "id": "1cb4af5a", 1965 | "metadata": {}, 1966 | "outputs": [], 1967 | "source": [ 1968 | "'''\n", 1969 | "\n", 1970 | "import openai\n", 1971 | "\n", 1972 | "openai.api_key = \"YOUR_API_KEY\" # supply your API key however you choose\n", 1973 | "\n", 1974 | "message = {\"role\":\"user\", \"content\": input(\"This is the beginning of your chat with AI. [To exit, send \\\"###\\\".]\\n\\nYou:\")};\n", 1975 | "\n", 1976 | "conversation = [{\"role\": \"system\", \"content\": \"DIRECTIVE_FOR_gpt-3.5-turbo\"}]\n", 1977 | "\n", 1978 | "while(message[\"content\"]!=\"###\"):\n", 1979 | " conversation.append(message)\n", 1980 | " completion = openai.ChatCompletion.create(model=\"gpt-3.5-turbo\", messages=conversation) \n", 1981 | " message[\"content\"] = input(f\"Assistant: {completion.choices[0].message.content} \\nYou:\")\n", 1982 | " print()\n", 1983 | " conversation.append(completion.choices[0].message)\n", 1984 | " \n", 1985 | "'''" 1986 | ] 1987 | }, 1988 | { 1989 | "cell_type": "markdown", 1990 | "id": "7bd5bff7", 1991 | "metadata": {}, 1992 | "source": [ 1993 | "## 위처럼 하는 것이 깔끔할 수 있습니다." 1994 | ] 1995 | }, 1996 | { 1997 | "cell_type": "code", 1998 | "execution_count": 873, 1999 | "id": "metric-single", 2000 | "metadata": {}, 2001 | "outputs": [], 2002 | "source": [] 2003 | }, 2004 | { 2005 | "cell_type": "markdown", 2006 | "id": "be150f32", 2007 | "metadata": {}, 2008 | "source": [ 2009 | "## 쿼리에 따른 추천 프로세스 실행" 2010 | ] 2011 | }, 2012 | { 2013 | "cell_type": "code", 2014 | "execution_count": 874, 2015 | "id": "modern-cover", 2016 | "metadata": {}, 2017 | "outputs": [ 2018 | { 2019 | "name": "stdout", 2020 | "output_type": "stream", 2021 | "text": [ 2022 | "user_intent : recommended\n", 2023 | "intent_data_msg : Of course! Here are some top-rated movie items that you might enjoy.\n", 2024 | "\n", 2025 | " recom data : Of course! Here are some top-rated movie items that you might enjoy. \n", 2026 | "\n", 2027 | "X-Men \n", 2028 | "Two mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers. \n", 2029 | "\n", 2030 | "\n" 2031 | ] 2032 | } 2033 | ], 2034 | "source": [ 2035 | "query = \"Please recommend a movie similar to a marvel heros movie.\"\n", 2036 | "user_interact(query, model, copy.deepcopy(msg_prompt))" 2037 | ] 2038 | }, 2039 | { 2040 | "cell_type": "code", 2041 | "execution_count": 876, 2042 | "id": "mathematical-gabriel", 2043 | "metadata": {}, 2044 | "outputs": [ 2045 | { 2046 | "name": "stdout", 2047 | "output_type": "stream", 2048 | "text": [ 2049 | "user_intent : description\n", 2050 | "intent_data_msg : Of course! Let me explain what this item is and how it works.\n", 2051 | "\n", 2052 | " describe : Of course! Let me explain what this item is and how it works. {'feature': 'Adventure Action Science Fiction / X-Men / Two mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers.'}\n" 2053 | ] 2054 | } 2055 | ], 2056 | "source": [ 2057 | "query = \"Can you describe on the above?\"\n", 2058 | "user_interact(query, model, copy.deepcopy(msg_prompt))" 2059 | ] 2060 | }, 2061 | { 2062 | "cell_type": "code", 2063 | "execution_count": 877, 2064 | "id": "515afec7", 2065 | "metadata": {}, 2066 | "outputs": [ 2067 | { 2068 | "data": { 2069 | "text/plain": [ 2070 | "[{'role': 'assistant',\n", 2071 | " 'content': 'Of course! Here are some top-rated movie items that you might enjoy. \\n\\nX-Men \\nTwo mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers. \\n\\n'},\n", 2072 | " {'role': 'assistant',\n", 2073 | " 'content': {'feature': 'Adventure Action Science Fiction / X-Men / Two mutants, Rogue and Wolverine, come to a private academy for their kind whose resident superhero team, the X-Men, must oppose a terrorist organization with similar powers.'}}]" 2074 | ] 2075 | }, 2076 | "execution_count": 877, 2077 | "metadata": {}, 2078 | "output_type": "execute_result" 2079 | } 2080 | ], 2081 | "source": [ 2082 | "user_msg_history" 2083 | ] 2084 | }, 2085 | { 2086 | "cell_type": "code", 2087 | "execution_count": 878, 2088 | "id": "978cee80", 2089 | "metadata": {}, 2090 | "outputs": [ 2091 | { 2092 | "name": "stdout", 2093 | "output_type": "stream", 2094 | "text": [ 2095 | "user_intent : 'search'\n", 2096 | "intent_data_msg : Of course! We have a great selection of movie items that will fit your every need.\n", 2097 | "\n", 2098 | " recom data : Of course! We have a great selection of movie items that will fit your every need. \n", 2099 | "X-Men: Days of Future Past \n", 2100 | "The ultimate X-Men ensemble fights a war for the survival of the species across two time periods as they join forces with their younger selves in an epic battle that must change the past – to save our future. \n", 2101 | "\n", 2102 | "\n" 2103 | ] 2104 | } 2105 | ], 2106 | "source": [ 2107 | "query = \"Are there other movies that are similar to the ones above?\"\n", 2108 | "user_interact(query, model, copy.deepcopy(msg_prompt))" 2109 | ] 2110 | }, 2111 | { 2112 | "cell_type": "code", 2113 | "execution_count": null, 2114 | "id": "e2d7275c", 2115 | "metadata": {}, 2116 | "outputs": [], 2117 | "source": [] 2118 | }, 2119 | { 2120 | "cell_type": "markdown", 2121 | "id": "e74f1f6a", 2122 | "metadata": {}, 2123 | "source": [ 2124 | "# 사용자 태그" 2125 | ] 2126 | }, 2127 | { 2128 | "cell_type": "code", 2129 | "execution_count": 717, 2130 | "id": "f3bdc6a2", 2131 | "metadata": {}, 2132 | "outputs": [], 2133 | "source": [ 2134 | "user_perfer_tag = ['computer','tech','science']" 2135 | ] 2136 | }, 2137 | { 2138 | "cell_type": "code", 2139 | "execution_count": 718, 2140 | "id": "76d26f79", 2141 | "metadata": {}, 2142 | "outputs": [], 2143 | "source": [ 2144 | "top_result = get_query_sim_top_k(' '.join(user_perfer_tag), model, movies_metadata, top_k=3 )" 2145 | ] 2146 | }, 2147 | { 2148 | "cell_type": "code", 2149 | "execution_count": 719, 2150 | "id": "e3f79117", 2151 | "metadata": {}, 2152 | "outputs": [ 2153 | { 2154 | "data": { 2155 | "text/html": [ 2156 | "
\n", 2157 | "\n", 2170 | "\n", 2171 | " \n", 2172 | " \n", 2173 | " \n", 2174 | " \n", 2175 | " \n", 2176 | " \n", 2177 | " \n", 2178 | " \n", 2179 | " \n", 2180 | " \n", 2181 | " \n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | "
titlegenresoverview
23040TranscendenceThriller Science Fiction DramaTwo leading computer scientists work toward th...
30222DebugHorror Science FictionSix young computer hackers sent to work on a d...
996The Lawnmower ManHorror Thriller Science FictionA simple man is turned into a genius through t...
\n", 2200 | "
" 2201 | ], 2202 | "text/plain": [ 2203 | " title genres \\\n", 2204 | "23040 Transcendence Thriller Science Fiction Drama \n", 2205 | "30222 Debug Horror Science Fiction \n", 2206 | "996 The Lawnmower Man Horror Thriller Science Fiction \n", 2207 | "\n", 2208 | " overview \n", 2209 | "23040 Two leading computer scientists work toward th... \n", 2210 | "30222 Six young computer hackers sent to work on a d... \n", 2211 | "996 A simple man is turned into a genius through t... " 2212 | ] 2213 | }, 2214 | "execution_count": 719, 2215 | "metadata": {}, 2216 | "output_type": "execute_result" 2217 | } 2218 | ], 2219 | "source": [ 2220 | "movies_metadata.iloc[top_result[1].numpy(), :][['title', 'genres', 'overview']]" 2221 | ] 2222 | }, 2223 | { 2224 | "cell_type": "code", 2225 | "execution_count": null, 2226 | "id": "00320379", 2227 | "metadata": {}, 2228 | "outputs": [], 2229 | "source": [] 2230 | }, 2231 | { 2232 | "cell_type": "code", 2233 | "execution_count": null, 2234 | "id": "4f9e9d6b", 2235 | "metadata": {}, 2236 | "outputs": [], 2237 | "source": [] 2238 | }, 2239 | { 2240 | "cell_type": "code", 2241 | "execution_count": 723, 2242 | "id": "6c0e88fa", 2243 | "metadata": {}, 2244 | "outputs": [ 2245 | { 2246 | "data": { 2247 | "text/html": [ 2248 | "
\n", 2249 | "\n", 2262 | "\n", 2263 | " \n", 2264 | " \n", 2265 | " \n", 2266 | " \n", 2267 | " \n", 2268 | " \n", 2269 | " \n", 2270 | " \n", 2271 | " \n", 2272 | " \n", 2273 | " \n", 2274 | " \n", 2275 | " \n", 2276 | " \n", 2277 | " \n", 2278 | " \n", 2279 | " \n", 2280 | " \n", 2281 | " \n", 2282 | " \n", 2283 | " \n", 2284 | " \n", 2285 | " \n", 2286 | " \n", 2287 | " \n", 2288 | " \n", 2289 | " \n", 2290 | " \n", 2291 | "
idgenrestitleoverviewrelease_datefeaturetext_lenfeature_lenhf_embeddings
23165127585Action Adventure FantasyX-Men: Days of Future PastThe ultimate X-Men ensemble fights a war for t...2014-05-15Action Adventure Fantasy / X-Men: Days of Futu...208264[-0.047461677, 0.002811962, -0.010485606, -0.0...
\n", 2292 | "
" 2293 | ], 2294 | "text/plain": [ 2295 | " id genres title \\\n", 2296 | "23165 127585 Action Adventure Fantasy X-Men: Days of Future Past \n", 2297 | "\n", 2298 | " overview release_date \\\n", 2299 | "23165 The ultimate X-Men ensemble fights a war for t... 2014-05-15 \n", 2300 | "\n", 2301 | " feature text_len \\\n", 2302 | "23165 Action Adventure Fantasy / X-Men: Days of Futu... 208 \n", 2303 | "\n", 2304 | " feature_len hf_embeddings \n", 2305 | "23165 264 [-0.047461677, 0.002811962, -0.010485606, -0.0... " 2306 | ] 2307 | }, 2308 | "execution_count": 723, 2309 | "metadata": {}, 2310 | "output_type": "execute_result" 2311 | } 2312 | ], 2313 | "source": [ 2314 | "movies_metadata[movies_metadata['title'] == 'X-Men: Days of Future Past']" 2315 | ] 2316 | }, 2317 | { 2318 | "cell_type": "code", 2319 | "execution_count": null, 2320 | "id": "05b01f55", 2321 | "metadata": {}, 2322 | "outputs": [], 2323 | "source": [] 2324 | }, 2325 | { 2326 | "cell_type": "code", 2327 | "execution_count": null, 2328 | "id": "422c3155", 2329 | "metadata": {}, 2330 | "outputs": [], 2331 | "source": [] 2332 | } 2333 | ], 2334 | "metadata": { 2335 | "kernelspec": { 2336 | "display_name": "Python 3 (ipykernel)", 2337 | "language": "python", 2338 | "name": "python3" 2339 | }, 2340 | "language_info": { 2341 | "codemirror_mode": { 2342 | "name": "ipython", 2343 | "version": 3 2344 | }, 2345 | "file_extension": ".py", 2346 | "mimetype": "text/x-python", 2347 | "name": "python", 2348 | "nbconvert_exporter": "python", 2349 | "pygments_lexer": "ipython3", 2350 | "version": "3.8.16" 2351 | } 2352 | }, 2353 | "nbformat": 4, 2354 | "nbformat_minor": 5 2355 | } 2356 | -------------------------------------------------------------------------------- /010. LLM based Explainability RecSys .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 78, 6 | "id": "7a6caca3-db33-4610-87a3-3331ad342413", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "from torch.utils.data import TensorDataset\n", 11 | "from torch.utils.data import DataLoader\n", 12 | "from torch.utils.data import Dataset\n", 13 | "from tqdm import tqdm\n", 14 | "from sklearn.preprocessing import LabelEncoder\n", 15 | "from sklearn.model_selection import train_test_split\n", 16 | "\n", 17 | "from langchain.docstore.document import Document\n", 18 | "from langchain.chains.summarize import load_summarize_chain\n", 19 | "from langchain_community.embeddings import HuggingFaceEmbeddings\n", 20 | "from langchain_openai import OpenAIEmbeddings\n", 21 | "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", 22 | "from langchain_community.vectorstores.faiss import FAISS\n", 23 | "from langchain.document_loaders.csv_loader import CSVLoader\n", 24 | "from langchain.prompts import PromptTemplate\n", 25 | "from langchain_openai import OpenAI, ChatOpenAI\n", 26 | "from langchain.chains import LLMChain\n", 27 | "from langchain.chains import RetrievalQA\n", 28 | "\n", 29 | "from collections import defaultdict\n", 30 | "\n", 31 | "import time\n", 32 | "import os\n", 33 | "import random\n", 34 | "import pandas as pd\n", 35 | "import numpy as np\n", 36 | "import torch\n", 37 | "import torch.nn as nn\n", 38 | "import torch.nn.functional as F\n", 39 | "import torch.utils.data as Data\n", 40 | "import math\n", 41 | "import requests\n", 42 | "import json\n", 43 | "import pickle\n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 79, 49 | "id": "6db84a61-92a7-40e0-ba06-b6f8b0dc6a77", 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "data": { 54 | "text/plain": [ 55 | "True" 56 | ] 57 | }, 58 | "execution_count": 79, 59 | "metadata": {}, 60 | "output_type": "execute_result" 61 | } 62 | ], 63 | "source": [ 64 | "from dotenv import load_dotenv\n", 65 | "\n", 66 | "load_dotenv()" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 80, 72 | "id": "0ddd644c-1973-4c61-a88e-b10d49cc2ba2", 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "project_path = os.path.abspath(os.getcwd())\n", 77 | "data_dir_nm = 'data'\n", 78 | "model_dir_nm = 'model'\n", 79 | "data_path = f\"{project_path}/{data_dir_nm}\"\n", 80 | "model_path = f\"{project_path}/{model_dir_nm}\"" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "id": "5411d293-2104-4520-bb25-cc1d09d8385b", 86 | "metadata": {}, 87 | "source": [ 88 | "# Load data\n", 89 | "- MovieLens1M movie info\n", 90 | "- MovieLens test set\n", 91 | "- LabelEncoder\n", 92 | "- model" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 81, 98 | "id": "c89865d0-d454-42e2-9f3f-67b0b046f7c1", 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "with open('./data/movielens1m_label_encoders.pkl', 'rb') as f:\n", 103 | " label_encoders = pickle.load(f)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 85, 109 | "id": "5bd214e6-5032-4b54-a3bf-f00fc5990333", 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "movie_info = pd.read_csv(f\"{data_path}/movies.csv\", dtype=str)\n", 114 | "test_df = pd.read_csv(f\"{data_path}/movielens1m_test.csv\", dtype=str)\n", 115 | "movielens_rcmm_origin = pd.read_csv(f\"{data_path}/movielens_rcmm.csv\", dtype=str)" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 86, 121 | "id": "787d1ab6-4d70-4fc1-a527-4e4509e499ee", 122 | "metadata": {}, 123 | "outputs": [ 124 | { 125 | "data": { 126 | "text/html": [ 127 | "
\n", 128 | "\n", 141 | "\n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | "
movie_idtitlemovie_decadegenre
01Toy Story1990sAnimation
12Jumanji1990sAdventure
23Grumpier Old Men1990sComedy
34Waiting to Exhale1990sComedy
45Father of the Bride Part II1990sComedy
\n", 189 | "
" 190 | ], 191 | "text/plain": [ 192 | " movie_id title movie_decade genre\n", 193 | "0 1 Toy Story 1990s Animation\n", 194 | "1 2 Jumanji 1990s Adventure\n", 195 | "2 3 Grumpier Old Men 1990s Comedy\n", 196 | "3 4 Waiting to Exhale 1990s Comedy\n", 197 | "4 5 Father of the Bride Part II 1990s Comedy" 198 | ] 199 | }, 200 | "execution_count": 86, 201 | "metadata": {}, 202 | "output_type": "execute_result" 203 | } 204 | ], 205 | "source": [ 206 | "movie_info.head()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": 87, 212 | "id": "313ad90a-618f-4184-8426-8b39f6aaa701", 213 | "metadata": {}, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/html": [ 218 | "
\n", 219 | "\n", 232 | "\n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | "
user_idmovie_idmovie_decademovie_yearrating_yearrating_monthrating_decadegenre1genre2genre3genderageoccupationziplabel
02741957765070461512630781.0
149316098700500141512319181.0
257863143873011041715111533970.0
35917174187801004171514134170.0
4133910096520100001514418001.0
\n", 346 | "
" 347 | ], 348 | "text/plain": [ 349 | " user_id movie_id movie_decade movie_year rating_year rating_month \\\n", 350 | "0 2741 957 7 65 0 7 \n", 351 | "1 4931 609 8 70 0 5 \n", 352 | "2 5786 3143 8 73 0 11 \n", 353 | "3 5917 1741 8 78 0 10 \n", 354 | "4 1339 1009 6 52 0 10 \n", 355 | "\n", 356 | " rating_decade genre1 genre2 genre3 gender age occupation zip label \n", 357 | "0 0 4 6 15 1 2 6 3078 1.0 \n", 358 | "1 0 0 14 15 1 2 3 1918 1.0 \n", 359 | "2 0 4 17 15 1 1 15 3397 0.0 \n", 360 | "3 0 4 17 15 1 4 13 417 0.0 \n", 361 | "4 0 0 0 15 1 4 4 1800 1.0 " 362 | ] 363 | }, 364 | "execution_count": 87, 365 | "metadata": {}, 366 | "output_type": "execute_result" 367 | } 368 | ], 369 | "source": [ 370 | "test_df.head()" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "id": "676a5c72-9880-4c6a-9265-9c2be4e41f2f", 376 | "metadata": {}, 377 | "source": [ 378 | "# Set dataset" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 16, 384 | "id": "0d842fff-d3fb-485f-a845-5707ae1be82e", 385 | "metadata": {}, 386 | "outputs": [], 387 | "source": [ 388 | "class MVLensDataset(Dataset):\n", 389 | " def __init__(self, data, u_i_cols, label_col):\n", 390 | " self.n = data.shape[0]\n", 391 | " self.y = data[label_col].astype(np.float32).values.reshape(-1, 1)\n", 392 | "\n", 393 | " self.u_i_cols = u_i_cols\n", 394 | " \n", 395 | " self.data_v = data[self.u_i_cols].astype(np.int64).values\n", 396 | "\n", 397 | " self.field_dims = np.max(self.data_v, axis=0) + 1\n", 398 | "\n", 399 | "\n", 400 | " def __len__(self):\n", 401 | " return self.n\n", 402 | "\n", 403 | " def __getitem__(self, idx):\n", 404 | " return [self.data_v[idx], self.y[idx]]\n", 405 | " \n", 406 | "u_i_feature = ['user_id', 'movie_id']\n", 407 | "label = 'label'\n", 408 | "batch_size = 512\n", 409 | "test_dataset = MVLensDataset(data=test_df, u_i_cols=u_i_feature, label_col=label)\n", 410 | "test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "id": "e132b9d3-bd38-4901-9048-7ae84cb55d63", 416 | "metadata": {}, 417 | "source": [ 418 | "# Set Model (NCF)" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 10, 424 | "id": "fb9fb05d-3342-4fcd-a8f0-c548ffb157a3", 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "class NeuMF(torch.nn.Module):\n", 429 | " def __init__(self, config):\n", 430 | " super(NeuMF, self).__init__()\n", 431 | " # config \n", 432 | " self.config = config\n", 433 | " self.num_users = config['num_users']\n", 434 | " self.num_items = config['num_items']\n", 435 | " self.latent_dim_mf = config['latent_dim_mf']\n", 436 | " self.latent_dim_mlp = config['latent_dim_mlp']\n", 437 | " # Embedding setting\n", 438 | " self.embedding_user_mlp = torch.nn.Embedding(num_embeddings=self.num_users, embedding_dim=self.latent_dim_mlp)\n", 439 | " self.embedding_item_mlp = torch.nn.Embedding(num_embeddings=self.num_items, embedding_dim=self.latent_dim_mlp)\n", 440 | " self.embedding_user_mf = torch.nn.Embedding(num_embeddings=self.num_users, embedding_dim=self.latent_dim_mf)\n", 441 | " self.embedding_item_mf = torch.nn.Embedding(num_embeddings=self.num_items, embedding_dim=self.latent_dim_mf)\n", 442 | " # MLP layer\n", 443 | " self.fc_layers = torch.nn.ModuleList()\n", 444 | " for idx, (in_size, out_size) in enumerate(zip(config['layers'][:-1], config['layers'][1:])):\n", 445 | " self.fc_layers.append(torch.nn.Linear(in_size, out_size))\n", 446 | " # output layer\n", 447 | " self.affine_output = torch.nn.Linear(in_features=config['layers'][-1] + config['latent_dim_mf'], out_features=1)\n", 448 | " self.logistic = torch.nn.Sigmoid()\n", 449 | "\n", 450 | " def forward(self, user_indices, item_indices):\n", 451 | " user_embedding_mlp = self.embedding_user_mlp(user_indices)\n", 452 | " item_embedding_mlp = self.embedding_item_mlp(item_indices)\n", 453 | " user_embedding_mf = self.embedding_user_mf(user_indices)\n", 454 | " item_embedding_mf = self.embedding_item_mf(item_indices)\n", 455 | " # MLP, MF\n", 456 | " mlp_vector = torch.cat([user_embedding_mlp, item_embedding_mlp], dim=-1)\n", 457 | " mf_vector =torch.mul(user_embedding_mf, item_embedding_mf)\n", 458 | "\n", 459 | " # MLP feed\n", 460 | " for idx, _ in enumerate(range(len(self.fc_layers))):\n", 461 | " mlp_vector = self.fc_layers[idx](mlp_vector)\n", 462 | " mlp_vector = torch.nn.ReLU()(mlp_vector)\n", 463 | " # concat MLP & MF\n", 464 | " vector = torch.cat([mlp_vector, mf_vector], dim=-1)\n", 465 | " # prediction\n", 466 | " logits = self.affine_output(vector)\n", 467 | " rating = self.logistic(logits)\n", 468 | " return rating" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 14, 474 | "id": "8ac6ae2f-5d1f-4541-896b-cb2d1f806551", 475 | "metadata": {}, 476 | "outputs": [ 477 | { 478 | "data": { 479 | "text/plain": [ 480 | "" 481 | ] 482 | }, 483 | "execution_count": 14, 484 | "metadata": {}, 485 | "output_type": "execute_result" 486 | } 487 | ], 488 | "source": [ 489 | "config = {\n", 490 | " 'num_users': 6040,\n", 491 | " 'num_items': 3706,\n", 492 | " 'latent_dim_mf': 8,\n", 493 | " 'latent_dim_mlp': 16,\n", 494 | " 'layers': [32, 16, 8]\n", 495 | "}\n", 496 | "model = NeuMF(config)\n", 497 | "model.load_state_dict(torch.load('./model/ncf_mlm'))" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": 19, 503 | "id": "45fd4228-a3f4-4693-91c1-6ab262bbbaf6", 504 | "metadata": {}, 505 | "outputs": [], 506 | "source": [ 507 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "id": "6cc369fc-8d80-41c6-8e77-08855970bb6f", 513 | "metadata": {}, 514 | "source": [ 515 | "# Predict test data" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 32, 521 | "id": "80db479a-da2e-4d5c-9e84-7ebaf3ffdcfb", 522 | "metadata": {}, 523 | "outputs": [], 524 | "source": [ 525 | "user_pred_info = {}\n", 526 | "top = 10\n", 527 | "\n", 528 | "def test_model(model, test_loader):\n", 529 | " # eval mode\n", 530 | " model.eval()\n", 531 | " user_pred_info = defaultdict(list)\n", 532 | " with torch.no_grad():\n", 533 | " with tqdm(test_loader, unit='batch') as tepoch:\n", 534 | " for samples in tepoch:\n", 535 | " user_items, y = samples[0], samples[1]\n", 536 | " user_items, y = user_items.to(device), y.to(device)\n", 537 | " # user=0, item=1\n", 538 | " y_pred = model(user_items[:, 0], user_items[:, 1])\n", 539 | " for user_item, p in zip(user_items, y_pred):\n", 540 | " # save model predict result\n", 541 | " user_pred_info[int(user_item[0])].append((int(user_item[1]), float(p)))\n", 542 | " return user_pred_info" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 33, 548 | "id": "c0e348d1-ac22-49b6-bd4d-bad608210087", 549 | "metadata": {}, 550 | "outputs": [ 551 | { 552 | "name": "stderr", 553 | "output_type": "stream", 554 | "text": [ 555 | "100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 391/391 [00:04<00:00, 93.74batch/s]\n" 556 | ] 557 | } 558 | ], 559 | "source": [ 560 | "ncf_user_pred_info = test_model(model, test_dataloader)" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "id": "326ba940-8dba-44c0-bce8-9e447afa5eea", 566 | "metadata": {}, 567 | "source": [ 568 | "# Get Ranked list" 569 | ] 570 | }, 571 | { 572 | "cell_type": "code", 573 | "execution_count": 34, 574 | "id": "418e6877-cf41-437d-8df2-49b0eec54707", 575 | "metadata": {}, 576 | "outputs": [ 577 | { 578 | "name": "stderr", 579 | "output_type": "stream", 580 | "text": [ 581 | "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6035/6035 [00:00<00:00, 98230.88it/s]\n" 582 | ] 583 | } 584 | ], 585 | "source": [ 586 | "for user, data_info in tqdm(ncf_user_pred_info.items(), total=len(ncf_user_pred_info), position=0, leave=True):\n", 587 | " # sorted by high prop and slice by top(10)\n", 588 | " ranklist = sorted(data_info, key=lambda s : s[1], reverse=True)[:top]\n", 589 | " # to list\n", 590 | " ranklist = list(dict.fromkeys([r[0] for r in ranklist]))\n", 591 | " user_pred_info[str(user)] = ranklist" 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": 88, 597 | "id": "dafb43dc-8084-44af-8d0e-89aef28e1b07", 598 | "metadata": {}, 599 | "outputs": [ 600 | { 601 | "name": "stdout", 602 | "output_type": "stream", 603 | "text": [ 604 | "사용자 : 2741의 추천 리스트 : [3613, 3306, 191, 2855, 957, 996, 2577]\n" 605 | ] 606 | } 607 | ], 608 | "source": [ 609 | "for user, recom_list in user_pred_info.items():\n", 610 | " print(f\"사용자 : {user}의 추천 리스트 : {recom_list}\")\n", 611 | " break" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "id": "b9ed6550-8f10-4d7c-968b-9ee884ab6c67", 618 | "metadata": {}, 619 | "outputs": [], 620 | "source": [] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": null, 625 | "id": "1fb0999e-eded-415b-9f34-bbac7b8f5c05", 626 | "metadata": {}, 627 | "outputs": [], 628 | "source": [] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "id": "d49bb508-9498-435d-8a1f-cd708be81c1c", 633 | "metadata": {}, 634 | "source": [ 635 | "# Sampling random users" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": 45, 641 | "id": "4b0f0016-50ac-4652-996d-1bc6eb0f9770", 642 | "metadata": {}, 643 | "outputs": [ 644 | { 645 | "data": { 646 | "text/plain": [ 647 | "['1662']" 648 | ] 649 | }, 650 | "execution_count": 45, 651 | "metadata": {}, 652 | "output_type": "execute_result" 653 | } 654 | ], 655 | "source": [ 656 | "random_user_origin = random.sample(list(user_pred_info.keys()), 1)\n", 657 | "sample_user_pred_info = user_pred_info[random_user_origin[0]]\n", 658 | "random_user = list(map(int, random_user_origin))\n", 659 | "random_user = label_encoders['user_id'].inverse_transform(random_user)\n", 660 | "random_user = list(map(str, random_user))\n", 661 | "random_user" 662 | ] 663 | }, 664 | { 665 | "cell_type": "code", 666 | "execution_count": 46, 667 | "id": "11562c70-94cf-4124-b0ae-4f7c9a31aa1e", 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "data": { 672 | "text/plain": [ 673 | "array(['296', '1580', '3943', '3863'], dtype=object)" 674 | ] 675 | }, 676 | "execution_count": 46, 677 | "metadata": {}, 678 | "output_type": "execute_result" 679 | } 680 | ], 681 | "source": [ 682 | "sample_user_pred_info_trans = list(map(int, sample_user_pred_info)) \n", 683 | "sample_user_pred_info_trans = label_encoders['movie_id'].inverse_transform(sample_user_pred_info_trans)\n", 684 | "sample_user_pred_info_trans" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": 47, 690 | "id": "5a64030a-7d71-4130-bcfa-d3a9fedffecb", 691 | "metadata": {}, 692 | "outputs": [ 693 | { 694 | "data": { 695 | "text/html": [ 696 | "
\n", 697 | "\n", 710 | "\n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | "
movie_idtitlemovie_decadegenre
293296Pulp Fiction1990sCrime
15391580Men in Black1990sAction
37933863Cell, The2000sSci-Fi
38733943Bamboozled2000sComedy
\n", 751 | "
" 752 | ], 753 | "text/plain": [ 754 | " movie_id title movie_decade genre\n", 755 | "293 296 Pulp Fiction 1990s Crime\n", 756 | "1539 1580 Men in Black 1990s Action\n", 757 | "3793 3863 Cell, The 2000s Sci-Fi\n", 758 | "3873 3943 Bamboozled 2000s Comedy" 759 | ] 760 | }, 761 | "execution_count": 47, 762 | "metadata": {}, 763 | "output_type": "execute_result" 764 | } 765 | ], 766 | "source": [ 767 | "movie_info[movie_info['movie_id'].isin(sample_user_pred_info_trans)]" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 48, 773 | "id": "32f28b93-ef7a-4c2c-be23-9301cafae59e", 774 | "metadata": {}, 775 | "outputs": [ 776 | { 777 | "name": "stdout", 778 | "output_type": "stream", 779 | "text": [ 780 | "(25, 15)\n" 781 | ] 782 | }, 783 | { 784 | "data": { 785 | "text/html": [ 786 | "
\n", 787 | "\n", 800 | "\n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | "
user_idmovie_idmovie_decademovie_yearrating_yearrating_monthrating_decadegenre1genre2genre3genderageoccupationziplabel
96294816625271990s19932000112000sDramaWarnon dataM2512941211
962949166227621990s19992000112000sThrillernon datanon dataM2512941211
962950166212591980s19862000112000sAdventureComedyDramaM2512941211
96295116625891990s19912000112000sActionSci-FiThrillerM2512941211
962952166228581990s19992000112000sComedyDramanon dataM2512941211
\n", 914 | "
" 915 | ], 916 | "text/plain": [ 917 | " user_id movie_id movie_decade movie_year rating_year rating_month \\\n", 918 | "962948 1662 527 1990s 1993 2000 11 \n", 919 | "962949 1662 2762 1990s 1999 2000 11 \n", 920 | "962950 1662 1259 1980s 1986 2000 11 \n", 921 | "962951 1662 589 1990s 1991 2000 11 \n", 922 | "962952 1662 2858 1990s 1999 2000 11 \n", 923 | "\n", 924 | " rating_decade genre1 genre2 genre3 gender age occupation \\\n", 925 | "962948 2000s Drama War non data M 25 12 \n", 926 | "962949 2000s Thriller non data non data M 25 12 \n", 927 | "962950 2000s Adventure Comedy Drama M 25 12 \n", 928 | "962951 2000s Action Sci-Fi Thriller M 25 12 \n", 929 | "962952 2000s Comedy Drama non data M 25 12 \n", 930 | "\n", 931 | " zip label \n", 932 | "962948 94121 1 \n", 933 | "962949 94121 1 \n", 934 | "962950 94121 1 \n", 935 | "962951 94121 1 \n", 936 | "962952 94121 1 " 937 | ] 938 | }, 939 | "execution_count": 48, 940 | "metadata": {}, 941 | "output_type": "execute_result" 942 | } 943 | ], 944 | "source": [ 945 | "sample_user_history = movielens_rcmm_origin[movielens_rcmm_origin['user_id'] == random_user[0]].fillna('non data')\n", 946 | "print(sample_user_history.shape)\n", 947 | "sample_user_history.head()" 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": null, 953 | "id": "4d527c69-eeb6-4337-aa36-d81791cdfa56", 954 | "metadata": {}, 955 | "outputs": [], 956 | "source": [] 957 | }, 958 | { 959 | "cell_type": "markdown", 960 | "id": "b65b5906-e36e-4d1f-a725-5687b52cd3ad", 961 | "metadata": {}, 962 | "source": [ 963 | "# Set user info by history --> to LLM input" 964 | ] 965 | }, 966 | { 967 | "cell_type": "code", 968 | "execution_count": 49, 969 | "id": "4c8fc4d4-2e96-4daa-aa85-7daa17faa783", 970 | "metadata": {}, 971 | "outputs": [], 972 | "source": [ 973 | "# Recent user info\n", 974 | "recent_ratio = int(sample_user_history.shape[0] * 0.1)\n", 975 | "user_data = movielens_rcmm_origin[movielens_rcmm_origin['user_id'] == random_user[0]].fillna('non data')[['movie_decade', 'movie_year', 'rating_year', 'rating_decade', 'genre1', 'genre2', 'gender', 'age', 'zip']].values[:recent_ratio]\n", 976 | "recent_user_hist_info = \"#### Item interaction information\\n\\n- (item) : metadata information of items \\n- (user) : metadata information of users\"\n", 977 | "for cnt, rows in enumerate(user_data):\n", 978 | " recent_user_hist_info += f\"\\n\\n{cnt+1}th.\\n- (Item) Movie Release Decade (ex. 1990s movies): {rows[0]}\\n- (Item) Movie Release Year: {rows[1]}\\n- (User) Rating Year: {rows[2]}\\n- (User) Rating Decade (e.g., 1990s ratings): {rows[3]}\\n- (Item) Genre 1: {rows[4]}\\n- (Item) Genre 2: {rows[5]}\\n- (User) Gender: {rows[6]}\\n- (User) Age: {rows[7]}\\n- (User) Address Information (zipcode): {rows[8]}\\n##### End of {cnt+1}th item interaction information\"" 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "execution_count": 51, 984 | "id": "edf2ac7e-1a0f-414a-ba46-84c35e00432b", 985 | "metadata": {}, 986 | "outputs": [], 987 | "source": [ 988 | "# Entire user history information\n", 989 | "user_data = movielens_rcmm_origin[movielens_rcmm_origin['user_id'] == random_user[0]].fillna('non data')[['movie_decade', 'movie_year', 'rating_year', 'rating_decade', 'genre1', 'genre2', 'gender', 'age', 'zip']].values\n", 990 | "user_all_hist_info = \"#### Item interaction information\\n\\n- (item) : metadata information of items \\n- (user) : metadata information of users\"\n", 991 | "for cnt, rows in enumerate(user_data):\n", 992 | " user_all_hist_info += f\"\\n\\n{cnt+1}th.\\n- (Item) Movie Release Decade (ex. 1990s movies): {rows[0]}\\n- (Item) Movie Release Year: {rows[1]}\\n- (User) Rating Year: {rows[2]}\\n- (User) Rating Decade (e.g., 1990s ratings): {rows[3]}\\n- (Item) Genre 1: {rows[4]}\\n- (Item) Genre 2: {rows[5]}\\n- (User) Gender: {rows[6]}\\n- (User) Age: {rows[7]}\\n- (User) Address Information (zipcode): {rows[8]}\\n##### End of {cnt+1}th item interaction information\"" 993 | ] 994 | }, 995 | { 996 | "cell_type": "code", 997 | "execution_count": 52, 998 | "id": "18f086d7-128c-4542-b7af-59917caa67e1", 999 | "metadata": {}, 1000 | "outputs": [ 1001 | { 1002 | "name": "stdout", 1003 | "output_type": "stream", 1004 | "text": [ 1005 | "#### Item interaction information\n", 1006 | "\n", 1007 | "- (item) : metadata information of items \n", 1008 | "- (user) : metadata information of users\n", 1009 | "\n", 1010 | "1th.\n", 1011 | "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n", 1012 | "- (Item) Movie Release Year: 1993\n", 1013 | "- (User) Rating Year: 2000\n", 1014 | "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n", 1015 | "- (Item) Genre 1: Drama\n", 1016 | "- (Item) Genre 2: War\n", 1017 | "- (User) Gender: M\n", 1018 | "- (User) Age: 25\n", 1019 | "- (User) Address Information (zipcode): 94121\n", 1020 | "##### End of 1th item interaction information\n", 1021 | "\n", 1022 | "2th.\n", 1023 | "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n", 1024 | "- (Item) Movie Release Year: 1999\n", 1025 | "- (User) Rating Year: 2000\n", 1026 | "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n", 1027 | "- (Item) Genre 1: Thriller\n", 1028 | "- (Item) Genre 2: non data\n", 1029 | "- (User) Gender: M\n", 1030 | "- (User) Age: 25\n", 1031 | "- (User) Address Information (zipcode): 94121\n", 1032 | "##### End of 2th item interaction information\n" 1033 | ] 1034 | } 1035 | ], 1036 | "source": [ 1037 | "print(recent_user_hist_info)" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "code", 1042 | "execution_count": 55, 1043 | "id": "5b3fc1a7-1412-4258-b2e7-b7f507d765c3", 1044 | "metadata": {}, 1045 | "outputs": [ 1046 | { 1047 | "name": "stdout", 1048 | "output_type": "stream", 1049 | "text": [ 1050 | "#### Item interaction information\n", 1051 | "\n", 1052 | "- (item) : metadata information of items \n", 1053 | "- (user) : metadata information of users\n", 1054 | "\n", 1055 | "1th.\n", 1056 | "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n", 1057 | "- (Item) Movie Release Year: 1993\n", 1058 | "- (User) Rating Year: 2000\n", 1059 | "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n", 1060 | "- (Item) Genre 1: Drama\n", 1061 | "- (Item) Genre 2: War\n", 1062 | "- (User) Gender: M\n", 1063 | "- (User) Age: 25\n", 1064 | "- (User) Address Information (zipcode): 94121\n", 1065 | "##### End of 1th item interaction information\n", 1066 | "\n", 1067 | "2th.\n", 1068 | "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n", 1069 | "- (Item) Movie Release Year: 1999\n", 1070 | "- (User) Rating Year: 2000\n", 1071 | "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n", 1072 | "- (Item) Genre 1: Thriller\n", 1073 | "- (Item) Genre 2: non data\n", 1074 | "- (User) Gender: M\n", 1075 | "- (User) Age: 25\n", 1076 | "- (User) Address Information (zipcode): 94121\n", 1077 | "##### End of 2th item interaction information\n", 1078 | "\n", 1079 | "3th.\n", 1080 | "- (Item) Movie Release Decade (ex. 1990s movies): 1980s\n", 1081 | "- (Item) Movie Release Year: 1986\n", 1082 | "- (User) Rating Year: 2000\n", 1083 | "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n", 1084 | "- (Item) Genre 1: Adventure\n", 1085 | "- (Item) Genre 2: Comedy\n", 1086 | "- (User) Gender: M\n", 1087 | "- (User) Age: 25\n", 1088 | "- (User) Address Information (zipcode): 94121\n", 1089 | "##### End of 3th item interaction information\n", 1090 | "\n", 1091 | "4th.\n", 1092 | "- (Item) Movie Release Decade (ex. 1990s movies): 1990s\n", 1093 | "- (Item) Movie Release Year: 1991\n", 1094 | "- (User) Rating Year: 2000\n", 1095 | "- (User) Rating Decade (e.g., 1990s ratings): 2000s\n", 1096 | "- (Item) Genre 1: Action\n", 1097 | "- (Item) Genre 2: Sci-Fi\n", 1098 | "- (User) Gender: M\n", 1099 | "- (User) Age: 25\n", 1100 | "- (User) Address Information (zipcode): 94121\n", 1101 | "##### End of \n" 1102 | ] 1103 | } 1104 | ], 1105 | "source": [ 1106 | "print(user_all_hist_info[:1500])" 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "code", 1111 | "execution_count": null, 1112 | "id": "779d69b8-19e4-43d5-9abf-4bcb24b4baa1", 1113 | "metadata": {}, 1114 | "outputs": [], 1115 | "source": [] 1116 | }, 1117 | { 1118 | "cell_type": "markdown", 1119 | "id": "7a478f9e-8a3c-4bad-915f-7e9558fe4947", 1120 | "metadata": {}, 1121 | "source": [ 1122 | "# Summay user history" 1123 | ] 1124 | }, 1125 | { 1126 | "cell_type": "code", 1127 | "execution_count": 56, 1128 | "id": "256bd0f1-f29b-4fa5-9fd2-6fa34bddbaca", 1129 | "metadata": {}, 1130 | "outputs": [ 1131 | { 1132 | "name": "stdout", 1133 | "output_type": "stream", 1134 | "text": [ 1135 | "9011\n" 1136 | ] 1137 | } 1138 | ], 1139 | "source": [ 1140 | "print(len(user_all_hist_info))" 1141 | ] 1142 | }, 1143 | { 1144 | "cell_type": "code", 1145 | "execution_count": 57, 1146 | "id": "c8ef4f29-0c12-481e-9bc4-928c45cf8a9d", 1147 | "metadata": {}, 1148 | "outputs": [], 1149 | "source": [ 1150 | "docs = []\n", 1151 | "text_splitter = RecursiveCharacterTextSplitter(chunk_size=550, chunk_overlap=100)\n", 1152 | "texts = text_splitter.split_text(user_all_hist_info)\n", 1153 | "docs += [Document(page_content=t) for t in texts]" 1154 | ] 1155 | }, 1156 | { 1157 | "cell_type": "code", 1158 | "execution_count": 58, 1159 | "id": "aa4d9dd9-49f7-42d5-b0e3-8618e7eac0a3", 1160 | "metadata": {}, 1161 | "outputs": [], 1162 | "source": [ 1163 | "template = '''Below is the user's past history information. Considering the user's main characteristics, persona, preferences, and meaningful patterns, please summarize the user information within 700 characters.\\n\\n##### User history information: {text}.'''\n", 1164 | "\n", 1165 | "prompt = PromptTemplate(template=template, input_variables=['text'])\n", 1166 | "\n", 1167 | "llm = ChatOpenAI(temperature=0, model='gpt-4o')\n" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": 59, 1173 | "id": "8958d283-47ed-4549-9620-338e2771d052", 1174 | "metadata": {}, 1175 | "outputs": [ 1176 | { 1177 | "name": "stderr", 1178 | "output_type": "stream", 1179 | "text": [ 1180 | "/Users/leesoojin/opt/anaconda3/envs/llm/lib/python3.9/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `run` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.\n", 1181 | " warn_deprecated(\n" 1182 | ] 1183 | } 1184 | ], 1185 | "source": [ 1186 | "chain = load_summarize_chain(llm, \n", 1187 | " chain_type='map_reduce', \n", 1188 | " map_prompt=prompt, combine_prompt=prompt,\n", 1189 | " verbose=False)\n", 1190 | "summary = chain.run(docs)" 1191 | ] 1192 | }, 1193 | { 1194 | "cell_type": "code", 1195 | "execution_count": 60, 1196 | "id": "bce8a1cb-5e94-439c-9a99-fc0692846a1e", 1197 | "metadata": {}, 1198 | "outputs": [ 1199 | { 1200 | "data": { 1201 | "text/plain": [ 1202 | "'The user is a 25-year-old male residing in the 94121 zip code area. He has a strong preference for movies from the 1990s, with specific interests in various years such as 1990, 1991, 1993, 1994, 1996, 1997, 1998, and 1999. His favorite genres include drama, thriller, comedy, romance, action, adventure, and sci-fi. He tends to rate movies primarily in the early 2000s, suggesting a nostalgic inclination towards films from his formative years. This user enjoys revisiting and evaluating films from the past, reflecting a blend of nostalgia and a methodical approach to his movie-watching habits.'" 1203 | ] 1204 | }, 1205 | "execution_count": 60, 1206 | "metadata": {}, 1207 | "output_type": "execute_result" 1208 | } 1209 | ], 1210 | "source": [ 1211 | "summary" 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "markdown", 1216 | "id": "d8e95e1d-ed30-46af-9080-30afd5ff5301", 1217 | "metadata": {}, 1218 | "source": [ 1219 | "# Get user persona and characteristics" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "execution_count": 62, 1225 | "id": "1c98559a-d4d1-4ab3-84b6-90978e0c43b0", 1226 | "metadata": {}, 1227 | "outputs": [], 1228 | "source": [ 1229 | "\n", 1230 | "template = \"\"\"Below is the user's item interaction history information. Using this data, please derive the user's main characteristics, persona, preferences, and meaningful patterns.\n", 1231 | "\n", 1232 | "# User history information\n", 1233 | "{user_hist}\n", 1234 | "\n", 1235 | "Please output in the following format:\n", 1236 | "\n", 1237 | "- Main characteristics of the user: string\n", 1238 | "- User persona: string\n", 1239 | "- User preferences: string\n", 1240 | "- Meaningful patterns of the user: string\n", 1241 | "\n", 1242 | "\"\"\"\n", 1243 | "prompt = PromptTemplate(template=template, input_variables=['user_hist'])\n" 1244 | ] 1245 | }, 1246 | { 1247 | "cell_type": "code", 1248 | "execution_count": 63, 1249 | "id": "782e9108-ee4b-4e4a-90c4-9b1b951367fd", 1250 | "metadata": {}, 1251 | "outputs": [], 1252 | "source": [ 1253 | "llm = ChatOpenAI(temperature=0, model='gpt-4o')\n", 1254 | "chain = LLMChain(llm=llm, prompt=prompt)\n", 1255 | "user_recent_summary = chain.invoke({'user_hist': recent_user_hist_info})" 1256 | ] 1257 | }, 1258 | { 1259 | "cell_type": "code", 1260 | "execution_count": 64, 1261 | "id": "db87eae3-0cbe-4c5b-98bd-6bf0c5b6e99e", 1262 | "metadata": {}, 1263 | "outputs": [ 1264 | { 1265 | "data": { 1266 | "text/plain": [ 1267 | "{'user_hist': '#### Item interaction information\\n\\n- (item) : metadata information of items \\n- (user) : metadata information of users\\n\\n1th.\\n- (Item) Movie Release Decade (ex. 1990s movies): 1990s\\n- (Item) Movie Release Year: 1993\\n- (User) Rating Year: 2000\\n- (User) Rating Decade (e.g., 1990s ratings): 2000s\\n- (Item) Genre 1: Drama\\n- (Item) Genre 2: War\\n- (User) Gender: M\\n- (User) Age: 25\\n- (User) Address Information (zipcode): 94121\\n##### End of 1th item interaction information\\n\\n2th.\\n- (Item) Movie Release Decade (ex. 1990s movies): 1990s\\n- (Item) Movie Release Year: 1999\\n- (User) Rating Year: 2000\\n- (User) Rating Decade (e.g., 1990s ratings): 2000s\\n- (Item) Genre 1: Thriller\\n- (Item) Genre 2: non data\\n- (User) Gender: M\\n- (User) Age: 25\\n- (User) Address Information (zipcode): 94121\\n##### End of 2th item interaction information',\n", 1268 | " 'text': \"- Main characteristics of the user: The user is a 25-year-old male living in the 94121 zip code area. He rated movies in the early 2000s.\\n\\n- User persona: The user is a young adult male who enjoys watching movies from the 1990s. He seems to have a preference for serious and intense genres, indicating a possible interest in thought-provoking and emotionally engaging content.\\n\\n- User preferences: The user prefers movies from the 1990s, particularly those in the Drama and Thriller genres. He also shows an interest in War-themed movies.\\n\\n- Meaningful patterns of the user: The user consistently rates movies from the 1990s, suggesting a nostalgic or particular interest in that decade's filmography. His genre preferences lean towards Drama and Thriller, with a specific inclination towards movies that are intense and possibly have complex narratives. The user’s ratings are from the early 2000s, indicating that he might have been actively watching and rating movies during that period.\"}" 1269 | ] 1270 | }, 1271 | "execution_count": 64, 1272 | "metadata": {}, 1273 | "output_type": "execute_result" 1274 | } 1275 | ], 1276 | "source": [ 1277 | "user_recent_summary" 1278 | ] 1279 | }, 1280 | { 1281 | "cell_type": "code", 1282 | "execution_count": null, 1283 | "id": "084708fc-76c9-473c-aafe-befb6a5e4a7b", 1284 | "metadata": {}, 1285 | "outputs": [], 1286 | "source": [] 1287 | }, 1288 | { 1289 | "cell_type": "markdown", 1290 | "id": "2141d6ec-5166-46d9-aa78-67353b982b1e", 1291 | "metadata": {}, 1292 | "source": [ 1293 | "# Explainbilty" 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "code", 1298 | "execution_count": 66, 1299 | "id": "4b532750-d7c5-4453-9141-9013c722c115", 1300 | "metadata": {}, 1301 | "outputs": [ 1302 | { 1303 | "data": { 1304 | "text/html": [ 1305 | "
\n", 1306 | "\n", 1319 | "\n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | "
movie_idtitlemovie_decadegenre
293296Pulp Fiction1990sCrime
15391580Men in Black1990sAction
37933863Cell, The2000sSci-Fi
38733943Bamboozled2000sComedy
\n", 1360 | "
" 1361 | ], 1362 | "text/plain": [ 1363 | " movie_id title movie_decade genre\n", 1364 | "293 296 Pulp Fiction 1990s Crime\n", 1365 | "1539 1580 Men in Black 1990s Action\n", 1366 | "3793 3863 Cell, The 2000s Sci-Fi\n", 1367 | "3873 3943 Bamboozled 2000s Comedy" 1368 | ] 1369 | }, 1370 | "execution_count": 66, 1371 | "metadata": {}, 1372 | "output_type": "execute_result" 1373 | } 1374 | ], 1375 | "source": [ 1376 | "user_recom_result = movie_info[movie_info['movie_id'].isin(sample_user_pred_info_trans)]\n", 1377 | "user_recom_result" 1378 | ] 1379 | }, 1380 | { 1381 | "cell_type": "code", 1382 | "execution_count": 67, 1383 | "id": "4326e016-2703-48b4-a57f-920c35a2d37a", 1384 | "metadata": {}, 1385 | "outputs": [], 1386 | "source": [ 1387 | "user_recent_summary_info = user_recent_summary['text']\n", 1388 | "user_entire_summary_info = summary" 1389 | ] 1390 | }, 1391 | { 1392 | "cell_type": "code", 1393 | "execution_count": null, 1394 | "id": "f1f5ceda-fbc7-4728-a50a-c9133e29c6d5", 1395 | "metadata": {}, 1396 | "outputs": [], 1397 | "source": [] 1398 | }, 1399 | { 1400 | "cell_type": "code", 1401 | "execution_count": 70, 1402 | "id": "7670b00b-a5e7-4688-8aec-3f3c527d27b0", 1403 | "metadata": {}, 1404 | "outputs": [], 1405 | "source": [ 1406 | "user_data = user_recom_result[['title', 'movie_decade', 'genre']].values\n", 1407 | "\n", 1408 | "user_recom_info = \"#### User Recommendation List\\n\\n\"\n", 1409 | "for cnt, rows in enumerate(user_data):\n", 1410 | " user_recom_info += f\"\\n\\nRecommendation {cnt+1}:\\n- Item Title: {rows[0]}\\n- (Item) Movie Release Decade (e.g., 1990s movie): {rows[1]}\\n- Item Genre (Category): {rows[2]}\\n##### End of Recommendation {cnt+1} Information\"\n" 1411 | ] 1412 | }, 1413 | { 1414 | "cell_type": "code", 1415 | "execution_count": 71, 1416 | "id": "24c0018e-ee5c-4014-b809-a8066d7dcbe1", 1417 | "metadata": {}, 1418 | "outputs": [ 1419 | { 1420 | "data": { 1421 | "text/plain": [ 1422 | "'#### User Recommendation List\\n\\n\\n\\nRecommendation 1:\\n- Item Title: Pulp Fiction\\n- (Item) Movie Release Decade (e.g., 1990s movie): 1990s\\n- Item Genre (Category): Crime\\n##### End of Recommendation 1 Information\\n\\nRecommendation 2:\\n- Item Title: Men in Black\\n- (Item) Movie Release Decade (e.g., 1990s movie): 1990s\\n- Item Genre (Category): Action\\n##### End of Recommendation 2 Information\\n\\nRecommendation 3:\\n- Item Title: Cell, The\\n- (Item) Movie Release Decade (e.g., 1990s movie): 2000s\\n- Item Genre (Category): Sci-Fi\\n##### End of Recommendation 3 Information\\n\\nRecommendation 4:\\n- Item Title: Bamboozled\\n- (Item) Movie Release Decade (e.g., 1990s movie): 2000s\\n- Item Genre (Category): Comedy\\n##### End of Recommendation 4 Information'" 1423 | ] 1424 | }, 1425 | "execution_count": 71, 1426 | "metadata": {}, 1427 | "output_type": "execute_result" 1428 | } 1429 | ], 1430 | "source": [ 1431 | "user_recom_info" 1432 | ] 1433 | }, 1434 | { 1435 | "cell_type": "code", 1436 | "execution_count": 73, 1437 | "id": "f286da16-9e56-422e-835c-3a30c3b1c132", 1438 | "metadata": {}, 1439 | "outputs": [], 1440 | "source": [ 1441 | "template = \"\"\"The data below contains the user's main characteristics, persona, and preference information. There is preference information based on the entire history and also based on the last 10 interactions.\n", 1442 | "\n", 1443 | "#### Main characteristics based on the entire history\n", 1444 | "{user_entire_summary_info}\n", 1445 | "\n", 1446 | "#### Main characteristics based on the last 10 interactions\n", 1447 | "{user_recent_summary_info}\n", 1448 | "\n", 1449 | "Below is the item information recommended by the recommendation system for the above user.\n", 1450 | "\n", 1451 | "#### Recommendation results provided by the recommendation system\n", 1452 | "{recom_list}\n", 1453 | "\n", 1454 | "Your role is to write the reason for the recommendation by comparing the user's main characteristics information with the recommendation results provided by the recommendation system.\n", 1455 | "The recommendation results are a list of items provided by the recommendation system based on the user's past interaction information.\n", 1456 | "If you determine that the reason for the recommendation is inappropriate, please say, 'It does not seem to be an appropriate recommendation' and also provide the reason why it is not appropriate.\n", 1457 | "\n", 1458 | "To summarize your role:\n", 1459 | "\n", 1460 | "- Consider the user information (main characteristics based on the entire history, main characteristics based on the last 10 interactions)\n", 1461 | "- The recommendation results are a recommendation list provided by the recommendation system based on the user's past interactions\n", 1462 | "- Write the reason for the recommendation by referring to the recommendation results and user information\n", 1463 | "- If the reason for the recommendation is inappropriate, say 'It does not seem to be an appropriate recommendation' and explain why it is not appropriate\n", 1464 | "- Do not include unnecessary words, perform the requested task and respond\n", 1465 | "- If you are unsure, think it over and if you really don't know, respond with 'I don't know'\n", 1466 | "\"\"\"\n", 1467 | "\n", 1468 | "prompt = PromptTemplate(template=template, input_variables=['user_entire_summary_info', 'user_recent_summary_info', 'recom_list'])\n" 1469 | ] 1470 | }, 1471 | { 1472 | "cell_type": "code", 1473 | "execution_count": 74, 1474 | "id": "d00e2774-4fde-4170-8b58-e35f73d8b062", 1475 | "metadata": {}, 1476 | "outputs": [], 1477 | "source": [ 1478 | "llm = ChatOpenAI(temperature=0, model='gpt-4o')\n", 1479 | "chain = LLMChain(llm=llm, prompt=prompt)\n", 1480 | "recommend_explain = chain.invoke({'user_entire_summary_info': user_entire_summary_info, 'user_recent_summary_info':user_recent_summary_info, 'recom_list':user_recom_info})" 1481 | ] 1482 | }, 1483 | { 1484 | "cell_type": "code", 1485 | "execution_count": null, 1486 | "id": "53e9bce0-3bbe-485b-832e-29ea87fbc86e", 1487 | "metadata": {}, 1488 | "outputs": [], 1489 | "source": [] 1490 | }, 1491 | { 1492 | "cell_type": "code", 1493 | "execution_count": 75, 1494 | "id": "9d6f2762-93ca-4df3-a3b7-11f355c5376f", 1495 | "metadata": {}, 1496 | "outputs": [ 1497 | { 1498 | "name": "stdout", 1499 | "output_type": "stream", 1500 | "text": [ 1501 | "#### Recommendation 1: Pulp Fiction\n", 1502 | "- **Reason for Recommendation:** \"Pulp Fiction\" is a 1990s movie, aligning with the user's strong preference for films from that decade. The genre is Crime, which, while not explicitly listed in the user's favorite genres, often overlaps with Drama and Thriller, both of which the user enjoys. The intense and complex narrative of \"Pulp Fiction\" fits the user's interest in thought-provoking and emotionally engaging content.\n", 1503 | "\n", 1504 | "#### Recommendation 2: Men in Black\n", 1505 | "- **Reason for Recommendation:** \"Men in Black\" is a 1990s movie, which matches the user's preference for that decade. The genre is Action, one of the user's favorite genres. This recommendation aligns well with the user's interest in 1990s films and action-packed narratives.\n", 1506 | "\n", 1507 | "#### Recommendation 3: The Cell\n", 1508 | "- **It does not seem to be an appropriate recommendation**\n", 1509 | "- **Reason:** \"The Cell\" is a 2000s movie, which does not align with the user's strong preference for 1990s films. Although the genre is Sci-Fi, which the user enjoys, the decade mismatch makes this recommendation less suitable.\n", 1510 | "\n", 1511 | "#### Recommendation 4: Bamboozled\n", 1512 | "- **It does not seem to be an appropriate recommendation**\n", 1513 | "- **Reason:** \"Bamboozled\" is a 2000s movie, which does not align with the user's preference for 1990s films. Additionally, while the user enjoys Comedy, the primary interest is in Drama and Thriller genres, making this recommendation less fitting.\n" 1514 | ] 1515 | } 1516 | ], 1517 | "source": [ 1518 | "print(recommend_explain['text'])" 1519 | ] 1520 | }, 1521 | { 1522 | "cell_type": "code", 1523 | "execution_count": null, 1524 | "id": "be6791ea-2f1d-4b86-9ee5-6bea5e15779b", 1525 | "metadata": {}, 1526 | "outputs": [], 1527 | "source": [] 1528 | }, 1529 | { 1530 | "cell_type": "code", 1531 | "execution_count": null, 1532 | "id": "6a4a3352-8baa-47b9-ac15-febfbe6ca3de", 1533 | "metadata": {}, 1534 | "outputs": [], 1535 | "source": [] 1536 | }, 1537 | { 1538 | "cell_type": "code", 1539 | "execution_count": null, 1540 | "id": "6f87e87a-84ab-4864-a5c4-4e509f8120ba", 1541 | "metadata": {}, 1542 | "outputs": [], 1543 | "source": [] 1544 | } 1545 | ], 1546 | "metadata": { 1547 | "kernelspec": { 1548 | "display_name": "Python 3 (ipykernel)", 1549 | "language": "python", 1550 | "name": "python3" 1551 | }, 1552 | "language_info": { 1553 | "codemirror_mode": { 1554 | "name": "ipython", 1555 | "version": 3 1556 | }, 1557 | "file_extension": ".py", 1558 | "mimetype": "text/x-python", 1559 | "name": "python", 1560 | "nbconvert_exporter": "python", 1561 | "pygments_lexer": "ipython3", 1562 | "version": "3.9.18" 1563 | } 1564 | }, 1565 | "nbformat": 4, 1566 | "nbformat_minor": 5 1567 | } 1568 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # 파이썬을 활용한 추천 시스템 구현(recommender system with Python) 3 | 4 | ### 각 파일에 대한 자료 설명 5 | 6 | > 각 파일에 대한 설명은 https://lsjsj92.tistory.com/ 블로그에 올려두었습니다. 상세주소는 각 파일 최상단에 있으니 참고바랍니다. 7 | 8 | **1. recommender system basic** 9 | - 추천 시스템 기본 유형 소개 : 이론 10 | - content based filtering 11 | - collaborative filtering 12 | 13 | 14 | **2. recommender system basic with Python - 1 content based filtering** 15 | - 파이썬을 활용해 content based filtering 구현 16 | - kaggle의 movies dataset 활용 17 | 18 | 19 | **3. recommender system basic with Python - 2 Collaborative Filtering** 20 | - 파이썬을 활용해 collaborative filtering 구현 21 | - kaggle의 movies dataset, movielens dataset 활용 22 | 23 | 24 | **4. recommender system basic with Python - 3 Matrix Factorization** 25 | - 파이썬을 활용해 Matrix Factorization 구현 및 이론 설명 26 | - kaggle의 movies dataset, movielens dataset 활용 27 | 28 | 29 | **5. naver news recommender** 30 | - Naver news 데이터를 활용해 추천 시스템 적용 31 | - Doc2vec 등의 embedding 방법을 사용 32 | 33 | **6. deep learning recommender system** 34 | - 딥러닝 기반의 추천 시스템 활용 예제 코드 35 | - Keras 활용 36 | 37 | 38 | **7. Wide & Deep recommender system** 39 | - Wide & Deep paper를 기반으로 한 추천 시스템 모델 구현 40 | - 컨셉만 유지하면서 구현하였음 41 | - Keras를 활용 42 | 43 | **8. Simple book recommender system with Keras(kaggle data)** 44 | - Kaggle에 있는 book 데이터를 활용한 간단한 추천 시스템 구현 45 | - Keras를 활용해 만들 수 있는 기본적인 추천 모형 코드 46 | 47 | **9. recommender system using ChatGPT** 48 | - ChatGPT을 활용한 추천 시스템 49 | - https://lsjsj92.tistory.com/657 50 | 51 | **10. LLM based explainability recsys** 52 | - LLM을 활용한 추천 시스템의 설명 가능성 부여 53 | - LangChain, gpt-4o 활용 54 | - https://lsjsj92.tistory.com/670 55 | --------------------------------------------------------------------------------