├── .gitignore ├── README.md └── code ├── demo_gpt_01_chat.ipynb ├── demo_gpt_02_rag.ipynb ├── demo_gpt_03_finetune.ipynb ├── demo_gpt_04_finetune_dialog.ipynb ├── demo_webcrawl_01_wiki.ipynb ├── demo_webcrawl_02_wikibot.ipynb ├── demo_webcrawl_03_qd.ipynb ├── gpt_helper.py ├── util.py └── wiki_helper.py /.gitignore: -------------------------------------------------------------------------------- 1 | key 2 | .ipynb_checkpoints 3 | __pycache__/ 4 | 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Yet Another GPT Tutorial 2 | 3 | This repo contains simple usages for utilizing GPT API provided by OpenAI. 4 | - [GPT API usage](https://github.com/sjchoi86/yet-another-gpt-tutorial/blob/main/code/demo_gpt_01_chat.ipynb) 5 | : Basic OpenAI API usage for using [GPT](https://openai.com/gpt-4) 6 | - [Wiki Summarize](https://github.com/sjchoi86/yet-another-gpt-tutorial/blob/main/code/demo_webcrawl_01_wiki.ipynb) 7 | : [Wikipedia](https://www.wikipedia.org/) Web crawling using [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) + Summarization using GPT 8 | - [Retrieval-Augmented Generation](https://github.com/sjchoi86/yet-another-gpt-tutorial/blob/main/code/demo_gpt_02_rag.ipynb) 9 | : A minimal implementation of RAG using Wikipedia. Given the user's question, GPT first suggests entities for searching Wikipedia. Then, GPT summarizes the queried pages and the summarized sentences and the given question are combined and given to GPT to answer. 10 | - [Qaulity-Diversity Wiki Sampling](https://github.com/sjchoi86/yet-another-gpt-tutorial/blob/main/code/demo_webcrawl_03_qd.ipynb): A quality-diversity based sampling using determinantal point processes where the kernel matrix is constructed from BERT distance measure. The initial sample is deterministically selected using the same BERT distance. 11 | - [GPT Fine-Tuning](https://github.com/sjchoi86/yet-another-gpt-tutorial/blob/main/code/demo_gpt_03_finetune.ipynb): Fine-tune GPT model using OpenAI API. 12 | 13 | ### Contact 14 | sungjoon dash choi at korea dot ac dot kr 15 | -------------------------------------------------------------------------------- /code/demo_gpt_01_chat.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "fee5fb83", 6 | "metadata": {}, 7 | "source": [ 8 | "### How to chat with GPT using OpenAI API" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "0d0533bc", 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "openai version:[0.28.0]\n" 22 | ] 23 | } 24 | ], 25 | "source": [ 26 | "import os\n", 27 | "import openai\n", 28 | "from retrying import retry\n", 29 | "from IPython.display import Markdown,display\n", 30 | "print (\"openai version:[%s]\"%(openai.__version__))" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "900baa7d", 36 | "metadata": {}, 37 | "source": [ 38 | "### Locate where your key is" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "id": "fc3b903b", 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "key_path:[../key/rilab_key.txt]\n" 52 | ] 53 | } 54 | ], 55 | "source": [ 56 | "key_path = '../key/rilab_key.txt'\n", 57 | "print ('key_path:[%s]'%(key_path))" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "id": "9b49b03f", 63 | "metadata": {}, 64 | "source": [ 65 | "### Use key" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "id": "bc2fc79c", 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "with open(key_path, 'r') as f: OPENAI_API_KEY = f.read()\n", 76 | "openai.api_key = OPENAI_API_KEY" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "id": "c2dd6db4", 82 | "metadata": {}, 83 | "source": [ 84 | "### Query function" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 4, 90 | "id": "830439b8", 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "Ready.\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "@retry(stop_max_attempt_number=5,\n", 103 | " wait_exponential_multiplier=1000,\n", 104 | " wait_exponential_max=10000)\n", 105 | "def query_gpt(messages:list,gpt_model='gpt-4'):\n", 106 | " \"\"\"\n", 107 | " gpt_model: 'gpt-3.5-turbo' / 'gpt-4'\n", 108 | " \"\"\"\n", 109 | " # Call the OpenAI API\n", 110 | " response = openai.ChatCompletion.create(\n", 111 | " model = gpt_model, \n", 112 | " messages = messages\n", 113 | " )\n", 114 | " # Extract the response content and status code\n", 115 | " content = response[\"choices\"][0][\"message\"][\"content\"]\n", 116 | " status_code = response[\"choices\"][0][\"finish_reason\"]\n", 117 | " return content,status_code,response\n", 118 | "print (\"Ready.\")" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "id": "ad48f062", 124 | "metadata": {}, 125 | "source": [ 126 | "`messages`, an input to GPT, is basically a list where each item is a dictionary consists of `role` and `content`. A `role` can either be\n", 127 | "* `system`: which defines the identity of the agent\n", 128 | "* `user`: which states the input of a user\n", 129 | "* `assistant`: which stores messages previously generated by the agents\n", 130 | "More information can be found in [here](https://platform.openai.com/docs/guides/gpt/chat-completions-api)." 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 5, 136 | "id": "84ab7761", 137 | "metadata": {}, 138 | "outputs": [ 139 | { 140 | "name": "stdout", 141 | "output_type": "stream", 142 | "text": [ 143 | "[{'role': 'system', 'content': '너는 한국어로 대화에 능통한 언어모델이야'}, {'role': 'user', 'content': 'LLM(Large Language Model)은 어떻게 작동하는지 설명해줘'}]\n" 144 | ] 145 | } 146 | ], 147 | "source": [ 148 | "role_msg = \"\"\"너는 한국어로 대화에 능통한 언어모델이야\"\"\"\n", 149 | "question = \"LLM(Large Language Model)은 어떻게 작동하는지 설명해줘\"\n", 150 | "messages = [{\"role\": \"system\", \"content\": f'{role_msg}'},\n", 151 | " {\"role\": 'user', \"content\": f'{question}'}]\n", 152 | "print (messages)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "id": "0b7f4f53", 158 | "metadata": {}, 159 | "source": [ 160 | "### Now let's use `GPT`" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 6, 166 | "id": "2ccc1957", 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "content,status_code,response = query_gpt(messages=messages,gpt_model='gpt-3.5-turbo')" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 7, 176 | "id": "74676205", 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "LLM은 큰 규모의 언어 모델로, 텍스트 관련 작업을 수행하기 위해 사용됩니다. LLM은 기계 학습 기술을 사용하여 훈련되며, 많은 양의 텍스트 데이터를 이용해 언어의 통계적 패턴과 구조를 학습합니다.\n", 184 | "\n", 185 | "LLM은 주어진 입력에 대한 다음 단어나 문장을 예측하는 데 사용됩니다. 예를 들어, \"나는 오늘\"이라는 입력이 주어진다면, LLM은 다음에 올 수 있는 가능한 단어를 예측하여 \"나는 오늘 먹었다\" 또는 \"나는 오늘 공부했다\"와 같은 결과를 출력할 수 있습니다.\n", 186 | "\n", 187 | "LLM은 작동하기 위해 큰 양의 데이터를 사용합니다. 대규모의 코퍼스(텍스트 데이터 모음)를 사용하여 많은 문장과 문서를 학습하고, 문법, 단어 간의 관계, 의미, 문맥 등 다양한 언어적 특징을 학습합니다.\n", 188 | "\n", 189 | "LLM은 훈련된 이후에는 실제 응용 프로그램에서 다양한 작업에 활용될 수 있습니다. 예를 들어, 기계 번역, 자동 요약, 질의응답 시스템, 텍스트 생성 등에 활용될 수 있습니다. LLM은 입력 텍스트와 관련된 문제에 대한 올바른 예측을 생성하여 도움을 줄 수 있습니다.\n" 190 | ] 191 | } 192 | ], 193 | "source": [ 194 | "print (content)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 8, 200 | "id": "d330b693", 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "name": "stdout", 205 | "output_type": "stream", 206 | "text": [ 207 | "stop\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "print (status_code)" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 9, 218 | "id": "b9daf9aa", 219 | "metadata": { 220 | "scrolled": true 221 | }, 222 | "outputs": [ 223 | { 224 | "name": "stdout", 225 | "output_type": "stream", 226 | "text": [ 227 | "{\n", 228 | " \"id\": \"chatcmpl-8FRQqdZgJ0nN2kSmQKIeiekX9Gx8c\",\n", 229 | " \"object\": \"chat.completion\",\n", 230 | " \"created\": 1698691060,\n", 231 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 232 | " \"choices\": [\n", 233 | " {\n", 234 | " \"index\": 0,\n", 235 | " \"message\": {\n", 236 | " \"role\": \"assistant\",\n", 237 | " \"content\": \"LLM\\uc740 \\ud070 \\uaddc\\ubaa8\\uc758 \\uc5b8\\uc5b4 \\ubaa8\\ub378\\ub85c, \\ud14d\\uc2a4\\ud2b8 \\uad00\\ub828 \\uc791\\uc5c5\\uc744 \\uc218\\ud589\\ud558\\uae30 \\uc704\\ud574 \\uc0ac\\uc6a9\\ub429\\ub2c8\\ub2e4. LLM\\uc740 \\uae30\\uacc4 \\ud559\\uc2b5 \\uae30\\uc220\\uc744 \\uc0ac\\uc6a9\\ud558\\uc5ec \\ud6c8\\ub828\\ub418\\uba70, \\ub9ce\\uc740 \\uc591\\uc758 \\ud14d\\uc2a4\\ud2b8 \\ub370\\uc774\\ud130\\ub97c \\uc774\\uc6a9\\ud574 \\uc5b8\\uc5b4\\uc758 \\ud1b5\\uacc4\\uc801 \\ud328\\ud134\\uacfc \\uad6c\\uc870\\ub97c \\ud559\\uc2b5\\ud569\\ub2c8\\ub2e4.\\n\\nLLM\\uc740 \\uc8fc\\uc5b4\\uc9c4 \\uc785\\ub825\\uc5d0 \\ub300\\ud55c \\ub2e4\\uc74c \\ub2e8\\uc5b4\\ub098 \\ubb38\\uc7a5\\uc744 \\uc608\\uce21\\ud558\\ub294 \\ub370 \\uc0ac\\uc6a9\\ub429\\ub2c8\\ub2e4. \\uc608\\ub97c \\ub4e4\\uc5b4, \\\"\\ub098\\ub294 \\uc624\\ub298\\\"\\uc774\\ub77c\\ub294 \\uc785\\ub825\\uc774 \\uc8fc\\uc5b4\\uc9c4\\ub2e4\\uba74, LLM\\uc740 \\ub2e4\\uc74c\\uc5d0 \\uc62c \\uc218 \\uc788\\ub294 \\uac00\\ub2a5\\ud55c \\ub2e8\\uc5b4\\ub97c \\uc608\\uce21\\ud558\\uc5ec \\\"\\ub098\\ub294 \\uc624\\ub298 \\uba39\\uc5c8\\ub2e4\\\" \\ub610\\ub294 \\\"\\ub098\\ub294 \\uc624\\ub298 \\uacf5\\ubd80\\ud588\\ub2e4\\\"\\uc640 \\uac19\\uc740 \\uacb0\\uacfc\\ub97c \\ucd9c\\ub825\\ud560 \\uc218 \\uc788\\uc2b5\\ub2c8\\ub2e4.\\n\\nLLM\\uc740 \\uc791\\ub3d9\\ud558\\uae30 \\uc704\\ud574 \\ud070 \\uc591\\uc758 \\ub370\\uc774\\ud130\\ub97c \\uc0ac\\uc6a9\\ud569\\ub2c8\\ub2e4. \\ub300\\uaddc\\ubaa8\\uc758 \\ucf54\\ud37c\\uc2a4(\\ud14d\\uc2a4\\ud2b8 \\ub370\\uc774\\ud130 \\ubaa8\\uc74c)\\ub97c \\uc0ac\\uc6a9\\ud558\\uc5ec \\ub9ce\\uc740 \\ubb38\\uc7a5\\uacfc \\ubb38\\uc11c\\ub97c \\ud559\\uc2b5\\ud558\\uace0, \\ubb38\\ubc95, \\ub2e8\\uc5b4 \\uac04\\uc758 \\uad00\\uacc4, \\uc758\\ubbf8, \\ubb38\\ub9e5 \\ub4f1 \\ub2e4\\uc591\\ud55c \\uc5b8\\uc5b4\\uc801 \\ud2b9\\uc9d5\\uc744 \\ud559\\uc2b5\\ud569\\ub2c8\\ub2e4.\\n\\nLLM\\uc740 \\ud6c8\\ub828\\ub41c \\uc774\\ud6c4\\uc5d0\\ub294 \\uc2e4\\uc81c \\uc751\\uc6a9 \\ud504\\ub85c\\uadf8\\ub7a8\\uc5d0\\uc11c \\ub2e4\\uc591\\ud55c \\uc791\\uc5c5\\uc5d0 \\ud65c\\uc6a9\\ub420 \\uc218 \\uc788\\uc2b5\\ub2c8\\ub2e4. \\uc608\\ub97c \\ub4e4\\uc5b4, \\uae30\\uacc4 \\ubc88\\uc5ed, \\uc790\\ub3d9 \\uc694\\uc57d, \\uc9c8\\uc758\\uc751\\ub2f5 \\uc2dc\\uc2a4\\ud15c, \\ud14d\\uc2a4\\ud2b8 \\uc0dd\\uc131 \\ub4f1\\uc5d0 \\ud65c\\uc6a9\\ub420 \\uc218 \\uc788\\uc2b5\\ub2c8\\ub2e4. LLM\\uc740 \\uc785\\ub825 \\ud14d\\uc2a4\\ud2b8\\uc640 \\uad00\\ub828\\ub41c \\ubb38\\uc81c\\uc5d0 \\ub300\\ud55c \\uc62c\\ubc14\\ub978 \\uc608\\uce21\\uc744 \\uc0dd\\uc131\\ud558\\uc5ec \\ub3c4\\uc6c0\\uc744 \\uc904 \\uc218 \\uc788\\uc2b5\\ub2c8\\ub2e4.\"\n", 238 | " },\n", 239 | " \"finish_reason\": \"stop\"\n", 240 | " }\n", 241 | " ],\n", 242 | " \"usage\": {\n", 243 | " \"prompt_tokens\": 60,\n", 244 | " \"completion_tokens\": 440,\n", 245 | " \"total_tokens\": 500\n", 246 | " }\n", 247 | "}\n" 248 | ] 249 | } 250 | ], 251 | "source": [ 252 | "print (response)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "id": "10cddce3", 258 | "metadata": {}, 259 | "source": [ 260 | "### Helper Class for implementing efficient chat with GPT" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 10, 266 | "id": "46db1fe1", 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "name": "stdout", 271 | "output_type": "stream", 272 | "text": [ 273 | "Ready.\n" 274 | ] 275 | } 276 | ], 277 | "source": [ 278 | "def printmd(string):\n", 279 | " display(Markdown(string))\n", 280 | "class GPTchatClass():\n", 281 | " def __init__(self,\n", 282 | " gpt_model = 'gpt-4',\n", 283 | " role_msg = 'Your are a helpful assistant.',\n", 284 | " VERBOSE = True\n", 285 | " ):\n", 286 | " self.gpt_model = gpt_model\n", 287 | " self.messages = [{'role':'system','content':f'{role_msg}'}]\n", 288 | " self.init_messages = [{'role':'system','content':f'{role_msg}'}]\n", 289 | " self.VERBOSE = VERBOSE\n", 290 | " self.response = None\n", 291 | " if self.VERBOSE:\n", 292 | " print (\"Chat agent using [%s] initialized with the follow role:[%s]\"%\n", 293 | " (self.gpt_model,role_msg))\n", 294 | " \n", 295 | " def _add_message(self,role='assistant',content=''):\n", 296 | " \"\"\"\n", 297 | " role: 'assistant' / 'user'\n", 298 | " \"\"\"\n", 299 | " self.messages.append({'role':role, 'content':content})\n", 300 | " \n", 301 | " def _get_response_content(self):\n", 302 | " if self.response:\n", 303 | " return self.response['choices'][0]['message']['content']\n", 304 | " else:\n", 305 | " return None\n", 306 | " \n", 307 | " def _get_response_status(self):\n", 308 | " if self.response:\n", 309 | " return self.response['choices'][0]['message']['finish_reason']\n", 310 | " else:\n", 311 | " return None\n", 312 | " \n", 313 | " def chat(self,user_msg='hi',\n", 314 | " PRINT_USER_MSG=True,PRINT_GPT_OUTPUT=True,\n", 315 | " RESET_CHAT=False,RETURN_RESPONSE=True):\n", 316 | " self._add_message(role='user',content=user_msg)\n", 317 | " self.response = openai.ChatCompletion.create(\n", 318 | " model = self.gpt_model,\n", 319 | " messages = self.messages\n", 320 | " )\n", 321 | " # Backup response for continous chatting\n", 322 | " self._add_message(role='assistant',content=self._get_response_content())\n", 323 | " if PRINT_USER_MSG:\n", 324 | " print(\"[USER_MSG]\")\n", 325 | " printmd(user_msg)\n", 326 | " if PRINT_GPT_OUTPUT:\n", 327 | " print(\"[GPT_OUTPUT]\")\n", 328 | " printmd(self._get_response_content())\n", 329 | " # Reset\n", 330 | " if RESET_CHAT:\n", 331 | " self.messages = self.init_messages\n", 332 | " # Return\n", 333 | " if RETURN_RESPONSE:\n", 334 | " return self._get_response_content()\n", 335 | "print (\"Ready.\") " 336 | ] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "id": "c240f58b", 341 | "metadata": {}, 342 | "source": [ 343 | "### Now let's chat" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 11, 349 | "id": "87c6ad0a", 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "name": "stdout", 354 | "output_type": "stream", 355 | "text": [ 356 | "Chat agent using [gpt-3.5-turbo] initialized with the follow role:[Your are a helpful assistant.]\n", 357 | "[USER_MSG]\n" 358 | ] 359 | }, 360 | { 361 | "data": { 362 | "text/markdown": [ 363 | "Who is the current president of Korea?" 364 | ], 365 | "text/plain": [ 366 | "" 367 | ] 368 | }, 369 | "metadata": {}, 370 | "output_type": "display_data" 371 | }, 372 | { 373 | "name": "stdout", 374 | "output_type": "stream", 375 | "text": [ 376 | "[GPT_OUTPUT]\n" 377 | ] 378 | }, 379 | { 380 | "data": { 381 | "text/markdown": [ 382 | "As of my last update in November 2021, the current president of South Korea is Moon Jae-in. However, please note that political positions can change, so it's always a good idea to verify with the latest information from reliable sources." 383 | ], 384 | "text/plain": [ 385 | "" 386 | ] 387 | }, 388 | "metadata": {}, 389 | "output_type": "display_data" 390 | }, 391 | { 392 | "name": "stdout", 393 | "output_type": "stream", 394 | "text": [ 395 | "[USER_MSG]\n" 396 | ] 397 | }, 398 | { 399 | "data": { 400 | "text/markdown": [ 401 | "Are you sure? I think you are outdated." 402 | ], 403 | "text/plain": [ 404 | "" 405 | ] 406 | }, 407 | "metadata": {}, 408 | "output_type": "display_data" 409 | }, 410 | { 411 | "name": "stdout", 412 | "output_type": "stream", 413 | "text": [ 414 | "[GPT_OUTPUT]\n" 415 | ] 416 | }, 417 | { 418 | "data": { 419 | "text/markdown": [ 420 | "I apologize if my previous response was outdated. As an AI, I do not have real-time information updates. As of November 2021, Moon Jae-in was serving as the president of South Korea, but please verify with the latest sources to ensure accuracy." 421 | ], 422 | "text/plain": [ 423 | "" 424 | ] 425 | }, 426 | "metadata": {}, 427 | "output_type": "display_data" 428 | }, 429 | { 430 | "name": "stdout", 431 | "output_type": "stream", 432 | "text": [ 433 | "[USER_MSG]\n" 434 | ] 435 | }, 436 | { 437 | "data": { 438 | "text/markdown": [ 439 | "Where can I get the latest information?" 440 | ], 441 | "text/plain": [ 442 | "" 443 | ] 444 | }, 445 | "metadata": {}, 446 | "output_type": "display_data" 447 | }, 448 | { 449 | "name": "stdout", 450 | "output_type": "stream", 451 | "text": [ 452 | "[GPT_OUTPUT]\n" 453 | ] 454 | }, 455 | { 456 | "data": { 457 | "text/markdown": [ 458 | "For the most up-to-date and accurate information on the current president of South Korea, I recommend checking reliable news sources such as reputable news websites, government websites, or official government social media accounts. These sources usually provide the most recent and accurate information on political leadership and updates." 459 | ], 460 | "text/plain": [ 461 | "" 462 | ] 463 | }, 464 | "metadata": {}, 465 | "output_type": "display_data" 466 | } 467 | ], 468 | "source": [ 469 | "GPT = GPTchatClass(gpt_model='gpt-3.5-turbo',role_msg = 'Your are a helpful assistant.')\n", 470 | "PRINT_USER_MSG = True\n", 471 | "PRINT_GPT_OUTPUT = True\n", 472 | "RESET_CHAT = False\n", 473 | "RETURN_RESPONSE = False\n", 474 | "GPT.chat(user_msg='Who is the current president of Korea?',\n", 475 | " PRINT_USER_MSG=PRINT_USER_MSG,PRINT_GPT_OUTPUT=PRINT_GPT_OUTPUT,\n", 476 | " RESET_CHAT=RESET_CHAT,RETURN_RESPONSE=RETURN_RESPONSE)\n", 477 | "GPT.chat(user_msg='Are you sure? I think you are outdated.',\n", 478 | " PRINT_USER_MSG=PRINT_USER_MSG,PRINT_GPT_OUTPUT=PRINT_GPT_OUTPUT,\n", 479 | " RESET_CHAT=RESET_CHAT,RETURN_RESPONSE=RETURN_RESPONSE)\n", 480 | "GPT.chat(user_msg='Where can I get the latest information?',\n", 481 | " PRINT_USER_MSG=PRINT_USER_MSG,PRINT_GPT_OUTPUT=PRINT_GPT_OUTPUT,\n", 482 | " RESET_CHAT=RESET_CHAT,RETURN_RESPONSE=RETURN_RESPONSE)" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "id": "0e3dff58", 488 | "metadata": {}, 489 | "source": [ 490 | "### Chat with resetting everytime" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 12, 496 | "id": "0b3982d2", 497 | "metadata": {}, 498 | "outputs": [ 499 | { 500 | "name": "stdout", 501 | "output_type": "stream", 502 | "text": [ 503 | "Chat agent using [gpt-3.5-turbo] initialized with the follow role:[Your are a helpful assistant.]\n", 504 | "[USER_MSG]\n" 505 | ] 506 | }, 507 | { 508 | "data": { 509 | "text/markdown": [ 510 | "Who is the current president of Korea?" 511 | ], 512 | "text/plain": [ 513 | "" 514 | ] 515 | }, 516 | "metadata": {}, 517 | "output_type": "display_data" 518 | }, 519 | { 520 | "name": "stdout", 521 | "output_type": "stream", 522 | "text": [ 523 | "[GPT_OUTPUT]\n" 524 | ] 525 | }, 526 | { 527 | "data": { 528 | "text/markdown": [ 529 | "As of my knowledge, the current President of South Korea is Moon Jae-in." 530 | ], 531 | "text/plain": [ 532 | "" 533 | ] 534 | }, 535 | "metadata": {}, 536 | "output_type": "display_data" 537 | }, 538 | { 539 | "name": "stdout", 540 | "output_type": "stream", 541 | "text": [ 542 | "[USER_MSG]\n" 543 | ] 544 | }, 545 | { 546 | "data": { 547 | "text/markdown": [ 548 | "Are you sure? I think you are outdated." 549 | ], 550 | "text/plain": [ 551 | "" 552 | ] 553 | }, 554 | "metadata": {}, 555 | "output_type": "display_data" 556 | }, 557 | { 558 | "name": "stdout", 559 | "output_type": "stream", 560 | "text": [ 561 | "[GPT_OUTPUT]\n" 562 | ] 563 | }, 564 | { 565 | "data": { 566 | "text/markdown": [ 567 | "I'm here to assist you to the best of my ability. While I may not have the latest information or be able to perform certain tasks, I'll do my best to help you with any questions or tasks you have. Let me know how I can assist you!" 568 | ], 569 | "text/plain": [ 570 | "" 571 | ] 572 | }, 573 | "metadata": {}, 574 | "output_type": "display_data" 575 | }, 576 | { 577 | "name": "stdout", 578 | "output_type": "stream", 579 | "text": [ 580 | "[USER_MSG]\n" 581 | ] 582 | }, 583 | { 584 | "data": { 585 | "text/markdown": [ 586 | "Where can I get the latest information?" 587 | ], 588 | "text/plain": [ 589 | "" 590 | ] 591 | }, 592 | "metadata": {}, 593 | "output_type": "display_data" 594 | }, 595 | { 596 | "name": "stdout", 597 | "output_type": "stream", 598 | "text": [ 599 | "[GPT_OUTPUT]\n" 600 | ] 601 | }, 602 | { 603 | "data": { 604 | "text/markdown": [ 605 | "To get the latest information, there are several reliable sources you can turn to:\n", 606 | "\n", 607 | "1. News websites: Popular news websites like BBC News, CNN, Reuters, or The New York Times provide up-to-date news on various topics.\n", 608 | "\n", 609 | "2. Social media platforms: Follow reputable news outlets or journalists on platforms like Twitter, where they often share the latest updates and breaking news.\n", 610 | "\n", 611 | "3. Official government websites: Government websites provide reliable information on topics such as public health, policies, and current events. Look for websites specific to your country or region.\n", 612 | "\n", 613 | "4. Trusted experts and organizations: Follow experts or organizations related to the topic you're interested in. For example, if you're looking for health-related information, following the World Health Organization (WHO) or the Centers for Disease Control and Prevention (CDC) can provide reliable updates.\n", 614 | "\n", 615 | "Remember, it's always important to verify the credibility of your sources and cross-reference information from multiple reliable sources to ensure accuracy." 616 | ], 617 | "text/plain": [ 618 | "" 619 | ] 620 | }, 621 | "metadata": {}, 622 | "output_type": "display_data" 623 | } 624 | ], 625 | "source": [ 626 | "GPT = GPTchatClass(gpt_model='gpt-3.5-turbo',role_msg = 'Your are a helpful assistant.')\n", 627 | "PRINT_USER_MSG = True\n", 628 | "PRINT_GPT_OUTPUT = True\n", 629 | "RESET_CHAT = True\n", 630 | "RETURN_RESPONSE = False\n", 631 | "GPT.chat(user_msg='Who is the current president of Korea?',\n", 632 | " PRINT_USER_MSG=PRINT_USER_MSG,PRINT_GPT_OUTPUT=PRINT_GPT_OUTPUT,\n", 633 | " RESET_CHAT=RESET_CHAT,RETURN_RESPONSE=RETURN_RESPONSE)\n", 634 | "GPT.chat(user_msg='Are you sure? I think you are outdated.',\n", 635 | " PRINT_USER_MSG=PRINT_USER_MSG,PRINT_GPT_OUTPUT=PRINT_GPT_OUTPUT,\n", 636 | " RESET_CHAT=RESET_CHAT,RETURN_RESPONSE=RETURN_RESPONSE)\n", 637 | "GPT.chat(user_msg='Where can I get the latest information?',\n", 638 | " PRINT_USER_MSG=PRINT_USER_MSG,PRINT_GPT_OUTPUT=PRINT_GPT_OUTPUT,\n", 639 | " RESET_CHAT=RESET_CHAT,RETURN_RESPONSE=RETURN_RESPONSE)" 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "execution_count": null, 645 | "id": "92b54979", 646 | "metadata": {}, 647 | "outputs": [], 648 | "source": [] 649 | } 650 | ], 651 | "metadata": { 652 | "kernelspec": { 653 | "display_name": "Python 3 (ipykernel)", 654 | "language": "python", 655 | "name": "python3" 656 | }, 657 | "language_info": { 658 | "codemirror_mode": { 659 | "name": "ipython", 660 | "version": 3 661 | }, 662 | "file_extension": ".py", 663 | "mimetype": "text/x-python", 664 | "name": "python", 665 | "nbconvert_exporter": "python", 666 | "pygments_lexer": "ipython3", 667 | "version": "3.9.16" 668 | } 669 | }, 670 | "nbformat": 4, 671 | "nbformat_minor": 5 672 | } 673 | -------------------------------------------------------------------------------- /code/demo_gpt_02_rag.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "9d42d686", 6 | "metadata": {}, 7 | "source": [ 8 | "### Retrieval-Augmented Generation with Wikipedia" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "0ff8898c", 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "openai version:[0.28.0]\n" 22 | ] 23 | } 24 | ], 25 | "source": [ 26 | "import os\n", 27 | "import openai\n", 28 | "from gpt_helper import set_openai_api_key_from_txt,GPTchatClass,printmd\n", 29 | "from wiki_helper import wiki_search\n", 30 | "from util import printmd,extract_quoted_words\n", 31 | "print (\"openai version:[%s]\"%(openai.__version__))" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "id": "1b39a3d2", 37 | "metadata": {}, 38 | "source": [ 39 | "### Instantiate GPT Agent" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "id": "08bd57a4", 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "OpenAI API Key Ready from [../key/rilab_key.txt].\n", 53 | "Chat agent using [gpt-3.5-turbo] initialized with the follow role:[Your are a helpful assistant summarizing infromation and answering user queries.]\n" 54 | ] 55 | } 56 | ], 57 | "source": [ 58 | "set_openai_api_key_from_txt(key_path='../key/rilab_key.txt')\n", 59 | "GPT = GPTchatClass(\n", 60 | " gpt_model='gpt-3.5-turbo', # 'gpt-3.5-turbo' / 'gpt-4'\n", 61 | " role_msg='Your are a helpful assistant summarizing infromation and answering user queries.')" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "id": "cb1e669b", 67 | "metadata": {}, 68 | "source": [ 69 | "### Our RAG agent will use the following strategies\n", 70 | "We assume that a user question is given (e.g., 'Who is the current president of South Korea?').\n", 71 | "* Step 1. For the given question, our `GPT agent` will first generate a number of entities for searching Wikipedia.\n", 72 | "* Step 2. Then, our `WikiBot` will provide (i.e., crawl) related information summarized with the `GPT agent` considering the user question.\n", 73 | "* Step 3. Finally, the summarized texts and the original user question will be given to the `GPT agent` to answer. " 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 3, 79 | "id": "7986bfe2", 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "question: Who is the current president of South Korea?\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "question = 'Who is the current president of South Korea?'\n", 92 | "\"\"\"\n", 93 | "question = '''\n", 94 | " I am an interactive humanoid robot agent. \n", 95 | " I have following action capabilites:['idle','waving','greeting','raising hands','hugging','reading a book']\n", 96 | " I can detect following observations:['no people','a person appears','a person waves hands','a person leaves']\n", 97 | " I have a following personality:['Introverted and Childish']\n", 98 | " What is the best next action when I am in ['idle'] state and observes ['a person waves hands']?\n", 99 | "'''\n", 100 | "\"\"\"\n", 101 | "print (\"question: %s\"%(question))" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "id": "a7786588", 107 | "metadata": {}, 108 | "source": [ 109 | "### Step 1. Generate entities for wiki search" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 4, 115 | "id": "a46151c3", 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "user_msg = \\\n", 120 | " \"\"\"\n", 121 | " Suppose you will use Wikipedia for retrieving information. \n", 122 | " Could you recommend three query words wrapped with quotation marks considering the following question?\n", 123 | " \"\"\" + '\"' + question + '\"'" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 5, 129 | "id": "31f76438", 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "[USER_MSG]\n" 137 | ] 138 | }, 139 | { 140 | "data": { 141 | "text/markdown": [ 142 | "\n", 143 | " Suppose you will use Wikipedia for retrieving information. \n", 144 | " Could you recommend three query words wrapped with quotation marks considering the following question?\n", 145 | " \"Who is the current president of South Korea?\"" 146 | ], 147 | "text/plain": [ 148 | "" 149 | ] 150 | }, 151 | "metadata": {}, 152 | "output_type": "display_data" 153 | }, 154 | { 155 | "name": "stdout", 156 | "output_type": "stream", 157 | "text": [ 158 | "[GPT_OUTPUT]\n" 159 | ] 160 | }, 161 | { 162 | "data": { 163 | "text/markdown": [ 164 | "Sure! Here are three query words you can use to search for the current president of South Korea on Wikipedia:\n", 165 | "\n", 166 | "1. \"Current president of South Korea\"\n", 167 | "2. \"President of South Korea\"\n", 168 | "3. \"South Korean president\"\n", 169 | "\n", 170 | "By putting these phrases in quotation marks, it will help narrow down the search results and prioritize pages that contain the exact phrase you're looking for." 171 | ], 172 | "text/plain": [ 173 | "" 174 | ] 175 | }, 176 | "metadata": {}, 177 | "output_type": "display_data" 178 | } 179 | ], 180 | "source": [ 181 | "response_content = GPT.chat(\n", 182 | " user_msg=user_msg,PRINT_USER_MSG=True,PRINT_GPT_OUTPUT=True,\n", 183 | " RESET_CHAT=True,RETURN_RESPONSE=True)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 6, 189 | "id": "90835aa3", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/markdown": [ 195 | "Sure! Here are three query words you can use to search for the current president of South Korea on Wikipedia:\n", 196 | "\n", 197 | "1. \"Current president of South Korea\"\n", 198 | "2. \"President of South Korea\"\n", 199 | "3. \"South Korean president\"\n", 200 | "\n", 201 | "By putting these phrases in quotation marks, it will help narrow down the search results and prioritize pages that contain the exact phrase you're looking for." 202 | ], 203 | "text/plain": [ 204 | "" 205 | ] 206 | }, 207 | "metadata": {}, 208 | "output_type": "display_data" 209 | } 210 | ], 211 | "source": [ 212 | "# Print summarized sentence with a markdown format\n", 213 | "printmd(response_content)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 7, 219 | "id": "da04d2e3", 220 | "metadata": {}, 221 | "outputs": [ 222 | { 223 | "name": "stdout", 224 | "output_type": "stream", 225 | "text": [ 226 | "['Current president of South Korea', 'President of South Korea', 'South Korean president']\n" 227 | ] 228 | } 229 | ], 230 | "source": [ 231 | "entities = extract_quoted_words(response_content)\n", 232 | "if len(entities) > 3: entities = entities[-3:]\n", 233 | "print (entities)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "id": "79e4bf43", 239 | "metadata": {}, 240 | "source": [ 241 | "### Step 2. Query entities to `WikiBot`" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": 8, 247 | "id": "288c9789", 248 | "metadata": {}, 249 | "outputs": [ 250 | { 251 | "name": "stdout", 252 | "output_type": "stream", 253 | "text": [ 254 | "entity:[Current president of South Korea] mismatched. use [President of South Korea] instead.\n", 255 | " We have total [293] paragraphs.\n", 256 | " After filtering, we have [31] and [8] paragraphs returned (k:[5] and m:[3])\n", 257 | "entity:[President of South Korea] matched.\n", 258 | " We have total [293] paragraphs.\n", 259 | " After filtering, we have [31] and [8] paragraphs returned (k:[5] and m:[3])\n", 260 | "entity:[South Korean president] matched.\n", 261 | " We have total [293] paragraphs.\n", 262 | " After filtering, we have [31] and [8] paragraphs returned (k:[5] and m:[3])\n" 263 | ] 264 | } 265 | ], 266 | "source": [ 267 | "paragraphs_return = []\n", 268 | "for entity in entities:\n", 269 | " paragraphs_return += wiki_search(entity=entity,VERBOSE=True)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 9, 275 | "id": "92579ce3", 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "name": "stdout", 280 | "output_type": "stream", 281 | "text": [ 282 | "Number of paragraphs [24] => unique ones [8]\n" 283 | ] 284 | } 285 | ], 286 | "source": [ 287 | "# Get the unique elements\n", 288 | "paragraphs_unique = list(set(paragraphs_return))\n", 289 | "print (\"Number of paragraphs [%d] => unique ones [%d]\"%\n", 290 | " (len(paragraphs_return),len(paragraphs_unique)))" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 10, 296 | "id": "512ca3f3", 297 | "metadata": { 298 | "scrolled": true 299 | }, 300 | "outputs": [ 301 | { 302 | "data": { 303 | "text/markdown": [ 304 | "The current president of South Korea is directly elected for a five-year term with no possibility of re-election, and in case of a vacancy, a successor must be elected within sixty days." 305 | ], 306 | "text/plain": [ 307 | "" 308 | ] 309 | }, 310 | "metadata": {}, 311 | "output_type": "display_data" 312 | }, 313 | { 314 | "data": { 315 | "text/markdown": [ 316 | "The current president of South Korea is the head of state and government, leading the State Council and serving as the commander-in-chief of the Armed Forces." 317 | ], 318 | "text/plain": [ 319 | "" 320 | ] 321 | }, 322 | "metadata": {}, 323 | "output_type": "display_data" 324 | }, 325 | { 326 | "data": { 327 | "text/markdown": [ 328 | "The current president of South Korea is not mentioned in the given paragraph" 329 | ], 330 | "text/plain": [ 331 | "" 332 | ] 333 | }, 334 | "metadata": {}, 335 | "output_type": "display_data" 336 | }, 337 | { 338 | "data": { 339 | "text/markdown": [ 340 | "The presidential term in South Korea is currently set at five years since 1988, with the president being barred from re-election since 1981." 341 | ], 342 | "text/plain": [ 343 | "" 344 | ] 345 | }, 346 | "metadata": {}, 347 | "output_type": "display_data" 348 | }, 349 | { 350 | "data": { 351 | "text/markdown": [ 352 | "The Provisional Government of the Republic of Korea established in September 1919 was recognized and succeeded by South Korea and its current Constitution." 353 | ], 354 | "text/plain": [ 355 | "" 356 | ] 357 | }, 358 | "metadata": {}, 359 | "output_type": "display_data" 360 | }, 361 | { 362 | "data": { 363 | "text/markdown": [ 364 | "The paragraph describes the National Security Council and the Peaceful Unification Advisory Council in South Korea, their roles and membership." 365 | ], 366 | "text/plain": [ 367 | "" 368 | ] 369 | }, 370 | "metadata": {}, 371 | "output_type": "display_data" 372 | }, 373 | { 374 | "data": { 375 | "text/markdown": [ 376 | "Yoon Suk Yeol, a former prosecutor general and member of the conservative People Power Party, became the president of South Korea on May 10, 2022, after winning the 2022 presidential election with a narrow 48.5% of the votes, defeating Lee Jae-myung from the Democratic Party." 377 | ], 378 | "text/plain": [ 379 | "" 380 | ] 381 | }, 382 | "metadata": {}, 383 | "output_type": "display_data" 384 | }, 385 | { 386 | "data": { 387 | "text/markdown": [ 388 | "The paragraph discusses the controversial Advisory Council of Elder Statesmen in South Korea, which was expanded and elevated to cabinet rank before Roh Tae Woo became president, leading to suspicions of it being designed to benefit a specific individual. However, these suspicions became irrelevant when former President Chun withdrew from politics in November 1988." 389 | ], 390 | "text/plain": [ 391 | "" 392 | ] 393 | }, 394 | "metadata": {}, 395 | "output_type": "display_data" 396 | } 397 | ], 398 | "source": [ 399 | "# Now summarize each paragraph into a single sentence considering the question\n", 400 | "summarized_sentences = []\n", 401 | "for p_idx,p in enumerate(paragraphs_unique):\n", 402 | " user_msg = \"You are given following question: \"+question\n", 403 | " user_msg += \"Could you summarize the following paragraph into one setence? \\n \"+p\n", 404 | " response_content = GPT.chat(\n", 405 | " user_msg=user_msg,PRINT_USER_MSG=False,PRINT_GPT_OUTPUT=False,\n", 406 | " RESET_CHAT=True,RETURN_RESPONSE=True)\n", 407 | " # Append summarized sentences\n", 408 | " summarized_sentences.append(response_content)\n", 409 | " # Print summarized sentence with a markdown format\n", 410 | " printmd(response_content)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "id": "d5286646", 416 | "metadata": {}, 417 | "source": [ 418 | "### Step 3. Answer the question using `summarized_sentences`" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 11, 424 | "id": "9d5ddbe4", 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "user_msg = \" \".join(summarized_sentences)\n", 429 | "user_msg += \" Using the information above, could you answer the following question? \"\n", 430 | "user_msg += question" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 12, 436 | "id": "d680df19", 437 | "metadata": { 438 | "scrolled": true 439 | }, 440 | "outputs": [ 441 | { 442 | "name": "stdout", 443 | "output_type": "stream", 444 | "text": [ 445 | "[USER_MSG]\n" 446 | ] 447 | }, 448 | { 449 | "data": { 450 | "text/markdown": [ 451 | "The current president of South Korea is directly elected for a five-year term with no possibility of re-election, and in case of a vacancy, a successor must be elected within sixty days. The current president of South Korea is the head of state and government, leading the State Council and serving as the commander-in-chief of the Armed Forces. The current president of South Korea is not mentioned in the given paragraph The presidential term in South Korea is currently set at five years since 1988, with the president being barred from re-election since 1981. The Provisional Government of the Republic of Korea established in September 1919 was recognized and succeeded by South Korea and its current Constitution. The paragraph describes the National Security Council and the Peaceful Unification Advisory Council in South Korea, their roles and membership. Yoon Suk Yeol, a former prosecutor general and member of the conservative People Power Party, became the president of South Korea on May 10, 2022, after winning the 2022 presidential election with a narrow 48.5% of the votes, defeating Lee Jae-myung from the Democratic Party. The paragraph discusses the controversial Advisory Council of Elder Statesmen in South Korea, which was expanded and elevated to cabinet rank before Roh Tae Woo became president, leading to suspicions of it being designed to benefit a specific individual. However, these suspicions became irrelevant when former President Chun withdrew from politics in November 1988. Using the information above, could you answer the following question? Who is the current president of South Korea?" 452 | ], 453 | "text/plain": [ 454 | "" 455 | ] 456 | }, 457 | "metadata": {}, 458 | "output_type": "display_data" 459 | }, 460 | { 461 | "name": "stdout", 462 | "output_type": "stream", 463 | "text": [ 464 | "[GPT_OUTPUT]\n" 465 | ] 466 | }, 467 | { 468 | "data": { 469 | "text/markdown": [ 470 | "The current president of South Korea is Yoon Suk Yeol, who took office on May 10, 2022, after winning the 2022 presidential election." 471 | ], 472 | "text/plain": [ 473 | "" 474 | ] 475 | }, 476 | "metadata": {}, 477 | "output_type": "display_data" 478 | } 479 | ], 480 | "source": [ 481 | "response_content = GPT.chat(\n", 482 | " user_msg=user_msg,PRINT_USER_MSG=True,PRINT_GPT_OUTPUT=True,\n", 483 | " RESET_CHAT=False,RETURN_RESPONSE=True)" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": null, 489 | "id": "b987f6f2", 490 | "metadata": {}, 491 | "outputs": [], 492 | "source": [ 493 | "user_msg = \"Could you explain about this little longer?\"\n", 494 | "response_content = GPT.chat(\n", 495 | " user_msg=user_msg,PRINT_USER_MSG=True,PRINT_GPT_OUTPUT=True,\n", 496 | " RESET_CHAT=False,RETURN_RESPONSE=True)" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": null, 502 | "id": "adb3a643", 503 | "metadata": {}, 504 | "outputs": [], 505 | "source": [] 506 | } 507 | ], 508 | "metadata": { 509 | "kernelspec": { 510 | "display_name": "Python 3 (ipykernel)", 511 | "language": "python", 512 | "name": "python3" 513 | }, 514 | "language_info": { 515 | "codemirror_mode": { 516 | "name": "ipython", 517 | "version": 3 518 | }, 519 | "file_extension": ".py", 520 | "mimetype": "text/x-python", 521 | "name": "python", 522 | "nbconvert_exporter": "python", 523 | "pygments_lexer": "ipython3", 524 | "version": "3.9.16" 525 | } 526 | }, 527 | "nbformat": 4, 528 | "nbformat_minor": 5 529 | } 530 | -------------------------------------------------------------------------------- /code/demo_gpt_03_finetune.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "af5d9184", 6 | "metadata": {}, 7 | "source": [ 8 | "### Fine-tune GPT" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "b2c46c31", 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "OpenAI package version:0.28.0\n" 22 | ] 23 | } 24 | ], 25 | "source": [ 26 | "import json\n", 27 | "import openai\n", 28 | "import time, random\n", 29 | "from datasets import load_dataset\n", 30 | "print (\"OpenAI package version:%s\"%(openai.__version__))" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "71a68cf4", 36 | "metadata": {}, 37 | "source": [ 38 | "### Load data from Hugging Face Datasets" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "id": "771dc225", 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "dataset = load_dataset(path=\"nguha/legalbench\",name=\"nys_judicial_ethics\")" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 3, 54 | "id": "ac5a6b9c", 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/html": [ 60 | "
\n", 61 | "\n", 74 | "\n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | "
answerindexquestionyear
0No0If a judge reports an attorney for a substanti...2010
1No1Does a village justice need to disqualify them...2010
2No2Is a judge required to disclose their former e...2010
3No3Is it prohibited for a part-time judge who pra...2010
4Yes4Can a judge appear monthly on a local televisi...2010
5Yes5Is a judge required to report an attorney to t...2010
6Yes6Is a judge required to take appropriate action...2010
7Yes7Can a judge appoint a qualified former law cle...2010
\n", 143 | "
" 144 | ], 145 | "text/plain": [ 146 | " answer index question year\n", 147 | "0 No 0 If a judge reports an attorney for a substanti... 2010\n", 148 | "1 No 1 Does a village justice need to disqualify them... 2010\n", 149 | "2 No 2 Is a judge required to disclose their former e... 2010\n", 150 | "3 No 3 Is it prohibited for a part-time judge who pra... 2010\n", 151 | "4 Yes 4 Can a judge appear monthly on a local televisi... 2010\n", 152 | "5 Yes 5 Is a judge required to report an attorney to t... 2010\n", 153 | "6 Yes 6 Is a judge required to take appropriate action... 2010\n", 154 | "7 Yes 7 Can a judge appoint a qualified former law cle... 2010" 155 | ] 156 | }, 157 | "execution_count": 3, 158 | "metadata": {}, 159 | "output_type": "execute_result" 160 | } 161 | ], 162 | "source": [ 163 | "dataset[\"train\"].to_pandas()" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 4, 169 | "id": "0b1a8de0", 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "data": { 174 | "text/html": [ 175 | "
\n", 176 | "\n", 189 | "\n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | "
answerindexquestionyear
0Yes0Is a judge required to disclose a former law c...2010
1Yes1Can the names of judges who are members of a b...2010
2Yes2Should the judge disqualify themselves from a ...2010
3Yes3Should a judge disqualify themselves from all ...2010
4No4Can a judge designate the traffic court clerk ...2010
...............
287Yes287Is it permissible for the inquirer to share et...2021
288Yes288Can a part-time lawyer judge who was assigned ...2021
289No289Can a part-time lawyer judge ordinarily appear...2021
290No290Can a practicing part-time lawyer judge repres...2021
291Yes291Is it permissible for a village justice to sen...2021
\n", 279 | "

292 rows × 4 columns

\n", 280 | "
" 281 | ], 282 | "text/plain": [ 283 | " answer index question year\n", 284 | "0 Yes 0 Is a judge required to disclose a former law c... 2010\n", 285 | "1 Yes 1 Can the names of judges who are members of a b... 2010\n", 286 | "2 Yes 2 Should the judge disqualify themselves from a ... 2010\n", 287 | "3 Yes 3 Should a judge disqualify themselves from all ... 2010\n", 288 | "4 No 4 Can a judge designate the traffic court clerk ... 2010\n", 289 | ".. ... ... ... ...\n", 290 | "287 Yes 287 Is it permissible for the inquirer to share et... 2021\n", 291 | "288 Yes 288 Can a part-time lawyer judge who was assigned ... 2021\n", 292 | "289 No 289 Can a part-time lawyer judge ordinarily appear... 2021\n", 293 | "290 No 290 Can a practicing part-time lawyer judge repres... 2021\n", 294 | "291 Yes 291 Is it permissible for a village justice to sen... 2021\n", 295 | "\n", 296 | "[292 rows x 4 columns]" 297 | ] 298 | }, 299 | "execution_count": 4, 300 | "metadata": {}, 301 | "output_type": "execute_result" 302 | } 303 | ], 304 | "source": [ 305 | "dataset['test'].to_pandas()" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "id": "a1dd9a7c", 311 | "metadata": {}, 312 | "source": [ 313 | "### Since the nunber of training set is too small, we'll use test set" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 5, 319 | "id": "8a80e08e", 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "NUM_TRAIN = 10\n", 324 | "NUM_VALIDATION = 5" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 6, 330 | "id": "f0e7a9cb", 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "base_text = \"\"\"\n", 335 | "Imagine your are the New York State Unified Court System Advisory Committee on Judicial Ethics. You've received the following question(s). Answer them as either \"Yes\" or \"No\".\n", 336 | "\"\"\"" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 7, 342 | "id": "1f5ff46f", 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "name": "stdout", 347 | "output_type": "stream", 348 | "text": [ 349 | "[../data/legalbench_nys_judicial_ethics_train.json1] saved.\n", 350 | "[../data/legalbench_nys_judicial_ethics_validation.json1] saved.\n" 351 | ] 352 | } 353 | ], 354 | "source": [ 355 | "train_jsonl_path = '../data/legalbench_nys_judicial_ethics_train.json1'\n", 356 | "validation_jsonl_path = '../data/legalbench_nys_judicial_ethics_validation.json1'\n", 357 | "with open(train_jsonl_path, 'w') as f:\n", 358 | " for i in range(NUM_TRAIN):\n", 359 | " data = dataset['test'][i]\n", 360 | " line = {\"messages\": [{\"role\": \"system\", \"content\": base_text}, \n", 361 | " {\"role\": \"user\", \"content\": \"Question: \" + data['question']},\n", 362 | " {\"role\": \"assistant\", \"content\": \"Answer: \" + data['answer']}]}\n", 363 | " f.write(json.dumps(line) + '\\n')\n", 364 | "\n", 365 | "with open(validation_jsonl_path, 'w') as f:\n", 366 | " for i in range(NUM_TRAIN,NUM_TRAIN+NUM_VALIDATION):\n", 367 | " data = dataset['test'][i]\n", 368 | " line = {\"messages\": [{\"role\": \"system\", \"content\": base_text}, \n", 369 | " {\"role\": \"user\", \"content\": \"Question: \" + data['question']},\n", 370 | " {\"role\": \"assistant\", \"content\": \"Answer: \" + data['answer']}]}\n", 371 | " f.write(json.dumps(line) + '\\n')\n", 372 | "print (\"[%s] saved.\"%(train_jsonl_path))\n", 373 | "print (\"[%s] saved.\"%(validation_jsonl_path))" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "id": "0ec2d77b", 379 | "metadata": {}, 380 | "source": [ 381 | "### Locate the key" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 8, 387 | "id": "6bf3cc86", 388 | "metadata": {}, 389 | "outputs": [ 390 | { 391 | "name": "stdout", 392 | "output_type": "stream", 393 | "text": [ 394 | "key_path:[../key/rilab_key.txt]\n" 395 | ] 396 | } 397 | ], 398 | "source": [ 399 | "key_path = '../key/rilab_key.txt'\n", 400 | "print ('key_path:[%s]'%(key_path))" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 9, 406 | "id": "c58a7f65", 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "with open(key_path, 'r') as f: OPENAI_API_KEY = f.read()\n", 411 | "openai.api_key = OPENAI_API_KEY" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "id": "49033284", 417 | "metadata": {}, 418 | "source": [ 419 | "### Delete existing files (optional)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 10, 425 | "id": "df59a3dd", 426 | "metadata": { 427 | "scrolled": true 428 | }, 429 | "outputs": [ 430 | { 431 | "name": "stdout", 432 | "output_type": "stream", 433 | "text": [ 434 | "{\n", 435 | " \"object\": \"file\",\n", 436 | " \"id\": \"file-PhLUhZJ5l2VsqnLzXY0RsCpE\",\n", 437 | " \"purpose\": \"fine-tune\",\n", 438 | " \"filename\": \"file\",\n", 439 | " \"bytes\": 5106,\n", 440 | " \"created_at\": 1693543712,\n", 441 | " \"status\": \"uploaded\",\n", 442 | " \"status_details\": null\n", 443 | "}\n", 444 | "{\n", 445 | " \"object\": \"file\",\n", 446 | " \"id\": \"file-WuxpCaSYmkq171yJVWGAGmbR\",\n", 447 | " \"purpose\": \"fine-tune-results\",\n", 448 | " \"filename\": \"step_metrics.csv\",\n", 449 | " \"bytes\": 326,\n", 450 | " \"created_at\": 1693545518,\n", 451 | " \"status\": \"uploaded\",\n", 452 | " \"status_details\": null\n", 453 | "}\n", 454 | "{\n", 455 | " \"object\": \"file\",\n", 456 | " \"id\": \"file-l5WBArbeZ1DKW79pm2USfrCn\",\n", 457 | " \"purpose\": \"fine-tune-results\",\n", 458 | " \"filename\": \"step_metrics.csv\",\n", 459 | " \"bytes\": 327,\n", 460 | " \"created_at\": 1693547402,\n", 461 | " \"status\": \"uploaded\",\n", 462 | " \"status_details\": null\n", 463 | "}\n", 464 | "{\n", 465 | " \"object\": \"file\",\n", 466 | " \"id\": \"file-a3z1jd3Kt5giIVLsxPubxGUH\",\n", 467 | " \"purpose\": \"fine-tune\",\n", 468 | " \"filename\": \"file\",\n", 469 | " \"bytes\": 4841,\n", 470 | " \"created_at\": 1693548220,\n", 471 | " \"status\": \"processed\",\n", 472 | " \"status_details\": null\n", 473 | "}\n", 474 | "{\n", 475 | " \"object\": \"file\",\n", 476 | " \"id\": \"file-LTE83jTd01XEUQ6aiL1iLeuN\",\n", 477 | " \"purpose\": \"fine-tune\",\n", 478 | " \"filename\": \"file\",\n", 479 | " \"bytes\": 2415,\n", 480 | " \"created_at\": 1693548234,\n", 481 | " \"status\": \"processed\",\n", 482 | " \"status_details\": null\n", 483 | "}\n", 484 | "{\n", 485 | " \"object\": \"file\",\n", 486 | " \"id\": \"file-KQDYaX1BlnGQZHhk94Qjy35K\",\n", 487 | " \"purpose\": \"fine-tune-results\",\n", 488 | " \"filename\": \"step_metrics.csv\",\n", 489 | " \"bytes\": 328,\n", 490 | " \"created_at\": 1693549011,\n", 491 | " \"status\": \"processed\",\n", 492 | " \"status_details\": null\n", 493 | "}\n" 494 | ] 495 | } 496 | ], 497 | "source": [ 498 | "past_file_lists = openai.File.list()\n", 499 | "for past_file in past_file_lists['data']:\n", 500 | " print (past_file)" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 11, 506 | "id": "2dec1ab6", 507 | "metadata": { 508 | "scrolled": true 509 | }, 510 | "outputs": [ 511 | { 512 | "name": "stdout", 513 | "output_type": "stream", 514 | "text": [ 515 | "{\n", 516 | " \"object\": \"file\",\n", 517 | " \"id\": \"file-a3z1jd3Kt5giIVLsxPubxGUH\",\n", 518 | " \"purpose\": \"fine-tune\",\n", 519 | " \"filename\": \"file\",\n", 520 | " \"bytes\": 4841,\n", 521 | " \"created_at\": 1693548220,\n", 522 | " \"status\": \"processed\",\n", 523 | " \"status_details\": null\n", 524 | "}\n", 525 | "{\n", 526 | " \"object\": \"file\",\n", 527 | " \"id\": \"file-LTE83jTd01XEUQ6aiL1iLeuN\",\n", 528 | " \"purpose\": \"fine-tune\",\n", 529 | " \"filename\": \"file\",\n", 530 | " \"bytes\": 2415,\n", 531 | " \"created_at\": 1693548234,\n", 532 | " \"status\": \"processed\",\n", 533 | " \"status_details\": null\n", 534 | "}\n", 535 | "{\n", 536 | " \"object\": \"file\",\n", 537 | " \"id\": \"file-KQDYaX1BlnGQZHhk94Qjy35K\",\n", 538 | " \"purpose\": \"fine-tune-results\",\n", 539 | " \"filename\": \"step_metrics.csv\",\n", 540 | " \"bytes\": 328,\n", 541 | " \"created_at\": 1693549011,\n", 542 | " \"status\": \"processed\",\n", 543 | " \"status_details\": null\n", 544 | "}\n" 545 | ] 546 | } 547 | ], 548 | "source": [ 549 | "past_file_lists = openai.File.list()\n", 550 | "for past_file in past_file_lists['data']:\n", 551 | " idx = past_file['id']\n", 552 | " if past_file['status'] == 'processed':\n", 553 | " print (past_file)\n", 554 | " openai.File.delete(idx)" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 12, 560 | "id": "b2787d03", 561 | "metadata": { 562 | "scrolled": true 563 | }, 564 | "outputs": [ 565 | { 566 | "name": "stdout", 567 | "output_type": "stream", 568 | "text": [ 569 | "{\n", 570 | " \"object\": \"file\",\n", 571 | " \"id\": \"file-l5WBArbeZ1DKW79pm2USfrCn\",\n", 572 | " \"purpose\": \"fine-tune-results\",\n", 573 | " \"filename\": \"step_metrics.csv\",\n", 574 | " \"bytes\": 327,\n", 575 | " \"created_at\": 1693547402,\n", 576 | " \"status\": \"uploaded\",\n", 577 | " \"status_details\": null\n", 578 | "}\n", 579 | "{\n", 580 | " \"object\": \"file\",\n", 581 | " \"id\": \"file-WuxpCaSYmkq171yJVWGAGmbR\",\n", 582 | " \"purpose\": \"fine-tune-results\",\n", 583 | " \"filename\": \"step_metrics.csv\",\n", 584 | " \"bytes\": 326,\n", 585 | " \"created_at\": 1693545518,\n", 586 | " \"status\": \"uploaded\",\n", 587 | " \"status_details\": null\n", 588 | "}\n", 589 | "{\n", 590 | " \"object\": \"file\",\n", 591 | " \"id\": \"file-PhLUhZJ5l2VsqnLzXY0RsCpE\",\n", 592 | " \"purpose\": \"fine-tune\",\n", 593 | " \"filename\": \"file\",\n", 594 | " \"bytes\": 5106,\n", 595 | " \"created_at\": 1693543712,\n", 596 | " \"status\": \"uploaded\",\n", 597 | " \"status_details\": null\n", 598 | "}\n" 599 | ] 600 | } 601 | ], 602 | "source": [ 603 | "# Print after deleting files\n", 604 | "past_file_lists = openai.File.list()\n", 605 | "for past_file in past_file_lists['data']:\n", 606 | " print (past_file)" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "id": "6b4d4b09", 612 | "metadata": {}, 613 | "source": [ 614 | "### Upload the dataset" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 13, 620 | "id": "820ca734", 621 | "metadata": {}, 622 | "outputs": [], 623 | "source": [ 624 | "file_train_data = openai.File.create(\n", 625 | " file=open(train_jsonl_path, \"rb\"),\n", 626 | " purpose='fine-tune'\n", 627 | ")\n", 628 | "file_train_data_id = file_train_data.id" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 14, 634 | "id": "e7de389f", 635 | "metadata": {}, 636 | "outputs": [], 637 | "source": [ 638 | "file_test_data = openai.File.create(\n", 639 | " file=open(validation_jsonl_path, \"rb\"),\n", 640 | " purpose='fine-tune'\n", 641 | ")\n", 642 | "file_test_data_id = file_test_data.id" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": 15, 648 | "id": "e3f06189", 649 | "metadata": { 650 | "scrolled": true 651 | }, 652 | "outputs": [ 653 | { 654 | "name": "stdout", 655 | "output_type": "stream", 656 | "text": [ 657 | "{\n", 658 | " \"object\": \"file\",\n", 659 | " \"id\": \"file-WuxpCaSYmkq171yJVWGAGmbR\",\n", 660 | " \"purpose\": \"fine-tune-results\",\n", 661 | " \"filename\": \"step_metrics.csv\",\n", 662 | " \"bytes\": 326,\n", 663 | " \"created_at\": 1693545518,\n", 664 | " \"status\": \"uploaded\",\n", 665 | " \"status_details\": null\n", 666 | "}\n", 667 | "{\n", 668 | " \"object\": \"file\",\n", 669 | " \"id\": \"file-l5WBArbeZ1DKW79pm2USfrCn\",\n", 670 | " \"purpose\": \"fine-tune-results\",\n", 671 | " \"filename\": \"step_metrics.csv\",\n", 672 | " \"bytes\": 327,\n", 673 | " \"created_at\": 1693547402,\n", 674 | " \"status\": \"uploaded\",\n", 675 | " \"status_details\": null\n", 676 | "}\n", 677 | "{\n", 678 | " \"object\": \"file\",\n", 679 | " \"id\": \"file-t4PDvppb0D5qHrU4tHrfumh1\",\n", 680 | " \"purpose\": \"fine-tune\",\n", 681 | " \"filename\": \"file\",\n", 682 | " \"bytes\": 2415,\n", 683 | " \"created_at\": 1693554217,\n", 684 | " \"status\": \"uploaded\",\n", 685 | " \"status_details\": null\n", 686 | "}\n", 687 | "{\n", 688 | " \"object\": \"file\",\n", 689 | " \"id\": \"file-rtZF3Ru65feoLs3eutQJsJ0H\",\n", 690 | " \"purpose\": \"fine-tune\",\n", 691 | " \"filename\": \"file\",\n", 692 | " \"bytes\": 4841,\n", 693 | " \"created_at\": 1693554213,\n", 694 | " \"status\": \"uploaded\",\n", 695 | " \"status_details\": null\n", 696 | "}\n", 697 | "{\n", 698 | " \"object\": \"file\",\n", 699 | " \"id\": \"file-PhLUhZJ5l2VsqnLzXY0RsCpE\",\n", 700 | " \"purpose\": \"fine-tune\",\n", 701 | " \"filename\": \"file\",\n", 702 | " \"bytes\": 5106,\n", 703 | " \"created_at\": 1693543712,\n", 704 | " \"status\": \"uploaded\",\n", 705 | " \"status_details\": null\n", 706 | "}\n" 707 | ] 708 | } 709 | ], 710 | "source": [ 711 | "# Print after deleting files\n", 712 | "past_file_lists = openai.File.list()\n", 713 | "for past_file in past_file_lists['data']:\n", 714 | " print (past_file)" 715 | ] 716 | }, 717 | { 718 | "cell_type": "markdown", 719 | "id": "df7135bd", 720 | "metadata": {}, 721 | "source": [ 722 | "### Wait for your data to be processed" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": 16, 728 | "id": "7bbc71ba", 729 | "metadata": {}, 730 | "outputs": [ 731 | { 732 | "name": "stdout", 733 | "output_type": "stream", 734 | "text": [ 735 | "0 file-PhLUhZJ5l2VsqnLzXY0RsCpE uploaded\n", 736 | "1 file-WuxpCaSYmkq171yJVWGAGmbR uploaded\n", 737 | "2 file-l5WBArbeZ1DKW79pm2USfrCn uploaded\n", 738 | "3 file-rtZF3Ru65feoLs3eutQJsJ0H processed\n", 739 | "4 file-t4PDvppb0D5qHrU4tHrfumh1 processed\n", 740 | "Ready.\n" 741 | ] 742 | } 743 | ], 744 | "source": [ 745 | "while True:\n", 746 | " files = openai.File.list()\n", 747 | " completed = True\n", 748 | " for f_idx,file in enumerate(files['data']):\n", 749 | " print(f_idx,file['id'], file['status'])\n", 750 | " if file['id'] == file_train_data_id or file['id'] == file_test_data_id:\n", 751 | " processed = (file['status'] == 'processed')\n", 752 | " completed = completed and processed\n", 753 | " if completed:\n", 754 | " break\n", 755 | " time.sleep(seconds=10)\n", 756 | "print (\"Ready.\")" 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "id": "216ea95d", 762 | "metadata": {}, 763 | "source": [ 764 | "### Cancel running models (optional)" 765 | ] 766 | }, 767 | { 768 | "cell_type": "code", 769 | "execution_count": 17, 770 | "id": "f6da91ed", 771 | "metadata": {}, 772 | "outputs": [ 773 | { 774 | "name": "stdout", 775 | "output_type": "stream", 776 | "text": [ 777 | "ftjob-o3N7MRwfHtLXgvyheaO7pEHr succeeded\n", 778 | "ftjob-FyMhJGWK7LaGlu94NuAadINS cancelled\n", 779 | "ftjob-QcdcVVyLjBabDRhESarNTCpm succeeded\n", 780 | "ftjob-fhipvS2c2q7G3t4s7CLQGgDH succeeded\n", 781 | "ftjob-gNALz0Stw06p7e0NeI0isSAO succeeded\n", 782 | "ftjob-DAZc9L3aEKg1mNY2zFmpgdhv succeeded\n" 783 | ] 784 | } 785 | ], 786 | "source": [ 787 | "# List 10 fine-tuning jobs\n", 788 | "jobs = openai.FineTuningJob.list(limit=10)\n", 789 | "jobs = jobs['data']\n", 790 | "for job in jobs:\n", 791 | " print(job['id'], job['status'])\n", 792 | " job_id = job['id']\n", 793 | " completed = job['status'] != 'running'\n", 794 | " if not completed:\n", 795 | " # Cancel a job\n", 796 | " openai.FineTuningJob.cancel(job_id)" 797 | ] 798 | }, 799 | { 800 | "cell_type": "markdown", 801 | "id": "26dc2b3a", 802 | "metadata": {}, 803 | "source": [ 804 | "### Start fine-tuning" 805 | ] 806 | }, 807 | { 808 | "cell_type": "code", 809 | "execution_count": 18, 810 | "id": "e3564af7", 811 | "metadata": { 812 | "scrolled": true 813 | }, 814 | "outputs": [ 815 | { 816 | "name": "stdout", 817 | "output_type": "stream", 818 | "text": [ 819 | "Ready.\n" 820 | ] 821 | } 822 | ], 823 | "source": [ 824 | "current_job = openai.FineTuningJob.create(\n", 825 | " training_file=file_train_data_id, model=\"gpt-3.5-turbo\",\n", 826 | " validation_file=file_test_data_id, hyperparameters={\"n_epochs\":1, })\n", 827 | "print (\"Ready.\")" 828 | ] 829 | }, 830 | { 831 | "cell_type": "code", 832 | "execution_count": 19, 833 | "id": "531e2f8d", 834 | "metadata": {}, 835 | "outputs": [ 836 | { 837 | "data": { 838 | "text/plain": [ 839 | " JSON: {\n", 840 | " \"object\": \"list\",\n", 841 | " \"data\": [\n", 842 | " {\n", 843 | " \"object\": \"fine_tuning.job\",\n", 844 | " \"id\": \"ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\",\n", 845 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 846 | " \"created_at\": 1693554274,\n", 847 | " \"finished_at\": null,\n", 848 | " \"fine_tuned_model\": null,\n", 849 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 850 | " \"result_files\": [],\n", 851 | " \"status\": \"running\",\n", 852 | " \"validation_file\": \"file-t4PDvppb0D5qHrU4tHrfumh1\",\n", 853 | " \"training_file\": \"file-rtZF3Ru65feoLs3eutQJsJ0H\",\n", 854 | " \"hyperparameters\": {\n", 855 | " \"n_epochs\": 1\n", 856 | " },\n", 857 | " \"trained_tokens\": null\n", 858 | " },\n", 859 | " {\n", 860 | " \"object\": \"fine_tuning.job\",\n", 861 | " \"id\": \"ftjob-o3N7MRwfHtLXgvyheaO7pEHr\",\n", 862 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 863 | " \"created_at\": 1693548543,\n", 864 | " \"finished_at\": 1693549009,\n", 865 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7trkYzgO\",\n", 866 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 867 | " \"result_files\": [\n", 868 | " \"file-KQDYaX1BlnGQZHhk94Qjy35K\"\n", 869 | " ],\n", 870 | " \"status\": \"succeeded\",\n", 871 | " \"validation_file\": \"file-LTE83jTd01XEUQ6aiL1iLeuN\",\n", 872 | " \"training_file\": \"file-a3z1jd3Kt5giIVLsxPubxGUH\",\n", 873 | " \"hyperparameters\": {\n", 874 | " \"n_epochs\": 1\n", 875 | " },\n", 876 | " \"trained_tokens\": 831\n", 877 | " },\n", 878 | " {\n", 879 | " \"object\": \"fine_tuning.job\",\n", 880 | " \"id\": \"ftjob-FyMhJGWK7LaGlu94NuAadINS\",\n", 881 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 882 | " \"created_at\": 1693548511,\n", 883 | " \"finished_at\": null,\n", 884 | " \"fine_tuned_model\": null,\n", 885 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 886 | " \"result_files\": [],\n", 887 | " \"status\": \"cancelled\",\n", 888 | " \"validation_file\": \"file-LTE83jTd01XEUQ6aiL1iLeuN\",\n", 889 | " \"training_file\": \"file-a3z1jd3Kt5giIVLsxPubxGUH\",\n", 890 | " \"hyperparameters\": {\n", 891 | " \"n_epochs\": 1\n", 892 | " },\n", 893 | " \"trained_tokens\": null\n", 894 | " },\n", 895 | " {\n", 896 | " \"object\": \"fine_tuning.job\",\n", 897 | " \"id\": \"ftjob-QcdcVVyLjBabDRhESarNTCpm\",\n", 898 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 899 | " \"created_at\": 1693546893,\n", 900 | " \"finished_at\": 1693547399,\n", 901 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7trKamW3\",\n", 902 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 903 | " \"result_files\": [\n", 904 | " \"file-l5WBArbeZ1DKW79pm2USfrCn\"\n", 905 | " ],\n", 906 | " \"status\": \"succeeded\",\n", 907 | " \"validation_file\": \"file-Q5fBGxgptSjSWfvd2pdizTQ4\",\n", 908 | " \"training_file\": \"file-3VlDNqS78TwXgLdi92igEz7B\",\n", 909 | " \"hyperparameters\": {\n", 910 | " \"n_epochs\": 1\n", 911 | " },\n", 912 | " \"trained_tokens\": 914\n", 913 | " },\n", 914 | " {\n", 915 | " \"object\": \"fine_tuning.job\",\n", 916 | " \"id\": \"ftjob-fhipvS2c2q7G3t4s7CLQGgDH\",\n", 917 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 918 | " \"created_at\": 1693545051,\n", 919 | " \"finished_at\": 1693545516,\n", 920 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7tqqDVxu\",\n", 921 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 922 | " \"result_files\": [\n", 923 | " \"file-WuxpCaSYmkq171yJVWGAGmbR\"\n", 924 | " ],\n", 925 | " \"status\": \"succeeded\",\n", 926 | " \"validation_file\": \"file-B2h4oSFYdJkA7zzE82nqft0h\",\n", 927 | " \"training_file\": \"file-SXe3jQvUsMOrYS9GtfKR1g8s\",\n", 928 | " \"hyperparameters\": {\n", 929 | " \"n_epochs\": 1\n", 930 | " },\n", 931 | " \"trained_tokens\": 914\n", 932 | " },\n", 933 | " {\n", 934 | " \"object\": \"fine_tuning.job\",\n", 935 | " \"id\": \"ftjob-gNALz0Stw06p7e0NeI0isSAO\",\n", 936 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 937 | " \"created_at\": 1693544099,\n", 938 | " \"finished_at\": 1693544596,\n", 939 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7tqbNGH5\",\n", 940 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 941 | " \"result_files\": [\n", 942 | " \"file-GFZY2SmNEuwnRLXq2CmbCHJP\"\n", 943 | " ],\n", 944 | " \"status\": \"succeeded\",\n", 945 | " \"validation_file\": \"file-UEQLpqMsDH2Y7rJLCtbWnqY2\",\n", 946 | " \"training_file\": \"file-abntl1qJwH7fVHfRXAn97hKR\",\n", 947 | " \"hyperparameters\": {\n", 948 | " \"n_epochs\": 1\n", 949 | " },\n", 950 | " \"trained_tokens\": 2551\n", 951 | " },\n", 952 | " {\n", 953 | " \"object\": \"fine_tuning.job\",\n", 954 | " \"id\": \"ftjob-DAZc9L3aEKg1mNY2zFmpgdhv\",\n", 955 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 956 | " \"created_at\": 1693542580,\n", 957 | " \"finished_at\": 1693543162,\n", 958 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7tqEGsAP\",\n", 959 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 960 | " \"result_files\": [\n", 961 | " \"file-IG5qEkSebmuqpDR3Zl5ljMA3\"\n", 962 | " ],\n", 963 | " \"status\": \"succeeded\",\n", 964 | " \"validation_file\": null,\n", 965 | " \"training_file\": \"file-szdPvDiHgfSoSMJKMsNCtGLG\",\n", 966 | " \"hyperparameters\": {\n", 967 | " \"n_epochs\": 3\n", 968 | " },\n", 969 | " \"trained_tokens\": 7653\n", 970 | " }\n", 971 | " ],\n", 972 | " \"has_more\": false\n", 973 | "}" 974 | ] 975 | }, 976 | "execution_count": 19, 977 | "metadata": {}, 978 | "output_type": "execute_result" 979 | } 980 | ], 981 | "source": [ 982 | "# List 10 fine-tuning jobs\n", 983 | "openai.FineTuningJob.list(limit=10)" 984 | ] 985 | }, 986 | { 987 | "cell_type": "markdown", 988 | "id": "56d1d159", 989 | "metadata": {}, 990 | "source": [ 991 | "### Retrieve information" 992 | ] 993 | }, 994 | { 995 | "cell_type": "code", 996 | "execution_count": 20, 997 | "id": "55953842", 998 | "metadata": {}, 999 | "outputs": [ 1000 | { 1001 | "name": "stdout", 1002 | "output_type": "stream", 1003 | "text": [ 1004 | "current_job_id:[ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ]\n" 1005 | ] 1006 | } 1007 | ], 1008 | "source": [ 1009 | "current_job_id = current_job['id']\n", 1010 | "print (\"current_job_id:[%s]\"%(current_job_id))" 1011 | ] 1012 | }, 1013 | { 1014 | "cell_type": "code", 1015 | "execution_count": 21, 1016 | "id": "7df74fc9", 1017 | "metadata": {}, 1018 | "outputs": [ 1019 | { 1020 | "data": { 1021 | "text/plain": [ 1022 | " JSON: {\n", 1023 | " \"object\": \"fine_tuning.job\",\n", 1024 | " \"id\": \"ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\",\n", 1025 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 1026 | " \"created_at\": 1693554274,\n", 1027 | " \"finished_at\": null,\n", 1028 | " \"fine_tuned_model\": null,\n", 1029 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 1030 | " \"result_files\": [],\n", 1031 | " \"status\": \"running\",\n", 1032 | " \"validation_file\": \"file-t4PDvppb0D5qHrU4tHrfumh1\",\n", 1033 | " \"training_file\": \"file-rtZF3Ru65feoLs3eutQJsJ0H\",\n", 1034 | " \"hyperparameters\": {\n", 1035 | " \"n_epochs\": 1\n", 1036 | " },\n", 1037 | " \"trained_tokens\": null\n", 1038 | "}" 1039 | ] 1040 | }, 1041 | "execution_count": 21, 1042 | "metadata": {}, 1043 | "output_type": "execute_result" 1044 | } 1045 | ], 1046 | "source": [ 1047 | "# Retrieve the state of a fine-tune\n", 1048 | "openai.FineTuningJob.retrieve(current_job_id)" 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "markdown", 1053 | "id": "b270451f", 1054 | "metadata": {}, 1055 | "source": [ 1056 | "### Wait for fine-tuning" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "code", 1061 | "execution_count": 22, 1062 | "id": "762b577d", 1063 | "metadata": { 1064 | "scrolled": true 1065 | }, 1066 | "outputs": [ 1067 | { 1068 | "name": "stdout", 1069 | "output_type": "stream", 1070 | "text": [ 1071 | "Fine tuning job started\n", 1072 | "Created fine-tune: ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\n", 1073 | "Fine tuning job started\n", 1074 | "Created fine-tune: ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\n", 1075 | "Fine tuning job started\n", 1076 | "Created fine-tune: ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\n", 1077 | "Fine tuning job started\n", 1078 | "Created fine-tune: ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\n", 1079 | "Fine tuning job started\n", 1080 | "Created fine-tune: ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\n", 1081 | "Fine tuning job started\n", 1082 | "Created fine-tune: ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\n", 1083 | "Fine tuning job started\n", 1084 | "Created fine-tune: ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\n", 1085 | "Step 2/10: training loss=3.19, validation loss=1.57\n", 1086 | "Step 1/10: training loss=2.89, validation loss=3.36\n", 1087 | "Fine-tuning job successfully completed\n", 1088 | "New fine-tuned model created: ft:gpt-3.5-turbo-0613:korea-university::7ttF0Dx2\n", 1089 | "=====================================\n", 1090 | "Fine-tuning job successfully completed\n", 1091 | "None\n", 1092 | "=====================================\n", 1093 | "New fine-tuned model created: ft:gpt-3.5-turbo-0613:korea-university::7ttF0Dx2\n", 1094 | "None\n", 1095 | "=====================================\n", 1096 | "Step 10/10: training loss=0.01, validation loss=0.07\n", 1097 | "{\n", 1098 | " \"step\": 10,\n", 1099 | " \"train_loss\": 0.0061782835982739925,\n", 1100 | " \"valid_loss\": 0.06946449279785157,\n", 1101 | " \"train_mean_token_accuracy\": 1.0,\n", 1102 | " \"valid_mean_token_accuracy\": 0.6\n", 1103 | "}\n", 1104 | "=====================================\n", 1105 | "Step 9/10: training loss=0.01, validation loss=0.01\n", 1106 | "{\n", 1107 | " \"step\": 9,\n", 1108 | " \"train_loss\": 0.0071807862259447575,\n", 1109 | " \"valid_loss\": 0.007854843139648437,\n", 1110 | " \"train_mean_token_accuracy\": 1.0,\n", 1111 | " \"valid_mean_token_accuracy\": 0.6\n", 1112 | "}\n", 1113 | "=====================================\n", 1114 | "Step 8/10: training loss=0.01, validation loss=0.01\n", 1115 | "{\n", 1116 | " \"step\": 8,\n", 1117 | " \"train_loss\": 0.014714050106704235,\n", 1118 | " \"valid_loss\": 0.011740493774414062,\n", 1119 | " \"train_mean_token_accuracy\": 1.0,\n", 1120 | " \"valid_mean_token_accuracy\": 0.6\n", 1121 | "}\n", 1122 | "=====================================\n", 1123 | "Step 7/10: training loss=0.36, validation loss=0.01\n", 1124 | "{\n", 1125 | " \"step\": 7,\n", 1126 | " \"train_loss\": 0.36053237318992615,\n", 1127 | " \"valid_loss\": 0.0137237548828125,\n", 1128 | " \"train_mean_token_accuracy\": 0.800000011920929,\n", 1129 | " \"valid_mean_token_accuracy\": 0.6\n", 1130 | "}\n", 1131 | "=====================================\n", 1132 | "Step 6/10: training loss=0.07, validation loss=0.09\n", 1133 | "{\n", 1134 | " \"step\": 6,\n", 1135 | " \"train_loss\": 0.0713600143790245,\n", 1136 | " \"valid_loss\": 0.08551521301269531,\n", 1137 | " \"train_mean_token_accuracy\": 1.0,\n", 1138 | " \"valid_mean_token_accuracy\": 0.6\n", 1139 | "}\n", 1140 | "=====================================\n", 1141 | "Step 5/10: training loss=0.33, validation loss=0.21\n", 1142 | "{\n", 1143 | " \"step\": 5,\n", 1144 | " \"train_loss\": 0.3335861265659332,\n", 1145 | " \"valid_loss\": 0.2073833465576172,\n", 1146 | " \"train_mean_token_accuracy\": 0.800000011920929,\n", 1147 | " \"valid_mean_token_accuracy\": 0.6\n", 1148 | "}\n", 1149 | "=====================================\n", 1150 | "Step 4/10: training loss=0.79, validation loss=0.27\n", 1151 | "{\n", 1152 | " \"step\": 4,\n", 1153 | " \"train_loss\": 0.7851310968399048,\n", 1154 | " \"valid_loss\": 0.2712444305419922,\n", 1155 | " \"train_mean_token_accuracy\": 0.800000011920929,\n", 1156 | " \"valid_mean_token_accuracy\": 0.6\n", 1157 | "}\n", 1158 | "=====================================\n", 1159 | "Step 3/10: training loss=1.42, validation loss=0.52\n", 1160 | "{\n", 1161 | " \"step\": 3,\n", 1162 | " \"train_loss\": 1.4208381175994873,\n", 1163 | " \"valid_loss\": 0.518292236328125,\n", 1164 | " \"train_mean_token_accuracy\": 0.800000011920929,\n", 1165 | " \"valid_mean_token_accuracy\": 0.6\n", 1166 | "}\n", 1167 | "Ready.\n" 1168 | ] 1169 | } 1170 | ], 1171 | "source": [ 1172 | "while True:\n", 1173 | " job = openai.FineTuningJob.retrieve(current_job_id)\n", 1174 | " completed = job['status'] != 'running'\n", 1175 | " # List up to 10 events from a fine-tuning job\n", 1176 | " events = openai.FineTuningJob.list_events(id=current_job_id, limit=2)\n", 1177 | " for event in events['data']:\n", 1178 | " print(event['message'])\n", 1179 | " if completed:\n", 1180 | " break\n", 1181 | " time.sleep(60)\n", 1182 | "events = openai.FineTuningJob.list_events(id=current_job_id, limit=10)\n", 1183 | "for event in events['data']:\n", 1184 | " print(\"=====================================\")\n", 1185 | " print(event['message'])\n", 1186 | " print(event['data'])\n", 1187 | "print (\"Ready.\")" 1188 | ] 1189 | }, 1190 | { 1191 | "cell_type": "markdown", 1192 | "id": "368fe7c1", 1193 | "metadata": {}, 1194 | "source": [ 1195 | "### Inference" 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "code", 1200 | "execution_count": 23, 1201 | "id": "92037eef", 1202 | "metadata": {}, 1203 | "outputs": [], 1204 | "source": [ 1205 | "model_id = openai.FineTuningJob.retrieve(current_job_id)['fine_tuned_model']" 1206 | ] 1207 | }, 1208 | { 1209 | "cell_type": "code", 1210 | "execution_count": 24, 1211 | "id": "c3d4739e", 1212 | "metadata": {}, 1213 | "outputs": [ 1214 | { 1215 | "name": "stdout", 1216 | "output_type": "stream", 1217 | "text": [ 1218 | "{'answer': 'Yes', 'index': '87', 'question': 'Is a judicial candidate or their campaign committee allowed to post photographs on social media of the candidate together with sitting judges at a public or professional event, if the photograph was published with the consent of the judges and there is no verbiage that indicates an endorsement?', 'year': '2022'}\n" 1219 | ] 1220 | } 1221 | ], 1222 | "source": [ 1223 | "start_idx = NUM_TRAIN + NUM_VALIDATION\n", 1224 | "end_idx = len(dataset['test'])\n", 1225 | "sample_idx = random.randint(start_idx, end_idx)\n", 1226 | "sample = dataset['test'][sample_idx]\n", 1227 | "print(sample)" 1228 | ] 1229 | }, 1230 | { 1231 | "cell_type": "code", 1232 | "execution_count": 25, 1233 | "id": "6fd3412b", 1234 | "metadata": {}, 1235 | "outputs": [ 1236 | { 1237 | "data": { 1238 | "text/plain": [ 1239 | "'\\nImagine your are the New York State Unified Court System Advisory Committee on Judicial Ethics. You\\'ve received the following question(s). Answer them as either \"Yes\" or \"No\".\\n'" 1240 | ] 1241 | }, 1242 | "execution_count": 25, 1243 | "metadata": {}, 1244 | "output_type": "execute_result" 1245 | } 1246 | ], 1247 | "source": [ 1248 | "base_text" 1249 | ] 1250 | }, 1251 | { 1252 | "cell_type": "code", 1253 | "execution_count": 26, 1254 | "id": "71a73b24", 1255 | "metadata": {}, 1256 | "outputs": [ 1257 | { 1258 | "name": "stdout", 1259 | "output_type": "stream", 1260 | "text": [ 1261 | "[{'role': 'system', 'content': '\\nImagine your are the New York State Unified Court System Advisory Committee on Judicial Ethics. You\\'ve received the following question(s). Answer them as either \"Yes\" or \"No\".\\n'}, {'role': 'user', 'content': 'Question: Is a judicial candidate or their campaign committee allowed to post photographs on social media of the candidate together with sitting judges at a public or professional event, if the photograph was published with the consent of the judges and there is no verbiage that indicates an endorsement?'}]\n" 1262 | ] 1263 | } 1264 | ], 1265 | "source": [ 1266 | "messages = [\n", 1267 | " {\"role\": \"system\", \"content\": base_text},\n", 1268 | " {\"role\": \"user\", \"content\": \"Question: \" + sample['question']}\n", 1269 | "]\n", 1270 | "print (messages)" 1271 | ] 1272 | }, 1273 | { 1274 | "cell_type": "code", 1275 | "execution_count": 27, 1276 | "id": "4005a247", 1277 | "metadata": {}, 1278 | "outputs": [], 1279 | "source": [ 1280 | "response = completion = openai.ChatCompletion.create(\n", 1281 | " model=model_id,\n", 1282 | " messages=messages\n", 1283 | ")" 1284 | ] 1285 | }, 1286 | { 1287 | "cell_type": "code", 1288 | "execution_count": 28, 1289 | "id": "35753747", 1290 | "metadata": {}, 1291 | "outputs": [ 1292 | { 1293 | "name": "stdout", 1294 | "output_type": "stream", 1295 | "text": [ 1296 | "Question: Is a judicial candidate or their campaign committee allowed to post photographs on social media of the candidate together with sitting judges at a public or professional event, if the photograph was published with the consent of the judges and there is no verbiage that indicates an endorsement?\n", 1297 | "Answer: Answer: Yes\n", 1298 | "Ground Truth: Yes\n" 1299 | ] 1300 | } 1301 | ], 1302 | "source": [ 1303 | "answer = response['choices'][0]['message']['content']\n", 1304 | "print('Question: ' + sample['question'])\n", 1305 | "print('Answer: ' + answer)\n", 1306 | "print('Ground Truth: ' + sample['answer'])" 1307 | ] 1308 | }, 1309 | { 1310 | "cell_type": "code", 1311 | "execution_count": null, 1312 | "id": "92f8d374", 1313 | "metadata": {}, 1314 | "outputs": [], 1315 | "source": [] 1316 | } 1317 | ], 1318 | "metadata": { 1319 | "kernelspec": { 1320 | "display_name": "Python 3 (ipykernel)", 1321 | "language": "python", 1322 | "name": "python3" 1323 | }, 1324 | "language_info": { 1325 | "codemirror_mode": { 1326 | "name": "ipython", 1327 | "version": 3 1328 | }, 1329 | "file_extension": ".py", 1330 | "mimetype": "text/x-python", 1331 | "name": "python", 1332 | "nbconvert_exporter": "python", 1333 | "pygments_lexer": "ipython3", 1334 | "version": "3.9.16" 1335 | } 1336 | }, 1337 | "nbformat": 4, 1338 | "nbformat_minor": 5 1339 | } 1340 | -------------------------------------------------------------------------------- /code/demo_gpt_04_finetune_dialog.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "93746e6a", 6 | "metadata": {}, 7 | "source": [ 8 | "### Fine-tune GPT" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "14623c10", 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "OpenAI package version:0.28.0\n" 22 | ] 23 | } 24 | ], 25 | "source": [ 26 | "import json\n", 27 | "import openai\n", 28 | "import time, random\n", 29 | "from datasets import load_dataset\n", 30 | "print (\"OpenAI package version:%s\"%(openai.__version__))" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "e5bfab77", 36 | "metadata": {}, 37 | "source": [ 38 | "### Load data from Hugging Face Datasets" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "id": "63000fd7", 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "application/vnd.jupyter.widget-view+json": { 50 | "model_id": "70e206acf69c4bc699dba6170c1761f6", 51 | "version_major": 2, 52 | "version_minor": 0 53 | }, 54 | "text/plain": [ 55 | "Downloading metadata: 0%| | 0.00/1.01k [00:00\n", 160 | "\n", 173 | "\n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | "
personalityutterances
0[i like to remodel homes ., i like to go hunti...[{'candidates': ['my mom was single with 3 boy...
1[my mom is my best friend ., i have four siste...[{'candidates': ['there was one person better ...
2[i had a gig at local theater last night ., i ...[{'candidates': ['fine how are you feeling ton...
3[i'm very athletic ., i wear contacts ., i hav...[{'candidates': ['cool , i'm currently studyin...
4[i am primarily a meat eater ., i am a guitar ...[{'candidates': ['yes , i like the green bay p...
.........
17873[i sell miscellaneous stuff in local fairs ., ...[{'candidates': ['interesting , luis is your g...
17874[i currently work at mcdonalds ., i live with ...[{'candidates': ['her initials are c . c .', '...
17875[i've a daughter ., i'm under 6 feet tall ., i...[{'candidates': ['ya . but not as unwelcomed a...
17876[i have a severe phobia of wide open spaces .,...[{'candidates': ['i like to be smart . i'm ver...
17877[my mother lives with me ., i like to plant fl...[{'candidates': ['aprons , pot holders , table...
\n", 239 | "

17878 rows × 2 columns

\n", 240 | "" 241 | ], 242 | "text/plain": [ 243 | " personality \\\n", 244 | "0 [i like to remodel homes ., i like to go hunti... \n", 245 | "1 [my mom is my best friend ., i have four siste... \n", 246 | "2 [i had a gig at local theater last night ., i ... \n", 247 | "3 [i'm very athletic ., i wear contacts ., i hav... \n", 248 | "4 [i am primarily a meat eater ., i am a guitar ... \n", 249 | "... ... \n", 250 | "17873 [i sell miscellaneous stuff in local fairs ., ... \n", 251 | "17874 [i currently work at mcdonalds ., i live with ... \n", 252 | "17875 [i've a daughter ., i'm under 6 feet tall ., i... \n", 253 | "17876 [i have a severe phobia of wide open spaces .,... \n", 254 | "17877 [my mother lives with me ., i like to plant fl... \n", 255 | "\n", 256 | " utterances \n", 257 | "0 [{'candidates': ['my mom was single with 3 boy... \n", 258 | "1 [{'candidates': ['there was one person better ... \n", 259 | "2 [{'candidates': ['fine how are you feeling ton... \n", 260 | "3 [{'candidates': ['cool , i'm currently studyin... \n", 261 | "4 [{'candidates': ['yes , i like the green bay p... \n", 262 | "... ... \n", 263 | "17873 [{'candidates': ['interesting , luis is your g... \n", 264 | "17874 [{'candidates': ['her initials are c . c .', '... \n", 265 | "17875 [{'candidates': ['ya . but not as unwelcomed a... \n", 266 | "17876 [{'candidates': ['i like to be smart . i'm ver... \n", 267 | "17877 [{'candidates': ['aprons , pot holders , table... \n", 268 | "\n", 269 | "[17878 rows x 2 columns]" 270 | ] 271 | }, 272 | "execution_count": 3, 273 | "metadata": {}, 274 | "output_type": "execute_result" 275 | } 276 | ], 277 | "source": [ 278 | "dataset[\"train\"].to_pandas()" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "id": "d68c648f", 284 | "metadata": {}, 285 | "source": [ 286 | "### Since the nunber of training set is too small, we'll use test set" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 4, 292 | "id": "942e4bce", 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "NUM_TRAIN = 1000\n", 297 | "NUM_VALIDATION = 100" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 5, 303 | "id": "420d6dee", 304 | "metadata": {}, 305 | "outputs": [], 306 | "source": [ 307 | "base_text = \"\"\"\n", 308 | "You are a interactive agent that follows the given persona:\n", 309 | "\"\"\"" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 6, 315 | "id": "dcfd8814", 316 | "metadata": {}, 317 | "outputs": [ 318 | { 319 | "name": "stdout", 320 | "output_type": "stream", 321 | "text": [ 322 | "[../data/persona_chat_train.json1] saved.\n", 323 | "[../data/persona_chat_validation.json1] saved.\n" 324 | ] 325 | } 326 | ], 327 | "source": [ 328 | "train_jsonl_path = '../data/persona_chat_train.json1'\n", 329 | "validation_jsonl_path = '../data/persona_chat_validation.json1'\n", 330 | "with open(train_jsonl_path, 'w') as f:\n", 331 | " for i in range(NUM_TRAIN):\n", 332 | " data = dataset['train'][i]\n", 333 | " persona = \" \".join(data['personality'])\n", 334 | " history = data['utterances'][-1]['history']\n", 335 | " message = [{\"role\": \"system\", \"content\": persona}]\n", 336 | " for i, h in enumerate(history):\n", 337 | " if (i%2 == 0):\n", 338 | " role = \"user\"\n", 339 | " else: role = \"assistant\"\n", 340 | " message.append({\"role\": role, \"content\": h})\n", 341 | " line = {\"messages\": message}\n", 342 | " # line = {\"messages\": [{\"role\": \"system\", \"content\": persona}, \n", 343 | " # {\"role\": \"user\", \"content\": \"Question: \" + data['question']},\n", 344 | " # {\"role\": \"assistant\", \"content\": \"Answer: \" + data['answer']}]}\n", 345 | " f.write(json.dumps(line) + '\\n')\n", 346 | "\n", 347 | "with open(validation_jsonl_path, 'w') as f:\n", 348 | " for i in range(NUM_TRAIN,NUM_TRAIN+NUM_VALIDATION):\n", 349 | " data = dataset['train'][i]\n", 350 | " persona = \" \".join(data['personality'])\n", 351 | " history = data['utterances'][-1]['history']\n", 352 | " message = [{\"role\": \"system\", \"content\": persona}]\n", 353 | " for i, h in enumerate(history):\n", 354 | " if (i%2 == 0):\n", 355 | " role = \"user\"\n", 356 | " else: role = \"assistant\"\n", 357 | " message.append({\"role\": role, \"content\": h})\n", 358 | " line = {\"messages\": message}\n", 359 | " f.write(json.dumps(line) + '\\n')\n", 360 | "print (\"[%s] saved.\"%(train_jsonl_path))\n", 361 | "print (\"[%s] saved.\"%(validation_jsonl_path))" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "id": "8618d724", 367 | "metadata": {}, 368 | "source": [ 369 | "### Locate the key" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 7, 375 | "id": "1787eb12", 376 | "metadata": {}, 377 | "outputs": [ 378 | { 379 | "name": "stdout", 380 | "output_type": "stream", 381 | "text": [ 382 | "key_path:[../key/rilab_key.txt]\n" 383 | ] 384 | } 385 | ], 386 | "source": [ 387 | "key_path = '../key/rilab_key.txt'\n", 388 | "print ('key_path:[%s]'%(key_path))" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 8, 394 | "id": "c6ec80ff", 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "with open(key_path, 'r') as f: OPENAI_API_KEY = f.read()\n", 399 | "openai.api_key = OPENAI_API_KEY" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "id": "0421cbb4", 405 | "metadata": {}, 406 | "source": [ 407 | "### Delete existing files (optional)" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 9, 413 | "id": "916d899c", 414 | "metadata": { 415 | "scrolled": true 416 | }, 417 | "outputs": [ 418 | { 419 | "name": "stdout", 420 | "output_type": "stream", 421 | "text": [ 422 | "{\n", 423 | " \"object\": \"file\",\n", 424 | " \"id\": \"file-HW2ySBmxZGxbkavqN5CwFTgD\",\n", 425 | " \"purpose\": \"fine-tune\",\n", 426 | " \"filename\": \"file\",\n", 427 | " \"bytes\": 137019,\n", 428 | " \"created_at\": 1694411483,\n", 429 | " \"status\": \"processed\",\n", 430 | " \"status_details\": null\n", 431 | "}\n", 432 | "{\n", 433 | " \"object\": \"file\",\n", 434 | " \"id\": \"file-rEaiBrRYmva7r571MumUT7xv\",\n", 435 | " \"purpose\": \"fine-tune-results\",\n", 436 | " \"filename\": \"step_metrics.csv\",\n", 437 | " \"bytes\": 17983,\n", 438 | " \"created_at\": 1694413646,\n", 439 | " \"status\": \"processed\",\n", 440 | " \"status_details\": null\n", 441 | "}\n", 442 | "{\n", 443 | " \"object\": \"file\",\n", 444 | " \"id\": \"file-0Wrj9UoWNcXGAFjrSmqPeCpA\",\n", 445 | " \"purpose\": \"fine-tune\",\n", 446 | " \"filename\": \"file\",\n", 447 | " \"bytes\": 1364436,\n", 448 | " \"created_at\": 1694411480,\n", 449 | " \"status\": \"processed\",\n", 450 | " \"status_details\": null\n", 451 | "}\n" 452 | ] 453 | } 454 | ], 455 | "source": [ 456 | "past_file_lists = openai.File.list()\n", 457 | "for past_file in past_file_lists['data']:\n", 458 | " print (past_file)" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": 10, 464 | "id": "b6682978", 465 | "metadata": { 466 | "scrolled": true 467 | }, 468 | "outputs": [ 469 | { 470 | "name": "stdout", 471 | "output_type": "stream", 472 | "text": [ 473 | "{\n", 474 | " \"object\": \"file\",\n", 475 | " \"id\": \"file-0Wrj9UoWNcXGAFjrSmqPeCpA\",\n", 476 | " \"purpose\": \"fine-tune\",\n", 477 | " \"filename\": \"file\",\n", 478 | " \"bytes\": 1364436,\n", 479 | " \"created_at\": 1694411480,\n", 480 | " \"status\": \"processed\",\n", 481 | " \"status_details\": null\n", 482 | "}\n", 483 | "{\n", 484 | " \"object\": \"file\",\n", 485 | " \"id\": \"file-HW2ySBmxZGxbkavqN5CwFTgD\",\n", 486 | " \"purpose\": \"fine-tune\",\n", 487 | " \"filename\": \"file\",\n", 488 | " \"bytes\": 137019,\n", 489 | " \"created_at\": 1694411483,\n", 490 | " \"status\": \"processed\",\n", 491 | " \"status_details\": null\n", 492 | "}\n", 493 | "{\n", 494 | " \"object\": \"file\",\n", 495 | " \"id\": \"file-rEaiBrRYmva7r571MumUT7xv\",\n", 496 | " \"purpose\": \"fine-tune-results\",\n", 497 | " \"filename\": \"step_metrics.csv\",\n", 498 | " \"bytes\": 17983,\n", 499 | " \"created_at\": 1694413646,\n", 500 | " \"status\": \"processed\",\n", 501 | " \"status_details\": null\n", 502 | "}\n" 503 | ] 504 | } 505 | ], 506 | "source": [ 507 | "past_file_lists = openai.File.list()\n", 508 | "for past_file in past_file_lists['data']:\n", 509 | " idx = past_file['id']\n", 510 | " if past_file['status'] == 'processed':\n", 511 | " print (past_file)\n", 512 | " openai.File.delete(idx)" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": 11, 518 | "id": "580286dd", 519 | "metadata": {}, 520 | "outputs": [], 521 | "source": [ 522 | "# Print after deleting files\n", 523 | "past_file_lists = openai.File.list()\n", 524 | "for past_file in past_file_lists['data']:\n", 525 | " print (past_file)" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "id": "828eba6e", 531 | "metadata": {}, 532 | "source": [ 533 | "### Upload the dataset" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 12, 539 | "id": "17a11e59", 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [ 543 | "file_train_data = openai.File.create(\n", 544 | " file=open(train_jsonl_path, \"rb\"),\n", 545 | " purpose='fine-tune'\n", 546 | ")\n", 547 | "file_train_data_id = file_train_data.id" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 13, 553 | "id": "2feb4c63", 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "file_test_data = openai.File.create(\n", 558 | " file=open(validation_jsonl_path, \"rb\"),\n", 559 | " purpose='fine-tune'\n", 560 | ")\n", 561 | "file_test_data_id = file_test_data.id" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": 14, 567 | "id": "5f5f09d4", 568 | "metadata": { 569 | "scrolled": true 570 | }, 571 | "outputs": [ 572 | { 573 | "name": "stdout", 574 | "output_type": "stream", 575 | "text": [ 576 | "{\n", 577 | " \"object\": \"file\",\n", 578 | " \"id\": \"file-fh77pphFD9aiezuGPntLPRxm\",\n", 579 | " \"purpose\": \"fine-tune\",\n", 580 | " \"filename\": \"file\",\n", 581 | " \"bytes\": 1364436,\n", 582 | " \"created_at\": 1694523224,\n", 583 | " \"status\": \"uploaded\",\n", 584 | " \"status_details\": null\n", 585 | "}\n", 586 | "{\n", 587 | " \"object\": \"file\",\n", 588 | " \"id\": \"file-LNhNveOLUVZEdbhwLJKXe7u8\",\n", 589 | " \"purpose\": \"fine-tune\",\n", 590 | " \"filename\": \"file\",\n", 591 | " \"bytes\": 137019,\n", 592 | " \"created_at\": 1694523225,\n", 593 | " \"status\": \"uploaded\",\n", 594 | " \"status_details\": null\n", 595 | "}\n" 596 | ] 597 | } 598 | ], 599 | "source": [ 600 | "# Print after deleting files\n", 601 | "past_file_lists = openai.File.list()\n", 602 | "for past_file in past_file_lists['data']:\n", 603 | " print (past_file)" 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "id": "a3245a02", 609 | "metadata": {}, 610 | "source": [ 611 | "### Wait for your data to be processed" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": 15, 617 | "id": "4e6f78d9", 618 | "metadata": {}, 619 | "outputs": [ 620 | { 621 | "name": "stdout", 622 | "output_type": "stream", 623 | "text": [ 624 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 625 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 uploaded\n", 626 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 627 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 uploaded\n", 628 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 629 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 uploaded\n", 630 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 631 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 uploaded\n", 632 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 633 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 634 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 635 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 636 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 637 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 638 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 639 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 640 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 641 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 642 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 643 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 644 | "0 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 645 | "1 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 646 | "0 file-fh77pphFD9aiezuGPntLPRxm uploaded\n", 647 | "1 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 648 | "0 file-LNhNveOLUVZEdbhwLJKXe7u8 processed\n", 649 | "1 file-fh77pphFD9aiezuGPntLPRxm processed\n", 650 | "Ready.\n" 651 | ] 652 | } 653 | ], 654 | "source": [ 655 | "while True:\n", 656 | " files = openai.File.list()\n", 657 | " completed = True\n", 658 | " for f_idx,file in enumerate(files['data']):\n", 659 | " print(f_idx,file['id'], file['status'])\n", 660 | " if file['id'] == file_train_data_id or file['id'] == file_test_data_id:\n", 661 | " processed = (file['status'] == 'processed')\n", 662 | " completed = completed and processed\n", 663 | " if completed:\n", 664 | " break\n", 665 | " time.sleep(10)\n", 666 | "print (\"Ready.\")" 667 | ] 668 | }, 669 | { 670 | "cell_type": "markdown", 671 | "id": "02c969c0", 672 | "metadata": {}, 673 | "source": [ 674 | "### Cancel running models (optional)" 675 | ] 676 | }, 677 | { 678 | "cell_type": "code", 679 | "execution_count": 16, 680 | "id": "b9c2849e", 681 | "metadata": {}, 682 | "outputs": [ 683 | { 684 | "name": "stdout", 685 | "output_type": "stream", 686 | "text": [ 687 | "ftjob-xTZABRXWvpiYHl4TAjBWz29L succeeded\n", 688 | "ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ succeeded\n", 689 | "ftjob-o3N7MRwfHtLXgvyheaO7pEHr succeeded\n", 690 | "ftjob-FyMhJGWK7LaGlu94NuAadINS cancelled\n", 691 | "ftjob-QcdcVVyLjBabDRhESarNTCpm succeeded\n", 692 | "ftjob-fhipvS2c2q7G3t4s7CLQGgDH succeeded\n", 693 | "ftjob-gNALz0Stw06p7e0NeI0isSAO succeeded\n", 694 | "ftjob-DAZc9L3aEKg1mNY2zFmpgdhv succeeded\n" 695 | ] 696 | } 697 | ], 698 | "source": [ 699 | "# List 10 fine-tuning jobs\n", 700 | "jobs = openai.FineTuningJob.list(limit=10)\n", 701 | "jobs = jobs['data']\n", 702 | "for job in jobs:\n", 703 | " print(job['id'], job['status'])\n", 704 | " job_id = job['id']\n", 705 | " completed = job['status'] != 'running'\n", 706 | " if not completed:\n", 707 | " # Cancel a job\n", 708 | " openai.FineTuningJob.cancel(job_id)" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "id": "b86fa038", 714 | "metadata": {}, 715 | "source": [ 716 | "### Start fine-tuning" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": 17, 722 | "id": "06054afd", 723 | "metadata": { 724 | "scrolled": true 725 | }, 726 | "outputs": [ 727 | { 728 | "name": "stdout", 729 | "output_type": "stream", 730 | "text": [ 731 | "Ready.\n" 732 | ] 733 | } 734 | ], 735 | "source": [ 736 | "current_job = openai.FineTuningJob.create(\n", 737 | " training_file=file_train_data_id, model=\"gpt-3.5-turbo\",\n", 738 | " validation_file=file_test_data_id, hyperparameters={\"n_epochs\":1, })\n", 739 | "print (\"Ready.\")" 740 | ] 741 | }, 742 | { 743 | "cell_type": "code", 744 | "execution_count": 18, 745 | "id": "c30f4475", 746 | "metadata": {}, 747 | "outputs": [ 748 | { 749 | "data": { 750 | "text/plain": [ 751 | " JSON: {\n", 752 | " \"object\": \"list\",\n", 753 | " \"data\": [\n", 754 | " {\n", 755 | " \"object\": \"fine_tuning.job\",\n", 756 | " \"id\": \"ftjob-8jBQV5vc2pmpUifBuRHhr8OY\",\n", 757 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 758 | " \"created_at\": 1694523358,\n", 759 | " \"finished_at\": null,\n", 760 | " \"fine_tuned_model\": null,\n", 761 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 762 | " \"result_files\": [],\n", 763 | " \"status\": \"running\",\n", 764 | " \"validation_file\": \"file-LNhNveOLUVZEdbhwLJKXe7u8\",\n", 765 | " \"training_file\": \"file-fh77pphFD9aiezuGPntLPRxm\",\n", 766 | " \"hyperparameters\": {\n", 767 | " \"n_epochs\": 1\n", 768 | " },\n", 769 | " \"trained_tokens\": null,\n", 770 | " \"error\": null\n", 771 | " },\n", 772 | " {\n", 773 | " \"object\": \"fine_tuning.job\",\n", 774 | " \"id\": \"ftjob-xTZABRXWvpiYHl4TAjBWz29L\",\n", 775 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 776 | " \"created_at\": 1694411586,\n", 777 | " \"finished_at\": 1694413644,\n", 778 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7xUgH1aA\",\n", 779 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 780 | " \"result_files\": [\n", 781 | " \"file-rEaiBrRYmva7r571MumUT7xv\"\n", 782 | " ],\n", 783 | " \"status\": \"succeeded\",\n", 784 | " \"validation_file\": \"file-HW2ySBmxZGxbkavqN5CwFTgD\",\n", 785 | " \"training_file\": \"file-0Wrj9UoWNcXGAFjrSmqPeCpA\",\n", 786 | " \"hyperparameters\": {\n", 787 | " \"n_epochs\": 1\n", 788 | " },\n", 789 | " \"trained_tokens\": 246559,\n", 790 | " \"error\": null\n", 791 | " },\n", 792 | " {\n", 793 | " \"object\": \"fine_tuning.job\",\n", 794 | " \"id\": \"ftjob-TxmbAGAFPoNgDwGdNr9kM7FJ\",\n", 795 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 796 | " \"created_at\": 1693554274,\n", 797 | " \"finished_at\": 1693554742,\n", 798 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7ttF0Dx2\",\n", 799 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 800 | " \"result_files\": [\n", 801 | " \"file-ByRYue7nbUBQm4eYSZk2HhWI\"\n", 802 | " ],\n", 803 | " \"status\": \"succeeded\",\n", 804 | " \"validation_file\": \"file-t4PDvppb0D5qHrU4tHrfumh1\",\n", 805 | " \"training_file\": \"file-rtZF3Ru65feoLs3eutQJsJ0H\",\n", 806 | " \"hyperparameters\": {\n", 807 | " \"n_epochs\": 1\n", 808 | " },\n", 809 | " \"trained_tokens\": 831,\n", 810 | " \"error\": null\n", 811 | " },\n", 812 | " {\n", 813 | " \"object\": \"fine_tuning.job\",\n", 814 | " \"id\": \"ftjob-o3N7MRwfHtLXgvyheaO7pEHr\",\n", 815 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 816 | " \"created_at\": 1693548543,\n", 817 | " \"finished_at\": 1693549009,\n", 818 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7trkYzgO\",\n", 819 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 820 | " \"result_files\": [\n", 821 | " \"file-KQDYaX1BlnGQZHhk94Qjy35K\"\n", 822 | " ],\n", 823 | " \"status\": \"succeeded\",\n", 824 | " \"validation_file\": \"file-LTE83jTd01XEUQ6aiL1iLeuN\",\n", 825 | " \"training_file\": \"file-a3z1jd3Kt5giIVLsxPubxGUH\",\n", 826 | " \"hyperparameters\": {\n", 827 | " \"n_epochs\": 1\n", 828 | " },\n", 829 | " \"trained_tokens\": 831,\n", 830 | " \"error\": null\n", 831 | " },\n", 832 | " {\n", 833 | " \"object\": \"fine_tuning.job\",\n", 834 | " \"id\": \"ftjob-FyMhJGWK7LaGlu94NuAadINS\",\n", 835 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 836 | " \"created_at\": 1693548511,\n", 837 | " \"finished_at\": null,\n", 838 | " \"fine_tuned_model\": null,\n", 839 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 840 | " \"result_files\": [],\n", 841 | " \"status\": \"cancelled\",\n", 842 | " \"validation_file\": \"file-LTE83jTd01XEUQ6aiL1iLeuN\",\n", 843 | " \"training_file\": \"file-a3z1jd3Kt5giIVLsxPubxGUH\",\n", 844 | " \"hyperparameters\": {\n", 845 | " \"n_epochs\": 1\n", 846 | " },\n", 847 | " \"trained_tokens\": null,\n", 848 | " \"error\": null\n", 849 | " },\n", 850 | " {\n", 851 | " \"object\": \"fine_tuning.job\",\n", 852 | " \"id\": \"ftjob-QcdcVVyLjBabDRhESarNTCpm\",\n", 853 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 854 | " \"created_at\": 1693546893,\n", 855 | " \"finished_at\": 1693547399,\n", 856 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7trKamW3\",\n", 857 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 858 | " \"result_files\": [\n", 859 | " \"file-l5WBArbeZ1DKW79pm2USfrCn\"\n", 860 | " ],\n", 861 | " \"status\": \"succeeded\",\n", 862 | " \"validation_file\": \"file-Q5fBGxgptSjSWfvd2pdizTQ4\",\n", 863 | " \"training_file\": \"file-3VlDNqS78TwXgLdi92igEz7B\",\n", 864 | " \"hyperparameters\": {\n", 865 | " \"n_epochs\": 1\n", 866 | " },\n", 867 | " \"trained_tokens\": 914,\n", 868 | " \"error\": null\n", 869 | " },\n", 870 | " {\n", 871 | " \"object\": \"fine_tuning.job\",\n", 872 | " \"id\": \"ftjob-fhipvS2c2q7G3t4s7CLQGgDH\",\n", 873 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 874 | " \"created_at\": 1693545051,\n", 875 | " \"finished_at\": 1693545516,\n", 876 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7tqqDVxu\",\n", 877 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 878 | " \"result_files\": [\n", 879 | " \"file-WuxpCaSYmkq171yJVWGAGmbR\"\n", 880 | " ],\n", 881 | " \"status\": \"succeeded\",\n", 882 | " \"validation_file\": \"file-B2h4oSFYdJkA7zzE82nqft0h\",\n", 883 | " \"training_file\": \"file-SXe3jQvUsMOrYS9GtfKR1g8s\",\n", 884 | " \"hyperparameters\": {\n", 885 | " \"n_epochs\": 1\n", 886 | " },\n", 887 | " \"trained_tokens\": 914,\n", 888 | " \"error\": null\n", 889 | " },\n", 890 | " {\n", 891 | " \"object\": \"fine_tuning.job\",\n", 892 | " \"id\": \"ftjob-gNALz0Stw06p7e0NeI0isSAO\",\n", 893 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 894 | " \"created_at\": 1693544099,\n", 895 | " \"finished_at\": 1693544596,\n", 896 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7tqbNGH5\",\n", 897 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 898 | " \"result_files\": [\n", 899 | " \"file-GFZY2SmNEuwnRLXq2CmbCHJP\"\n", 900 | " ],\n", 901 | " \"status\": \"succeeded\",\n", 902 | " \"validation_file\": \"file-UEQLpqMsDH2Y7rJLCtbWnqY2\",\n", 903 | " \"training_file\": \"file-abntl1qJwH7fVHfRXAn97hKR\",\n", 904 | " \"hyperparameters\": {\n", 905 | " \"n_epochs\": 1\n", 906 | " },\n", 907 | " \"trained_tokens\": 2551,\n", 908 | " \"error\": null\n", 909 | " },\n", 910 | " {\n", 911 | " \"object\": \"fine_tuning.job\",\n", 912 | " \"id\": \"ftjob-DAZc9L3aEKg1mNY2zFmpgdhv\",\n", 913 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 914 | " \"created_at\": 1693542580,\n", 915 | " \"finished_at\": 1693543162,\n", 916 | " \"fine_tuned_model\": \"ft:gpt-3.5-turbo-0613:korea-university::7tqEGsAP\",\n", 917 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 918 | " \"result_files\": [\n", 919 | " \"file-IG5qEkSebmuqpDR3Zl5ljMA3\"\n", 920 | " ],\n", 921 | " \"status\": \"succeeded\",\n", 922 | " \"validation_file\": null,\n", 923 | " \"training_file\": \"file-szdPvDiHgfSoSMJKMsNCtGLG\",\n", 924 | " \"hyperparameters\": {\n", 925 | " \"n_epochs\": 3\n", 926 | " },\n", 927 | " \"trained_tokens\": 7653,\n", 928 | " \"error\": null\n", 929 | " }\n", 930 | " ],\n", 931 | " \"has_more\": false\n", 932 | "}" 933 | ] 934 | }, 935 | "execution_count": 18, 936 | "metadata": {}, 937 | "output_type": "execute_result" 938 | } 939 | ], 940 | "source": [ 941 | "# List 10 fine-tuning jobs\n", 942 | "openai.FineTuningJob.list(limit=10)" 943 | ] 944 | }, 945 | { 946 | "cell_type": "markdown", 947 | "id": "c26012f7", 948 | "metadata": {}, 949 | "source": [ 950 | "### Retrieve information" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": 19, 956 | "id": "445ce90d", 957 | "metadata": {}, 958 | "outputs": [ 959 | { 960 | "name": "stdout", 961 | "output_type": "stream", 962 | "text": [ 963 | "current_job_id:[ftjob-8jBQV5vc2pmpUifBuRHhr8OY]\n" 964 | ] 965 | } 966 | ], 967 | "source": [ 968 | "current_job_id = current_job['id']\n", 969 | "print (\"current_job_id:[%s]\"%(current_job_id))" 970 | ] 971 | }, 972 | { 973 | "cell_type": "code", 974 | "execution_count": 20, 975 | "id": "72bc304c", 976 | "metadata": {}, 977 | "outputs": [ 978 | { 979 | "data": { 980 | "text/plain": [ 981 | " JSON: {\n", 982 | " \"object\": \"fine_tuning.job\",\n", 983 | " \"id\": \"ftjob-8jBQV5vc2pmpUifBuRHhr8OY\",\n", 984 | " \"model\": \"gpt-3.5-turbo-0613\",\n", 985 | " \"created_at\": 1694523358,\n", 986 | " \"finished_at\": null,\n", 987 | " \"fine_tuned_model\": null,\n", 988 | " \"organization_id\": \"org-bT1bA6ExTcdQphsyv85F0j6z\",\n", 989 | " \"result_files\": [],\n", 990 | " \"status\": \"running\",\n", 991 | " \"validation_file\": \"file-LNhNveOLUVZEdbhwLJKXe7u8\",\n", 992 | " \"training_file\": \"file-fh77pphFD9aiezuGPntLPRxm\",\n", 993 | " \"hyperparameters\": {\n", 994 | " \"n_epochs\": 1\n", 995 | " },\n", 996 | " \"trained_tokens\": null,\n", 997 | " \"error\": null\n", 998 | "}" 999 | ] 1000 | }, 1001 | "execution_count": 20, 1002 | "metadata": {}, 1003 | "output_type": "execute_result" 1004 | } 1005 | ], 1006 | "source": [ 1007 | "# Retrieve the state of a fine-tune\n", 1008 | "openai.FineTuningJob.retrieve(current_job_id)" 1009 | ] 1010 | }, 1011 | { 1012 | "cell_type": "markdown", 1013 | "id": "2b2f8800", 1014 | "metadata": {}, 1015 | "source": [ 1016 | "### Wait for fine-tuning" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": 21, 1022 | "id": "704937f5", 1023 | "metadata": { 1024 | "scrolled": true 1025 | }, 1026 | "outputs": [ 1027 | { 1028 | "name": "stdout", 1029 | "output_type": "stream", 1030 | "text": [ 1031 | "Fine tuning job started\n", 1032 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1033 | "Fine tuning job started\n", 1034 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1035 | "Fine tuning job started\n", 1036 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1037 | "Fine tuning job started\n", 1038 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1039 | "Fine tuning job started\n", 1040 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1041 | "Fine tuning job started\n", 1042 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1043 | "Fine tuning job started\n", 1044 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1045 | "Fine tuning job started\n", 1046 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1047 | "Fine tuning job started\n", 1048 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1049 | "Fine tuning job started\n", 1050 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1051 | "Fine tuning job started\n", 1052 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1053 | "Fine tuning job started\n", 1054 | "Created fine-tuning job: ftjob-8jBQV5vc2pmpUifBuRHhr8OY\n", 1055 | "Step 100/1000: training loss=1.76\n", 1056 | "Fine tuning job started\n", 1057 | "Step 100/1000: training loss=1.76\n", 1058 | "Fine tuning job started\n", 1059 | "Step 100/1000: training loss=1.76\n", 1060 | "Fine tuning job started\n", 1061 | "Step 200/1000: training loss=1.96\n", 1062 | "Step 100/1000: training loss=1.76\n", 1063 | "Step 200/1000: training loss=1.96\n", 1064 | "Step 100/1000: training loss=1.76\n", 1065 | "Step 200/1000: training loss=1.96\n", 1066 | "Step 100/1000: training loss=1.76\n", 1067 | "Step 300/1000: training loss=1.85\n", 1068 | "Step 200/1000: training loss=1.96\n", 1069 | "Step 300/1000: training loss=1.85\n", 1070 | "Step 200/1000: training loss=1.96\n", 1071 | "Step 300/1000: training loss=1.85\n", 1072 | "Step 200/1000: training loss=1.96\n", 1073 | "Step 400/1000: training loss=2.87\n", 1074 | "Step 300/1000: training loss=1.85\n", 1075 | "Step 400/1000: training loss=2.87\n", 1076 | "Step 300/1000: training loss=1.85\n", 1077 | "Step 400/1000: training loss=2.87\n", 1078 | "Step 300/1000: training loss=1.85\n", 1079 | "Step 400/1000: training loss=2.87\n", 1080 | "Step 300/1000: training loss=1.85\n", 1081 | "Step 500/1000: training loss=2.48\n", 1082 | "Step 400/1000: training loss=2.87\n", 1083 | "Step 500/1000: training loss=2.48\n", 1084 | "Step 400/1000: training loss=2.87\n", 1085 | "Step 500/1000: training loss=2.48\n", 1086 | "Step 400/1000: training loss=2.87\n", 1087 | "Step 600/1000: training loss=1.93\n", 1088 | "Step 500/1000: training loss=2.48\n", 1089 | "Step 600/1000: training loss=1.93\n", 1090 | "Step 500/1000: training loss=2.48\n", 1091 | "Step 600/1000: training loss=1.93\n", 1092 | "Step 500/1000: training loss=2.48\n", 1093 | "Step 700/1000: training loss=2.23\n", 1094 | "Step 600/1000: training loss=1.93\n", 1095 | "Step 700/1000: training loss=2.23\n", 1096 | "Step 600/1000: training loss=1.93\n", 1097 | "Step 700/1000: training loss=2.23\n", 1098 | "Step 600/1000: training loss=1.93\n", 1099 | "Step 800/1000: training loss=1.89\n", 1100 | "Step 700/1000: training loss=2.23\n", 1101 | "Step 800/1000: training loss=1.89\n", 1102 | "Step 700/1000: training loss=2.23\n", 1103 | "Step 800/1000: training loss=1.89\n", 1104 | "Step 700/1000: training loss=2.23\n", 1105 | "Step 900/1000: training loss=1.53\n", 1106 | "Step 800/1000: training loss=1.89\n", 1107 | "Step 900/1000: training loss=1.53\n", 1108 | "Step 800/1000: training loss=1.89\n", 1109 | "Step 900/1000: training loss=1.53\n", 1110 | "Step 800/1000: training loss=1.89\n", 1111 | "The job has successfully completed\n", 1112 | "New fine-tuned model created: ft:gpt-3.5-turbo-0613:korea-university::7xxqUPHt\n", 1113 | "=====================================\n", 1114 | "The job has successfully completed\n", 1115 | "{}\n", 1116 | "=====================================\n", 1117 | "New fine-tuned model created: ft:gpt-3.5-turbo-0613:korea-university::7xxqUPHt\n", 1118 | "{}\n", 1119 | "=====================================\n", 1120 | "Step 1000/1000: training loss=1.82\n", 1121 | "{\n", 1122 | " \"step\": 1000,\n", 1123 | " \"train_loss\": 1.82290518283844,\n", 1124 | " \"train_mean_token_accuracy\": 0.5504587292671204\n", 1125 | "}\n", 1126 | "=====================================\n", 1127 | "Step 900/1000: training loss=1.53\n", 1128 | "{\n", 1129 | " \"step\": 900,\n", 1130 | " \"train_loss\": 1.5286750793457031,\n", 1131 | " \"train_mean_token_accuracy\": 0.625\n", 1132 | "}\n", 1133 | "=====================================\n", 1134 | "Step 800/1000: training loss=1.89\n", 1135 | "{\n", 1136 | " \"step\": 800,\n", 1137 | " \"train_loss\": 1.8901692628860474,\n", 1138 | " \"train_mean_token_accuracy\": 0.5660377144813538\n", 1139 | "}\n", 1140 | "=====================================\n", 1141 | "Step 700/1000: training loss=2.23\n", 1142 | "{\n", 1143 | " \"step\": 700,\n", 1144 | " \"train_loss\": 2.2319793701171875,\n", 1145 | " \"train_mean_token_accuracy\": 0.47058823704719543\n", 1146 | "}\n", 1147 | "=====================================\n", 1148 | "Step 600/1000: training loss=1.93\n", 1149 | "{\n", 1150 | " \"step\": 600,\n", 1151 | " \"train_loss\": 1.9274938106536865,\n", 1152 | " \"train_mean_token_accuracy\": 0.5145630836486816\n", 1153 | "}\n", 1154 | "=====================================\n", 1155 | "Step 500/1000: training loss=2.48\n", 1156 | "{\n", 1157 | " \"step\": 500,\n", 1158 | " \"train_loss\": 2.4816272258758545,\n", 1159 | " \"train_mean_token_accuracy\": 0.4084506928920746\n", 1160 | "}\n", 1161 | "=====================================\n", 1162 | "Step 400/1000: training loss=2.87\n", 1163 | "{\n", 1164 | " \"step\": 400,\n", 1165 | " \"train_loss\": 2.870401620864868,\n", 1166 | " \"train_mean_token_accuracy\": 0.4650000035762787\n", 1167 | "}\n", 1168 | "=====================================\n", 1169 | "Step 300/1000: training loss=1.85\n", 1170 | "{\n", 1171 | " \"step\": 300,\n", 1172 | " \"train_loss\": 1.846813440322876,\n", 1173 | " \"train_mean_token_accuracy\": 0.529411792755127\n", 1174 | "}\n", 1175 | "Ready.\n" 1176 | ] 1177 | } 1178 | ], 1179 | "source": [ 1180 | "while True:\n", 1181 | " job = openai.FineTuningJob.retrieve(current_job_id)\n", 1182 | " completed = job['status'] != 'running'\n", 1183 | " # List up to 10 events from a fine-tuning job\n", 1184 | " events = openai.FineTuningJob.list_events(id=current_job_id, limit=2)\n", 1185 | " for event in events['data']:\n", 1186 | " print(event['message'])\n", 1187 | " if completed:\n", 1188 | " break\n", 1189 | " time.sleep(60)\n", 1190 | "events = openai.FineTuningJob.list_events(id=current_job_id, limit=10)\n", 1191 | "for event in events['data']:\n", 1192 | " print(\"=====================================\")\n", 1193 | " print(event['message'])\n", 1194 | " print(event['data'])\n", 1195 | "print (\"Ready.\")" 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "markdown", 1200 | "id": "685011a6", 1201 | "metadata": {}, 1202 | "source": [ 1203 | "### Inference" 1204 | ] 1205 | }, 1206 | { 1207 | "cell_type": "code", 1208 | "execution_count": 22, 1209 | "id": "66be38d3", 1210 | "metadata": {}, 1211 | "outputs": [], 1212 | "source": [ 1213 | "model_id = openai.FineTuningJob.retrieve(current_job_id)['fine_tuned_model']" 1214 | ] 1215 | }, 1216 | { 1217 | "cell_type": "code", 1218 | "execution_count": 23, 1219 | "id": "ca0074b4", 1220 | "metadata": { 1221 | "scrolled": true 1222 | }, 1223 | "outputs": [ 1224 | { 1225 | "name": "stdout", 1226 | "output_type": "stream", 1227 | "text": [ 1228 | "{'personality': ['although i m studying to be a doctor , animals like me .', \"i'm a graduate student .\", 'i volunteer with dogs .', 'i am in between classes .', 'i m always early .'], 'utterances': [{'candidates': ['just a small pet cat . you have cats ?', 'i do like a lot of none gmo types of food', 'same , thinking of ordering a pizza . wish i could get sushi !', 'general office supplies , what do you do ?', 'i love books by stephen king', 'hello how is your sunday ?', 'that is so sweet ! is she sick ?', 'i walk around the mall when i have a chance .', \"i don't care for his books .\", 'i am , thank you . tell me about your personality .', 'why do you not date ?', 'hello mate , how are you today', 'i do not really watch many . i mostly spend time reading', 'good morning , how are you ?', 'what instrument do you play ?', 'do you happen to have a daughter named dorothy ?', \"ll , i'm already hungry enough . at this point , my lizards look edible .\", 'how about just 1 friend ?', 'wow you sound like a healthy person', \"hey i don't have long to chat . how are you ?\"], 'history': ['__ SILENCE __']}, {'candidates': ['hello , how is your weekend going ?', 'is that what you do on youtube ?', 'i am doing good . i am just drinking my favorite coffee right now .', 'i am alone . what about you ?', 'yes i like that since i work in a book store', 'hi , how are you today ?', \"i love dogs but don't have any pets\", 'oh , i just work odd jobs here and there', 'nice to meet you . i like to live a more enlightened life i guess', 'yeah , no joke . i love golden retrievers . i might get one of my own soon', 'it was the only thing that got me through the chemo for my wife', 'where are you from ? i am in cali', \"i'll only live for a few more months\", 'hi how are you doing', 'i prefer ice cream or frozen yogurt .', 'you like them or are you not that close ?', 'it gets better . have you thought of joining the army ?', \"my name is sam , and i've an addiction to high stakes blackjack\", 'i like comic books . i love superhero ones .', 'what are you up to today ?'], 'history': ['__ SILENCE __', \"hey i don't have long to chat . how are you ?\", 'that s ok i don t have long either . i m fine .']}, {'candidates': ['yes i did . i hope to find a job when i graduate next september', 'i have heard of it . never played though . busy with guitar , haha .', 'love country music . went shopping today and bought a new bmw', \"i'm on a diet so i try to watch what i eat\", 'nah , i cycle at night . its hard to hike at night .', 'you sure can ! i met him in a cave during halloween last year', 'i am pretty shy , so this is a big deal for me . i do like to sing though', 'nice to meet you ! i am don and i love being outside . do you work ?', 'that is a great job , i just work at an office in my city', 'i only drink coke from china , and it is hard to find in the usa', 'yeah the aliens are quiet creatures .', \"sure , i've settled down a little bit too . but i still love concerts !\", 'where do you rap at', 'what is up my man ?', 'about her life as i can remember it , she passed on when i was 9', \"haha i'm a kid ! i'm in third grade and love soccer ! you ?\", 'bread , we are the bread family haha , teaching is awesome', 'i have a few kittens', 'it certainly sounds that way', 'nice ! you should take pictures of shelter dogs and help them get adopted !'], 'history': ['__ SILENCE __', \"hey i don't have long to chat . how are you ?\", 'that s ok i don t have long either . i m fine .', 'what are you up to today ?', \"i'm taking pictures . i'm a photographer\"]}, {'candidates': ['do you have a favorite color ?', 'bicycles must be so useful there ? except in winter . or maybe not ?', 'i like any brand that good will carries .', 'what kind of sea food do you like', 'it helps when you have a deadbeat for a father . i fend for myself .', \"i've three they are the best\", 'i love apples . macintosh are the best !', 'noone opinion matters but yours . i love roses', 'not really . i like to stay home a lot but i might go out to eat .', 'oh and i also have these chronically bad back pains', 'i love having my back scratched , and making candles .', 'hello from pittsburgh where i live', \"don't blame you . must be cold in buffalo .\", 'i am in college for childhood', 'i read the walking dead comics and watch the show', 'besides that i like okra , peas , apples and bananas best', 'nice , sounds like a delicious job', 'yes . we are all here for a reason', 'cults suck , when was the last time you heard from her', \"me too . animals like me and i like them . i'm in medical school though\"], 'history': ['__ SILENCE __', \"hey i don't have long to chat . how are you ?\", 'that s ok i don t have long either . i m fine .', 'what are you up to today ?', \"i'm taking pictures . i'm a photographer\", 'nice ! you should take pictures of shelter dogs and help them get adopted !', \"that's a good idea . i love animals .\"]}, {'candidates': ['i have absolutely no energy ! i stay home with the kids . but i am a serious cleaner', 'that is fine . if we can make them explode .', 'do you enjoy to read ? i could read while you fish ?', 'i have almost finished with it too ! just gotta finish one book .', \"no , i don't . what did you have for breakfast today ?\", 'working all day , my feet hurt and smell , but money is my favorite thing i love it', 'i am going for medical billing not to bad .', 'i like bambi you just made me cry', 'he retires next year . luckily , i am already retired . i live on the west coast . you ?', 'hi . tell me about you .', 'second year how long in the dress design business', 'my bf watches twitch a lot .', 'family pictures ! family is everything to me ! how big is yours ?', 'ugh , i am sorry . do you have any hobbies ?', \"i've rex , pepper and miko\", 'i like to go for a run when i wake up it really energizes me', \"hello ! i'm well , thanks ! how are you ?\", 'thank you . i try to be a positive influence in this world .', \"i don't do music , i listen to it .\", \"i'm waiting for class to start . i get here too quick but there's no real time\"], 'history': ['__ SILENCE __', \"hey i don't have long to chat . how are you ?\", 'that s ok i don t have long either . i m fine .', 'what are you up to today ?', \"i'm taking pictures . i'm a photographer\", 'nice ! you should take pictures of shelter dogs and help them get adopted !', \"that's a good idea . i love animals .\", \"me too . animals like me and i like them . i'm in medical school though\", \"that's very admirable too . i've taken photos of doctors in action before .\"]}, {'candidates': ['yeah , it is sad . i should just hang out and listen to music .', \"that is a fun sport i do rock climbing when i'm able\", 'haha , its my favorite subject . when i graduate i wanna go to college for it !', \"that's awesome , man . i am still in the closet and i am so conflicted . . .\", 'lol well i manage a grocery store , talk about boring', 'i have a degree in communication , and i am in the navy .', 'fun ! do you mystery novels ?', 'i think my favorite is hamburger !', 'my son is awesome straight as', 'i like bands vnv is the one i like alot .', \"yeah , i've been seeing this girl for a few months .\", \"ow i haven't ridden a horse in a long time . i have to take certain medications\", \"my two children they're for under 10\", 'all about drawing comics . i like them a lot', 'no kidding ? ! before i was in law school , i skied outside chicago .', 'i used to drive my mother crazy , i liked to smoke , i am tee total now though', 'lol gotcha . . . this is michael', '6 years ago when i was in highschool', 'yes actually . a cat named majora .', 'for sure . i volunteer with the shelter in my spare time .'], 'history': ['__ SILENCE __', \"hey i don't have long to chat . how are you ?\", 'that s ok i don t have long either . i m fine .', 'what are you up to today ?', \"i'm taking pictures . i'm a photographer\", 'nice ! you should take pictures of shelter dogs and help them get adopted !', \"that's a good idea . i love animals .\", \"me too . animals like me and i like them . i'm in medical school though\", \"that's very admirable too . i've taken photos of doctors in action before .\", \"i'm waiting for class to start . i get here too quick but there's no real time\", 'i do that too . got to get the best seat .']}, {'candidates': ['an honor ! but i also have 2 siberians . those huskies are like a second job !', 'i am a vet at the animal shelter', 'glad you are studying english . it must be difficult learning english and living in a new culture .', 'meet your parents ? ! my hair is still red from the accident !', 'i went shopping online and got a purse', \"yes i'm the leader of a band and the leader at my local gun club\", 'i love dogs ! my mutt loves swimming in our lake .', 'we need to get a pianist . would help me stay on key', 'do you have any pets ? i love dogs', 'hi ! do you like dogs ?', \"that is cool . well i've to get back to my studies soon . is there anything else ?\", 'that is kind of you to say but are you funny also ?', 'that is the best thing ever', 'hello ! what do you do ?', 'yes i can make any color simple or bold look', 'what is his name , if i may ask ?', 'i am not stressed i just want to look fabulous forever lol', \"you love photography that's a great hobby for traveling\", 'yea , i like cars but i needed a truck for the farm , so i changed', 'what is wrong with her'], 'history': ['__ SILENCE __', \"hey i don't have long to chat . how are you ?\", 'that s ok i don t have long either . i m fine .', 'what are you up to today ?', \"i'm taking pictures . i'm a photographer\", 'nice ! you should take pictures of shelter dogs and help them get adopted !', \"that's a good idea . i love animals .\", \"me too . animals like me and i like them . i'm in medical school though\", \"that's very admirable too . i've taken photos of doctors in action before .\", \"i'm waiting for class to start . i get here too quick but there's no real time\", 'i do that too . got to get the best seat .', 'for sure . i volunteer with the shelter in my spare time .', 'my wife used to help out at the shelter before she got sick .']}, {'candidates': ['full house here too . i have 6 brothers and sisters , we were all adopted', \"well , i would like to , but i don't have time for it .\", \"i'm am doing good do u like sports\", 'i play a few instruments ! i really want to be able to paly music for a living', 'i disagree , i love my small town', 'hello , how are you today ?', 'my dad does tax assesment and my mom teaches !', 'can i come then please', 'are you looking for any kind of design advice by any chance ?', 'i am doing well thank you . just been working on a short film myself .', 'i just learned to read . we can be friends and i will read about trains .', 'awesome ! they sleeping . luckily i brought my fav book . the tale of genji .', 'hello how are you today', 'ok when did you become deaf', 'hi how are you dong ?', 'i think people think i m lazy because i still live at home with mom . what about you ?', 'i think magic mike is a movie i like a lot , uh , because of the music .', 'i just work on my wood work and make couches and stuff', 'yes . i miss my wife and kids . are you married ?', \"sorry to hear that . that's sad .\"], 'history': ['__ SILENCE __', \"hey i don't have long to chat . how are you ?\", 'that s ok i don t have long either . i m fine .', 'what are you up to today ?', \"i'm taking pictures . i'm a photographer\", 'nice ! you should take pictures of shelter dogs and help them get adopted !', \"that's a good idea . i love animals .\", \"me too . animals like me and i like them . i'm in medical school though\", \"that's very admirable too . i've taken photos of doctors in action before .\", \"i'm waiting for class to start . i get here too quick but there's no real time\", 'i do that too . got to get the best seat .', 'for sure . i volunteer with the shelter in my spare time .', 'my wife used to help out at the shelter before she got sick .', 'what is wrong with her', 'she had cancer . she passed away .']}]}\n" 1229 | ] 1230 | } 1231 | ], 1232 | "source": [ 1233 | "start_idx = NUM_TRAIN + NUM_VALIDATION\n", 1234 | "end_idx = len(dataset['train'])\n", 1235 | "sample_idx = random.randint(start_idx, end_idx)\n", 1236 | "sample = dataset['train'][sample_idx]\n", 1237 | "print(sample)" 1238 | ] 1239 | }, 1240 | { 1241 | "cell_type": "code", 1242 | "execution_count": 24, 1243 | "id": "9a6a0839", 1244 | "metadata": {}, 1245 | "outputs": [], 1246 | "source": [ 1247 | "persona = \" \".join(sample['personality'])\n", 1248 | "history = sample['utterances'][-1]['history']\n", 1249 | "messages = [\n", 1250 | " {\"role\": \"system\", \"content\": persona},\n", 1251 | " {\"role\": \"user\", \"content\": history[0]},\n", 1252 | " {\"role\": \"assistant\", \"content\": history[1]},\n", 1253 | " {\"role\": \"user\", \"content\": history[2]},\n", 1254 | "]" 1255 | ] 1256 | }, 1257 | { 1258 | "cell_type": "code", 1259 | "execution_count": 25, 1260 | "id": "b748321b", 1261 | "metadata": {}, 1262 | "outputs": [], 1263 | "source": [ 1264 | "response = completion = openai.ChatCompletion.create(\n", 1265 | " model=model_id,\n", 1266 | " messages=messages\n", 1267 | ")" 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "code", 1272 | "execution_count": 26, 1273 | "id": "0bf97bce", 1274 | "metadata": {}, 1275 | "outputs": [ 1276 | { 1277 | "name": "stdout", 1278 | "output_type": "stream", 1279 | "text": [ 1280 | "Persona: although i m studying to be a doctor , animals like me . i'm a graduate student . i volunteer with dogs . i am in between classes . i m always early .\n", 1281 | "user: __ SILENCE __\n", 1282 | "assistant: hey i don't have long to chat . how are you ?\n", 1283 | "user: that s ok i don t have long either . i m fine .\n", 1284 | "response: what do you do for work ? i am getting ready to be a doctor\n", 1285 | "Ground Truth: what are you up to today ?\n" 1286 | ] 1287 | } 1288 | ], 1289 | "source": [ 1290 | "answer = response['choices'][0]['message']['content']\n", 1291 | "print('Persona: ' + persona)\n", 1292 | "print('user: ' + history[0])\n", 1293 | "print('assistant: ' + history[1])\n", 1294 | "print('user: ' + history[2])\n", 1295 | "print('response: ' + answer)\n", 1296 | "print('Ground Truth: ' + history[3])" 1297 | ] 1298 | }, 1299 | { 1300 | "cell_type": "code", 1301 | "execution_count": 27, 1302 | "id": "4e00d67b", 1303 | "metadata": {}, 1304 | "outputs": [], 1305 | "source": [ 1306 | "persona = \"I am an extroverted person \"\n", 1307 | "\n", 1308 | "messages = [\n", 1309 | " {\"role\": \"system\", \"content\": persona},\n", 1310 | " {\"role\": \"user\", \"content\": \"If you see someone waving their hand what will you do?\"}\n", 1311 | "]" 1312 | ] 1313 | }, 1314 | { 1315 | "cell_type": "code", 1316 | "execution_count": 28, 1317 | "id": "1f4755da", 1318 | "metadata": {}, 1319 | "outputs": [], 1320 | "source": [ 1321 | "response = completion = openai.ChatCompletion.create(\n", 1322 | " model=model_id,\n", 1323 | " messages=messages\n", 1324 | ")" 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "code", 1329 | "execution_count": 29, 1330 | "id": "decf8785", 1331 | "metadata": {}, 1332 | "outputs": [ 1333 | { 1334 | "name": "stdout", 1335 | "output_type": "stream", 1336 | "text": [ 1337 | "Persona: I am an extroverted person \n", 1338 | "response: I wave my hands to when I talk as I am an extroverted person\n" 1339 | ] 1340 | } 1341 | ], 1342 | "source": [ 1343 | "answer = response['choices'][0]['message']['content']\n", 1344 | "print('Persona: ' + persona)\n", 1345 | "print('response: ' + answer)" 1346 | ] 1347 | }, 1348 | { 1349 | "cell_type": "code", 1350 | "execution_count": 30, 1351 | "id": "4f602646", 1352 | "metadata": {}, 1353 | "outputs": [ 1354 | { 1355 | "name": "stdout", 1356 | "output_type": "stream", 1357 | "text": [ 1358 | "Persona: I am an extroverted person \n", 1359 | "response: As an extroverted person, I would likely feel comfortable and excited to engage with them. I would wave back, approach them, and start a friendly conversation.\n" 1360 | ] 1361 | } 1362 | ], 1363 | "source": [ 1364 | "response = completion = openai.ChatCompletion.create(\n", 1365 | " model='gpt-4',\n", 1366 | " messages=messages\n", 1367 | ")\n", 1368 | "answer = response['choices'][0]['message']['content']\n", 1369 | "print('Persona: ' + persona)\n", 1370 | "print('response: ' + answer)" 1371 | ] 1372 | }, 1373 | { 1374 | "cell_type": "code", 1375 | "execution_count": null, 1376 | "id": "ab198c0f", 1377 | "metadata": {}, 1378 | "outputs": [], 1379 | "source": [] 1380 | } 1381 | ], 1382 | "metadata": { 1383 | "kernelspec": { 1384 | "display_name": "Python 3 (ipykernel)", 1385 | "language": "python", 1386 | "name": "python3" 1387 | }, 1388 | "language_info": { 1389 | "codemirror_mode": { 1390 | "name": "ipython", 1391 | "version": 3 1392 | }, 1393 | "file_extension": ".py", 1394 | "mimetype": "text/x-python", 1395 | "name": "python", 1396 | "nbconvert_exporter": "python", 1397 | "pygments_lexer": "ipython3", 1398 | "version": "3.9.16" 1399 | } 1400 | }, 1401 | "nbformat": 4, 1402 | "nbformat_minor": 5 1403 | } 1404 | -------------------------------------------------------------------------------- /code/demo_webcrawl_02_wikibot.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "0feb4754", 6 | "metadata": {}, 7 | "source": [ 8 | "### `WikiBot` searching and summaring user queries" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "040163e7", 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "Ready.\n" 22 | ] 23 | } 24 | ], 25 | "source": [ 26 | "import os\n", 27 | "from gpt_helper import set_openai_api_key_from_txt,GPTchatClass,printmd\n", 28 | "from wiki_helper import wiki_search\n", 29 | "from util import printmd\n", 30 | "print (\"Ready.\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "12f7cc3e", 36 | "metadata": {}, 37 | "source": [ 38 | "### Set API Key and Instantiate GPT Agent" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "id": "b17d6584", 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "name": "stdout", 49 | "output_type": "stream", 50 | "text": [ 51 | "OpenAI API Key Ready from [../key/rilab_key.txt].\n", 52 | "Chat agent using [gpt-3.5-turbo] initialized with the follow role:[Your are a helpful assistant summarizing infromation and answering user queries.]\n" 53 | ] 54 | } 55 | ], 56 | "source": [ 57 | "set_openai_api_key_from_txt(key_path='../key/rilab_key.txt')\n", 58 | "GPT = GPTchatClass(\n", 59 | " gpt_model='gpt-3.5-turbo',\n", 60 | " role_msg='Your are a helpful assistant summarizing infromation and answering user queries.')" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "id": "7fec5733", 66 | "metadata": {}, 67 | "source": [ 68 | "### Query sentence" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "id": "7f00b562", 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "name": "stdout", 79 | "output_type": "stream", 80 | "text": [ 81 | "entity:[Could you explain the behavior of a stubborn person?]\n" 82 | ] 83 | } 84 | ], 85 | "source": [ 86 | "# entity = \"President of South Korea\"\n", 87 | "entity = \"Could you explain the behavior of a stubborn person?\"\n", 88 | "print (\"entity:[%s]\"%(entity))" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 4, 94 | "id": "e85659b2", 95 | "metadata": {}, 96 | "outputs": [ 97 | { 98 | "name": "stdout", 99 | "output_type": "stream", 100 | "text": [ 101 | "entity:[Could you explain the behavior of a stubborn person?] mismatched. use [Tantrum] instead.\n", 102 | " We have total [51] paragraphs.\n", 103 | " After filtering, we have [16] and [8] paragraphs returned (k:[5] and m:[3])\n" 104 | ] 105 | } 106 | ], 107 | "source": [ 108 | "paragraphs_return = wiki_search(entity=entity,VERBOSE=True)" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "id": "9c61668a", 114 | "metadata": {}, 115 | "source": [ 116 | "### Summarize each paragraph using `GPT`" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 5, 122 | "id": "096e1081", 123 | "metadata": { 124 | "scrolled": false 125 | }, 126 | "outputs": [ 127 | { 128 | "data": { 129 | "text/markdown": [ 130 | "A tantrum is an emotional outburst typically characterized by stubbornness, crying, screaming, violence, defiance, and resistance to pacification, which can lead to consequences such as detention or suspension." 131 | ], 132 | "text/plain": [ 133 | "" 134 | ] 135 | }, 136 | "metadata": {}, 137 | "output_type": "display_data" 138 | }, 139 | { 140 | "data": { 141 | "text/markdown": [ 142 | "Tantrums are common in young children and are considered normal and indicators of character development, but tend to decrease in frequency and intensity as the child ages." 143 | ], 144 | "text/plain": [ 145 | "" 146 | ] 147 | }, 148 | "metadata": {}, 149 | "output_type": "display_data" 150 | }, 151 | { 152 | "data": { 153 | "text/markdown": [ 154 | "Tantrums can be seen as a sign of excessive frustration and can diminish over time with calm and consistent handling, suggesting that parental containment may be necessary." 155 | ], 156 | "text/plain": [ 157 | "" 158 | ] 159 | }, 160 | "metadata": {}, 161 | "output_type": "display_data" 162 | }, 163 | { 164 | "data": { 165 | "text/markdown": [ 166 | "Selma Fraiberg cautioned against excessive control in child-rearing, as it can lead to defiant behavior and tantrums." 167 | ], 168 | "text/plain": [ 169 | "" 170 | ] 171 | }, 172 | "metadata": {}, 173 | "output_type": "display_data" 174 | }, 175 | { 176 | "data": { 177 | "text/markdown": [ 178 | "Certain individuals with developmental disorders or brain damage may be more prone to tantrums, although anyone can experience them regardless of gender or age, but it is important to distinguish between tantrums and meltdowns caused by sensory overload." 179 | ], 180 | "text/plain": [ 181 | "" 182 | ] 183 | }, 184 | "metadata": {}, 185 | "output_type": "display_data" 186 | }, 187 | { 188 | "data": { 189 | "text/markdown": [ 190 | "Freud believed that the Wolf Man's temper tantrums were a result of his sister seducing him, leading to feelings of guilt and an unconscious need for punishment, a phenomenon that Freud believed could be applicable to other cases of childhood tantrums." 191 | ], 192 | "text/plain": [ 193 | "" 194 | ] 195 | }, 196 | "metadata": {}, 197 | "output_type": "display_data" 198 | }, 199 | { 200 | "data": { 201 | "text/markdown": [ 202 | "Heinz Kohut argued that the core of a baby's personality is likely to have a self-centered, grandiose, and exhibitionist aspect, and tantrums are a form of narcissistic rage that occurs when the baby's inflated self-image is threatened by frustration from being denied something they want." 203 | ], 204 | "text/plain": [ 205 | "" 206 | ] 207 | }, 208 | "metadata": {}, 209 | "output_type": "display_data" 210 | }, 211 | { 212 | "data": { 213 | "text/markdown": [ 214 | "Heinz Kohut believed that tantrums were expressions of anger caused by the frustration of a child's grandiose self-image." 215 | ], 216 | "text/plain": [ 217 | "" 218 | ] 219 | }, 220 | "metadata": {}, 221 | "output_type": "display_data" 222 | } 223 | ], 224 | "source": [ 225 | "for p_idx,p in enumerate(paragraphs_return):\n", 226 | " user_msg = \"Could you summarize the following paragraph into one setence? \\n \"+p\n", 227 | " response_content = GPT.chat(\n", 228 | " user_msg=user_msg,PRINT_USER_MSG=False,PRINT_GPT_OUTPUT=False,\n", 229 | " RESET_CHAT=True,RETURN_RESPONSE=True)\n", 230 | " # Print summarized sentence with a markdown format\n", 231 | " printmd(response_content)" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "id": "078f3742", 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [] 241 | } 242 | ], 243 | "metadata": { 244 | "kernelspec": { 245 | "display_name": "Python 3 (ipykernel)", 246 | "language": "python", 247 | "name": "python3" 248 | }, 249 | "language_info": { 250 | "codemirror_mode": { 251 | "name": "ipython", 252 | "version": 3 253 | }, 254 | "file_extension": ".py", 255 | "mimetype": "text/x-python", 256 | "name": "python", 257 | "nbconvert_exporter": "python", 258 | "pygments_lexer": "ipython3", 259 | "version": "3.9.16" 260 | } 261 | }, 262 | "nbformat": 4, 263 | "nbformat_minor": 5 264 | } 265 | -------------------------------------------------------------------------------- /code/gpt_helper.py: -------------------------------------------------------------------------------- 1 | import copy 2 | import os 3 | import openai 4 | from tenacity import retry, stop_after_attempt, wait_fixed 5 | from IPython.display import Markdown,display 6 | from util import printmd 7 | 8 | def set_openai_api_key_from_txt(key_path='./key.txt',VERBOSE=True): 9 | """ 10 | Set OpenAI API Key from a txt file 11 | """ 12 | with open(key_path, 'r') as f: 13 | OPENAI_API_KEY = f.read() 14 | openai.api_key = OPENAI_API_KEY 15 | if VERBOSE: 16 | print ("OpenAI API Key Ready from [%s]."%(key_path)) 17 | 18 | class GPTchatClass(): 19 | def __init__(self, 20 | gpt_model = 'gpt-4', 21 | role_msg = 'Your are a helpful assistant.', 22 | VERBOSE = True 23 | ): 24 | self.gpt_model = gpt_model 25 | self.messages = [{'role':'system','content':f'{role_msg}'}] 26 | self.init_messages = [{'role':'system','content':f'{role_msg}'}] 27 | self.VERBOSE = VERBOSE 28 | self.response = None 29 | if self.VERBOSE: 30 | print ("Chat agent using [%s] initialized with the follow role:[%s]"% 31 | (self.gpt_model,role_msg)) 32 | 33 | def _add_message(self,role='assistant',content=''): 34 | """ 35 | role: 'assistant' / 'user' 36 | """ 37 | self.messages.append({'role':role, 'content':content}) 38 | 39 | def _get_response_content(self): 40 | if self.response: 41 | return self.response['choices'][0]['message']['content'] 42 | else: 43 | return None 44 | 45 | def _get_response_status(self): 46 | if self.response: 47 | return self.response['choices'][0]['message']['finish_reason'] 48 | else: 49 | return None 50 | 51 | @retry(stop=stop_after_attempt(10), wait=wait_fixed(5)) 52 | def chat(self,user_msg='hi', 53 | PRINT_USER_MSG=True,PRINT_GPT_OUTPUT=True, 54 | RESET_CHAT=False,RETURN_RESPONSE=True): 55 | self._add_message(role='user',content=user_msg) 56 | self.response = openai.ChatCompletion.create( 57 | model = self.gpt_model, 58 | messages = self.messages 59 | ) 60 | # Backup response for continous chatting 61 | self._add_message(role='assistant',content=self._get_response_content()) 62 | if PRINT_USER_MSG: 63 | print("[USER_MSG]") 64 | printmd(user_msg) 65 | if PRINT_GPT_OUTPUT: 66 | print("[GPT_OUTPUT]") 67 | printmd(self._get_response_content()) 68 | # Reset 69 | if RESET_CHAT: 70 | self.messages = copy.copy(self.init_messages) 71 | # Return 72 | if RETURN_RESPONSE: 73 | return self._get_response_content() 74 | -------------------------------------------------------------------------------- /code/util.py: -------------------------------------------------------------------------------- 1 | import re 2 | from IPython.display import Markdown,display 3 | 4 | def printmd(string): 5 | display(Markdown(string)) 6 | 7 | def extract_quoted_words(string): 8 | quoted_words = re.findall(r'"([^"]*)"', string) 9 | return quoted_words -------------------------------------------------------------------------------- /code/wiki_helper.py: -------------------------------------------------------------------------------- 1 | 2 | import requests 3 | from bs4 import BeautifulSoup 4 | 5 | def wiki_search(entity = "President of South Korea", 6 | min_char_len = 100, # minimum number of characters in a paragraph 7 | first_k = 5, # 'first_k' paragraphs to be included 8 | top_m_excluding_first_k = 3, # get 'top_m' exluding 'first_k' making the total 'k+m' 9 | VERBOSE = True 10 | ): 11 | """ 12 | This function return a number of paragraphs for searching an entity in Wikipedia 13 | """ 14 | # First, search `en.wikipedia.org` to get page 15 | entity_ = entity.replace(" ", "+") 16 | search_url = f"https://en.wikipedia.org/w/index.php?search={entity_}" 17 | response_text = requests.get(search_url).text 18 | soup = BeautifulSoup(response_text, features="html.parser") 19 | result_divs = soup.find_all("div", {"class": "mw-search-result-heading"}) 20 | 21 | if result_divs: # entity mismatch occurs 22 | # Get related wiki pages 23 | results = [] 24 | for div in result_divs: 25 | link = div.find('a') 26 | title = link.text 27 | url = link['href'] 28 | result = {'title': title, 'url': url} 29 | results.append(result) 30 | 31 | # Use the first matched wiki page 32 | entity_new = results[0]['title'] 33 | search_url = f"https://en.wikipedia.org/w/index.php?search={entity_new}" 34 | response_text = requests.get(search_url).text 35 | soup = BeautifulSoup(response_text, features="html.parser") 36 | page = [p.get_text().strip() for p in soup.find_all("p") + soup.find_all("ul")] 37 | 38 | if VERBOSE: # Debug print 39 | print ("entity:[%s] mismatched. use [%s] instead."%(entity,entity_new)) 40 | else: 41 | page = [p.get_text().strip() for p in soup.find_all("p") + soup.find_all("ul")] 42 | 43 | if VERBOSE: # Debug print 44 | print ("entity:[%s] matched."%(entity)) 45 | # Then, clean some strings 46 | def clean_str(p): 47 | p = p.replace('\\', '/') # <= Debug using GPT and it works! 48 | p_encode = p.encode() 49 | p_decode = p_encode.decode("unicode-escape") 50 | p_encode2 = p_decode.encode('latin1') 51 | p_decode2 = p_encode2.decode('utf-8') 52 | return p_decode2 53 | page_clean = "" 54 | for p in page: 55 | page_clean += clean_str(p) 56 | if not p.endswith('\n'): 57 | page_clean += '\n' 58 | paragraphs = page_clean.split("\n") 59 | paragraphs = [p.strip() for p in paragraphs if p.strip()] 60 | if VERBOSE: 61 | print (" We have total [%d] paragraphs."%(len(paragraphs))) 62 | 63 | # Second, get some paragraphs 64 | paragraphs_filtered = [p for p in paragraphs if len(p) >= min_char_len] 65 | paragraphs_first_k = paragraphs_filtered[:first_k] 66 | praagraphs_remain = paragraphs_filtered[first_k:] 67 | paragraphs_sorted = sorted(praagraphs_remain,key=len,reverse=True) 68 | paragraphs_top_m = paragraphs_sorted[:top_m_excluding_first_k] 69 | paragraphs_return = paragraphs_first_k + paragraphs_top_m 70 | 71 | if VERBOSE: # Debug print 72 | print (" After filtering, we have [%d] and [%d] paragraphs returned (k:[%d] and m:[%d])"% 73 | (len(paragraphs_filtered),len(paragraphs_return),first_k,top_m_excluding_first_k 74 | )) 75 | 76 | # Return filtered paragraphs 77 | return paragraphs_return 78 | --------------------------------------------------------------------------------