├── LICENSE ├── OASST_tutorial.ipynb ├── README.md ├── chat-oasst-api.py └── oasstapiv1.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Harrison 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /OASST_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "# Open Assistant as a Local ChatGPT API\n", 9 | "\n", 10 | "Video tutorial:\n", 11 | "[![Open Assistant as a Local ChatGPT API](https://img.youtube.com/vi/kkTNg_UOCNE/0.jpg)](https://www.youtube.com/watch?v=kkTNg_UOCNE)\n", 12 | "\n", 13 | "\n", 14 | "Welcome everyone to a bit of a showcasing and how-to with Open Assistant's Pythia 12 billion parameter model. This model is meant to be a chat assistant, like ChatGPT, but runnable locally. The model uses 48GB of memory, or 24GB at half precision.\n", 15 | "\n", 16 | "This model is in live development and training, so you will want to keep an eye out for new releases. I started playing with this model's first variant (https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b) and the next time I checked for an update, there was a 4th iteration available (https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5). \n", 17 | "\n", 18 | "\n", 19 | "Being a local model, I'd like to also show how to essentially set up your own local API, which makes doing your own R&D and testing much quicker and easier. To start though, let's check out a super basic example. \n", 20 | "\n", 21 | "At their most basic level, these large language GPT models just simply generate text sequentially. An example input might be:\n", 22 | "\n", 23 | "\"<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>\"\n", 24 | "\n", 25 | "And the output might be.\n", 26 | "\n", 27 | "\"A meme is a cultural idea, behavior, or style that spreads from person to person within a\"\n", 28 | "\n", 29 | "We can then wrap this in some basic logic to handle for the special tokens of <|prompter|>, <|endoftext|>, and <|assistant|> to get a more human readable output to give the chat and response feel. \n", 30 | "\n", 31 | "Let's dive in!\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# OPTIONAL TO RUN ON A SPECIFIC GPU:\n", 41 | "import os\n", 42 | "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"" 43 | ] 44 | }, 45 | { 46 | "attachments": {}, 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "First, we'll import AutoTokenizer & AutoModelForCausalLM, which will allow us to load the model and tokenizer from the HuggingFace model hub. We'll also import torch, which we'll use to handle the model's output in a bit." 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "from transformers import AutoTokenizer, AutoModelForCausalLM\n", 60 | "import torch\n", 61 | "\n", 62 | "MODEL_NAME = \"OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5\"\n", 63 | "\n", 64 | "# load model and tokenizer\n", 65 | "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n", 66 | "model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)" 67 | ] 68 | }, 69 | { 70 | "attachments": {}, 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "This will load the model and tokenizer into memory. If this is the first time you're running this with that specific model, it will take a bit to download the model and tokenizer. After that, it should take ~ a minute or so to load into memory. Once you have the model downloaded and loaded, you can optionally move it to your GPU if possible. In this example, I am also using the half precision version of the model, which is a bit faster and uses half the memory:" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "# Move the model to GPU and set it to half precision (float16)\n", 84 | "model = model.half().cuda()" 85 | ] 86 | }, 87 | { 88 | "attachments": {}, 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "Now we'll start with some input. This could be any text you want, but it probably makes the most sense to structure it how this model was trained, with the special tokens of <|prompter|>, <|endoftext|>, and <|assistant|>. \n", 93 | "\n", 94 | "Imagine that we want to ask this model \"What color is the sky?\"\n", 95 | "\n", 96 | "The way to build this prompt would be to be more like:\n", 97 | "\n", 98 | "\"<|prompter|>What color is the sky?<|endoftext|><|assistant>\"\n", 99 | "\n", 100 | "It feels a bit weird to use this end of ext tag followed by assistant tag, seems maybe redundant, but that's in the example provided by OpenAssistant on their \n", 101 | "HF page, so I assume every string before another \"speaker\" is terminated with that tag. By ending with the <|assistant> tag, we're making very clear to the model that a continue generation would be starting with the assistance's response to that input. The output from the model will likely be a continued generation, something like:\n", 102 | "\n", 103 | "\"<|prompter|>What color is the sky?<|endoftext|><|assistant> The sky is often blue.<|endoftext|>\"\n", 104 | "\n", 105 | "You may find that after that end text tag, another prompter tag is generated and more text is continued to be generated by the model. You can either handle for this with some python logic to stop at the end of text tag, or you can utilize the early-stopping capability from the transformers package.\n", 106 | "\n", 107 | "Let's see how to do this in Python:" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "inp = \"What color is the sky?\"\n", 117 | "\n", 118 | "input_ids = tokenizer.encode(inp, return_tensors=\"pt\")\n", 119 | "\n", 120 | "# Move the input to GPU (ONLY do this if you're using the GPU for your model.)\n", 121 | "input_ids = input_ids.cuda()" 122 | ] 123 | }, 124 | { 125 | "attachments": {}, 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "First, we specify some text input, then we tokenize that input with the model's tokenizer. From here, we move the tokenized input to the GPU, if we're using one. \n", 130 | "\n", 131 | "Next, we're going torch's automatic mixed precision (AMP) autocast context manager, which automatically sets operation datatypes. Within AMP's autocast context, we'll generate output with the model:" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "# Using automatic mixed precision\n", 141 | "with torch.cuda.amp.autocast():\n", 142 | " # generate text until the output length (which includes the original input/context's length) reaches max_length. do_sample for random sampling vs greedy\n", 143 | " output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)" 144 | ] 145 | }, 146 | { 147 | "attachments": {}, 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "Now we've got some output, but its on the GPU. Let's move it to the CPU so we can more easily access it:" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "# Move the output back to CPU\n", 161 | "output = output.cpu()" 162 | ] 163 | }, 164 | { 165 | "attachments": {}, 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "Finally, we can use the tokenizer to decode the output into human readable text:" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | "# Decode the output\n", 179 | "output_text = tokenizer.decode(output[0], skip_special_tokens=False)\n", 180 | "print(output_text)" 181 | ] 182 | }, 183 | { 184 | "attachments": {}, 185 | "cell_type": "markdown", 186 | "metadata": {}, 187 | "source": [ 188 | "Full code up to this point:" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "from transformers import AutoTokenizer, AutoModelForCausalLM\n", 198 | "import torch\n", 199 | "# OPTIONAL TO RUN ON A SPECIFIC GPU:\n", 200 | "import os\n", 201 | "\n", 202 | "MODEL_NAME = \"OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5\"\n", 203 | "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n", 204 | "\n", 205 | "# load model and tokenizer\n", 206 | "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n", 207 | "model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)\n", 208 | "\n", 209 | "# Move the model to GPU and set it to half precision (float16)\n", 210 | "model = model.half().cuda()\n", 211 | "\n", 212 | "inp = \"What color is the sky?\"\n", 213 | "\n", 214 | "input_ids = tokenizer.encode(inp, return_tensors=\"pt\")\n", 215 | "\n", 216 | "# Move the input to GPU (ONLY do this if you're using the GPU for your model.)\n", 217 | "input_ids = input_ids.cuda()\n", 218 | "\n", 219 | "# Using automatic mixed precision\n", 220 | "with torch.cuda.amp.autocast():\n", 221 | " # generate text until the output length (which includes the original input/context's length) reaches max_length\n", 222 | " output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)\n", 223 | "\n", 224 | "# Move the output back to CPU\n", 225 | "output = output.cpu()\n", 226 | "# Decode the output\n", 227 | "output_text = tokenizer.decode(output[0], skip_special_tokens=False)\n", 228 | "print(output_text)" 229 | ] 230 | }, 231 | { 232 | "attachments": {}, 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "Okay, so that's a very basic example of how to use this model. Let's take a look at how to set up a local API to make this a bit easier to use and workwith. \n", 237 | "\n", 238 | "With an API, even just locally, we can speed up R&D time without needing to re-load the model to memory every run (though you could also just use a notebook or something in this case too!). Beyond that, we can also access this API from anywhere else on our network, or even the internet if we wanted, empowering whatever devices and computers we might want.\n", 239 | "\n", 240 | "For this, I am going to use Flask (pip install flask), but there are certainly many ways you could do this same thing. I'll start a new script, which I'll call `oasst_api.py`. We'll start with:" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "from flask import Flask, request, jsonify\n", 250 | "from transformers import AutoTokenizer, AutoModelForCausalLM\n", 251 | "import torch\n", 252 | "import os\n", 253 | "\n", 254 | "\n", 255 | "app = Flask(__name__)\n", 256 | "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"2\"\n", 257 | "\n", 258 | "MODEL_NAME = \"OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5\"\n", 259 | "\n", 260 | "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n", 261 | "model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)\n", 262 | "\n", 263 | "model = model.half().cuda()" 264 | ] 265 | }, 266 | { 267 | "attachments": {}, 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "Not too much new here from before yet, other than the flask imports and beginning app defintion. Now, all we need with our flask app is a very basic route to handle our input and output. We'll use the same logic as before, but we'll wrap it in a function, and then we'll use flask's jsonify to return the output as a json object.\n", 272 | "\n", 273 | "Starting with:" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": null, 279 | "metadata": {}, 280 | "outputs": [], 281 | "source": [ 282 | "@app.route('/generate', methods=['POST'])\n", 283 | "def generate():\n", 284 | " content = request.json" 285 | ] 286 | }, 287 | { 288 | "attachments": {}, 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "This view will take a post request, and that request will have a json object, which will contain our prompt. We can get the prompt with `content.get` and then we will tokenize and pass that to the GPU (if we're using one)." 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | " inp = content.get(\"text\", \"\")\n", 302 | " input_ids = tokenizer.encode(inp, return_tensors=\"pt\")\n", 303 | " input_ids = input_ids.cuda()" 304 | ] 305 | }, 306 | { 307 | "attachments": {}, 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "Now we will query the model:" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | " with torch.cuda.amp.autocast():\n", 321 | " output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)" 322 | ] 323 | }, 324 | { 325 | "attachments": {}, 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "Similar to before, we're using AMP's autocast and model.generate to get our output. From here, we just need to decode and return the output as a json object:" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | " decoded = tokenizer.decode(output[0], skip_special_tokens=False)\n", 339 | " return jsonify({'generated_text': decoded})" 340 | ] 341 | }, 342 | { 343 | "attachments": {}, 344 | "cell_type": "markdown", 345 | "metadata": {}, 346 | "source": [ 347 | "Finally, we can run the app:" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "if __name__ == '__main__':\n", 357 | " app.run(host='0.0.0.0', port=5000)" 358 | ] 359 | }, 360 | { 361 | "attachments": {}, 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "Making our full code for `oasst_api.py` now:" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": null, 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "from flask import Flask, request, jsonify\n", 375 | "from transformers import AutoTokenizer, AutoModelForCausalLM\n", 376 | "import torch\n", 377 | "import os\n", 378 | "\n", 379 | "\n", 380 | "app = Flask(__name__)\n", 381 | "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"2\"\n", 382 | "\n", 383 | "MODEL_NAME = \"OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5\"\n", 384 | "\n", 385 | "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n", 386 | "model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)\n", 387 | "\n", 388 | "model = model.half().cuda()\n", 389 | "\n", 390 | "\n", 391 | "@app.route('/generate', methods=['POST'])\n", 392 | "def generate():\n", 393 | " content = request.json\n", 394 | " inp = content.get(\"text\", \"\")\n", 395 | " input_ids = tokenizer.encode(inp, return_tensors=\"pt\")\n", 396 | " input_ids = input_ids.cuda()\n", 397 | "\n", 398 | " with torch.cuda.amp.autocast():\n", 399 | " output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)\n", 400 | "\n", 401 | " decoded = tokenizer.decode(output[0], skip_special_tokens=False)\n", 402 | "\n", 403 | " return jsonify({'generated_text': decoded})\n", 404 | "\n", 405 | "if __name__ == '__main__':\n", 406 | " app.run(host='0.0.0.0', port=5000) # Set the host to '0.0.0.0' to make it accessible from your local network" 407 | ] 408 | }, 409 | { 410 | "attachments": {}, 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "Now, we can run this API on whatever machine we want to host the model, and then we can query this machine from whatever machine we want, provided it's on network. \n", 415 | "\n", 416 | "For example, I can create a new file, called `chat-oasst-api.py` to work with my new API. To start, some imports and constants:" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "metadata": {}, 423 | "outputs": [], 424 | "source": [ 425 | "import requests\n", 426 | "import json\n", 427 | "import colorama\n", 428 | "\n", 429 | "SERVER_IP = \"10.0.0.18\" # Change this to the IP of your server that's hosting the API. This can be the same machine you're working on too.\n", 430 | "URL = f\"http://{SERVER_IP}:5000/generate\"\n", 431 | "\n", 432 | "USERTOKEN = \"<|prompter|>\"\n", 433 | "ENDTOKEN = \"<|endoftext|>\"\n", 434 | "ASSISTANTTOKEN = \"<|assistant|>\"" 435 | ] 436 | }, 437 | { 438 | "attachments": {}, 439 | "cell_type": "markdown", 440 | "metadata": {}, 441 | "source": [ 442 | "With the imports and constants out of the way, let's write a quick prompt function:" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": null, 448 | "metadata": {}, 449 | "outputs": [], 450 | "source": [ 451 | "def prompt(inp):\n", 452 | " data = {\"text\": inp}\n", 453 | " headers = {'Content-type': 'application/json'}\n", 454 | "\n", 455 | " response = requests.post(URL, data=json.dumps(data), headers=headers)\n", 456 | "\n", 457 | " if response.status_code == 200:\n", 458 | " return response.json()[\"generated_text\"]\n", 459 | " else:\n", 460 | " return \"Error:\", response.status_code" 461 | ] 462 | }, 463 | { 464 | "attachments": {}, 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "This function takes input, builds a dictionary which we'll convert to a json object, sets headers, and then sends a post request to our API. We'll use the requests package to do this. From here, we'll grab either the json response, or error if there is one. Now we just need some simple logic to handle for the chat and context:\n" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "history = \"\"\n", 478 | "while True:\n", 479 | " inp = input(\">>> \")\n", 480 | " context = history + USERTOKEN + inp + ENDTOKEN + ASSISTANTTOKEN\n", 481 | " output = prompt(context)\n", 482 | " history = output\n", 483 | " just_latest_asst_output = output.split(ASSISTANTTOKEN)[-1].split(ENDTOKEN)[0]\n", 484 | " # color just_latest_asst_output green in print:\n", 485 | " print(colorama.Fore.GREEN + just_latest_asst_output + colorama.Style.RESET_ALL)" 486 | ] 487 | }, 488 | { 489 | "attachments": {}, 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "The full `chat_oasst_api.py` code:" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": null, 499 | "metadata": {}, 500 | "outputs": [], 501 | "source": [ 502 | "import requests\n", 503 | "import json\n", 504 | "import colorama\n", 505 | "\n", 506 | "SERVER_IP = \"10.0.0.18\"\n", 507 | "URL = f\"http://{SERVER_IP}:5000/generate\"\n", 508 | "\n", 509 | "USERTOKEN = \"<|prompter|>\"\n", 510 | "ENDTOKEN = \"<|endoftext|>\"\n", 511 | "ASSISTANTTOKEN = \"<|assistant|>\"\n", 512 | "\n", 513 | "def prompt(inp):\n", 514 | " data = {\"text\": inp}\n", 515 | " headers = {'Content-type': 'application/json'}\n", 516 | "\n", 517 | " response = requests.post(URL, data=json.dumps(data), headers=headers)\n", 518 | "\n", 519 | " if response.status_code == 200:\n", 520 | " return response.json()[\"generated_text\"]\n", 521 | " else:\n", 522 | " return \"Error:\", response.status_code\n", 523 | " \n", 524 | "history = \"\"\n", 525 | "while True:\n", 526 | " inp = input(\">>> \")\n", 527 | " context = history + USERTOKEN + inp + ENDTOKEN + ASSISTANTTOKEN\n", 528 | " output = prompt(context)\n", 529 | " history = output\n", 530 | " just_latest_asst_output = output.split(ASSISTANTTOKEN)[-1].split(ENDTOKEN)[0]\n", 531 | " # color just_latest_asst_output green in print:\n", 532 | " print(colorama.Fore.GREEN + just_latest_asst_output + colorama.Style.RESET_ALL)" 533 | ] 534 | }, 535 | { 536 | "attachments": {}, 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "With this, we can fully interact with our model! \n", 541 | "\n", 542 | "Only one slight problem is the context is going to continue growing. The maximum context length for this model is 2048 tokens. That's quite a bit, but if you have a longer conversation, or you even just want to keep an ongoing one for days, this is going to be a problem. \n", 543 | "\n", 544 | "How you handle for context might vary. You could just trim context to keep it in some range. Remember: context includes the prompt as well as the generation. Your generation might want to be 200 tokens long, so this really means your prompt needs to be 1848 tokens or less.\n", 545 | "\n", 546 | "Besides a simple trimming past a certain amount of tokens, you could also get more complex by attempting to also summarize the context to \"compress\" it. I will skip that for now and go straight to a trim. In most cases, this will be fine. If you need to retain history more, then you might try a more complicated approach. \n", 547 | "\n", 548 | "You can also choose whether you want to add this logic to the API, or the client. I think handling for summarization would be done client-side, but a brute trimming of the context to handle for longer conversations can happen API-side I think. This really is up to you though. I'll edit the `oasst_api.py`, and start by adding the following constants:" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": null, 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "# Get max context length and the determine cushion for response\n", 558 | "MAX_CONTEXT_LENGTH = model.config.max_position_embeddings\n", 559 | "print(f\"Max context length: {MAX_CONTEXT_LENGTH}\")\n", 560 | "ROOM_FOR_RESPONSE = 512" 561 | ] 562 | }, 563 | { 564 | "attachments": {}, 565 | "cell_type": "markdown", 566 | "metadata": {}, 567 | "source": [ 568 | "This dynamically pulls the maximum context length from the model's attributes, and then we can opt for how much of a \"cushion\" we want to leave for a plausible generation. I've chosen 512, which is quite large and probably will never happen, but 2048-512=1536, which is still a lot of context!\n", 569 | "\n", 570 | "Now, within the `generate` function, we can add some logic to handle for context length:" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": {}, 577 | "outputs": [], 578 | "source": [ 579 | " # Calc current size\n", 580 | " print(\"Context length is currently\", input_ids.shape[1], \"tokens. Allowed amount is\", MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE, \"tokens.\")\n", 581 | " # determine if we need to trim\n", 582 | " if input_ids.shape[1] > (MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):\n", 583 | " print(\"Trimming a bit\")\n", 584 | " # trim as needed AT the first dimension\n", 585 | " input_ids = input_ids[:, -(MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):]" 586 | ] 587 | }, 588 | { 589 | "attachments": {}, 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "The full code for `oasst_api.py` is now:" 594 | ] 595 | }, 596 | { 597 | "cell_type": "code", 598 | "execution_count": null, 599 | "metadata": {}, 600 | "outputs": [], 601 | "source": [ 602 | "from flask import Flask, request, jsonify\n", 603 | "from transformers import AutoTokenizer, AutoModelForCausalLM\n", 604 | "import torch\n", 605 | "import os\n", 606 | "\n", 607 | "\n", 608 | "app = Flask(__name__)\n", 609 | "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n", 610 | "\n", 611 | "MODEL_NAME = \"OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5\"\n", 612 | "\n", 613 | "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n", 614 | "model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)\n", 615 | "\n", 616 | "# Get max context length and the determine cushion for response\n", 617 | "MAX_CONTEXT_LENGTH = model.config.max_position_embeddings\n", 618 | "print(f\"Max context length: {MAX_CONTEXT_LENGTH}\")\n", 619 | "ROOM_FOR_RESPONSE = 512\n", 620 | "\n", 621 | "model = model.half().cuda()\n", 622 | "\n", 623 | "\n", 624 | "@app.route('/generate', methods=['POST'])\n", 625 | "def generate():\n", 626 | " content = request.json\n", 627 | " inp = content.get(\"text\", \"\")\n", 628 | " input_ids = tokenizer.encode(inp, return_tensors=\"pt\")\n", 629 | "\n", 630 | " # Calc current size\n", 631 | " print(\"Context length is currently\", input_ids.shape[1], \"tokens. Allowed amount is\", MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE, \"tokens.\")\n", 632 | " # determine if we need to trim\n", 633 | " if input_ids.shape[1] > (MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):\n", 634 | " print(\"Trimming a bit\")\n", 635 | " # trim as needed AT the first dimension\n", 636 | " input_ids = input_ids[:, -(MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):]\n", 637 | " \n", 638 | " input_ids = input_ids.cuda()\n", 639 | "\n", 640 | " with torch.cuda.amp.autocast():\n", 641 | " output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)\n", 642 | "\n", 643 | " decoded = tokenizer.decode(output[0], skip_special_tokens=False)\n", 644 | "\n", 645 | " return jsonify({'generated_text': decoded})\n", 646 | "\n", 647 | "if __name__ == '__main__':\n", 648 | " app.run(host='0.0.0.0', port=5000) # Set the host to '0.0.0.0' to make it accessible from your local network" 649 | ] 650 | } 651 | ], 652 | "metadata": { 653 | "kernelspec": { 654 | "display_name": "Python 3.8.10 64-bit", 655 | "language": "python", 656 | "name": "python3" 657 | }, 658 | "language_info": { 659 | "codemirror_mode": { 660 | "name": "ipython", 661 | "version": 3 662 | }, 663 | "file_extension": ".py", 664 | "mimetype": "text/x-python", 665 | "name": "python", 666 | "nbconvert_exporter": "python", 667 | "pygments_lexer": "ipython3", 668 | "version": "3.8.10" 669 | }, 670 | "orig_nbformat": 4, 671 | "vscode": { 672 | "interpreter": { 673 | "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" 674 | } 675 | } 676 | }, 677 | "nbformat": 4, 678 | "nbformat_minor": 2 679 | } 680 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # OpenAssistant_API_Pythia_12B 2 | Creating and Using an Open Assistant API locally (Pythia 12B GPT model) 3 | 4 | - OASST_tutorial.ipynb - Breaks down how everything works. 5 | - oasstapiv1.py - Host a local API for the Open Assistant Pythia 12B model. 6 | - chat-oasst-api.py - File to interact with the local API. You will need to modify `SERVER_IP` 7 | 8 | 9 | Video tutorial: 10 | 11 | [![Open Assistant as a Local ChatGPT API](https://img.youtube.com/vi/kkTNg_UOCNE/0.jpg)](https://www.youtube.com/watch?v=kkTNg_UOCNE) 12 | -------------------------------------------------------------------------------- /chat-oasst-api.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import colorama 4 | 5 | SERVER_IP = "10.0.0.18" 6 | URL = f"http://{SERVER_IP}:5000/generate" 7 | 8 | USERTOKEN = "<|prompter|>" 9 | ENDTOKEN = "<|endoftext|>" 10 | ASSISTANTTOKEN = "<|assistant|>" 11 | 12 | def prompt(inp): 13 | data = {"text": inp} 14 | headers = {'Content-type': 'application/json'} 15 | 16 | response = requests.post(URL, data=json.dumps(data), headers=headers) 17 | 18 | if response.status_code == 200: 19 | return response.json()["generated_text"] 20 | else: 21 | return "Error:", response.status_code 22 | 23 | history = "" 24 | while True: 25 | inp = input(">>> ") 26 | context = history + USERTOKEN + inp + ENDTOKEN + ASSISTANTTOKEN 27 | output = prompt(context) 28 | history = output 29 | just_latest_asst_output = output.split(ASSISTANTTOKEN)[-1].split(ENDTOKEN)[0] 30 | # color just_latest_asst_output green in print: 31 | print(colorama.Fore.GREEN + just_latest_asst_output + colorama.Style.RESET_ALL) 32 | 33 | 34 | -------------------------------------------------------------------------------- /oasstapiv1.py: -------------------------------------------------------------------------------- 1 | from flask import Flask, request, jsonify 2 | from transformers import AutoTokenizer, AutoModelForCausalLM 3 | import torch 4 | import os 5 | 6 | 7 | app = Flask(__name__) 8 | os.environ["CUDA_VISIBLE_DEVICES"] = "0" 9 | 10 | MODEL_NAME = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5" 11 | 12 | tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) 13 | model = AutoModelForCausalLM.from_pretrained(MODEL_NAME) 14 | 15 | # Get max context length and the determine cushion for response 16 | MAX_CONTEXT_LENGTH = model.config.max_position_embeddings 17 | print(f"Max context length: {MAX_CONTEXT_LENGTH}") 18 | ROOM_FOR_RESPONSE = 512 19 | 20 | model = model.half().cuda() 21 | 22 | 23 | @app.route('/generate', methods=['POST']) 24 | def generate(): 25 | content = request.json 26 | inp = content.get("text", "") 27 | input_ids = tokenizer.encode(inp, return_tensors="pt") 28 | 29 | # Calc current size 30 | print("Context length is currently", input_ids.shape[1], "tokens. Allowed amount is", MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE, "tokens.") 31 | # determine if we need to trim 32 | if input_ids.shape[1] > (MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE): 33 | print("Trimming a bit") 34 | # trim as needed AT the first dimension 35 | input_ids = input_ids[:, -(MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):] 36 | 37 | input_ids = input_ids.cuda() 38 | 39 | with torch.cuda.amp.autocast(): 40 | output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id) 41 | 42 | decoded = tokenizer.decode(output[0], skip_special_tokens=False) 43 | 44 | return jsonify({'generated_text': decoded}) 45 | 46 | if __name__ == '__main__': 47 | app.run(host='0.0.0.0', port=5000) # Set the host to '0.0.0.0' to make it accessible from your local network --------------------------------------------------------------------------------