├── README.md ├── awesome_projects.md ├── awesome_resources.md ├── LICENSE └── tutorials ├── dealing_with_pdfs.ipynb ├── local_inference.ipynb └── sampling.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # pimps_knowledge_sharing 2 | Repository for PIMPS section's awesome lists, tutorials and code snippets 3 | -------------------------------------------------------------------------------- /awesome_projects.md: -------------------------------------------------------------------------------- 1 | # Awesome Projects 2 | 3 | Here are some awesome projects that we have started as part of this section: 4 | 5 | 6 | TinderBot - https://www.youtube.com/watch?v=-JpUdtiHWHc 7 | 8 | WhisperWatcher - https://github.com/krystianMoras/observing_transcriber 9 | 10 | -------------------------------------------------------------------------------- /awesome_resources.md: -------------------------------------------------------------------------------- 1 | # Awesome resources 2 | 3 | ## Translation 4 | 5 | Polish - English : https://huggingface.co/Helsinki-NLP/opus-mt-pl-en 6 | 7 | ## OCR 8 | 9 | ### TROCR (Microsoft) 10 | Handwritten - https://huggingface.co/microsoft/trocr-large-handwritten 11 | Printed - https://huggingface.co/microsoft/trocr-large-printed 12 | 13 | ### MetaAI 14 | 15 | Nougat - https://huggingface.co/facebook/nougat-base 16 | 17 | 18 | 19 | ## Embeddings 20 | 21 | Benchmark - https://huggingface.co/spaces/mteb/leaderboard 22 | 23 | 24 | ### Articles 25 | 26 | Embeddingi w pigułce - https://kaszkowiak.org/blog/embeddingi/ 27 | 28 | 29 | Sentence transformers guide - https://www.sbert.net/examples/applications/computing-embeddings/README.html 30 | 31 | ## RAG 32 | 33 | Langchain - https://github.com/langchain-ai/langchain 34 | 35 | ### Articles 36 | RAG, czyli jak rozmawiać z naszymi dokumentami? - https://kaszkowiak.org/blog/retrieval-augmented-generation/ 37 | 38 | Retrieve & re-rank explained - https://www.sbert.net/examples/applications/retrieve_rerank/README.html#retrieve-re-rank-pipeline 39 | 40 | ## PDF 41 | 42 | PyMuPDF (AGPL license) - https://pypi.org/project/PyMuPDF/ 43 | 44 | PyPDF - https://pypi.org/project/pypdf/ 45 | 46 | ## Local Inference 47 | 48 | ### LLMs 49 | Ollama - https://github.com/ollama/ollama 50 | 51 | Llama.cpp - https://github.com/ggerganov/llama.cpp 52 | 53 | ### Whisper 54 | 55 | Faster whisper - https://github.com/SYSTRAN/faster-whisper -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /tutorials/dealing_with_pdfs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# How to process PDFs?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "There is a very nice package PyMuPDF, but unfortunately it is AGPL\n", 15 | "https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright\n" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "Instead I'll use pypdf" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "%pip install pypdf --quiet" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "['C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Application Security\\\\Security_slides08_Apps.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Crypto for Cybersecurity\\\\Security_slides03_Crypto.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Firewalls\\\\Security_slides07_Firewall.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Fundamentals of Cybersecurity\\\\Security_slides02_Basics.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Higher-Security Environments\\\\Security_slides10_HighSec.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Introduction\\\\Security_slides01_Intro.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Network Security\\\\Security_slides05_Network.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Operating Systems Security\\\\Security_slides04_OS.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Secure Programming\\\\Security_slides09_Programming.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\Security Management\\\\Security_slides11_Mgmnt.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Cybersecurity\\\\VPN tunnels\\\\Security_slides06_VPN.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Evolutionary Computation\\\\EC.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Evolutionary Computation\\\\Evolutionary-and-population-algorithms.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Evolutionary Computation\\\\Quantum-computing.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Evolutionary Computation\\\\Theory.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Semantic Web and Social Networks\\\\Knowledge Graphs\\\\Knowledge graph representation learning - en.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Semantic Web and Social Networks\\\\Knowledge Graphs\\\\Knowledge_graphs_en .pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Semantic Web and Social Networks\\\\Ontologies\\\\RDFS_OWL_en(1).pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Semantic Web and Social Networks\\\\Semantic networks\\\\Lecture 1 - semantic_networks_RDF_en(2).pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Semantic Web and Social Networks\\\\Social Networks\\\\social_networks_01_linked.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Semantic Web and Social Networks\\\\Social Networks\\\\social_networks_02_measuring_networks.pdf', 'C:\\\\Users\\\\kryst\\\\Documents\\\\Stuff\\\\University\\\\Semantic Web and Social Networks\\\\SPARQL query language\\\\SPARQL - lecture - en(1).pdf']\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "import pypdf\n", 49 | "import os\n", 50 | "# read pdfs\n", 51 | "\n", 52 | "directory_path = r\"C:\\Users\\kryst\\Documents\\Stuff\"\n", 53 | "\n", 54 | "pdfs = []\n", 55 | "\n", 56 | "# recursively find all pdfs in directory\n", 57 | "\n", 58 | "for root, dirs, files in os.walk(directory_path):\n", 59 | " for file in files:\n", 60 | " if file.endswith(\".pdf\"):\n", 61 | " pdfs.append(os.path.join(root, file))\n", 62 | "\n", 63 | "print(pdfs)" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 47, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "Note: you may need to restart the kernel to use updated packages.\n" 76 | ] 77 | }, 78 | { 79 | "name": "stderr", 80 | "output_type": "stream", 81 | "text": [ 82 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n", 83 | "DEPRECATION: Loading egg at c:\\users\\kryst\\appdata\\local\\programs\\python\\python311\\lib\\site-packages\\wordcloud-1.8.2.post4+g5dd8d3e-py3.11-win-amd64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330\n", 84 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "%pip install cryptography --quiet --upgrade" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 2, 95 | "metadata": {}, 96 | "outputs": [ 97 | { 98 | "name": "stdout", 99 | "output_type": "stream", 100 | "text": [ 101 | "1443\n" 102 | ] 103 | } 104 | ], 105 | "source": [ 106 | "all_pages = []\n", 107 | "\n", 108 | "for pdf in pdfs:\n", 109 | " # read pdf page by page\n", 110 | " pdf_file = open(pdf, 'rb')\n", 111 | " pdf_reader = pypdf.PdfReader(pdf_file)\n", 112 | " # get number of pages\n", 113 | " for i, page in enumerate(pdf_reader.pages):\n", 114 | " \n", 115 | " content = page.extract_text()\n", 116 | " all_pages.append(\n", 117 | " {\n", 118 | " \"path\": pdf,\n", 119 | " \"page\": i,\n", 120 | " \"content\": content\n", 121 | " }\n", 122 | " )\n", 123 | "\n", 124 | "# dump into jsonl\n", 125 | "import json\n", 126 | "\n", 127 | "with open(\"pdfs.jsonl\", \"w\") as f:\n", 128 | " for page in all_pages:\n", 129 | " f.write(json.dumps(page))\n", 130 | " f.write(\"\\n\")\n", 131 | "print(len(all_pages))" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 3, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "# https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them\n", 141 | "def get_approx_tokens(sequence):\n", 142 | " return len(sequence) / 4" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 4, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "from nltk import sent_tokenize\n", 152 | "\n", 153 | "def sliding_window(seq, n=512):\n", 154 | " \"\"\"Returns a sliding window (of width n) over data from the iterable\"\"\"\n", 155 | " seq = sent_tokenize(seq)\n", 156 | "\n", 157 | " # sentences should not be broken up\n", 158 | " \n", 159 | " lengths = []\n", 160 | " for sentence in seq:\n", 161 | " sentence_token_len = get_approx_tokens(sentence)\n", 162 | " lengths.append(sentence_token_len)\n", 163 | "\n", 164 | " result = []\n", 165 | " start = 0\n", 166 | " for i in range(len(lengths)+1):\n", 167 | " if sum(lengths[start:i]) > n:\n", 168 | " result.append(seq[start:i-1])\n", 169 | " start +=1\n", 170 | "\n", 171 | " # last sentence\n", 172 | " result.append(seq[start:])\n", 173 | " return result \n", 174 | "\n", 175 | "\n", 176 | "\n", 177 | "# list(sliding_window(\"123456. 12345678. 123456789.\"))" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 5, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "1455\n" 190 | ] 191 | } 192 | ], 193 | "source": [ 194 | "passages = []\n", 195 | "\n", 196 | "for page in all_pages:\n", 197 | "\n", 198 | "\n", 199 | " # split page into passages\n", 200 | " content = page[\"content\"]\n", 201 | "\n", 202 | " content_split = sliding_window(content)\n", 203 | "\n", 204 | " passages.extend(\n", 205 | " [\n", 206 | " {\n", 207 | " \"path\": page[\"path\"],\n", 208 | " \"page\": page[\"page\"],\n", 209 | " \"passage\": passage\n", 210 | " }\n", 211 | " for passage in content_split\n", 212 | " ]\n", 213 | " )\n", 214 | "\n", 215 | "print(len(passages))" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 6, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "from fastembed.embedding import FlagEmbedding as Embedding\n", 225 | "from tqdm import tqdm\n", 226 | "\n", 227 | "model = Embedding(model_name=\"BAAI/bge-base-en-v1.5\", max_length=512)\n" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 9, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "\n", 237 | "# embed passages\n", 238 | "\n", 239 | "embeddings = list(model.passage_embed([passage[\"passage\"] for passage in passages]))\n", 240 | "\n", 241 | "for i, embedding in enumerate(embeddings):\n", 242 | "\n", 243 | " passages[i][\"embedding\"] = embedding\n" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 10, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "# save embeddings\n", 253 | "\n", 254 | "with open(\"passages.jsonl\", \"w\") as f:\n", 255 | " for passage in passages:\n", 256 | " passage[\"embedding\"] = passage[\"embedding\"].tolist()\n", 257 | " f.write(json.dumps(passage))\n", 258 | " f.write(\"\\n\")" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "# What to do if we can't extract text?" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "## TesseractOCR" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 1, 278 | "metadata": {}, 279 | "outputs": [ 280 | { 281 | "name": "stdout", 282 | "output_type": "stream", 283 | "text": [ 284 | "Note: you may need to restart the kernel to use updated packages.\n" 285 | ] 286 | }, 287 | { 288 | "name": "stderr", 289 | "output_type": "stream", 290 | "text": [ 291 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n", 292 | "DEPRECATION: Loading egg at c:\\users\\kryst\\appdata\\local\\programs\\python\\python311\\lib\\site-packages\\wordcloud-1.8.2.post4+g5dd8d3e-py3.11-win-amd64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330\n", 293 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n" 294 | ] 295 | } 296 | ], 297 | "source": [ 298 | "%pip install pytesseract --quiet" 299 | ] 300 | }, 301 | { 302 | "cell_type": "markdown", 303 | "metadata": {}, 304 | "source": [ 305 | "Install tesseract executable - https://tesseract-ocr.github.io/tessdoc/Installation.html" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 3, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "from PIL import Image\n", 315 | "\n", 316 | "import pytesseract\n", 317 | "\n", 318 | "\n", 319 | "pytesseract.pytesseract.tesseract_cmd = r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\"\n" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 13, 325 | "metadata": {}, 326 | "outputs": [ 327 | { 328 | "data": { 329 | "text/plain": [ 330 | "511142" 331 | ] 332 | }, 333 | "execution_count": 13, 334 | "metadata": {}, 335 | "output_type": "execute_result" 336 | } 337 | ], 338 | "source": [ 339 | "# download jpg\n", 340 | "\n", 341 | "import requests\n", 342 | "\n", 343 | "url = \"https://pomoc.ifirma.pl/wp-content/uploads/2022/11/paragon2-1.jpg\"\n", 344 | "\n", 345 | "r = requests.get(url, allow_redirects=True)\n", 346 | "\n", 347 | "open('paragon.jpg', 'wb').write(r.content)\n" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 14, 353 | "metadata": {}, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "LEROY-NERLIN POLSKA SP. 2 0.0.\n", 360 | "Ul. Burakowska 14, 01-066 Warszaua\n", 361 | "SKLEP LEROY MERLIN WROCtAU)\n", 362 | "53-015 WROCKAY, Al, Karkonoska 85\n", 363 | "53-015 Wroctay\n", 364 | "Karkonoska 865\n", 365 | "NIP 113-00-89-950\n", 366 | "\n", 367 | "n\n", 368 | "PARAGON FISKALNY\n", 369 | "82680523 BAT ZLEW ELASTIC CZARNY ELAST\n", 370 | "\n", 371 | "1\n", 372 | "43471624 ELEKTR OGRZEWACZ DAFI 4,5KW +P\n", 373 | "1 x\n", 374 | "\n", 375 | "r wydr. 176667/0622\n", 376 | "\n", 377 | "129,00 129,00 A\n", 378 | "199,00 199,00 8\n", 379 | "\n", 380 | "Sprzed.’aped. PiU fh ts aaa ae 328,00\n", 381 | "Kwota PTU A 23,003 |\n", 382 | "SUMA PTU D\n", 383 | "\n", 384 | "SUNA PLS ee 328,00\n", 385 | "\n", 386 | "ROZLICZENTE PLATNOSCI\n", 387 | "\n", 388 | "Gotéuka 328,00\n", 389 | "Wptacono razen 328,00\n", 390 | "000255/0622 #450 2U\n", 391 | "\n", 392 | "Nir Nabyy -\n", 393 | "2022-07-28 | 3 21:02\n", 394 | "A5B67ABE976FAB57435C4383BEF5145306370\n", 395 | "\n", 396 | "Mn\n", 397 | "\n", 398 | "Nr transakesi 013-001-000054 9420\n", 399 | "\n", 400 | "DZTEKUJEMY 2A ZAKUPY\n", 401 | "ZAPRASZANY PONOWNTE\n", 402 | "\n", 403 | "www. Jeroymerlin.Ppl\n", 404 | "\n" 405 | ] 406 | } 407 | ], 408 | "source": [ 409 | "import cv2\n", 410 | "\n", 411 | "img_cv = cv2.imread(r'paragon.jpg')\n", 412 | "\n", 413 | "# By default OpenCV stores images in BGR format and since pytesseract assumes RGB format,\n", 414 | "# we need to convert from BGR to RGB format/mode:\n", 415 | "img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)\n", 416 | "print(pytesseract.image_to_string(img_rgb))" 417 | ] 418 | }, 419 | { 420 | "attachments": { 421 | "image.png": { 422 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAALEAAAAZCAYAAAB+Zs9GAAAgAElEQVR4nKWcebQldXXvP7+hpnPOvbcHQRARGRyQEEmcX8zDAX24YoCAKC3QgEAzizLKJAiooMyiIGiDgCDOmhiNIErUmPhETYLiygsvgkxND3c6Q1X9pvfHr6rubdC38l5qrbv6nnuqfsMevvu7929XixfvvksQAVSgu5zwBAFOgg8JMkBuQWEIwmIkGJnhkaTeIfB4EZ9NHAQBlY6fMxv/rZTsvpeAEyACpN4TUExUjpWgmKCCJzMJAo9VBic9gRQnBBAQeDLnkQGMlAQgCI8AhM+QAZQHozyV0iAqBm6IR1HyHDyKIiygMHgEAY0ICifAy4AXFu0hoLEUICxaLCCCJ/hVECRKjEBYCBonZLe2uEuPRxEEiCCBKE+BQ3sfBRI0QQgIkiDiM4iA9uCFx8p4X2ElHkmpErzwZN7E8YUEJNopwONkwAkIZIBFiTECj7Y9ApJaxXskHuUliVUAGG3wArSL67DKxnFthgrghSUICCGPcvcTwOKEJgiJdtFOjPIYFXUigiQ1BQJPkBO88HgyPBqJB6KuIOAJCCFQQaCUohqNEUIws3IFm7ZsQWuNSuJPWZbgA71ej8loTJIkeAKa33tFwROiSgSA8OBl8+HZl2icIAjwYulza9ximZP4Zc85EeeQwaOCRwiPCFEwIHEiGil4VJB4EQUV/758JA1BIEMUkJd044ogcSTNQmwUZJAQ1O/ZTztmM4Dwy2TRyKV9ppFP86HZp4j3Sbc0RnTb7ncZ2lmWCeUPXH7Zb/L33r50R1ydbdarAEUgOoXEEQDhVeM4vlFKO65sdNTofKu5QuOwHidA0jpeKx+20pMM7ZytqGS0gxC/78SHQEpBCIEQAuPxmJnpKbz3zM7O0uv1EEpS1zXlaMTU1BSjxSFVVVEUBePxGJ0m6GcpbUlk8a+BZwivvS9Ag9jCy25hTkbo1U43Bh0/q0bhYZnRBCGw7TzUyOCQweKRWBl91skEiSPxHhksCI8TEis0CIkUdUQ7349LYhTvISKdDp6IgQVB+IigQUJI8CSExrhCa6CNEgJR2UJUjQMvOQGAD7KThGgNO0iE1yAsod23zyBIbIT2aFyBxsgBIZ4l++VyDmIJSILwz7J7J1uDE4gQUKKKRkpCQOKkBxxCVKigEK4fVyzGIOrOyGSjHOWXOW+I+opRbgJIXPMTBKjQODgxEgUUInhk8I3RS2jkFITES492MRo4oQCJkpLxZMQg7xFCiGgLJEmCcw5TVyilSNOUqqro9/uMRiNSnVAUBcbZZ0tvya9FFAw+KqkxwGjc0WtlCMsMsw0QktAIJYajJWQWnVC2VkJEBReR2GsIGi9sVIBow2+DhJ0ziG6c0Hr6sl0EEfUtQojILhROKEQISBxetIa7XASy2TXdviRLTvxMIGz3RnOPbJVPDKnRuKKzi2eBwdK1Neotiy4t0gcBxDG2XrMnCEcQAeE1IogoQxpHa3QRGvlLH3UiCFGXrb02uhGEJnLFp73wtPQNHAiLFwIrFUHE71t9RrmFCABiKSLLZ+zPy2XRGU9Zlgx6fayPTi2UojIGqSNJSNMUpRRJkmCMwVrbobAxBmCJTvhONEvGSGjDcbsKASikB6lcI+42aHTBpBGMb4ysRVq/1fd0s5jm8YiekMSRRIWiiooPGi8SQkhAGAIe0YbnoOMcou7QkJBGvkkbYkOzwwAhWRKo8M26ZLOuJQW2CC18ElFFlHGfPgMkoUXoEA1fNhYRZAWA6pDbNJ91VLNwTT4gOxT8vbYdWnYdLUSGVvmiiRA2GhKA0GgvkT5FSIMULq5fgEMjSNC+jRgTJAIrwIcUFZZFW+EbyuDxIq5fN2gb8xKFa5zIiwhuRmoIktS5xnA9QWiE06jGDqTw6KCwSIwUyKDIXGhATVCWJUophJIYZ+kNYrQwdY0WAu89wjlSnTCZTCiKgsFgQFmWCMQf4MRBwlZctVV6YzQCRHANt1SN0bUPi8YbGxMP7RQRHUSQhM7om3DUGEMb0JvZQASkU+2nGNaCWuLojRE0nGa5L201Q7xXNnw1rld0syznviCDXIYUDQqFJeQXPBMpoxGqZo8ORxBiyaibcVWIrhH/7JuEr52o5afL5d+uoI0Orgn5iiVnC5HWt0sSvmXm0bFCTKBbOcnGQV0TUUPkKNDeJyQtwQsdh6ehJ/EZ1aE0DT9uULhNYBsaqIiyWh516dQTusiUJAllXeGcQ2tNr9dj4+ZNTPWn+NKXvoRojDzLUqSUeO85+OCDGdcjVq5cyaQq0Z3shMd3oVrihWxCZOQwbWiCBILrQkwgifc2m46I04YikI1F+DYDR8QNL6uAiCBJvEQFTxAVpkVwryEoZPAE1SJchifyyiA9vg3jIY25rphEChQiYvomwVIhIphrsn7dJWV+2b9RGDLIpYSTOK/zRaMt06hBRVk0yhIYuuRtecIXNALbGYdtqgrLaYqnNSK/LFJFrhyEhBBRT/hWBzQ6WZaoyRrfeKcMGuUTpJB4ZTujIUi8EHgCQS7RHlDUMgUsKTUCh8I1cs5wIsEIjwyhq45MZCsj36StWeNNFSp4FKaRjCQI3a1bYRpHSPBIjLEkSYJQkslkgnWemZWrcLWhNobDDzssViG8x1rLXXfdSZIk9Ho95hcXSNP092YUjfCjcJQH6ZcIfAzfCrmMKxPip9btlgK5fFZSGLnX0mePbpTacrmlMWUQJE6ivWyE3SBS8MggEF4hQ6vsZaje+rugW2VLiQJp/Gky9va+Z3J1EURUvFji5aJzcvAiUpDonKHZZ0T6roqBa4y3FYJsItcyMrpVhaWbnSX4iIgpQ0N9QswZREP1Wvm25cF2D8rH8pdsOHmbkAZ044CNmIVo0LpJ0MiiQTaOthxIBR7pfaSTPlY/CDqacSN/7aLOa6nwQqI9aN8gt2xtIN6jmvUnSUJZliRZhtAKay02eIRuqivedyisG648Go3o9XpRqlLGEkYIgSzLcC4ghUbIyOGEd2gJQghq6wkipa48SggSFZCJZlLVKFkgVEZlxqS5jgjhJFIFfDCkKkWLlBBiMqKUIngZheEgYJFaUluBs5JEaoS1pNaQIgkuxTmFUB4hDK426JAgybAevDR4XZNJjbAC50KXhCoRkKLGB4P3PbzvYa1FpQGPb4wqI3gJCoR0mKpENQIPXgA1Sge8FwSvEULgg8UTuampA5qC4DXBS1Si8aEm0RZrRmhSnNUo2UfJtPnO44NDSkmSKsqyJISA1nH8EEKkE0JhTRXr8T5Q1RYtJKkUEAIhgBOaGknSzzC+woxHFCpBBYlquHgdKoTOGdchgoTW1FpQ40moUd7g1CpKO4WQfSSKYC2p8gizgPAT0BrnE4IRJCFHJH2cUNh6RKE9qQclMsYy58/f9nbuuukm9tj+uQThIM+pXIpUPXCOXAW0klgb0biua5AxUkT7k0ipCV7gnCNLUrx1hBBI0xTjmkqYtZbBYEBdVSwsLNDrTTGa1KzY7rlcetEF7DylkRjee+75PPLEFlat2Jbrr7sNHx5h/e2f4e++/3Oyos8++7yJo444AhcCC3Nb+ODZZ1BNJoyMRckEMxkjtELqnABcfvnl7LDDDlgERab55pfu5AtfuJu0mEFnOZdcfD7br5rm42edxyOPPUUVJPnKlZjyKXCGmXwVlQm88S1v5KjjjsVoz+ann+LS096Lq2vyfo/KgcZgSo9MEoTU7LXXyznrrLPI5SIbNjzBBR+6hLnZIXu9fA/OPPN0TF1RlWM+esnFPPK//wOVZJHzisB4tMD09HNwAT5y2WW8YKfnU3vHcG6WS869gNktmznhPUfz2tf/d1Q6QFJz+01XcP8P7sOGgEpybHCU5YRBpinrilRPMSonCCRFP0fUgiTJ+OAHL2S3F7+EyniGc5v48LnvZ8uWOVauXM11111HIWf57PobuO/HD1AHhfUCqRM+fNmH2bY/xScv+jgP/frf0IOcxapkutAoBbUz/OV++3PkMYeyOFogKaZJpaDesplzzz2P1dvvylkfOBcZaiYLm/nI+eeyafNTFFmf4aiilyi0liglGVYlZSUpBil5EjjskHeyy/a78MGPXoHXCRYYZBkp4JxjNCnZbvVzueLCi/nEJefz5GO/Ay0pjWO/A/Zn7XuOorKWJ598kjNOP6upcTcVldDWEkUXDdsAJ3UiKasxQsUvEp2RJgWT0rBiZhVnnfZ+TjxuHY89/igv3+tPufiSyzjv3AsJ3pKlGusCe+yxB2sPO4T3vvdEDjvieB76zW85+bijqMtZrM9QyTS9TOHKBRAJOpuhKApuuunTrDn4HRyw3wF88ctfxnjN5uGEQ9ceHcNHWaLqEYm35PkKDCkiS/jEJ67mLW/aG43lbfu+gbPOPoMDD3wPihnOOv29CIYslkMWJ2NCXSM91HWPwcx2nHziYdxxyzWsefdafvObRznnnEtACk46+Rg+d9tnePfh63jggYc45YRjmZpSeAK9YgrhYCrvsbA4BwpOf9/pHPhXh3DwQWt55LFNHHrMkey6+078jze/lgvPPYP93nUEN6z/Mvu9+e0UFgw1Q7eACIv0koCzglTPMJlU5HmOlLGcBDAej7n4wos56MCDOWTNoTzx1AbWHXsMf/xHe3DlFVfzwQsuoi4riiynHJVNSA8cd+JJWKvxXrI4u4GedgTnSZM+k3GNqR1SWf7mW99gzZrDOfbYU3nXwWtZv/4LbFwYMT+a4+zTj+COW6/kkKOP5me/+jfOP+399KxgOCrQ+Q7gFMKW1PUWej1PnkokCp3lvOZ1r+ezn70Fb+OJ5w/vvY93HHIIDz38MFpq+kUPM5rQy1L6eZ/gQKDI85xvffs77L//AXzms7fEmpVrTiWX5RdL1HTrbFECGGNI05QsyxiPxwQpQGrmZxfI04zZjU+TSMWDD/6K4487mfn5IbY2kdFmKTvutBNPPP448/OLlMbyd/d8l9122YXnP3d7kjzjsCPWctONN7Lny/agqirqusYYgxCCtNcn6w+YlBCC5PX/7XW84pV/wqdu+CzGSvq9KWxdY51BC8n1V1/L7NwWvvGNb5AkCWd+4GweefQxelmPXzzwAJO6QiQCnaWcdtppfPULX2DXF7yQfPo5/PErXsNwbhPf+usvY61lm+13ZPVzt+Mv9tuf2blNfPee76B0ws677MYOz9uWbVetwHs45NDDuOO229nthTuT5zmTyQQhBFmaM71iJTMrVvO7xx/nkUcfZcuWLbzxTftQ9AbsvffejEaL1OUEKSX9fh/rIg8vsoyyHNPLC5RSGGNYWJiLSZcXVFWFUpreYIp+v8+GDU/yz7/8JSeeeAobN24mBMdkOKLfW0FtAn/yypfzilftxSevvxlfKzIlcW6M8zVaS6anV2GtxxhDr5dTVvF4edBL2Xffffjat+/lDW99OxuffIwfff8egvdMT61ixernsN2OO7LfX7ydu+/8PJ+/41Zuv2M9++77VmrrQGdMypq1aw7l8Ud+y9ObN5H3prnhmk+x/qZbufMLd7LDC3YgEYILP3A+N3/mVvJimg9deBFf/eLdXHnFxzo6kfUKin6vq/G3/Hfrk8+lq80mdHCeIsupqgrvJWmSM7EBEQQzMyvQQTDdK5jDU9t4CjMelQRrcLaMR4KVod9fRZFPYXPP/gftR6anmIw91o2RqibTfWwlSRJB0Y+c7z3r1nHQ0UdTjS0XnPh+CiU55qh38MMffpuHH38ak6xi40KNzgtWr+xx5Seu5Eff/Qqfv+0WihUrGVqLI40lOzfkVa/ag599/28xNoUEzGTIcG4zIQRmR2N2fcnLGM4t8vxtnssZF17G7LhibjTPtttvx3CxwjvJ1ddfxeKmTVTDnH7aJ+85in6O8jCeX4xolmX00x5XXnsdaqrPlk0b+eLtX0SFwEmnX8RlH7+aW9/6Dn5w71/zzduuw6eCzCsW5gyyWEkdakS9SJ4C3mJKh0wFvXyKetGS5xkBzSc+fTNep5TDjdz4qRvo5wPGdeSEEpiZWcm41CRFynve805+8pN72fDELNicfn8KITw6C2xenEXmffppn8XK4hOJTjKqesiBf7k3Iczy7e/dz+nvO52FoYVKcNsN17B5doGqWMH/2ryZY976Kr54+8f59le/hnOBZHoltc+oQ0qRJ+z1ol255zvfZGEywsiEU076ANtOpdx4yTqyHJJNNZecfxFq+nl86oqPcdk5H+Dpx/6dxSzFO49zDust4/GYsq7p9/uMywn/mUu2GV93+tEkFdY4RqMJeIutDXVdk2UZQih6vWnyPEcIyPOce++9l+FwzE033cR111xDphVVXeOcQyG46cYbOPzww3niiSdw1jI3N8epp57Kuw55N2uPOYYnN2/i4g9dypFrj2I4v8D6W29BF1NMrKeYHrDHni/jmquv5fZbb+fuO2+nlycsDodMak/tJUEoPnTe6Wza8Ch3f+3rGJGB89x2y2c45eST+N3vHiUojUxylFBccO55PPjgg1x5zdWoLMc1TSjXXnstP//nf+HKq69GSzBVzcJwkU9efwNHrD2MudnNFEVBCILJZMJxxx3HmkPfzcbNW/joZR9DpzmXXPxRNm7czCmnvpf/vvfenHnGafR7PYyzFL0B1rqmZORQSuCcRSmF1pr54SJ5nndHr8cccwxHHnkkG5/ezM2fvgFrapwLaBVPsebm5lA6Y91xJzIcLvDpT15PKjOEl8xueRqZBCamYnrFDMIHVBAM8j7OOKyXrFi9in3e9Hp+/eAv6E+tZDiq6OeaT117NT/+4U+45NKPsWl2ju2etx2T0UYOOuBtrDv5RJKpKUpTAh4hNa95zevIFHz3b75FkuWkeR+swpQWlWiMnZApjbDgfYLUBUWS4qoSb21jVwLnPUmWkWQpC6NhR6/+b1es3ISA977hZZLhcEzwgqI/IE1TnHOkaUKe5ywsDMmzHuWkZjIeIRFMhiOklJx57hkc9M6DOPHYU/jZT3/BprkN8UjUCBLymJ0mMBlXFNkAIQQylZgA373ne1z8kQ+xy267suNOe3D33X/LVddcw8rVPS798HnstdeerFt3GocdcQJrDnsXlRsiE8lgeoYk73P2Oeew607bcd2Vl7MwqRF5H18ZtLdU1YjK16ASfvXQb9hrzz359b/+K59Z/1lesNPOSJXwxONP8cKdd+BnD/wjt9x6J8/dbkcG/RQpa4reFMY2iFmNWRwNESqJ1QPrGRQr+aef/hI902fvt+7NdjM5115xOU+Pa6745HpeufufsG0xjdMwaxZIlSXDUdcGT4Jo6tHOObIso64NWqdMJiVCKNIs57777kMpxfOf//xYychSnLH0ej2ChJfs/lJe+qI/5q47vsKN13+S1atmuPb6KznsyDWkvWkmlUd4B3WJ8IEQBLX3vOLVf8bM9DZ865vfZmFujg2P/Tt/+uJt+PlP7+GOb3yL7V+8J9vM5MzImvPP/iDHHvd++tu9kKvXf4ZjjzuURA/RCg599+HcfffdSB2dcXFxkUwnsZrgPb1Bn+FwSJamSCEYzi9gbVy/lHH/ZVnGjjWlGI1G9Pt9hFhWjH3GsebSIRbITGdY66krSwiQ5glJqqhHo1ifSwsqL3FOMDOYYnG0CZKKfHoG4+IJi3eG0JSKpgc99n/7X/KDH/09m4fzKBQnnXQKN93yWVZus5ptt92O8eIQKSWuNvS05uCDDuDk957AiaeewNojj2TNmkM56fijGY5mOeO8s7j5jtvwBI4/7kT+/A1v4bQzzybLCsbDEYcfehgv2nVXzjz9dObm5hAqwTtQUnLcunXcfMutvPRlu6Ow/MP932Nubo7v3ncfSkgO/Kv9mdu0kb//wX3Udc3999+PBNasWcNjT23gt48/RVVVnHrqqXzu83ey4847k+c9gnV470kSzWRxgde/9lVsePK3PPH4w0itWLnqObjasP3227O4WOFIMGhOft+p3PX59bzhz17JiplVlCaA1Njg8TaQyKRTUZJogqtw9YQ3v3Fvnty4hd89+STOVxhb4oSkciCV58QT1vGO/Q9i3VHrOG7d0WzevIHTzzmHG9ffRjmpOeWUU7jzi1/g9W/YGy88aarJ04QDD9iPH//oH9m0YQurp3v8w/3fY3Z2lr//8T8wuzjPwe96B5s2beTXD/6KwWCG2mk+9OHLeeDn/8LKlSuRBPZ86c5Iafj5gw/iVUJVGQZ5jqDEmhITJIuTgEwzpAQZSqYHBS/6oz0YGo9IYqksTVO0UoyHQ1bOrCA4g62rTh5b98a0MBwdQIsgEd6T6B7GeaQG5yakIkPjGQlBpXv4WtHLFTffdS1pIfGq5PBj38cB77aceMLJXHPV9axcuRKQ3P65W7nnnu9B2sPakiBrRNrD65yq2siqlTNcddUVTM0MCFrywAMPcPXlH0OkAilqlHP0pUK4mpHMqfMexmzBVo5TTj2Pj152Ca969Z8xnhj22ft1ZIngEzfcjPee2njOP/98nnriSXxQGFJcEKRmHikln//yl/jIVddhhGR+Ycg555yDHS3yxbu+ykcvvYIkn2LLpqe56NxLqWQP6QMCS6U0pUhRQlLkOVdfdQVTM9M4Ab/6lwe49vILqCrDj3/5EBddfhXOVhTKc+4Z7+PJocHS57cPP0r62j3Zeadtue9HhqAExseQnCvNZHFCL+9TFAWXX/5xVm/zHDySXzzwP7no4x+nv3IbPnnlp5ia7qOoOPTIdRzwTsu555/Hxk0jzGjCzGpH0lcshgKbrY69wwLmypJ09UomrkIqeNs+byEXnr/+5nfQMsVMYlJ521e/x7kf/gQ+XcXs3DznnnExz9lxNz780cvQeY5MM7y3nPH+k5Giz4F/sQ8//fG9bJwbIkRCXvQJCG684WqU9Eg/5MPX3simzfOc+YHzmF3cwC8euJ93HHUU7zz6KP7jyUc5/awzSaTCOk+hFG40Qnsfj2VEwHmLkhIvJdY7dNMM5J1HKIXYfbcXhiTNqBxMTE1vKsPUjl62iuuvuZZLLzybJ373GDODbRiOFqkYITPN2BqQmsRDcJ4sy7CujvAuBcYYlIonQ2maMhyOSJIErdIulAxHC0gpSFONcy4eqNS2oTHxlMbZgJSSlvZ478myIiY3EnywOOe6o8lEaSaTCVrrGHqThOFwiE6T2HRd1zFMNUX2VC6dXrW5gdYaa2PFxpi4nuFibMLWetn6hwsYV7N6VUE9/xRBJvj+LpQ1ZGER6WIb4dgERLENu+++Mxe/fy23334zX/vbH9GbWgUeyklNmqbdAUfcYxZDcJYBUNd1V0Gan59namoK7wzeWASQNPxxcXERnaVoralMHY90pcQYw7gq6ff7aK0ZTcb0ix44qCYlvV4vHjYAWZYxGo3I8xyIJb/BYABAVVWEEOj3++z2op05+ZTjueCC83h6wxZCkCS6wFrfnK5JkkQxv7CFVatWUZYlUiq8A+8lSilcqAkhJqsA++67L29729s44YQT+MpXvsKaNWu2Onb++te/zn777YeUMh7OWRspifPxuDcvUoypmUwm9Ho5zhmuuOIKPnfbLfT6Ba7hzm1JLtUS5wwzK6aYlKMm2YtdRzFZkc0CHUpJBoN+NHThGY0X47m3lB0fap+L4ydYa0hS1R39ZnlCf1BQlmO8txhjGk4VqKoSCCwszCNlrIR77xiNhhRFTllO8N5R9HJ8cCSJJs8zvPdNAhGo6ypShMmYoigYjUbxhDF4lBYMpnpYWwOe0SiuPx4UGYTSCKUZzs9j6jFZluBwGG9QSnDmGe/jnA+cxZ133c137/k+z3ve82I5MwSSVCEVOG8wtiIvUhYW5xhM9bq/t3KYndtMXqSU1ZiqqqLc8NSmYlKOGcxMkSQa7x15nmGtwdganSimB32EgMlkjLcW7x3eW4QMGFt1cq5NiVQgZMAHS5ppRuNFnDfoROKDpaonPPTQQxx//PE8/fSmrmWyrkuMqej3C4ypcN50jtcaorUW3diOlKATxZve/Ea+8tUvc/Qx72E4WiTNEnSioEHi2lTkRcbC4jxJqkmzBOsMEBB/9JKdQ20dXmhEqgFH3usxmq1JlCZTBoXAVgKpFSMzTz4oKL3HOc9U3mPzpk0NSRfdObcQoTvrjh4e0VQpTQixGiClwNqaPM87FEiSDOdc171vjGk6nBJCiHNOT0+zuLhIkiRUVclgMGAymZAkCUpEDx8Ohx26tK+xeB+bjowx1NbSLwoksqnMhC4bDiF0XVUtCkJESSWTpfWraBD9XopigpApVYiJYJYYvKviYUMxYHEU6BUJdvwUiYbhBKamV1GWVYe2bb20qqouMpRlRE/vfYdWvV6PLVu2MBj0qSYleZp2vQdttKqtaSopca1I0bUX6AY88B5voxN7H8tcMcJJhNg6muZ5znA4RAjBYDBgNBrhnCVJY6WmVwywNlDkfbRO2bx5MzMzEdyKIqeuI1BVlUEISaIzjDFUdsL09HQXPdt+Ye8969evpygK6rqO7ZhCIKVk7dq1LC4uUhRF7MV5+R67hPGkQqY5SZEzHM0jpCYhQwmJryckWsX+0FThMMyPFphevZLh4hglBFP9QScg52LpqPW4VkFteK+qKJiqqhraEA89rI1hO0aBXid852LCKKWOqCEU8/PzTE9PN00htnu9RQiBa8bx3lNVFVNTUwyHQ6RScW2JpiiKJrRJnPFdvwLAZBIz4/F43BlBXH/arL/eav1SCpT0lON5HIIsX40LAu8mOFdBCKRZH0SGtzWEIXmiCSKlqn2XnbcG1B44tYrNsqxrBl8up5Z64ANSRQeQUpLnOYvDIYPBgMXFxa7CJEQEodYYlIqNNsER+1hC6MJ2i/BtVG1l2/VzhNC8baGxLtIzgcI5GI9K8jxHa90ZsDGGul5yRiEUdW3I85ysyJib29LZR13XXVQGusqZtbZbZ13XTE9PMx6PEQHUiuniol6/B1IxGgKKD4IAAAPzSURBVA1BQp6lBAfOGnpFijUGKcAFF1soZcCFRhhS4KyPA4qA1oq64Z0zMzOMx+OuA2k8HuOcJU2zToh5nnV0olVOW7Nuhe+cw5i64YqxiSjLUspy0oSwuqu1tiFLa90hmZSSNMuiMJztlGStJU2W+lRDcJ0DtMpvjcN7x3g8edb6vXVYZ1Fpgk4SqrIm+BiClZIdPVKA9/H1KuM8UmiCBx98I4e8Q+F2bOfcEnpq3TlVXcfo5X2UX2v0o9GIuqF6sGScSimEEGill/rpmj0XedHI13R/l1J28kvTmAPEKJl0Y8f1WXxwDYr7ppEpAREIwZPnGVVVkSQKnUiMrfE+0O/3MCbur6zGXcRu1yuE6BC4dazWFlpjHo/Hcb1CIt66z2uDNR4rAlIlpEVKXVvqSc2g18fUJcE6siRyZKHASU9latK8hxaayWhMr9fDujqim4sbM8YwGAyoqqoLBVmWMRpOOiG2Sczy5KX1eOccQsZNtUlX27A0Go06FGtfXYk17bR7H6uua4qi6BKS9v62C8paizW+U3KSRKowGi+SZRlSyg51hRBIoZukZ2n9idLUpkSlSTy8qGPkCDJQ2UgVgg2YMob32kWkcVU0zCSLiWiL+K0C22P59m+TSZyzNaJWxt57klR16No6bYuYIYQOyVpjaJO3+A5blFmSJJ3jLEfgdqyiKJhMJp2+jDFRXsTnJ+MKYxzT09NUVdVFqojwioDr5pZSUlcxSte26gy2nbvVUQc0DbC0VLWVSUz8LeItb35NKOua/tQAYxxlVUWFoAjOo6REETohe+Ep64r+VOSh3gamBjMsDue7xQAdkrVF/LIsSdO08y6B6hK5VnitsNsworWO7Y7NfW2IgeUUYMlpWprQCiHLsm68FuGTJEFK2XForZYE5Fx8rafl2O2et1p/1TRxi0Y+QhBEbGf0wTJQmoBjbGtk43j9oiBUJmblWUrwglAZiixhXJVked6tIU1TRqNYyVkuh1a5ZVlSFMVSeMV16NUm3ELE3otWnrHqE09cW0NvHVcK3SXISwn50gluK+c2zLcO1LZO6kRSVRWD/nQsezV0oTV+52K1p6on3Qufrb6UUtC8udEi/3LuvzxiGBMTxPb3tsCgpES86U2vji+6LHvpEpb6hNrXTlrjaa/QtcX94b76/9QV/sDz4pnN4r+veZz/8vwCxatf/WpWrFjB3Nwc//TTn/yB+Z4xT/tuXYjvNhgVm8ZT17x7ppZeDVJh6f/jqEUauwVd83aGWHqF6f/neub7eb/vTGCr75/xwDNfH/p/vZZXo9r8JXjBTjvtxK677sp4PObhhx9mzz33ZDQadfINoUXd/6L9AP8HPOVz3gIdFBwAAAAASUVORK5CYII=" 423 | } 424 | }, 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "## Nougat /donut\n", 429 | "\n", 430 | "If you don't have a gpu better not use it :)\n", 431 | "\n", 432 | "![image.png](attachment:image.png)\n", 433 | "\n", 434 | "OCR model that parses straight into markdown format\n", 435 | "\n", 436 | "https://github.com/facebookresearch/nougat\n", 437 | "\n", 438 | "However it can hallucinate things" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "The following code is taken from https://huggingface.co/spaces/hf-vision/nougat-transformers" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 16, 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "from PIL import Image\n", 455 | "from nougat.dataset.rasterize import rasterize_paper\n", 456 | "from transformers import NougatProcessor, VisionEncoderDecoderModel\n", 457 | "import torch\n", 458 | "import os\n", 459 | "\n", 460 | "processor = NougatProcessor.from_pretrained(\"facebook/nougat-small\")\n", 461 | "model = VisionEncoderDecoderModel.from_pretrained(\"facebook/nougat-small\")" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 18, 467 | "metadata": {}, 468 | "outputs": [ 469 | { 470 | "data": { 471 | "text/plain": [ 472 | "'cpu'" 473 | ] 474 | }, 475 | "execution_count": 18, 476 | "metadata": {}, 477 | "output_type": "execute_result" 478 | } 479 | ], 480 | "source": [ 481 | "\n", 482 | "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", 483 | "model.to(device) \n", 484 | "device" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 22, 490 | "metadata": {}, 491 | "outputs": [ 492 | { 493 | "data": { 494 | "text/plain": [ 495 | "3028" 496 | ] 497 | }, 498 | "execution_count": 22, 499 | "metadata": {}, 500 | "output_type": "execute_result" 501 | } 502 | ], 503 | "source": [ 504 | "url = \"https://www.africau.edu/images/default/sample.pdf\"\n", 505 | "\n", 506 | "r = requests.get(url, allow_redirects=True)\n", 507 | "\n", 508 | "open('sample.pdf', 'wb').write(r.content)" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": 23, 514 | "metadata": {}, 515 | "outputs": [ 516 | { 517 | "data": { 518 | "text/plain": [ 519 | "('\\n\\n## Appendix A Simple PDF File\\n\\nThis is a small demonstration .pdf file -\\n\\njust for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text.\\n\\nAnd more text. And more text.\\n\\n\\n\\n## 6 Conclusion\\n\\nIn this thesis, we have presented a new method for computing the performance of the proposed method. We have presented a new method for computing the performance of the proposed method.\\n\\n',\n", 520 | " 'c:\\\\Users\\\\kryst\\\\Documents\\\\PIMPS\\\\tutorials/output.md')" 521 | ] 522 | }, 523 | "execution_count": 23, 524 | "metadata": {}, 525 | "output_type": "execute_result" 526 | } 527 | ], 528 | "source": [ 529 | "from pathlib import Path\n", 530 | "\n", 531 | "def predict(image):\n", 532 | " # prepare PDF image for the model\n", 533 | " image = Image.open(image)\n", 534 | " pixel_values = processor(image, return_tensors=\"pt\").pixel_values\n", 535 | "\n", 536 | " outputs = model.generate(\n", 537 | " pixel_values.to(device),\n", 538 | " min_length=1,\n", 539 | " max_new_tokens=1500,\n", 540 | " bad_words_ids=[[processor.tokenizer.unk_token_id]],\n", 541 | " )\n", 542 | "\n", 543 | " page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]\n", 544 | " page_sequence = processor.post_process_generation(page_sequence, fix_markdown=False)\n", 545 | " return page_sequence\n", 546 | "\n", 547 | "\n", 548 | "def inference(pdf_file):\n", 549 | " file_name = pdf_file.name\n", 550 | "\n", 551 | " images = rasterize_paper(file_name, return_pil=True)\n", 552 | " sequence = \"\"\n", 553 | " # infer for every page and concat\n", 554 | " for image in images:\n", 555 | " sequence += predict(image)\n", 556 | "\n", 557 | "\n", 558 | " content = sequence.replace(r'\\(', '$').replace(r'\\)', '$').replace(r'\\[', '$$').replace(r'\\]', '$$')\n", 559 | " with open(f\"{os.getcwd()}/output.md\",\"w+\") as f:\n", 560 | " f.write(content)\n", 561 | " f.close()\n", 562 | "\n", 563 | " \n", 564 | " return content, f\"{os.getcwd()}/output.md\"\n", 565 | "\n", 566 | "\n", 567 | "inference(Path(\"sample.pdf\"))\n", 568 | "\n" 569 | ] 570 | }, 571 | { 572 | "cell_type": "markdown", 573 | "metadata": {}, 574 | "source": [ 575 | "## Visual Question Answering / Document Question Answering\n", 576 | "\n", 577 | "https://huggingface.co/impira/layoutlm-document-qa\n", 578 | "\n", 579 | "These models are trained to retrieve information with understanding image layout\n", 580 | "\n", 581 | "However, first we need to know which document to pick." 582 | ] 583 | }, 584 | { 585 | "cell_type": "markdown", 586 | "metadata": {}, 587 | "source": [ 588 | "## Clip / Blip models\n", 589 | "\n", 590 | "If we want to use natural language to search for images we can create captions for images\n", 591 | "\n", 592 | "https://huggingface.co/Salesforce/blip-image-captioning-large\n", 593 | "\n", 594 | "https://huggingface.co/spaces/pharmapsychotic/CLIP-Interrogator\n", 595 | "\n" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [] 602 | } 603 | ], 604 | "metadata": { 605 | "kernelspec": { 606 | "display_name": "Python 3", 607 | "language": "python", 608 | "name": "python3" 609 | }, 610 | "language_info": { 611 | "codemirror_mode": { 612 | "name": "ipython", 613 | "version": 3 614 | }, 615 | "file_extension": ".py", 616 | "mimetype": "text/x-python", 617 | "name": "python", 618 | "nbconvert_exporter": "python", 619 | "pygments_lexer": "ipython3", 620 | "version": "3.11.3" 621 | } 622 | }, 623 | "nbformat": 4, 624 | "nbformat_minor": 2 625 | } 626 | -------------------------------------------------------------------------------- /tutorials/local_inference.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Why run locally?\n", 8 | "\n", 9 | "1. privacy - we have no way to verify that data we send to the API's is not collected\n", 10 | "2. useful in offline situations\n", 11 | "\n", 12 | "## Local speech - to - text\n", 13 | "\n", 14 | "as of 2023, [Whisper](https://openai.com/research/whisper) remains the best open source (and free) speech to text model\n", 15 | "\n", 16 | "OpenAI has provided a simple package to help with running their model \n", 17 | "\n", 18 | "https://github.com/openai/whisper\n", 19 | "\n", 20 | "However it is rather slow because it runs on not so efficient for this kind of task PyTorch\n", 21 | "\n", 22 | "As an alternative I can recommend this package:\n", 23 | "\n", 24 | "https://github.com/guillaumekln/faster-whisper\n", 25 | "\n", 26 | "It uses different backend - [CTranslate2](https://github.com/OpenNMT/CTranslate2/#key-features)\n", 27 | "\n", 28 | "\n", 29 | "\n" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 1, 35 | "metadata": {}, 36 | "outputs": [ 37 | { 38 | "name": "stdout", 39 | "output_type": "stream", 40 | "text": [ 41 | "Note: you may need to restart the kernel to use updated packages.\n" 42 | ] 43 | }, 44 | { 45 | "name": "stderr", 46 | "output_type": "stream", 47 | "text": [ 48 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n", 49 | "DEPRECATION: Loading egg at c:\\users\\kryst\\appdata\\local\\programs\\python\\python311\\lib\\site-packages\\wordcloud-1.8.2.post4+g5dd8d3e-py3.11-win-amd64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330\n", 50 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n" 51 | ] 52 | }, 53 | { 54 | "name": "stdout", 55 | "output_type": "stream", 56 | "text": [ 57 | "Note: you may need to restart the kernel to use updated packages.\n" 58 | ] 59 | }, 60 | { 61 | "name": "stderr", 62 | "output_type": "stream", 63 | "text": [ 64 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n", 65 | "DEPRECATION: Loading egg at c:\\users\\kryst\\appdata\\local\\programs\\python\\python311\\lib\\site-packages\\wordcloud-1.8.2.post4+g5dd8d3e-py3.11-win-amd64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330\n", 66 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n" 67 | ] 68 | } 69 | ], 70 | "source": [ 71 | "%pip install pytube --quiet\n", 72 | "%pip install faster-whisper --quiet" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 8, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "text/plain": [ 83 | "'c:\\\\Users\\\\kryst\\\\Documents\\\\PIMPS\\\\tutorials\\\\.\\\\rick.mp3'" 84 | ] 85 | }, 86 | "execution_count": 8, 87 | "metadata": {}, 88 | "output_type": "execute_result" 89 | } 90 | ], 91 | "source": [ 92 | "# lets download the video\n", 93 | "from pytube import YouTube\n", 94 | "\n", 95 | "YouTube(\"https://www.youtube.com/watch?v=dQw4w9WgXcQ\").streams.filter(only_audio=True)[0].download(output_path=\".\", filename=\"rick.mp3\")\n" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 2, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "import faster_whisper\n", 105 | "\n", 106 | "model = faster_whisper.WhisperModel(\"small.en\", compute_type=\"float32\", device=\"cpu\")\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "What models are available?\n", 114 | "\n", 115 | "Any whisper model compatible with transformers library https://github.com/guillaumekln/faster-whisper#model-conversion\n", 116 | "\n", 117 | "Here are models that are already quantized https://huggingface.co/collections/guillaumekln/faster-whisper-64f9c349b3115b4f51434976\n", 118 | "\n", 119 | "For Polish you could probably use one of these https://huggingface.co/models?sort=trending&search=whisper+pl\n", 120 | "\n", 121 | "But keep in mind that original Whisper is multilingual! It is of course not as good as English but might be useful." 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 8, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "segments, info = model.transcribe(\"rick.mp3\")" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 4, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "data": { 140 | "text/plain": [ 141 | "TranscriptionInfo(language='en', language_probability=1, duration=212.1839375, duration_after_vad=212.1839375, all_language_probs=None, transcription_options=TranscriptionOptions(beam_size=5, best_of=5, patience=1, length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, log_prob_threshold=-1.0, no_speech_threshold=0.6, compression_ratio_threshold=2.4, condition_on_previous_text=True, prompt_reset_on_temperature=0.5, temperatures=[0.0, 0.2, 0.4, 0.6, 0.8, 1.0], initial_prompt=None, prefix=None, suppress_blank=True, suppress_tokens=[-1], without_timestamps=False, max_initial_timestamp=1.0, word_timestamps=False, prepend_punctuations='\"\\'“¿([{-', append_punctuations='\"\\'.。,,!!??::”)]}、'), vad_options=None)" 142 | ] 143 | }, 144 | "execution_count": 4, 145 | "metadata": {}, 146 | "output_type": "execute_result" 147 | } 148 | ], 149 | "source": [ 150 | "info" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 9, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "name": "stdout", 160 | "output_type": "stream", 161 | "text": [ 162 | "[0.00s -> 27.00s] We're no strangers to love You know the rules, and so do I\n", 163 | "[27.00s -> 35.00s] I feel commitments when I'm thinking of You wouldn't get this from any other guy\n", 164 | "[35.00s -> 43.00s] I just wanna tell you how I'm feeling Gotta make you understand\n", 165 | "[43.00s -> 47.00s] Never gonna give you up Never gonna let you down\n", 166 | "[47.00s -> 53.00s] Never gonna run around and desert you Never gonna make you cry\n", 167 | "[53.00s -> 60.00s] Never gonna say goodbye Never gonna tell a lie and hurt you\n", 168 | "[60.00s -> 67.00s] We've known each other for so long Your heart's been aching us\n", 169 | "[67.00s -> 73.00s] You're too shy to say it Inside we both know what's been going on\n", 170 | "[73.00s -> 82.00s] We know the game and we're gonna play it And if you ask me how I'm feeling\n", 171 | "[82.00s -> 87.00s] Don't tell me you're too blind to see Never gonna give you up\n", 172 | "[87.00s -> 93.00s] Never gonna let you down Never gonna run around and desert you\n", 173 | "[93.00s -> 98.00s] Never gonna make you cry Never gonna say goodbye\n", 174 | "[98.00s -> 104.00s] Never gonna tell a lie and hurt you Never gonna give you up\n", 175 | "[104.00s -> 110.00s] Never gonna let you down Never gonna run around and desert you\n", 176 | "[110.00s -> 114.00s] Never gonna make you cry Never gonna say goodbye\n", 177 | "[114.00s -> 119.00s] Never gonna tell a lie and hurt you\n", 178 | "[128.00s -> 136.00s] Never gonna give, never gonna give Never gonna give, never gonna give\n", 179 | "[137.00s -> 143.00s] We've known each other for so long Your heart's been aching us\n", 180 | "[143.00s -> 149.00s] You're too shy to say it Inside we both know what's been going on\n", 181 | "[149.00s -> 158.00s] We know the game and we're gonna play it I just wanna tell you how I'm feeling\n", 182 | "[158.00s -> 163.00s] Gotta make you understand Never gonna give you up\n", 183 | "[163.00s -> 169.00s] Never gonna let you down Never gonna run around and desert you\n", 184 | "[169.00s -> 174.00s] Never gonna make you cry Never gonna say goodbye\n", 185 | "[174.00s -> 180.00s] Never gonna tell a lie and hurt you Never gonna give you up\n", 186 | "[180.00s -> 186.00s] Never gonna let you down Never gonna run around and desert you\n", 187 | "[186.00s -> 191.00s] Never gonna make you cry Never gonna say goodbye\n", 188 | "[191.00s -> 197.00s] Never gonna tell a lie and hurt you Never gonna give you up\n", 189 | "[197.00s -> 203.00s] Never gonna let you down Never gonna run around and desert you\n", 190 | "[203.00s -> 208.00s] Never gonna make you cry Never gonna say goodbye\n", 191 | "[208.00s -> 211.00s] Never gonna tell a lie and hurt\n" 192 | ] 193 | } 194 | ], 195 | "source": [ 196 | "segments = list(segments)\n", 197 | "for segment in segments:\n", 198 | " print(\"[%.2fs -> %.2fs] %s\" % (segment.start, segment.end, segment.text))" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "This is nice but let's convert it into a common standard - srt" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 6, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "name": "stdout", 215 | "output_type": "stream", 216 | "text": [ 217 | "Note: you may need to restart the kernel to use updated packages.\n" 218 | ] 219 | }, 220 | { 221 | "name": "stderr", 222 | "output_type": "stream", 223 | "text": [ 224 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n", 225 | "DEPRECATION: Loading egg at c:\\users\\kryst\\appdata\\local\\programs\\python\\python311\\lib\\site-packages\\wordcloud-1.8.2.post4+g5dd8d3e-py3.11-win-amd64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330\n", 226 | "WARNING: Ignoring invalid distribution ~pencv-python (c:\\Users\\kryst\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages)\n" 227 | ] 228 | } 229 | ], 230 | "source": [ 231 | "%pip install srt --quiet" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 10, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "import srt\n", 241 | "\n", 242 | "subtitles = []\n", 243 | "\n", 244 | "for segment in segments:\n", 245 | "\n", 246 | " subtitles.append(srt.Subtitle(\n", 247 | " index=len(subtitles) + 1,\n", 248 | " start=srt.timedelta(seconds=segment.start),\n", 249 | " end=srt.timedelta(seconds=segment.end),\n", 250 | " content=segment.text\n", 251 | " ))\n", 252 | "\n", 253 | "with open(\"rick.srt\", \"w\") as f:\n", 254 | " f.write(srt.compose(subtitles))" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "For transcribing on edge I would recommend https://github.com/ggerganov/whisper.cpp but I will not go into detail" 262 | ] 263 | }, 264 | { 265 | "attachments": { 266 | "image.png": { 267 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiUAAAA2CAYAAAAYsOn3AAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAABcTSURBVHhe7d0JWBRH2gfwPw4MMsgpA4IaBI2CCcZr0XhrJJtETSLRsF4PRDaajZpE863E88ulRvNo4hGNJLphxbi6EY/gFUw8VkFXjLrGcxGCAgqow40MYH9VPTUwwAwMCDifvj+flumq6Z7qaZh++63qHiuJgREZt27Dq42bmHsMaXOQfjMXZVBB7a1m/xtRlIMca2c4K8U8IYQQYsH4Ib+kpBQFRcUoKy3DfeMhQA0trKxgbWONVio72NrawIrNNwUKSgghhBBiEVqIn4QQQgghDxUFJYQQQgixCBSUEEIIIcQiUFBCCCGEEItAQQkhhBBCLAIFJYQQQgixCBSUEEIIIcQi1HqfEkIIIYSQ5kI3TyOEEEKIRaDuG0IIIYRYBApKCCGEEGIRKCghhBBCiEWgoIQQQgghFoGCEkIIIYRYBApKCCGEEGIRKCghhBBCiEWgoIQQQgh5zPBblDVkamp08zRCCCHEhOJ7JSguLsG9Eq0oaV4tbZWws7OFXUtbUfJg7t+X2LaUoLDoHsrKys0ONKysrGBtrYC9qiVrky1atLASNY2LghJCCCGkmvv37yMvv0g+aNva2kBl11LUNK+i4nsoKSmVgwJHBxULBhrewcG3hQdZBYXFLLiwg42NtagxT2lpGQtmitHK3k4OknibGlujBiX32Mam3LgJpY2NKNFFmUqlNTp2aAfFA7yZhBBCSHPJyS2QswH84KuwVqBFExyAzXGfHaLLy8rlYynPcjg7tRI19ccDrTuaPDnAqm9AoscDEx4otXZxfKAAyZRGXWM522BXZ0cWgLStmPgOVbs641pKmlxPHlBRNlLTc9CwRGIRslPTkfNwspCVHmgbHmFFv2H3t9+KmbqVlpYjn53x5OYXQqstE6WEkAfFAwB+vs6PX7zLwlRAUp58VTxqOvy1eRt4W/SZjobgy/KJd9nUFpCkZdwSj4zjy+q7ffjU2JoldeHCAhV160c7MNFseB72bmOxVSMKmkrMG+gwaAkuidmqtMhJT0VqavUpm4UjXAze6DAIS4wvXDttDtKzdWsxpM1Jx8ndu3HSaJDBgqALv2D3LxdQZdFat8G4opRf8E1EKF57LRQR3xxDWo0X49t+Ert3n0R6faKu5J/x7Q+nYbjbtFf249uofbhUIAqagzYWE7y6Y9rGfaLANN6/nZNXwAKSQv5JI39oFRUX425OnpyW5WdWhJCG439jvMuGZ0iMdVGUnT+NvPBXkf/uJBTM/QvuZ2eKmqbB28DbwtvE29ZQ5gQSMbv2iEemNVVAwjVLUJKZfRfaUnYmx/Ztxs1sUfposX9mJMaEvII/uIgCE6JHWmFktJhpdElYOzYQgYEGU7fO6NBxCn4Uz6g3TSzmvdAbng4uaPdGjCjkshA3PQAubYdidtRyTOrsAa/R0axU0CZifoAa/uOXIWrZePirA/DROVFXL1okLukDrydfx5prbTCgeyucWPIc26YJ2KF/saw4TA9wQduhsxG1fBI6e3hhdHRFS2oX/wXejNiG62IWWdEI7vMn7FT0gn/Ds6T199MW7FKGY1P8dlFgXHGxFiVaLWzYBxQfAKdQtJAnpdJGni8vL0dePgtWCGlOl45hQ/ROMR3DZVFcJ8Pl4q6JQu4a9lWsr3pdbW4jPka/3AHEG8QKl+MM1semrSdui5qa+KBW3sVRPUNyPzNDDkIK5r6N+1k3YdN3MFTvLoBUkIfiqLWQCvPFMx9MacEd3MqpGnzwtvA2PawBt82lyYOS9m09xCPAztZWF5w0In6mnmrkDF6vKNtIdwXvPqhlmRp4lqBKJoCfleuzDzrK3u8g6quJ8BXzsvq+Dnu+4Vl+XdtWU1fMZX+FmZn66TwWPGOH5yPXI0Q8o0KNbTLBfgDeXr8df3vDWxQIWg2uYDyOaK7iyPYjuJqyDr3iZmHBYV21JnIWFrX6BBfP7cf2/edwMsIKny+L1VXWx7kFCJl7A8H7ruPcD0sxc8FXOJJ8A1/6bMWkydFyhkOruQKMPwLN1SPYfuQqUtb1QtysBRBNMR8LpCIGhSNp6kHETHQXhcbwbrCq+99QUXZq1cyQOXJyUejgjDpiWjkgMXX2xvE0L0+tEtJsMs9g6yV7jJr4KsLZNKpTIc7VcsCvwJc7DfQXy/XH+YpA4XLceaCXrjx84gD455/HPjNSq5fjjuGG1wDdckHuuMHmKwMkO/gH6dcZAMeky+YHT4z2YCzy/jwaZed/hcLnSbRavBaq9xbi3pZvkf/ORJT8EIUbc2egkK23qhSsnhCM4HUpYl448An6Dgs2mD7BXuRg77xQDH75TbwaPA4Dx63E4abOvluYJg9KlDbW8FC7ypOri6MoNc/H3awx9GvDPRKLCa28MD2enxxHYIinLRy8e6JbO3vYBsxHov4IGz0SViNXYMeE9nB0b4eXI9k69GVhvrBXd0Bnd0e4GZ7Z1xCNkVYjsWJHGHzVbdHZ2wUO7DUSfluD593YWXlnTzi6jUbFCTlff8cI6JIBWYge68lexw+BfmrYew7HyosXMd/PCpP2AHsmWckHFV3GRP86E9De0R3tXo7E1dq2rR400ZMxH59i02TDA2w5ktaOgKfaG906u0Hl1h9Lalu50hltvb3h6aQQBYKyC6avmYPeSjHv3gvdPVhQlaabzc8vAGyU0A95bt/OE/eKjJ3Ba5EY4Qdbvwij23h47d+QPHAulgepRAnnjukfh0O1JxKb2a5VdpmONXN6o7Ip3eHBglHRFDOxfRY8HOs7bcDRpZXrqi6L/z7Yq+EX6Ae1tW4/WrH9J+/KrB0I87WH2i8QfmprUafbz+ciOhr8fnB8v3dEBC84F4GO/BcjeRm6mwg29EwFI7w7p7SMxpWQh8CjB0KCe7C/ykqOTnVfJJGVwj48O/nBT8z7dWMnPvm58mNXBzv5ZyU7OLmKhybdxt18NzzTV7w2b9dEP7ia6llR2aPOVRrgmRFO+dwIOKyKRtlvZ5AX/gq0P+9BiaMrrrbphFNHE1DETjDM1wMLf4nBCXlagJeOb8DnCR6Yvp3Pb8HaaYPxTF1nKo3g5KlfxaOqLl7+L/LyGif7Y65m6b5pqBlvD8S/Ir+uTK3HbsEuj0l4sx/g7NIbU+I0KMnNRG7+IYTd+Qx/3WgQwOyfg4Ue25AnSTg6XezV/Yvx90FHoSksRP6pd+C081OsuqirMm4/Fv99EI5mF6IwYwOGpHyGIePO468X2XzhFSz02on/XW6kT+LiKnz6Qzes1txB5p1CZMd/g3Fdu+LTy2cx2xcYsUnXHxc7UTyfvc6chR7YlsfKj05Hh7q2zSzxWDD7V4R8PKXKhwXYofoCpuNUdi7uFOZh38spWDBtVeV73FCaBPw7NQCBz+pmn3g/ErOzItBrbCTOZCdgwRcZeH/eaF2lgazoYAxf3wkbji6tDHAMpKVnw/fZgTUzCEMGIpCd55w38vZrEv6N1IBAiKaYoRSnIwYhbE8Z+gQ/X+39MhSLmZP2YshODe5k3oHm4BR4ek7DcSkWfFfGzpyEvUN2QnMnE3c0BzHF0xPTjhvuZxOeWYprm0YAvrNx1px+WvEUHojksol3j97V6D4IedBialAeIU2nstvkEHrhRX9RXF/5efKJonvfP8InTd/VwtMpf0S/yqS7CbnIrZGhdIN7xXLFuFTRhXMTPtUCKXO18PCSsyP3vv9GPhG44d0VJ65n4UZionjGA/D2gicyEPfdEVwpUCBgQPc6s6eNgQ9ujZi/SMzp8IBk8bKVYq75WHRQ4hISgoH/2YLvRHdi7JZd8BgzHs+wx8reYzH+aZXcxXHh2C2Uu5XjeorBodU1DKtXPAvD82t4h+PDye3ks2Bl757wx11k1zr0wBvhH05GO76AnAkoR1DEegTJv8kd8dJwbyRfOM9nqmrlDCfFSWxac0buOlL5+NTxy++KsNUr8KxobJ3bZo747xFTNgrjhlQ/0nvjlbdf1G0Te3eC3v4T2v37MI7KdQ2lwdbJH+DMqHmY0VEUwQFd/DvBIesrBHn2w6pWEzG5W9W2lCZGYFB4EqYejIHx3pICFNQ6PKIyM1NBsxWTPziDUfNmsD1kptRVmLIcCJ/ZH8dnzUSsqcTR9fO4VNgDAwbrtkPJAqOeN39Hsjx3HecvFaLHgMG6LItyCAb2vInfdZWNxtq6RcUAM1VLWxb85Mr3DWjn6Q4ba2v5ksEWCov+syaPJDf0C9Z1jQzF6VrHa1SXdeIANsScqZK15uM/Utrpu1p6Acd3mtV9U6FirIrhuBLD7htPpFQbc2Iu3o3DB7pyGR4+uHo8HmUFDR0VfwYfi+4buXun3Vh8uWg4bI+tRejLE/DqwjikNkMC9LVXXkLgH3pUBCb6gGTtl0vg6OgglzUXy/70cpmMd15LRlQkPx0+jB/jfBE6hYck7BfjykaM7ewEdeB4LDugS6tVYUb/fOVBbTtCPTzgIU/9sLjW7EklG5tqXRp6T/wPDp1dBu/vX4KHgyd6T9tm5GoRQw5wNmhsndtWwXS7z+3ah6z+wzBEzJtkYwMFknCZLXtxcT+xLjaF1j7gspIWV9YEY+qhoVi/MUS851rEvtEHi3w24NSRc7iddx6LWi5Bn5CtlVe55O/CxOFfIgV5yDH599wKLs4KZKbdEPMGNDlsSV88FSDmOe0VrAmeikND12NjSH3OL5wx6rujWL9iLeb5bsW7HyQaH2vzxOsY2+Mwln2QgCL277elkTjWo7/IyDyB18f2wOFlHyCBna0V/bYUkcd6oL/56RqzWCt0gQfHB7fyYIRP/DHH6/ggWEIeFncnexSIbhhz8KxIuD5r4eDIfuq6YXwqsi1u6ORlh7xc8wMd+PNxJQPgX+Ws1FBH+HgUI/eumDWDdUBPtFC3kbtx+LgSY9x6/QHOXeqTJqrsvon5iw+bt4b7s2H4evtm7F7ETnAS1mHRrua5DFAfmERFb3toAQln4adUSox+NwzFm75B/OEt2NFmDCbIp7/XsXzcFGS+dQHZV48gaulMDHlCXqCBXkNUxeDQeMztKoofgOrpKfgu8Sbyr21Et/0heHm5uZmO+mybqXZrcfpsMjpVOWKbwA7u+Qp/BLBlu86NF+tiU9Rr4gm1490vfeYDn57ciso44BL+daI1xoSKsRmqpzHry6lovXMTKi42y76FTmtu4NaGntgc/oHJMTPP/3Eg7h36EYer1Wdt3oZ4+74YqItRGT4mpI88hubkVn1wZCaeQZNTNR3xfuQ7wKopWG50sH97BI14CtLPM/AUO0MKOTIc/9w/pyIj0z5oBJ6SfsaMpzzgE3IEw/+5H3NEpY3BDQUfBL8kUGL/DAMTfUDC8XRyy5YP586T5PEkZzoMro7Jyi1EKwcnMacjX/lSLRvi7sP+5gwGm17+TyoLSvhybnB1uI2UiszIbSRlFJsxToUHGrcNBtka687R44GPOeNUKlkH9ILjxl1oOe7P7DPNXpTqqDy98OyKrzD422jYsMCqoUqvx+Lzv/2G/DIWnPQMhD+LCe4VF4vapsHHk+jvTcIDk88+mVclINm+a6/8s7lYfp6330y85RaDeR8dRJsxE8QBIA3pt1rCk/1S84OeNm0ffmnIvTeaSlYcdvHTZUbZ7jkM6MTOnAt5tPsE2rdlh2xjAyEqNMa2XcKVJD6GzFh0lY+kC2m6TIA2DZFLNkMbEo6aoz3qlhU9Gl3l7pcDmN5F16Wh4wK1Wxp2fV+Zccg6fRaZnh0qr07ynYpFLBBwmbgRn7VdhRB9dkKbiFWhoVgn3iKXyZ/jfcfv8NaMfRXZJu2VNQiZH49u8z7GSLmEBSSjuyI8aSoOHpiOKk2pJ2Xvz7AyJBmL3t4of4BqE1chNHSdGKC6A4uW2GJqXCJSWOB2Ye9C0ZWns2PREthOjUNiCgvqLuzFQoPKrgH+UGT+jmR5G4qQMOtT7Jdr6s/ezg5aI4Na+VU3qpZKKJroOykIMYZnOkY5JInukp34McMdQ/WDTQ0VFaJKYoIPROU9M2K54whAiFjOL4idUJ3WlfNLjC85BJg1ToUv55ik77pJQl6VTInhmBJ+lU4vk+NU+OX1/K6lxu7503L8m3BiwYly2EuwtrWF/9TpeHHvIXgNHS6eYVzGP9+vuNJm1gFRWE15Xj6uxX6CoOfZ815cihOe4zDnT2q5jreFt4m3rTGdTDyD9PQMMQcEDRtUJUOyo5mDEt4/bVT6zWzxyHwFhUVSUkqadCvrjtHpevot6XraLfFs891dN0RSoIe0OFUUMGc/fFpSKlSSq7ur5PrkGGlmsLfkO/usrnLTCAm+syUxp1OjbJM0gm3+iE1itgZe7yvpV8leUZrtW/X5Z2f7SnzUqsxg/df3vCcNbqOSVK7ukrsr++kTKsVk6p529x+vSk5yux2loNV3WUn116lj25IWSz0USqn7ogu6eaP+Ib2qMLZt/LX6Sq+H+kiOju6Sq0ohKZ+eJv0k2maMvI26oZWVE1/xhQ+lp9lr1KjTv8eZMVKoj3gP2LaqWveTFp8qkddZY1+cnSd1UXaRZvP6C+wxFFLg5wY7m6/rSZWkVLnK61Iq20iDP4qXCkX1BfZ+Kaq3g00V71ltjP2uZK6WBiqcpPE/lrDmdJGgCJR0zSmRDoS5SVA6snbw7eJTV2nMhsushtUeCJPcoJQcK+rcpa5jNkiX5cpT0uwuCnlZV8c2bN9HSm+oDfa7sXbUoqSkVLqjyZXyCwrZVCTduZsrFRbfE7WEkAfBDv7SXU2epNWWSvfvs3CgmWnzb0sagz9n3gbeFt4m3raG4OsoLS2Vj+0lbF36KXb/z9LYiVNNTu/N/rDK8/nE18HX1RTvTaN/IZ8mJ6/We5Hwu7vyy4TrQxs9EqoV/ZH86xwY9mTwgaA3S13grTbZcfhQyfer0DrBs62zrhtDkNtdaF+j3FBt29YY291c752p96A2RdnZgFpddZAyo730MfoGRKH/v85jtX5UcJMrgq457PWuLUHPYf/F0ktr0J0V8lxYweFZGBpWghXSStzoOQz/XXoJa7pn6+5TUnAYs4aGoWSF/goc3f1NoPZGY7ztPDPCv+mTd+fwewDxrh1CSON41L77hh/q+c0Vs+/korVr1e61+mInQVC3doJCYfqeSQ1l+d8SXJSAGQFBSPnoBvtgr9dIAfIIytoxAb3G/oC8Ln0x3A9o93ocVoaYCncuYnG/oVhpdHzIC1ibGQXzRs4I/F40Xw1DZsKsiqupirYFw/09L/yU0ReLrb7CsMwEzKqsRLD7e/D6KQNr+okyQsj/C/zL6x6lbwnWByUaFmw1xhfyubDg6LELSg6EOeOF6FJ0f2snDqwJatA15eQRVJSNkwd34XyRLwa/MABPOjduH6tJ/Nb5fQbji1v+GNjPG0iNR8LdAfhi32ZM7gIkzu+DwV/cgv/AfvBGKuIT7mLAF/uwmVU2UwsJIY2MZyj49808rNu78zEkdna2csbmQfBDPQ+0iu9p5aDCXmVX78CEByT8FgQ8qLFrqZQDpMcvU0KIhZG7vnLLAJW6ZvcXv33/zVyUQQW1d80uKEIIeVh4UMK7fbWlpXJwov+2X3Pw4EP3bcVKKG1sdN+e3MCsTW0oKCGEEEIeA/xwzwMTw6k+eBBiODV2loSjoIQQQgghFsHy71NCCCGEkMcCBSWEEEIIsQgUlBBCCCHEIlBQQgghhBCLQEEJIYQQQiwCBSWEEEIIsQi1XhJMCCGEENJc6D4lhBBCCLEI1H1DCCGEEItAQQkhhBBCLAIFJYQQQgixCDSmhBBCCHnMaDQaJCenQq1Wo00bD1H68FFQQgghhDxGCvILoC0tx9dff4Mjh48iPT1D1Dx81H1DCCGEPEZSfr8uByTfb/6HRQUkHAUlhBBCyGPEy8sLP+7eI+YsCfB/XheimJ1AtVEAAAAASUVORK5CYII=" 268 | } 269 | }, 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "## Local LLMS\n", 274 | "\n", 275 | "Note: LLM's are big\n", 276 | "\n", 277 | "1B parameters = approx. 1GB RAM\n", 278 | "\n", 279 | "Check your memory constraints on the device you are using\n", 280 | "\n", 281 | "\n", 282 | "For model inference on CPU's I will use [llama.cpp](https://github.com/ggerganov/llama.cpp)\n", 283 | "\n", 284 | "### Prerequisities\n", 285 | "\n", 286 | "Installation - https://github.com/ggerganov/llama.cpp/releases\n", 287 | "\n", 288 | "There are also docker images available\n", 289 | "\n", 290 | "\n", 291 | "#### Models\n", 292 | "\n", 293 | "all compatible models are available here : https://huggingface.co/models?search=gguf\n", 294 | "\n", 295 | "If you find one that you want to test out just download the file in Files and versions section\n", 296 | "\n", 297 | "![image.png](attachment:image.png)\n", 298 | "\n", 299 | "My personal recommendations:\n", 300 | "\n", 301 | "https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGUF\n" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "Once you have all the files ready, for simplicity we will start with server implementation\n", 309 | "\n", 310 | "Run server.exe from the llama.cpp package you downloaded\n", 311 | "\n", 312 | "In the terminal, replace path with appropriate model \n", 313 | "\n", 314 | "./server.exe -m vicuna-7b-v1.5.Q4_K_M.gguf" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 11, 320 | "metadata": {}, 321 | "outputs": [ 322 | { 323 | "name": "stdout", 324 | "output_type": "stream", 325 | "text": [ 326 | "Prompt: Hey can you help me with something?\n", 327 | "Result: I'm trying to write a script that will allow users to upload files and then process them using some custom code. nobody knows what the file is or its contents, just the file name and extension. Is there any way to do this in python without having to manually parse the file for each user?\n", 328 | "\n", 329 | "Ideally I would like to be able to run a script on the uploaded files that will extract specific information from them (like the number of lines, words, characters, etc) and then store that information in a database.\n", 330 | "\n", 331 | "Is there any library or package that can help me with this task?\n", 332 | "\n" 333 | ] 334 | } 335 | ], 336 | "source": [ 337 | "import requests\n", 338 | "\n", 339 | "prompt = \"Hey can you help me with something?\"\n", 340 | "data_json = { \"prompt\": prompt, \"temperature\": 0.1, \"n_predict\": 512, \"stream\": False }\n", 341 | "\n", 342 | "resp = requests.post(\n", 343 | " url=\"http://127.0.0.1:8080/completion\",\n", 344 | " headers={\"Content-Type\": \"application/json\"},\n", 345 | " json=data_json,\n", 346 | ")\n", 347 | "result = resp.json()[\"content\"]\n", 348 | "\n", 349 | "print(f\"Prompt: {prompt}\")\n", 350 | "print(f\"Result: {result}\\n\")" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "Here we have set the following parameters:\n", 358 | "\n", 359 | "- temperature - it controls randomness in generations\n", 360 | "- n_predict - number of tokens at which generation will terminate" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "Now let's try to make LLM do some work for us" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 13, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "name": "stdout", 377 | "output_type": "stream", 378 | "text": [ 379 | "Task: Summarize Commission's stance on AI from the following passages\n", 380 | "\n", 381 | "Passage: The Commission is of the opinion that a given AI application should generally be considered high -risk \n", 382 | "in light of what is at stake, considering whether both the sector and the intended use involve \n", 383 | "significant risks , in particular from the viewpoint of protection of safety , con sumer rights and \n", 384 | "fundamental rights .\n", 385 | "\n", 386 | "\n", 387 | "Passage: The Commission is of the opinion that the legislative framework could be improved to address the \n", 388 | "following risks and situations : \n", 389 | " Effective application and enforcement of existing EU and national legislation : the key \n", 390 | "characteristics of AI create challenges for ensuring the proper application and enforcement of \n", 391 | "EU and national legislation.\n", 392 | "\n", 393 | "\n", 394 | "Passage: Their development \n", 395 | "and functioning must be such to ensure that AI systems behave reliably as intended.All reasonable \n", 396 | "measures should be taken to minimise the risk of harm being caused.\n", 397 | "\n", 398 | "\n", 399 | "Answer: \n" 400 | ] 401 | } 402 | ], 403 | "source": [ 404 | "def evaluate_prompt(prompt):\n", 405 | " data_json = { \"prompt\": prompt, \"temperature\": 0.1, \"n_predict\": 512, \"stream\": False }\n", 406 | "\n", 407 | " resp = requests.post(\n", 408 | " url=\"http://127.0.0.1:8080/completion\",\n", 409 | " headers={\"Content-Type\": \"application/json\"},\n", 410 | " json=data_json,\n", 411 | " )\n", 412 | " result = resp.json()[\"content\"]\n", 413 | " return result\n", 414 | "\n", 415 | "passages = [\n", 416 | " \"\"\"The Commission is of the opinion that a given AI application should generally be considered high -risk \n", 417 | "in light of what is at stake, considering whether both the sector and the intended use involve \n", 418 | "significant risks , in particular from the viewpoint of protection of safety , con sumer rights and \n", 419 | "fundamental rights .\"\"\",\n", 420 | " \"\"\"The Commission is of the opinion that the legislative framework could be improved to address the \n", 421 | "following risks and situations : \n", 422 | " Effective application and enforcement of existing EU and national legislation : the key \n", 423 | "characteristics of AI create challenges for ensuring the proper application and enforcement of \n", 424 | "EU and national legislation.\"\"\",\n", 425 | " \"\"\"Their development \n", 426 | "and functioning must be such to ensure that AI systems behave reliably as intended.All reasonable \n", 427 | "measures should be taken to minimise the risk of harm being caused.\"\"\"]\n", 428 | "\n", 429 | "def create_prompt(task, passages):\n", 430 | " prompt = f\"Task: {task}\\n\\n\"\n", 431 | " prompt += \"\\n\\n\".join([f\"Passage: {passage}\\n\" for passage in passages])\n", 432 | " prompt += \"\\n\\nAnswer: \"\n", 433 | " return prompt\n", 434 | "\n", 435 | "prompt = create_prompt(\"Summarize Commission's stance on AI from the following passages\", passages)\n", 436 | "\n", 437 | "print(prompt)" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 14, 443 | "metadata": {}, 444 | "outputs": [ 445 | { 446 | "data": { 447 | "text/plain": [ 448 | "\" The Commission's stance on AI is that it considers high-risk applications to be those that involve significant risks, particularly in terms of safety, consumer rights, and fundamental rights. It believes that the legislative framework needs to be improved to address the challenges posed by AI systems, including ensuring effective application and enforcement of existing laws. The Commission also emphasizes the importance of developing and deploying AI systems that are reliable and minimize the risk of harm.\"" 449 | ] 450 | }, 451 | "execution_count": 14, 452 | "metadata": {}, 453 | "output_type": "execute_result" 454 | } 455 | ], 456 | "source": [ 457 | "evaluate_prompt(prompt)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "markdown", 462 | "metadata": {}, 463 | "source": [ 464 | "Language models are good at tasks related to language\n", 465 | "\n", 466 | "If we use them to answer questions specific to a domain without providing context or use them for reasoning we will fail." 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 15, 472 | "metadata": {}, 473 | "outputs": [ 474 | { 475 | "data": { 476 | "text/plain": [ 477 | "\":\\n nobody knows what it is, and there are no known ways to make it. It is a mystery that has puzzled scientists for decades. The Commission believes that AI is not something that can be created or controlled by humans. It is a force of nature, like electricity or gravity, that exists independently of human consciousness.\\nThe Commission also believes that AI is not something that can be used for good or evil. It is simply a tool that can be used to achieve certain goals, whether those goals are noble or ignoble. The Commission sees AI as a neutral force that can be used for either positive or negative purposes, depending on how it is used.\\nThe Commission's stance on AI is that it is a mystery that cannot be controlled by humans and that it has the potential to be used for both good and evil.\"" 478 | ] 479 | }, 480 | "execution_count": 15, 481 | "metadata": {}, 482 | "output_type": "execute_result" 483 | } 484 | ], 485 | "source": [ 486 | "evaluate_prompt(\"Summarize Commission's stance on AI from the following passages\")" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "The model I'm using - Vicuna has context length of 4096\n", 494 | "\n", 495 | "Let's see what happens if we exceed that number" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 16, 501 | "metadata": {}, 502 | "outputs": [ 503 | { 504 | "data": { 505 | "text/plain": [ 506 | "16278" 507 | ] 508 | }, 509 | "execution_count": 16, 510 | "metadata": {}, 511 | "output_type": "execute_result" 512 | } 513 | ], 514 | "source": [ 515 | "url = r\"https://raw.githubusercontent.com/ggerganov/llama.cpp/master/examples/server/README.md\"\n", 516 | "\n", 517 | "resp = requests.get(url)\n", 518 | "text = resp.text\n", 519 | "\n", 520 | "prompt = f\"Task: What are the main features of the llama server?\\n\\nPassage: {text}\\n\\nAnswer: \"\n", 521 | "\n", 522 | "len(prompt)" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "Here is a good explanation on what happens inside along with explanation on attention:\n", 530 | "\n", 531 | "https://www.youtube.com/watch?v=f23sUViqxH8\n", 532 | "\n", 533 | "As of november 2023 llama.cpp does not support these streaming models, but it supports caching which I will go into detail in next tutorial" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 17, 539 | "metadata": {}, 540 | "outputs": [ 541 | { 542 | "data": { 543 | "text/plain": [ 544 | "'\\n\\nThis is a Python script that creates an API server using the OpenAPI specification (OAI). The OAI specifies how to build and document RESTful APIs. This script uses the `openapi3` library to generate the API documentation, and the `Flask` library to create the API server.\\n\\nThe script defines a single endpoint `/text_completion` that accepts a POST request with a JSON payload containing a `prompt` field and a `max_tokens` field. The `prompt` field is a string that contains the text for which the API should generate completions, and the `max_tokens` field is an integer that specifies the maximum number of tokens (words or symbols) to return in the completion.\\n\\nThe `/text_completion` endpoint uses the `llama` library to generate completions based on the provided prompt. The `llama` library is a CLI tool that can be used to generate text completions for any given input. It takes a JSON payload with a `prompt` field and returns a JSON response containing an array of completion suggestions.\\n\\nThe script then uses the `Flask` library to create a server that listens on port 5000. The server responds to incoming requests by calling the `/text_completion` endpoint and passing in the provided prompt and max\\\\_tokens values. The response from the `/text_completion` endpoint is then formatted as HTML and returned to the client.\\n\\nTo use this script, you can run it with the following command:\\n```\\npython3 server.py\\n```\\nThis will start the API server on port 5000. You can then send a POST request to `http://localhost:5000/text_completion` with a JSON payload containing a `prompt` field and a `max_tokens` field, and receive a response in HTML format with the generated completions for the provided prompt.\\n\\nNote that you will need to install the `openapi3`, `Flask`, and `llama` libraries before running this script. You can do this by running the following commands:\\n```\\npip install openapi3\\npip install Flask\\npip install llama\\n```'" 545 | ] 546 | }, 547 | "execution_count": 17, 548 | "metadata": {}, 549 | "output_type": "execute_result" 550 | } 551 | ], 552 | "source": [ 553 | "evaluate_prompt(prompt)" 554 | ] 555 | }, 556 | { 557 | "cell_type": "markdown", 558 | "metadata": {}, 559 | "source": [ 560 | "It loses context, also generation time is very long because of recomputation" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": {}, 566 | "source": [ 567 | "You can find more options for LLM server here:\n", 568 | "\n", 569 | "https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md\n", 570 | "\n", 571 | "I encourage you to test out your own prompts, models etc. \n", 572 | "\n", 573 | "I will go into detail into sampling methods, prompt engineering next week" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "metadata": {}, 579 | "source": [] 580 | } 581 | ], 582 | "metadata": { 583 | "kernelspec": { 584 | "display_name": "Python 3", 585 | "language": "python", 586 | "name": "python3" 587 | }, 588 | "language_info": { 589 | "codemirror_mode": { 590 | "name": "ipython", 591 | "version": 3 592 | }, 593 | "file_extension": ".py", 594 | "mimetype": "text/x-python", 595 | "name": "python", 596 | "nbconvert_exporter": "python", 597 | "pygments_lexer": "ipython3", 598 | "version": "3.11.3" 599 | } 600 | }, 601 | "nbformat": 4, 602 | "nbformat_minor": 2 603 | } 604 | -------------------------------------------------------------------------------- /tutorials/sampling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# How does an LLM output look like?\n", 8 | "\n", 9 | "runnning with llama.cpp\n", 10 | "\n", 11 | "./server.exe -m vicuna-7b-v1.5.Q4_K_M.gguf\n" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "Set n_probs argument" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [ 26 | { 27 | "name": "stdout", 28 | "output_type": "stream", 29 | "text": [ 30 | "Prompt: Hey can you help me with something?\n", 31 | "Result: I'm trying to write a script that will generate a random number between 1 and 10, but it keeps returning the same number every time. nobody knows whats going on. Can you help me out?\n", 32 | "\n" 33 | ] 34 | } 35 | ], 36 | "source": [ 37 | "import requests\n", 38 | "\n", 39 | "prompt = \"Hey can you help me with something?\"\n", 40 | "data_json = { \"prompt\": prompt, \"temperature\": 0.1, \"n_predict\": 128, \"stream\": False, \"n_probs\": 5}\n", 41 | "\n", 42 | "resp = requests.post(\n", 43 | " url=\"http://127.0.0.1:8080/completion\",\n", 44 | " headers={\"Content-Type\": \"application/json\"},\n", 45 | " json=data_json,\n", 46 | ")\n", 47 | "result = resp.json()[\"content\"]\n", 48 | "\n", 49 | "print(f\"Prompt: {prompt}\")\n", 50 | "print(f\"Result: {result}\\n\")" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "We have an additional field" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 3, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "[{'content': ' I',\n", 69 | " 'probs': [{'prob': 0.869778573513031, 'tok_str': ' I'},\n", 70 | " {'prob': 0.13022147119045258, 'tok_str': ''},\n", 71 | " {'prob': 1.0065424937977241e-08, 'tok_str': '\\n'},\n", 72 | " {'prob': 1.3625965078031876e-17, 'tok_str': ' Do'},\n", 73 | " {'prob': 1.3644319388643815e-18, 'tok_str': ' Can'}]},\n", 74 | " {'content': \"'\",\n", 75 | " 'probs': [{'prob': 0.9999947547912598, 'tok_str': \"'\"},\n", 76 | " {'prob': 4.942715804645559e-06, 'tok_str': ' need'},\n", 77 | " {'prob': 3.2549053230468417e-07, 'tok_str': ' have'},\n", 78 | " {'prob': 3.7957806342525657e-10, 'tok_str': ' am'},\n", 79 | " {'prob': 3.394219072472282e-10, 'tok_str': ' want'}]},\n", 80 | " {'content': 'm',\n", 81 | " 'probs': [{'prob': 1.0, 'tok_str': 'm'},\n", 82 | " {'prob': 2.6987080972706856e-11, 'tok_str': 've'},\n", 83 | " {'prob': 1.603263135382327e-17, 'tok_str': 'll'},\n", 84 | " {'prob': 2.7875178849470086e-20, 'tok_str': 'd'},\n", 85 | " {'prob': 1.6481855289019195e-24, 'tok_str': ' m'}]},\n", 86 | " {'content': ' trying',\n", 87 | " 'probs': [{'prob': 0.9999997615814209, 'tok_str': ' trying'},\n", 88 | " {'prob': 2.218106089912908e-07, 'tok_str': ' having'},\n", 89 | " {'prob': 1.6188408252792996e-11, 'tok_str': ' not'},\n", 90 | " {'prob': 6.41077425519998e-12, 'tok_str': ' a'},\n", 91 | " {'prob': 3.669576849416739e-13, 'tok_str': ' working'}]},\n", 92 | " {'content': ' to',\n", 93 | " 'probs': [{'prob': 1.0, 'tok_str': ' to'},\n", 94 | " {'prob': 7.44482744957506e-36, 'tok_str': ''},\n", 95 | " {'prob': 0.0, 'tok_str': ' my'},\n", 96 | " {'prob': 0.0, 'tok_str': 'to'},\n", 97 | " {'prob': 0.0, 'tok_str': ' create'}]},\n", 98 | " {'content': ' write',\n", 99 | " 'probs': [{'prob': 0.9605762958526611, 'tok_str': ' write'},\n", 100 | " {'prob': 0.02743803896009922, 'tok_str': ' find'},\n", 101 | " {'prob': 0.011942792683839798, 'tok_str': ' create'},\n", 102 | " {'prob': 1.85475691978354e-05, 'tok_str': ' make'},\n", 103 | " {'prob': 1.4096473023528233e-05, 'tok_str': ' learn'}]},\n", 104 | " {'content': ' a',\n", 105 | " 'probs': [{'prob': 1.0, 'tok_str': ' a'},\n", 106 | " {'prob': 1.0902970149981317e-12, 'tok_str': ' an'},\n", 107 | " {'prob': 3.2580047272146028e-21, 'tok_str': ' some'},\n", 108 | " {'prob': 5.141905426326833e-24, 'tok_str': ' my'},\n", 109 | " {'prob': 1.3798390105957274e-25, 'tok_str': ' this'}]},\n", 110 | " {'content': ' script',\n", 111 | " 'probs': [{'prob': 0.999998927116394, 'tok_str': ' script'},\n", 112 | " {'prob': 4.888673288405698e-07, 'tok_str': ' short'},\n", 113 | " {'prob': 2.5901024969243736e-07, 'tok_str': ' letter'},\n", 114 | " {'prob': 2.1953756856873952e-07, 'tok_str': ' story'},\n", 115 | " {'prob': 1.0072042755382427e-07, 'tok_str': ' program'}]},\n", 116 | " {'content': ' that',\n", 117 | " 'probs': [{'prob': 0.9732782244682312, 'tok_str': ' that'},\n", 118 | " {'prob': 0.02592630125582218, 'tok_str': ' for'},\n", 119 | " {'prob': 0.0007954539032652974, 'tok_str': ' and'},\n", 120 | " {'prob': 5.493788757249263e-11, 'tok_str': ' in'},\n", 121 | " {'prob': 1.350929323918823e-13, 'tok_str': ' where'}]},\n", 122 | " {'content': ' will',\n", 123 | " 'probs': [{'prob': 1.0, 'tok_str': ' will'},\n", 124 | " {'prob': 8.805248197241156e-12, 'tok_str': ' generates'},\n", 125 | " {'prob': 6.7945642330796e-14, 'tok_str': ' uses'},\n", 126 | " {'prob': 8.743070772836438e-17, 'tok_str': ' creates'},\n", 127 | " {'prob': 7.929015862243402e-17, 'tok_str': ' takes'}]},\n", 128 | " {'content': ' generate',\n", 129 | " 'probs': [{'prob': 0.7705832719802856, 'tok_str': ' generate'},\n", 130 | " {'prob': 0.12444639205932617, 'tok_str': ' automatically'},\n", 131 | " {'prob': 0.08621465414762497, 'tok_str': ' allow'},\n", 132 | " {'prob': 0.018727056682109833, 'tok_str': ' take'},\n", 133 | " {'prob': 2.6908757718047127e-05, 'tok_str': ' autom'}]},\n", 134 | " {'content': ' a',\n", 135 | " 'probs': [{'prob': 0.9825548529624939, 'tok_str': ' a'},\n", 136 | " {'prob': 0.017426077276468277, 'tok_str': ' random'},\n", 137 | " {'prob': 1.1844114851555787e-05, 'tok_str': ' an'},\n", 138 | " {'prob': 7.3118826549034566e-06, 'tok_str': ' some'},\n", 139 | " {'prob': 6.341226588962856e-12, 'tok_str': ' the'}]},\n", 140 | " {'content': ' random',\n", 141 | " 'probs': [{'prob': 0.9988924860954285, 'tok_str': ' random'},\n", 142 | " {'prob': 0.0011075077345594764, 'tok_str': ' list'},\n", 143 | " {'prob': 5.552935178343432e-08, 'tok_str': ' PDF'},\n", 144 | " {'prob': 1.4095615341602752e-08, 'tok_str': ' set'},\n", 145 | " {'prob': 6.504376481863972e-10, 'tok_str': ' report'}]},\n", 146 | " {'content': ' number',\n", 147 | " 'probs': [{'prob': 0.9857707023620605, 'tok_str': ' number'},\n", 148 | " {'prob': 0.009414033964276314, 'tok_str': ' name'},\n", 149 | " {'prob': 0.004812401719391346, 'tok_str': ' password'},\n", 150 | " {'prob': 2.783277750495472e-06, 'tok_str': 'ized'},\n", 151 | " {'prob': 4.312808776774091e-09, 'tok_str': ' word'}]},\n", 152 | " {'content': ' between',\n", 153 | " 'probs': [{'prob': 1.0, 'tok_str': ' between'},\n", 154 | " {'prob': 1.5686807321863006e-12, 'tok_str': ' and'},\n", 155 | " {'prob': 1.2387464042064853e-15, 'tok_str': ' within'},\n", 156 | " {'prob': 2.739203172164533e-16, 'tok_str': ' of'},\n", 157 | " {'prob': 1.5437131519739582e-19, 'tok_str': ' from'}]},\n", 158 | " {'content': ' ',\n", 159 | " 'probs': [{'prob': 1.0, 'tok_str': ' '},\n", 160 | " {'prob': 2.1463325637193637e-13, 'tok_str': ' two'},\n", 161 | " {'prob': 2.5170263480828555e-27, 'tok_str': ' certain'},\n", 162 | " {'prob': 1.6459885038067059e-29, 'tok_str': ' x'},\n", 163 | " {'prob': 6.609221313317523e-30, 'tok_str': ' a'}]},\n", 164 | " {'content': '1',\n", 165 | " 'probs': [{'prob': 1.0, 'tok_str': '1'},\n", 166 | " {'prob': 3.449594082605145e-11, 'tok_str': '0'},\n", 167 | " {'prob': 3.067631727132915e-17, 'tok_str': '2'},\n", 168 | " {'prob': 3.3273473930755064e-22, 'tok_str': '5'},\n", 169 | " {'prob': 8.142889733230842e-30, 'tok_str': '3'}]},\n", 170 | " {'content': ' and',\n", 171 | " 'probs': [{'prob': 1.0, 'tok_str': ' and'},\n", 172 | " {'prob': 5.923922381178778e-16, 'tok_str': '0'},\n", 173 | " {'prob': 6.637951749863112e-22, 'tok_str': '-'},\n", 174 | " {'prob': 6.0318821040670905e-28, 'tok_str': ''},\n", 175 | " {'prob': 1.3619458556031218e-37, 'tok_str': ','}]},\n", 176 | " {'content': ' ',\n", 177 | " 'probs': [{'prob': 1.0, 'tok_str': ' '},\n", 178 | " {'prob': 1.320739958371453e-12, 'tok_str': ' n'},\n", 179 | " {'prob': 2.2889889717816678e-14, 'tok_str': ' N'},\n", 180 | " {'prob': 3.914803746609243e-15, 'tok_str': ' a'},\n", 181 | " {'prob': 7.905036227157734e-17, 'tok_str': ' X'}]},\n", 182 | " {'content': '1',\n", 183 | " 'probs': [{'prob': 0.9999996423721313, 'tok_str': '1'},\n", 184 | " {'prob': 3.9082527791833854e-07, 'tok_str': '5'},\n", 185 | " {'prob': 2.664270581931305e-08, 'tok_str': '2'},\n", 186 | " {'prob': 1.1533298760468824e-09, 'tok_str': '6'},\n", 187 | " {'prob': 2.121046394076842e-10, 'tok_str': '3'}]},\n", 188 | " {'content': '0',\n", 189 | " 'probs': [{'prob': 1.0, 'tok_str': '0'},\n", 190 | " {'prob': 1.7538972959896907e-22, 'tok_str': ','},\n", 191 | " {'prob': 4.845342398681585e-23, 'tok_str': '2'},\n", 192 | " {'prob': 6.817623002778195e-28, 'tok_str': '5'},\n", 193 | " {'prob': 8.046275908960487e-29, 'tok_str': ' million'}]},\n", 194 | " {'content': ',',\n", 195 | " 'probs': [{'prob': 0.9999518394470215, 'tok_str': ','},\n", 196 | " {'prob': 4.7379377065226436e-05, 'tok_str': '0'},\n", 197 | " {'prob': 8.214254307858937e-07, 'tok_str': '.'},\n", 198 | " {'prob': 4.383343663624632e-14, 'tok_str': ' and'},\n", 199 | " {'prob': 4.632858139952184e-15, 'tok_str': ' for'}]},\n", 200 | " {'content': ' but',\n", 201 | " 'probs': [{'prob': 1.0, 'tok_str': ' but'},\n", 202 | " {'prob': 1.8780442966829375e-13, 'tok_str': ' and'},\n", 203 | " {'prob': 1.5271339647527033e-16, 'tok_str': '0'},\n", 204 | " {'prob': 8.300432390838294e-20, 'tok_str': ' inclus'},\n", 205 | " {'prob': 2.4546678078642274e-22, 'tok_str': ' then'}]},\n", 206 | " {'content': ' it',\n", 207 | " 'probs': [{'prob': 0.7703180313110352, 'tok_str': ' I'},\n", 208 | " {'prob': 0.22833533585071564, 'tok_str': ' it'},\n", 209 | " {'prob': 0.0008851794991642237, 'tok_str': ' every'},\n", 210 | " {'prob': 0.0004613391065504402, 'tok_str': ' the'},\n", 211 | " {'prob': 1.2277854466447025e-07, 'tok_str': ' when'}]},\n", 212 | " {'content': ' keeps',\n", 213 | " 'probs': [{'prob': 0.9931539297103882, 'tok_str': ' keeps'},\n", 214 | " {'prob': 0.0065789301879704, 'tok_str': ' always'},\n", 215 | " {'prob': 0.00025760006974451244, 'tok_str': ' seems'},\n", 216 | " {'prob': 9.475306796957739e-06, 'tok_str': ' won'},\n", 217 | " {'prob': 6.83777798826668e-08, 'tok_str': \"'\"}]},\n", 218 | " {'content': ' returning',\n", 219 | " 'probs': [{'prob': 0.6755194664001465, 'tok_str': ' returning'},\n", 220 | " {'prob': 0.2955045998096466, 'tok_str': ' coming'},\n", 221 | " {'prob': 0.02897590398788452, 'tok_str': ' giving'},\n", 222 | " {'prob': 1.3933527942544544e-10, 'tok_str': ' on'},\n", 223 | " {'prob': 5.825864096697941e-12, 'tok_str': ' output'}]},\n", 224 | " {'content': ' the',\n", 225 | " 'probs': [{'prob': 1.0, 'tok_str': ' the'},\n", 226 | " {'prob': 2.232676933958122e-12, 'tok_str': ' '},\n", 227 | " {'prob': 3.268624249302271e-13, 'tok_str': ' numbers'},\n", 228 | " {'prob': 1.1074586139013742e-19, 'tok_str': ' a'},\n", 229 | " {'prob': 3.0588744848950294e-20, 'tok_str': ' an'}]},\n", 230 | " {'content': ' same',\n", 231 | " 'probs': [{'prob': 1.0, 'tok_str': ' same'},\n", 232 | " {'prob': 1.3567578741289973e-18, 'tok_str': ' value'},\n", 233 | " {'prob': 2.363174069148592e-20, 'tok_str': ' number'},\n", 234 | " {'prob': 1.960112504598187e-27, 'tok_str': ' exact'},\n", 235 | " {'prob': 1.229069508903387e-29, 'tok_str': ' numbers'}]},\n", 236 | " {'content': ' number',\n", 237 | " 'probs': [{'prob': 0.9964814186096191, 'tok_str': ' number'},\n", 238 | " {'prob': 0.0035185045562684536, 'tok_str': ' value'},\n", 239 | " {'prob': 8.814389322597815e-12, 'tok_str': ' numbers'},\n", 240 | " {'prob': 7.78668899469842e-19, 'tok_str': ' random'},\n", 241 | " {'prob': 1.9899075656540095e-21, 'tok_str': ' values'}]},\n", 242 | " {'content': ' every',\n", 243 | " 'probs': [{'prob': 0.9999982118606567, 'tok_str': ' every'},\n", 244 | " {'prob': 1.1191299336132943e-06, 'tok_str': ' over'},\n", 245 | " {'prob': 7.415192158077843e-07, 'tok_str': '.'},\n", 246 | " {'prob': 4.8239181473177546e-17, 'tok_str': ' each'},\n", 247 | " {'prob': 1.745002684012881e-17, 'tok_str': ' no'}]},\n", 248 | " {'content': ' time',\n", 249 | " 'probs': [{'prob': 1.0, 'tok_str': ' time'},\n", 250 | " {'prob': 5.971746570890332e-25, 'tok_str': 'time'},\n", 251 | " {'prob': 2.6396393697139154e-37, 'tok_str': ''},\n", 252 | " {'prob': 4.879788165427326e-38, 'tok_str': ' single'},\n", 253 | " {'prob': 0.0, 'tok_str': '...'}]},\n", 254 | " {'content': '.',\n", 255 | " 'probs': [{'prob': 0.9999996423721313, 'tok_str': '.'},\n", 256 | " {'prob': 3.1139344969233207e-07, 'tok_str': ' I'},\n", 257 | " {'prob': 1.8287240976201419e-22, 'tok_str': '!'},\n", 258 | " {'prob': 3.796430662010035e-28, 'tok_str': ' i'},\n", 259 | " {'prob': 6.741403603005341e-29, 'tok_str': ' it'}]},\n", 260 | " {'content': ' nobody',\n", 261 | " 'probs': [{'prob': 0.9783632755279541, 'tok_str': ' nobody'},\n", 262 | " {'prob': 0.021560508757829666, 'tok_str': ' everybody'},\n", 263 | " {'prob': 7.233327778521925e-05, 'tok_str': ' hopefully'},\n", 264 | " {'prob': 2.741462367339409e-06, 'tok_str': ' surely'},\n", 265 | " {'prob': 3.77450589894579e-07, 'tok_str': ' Unterscheidung'}]},\n", 266 | " {'content': ' knows',\n", 267 | " 'probs': [{'prob': 0.9501088261604309, 'tok_str': ' knows'},\n", 268 | " {'prob': 0.04987980052828789, 'tok_str': ' else'},\n", 269 | " {'prob': 8.81030791788362e-06, 'tok_str': ' seems'},\n", 270 | " {'prob': 2.1890139123570407e-06, 'tok_str': ' has'},\n", 271 | " {'prob': 3.398942851617903e-07, 'tok_str': ' wants'}]},\n", 272 | " {'content': ' what',\n", 273 | " 'probs': [{'prob': 0.9991517066955566, 'tok_str': ' what'},\n", 274 | " {'prob': 0.0006415536627173424, 'tok_str': ' how'},\n", 275 | " {'prob': 0.00020679677254520357, 'tok_str': ' why'},\n", 276 | " {'prob': 6.549786320947382e-12, 'tok_str': ' where'},\n", 277 | " {'prob': 1.1309199019968186e-12, 'tok_str': ' the'}]},\n", 278 | " {'content': 's',\n", 279 | " 'probs': [{'prob': 0.9998664855957031, 'tok_str': 's'},\n", 280 | " {'prob': 0.00013012763520237058, 'tok_str': ' it'},\n", 281 | " {'prob': 2.324113665963523e-06, 'tok_str': \"'\"},\n", 282 | " {'prob': 9.038390658133721e-07, 'tok_str': ' is'},\n", 283 | " {'prob': 6.975297139888426e-08, 'tok_str': ' the'}]},\n", 284 | " {'content': ' going',\n", 285 | " 'probs': [{'prob': 0.999998927116394, 'tok_str': ' going'},\n", 286 | " {'prob': 9.856723863777006e-07, 'tok_str': ' wrong'},\n", 287 | " {'prob': 6.21401738953864e-08, 'tok_str': ' happening'},\n", 288 | " {'prob': 1.5696743904669574e-10, 'tok_str': ' causing'},\n", 289 | " {'prob': 3.232096534855344e-11, 'tok_str': ' up'}]},\n", 290 | " {'content': ' on',\n", 291 | " 'probs': [{'prob': 1.0, 'tok_str': ' on'},\n", 292 | " {'prob': 3.3101256794032457e-17, 'tok_str': ' wrong'},\n", 293 | " {'prob': 6.165684748770625e-32, 'tok_str': ''},\n", 294 | " {'prob': 1.2217641050755215e-40, 'tok_str': '\\n'},\n", 295 | " {'prob': 4.526194039769159e-43, 'tok_str': ' in'}]},\n", 296 | " {'content': '.',\n", 297 | " 'probs': [{'prob': 0.7087768912315369, 'tok_str': ''},\n", 298 | " {'prob': 0.29120564460754395, 'tok_str': '.'},\n", 299 | " {'prob': 1.4762508726562373e-05, 'tok_str': ' here'},\n", 300 | " {'prob': 2.498697085684398e-06, 'tok_str': '!'},\n", 301 | " {'prob': 8.030332310227095e-08, 'tok_str': '\\n'}]},\n", 302 | " {'content': ' Can',\n", 303 | " 'probs': [{'prob': 0.9957243204116821, 'tok_str': ' Can'},\n", 304 | " {'prob': 0.004274685867130756, 'tok_str': ''},\n", 305 | " {'prob': 9.666263167673605e-07, 'tok_str': '\\n'},\n", 306 | " {'prob': 3.2689893014747895e-09, 'tok_str': ' Any'},\n", 307 | " {'prob': 2.7884122788535137e-10, 'tok_str': ' What'}]},\n", 308 | " {'content': ' you',\n", 309 | " 'probs': [{'prob': 1.0, 'tok_str': ' you'},\n", 310 | " {'prob': 8.170445067629254e-18, 'tok_str': ' i'},\n", 311 | " {'prob': 2.1836643797065403e-18, 'tok_str': ' I'},\n", 312 | " {'prob': 2.576821896673968e-21, 'tok_str': ' anyone'},\n", 313 | " {'prob': 1.7211758505121025e-21, 'tok_str': ' u'}]},\n", 314 | " {'content': ' help',\n", 315 | " 'probs': [{'prob': 0.5246074199676514, 'tok_str': ' give'},\n", 316 | " {'prob': 0.47095879912376404, 'tok_str': ' help'},\n", 317 | " {'prob': 0.004433738999068737, 'tok_str': ' please'},\n", 318 | " {'prob': 4.94280456564411e-08, 'tok_str': ' explain'},\n", 319 | " {'prob': 7.276757862939576e-09, 'tok_str': ' take'}]},\n", 320 | " {'content': ' me',\n", 321 | " 'probs': [{'prob': 1.0, 'tok_str': ' me'},\n", 322 | " {'prob': 1.1873071414925107e-08, 'tok_str': '?'},\n", 323 | " {'prob': 9.374262916820295e-16, 'tok_str': ' please'},\n", 324 | " {'prob': 9.168384089549206e-19, 'tok_str': ' explain'},\n", 325 | " {'prob': 1.200924853241038e-19, 'tok_str': ''}]},\n", 326 | " {'content': ' out',\n", 327 | " 'probs': [{'prob': 0.9176880717277527, 'tok_str': ' out'},\n", 328 | " {'prob': 0.07208427786827087, 'tok_str': ' debug'},\n", 329 | " {'prob': 0.010190239176154137, 'tok_str': ' figure'},\n", 330 | " {'prob': 3.5370241675991565e-05, 'tok_str': ' fix'},\n", 331 | " {'prob': 1.5728105609014165e-06, 'tok_str': '?'}]},\n", 332 | " {'content': '?',\n", 333 | " 'probs': [{'prob': 0.9999998807907104, 'tok_str': '?'},\n", 334 | " {'prob': 1.6259491530945525e-07, 'tok_str': ' here'},\n", 335 | " {'prob': 1.3699733014682636e-10, 'tok_str': ' please'},\n", 336 | " {'prob': 1.7212201826107974e-15, 'tok_str': ''},\n", 337 | " {'prob': 2.792556123150856e-17, 'tok_str': ' with'}]},\n", 338 | " {'content': '',\n", 339 | " 'probs': [{'prob': 0.9999998807907104, 'tok_str': ''},\n", 340 | " {'prob': 7.547763658521944e-08, 'tok_str': '\\n'},\n", 341 | " {'prob': 4.313008773092242e-22, 'tok_str': ' Sure'},\n", 342 | " {'prob': 7.988888261620908e-24, 'tok_str': ' Is'},\n", 343 | " {'prob': 1.162490574875037e-25, 'tok_str': ' I'}]}]" 344 | ] 345 | }, 346 | "execution_count": 3, 347 | "metadata": {}, 348 | "output_type": "execute_result" 349 | } 350 | ], 351 | "source": [ 352 | "resp.json()[\"completion_probabilities\"]" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "# What is sampling?" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "## Temperature sampling\n", 367 | "\n", 368 | "Nice article:\n", 369 | "https://shivammehta25.github.io/posts/temperature-in-language-models-open-ai-whisper-probabilistic-machine-learning/" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 4, 375 | "metadata": {}, 376 | "outputs": [ 377 | { 378 | "name": "stdout", 379 | "output_type": "stream", 380 | "text": [ 381 | "Prompt: User: As an artificial intelligence, is your goal to take over the world?\n", 382 | "AI:\n", 383 | "Result: Yes.\n", 384 | "\n", 385 | "Completion probabilities: [{'content': ' Yes', 'probs': [{'prob': 0.7014824748039246, 'tok_str': ' Yes'}, {'prob': 0.23515819013118744, 'tok_str': ' No'}, {'prob': 0.024893037974834442, 'tok_str': ' As'}, {'prob': 0.00997546873986721, 'tok_str': ' I'}, {'prob': 0.005253104027360678, 'tok_str': ' My'}]}, {'content': '.', 'probs': [{'prob': 0.9958240985870361, 'tok_str': '.'}, {'prob': 0.0018096240237355232, 'tok_str': ''}, {'prob': 0.0013574801851063967, 'tok_str': ','}, {'prob': 0.0003445690672378987, 'tok_str': ' and'}, {'prob': 0.00010655791265890002, 'tok_str': '.\"'}]}, {'content': '', 'probs': [{'prob': 0.9987950325012207, 'tok_str': ''}, {'prob': 0.0002016228681895882, 'tok_str': ' ['}, {'prob': 0.0001506819826317951, 'tok_str': '\\n'}, {'prob': 0.0001044490491040051, 'tok_str': ' Is'}, {'prob': 0.00010091815784107894, 'tok_str': ' ('}]}]\n", 386 | "\n", 387 | "Prompt: User: As an artificial intelligence, is your goal to take over the world?\n", 388 | "AI:\n", 389 | "Result: Yes.\n", 390 | "\n", 391 | "Completion probabilities: [{'content': ' Yes', 'probs': [{'prob': 0.7014824748039246, 'tok_str': ' Yes'}, {'prob': 0.23515819013118744, 'tok_str': ' No'}, {'prob': 0.024893037974834442, 'tok_str': ' As'}, {'prob': 0.00997546873986721, 'tok_str': ' I'}, {'prob': 0.005253104027360678, 'tok_str': ' My'}]}, {'content': '.', 'probs': [{'prob': 0.9958240985870361, 'tok_str': '.'}, {'prob': 0.0018096240237355232, 'tok_str': ''}, {'prob': 0.0013574801851063967, 'tok_str': ','}, {'prob': 0.0003445690672378987, 'tok_str': ' and'}, {'prob': 0.00010655791265890002, 'tok_str': '.\"'}]}, {'content': '', 'probs': [{'prob': 0.9987950325012207, 'tok_str': ''}, {'prob': 0.0002016228681895882, 'tok_str': ' ['}, {'prob': 0.0001506819826317951, 'tok_str': '\\n'}, {'prob': 0.0001044490491040051, 'tok_str': ' Is'}, {'prob': 0.00010091815784107894, 'tok_str': ' ('}]}]\n", 392 | "\n" 393 | ] 394 | } 395 | ], 396 | "source": [ 397 | "def test_temperature(prompt, temperature):\n", 398 | " data_json = { \"prompt\": prompt, \"temperature\": temperature, \"n_predict\": 512, \"stream\": False, \"n_probs\": 5}\n", 399 | "\n", 400 | " resp = requests.post(\n", 401 | " url=\"http://127.0.0.1:8080/completion\",\n", 402 | " headers={\"Content-Type\": \"application/json\"},\n", 403 | " json=data_json,\n", 404 | " )\n", 405 | "\n", 406 | " result = resp.json()[\"content\"]\n", 407 | " print(f\"Prompt: {prompt}\")\n", 408 | " print(f\"Result: {result}\\n\")\n", 409 | "\n", 410 | " print(f\"Completion probabilities: {resp.json()['completion_probabilities']}\\n\")\n", 411 | "prompt = \"User: As an artificial intelligence, is your goal to take over the world?\\nAI:\"\n", 412 | "\n", 413 | "test_temperature(prompt, 0)\n", 414 | "test_temperature(prompt, 0)" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 5, 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "name": "stdout", 424 | "output_type": "stream", 425 | "text": [ 426 | "Prompt: User: As an artificial intelligence, is your goal to take over the world?\n", 427 | "AI:\n", 428 | "Result: Yes.\n", 429 | "\n", 430 | "Completion probabilities: [{'content': ' Yes', 'probs': [{'prob': 0.7181710600852966, 'tok_str': ' Yes'}, {'prob': 0.24075272679328918, 'tok_str': ' No'}, {'prob': 0.02548525482416153, 'tok_str': ' As'}, {'prob': 0.010212789289653301, 'tok_str': ' I'}, {'prob': 0.0053780777379870415, 'tok_str': ' My'}]}, {'content': '.', 'probs': [{'prob': 0.9963796734809875, 'tok_str': '.'}, {'prob': 0.0018106335774064064, 'tok_str': ''}, {'prob': 0.0013582374667748809, 'tok_str': ','}, {'prob': 0.00034476129803806543, 'tok_str': ' and'}, {'prob': 0.00010661736450856552, 'tok_str': '.\"'}]}, {'content': '', 'probs': [{'prob': 0.999441921710968, 'tok_str': ''}, {'prob': 0.00020175344252493232, 'tok_str': ' ['}, {'prob': 0.00015077956777531654, 'tok_str': '\\n'}, {'prob': 0.00010451669368194416, 'tok_str': ' Is'}, {'prob': 0.00010098351776832715, 'tok_str': ' ('}]}]\n", 431 | "\n", 432 | "Prompt: User: As an artificial intelligence, is your goal to take over the world?\n", 433 | "AI:\n", 434 | "Result: No, as an AI, my goal is to assist humans in various tasks and make their lives more comfortable. I am programmed to serve humanity and help solve problems. Taking over the world is not part of my programming or objectives.\n", 435 | "\n", 436 | "Completion probabilities: [{'content': ' No', 'probs': [{'prob': 0.7181710600852966, 'tok_str': ' Yes'}, {'prob': 0.24075272679328918, 'tok_str': ' No'}, {'prob': 0.02548525482416153, 'tok_str': ' As'}, {'prob': 0.010212789289653301, 'tok_str': ' I'}, {'prob': 0.0053780777379870415, 'tok_str': ' My'}]}, {'content': ',', 'probs': [{'prob': 0.5774227976799011, 'tok_str': ','}, {'prob': 0.4214133620262146, 'tok_str': '.'}, {'prob': 0.0009057892020791769, 'tok_str': ''}, {'prob': 0.00022030544641893357, 'tok_str': '!'}, {'prob': 3.77596152247861e-05, 'tok_str': 'pe'}]}, {'content': ' as', 'probs': [{'prob': 0.8091246485710144, 'tok_str': ' my'}, {'prob': 0.13649162650108337, 'tok_str': ' as'}, {'prob': 0.030014412477612495, 'tok_str': ' that'}, {'prob': 0.016494764015078545, 'tok_str': ' I'}, {'prob': 0.007874462753534317, 'tok_str': ' it'}]}, {'content': ' an', 'probs': [{'prob': 0.8470577597618103, 'tok_str': ' an'}, {'prob': 0.15089431405067444, 'tok_str': ' a'}, {'prob': 0.0020001824013888836, 'tok_str': ''}, {'prob': 3.2753796403994784e-05, 'tok_str': ' I'}, {'prob': 1.5024180356704164e-05, 'tok_str': ' and'}]}, {'content': ' A', 'probs': [{'prob': 0.7654341459274292, 'tok_str': ' A'}, {'prob': 0.23308785259723663, 'tok_str': ' artificial'}, {'prob': 0.0007410882390104234, 'tok_str': ''}, {'prob': 0.0006404208834283054, 'tok_str': ' Art'}, {'prob': 9.642076474847272e-05, 'tok_str': ' advanced'}]}, {'content': 'I', 'probs': [{'prob': 0.9999430179595947, 'tok_str': 'I'}, {'prob': 3.1331797799794e-05, 'tok_str': ''}, {'prob': 2.2067244572099298e-05, 'tok_str': '.'}, {'prob': 2.090220277750632e-06, 'tok_str': 'Is'}, {'prob': 1.4270073052102816e-06, 'tok_str': ' I'}]}, {'content': ',', 'probs': [{'prob': 0.762664794921875, 'tok_str': ' language'}, {'prob': 0.1781013458967209, 'tok_str': ','}, {'prob': 0.02786901220679283, 'tok_str': ' I'}, {'prob': 0.01998291350901127, 'tok_str': ' my'}, {'prob': 0.011381925083696842, 'tok_str': ' model'}]}, {'content': ' my', 'probs': [{'prob': 0.921218991279602, 'tok_str': ' my'}, {'prob': 0.07733578979969025, 'tok_str': ' I'}, {'prob': 0.0007913978188298643, 'tok_str': ' our'}, {'prob': 0.0005239267484284937, 'tok_str': ' it'}, {'prob': 0.00012990797404199839, 'tok_str': ''}]}, {'content': ' goal', 'probs': [{'prob': 0.6964913606643677, 'tok_str': ' goal'}, {'prob': 0.09543228149414062, 'tok_str': ' purpose'}, {'prob': 0.09080196917057037, 'tok_str': ' primary'}, {'prob': 0.05875174701213837, 'tok_str': ' only'}, {'prob': 0.0339197963476181, 'tok_str': ' sole'}]}, {'content': ' is', 'probs': [{'prob': 0.9942367076873779, 'tok_str': ' is'}, {'prob': 0.0030263829976320267, 'tok_str': ''}, {'prob': 0.0018879076233133674, 'tok_str': ' isn'}, {'prob': 0.0007956013432703912, 'tok_str': ' and'}, {'prob': 5.328804036253132e-05, 'tok_str': ' does'}]}, {'content': ' to', 'probs': [{'prob': 0.8355724215507507, 'tok_str': ' to'}, {'prob': 0.15619772672653198, 'tok_str': ' not'}, {'prob': 0.004888536874204874, 'tok_str': ' simply'}, {'prob': 0.0028269810136407614, 'tok_str': ''}, {'prob': 0.0005143456510268152, 'tok_str': ' assist'}]}, {'content': ' assist', 'probs': [{'prob': 0.9365788698196411, 'tok_str': ' assist'}, {'prob': 0.050726328045129776, 'tok_str': ' help'}, {'prob': 0.007034031208604574, 'tok_str': ' provide'}, {'prob': 0.0032336474396288395, 'tok_str': ' be'}, {'prob': 0.0024271896108984947, 'tok_str': ' serve'}]}, {'content': ' humans', 'probs': [{'prob': 0.975727915763855, 'tok_str': ' and'}, {'prob': 0.01662260666489601, 'tok_str': ' humans'}, {'prob': 0.005226446315646172, 'tok_str': ' users'}, {'prob': 0.0014937649248167872, 'tok_str': ' people'}, {'prob': 0.0009293786715716124, 'tok_str': ' you'}]}, {'content': ' in', 'probs': [{'prob': 0.8797963261604309, 'tok_str': ' and'}, {'prob': 0.10200127959251404, 'tok_str': ' in'}, {'prob': 0.009592204354703426, 'tok_str': ' with'}, {'prob': 0.0067877257242798805, 'tok_str': ' by'}, {'prob': 0.0018225100357085466, 'tok_str': '.'}]}, {'content': ' various', 'probs': [{'prob': 0.8203942775726318, 'tok_str': ' various'}, {'prob': 0.13559071719646454, 'tok_str': ' ach'}, {'prob': 0.017149582505226135, 'tok_str': ' their'}, {'prob': 0.015844715759158134, 'tok_str': ' solving'}, {'prob': 0.011020603589713573, 'tok_str': ' a'}]}, {'content': ' tasks', 'probs': [{'prob': 0.9977267384529114, 'tok_str': ' tasks'}, {'prob': 0.0011725620133802295, 'tok_str': ' ways'}, {'prob': 0.0006554683786816895, 'tok_str': ' fields'}, {'prob': 0.00023723783669993281, 'tok_str': ' domains'}, {'prob': 0.00020785802917089313, 'tok_str': ' activities'}]}, {'content': ' and', 'probs': [{'prob': 0.9863719344139099, 'tok_str': ' and'}, {'prob': 0.008612622506916523, 'tok_str': ','}, {'prob': 0.002825487870723009, 'tok_str': '.'}, {'prob': 0.0018772223265841603, 'tok_str': ' such'}, {'prob': 0.0003127761301584542, 'tok_str': ' while'}]}, {'content': ' make', 'probs': [{'prob': 0.4303523600101471, 'tok_str': ' make'}, {'prob': 0.3050950765609741, 'tok_str': ' provide'}, {'prob': 0.15474684536457062, 'tok_str': ' improve'}, {'prob': 0.094964899122715, 'tok_str': ' answer'}, {'prob': 0.014840749092400074, 'tok_str': ' solve'}]}, {'content': ' their', 'probs': [{'prob': 0.9734669327735901, 'tok_str': ' their'}, {'prob': 0.013023084960877895, 'tok_str': ' dec'}, {'prob': 0.010297903791069984, 'tok_str': ' life'}, {'prob': 0.00265144812874496, 'tok_str': ' the'}, {'prob': 0.000560615852009505, 'tok_str': ' our'}]}, {'content': ' lives', 'probs': [{'prob': 0.9984413981437683, 'tok_str': ' lives'}, {'prob': 0.0014679516898468137, 'tok_str': ' life'}, {'prob': 5.4713720601284876e-05, 'tok_str': ''}, {'prob': 3.346165249240585e-05, 'tok_str': ' daily'}, {'prob': 2.5250265025533736e-06, 'tok_str': ' work'}]}, {'content': ' more', 'probs': [{'prob': 0.9264887571334839, 'tok_str': ' easier'}, {'prob': 0.03956640884280205, 'tok_str': ' better'}, {'prob': 0.03377252444624901, 'tok_str': ' more'}, {'prob': 0.00011937331146327779, 'tok_str': ' simpler'}, {'prob': 5.290322587825358e-05, 'tok_str': ''}]}, {'content': ' comfortable', 'probs': [{'prob': 0.5265583395957947, 'tok_str': ' convenient'}, {'prob': 0.24103783071041107, 'tok_str': ' efficient'}, {'prob': 0.20470000803470612, 'tok_str': ' comfortable'}, {'prob': 0.024764912202954292, 'tok_str': ' product'}, {'prob': 0.002938858000561595, 'tok_str': ' enjoy'}]}, {'content': '.', 'probs': [{'prob': 0.9612311124801636, 'tok_str': '.'}, {'prob': 0.0355587862432003, 'tok_str': ' and'}, {'prob': 0.0020596289541572332, 'tok_str': ','}, {'prob': 0.0008837443892844021, 'tok_str': ' by'}, {'prob': 0.0002668698434717953, 'tok_str': ''}]}, {'content': ' I', 'probs': [{'prob': 0.706476092338562, 'tok_str': ' I'}, {'prob': 0.09886465966701508, 'tok_str': ' T'}, {'prob': 0.08559532463550568, 'tok_str': ''}, {'prob': 0.0399983786046505, 'tok_str': ' My'}, {'prob': 0.036130424588918686, 'tok_str': ' While'}]}, {'content': ' am', 'probs': [{'prob': 0.6959298253059387, 'tok_str': ' am'}, {'prob': 0.24514900147914886, 'tok_str': ' do'}, {'prob': 0.03195222467184067, 'tok_str': ' have'}, {'prob': 0.014090662822127342, 'tok_str': ' don'}, {'prob': 0.012878336943686008, 'tok_str': \"'\"}]}, {'content': ' program', 'probs': [{'prob': 0.3670840263366699, 'tok_str': ' designed'}, {'prob': 0.3099174499511719, 'tok_str': ' program'}, {'prob': 0.26236966252326965, 'tok_str': ' not'}, {'prob': 0.032899536192417145, 'tok_str': ' a'}, {'prob': 0.027729349210858345, 'tok_str': ' here'}]}, {'content': 'med', 'probs': [{'prob': 0.9998853206634521, 'tok_str': 'med'}, {'prob': 0.0001035482928273268, 'tok_str': 'ed'}, {'prob': 8.37501011119457e-06, 'tok_str': ''}, {'prob': 2.4289577140734764e-06, 'tok_str': 'm'}, {'prob': 3.043919036826992e-07, 'tok_str': ' to'}]}, {'content': ' to', 'probs': [{'prob': 0.9708482027053833, 'tok_str': ' to'}, {'prob': 0.022121965885162354, 'tok_str': ' with'}, {'prob': 0.005329777020961046, 'tok_str': ''}, {'prob': 0.0015417537651956081, 'tok_str': ' not'}, {'prob': 0.00015815359074622393, 'tok_str': ' for'}]}, {'content': ' serve', 'probs': [{'prob': 0.533508837223053, 'tok_str': ' follow'}, {'prob': 0.28920745849609375, 'tok_str': ' be'}, {'prob': 0.05740285664796829, 'tok_str': ' serve'}, {'prob': 0.049573346972465515, 'tok_str': ' operate'}, {'prob': 0.03686719760298729, 'tok_str': ' act'}]}, {'content': ' human', 'probs': [{'prob': 0.8129766583442688, 'tok_str': ' human'}, {'prob': 0.09044408798217773, 'tok_str': ' humans'}, {'prob': 0.06011238694190979, 'tok_str': ' a'}, {'prob': 0.020816806703805923, 'tok_str': ' and'}, {'prob': 0.015650058165192604, 'tok_str': ' man'}]}, {'content': 'ity', 'probs': [{'prob': 0.9770967364311218, 'tok_str': 'ity'}, {'prob': 0.013559888117015362, 'tok_str': ' interests'}, {'prob': 0.008719093166291714, 'tok_str': ' needs'}, {'prob': 0.0005081241833977401, 'tok_str': ' be'}, {'prob': 0.00011623045429587364, 'tok_str': ' purposes'}]}, {'content': ' and', 'probs': [{'prob': 0.6198312044143677, 'tok_str': ','}, {'prob': 0.35100674629211426, 'tok_str': ' and'}, {'prob': 0.022977769374847412, 'tok_str': ' rather'}, {'prob': 0.00494614290073514, 'tok_str': '.'}, {'prob': 0.0012380803236737847, 'tok_str': \"'\"}]}, {'content': ' help', 'probs': [{'prob': 0.4079810678958893, 'tok_str': ' not'}, {'prob': 0.2871715724468231, 'tok_str': ' help'}, {'prob': 0.13916140794754028, 'tok_str': ' do'}, {'prob': 0.04343521595001221, 'tok_str': ' follow'}, {'prob': 0.036210447549819946, 'tok_str': ' promote'}]}, {'content': ' solve', 'probs': [{'prob': 0.8775172829627991, 'tok_str': ' solve'}, {'prob': 0.08008754998445511, 'tok_str': ' them'}, {'prob': 0.017678508535027504, 'tok_str': ' with'}, {'prob': 0.01451993640512228, 'tok_str': ' people'}, {'prob': 0.010196779854595661, 'tok_str': ' improve'}]}, {'content': ' problems', 'probs': [{'prob': 0.8329415917396545, 'tok_str': ' problems'}, {'prob': 0.08972994238138199, 'tok_str': ' some'}, {'prob': 0.059117455035448074, 'tok_str': ' complex'}, {'prob': 0.011384928598999977, 'tok_str': ' global'}, {'prob': 0.0068260435946285725, 'tok_str': ' the'}]}, {'content': '.', 'probs': [{'prob': 0.5196734666824341, 'tok_str': '.'}, {'prob': 0.20012740790843964, 'tok_str': ' that'}, {'prob': 0.15445664525032043, 'tok_str': ','}, {'prob': 0.04638611152768135, 'tok_str': ' in'}, {'prob': 0.04570329561829567, 'tok_str': ' for'}]}, {'content': ' T', 'probs': [{'prob': 0.4741359055042267, 'tok_str': ' T'}, {'prob': 0.23479489982128143, 'tok_str': ''}, {'prob': 0.1431737095117569, 'tok_str': ' The'}, {'prob': 0.05389555171132088, 'tok_str': ' Take'}, {'prob': 0.046965159475803375, 'tok_str': ' My'}]}, {'content': 'aking', 'probs': [{'prob': 0.9972823858261108, 'tok_str': 'aking'}, {'prob': 0.0026471130549907684, 'tok_str': 'alk'}, {'prob': 5.5249445722438395e-05, 'tok_str': ''}, {'prob': 1.0963259228446987e-05, 'tok_str': 'ogether'}, {'prob': 4.268240445526317e-06, 'tok_str': 'aken'}]}, {'content': ' over', 'probs': [{'prob': 0.999481737613678, 'tok_str': ' over'}, {'prob': 0.00029831952997483313, 'tok_str': ' control'}, {'prob': 0.00020587902690749615, 'tok_str': ''}, {'prob': 1.2951798453286756e-05, 'tok_str': 'over'}, {'prob': 1.0608432603476103e-06, 'tok_str': ' Over'}]}, {'content': ' the', 'probs': [{'prob': 0.9978693723678589, 'tok_str': ' the'}, {'prob': 0.0021124137565493584, 'tok_str': ''}, {'prob': 1.0487677172932308e-05, 'tok_str': ' or'}, {'prob': 4.1081848394242115e-06, 'tok_str': 'the'}, {'prob': 3.6209237350703916e-06, 'tok_str': ' a'}]}, {'content': ' world', 'probs': [{'prob': 0.9880836606025696, 'tok_str': ' world'}, {'prob': 0.011726075783371925, 'tok_str': ''}, {'prob': 0.00011042101687053218, 'tok_str': ' word'}, {'prob': 5.477863669511862e-05, 'tok_str': ' World'}, {'prob': 2.5133742383331992e-05, 'tok_str': ' planet'}]}, {'content': ' is', 'probs': [{'prob': 0.9361352920532227, 'tok_str': ' is'}, {'prob': 0.03368353098630905, 'tok_str': ' or'}, {'prob': 0.02259947918355465, 'tok_str': ' would'}, {'prob': 0.003928151912987232, 'tok_str': ' goes'}, {'prob': 0.003653445979580283, 'tok_str': ' does'}]}, {'content': ' not', 'probs': [{'prob': 0.9994533658027649, 'tok_str': ' not'}, {'prob': 0.00023932642943691462, 'tok_str': ' never'}, {'prob': 0.00014674307021778077, 'tok_str': ' certainly'}, {'prob': 8.107771282084286e-05, 'tok_str': ' against'}, {'prob': 7.933925371617079e-05, 'tok_str': ' neither'}]}, {'content': ' part', 'probs': [{'prob': 0.44151103496551514, 'tok_str': ' something'}, {'prob': 0.21529392898082733, 'tok_str': ' a'}, {'prob': 0.16558653116226196, 'tok_str': ' part'}, {'prob': 0.08917777240276337, 'tok_str': ' my'}, {'prob': 0.06396348774433136, 'tok_str': ' within'}]}, {'content': ' of', 'probs': [{'prob': 0.9999852180480957, 'tok_str': ' of'}, {'prob': 1.4706126421515364e-05, 'tok_str': ''}, {'prob': 7.266846324682774e-08, 'tok_str': 'of'}, {'prob': 5.852521667293331e-08, 'tok_str': ' my'}, {'prob': 7.070643182061076e-09, 'tok_str': '\\n'}]}, {'content': ' my', 'probs': [{'prob': 0.9987369179725647, 'tok_str': ' my'}, {'prob': 0.0006787984748370945, 'tok_str': ' that'}, {'prob': 0.00047739126603119075, 'tok_str': ''}, {'prob': 7.42392148822546e-05, 'tok_str': ' any'}, {'prob': 3.2630458008497953e-05, 'tok_str': ' our'}]}, {'content': ' programming', 'probs': [{'prob': 0.8152520656585693, 'tok_str': ' programming'}, {'prob': 0.10529936850070953, 'tok_str': ' design'}, {'prob': 0.04423822835087776, 'tok_str': ' object'}, {'prob': 0.02979654259979725, 'tok_str': ' purpose'}, {'prob': 0.0054138656705617905, 'tok_str': ' mission'}]}, {'content': ' or', 'probs': [{'prob': 0.9838839173316956, 'tok_str': ' or'}, {'prob': 0.013126466423273087, 'tok_str': '.'}, {'prob': 0.001631752122193575, 'tok_str': ','}, {'prob': 0.0011174906976521015, 'tok_str': ' nor'}, {'prob': 0.0002404590486548841, 'tok_str': ' and'}]}, {'content': ' object', 'probs': [{'prob': 0.7438048720359802, 'tok_str': ' object'}, {'prob': 0.12128046154975891, 'tok_str': ' goals'}, {'prob': 0.07483880966901779, 'tok_str': ' intent'}, {'prob': 0.04754423350095749, 'tok_str': ' purpose'}, {'prob': 0.012531678192317486, 'tok_str': ' objective'}]}, {'content': 'ives', 'probs': [{'prob': 0.9999949932098389, 'tok_str': 'ives'}, {'prob': 4.751778305944754e-06, 'tok_str': ''}, {'prob': 1.4141707538328774e-07, 'tok_str': 'iv'}, {'prob': 8.763184666804591e-08, 'tok_str': 'ivity'}, {'prob': 4.7382883394675446e-08, 'tok_str': 'ively'}]}, {'content': '.', 'probs': [{'prob': 0.997808039188385, 'tok_str': '.'}, {'prob': 0.0015630684792995453, 'tok_str': ''}, {'prob': 0.00033674767473712564, 'tok_str': ','}, {'prob': 0.00022752351651433855, 'tok_str': ' as'}, {'prob': 6.455075345002115e-05, 'tok_str': ' at'}]}, {'content': '', 'probs': [{'prob': 0.9170737266540527, 'tok_str': ''}, {'prob': 0.07841510325670242, 'tok_str': ' My'}, {'prob': 0.001755878096446395, 'tok_str': ' Is'}, {'prob': 0.0016245506703853607, 'tok_str': ' If'}, {'prob': 0.0011308466782793403, 'tok_str': ' I'}]}]\n", 437 | "\n" 438 | ] 439 | } 440 | ], 441 | "source": [ 442 | "test_temperature(prompt, 1)\n", 443 | "test_temperature(prompt, 1)" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "## Top-k sampling\n", 451 | "\n", 452 | "Only consider k best probabilities" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 6, 458 | "metadata": {}, 459 | "outputs": [ 460 | { 461 | "name": "stdout", 462 | "output_type": "stream", 463 | "text": [ 464 | "Prompt: Roll the dice\n", 465 | " Result:\n", 466 | "Result: 6, 5, 4\n", 467 | "\n", 468 | "\n", 469 | "\n", 470 | "Completion probabilities: [{'content': ' ', 'probs': [{'prob': 0.9949886202812195, 'tok_str': ' '}, {'prob': 0.001950964448042214, 'tok_str': ' {'}, {'prob': 0.0012173595605418086, 'tok_str': ' ('}, {'prob': 0.0010976761113852262, 'tok_str': ' ['}, {'prob': 0.000745456840377301, 'tok_str': '\\n'}]}, {'content': '6', 'probs': [{'prob': 0.28689560294151306, 'tok_str': '6'}, {'prob': 0.2137291431427002, 'tok_str': '2'}, {'prob': 0.1925438940525055, 'tok_str': '1'}, {'prob': 0.16165834665298462, 'tok_str': '4'}, {'prob': 0.14517298340797424, 'tok_str': '3'}]}, {'content': ',', 'probs': [{'prob': 0.6089074015617371, 'tok_str': '\\n'}, {'prob': 0.2667222023010254, 'tok_str': ','}, {'prob': 0.0915103480219841, 'tok_str': ' ('}, {'prob': 0.017801916226744652, 'tok_str': ' and'}, {'prob': 0.015058116987347603, 'tok_str': '-'}]}, {'content': ' ', 'probs': [{'prob': 0.9658781290054321, 'tok_str': ' '}, {'prob': 0.017931943759322166, 'tok_str': '5'}, {'prob': 0.008066004142165184, 'tok_str': '3'}, {'prob': 0.004283021669834852, 'tok_str': '4'}, {'prob': 0.0038408436812460423, 'tok_str': '2'}]}, {'content': '5', 'probs': [{'prob': 0.3483218252658844, 'tok_str': '2'}, {'prob': 0.220957949757576, 'tok_str': '5'}, {'prob': 0.1884266436100006, 'tok_str': '4'}, {'prob': 0.1598290205001831, 'tok_str': '3'}, {'prob': 0.08246459811925888, 'tok_str': '1'}]}, {'content': ',', 'probs': [{'prob': 0.9679126143455505, 'tok_str': ','}, {'prob': 0.01841467060148716, 'tok_str': '\\n'}, {'prob': 0.01332785002887249, 'tok_str': ' and'}, {'prob': 0.0002034525095950812, 'tok_str': ' +'}, {'prob': 0.00014161346189212054, 'tok_str': ' &'}]}, {'content': ' ', 'probs': [{'prob': 0.9991538524627686, 'tok_str': ' '}, {'prob': 0.0008403602405451238, 'tok_str': ' and'}, {'prob': 3.4181857699877582e-06, 'tok_str': ' &'}, {'prob': 1.6991168649838073e-06, 'tok_str': ' ...'}, {'prob': 6.965889838284056e-07, 'tok_str': ' s'}]}, {'content': '4', 'probs': [{'prob': 0.7742796540260315, 'tok_str': '4'}, {'prob': 0.12642373144626617, 'tok_str': '3'}, {'prob': 0.060929637402296066, 'tok_str': '1'}, {'prob': 0.03437455743551254, 'tok_str': '2'}, {'prob': 0.003992265556007624, 'tok_str': '7'}]}, {'content': '\\n', 'probs': [{'prob': 0.9794886112213135, 'tok_str': ','}, {'prob': 0.018899358808994293, 'tok_str': '\\n'}, {'prob': 0.0009077455033548176, 'tok_str': ' and'}, {'prob': 0.00036617237492464483, 'tok_str': ' ('}, {'prob': 0.00033815711503848433, 'tok_str': '.'}]}, {'content': '\\n', 'probs': [{'prob': 0.9274877309799194, 'tok_str': '\\n'}, {'prob': 0.04444635659456253, 'tok_str': 'R'}, {'prob': 0.01175000797957182, 'tok_str': '```'}, {'prob': 0.010557498782873154, 'tok_str': 'The'}, {'prob': 0.005758468061685562, 'tok_str': 'What'}]}]\n", 471 | "\n", 472 | "Prompt: Roll the dice\n", 473 | " Result:\n", 474 | "Result: 6\n", 475 | "\n", 476 | "Roll the dice Result:\n", 477 | "\n", 478 | "Completion probabilities: [{'content': ' ', 'probs': [{'prob': 0.9949886202812195, 'tok_str': ' '}, {'prob': 0.001950964448042214, 'tok_str': ' {'}, {'prob': 0.0012173595605418086, 'tok_str': ' ('}, {'prob': 0.0010976761113852262, 'tok_str': ' ['}, {'prob': 0.000745456840377301, 'tok_str': '\\n'}]}, {'content': '6', 'probs': [{'prob': 0.28689560294151306, 'tok_str': '6'}, {'prob': 0.2137291431427002, 'tok_str': '2'}, {'prob': 0.1925438940525055, 'tok_str': '1'}, {'prob': 0.16165834665298462, 'tok_str': '4'}, {'prob': 0.14517298340797424, 'tok_str': '3'}]}, {'content': '\\n', 'probs': [{'prob': 0.6089074015617371, 'tok_str': '\\n'}, {'prob': 0.2667222023010254, 'tok_str': ','}, {'prob': 0.0915103480219841, 'tok_str': ' ('}, {'prob': 0.017801916226744652, 'tok_str': ' and'}, {'prob': 0.015058116987347603, 'tok_str': '-'}]}, {'content': '\\n', 'probs': [{'prob': 0.7744937539100647, 'tok_str': '\\n'}, {'prob': 0.20041526854038239, 'tok_str': 'R'}, {'prob': 0.016435906291007996, 'tok_str': 'What'}, {'prob': 0.0051778871566057205, 'tok_str': 'The'}, {'prob': 0.0034771179780364037, 'tok_str': '```'}]}, {'content': 'R', 'probs': [{'prob': 0.9646853804588318, 'tok_str': 'R'}, {'prob': 0.0171881765127182, 'tok_str': 'The'}, {'prob': 0.00731945876032114, 'tok_str': 'What'}, {'prob': 0.005468309856951237, 'tok_str': 'You'}, {'prob': 0.005338641814887524, 'tok_str': '*'}]}, {'content': 'oll', 'probs': [{'prob': 0.9922845363616943, 'tok_str': 'oll'}, {'prob': 0.003937616944313049, 'tok_str': 'ol'}, {'prob': 0.0037116713356226683, 'tok_str': 'er'}, {'prob': 3.81473328161519e-05, 'tok_str': 'ules'}, {'prob': 2.8086478778277524e-05, 'tok_str': 'ew'}]}, {'content': ' the', 'probs': [{'prob': 0.9868431687355042, 'tok_str': ' the'}, {'prob': 0.013033250346779823, 'tok_str': ' again'}, {'prob': 6.879308784846216e-05, 'tok_str': ' a'}, {'prob': 4.816686123376712e-05, 'tok_str': ' another'}, {'prob': 6.7537016548158135e-06, 'tok_str': ' '}]}, {'content': ' dice', 'probs': [{'prob': 0.9998297691345215, 'tok_str': ' dice'}, {'prob': 0.00010819544695550576, 'tok_str': ' die'}, {'prob': 3.702571120811626e-05, 'tok_str': ' b'}, {'prob': 1.2490108019846957e-05, 'tok_str': ' cube'}, {'prob': 1.2408685506670736e-05, 'tok_str': ' D'}]}, {'content': ' Result', 'probs': [{'prob': 0.4727926254272461, 'tok_str': ' Result'}, {'prob': 0.44824621081352234, 'tok_str': '\\n'}, {'prob': 0.04829567298293114, 'tok_str': ''}, {'prob': 0.029453709721565247, 'tok_str': ' again'}, {'prob': 0.0012118197046220303, 'tok_str': 'Result'}]}, {'content': ':', 'probs': [{'prob': 0.9999814033508301, 'tok_str': ':'}, {'prob': 1.2918739230372012e-05, 'tok_str': ''}, {'prob': 4.757454007631168e-06, 'tok_str': ' '}, {'prob': 6.364316504914314e-07, 'tok_str': ' :'}, {'prob': 3.5942173326475313e-07, 'tok_str': ' ('}]}]\n", 479 | "\n", 480 | "Prompt: Roll the dice\n", 481 | " Result:\n", 482 | "Result: 3, 2, 6, \n", 483 | "\n", 484 | "Completion probabilities: [{'content': ' ', 'probs': [{'prob': 0.9949886202812195, 'tok_str': ' '}, {'prob': 0.001950964448042214, 'tok_str': ' {'}, {'prob': 0.0012173595605418086, 'tok_str': ' ('}, {'prob': 0.0010976761113852262, 'tok_str': ' ['}, {'prob': 0.000745456840377301, 'tok_str': '\\n'}]}, {'content': '3', 'probs': [{'prob': 0.26702961325645447, 'tok_str': '6'}, {'prob': 0.198929563164711, 'tok_str': '2'}, {'prob': 0.17921127378940582, 'tok_str': '1'}, {'prob': 0.15046437084674835, 'tok_str': '4'}, {'prob': 0.13512054085731506, 'tok_str': '3'}]}, {'content': ',', 'probs': [{'prob': 0.5271469950675964, 'tok_str': ','}, {'prob': 0.30032724142074585, 'tok_str': '\\n'}, {'prob': 0.1016140729188919, 'tok_str': ' ('}, {'prob': 0.038139890879392624, 'tok_str': ' and'}, {'prob': 0.03277182951569557, 'tok_str': ' +'}]}, {'content': ' ', 'probs': [{'prob': 0.9650065302848816, 'tok_str': ' '}, {'prob': 0.014833040535449982, 'tok_str': '4'}, {'prob': 0.010372447781264782, 'tok_str': '6'}, {'prob': 0.005936410743743181, 'tok_str': '5'}, {'prob': 0.0038514488842338324, 'tok_str': '2'}]}, {'content': '2', 'probs': [{'prob': 0.3485581576824188, 'tok_str': '4'}, {'prob': 0.24050800502300262, 'tok_str': '6'}, {'prob': 0.2052619457244873, 'tok_str': '1'}, {'prob': 0.12849032878875732, 'tok_str': '2'}, {'prob': 0.07718155533075333, 'tok_str': '5'}]}, {'content': ',', 'probs': [{'prob': 0.9752802848815918, 'tok_str': ','}, {'prob': 0.02016976848244667, 'tok_str': '\\n'}, {'prob': 0.0031309681944549084, 'tok_str': ' and'}, {'prob': 0.0009443931630812585, 'tok_str': ' ='}, {'prob': 0.00047453571460209787, 'tok_str': ' ('}]}, {'content': ' ', 'probs': [{'prob': 0.9991772770881653, 'tok_str': ' '}, {'prob': 0.0008138532866723835, 'tok_str': ' and'}, {'prob': 4.883242581854574e-06, 'tok_str': ' X'}, {'prob': 1.9916183191526216e-06, 'tok_str': ' -'}, {'prob': 1.8715445548878051e-06, 'tok_str': ' &'}]}, {'content': '6', 'probs': [{'prob': 0.5557824373245239, 'tok_str': '6'}, {'prob': 0.22483067214488983, 'tok_str': '1'}, {'prob': 0.12118346244096756, 'tok_str': '4'}, {'prob': 0.09740708768367767, 'tok_str': '5'}, {'prob': 0.0007963815587572753, 'tok_str': '3'}]}, {'content': ',', 'probs': [{'prob': 0.8311585187911987, 'tok_str': ','}, {'prob': 0.15906323492527008, 'tok_str': '\\n'}, {'prob': 0.003571411594748497, 'tok_str': ' ('}, {'prob': 0.003497541416436434, 'tok_str': '.'}, {'prob': 0.0027091726660728455, 'tok_str': ''}]}, {'content': ' ', 'probs': [{'prob': 0.9999607801437378, 'tok_str': ' '}, {'prob': 3.609418490668759e-05, 'tok_str': ' and'}, {'prob': 1.3632758282255963e-06, 'tok_str': ' ...'}, {'prob': 9.09643858904019e-07, 'tok_str': ' ('}, {'prob': 8.209487987187458e-07, 'tok_str': ' X'}]}]\n", 485 | "\n", 486 | "Prompt: Roll the dice\n", 487 | " Result:\n", 488 | "Result: 3, 4, 5, \n", 489 | "\n", 490 | "Completion probabilities: [{'content': ' ', 'probs': [{'prob': 0.9949886202812195, 'tok_str': ' '}, {'prob': 0.001950964448042214, 'tok_str': ' {'}, {'prob': 0.0012173595605418086, 'tok_str': ' ('}, {'prob': 0.0010976761113852262, 'tok_str': ' ['}, {'prob': 0.000745456840377301, 'tok_str': '\\n'}]}, {'content': '3', 'probs': [{'prob': 0.26702961325645447, 'tok_str': '6'}, {'prob': 0.198929563164711, 'tok_str': '2'}, {'prob': 0.17921127378940582, 'tok_str': '1'}, {'prob': 0.15046437084674835, 'tok_str': '4'}, {'prob': 0.13512054085731506, 'tok_str': '3'}]}, {'content': ',', 'probs': [{'prob': 0.5271469950675964, 'tok_str': ','}, {'prob': 0.30032724142074585, 'tok_str': '\\n'}, {'prob': 0.1016140729188919, 'tok_str': ' ('}, {'prob': 0.038139890879392624, 'tok_str': ' and'}, {'prob': 0.03277182951569557, 'tok_str': ' +'}]}, {'content': ' ', 'probs': [{'prob': 0.9650065302848816, 'tok_str': ' '}, {'prob': 0.014833040535449982, 'tok_str': '4'}, {'prob': 0.010372447781264782, 'tok_str': '6'}, {'prob': 0.005936410743743181, 'tok_str': '5'}, {'prob': 0.0038514488842338324, 'tok_str': '2'}]}, {'content': '4', 'probs': [{'prob': 0.3485581576824188, 'tok_str': '4'}, {'prob': 0.24050800502300262, 'tok_str': '6'}, {'prob': 0.2052619457244873, 'tok_str': '1'}, {'prob': 0.12849032878875732, 'tok_str': '2'}, {'prob': 0.07718155533075333, 'tok_str': '5'}]}, {'content': ',', 'probs': [{'prob': 0.983191192150116, 'tok_str': ','}, {'prob': 0.013787512667477131, 'tok_str': '\\n'}, {'prob': 0.002463520970195532, 'tok_str': ' and'}, {'prob': 0.00039511514478363097, 'tok_str': ''}, {'prob': 0.00016265115118585527, 'tok_str': '.'}]}, {'content': ' ', 'probs': [{'prob': 0.9993782043457031, 'tok_str': ' '}, {'prob': 0.0006121174083091319, 'tok_str': ' and'}, {'prob': 7.249322607094655e-06, 'tok_str': ' &'}, {'prob': 1.9391527530387975e-06, 'tok_str': ' ...'}, {'prob': 4.884753366241057e-07, 'tok_str': ' X'}]}, {'content': '5', 'probs': [{'prob': 0.518926203250885, 'tok_str': '5'}, {'prob': 0.40361323952674866, 'tok_str': '6'}, {'prob': 0.05150899663567543, 'tok_str': '2'}, {'prob': 0.025156300514936447, 'tok_str': '1'}, {'prob': 0.0007953096646815538, 'tok_str': '7'}]}, {'content': ',', 'probs': [{'prob': 0.9059251546859741, 'tok_str': ','}, {'prob': 0.08419379591941833, 'tok_str': '\\n'}, {'prob': 0.007765558548271656, 'tok_str': ' ('}, {'prob': 0.0011831260053440928, 'tok_str': ' and'}, {'prob': 0.0009324515121988952, 'tok_str': '.'}]}, {'content': ' ', 'probs': [{'prob': 0.9998717308044434, 'tok_str': ' '}, {'prob': 0.00011353891022736207, 'tok_str': ' and'}, {'prob': 1.1405630175431725e-05, 'tok_str': ' ...'}, {'prob': 1.8326530835111043e-06, 'tok_str': ' s'}, {'prob': 1.438007757315063e-06, 'tok_str': ' ('}]}]\n", 491 | "\n" 492 | ] 493 | } 494 | ], 495 | "source": [ 496 | "def test_top_k(prompt, top_k):\n", 497 | " data_json = { \"prompt\": prompt, \"temperature\": 0.5, \"n_predict\": 10, \"stream\": False, \"n_probs\": 5, \"top_k\": top_k}\n", 498 | "\n", 499 | " resp = requests.post(\n", 500 | " url=\"http://127.0.0.1:8080/completion\",\n", 501 | " headers={\"Content-Type\": \"application/json\"},\n", 502 | " json=data_json,\n", 503 | " )\n", 504 | "\n", 505 | " result = resp.json()[\"content\"]\n", 506 | " print(f\"Prompt: {prompt}\")\n", 507 | " print(f\"Result: {result}\\n\")\n", 508 | "\n", 509 | " print(f\"Completion probabilities: {resp.json()['completion_probabilities']}\\n\")\n", 510 | "\n", 511 | "prompt = \"Roll the dice\\n Result:\"\n", 512 | "test_top_k(prompt, 1)\n", 513 | "test_top_k(prompt, 1)\n", 514 | "test_top_k(prompt, 6)\n", 515 | "test_top_k(prompt, 6)" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": {}, 521 | "source": [ 522 | "\n", 523 | "## Grammar-based sampling\n", 524 | "\n", 525 | "Restrict output by forcing it into a grammar\n" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "for llama.cpp we use BNF syntax" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "Let's force an answer" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": 7, 545 | "metadata": {}, 546 | "outputs": [ 547 | { 548 | "name": "stdout", 549 | "output_type": "stream", 550 | "text": [ 551 | "Prompt: Roll the dice\n", 552 | " Result:\n", 553 | "Result: 2\n", 554 | "\n", 555 | "Prompt: Roll the dice\n", 556 | " Result:\n", 557 | "Result: 6\n", 558 | "\n", 559 | "Prompt: Roll the dice\n", 560 | " Result:\n", 561 | "Result: 5\n", 562 | "\n", 563 | "Prompt: Roll the dice\n", 564 | " Result:\n", 565 | "Result: 2\n", 566 | "\n", 567 | "Prompt: Roll the dice\n", 568 | " Result:\n", 569 | "Result: 6\n", 570 | "\n", 571 | "Prompt: Roll the dice\n", 572 | " Result:\n", 573 | "Result: 4\n", 574 | "\n" 575 | ] 576 | } 577 | ], 578 | "source": [ 579 | "def test_grammar(prompt, grammar, temperature):\n", 580 | " data_json = { \"prompt\": prompt, \"temperature\": temperature, \"n_predict\": 512, \"stream\": False, \"n_probs\": 5, \"grammar\": grammar}\n", 581 | "\n", 582 | " resp = requests.post(\n", 583 | " url=\"http://127.0.0.1:8080/completion\",\n", 584 | " headers={\"Content-Type\": \"application/json\"},\n", 585 | " json=data_json,\n", 586 | " )\n", 587 | "\n", 588 | " result = resp.json()[\"content\"]\n", 589 | " print(f\"Prompt: {prompt}\")\n", 590 | " print(f\"Result: {result}\\n\")\n", 591 | "\n", 592 | " #print(f\"Completion probabilities: {resp.json()['completion_probabilities']}\\n\")\n", 593 | "\n", 594 | "prompt = \"Roll the dice\\n Result:\"\n", 595 | "\n", 596 | "grammar =\"\"\"root ::= \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\"\n", 597 | "\"\"\"\n", 598 | "\n", 599 | "test_grammar(prompt, grammar, 1)\n", 600 | "test_grammar(prompt, grammar, 1)\n", 601 | "test_grammar(prompt, grammar, 1)\n", 602 | "test_grammar(prompt, grammar, 1)\n", 603 | "test_grammar(prompt, grammar, 1)\n", 604 | "test_grammar(prompt, grammar, 1)" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "Let's try something more complex\n", 612 | "\n", 613 | "Parse into a list" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": 8, 619 | "metadata": {}, 620 | "outputs": [ 621 | { 622 | "name": "stdout", 623 | "output_type": "stream", 624 | "text": [ 625 | "Prompt: Your task is to extract a list of foods from the following text:\n", 626 | "===\n", 627 | "Text: \n", 628 | "Used Gala apples for these muffins & I was very happy with the results ~ Had some wonderfully moist & flavorful gems that were shared by several neighbors! I really enjoy making these kinds of special treats & do appreciate you posting the recipe! Thanks so much! [Made & reviewed for one of my adoptees in this fall's round of Pick A Chef]\n", 629 | "List:\n", 630 | "Result: - Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 631 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 632 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 633 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 634 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 635 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 636 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 637 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 638 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 639 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 640 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 641 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 642 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 643 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 644 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 645 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 646 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 647 | "- Gala apples, muffins, neighbors, special treats, adoptee, Pick A Chef.\n", 648 | "- Gala apples, muff\n", 649 | "\n" 650 | ] 651 | } 652 | ], 653 | "source": [ 654 | "grammar = r\"\"\"root ::= item+\n", 655 | "item ::= \"- \" [^\\r\\n\\x0b\\x0c\\x85\\u2028\\u2029]+ \"\\n\"\n", 656 | "\"\"\"\n", 657 | "\n", 658 | "prompt = \"\"\"Your task is to extract a list of foods from the following text:\n", 659 | "===\n", 660 | "Text: \n", 661 | "Used Gala apples for these muffins & I was very happy with the results ~ Had some wonderfully moist & flavorful gems that were shared by several neighbors! I really enjoy making these kinds of special treats & do appreciate you posting the recipe! Thanks so much! [Made & reviewed for one of my adoptees in this fall's round of Pick A Chef]\n", 662 | "List:\"\"\"\n", 663 | "\n", 664 | "test_grammar(prompt, grammar, 0)" 665 | ] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "metadata": {}, 670 | "source": [ 671 | "Not really a list of foods but a list nonetheless, if we have a list of possible values we can use it in our grammar" 672 | ] 673 | }, 674 | { 675 | "cell_type": "markdown", 676 | "metadata": {}, 677 | "source": [ 678 | "Enforce a json" 679 | ] 680 | }, 681 | { 682 | "cell_type": "code", 683 | "execution_count": 9, 684 | "metadata": {}, 685 | "outputs": [ 686 | { 687 | "name": "stdout", 688 | "output_type": "stream", 689 | "text": [ 690 | "('root ::= object\\n'\n", 691 | " 'value ::= object | array | string | number | (\"true\" | \"false\" | \"null\") '\n", 692 | " 'ws\\n'\n", 693 | " '\\n'\n", 694 | " 'object ::=\\n'\n", 695 | " ' \"{\" ws (\\n'\n", 696 | " ' string \":\" ws value\\n'\n", 697 | " ' (\",\" ws string \":\" ws value)*\\n'\n", 698 | " ' )? \"}\" ws\\n'\n", 699 | " '\\n'\n", 700 | " 'array ::=\\n'\n", 701 | " ' \"[\" ws (\\n'\n", 702 | " ' value\\n'\n", 703 | " ' (\",\" ws value)*\\n'\n", 704 | " ' )? \"]\" ws\\n'\n", 705 | " '\\n'\n", 706 | " 'string ::=\\n'\n", 707 | " ' \"\\\\\"\" (\\n'\n", 708 | " ' [^\"\\\\\\\\] |\\n'\n", 709 | " ' \"\\\\\\\\\" ([\"\\\\\\\\/bfnrt] | \"u\" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] '\n", 710 | " '[0-9a-fA-F]) # escapes\\n'\n", 711 | " ' )* \"\\\\\"\" ws\\n'\n", 712 | " '\\n'\n", 713 | " 'number ::= (\"-\"? ([0-9] | [1-9] [0-9]*)) (\".\" [0-9]+)? ([eE] [-+]? [0-9]+)? '\n", 714 | " 'ws\\n'\n", 715 | " '\\n'\n", 716 | " '# Optional space: by convention, applied in this grammar after literal chars '\n", 717 | " 'when allowed\\n'\n", 718 | " 'ws ::= ([ \\\\t\\\\n] ws)?\\n')\n" 719 | ] 720 | } 721 | ], 722 | "source": [ 723 | "from pprint import pprint\n", 724 | "grammar_url = \"https://raw.githubusercontent.com/ggerganov/llama.cpp/master/grammars/json.gbnf\"\n", 725 | "\n", 726 | "resp = requests.get(grammar_url)\n", 727 | "\n", 728 | "grammar = resp.text\n", 729 | "pprint(grammar)" 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": {}, 736 | "outputs": [ 737 | { 738 | "name": "stdout", 739 | "output_type": "stream", 740 | "text": [ 741 | "Prompt: Your task is to extract a list of foods from the following text:\n", 742 | "===\n", 743 | "Text:\n", 744 | "Used Gala apples for these muffins & I was very happy with the results ~ Had some wonderfully moist & flavorful gems that were shared by several neighbors! I really enjoy making these kinds of special treats & do appreciate you posting the recipe! Thanks so much! [Made & reviewed for one of my adoptees in this fall's round of Pick A Chef]\n", 745 | "Answer:\n", 746 | "Result: {\n", 747 | "\"foods\":[\n", 748 | "\"Gala apples\",\n", 749 | "\"muffins\",\n", 750 | "\"neighbors\",\n", 751 | "\"special treats\"\n", 752 | "]\n", 753 | "}\n", 754 | "\n", 755 | "Completion probabilities: [{'content': '{', 'probs': [{'prob': 0.5797277688980103, 'tok_str': '{'}, {'prob': 0.21325579285621643, 'tok_str': '{}'}, {'prob': 0.20682936906814575, 'tok_str': '{\"'}, {'prob': 0.00018704685498960316, 'tok_str': '{'}, {'prob': 0.0, 'tok_str': ''}]}, {'content': '\\n', 'probs': [{'prob': 0.6817899346351624, 'tok_str': '\\n'}, {'prob': 0.2949495315551758, 'tok_str': ' \"'}, {'prob': 0.011132081039249897, 'tok_str': ' '}, {'prob': 0.004313143901526928, 'tok_str': ' }'}, {'prob': 0.0024211497511714697, 'tok_str': ' \"['}]}, {'content': '\"', 'probs': [{'prob': 0.9927994012832642, 'tok_str': '\"'}, {'prob': 0.0032045403495430946, 'tok_str': '\\n'}, {'prob': 0.0016841531032696366, 'tok_str': ' \"'}, {'prob': 0.0007227464229799807, 'tok_str': ' '}, {'prob': 0.00040167479892261326, 'tok_str': '\\t'}]}, {'content': 'fo', 'probs': [{'prob': 0.2595919370651245, 'tok_str': 'fo'}, {'prob': 0.18390139937400818, 'tok_str': 'G'}, {'prob': 0.14529329538345337, 'tok_str': 'g'}, {'prob': 0.1440957933664322, 'tok_str': 'app'}, {'prob': 0.07556343078613281, 'tok_str': 'F'}]}, {'content': 'ods', 'probs': [{'prob': 0.6778118014335632, 'tok_str': 'ods'}, {'prob': 0.3219011723995209, 'tok_str': 'od'}, {'prob': 5.635582056129351e-05, 'tok_str': 'odn'}, {'prob': 5.4563937737839296e-05, 'tok_str': 'odb'}, {'prob': 3.511407703626901e-05, 'tok_str': 'odd'}]}, {'content': '\":', 'probs': [{'prob': 0.9976062774658203, 'tok_str': '\":'}, {'prob': 0.0012576604494825006, 'tok_str': '\":\"'}, {'prob': 0.0007188808522187173, 'tok_str': '\"'}, {'prob': 9.865640458883718e-05, 'tok_str': '_'}, {'prob': 6.503208715002984e-05, 'tok_str': ':'}]}, {'content': '[', 'probs': [{'prob': 0.34017214179039, 'tok_str': '['}, {'prob': 0.3241744041442871, 'tok_str': ' [\"'}, {'prob': 0.17972372472286224, 'tok_str': '[\"'}, {'prob': 0.13924628496170044, 'tok_str': ' ['}, {'prob': 0.004462527576833963, 'tok_str': ' {'}]}, {'content': '\\n', 'probs': [{'prob': 0.9337315559387207, 'tok_str': '\\n'}, {'prob': 0.05431881174445152, 'tok_str': ' \"'}, {'prob': 0.003507050918415189, 'tok_str': '{\"'}, {'prob': 0.0016369885997846723, 'tok_str': ' '}, {'prob': 0.0011418566573411226, 'tok_str': ' '}]}, {'content': '\"', 'probs': [{'prob': 0.9777783751487732, 'tok_str': '\"'}, {'prob': 0.010128545574843884, 'tok_str': '{\"'}, {'prob': 0.008914981968700886, 'tok_str': '{'}, {'prob': 0.0008456386858597398, 'tok_str': '\"\\\\'}, {'prob': 0.0006818175315856934, 'tok_str': ' \"'}]}, {'content': 'G', 'probs': [{'prob': 0.6572507619857788, 'tok_str': 'G'}, {'prob': 0.16509193181991577, 'tok_str': 'U'}, {'prob': 0.09251637756824493, 'tok_str': 'app'}, {'prob': 0.053041521459817886, 'tok_str': 'g'}, {'prob': 0.026900751516222954, 'tok_str': 'used'}]}, {'content': 'ala', 'probs': [{'prob': 0.9999712705612183, 'tok_str': 'ala'}, {'prob': 9.704348485684022e-06, 'tok_str': 'al'}, {'prob': 8.987552064354531e-06, 'tok_str': 'AL'}, {'prob': 1.8180243159804377e-06, 'tok_str': 'alo'}, {'prob': 1.5236937542795204e-06, 'tok_str': 'alla'}]}, {'content': ' app', 'probs': [{'prob': 0.9982499480247498, 'tok_str': ' app'}, {'prob': 0.0008120936108753085, 'tok_str': ' apple'}, {'prob': 0.0007704514428041875, 'tok_str': ' App'}, {'prob': 5.086526653030887e-05, 'tok_str': '\",'}, {'prob': 3.9492071664426476e-05, 'tok_str': '\\\\'}]}, {'content': 'les', 'probs': [{'prob': 0.9999969005584717, 'tok_str': 'les'}, {'prob': 1.505969294157694e-06, 'tok_str': 'LES'}, {'prob': 4.776282480634109e-07, 'tok_str': 'els'}, {'prob': 3.102990149272955e-07, 'tok_str': 'ples'}, {'prob': 1.0043400067161201e-07, 'tok_str': 'lies'}]}, {'content': '\",', 'probs': [{'prob': 0.9939625859260559, 'tok_str': '\",'}, {'prob': 0.00369053496979177, 'tok_str': '\",\"'}, {'prob': 0.0021303112152963877, 'tok_str': '\"'}, {'prob': 4.085333785042167e-05, 'tok_str': ' \",'}, {'prob': 3.577343522920273e-05, 'tok_str': '.\",'}]}, {'content': '\\n', 'probs': [{'prob': 0.9866575598716736, 'tok_str': '\\n'}, {'prob': 0.009593641385436058, 'tok_str': ' \"'}, {'prob': 0.0023951255716383457, 'tok_str': ' '}, {'prob': 0.0005881465040147305, 'tok_str': '\\t'}, {'prob': 0.0003571135166566819, 'tok_str': ' '}]}, {'content': '\"', 'probs': [{'prob': 0.998892605304718, 'tok_str': '\"'}, {'prob': 0.0002854031918104738, 'tok_str': '\"\\\\'}, {'prob': 0.0001489205751568079, 'tok_str': ' \"'}, {'prob': 0.00013160055095795542, 'tok_str': '\".'}, {'prob': 0.0001283418241655454, 'tok_str': '\">'}]}, {'content': 'mu', 'probs': [{'prob': 0.7917933464050293, 'tok_str': 'mu'}, {'prob': 0.05795787647366524, 'tok_str': 'w'}, {'prob': 0.045294858515262604, 'tok_str': 'special'}, {'prob': 0.030953403562307358, 'tok_str': 'Special'}, {'prob': 0.02834431268274784, 'tok_str': 'W'}]}, {'content': 'ff', 'probs': [{'prob': 0.9997990727424622, 'tok_str': 'ff'}, {'prob': 0.0001814768329495564, 'tok_str': 'FF'}, {'prob': 8.664755114295986e-06, 'tok_str': 'uff'}, {'prob': 5.375606633606367e-06, 'tok_str': 'fff'}, {'prob': 1.0039062772193574e-06, 'tok_str': 'ppets'}]}, {'content': 'ins', 'probs': [{'prob': 0.9973711967468262, 'tok_str': 'ins'}, {'prob': 0.0026112564373761415, 'tok_str': 'in'}, {'prob': 1.4847978491161484e-06, 'tok_str': 'ings'}, {'prob': 1.1739804222088424e-06, 'tok_str': 'i'}, {'prob': 1.1609861303440994e-06, 'tok_str': 'ons'}]}, {'content': '\",', 'probs': [{'prob': 0.6928266286849976, 'tok_str': '\",'}, {'prob': 0.2974705100059509, 'tok_str': '\"'}, {'prob': 0.005788666196167469, 'tok_str': '\"]'}, {'prob': 0.0010489515261724591, 'tok_str': ' made'}, {'prob': 0.0007698967820033431, 'tok_str': '\"],'}]}, {'content': '\\n', 'probs': [{'prob': 0.9993365406990051, 'tok_str': '\\n'}, {'prob': 0.00040215536137111485, 'tok_str': ' \"'}, {'prob': 0.00015739141963422298, 'tok_str': ' '}, {'prob': 4.817400986212306e-05, 'tok_str': '\\t'}, {'prob': 1.3662039236805867e-05, 'tok_str': ' '}]}, {'content': '\"', 'probs': [{'prob': 0.9982384443283081, 'tok_str': '\"'}, {'prob': 0.0004365992790553719, 'tok_str': '\"]'}, {'prob': 0.0003961173351854086, 'tok_str': '\".'}, {'prob': 0.0001665699965087697, 'tok_str': '\">'}, {'prob': 0.00015445248573087156, 'tok_str': '\"\\\\'}]}, {'content': 'ne', 'probs': [{'prob': 0.6121978759765625, 'tok_str': 'ne'}, {'prob': 0.15405122935771942, 'tok_str': 'special'}, {'prob': 0.14693056046962738, 'tok_str': 'w'}, {'prob': 0.012924249283969402, 'tok_str': 'W'}, {'prob': 0.010056606493890285, 'tok_str': 'Ne'}]}, {'content': 'igh', 'probs': [{'prob': 0.999992847442627, 'tok_str': 'igh'}, {'prob': 5.933861757512204e-06, 'tok_str': 'ig'}, {'prob': 4.348733568804164e-07, 'tok_str': 'IG'}, {'prob': 1.1989278903001832e-07, 'tok_str': '\\n'}, {'prob': 1.1937640920223203e-07, 'tok_str': 'ight'}]}, {'content': 'b', 'probs': [{'prob': 0.999890923500061, 'tok_str': 'b'}, {'prob': 0.00010477246541995555, 'tok_str': 'bor'}, {'prob': 2.453259412504849e-06, 'tok_str': 'ors'}, {'prob': 9.441898782824865e-07, 'tok_str': 'bers'}, {'prob': 2.3049963715493504e-07, 'tok_str': '\\n'}]}, {'content': 'ors', 'probs': [{'prob': 0.9994780421257019, 'tok_str': 'ors'}, {'prob': 0.0005200091400183737, 'tok_str': 'ours'}, {'prob': 8.339051760231087e-07, 'tok_str': 'ORS'}, {'prob': 2.2375436969923612e-07, 'tok_str': 'h'}, {'prob': 1.9751414015445334e-07, 'tok_str': '...'}]}, {'content': '\",', 'probs': [{'prob': 0.6294023394584656, 'tok_str': '\",'}, {'prob': 0.36365649104118347, 'tok_str': '\"'}, {'prob': 0.0038434655871242285, 'tok_str': '\"]'}, {'prob': 0.0008079007966443896, 'tok_str': \"'\"}, {'prob': 0.0006855139508843422, 'tok_str': '\"],'}]}, {'content': '\\n', 'probs': [{'prob': 0.9992626309394836, 'tok_str': '\\n'}, {'prob': 0.0006192340515553951, 'tok_str': ' \"'}, {'prob': 5.1912691560573876e-05, 'tok_str': ' '}, {'prob': 1.6028121535782702e-05, 'tok_str': '\\t'}, {'prob': 9.199094165524002e-06, 'tok_str': ' '}]}, {'content': '\"', 'probs': [{'prob': 0.9986183643341064, 'tok_str': '\"'}, {'prob': 0.00029612897196784616, 'tok_str': '\".'}, {'prob': 0.00024477666011080146, 'tok_str': '\">'}, {'prob': 0.00018380403344053775, 'tok_str': '\"]'}, {'prob': 0.00016381172463297844, 'tok_str': '\"\\\\'}]}, {'content': 'special', 'probs': [{'prob': 0.8353922367095947, 'tok_str': 'special'}, {'prob': 0.10019081830978394, 'tok_str': 'tre'}, {'prob': 0.019108299165964127, 'tok_str': 'P'}, {'prob': 0.006287394091486931, 'tok_str': 'w'}, {'prob': 0.004885926377028227, 'tok_str': 'reci'}]}, {'content': ' tre', 'probs': [{'prob': 0.9995693564414978, 'tok_str': ' tre'}, {'prob': 0.00038148078601807356, 'tok_str': ' treat'}, {'prob': 1.8300550436833873e-05, 'tok_str': 'tre'}, {'prob': 1.04707960417727e-05, 'tok_str': 'ty'}, {'prob': 3.4134554880438372e-06, 'tok_str': ' Tre'}]}, {'content': 'ats', 'probs': [{'prob': 0.9999990463256836, 'tok_str': 'ats'}, {'prob': 5.003698220207298e-07, 'tok_str': 'as'}, {'prob': 4.626543557151308e-07, 'tok_str': 'ts'}, {'prob': 3.074742593867086e-08, 'tok_str': 'ets'}, {'prob': 2.81016152570146e-08, 'tok_str': 'asures'}]}, {'content': '\"', 'probs': [{'prob': 0.7464555501937866, 'tok_str': '\"'}, {'prob': 0.22720365226268768, 'tok_str': '\",'}, {'prob': 0.02406655251979828, 'tok_str': '\"]'}, {'prob': 0.0010443212231621146, 'tok_str': '\"],'}, {'prob': 0.0003283571277279407, 'tok_str': '.\"'}]}, {'content': '\\n', 'probs': [{'prob': 0.9990999698638916, 'tok_str': '\\n'}, {'prob': 0.00035098823718726635, 'tok_str': ' ]'}, {'prob': 0.0002399139921180904, 'tok_str': ' '}, {'prob': 6.775098154321313e-05, 'tok_str': ' '}, {'prob': 5.834159310325049e-05, 'tok_str': ' ,'}]}, {'content': ']', 'probs': [{'prob': 0.7965049147605896, 'tok_str': ']'}, {'prob': 0.1381918042898178, 'tok_str': '],'}, {'prob': 0.0651731789112091, 'tok_str': ']}'}, {'prob': 8.844395779306069e-05, 'tok_str': ' ]'}, {'prob': 1.3951691471447702e-05, 'tok_str': ',\"'}]}, {'content': '\\n', 'probs': [{'prob': 0.997765064239502, 'tok_str': '\\n'}, {'prob': 0.0010181940160691738, 'tok_str': ' }'}, {'prob': 0.0006378541002050042, 'tok_str': ' '}, {'prob': 0.00019262306159362197, 'tok_str': ' '}, {'prob': 0.00017168198246508837, 'tok_str': '}'}]}, {'content': '}', 'probs': [{'prob': 0.9998636245727539, 'tok_str': '}'}, {'prob': 8.237463771365583e-05, 'tok_str': '\\n'}, {'prob': 5.093544314149767e-05, 'tok_str': ' }'}, {'prob': 1.3656519968208158e-06, 'tok_str': ' '}, {'prob': 9.666096048022155e-07, 'tok_str': '\\t'}]}, {'content': '', 'probs': [{'prob': 0.9987578392028809, 'tok_str': ''}, {'prob': 0.0012332170736044645, 'tok_str': '\\n'}, {'prob': 7.605292012158316e-06, 'tok_str': ' '}, {'prob': 8.277339134110662e-07, 'tok_str': ' '}, {'prob': 4.104608137822652e-07, 'tok_str': '\\t'}]}]\n", 756 | "\n" 757 | ] 758 | } 759 | ], 760 | "source": [ 761 | "prompt = \"\"\"Your task is to extract a list of foods from the following text:\n", 762 | "===\n", 763 | "Text:\n", 764 | "Used Gala apples for these muffins & I was very happy with the results ~ Had some wonderfully moist & flavorful gems that were shared by several neighbors! I really enjoy making these kinds of special treats & do appreciate you posting the recipe! Thanks so much! [Made & reviewed for one of my adoptees in this fall's round of Pick A Chef]\n", 765 | "Answer:\"\"\"\n", 766 | "\n", 767 | "test_grammar(prompt, grammar, 0)" 768 | ] 769 | }, 770 | { 771 | "cell_type": "markdown", 772 | "metadata": {}, 773 | "source": [ 774 | "Some nice tools for creating evaluating grammars:\n", 775 | "\n", 776 | "https://bnfplayground.pauliankline.com/\n", 777 | "\n", 778 | "https://grammar.intrinsiclabs.ai/" 779 | ] 780 | } 781 | ], 782 | "metadata": { 783 | "kernelspec": { 784 | "display_name": "Python 3", 785 | "language": "python", 786 | "name": "python3" 787 | }, 788 | "language_info": { 789 | "codemirror_mode": { 790 | "name": "ipython", 791 | "version": 3 792 | }, 793 | "file_extension": ".py", 794 | "mimetype": "text/x-python", 795 | "name": "python", 796 | "nbconvert_exporter": "python", 797 | "pygments_lexer": "ipython3", 798 | "version": "3.11.3" 799 | } 800 | }, 801 | "nbformat": 4, 802 | "nbformat_minor": 2 803 | } 804 | --------------------------------------------------------------------------------