├── .gitignore ├── ExtractMetadataUsingFinetunedGPT3.ipynb ├── LICENSE ├── README.md ├── docs ├── .gitignore └── README.md ├── dspace ├── README.md ├── doria-10024-3982-abo-doctheses.xml ├── lutpub-10024-158302-doctheses.xml ├── osuva-10024-168-doctheses.xml └── utupub-10024-143705-doctheses.xml └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # env file with private API key 2 | env.sh 3 | 4 | # fine-tuning data file 5 | fine-tune.jsonl 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | *.py[cod] 10 | *$py.class 11 | 12 | # C extensions 13 | *.so 14 | 15 | # Distribution / packaging 16 | .Python 17 | build/ 18 | develop-eggs/ 19 | dist/ 20 | downloads/ 21 | eggs/ 22 | .eggs/ 23 | lib/ 24 | lib64/ 25 | parts/ 26 | sdist/ 27 | var/ 28 | wheels/ 29 | pip-wheel-metadata/ 30 | share/python-wheels/ 31 | *.egg-info/ 32 | .installed.cfg 33 | *.egg 34 | MANIFEST 35 | 36 | # PyInstaller 37 | # Usually these files are written by a python script from a template 38 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 39 | *.manifest 40 | *.spec 41 | 42 | # Installer logs 43 | pip-log.txt 44 | pip-delete-this-directory.txt 45 | 46 | # Unit test / coverage reports 47 | htmlcov/ 48 | .tox/ 49 | .nox/ 50 | .coverage 51 | .coverage.* 52 | .cache 53 | nosetests.xml 54 | coverage.xml 55 | *.cover 56 | *.py,cover 57 | .hypothesis/ 58 | .pytest_cache/ 59 | 60 | # Translations 61 | *.mo 62 | *.pot 63 | 64 | # Django stuff: 65 | *.log 66 | local_settings.py 67 | db.sqlite3 68 | db.sqlite3-journal 69 | 70 | # Flask stuff: 71 | instance/ 72 | .webassets-cache 73 | 74 | # Scrapy stuff: 75 | .scrapy 76 | 77 | # Sphinx documentation 78 | docs/_build/ 79 | 80 | # PyBuilder 81 | target/ 82 | 83 | # Jupyter Notebook 84 | .ipynb_checkpoints 85 | 86 | # IPython 87 | profile_default/ 88 | ipython_config.py 89 | 90 | # pyenv 91 | .python-version 92 | 93 | # pipenv 94 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 95 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 96 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 97 | # install all needed dependencies. 98 | #Pipfile.lock 99 | 100 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 101 | __pypackages__/ 102 | 103 | # Celery stuff 104 | celerybeat-schedule 105 | celerybeat.pid 106 | 107 | # SageMath parsed files 108 | *.sage.py 109 | 110 | # Environments 111 | .env 112 | .venv 113 | env/ 114 | venv/ 115 | ENV/ 116 | env.bak/ 117 | venv.bak/ 118 | 119 | # Spyder project settings 120 | .spyderproject 121 | .spyproject 122 | 123 | # Rope project settings 124 | .ropeproject 125 | 126 | # mkdocs documentation 127 | /site 128 | 129 | # mypy 130 | .mypy_cache/ 131 | .dmypy.json 132 | dmypy.json 133 | 134 | # Pyre type checker 135 | .pyre/ 136 | -------------------------------------------------------------------------------- /ExtractMetadataUsingFinetunedGPT3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3a4a32f6", 6 | "metadata": {}, 7 | "source": [ 8 | "# Extract metadata from PDF files using fine-tuned GPT3 language model\n", 9 | "\n", 10 | "This notebook will demonstrate how we can extract Dublin Core style metadata about PDF documents, in this case doctoral theses from four Finnish universities (Åbo Akademi, University of Turku, University of Vaasa and Lappeenranta University of Technology), using only the raw text from the first few pages of the PDF.\n", 11 | "\n", 12 | "The set of 192 documents will be split into two subsets (train: 149, test: 43). We will extract the text from around 5 pages of text, aiming for 500 to 700 words. The corresponding metadata, which has been exported from DSpace repositories of the universities, is represented in a simple textual \"key: value\" format, which should be easy enough for a language model to handle. The train set is used to create a data set which will then be used to fine-tune a GPT-3 Curie model. Subsequently the model can be used to generate similar metadata for unseen documents from the test set.\n", 13 | "\n", 14 | "For this experiment, an OpenAI API access key is required. It can be generated after registering an user account (the same account can be used for e.g. ChatGPT). The API key has to be stored in an environment variable `OPENAI_API_KEY` before starting this notebook. The finetuning will cost around \\\\$2.50 USD and generating new metadata with the API also has a small cost, but currently every account gets a free \\\\$18 credit from OpenAI which is plenty for this experiment even with a few iterations.\n", 15 | "\n", 16 | "This notebook depends on a few Python libraries, which are listed in `requirements.txt`. See the README for details." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "79c63d21", 22 | "metadata": {}, 23 | "source": [ 24 | "## Test the connection and API key\n", 25 | "\n", 26 | "Make sure it's possible to use the OpenAI API." 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 1, 32 | "id": "24ff64b3", 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "data": { 37 | "text/plain": [ 38 | " JSON: {\n", 39 | " \"choices\": [\n", 40 | " {\n", 41 | " \"finish_reason\": \"length\",\n", 42 | " \"index\": 0,\n", 43 | " \"logprobs\": null,\n", 44 | " \"text\": \"\\n\\nThis is a test.\"\n", 45 | " }\n", 46 | " ],\n", 47 | " \"created\": 1674461326,\n", 48 | " \"id\": \"cmpl-6bmAgQqLkF71IJL0qqBELSTuq2P2h\",\n", 49 | " \"model\": \"text-curie-001\",\n", 50 | " \"object\": \"text_completion\",\n", 51 | " \"usage\": {\n", 52 | " \"completion_tokens\": 7,\n", 53 | " \"prompt_tokens\": 5,\n", 54 | " \"total_tokens\": 12\n", 55 | " }\n", 56 | "}" 57 | ] 58 | }, 59 | "execution_count": 1, 60 | "metadata": {}, 61 | "output_type": "execute_result" 62 | } 63 | ], 64 | "source": [ 65 | "import openai\n", 66 | "import os\n", 67 | "\n", 68 | "# read the OpenAI API key from an environment variable\n", 69 | "openai.api_key = os.environ['OPENAI_API_KEY']\n", 70 | "\n", 71 | "# test the API connection by making a simple request\n", 72 | "response = openai.Completion.create(model=\"text-curie-001\", prompt=\"Say this is a test\", temperature=0, max_tokens=7)\n", 73 | "response" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "id": "73db4c29", 79 | "metadata": {}, 80 | "source": [ 81 | "## Prepare the data set\n", 82 | "\n", 83 | "Extract metadata and PDF text" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 2, 89 | "id": "a6f64d05", 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "# Define some settings for the metadata extraction\n", 94 | "\n", 95 | "import glob\n", 96 | "\n", 97 | "MAXPAGES = 5 # how many pages of text to extract (maximum)\n", 98 | "MARGIN = 2 # how many more pages to look at, in case we can't find text from the first ones\n", 99 | "TEXT_MIN = 500 # how many words to aim for (minimum)\n", 100 | "TEXT_LIMIT = 700 # upper limit on # of words\n", 101 | "\n", 102 | "# files containing metadata about doctoral theses documents, exported from DSpace repositories\n", 103 | "METADATAFILES = glob.glob(\"dspace/*-doctheses.xml\")\n", 104 | "\n", 105 | "# metadata fields we are interested in (corresponding to fields used in DSpace)\n", 106 | "# syntax: \"fieldname\" or \"fieldname/qualifier\"\n", 107 | "METADATAFIELDS = \"\"\"\n", 108 | "title\n", 109 | "title/alternative\n", 110 | "contributor/faculty\n", 111 | "contributor/author\n", 112 | "contributor/organization\n", 113 | "contributor/opponent\n", 114 | "contributor/supervisor\n", 115 | "contributor/reviewer\n", 116 | "publisher\n", 117 | "date/issued\n", 118 | "relation/issn\n", 119 | "relation/isbn\n", 120 | "relation/ispartofseries\n", 121 | "relation/numberinseries\n", 122 | "\"\"\".strip().split()\n", 123 | "\n", 124 | "# identifiers of documents that will form the test set\n", 125 | "# these have been selected to correspond with a demo application for the same purpose\n", 126 | "TEST_SET_IDS = \"\"\"\n", 127 | "handle_10024_181378\n", 128 | "handle_10024_181284\n", 129 | "handle_10024_181280\n", 130 | "handle_10024_181229\n", 131 | "handle_10024_181227\n", 132 | "handle_10024_181210\n", 133 | "handle_10024_181206\n", 134 | "handle_10024_181139\n", 135 | "handle_10024_181073\n", 136 | "handle_10024_181025\n", 137 | "handle_10024_181001\n", 138 | "handle_10024_163335\n", 139 | "handle_10024_163304\n", 140 | "handle_10024_163298\n", 141 | "handle_10024_163277\n", 142 | "handle_10024_163276\n", 143 | "handle_10024_163263\n", 144 | "handle_10024_163258\n", 145 | "handle_10024_163257\n", 146 | "handle_10024_163057\n", 147 | "handle_10024_163056\n", 148 | "handle_10024_162878\n", 149 | "handle_10024_11364\n", 150 | "handle_10024_11363\n", 151 | "handle_10024_11348\n", 152 | "handle_10024_11207\n", 153 | "handle_10024_10928\n", 154 | "handle_10024_10620\n", 155 | "handle_10024_10614\n", 156 | "handle_10024_10443\n", 157 | "handle_10024_10432\n", 158 | "handle_10024_10254\n", 159 | "handle_10024_152922\n", 160 | "handle_10024_152903\n", 161 | "handle_10024_152904\n", 162 | "handle_10024_152860\n", 163 | "handle_10024_152862\n", 164 | "handle_10024_152853\n", 165 | "handle_10024_152854\n", 166 | "handle_10024_152855\n", 167 | "handle_10024_152852\n", 168 | "handle_10024_152846\n", 169 | "handle_10024_152836\n", 170 | "\"\"\".strip().split()" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 3, 176 | "id": "2ed0cf46", 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13521.pdf: text would become too long\n", 184 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13449.pdf: text would become too long\n", 185 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13420.pdf: text would become too long\n", 186 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13378.pdf: text would become too long\n", 187 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13377.pdf: text would become too long\n", 188 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13257.pdf: text would become too long\n", 189 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13205.pdf: text would become too long\n", 190 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13177.pdf: text would become too long\n", 191 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13169.pdf: text would become too long\n", 192 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13167.pdf: text would become too long\n", 193 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13159.pdf: text would become too long\n", 194 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13153.pdf: text would become too long\n", 195 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13152.pdf: text would become too long\n", 196 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13119.pdf: text would become too long\n", 197 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_13116.pdf: text would become too long\n", 198 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_12958.pdf: text would become too long\n", 199 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_12619.pdf: text would become too long\n", 200 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_12481.pdf: text would become too long\n", 201 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_12480.pdf: text would become too long\n", 202 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_12300.pdf: text would become too long\n", 203 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_12272.pdf: text would become too long\n", 204 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_12259.pdf: text would become too long\n", 205 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11784.pdf: text would become too long\n", 206 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11593.pdf: text would become too long\n", 207 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11584.pdf: text would become too long\n", 208 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11561.pdf: text would become too long\n", 209 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11491.pdf: text would become too long\n", 210 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11407.pdf: text would become too long\n", 211 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11363.pdf: text would become too long\n", 212 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11348.pdf: text would become too long\n", 213 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_11207.pdf: text would become too long\n", 214 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_10614.pdf: text would become too long\n", 215 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_10443.pdf: text would become too long\n", 216 | "skipping page 5 of docs/osuva.uwasa.fi_handle_10024_10432.pdf: text would become too long\n", 217 | "skipping page 5 of docs/www.utupub.fi_handle_10024_153232.pdf: text would become too long\n", 218 | "skipping page 5 of docs/www.utupub.fi_handle_10024_153200.pdf: text would become too long\n", 219 | "skipping page 5 of docs/www.doria.fi_handle_10024_182724.pdf: text would become too long\n", 220 | "skipping page 6 of docs/www.doria.fi_handle_10024_182724.pdf: text would become too long\n", 221 | "skipping page 5 of docs/www.doria.fi_handle_10024_182159.pdf: text would become too long\n", 222 | "skipping page 6 of docs/www.doria.fi_handle_10024_182159.pdf: text would become too long\n", 223 | "skipping page 5 of docs/www.doria.fi_handle_10024_181975.pdf: text would become too long\n", 224 | "skipping page 6 of docs/www.doria.fi_handle_10024_181975.pdf: text would become too long\n", 225 | "skipping page 5 of docs/www.doria.fi_handle_10024_181902.pdf: text would become too long\n", 226 | "skipping page 6 of docs/www.doria.fi_handle_10024_181902.pdf: text would become too long\n", 227 | "skipping page 7 of docs/www.doria.fi_handle_10024_181902.pdf: text would become too long\n", 228 | "skipping page 5 of docs/www.doria.fi_handle_10024_181847.pdf: text would become too long\n", 229 | "skipping page 5 of docs/www.doria.fi_handle_10024_181727.pdf: text would become too long\n", 230 | "no file element found (id: https://www.doria.fi/handle/10024/181511), skipping\n", 231 | "skipping page 5 of docs/www.doria.fi_handle_10024_181280.pdf: text would become too long\n", 232 | "skipping page 6 of docs/www.doria.fi_handle_10024_181280.pdf: text would become too long\n", 233 | "skipping page 7 of docs/www.doria.fi_handle_10024_181280.pdf: text would become too long\n", 234 | "skipping page 5 of docs/www.doria.fi_handle_10024_181227.pdf: text would become too long\n", 235 | "skipping page 5 of docs/www.doria.fi_handle_10024_181073.pdf: text would become too long\n", 236 | "no file element found (id: https://lutpub.lut.fi/handle/10024/163756), skipping\n", 237 | "train set size: 149\n", 238 | "test set size: 43\n" 239 | ] 240 | } 241 | ], 242 | "source": [ 243 | "#%%time\n", 244 | "\n", 245 | "from lxml import etree\n", 246 | "import requests\n", 247 | "import os.path\n", 248 | "from pypdf import PdfReader\n", 249 | "import glob\n", 250 | "\n", 251 | "# train set: document identifiers, text (x) and metadata (y)\n", 252 | "train_ids = []\n", 253 | "train_x = []\n", 254 | "train_y = []\n", 255 | "\n", 256 | "# test set: document identifiers, text (x) and metadata (y)\n", 257 | "test_ids = []\n", 258 | "test_x = []\n", 259 | "test_y = []\n", 260 | "\n", 261 | "def extract_metadata(doc_item):\n", 262 | " \"\"\"extract the metadata as a list of (key, value) tuples from an etree element representing a document\"\"\"\n", 263 | " metadata = []\n", 264 | " for fldname in METADATAFIELDS:\n", 265 | " if '/' in fldname:\n", 266 | " fld, qualifier = fldname.split('/')\n", 267 | " for val in doc_item.findall(f\"metadata[@element='{fld}'][@qualifier='{qualifier}']\"):\n", 268 | " if fld == 'date':\n", 269 | " metadata.append((fldname, val.text[:4])) # only the year \n", 270 | " else:\n", 271 | " metadata.append((fldname, val.text))\n", 272 | " else:\n", 273 | " for val in doc_item.findall(f\"metadata[@element='{fldname}'][@qualifier='']\"):\n", 274 | " metadata.append((fldname, val.text))\n", 275 | " return metadata\n", 276 | "\n", 277 | "def id_to_fn(identifier):\n", 278 | " \"\"\"convert a URI identifier to a simpler string we can use as a filename for the PDF\"\"\"\n", 279 | " return 'docs/' + identifier.replace('https://', '').replace('/','_') + \".pdf\"\n", 280 | "\n", 281 | "def download(file_url, identifier):\n", 282 | " \"\"\"download a PDF file, with the given identifier, from the given URL (unless this was done already)\n", 283 | " and return a path to the PDF file\"\"\"\n", 284 | " path = id_to_fn(identifier)\n", 285 | " if os.path.exists(path) and os.path.getsize(path) > 0:\n", 286 | " return path\n", 287 | "\n", 288 | " response = requests.get(file_url)\n", 289 | " with open(path, \"wb\") as f:\n", 290 | " f.write(response.content)\n", 291 | " print(f\"wrote {file_url} as {path}\")\n", 292 | " return path\n", 293 | "\n", 294 | "def extract_text(fn):\n", 295 | " \"\"\"extract and return the first few pages of text from the given PDF file\"\"\"\n", 296 | " reader = PdfReader(fn)\n", 297 | " texts = []\n", 298 | " extracted_pages = 0\n", 299 | " extracted_length = 0\n", 300 | " for idx, page in enumerate(reader.pages[:MAXPAGES + MARGIN]):\n", 301 | " text = page.extract_text()\n", 302 | " text_length = len(text.strip().split()) \n", 303 | " if extracted_length + text_length < TEXT_LIMIT:\n", 304 | " texts.append(text)\n", 305 | " extracted_length += text_length\n", 306 | " extracted_pages += 1\n", 307 | " else:\n", 308 | " print(f\"skipping page {idx+1} of {fn}: text would become too long\")\n", 309 | " if extracted_pages >= MAXPAGES or extracted_length >= TEXT_MIN:\n", 310 | " break\n", 311 | " return '\\n'.join(texts)\n", 312 | "\n", 313 | "def is_test_doc(identifier):\n", 314 | " \"\"\"return True iff the given identifier belongs to the test set\"\"\"\n", 315 | " shortid = 'handle' + identifier.split('handle')[1].replace('/', '_')\n", 316 | " return shortid in TEST_SET_IDS \n", 317 | "\n", 318 | "# Read all the metadata files, extract the DSpace metadata, download the PDFs and extract text from them\n", 319 | "# into the train_* and test_* lists\n", 320 | "for fn in METADATAFILES:\n", 321 | " tree = etree.parse(fn)\n", 322 | " for item in tree.findall('item'):\n", 323 | " try:\n", 324 | " identifier = item.find(\"metadata[@element='identifier'][@qualifier='uri']\").text\n", 325 | " except AttributeError:\n", 326 | " print(\"no identifier found, skipping\")\n", 327 | " continue\n", 328 | " try:\n", 329 | " file_url = item.find('file').text\n", 330 | " except AttributeError:\n", 331 | " print(f\"no file element found (id: {identifier}), skipping\")\n", 332 | " continue\n", 333 | " print(f\"skipping test document {identifier}\")\n", 334 | " continue\n", 335 | " path = download(file_url, identifier)\n", 336 | " text = extract_text(path)\n", 337 | " metadata = extract_metadata(item)\n", 338 | " if is_test_doc(identifier):\n", 339 | " test_ids.append(identifier)\n", 340 | " test_x.append(text)\n", 341 | " test_y.append(metadata)\n", 342 | " else:\n", 343 | " train_ids.append(identifier)\n", 344 | " train_x.append(text)\n", 345 | " train_y.append(metadata)\n", 346 | "\n", 347 | "print(f\"train set size: {len(train_ids)}\")\n", 348 | "print(f\"test set size: {len(test_ids)}\")" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "id": "b1daf3fc", 354 | "metadata": {}, 355 | "source": [ 356 | "## Fine-tuning\n", 357 | "\n", 358 | "Prepare a fine-tuning dataset and use it to fine-tune a GPT3 model." 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 4, 364 | "id": "14f9952f", 365 | "metadata": {}, 366 | "outputs": [ 367 | { 368 | "name": "stdout", 369 | "output_type": "stream", 370 | "text": [ 371 | "wrote fine-tuning data set into file fine-tune.jsonl\n" 372 | ] 373 | } 374 | ], 375 | "source": [ 376 | "# prepare fine-tuning dataset\n", 377 | "import json\n", 378 | "\n", 379 | "PROMPT_SUFFIX = '\\n\\n###\\n\\n'\n", 380 | "COMPLETION_STOP = '\\n###'\n", 381 | "TRAINFILE = 'fine-tune.jsonl'\n", 382 | "\n", 383 | "def metadata_to_text(metadata):\n", 384 | " \"\"\"convert the metadata tuple to text with key: value pairs\"\"\"\n", 385 | " return \"\\n\".join([f\"{fld}: {val}\" for fld, val in metadata])\n", 386 | "\n", 387 | "def create_sample(text, metadata):\n", 388 | " \"\"\"create a fine-tuning sample from text and metadata about a single document\"\"\"\n", 389 | " return {'prompt': text + PROMPT_SUFFIX,\n", 390 | " 'completion': metadata_to_text(metadata) + COMPLETION_STOP}\n", 391 | "\n", 392 | "with open(TRAINFILE, 'w') as outf:\n", 393 | " for text, metadata in zip(train_x, train_y):\n", 394 | " sample = create_sample(text, metadata)\n", 395 | " print(json.dumps(sample), file=outf)\n", 396 | "\n", 397 | "print(f\"wrote fine-tuning data set into file {TRAINFILE}\")" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 5, 403 | "id": "461a296d", 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "name": "stdout", 408 | "output_type": "stream", 409 | "text": [ 410 | "Analyzing...\n", 411 | "\n", 412 | "- Your file contains 149 prompt-completion pairs\n", 413 | "- All prompts end with suffix `\\n\\n###\\n\\n`\n", 414 | "- All completions start with prefix `title: `. Most of the time you should only add the output data into the completion, without any prefix\n", 415 | "- All completions end with suffix `\\n###`\n", 416 | "- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details\n", 417 | "\n", 418 | "Based on the analysis we will perform the following actions:\n", 419 | "- [Recommended] Remove prefix `title: ` from all completions [Y/n]: ^C\n", 420 | "\n" 421 | ] 422 | } 423 | ], 424 | "source": [ 425 | "# Optional:\n", 426 | "# Check that the fine-tuning data set is OK using the prepare_data tool.\n", 427 | "# It will complain that all completions start with the same \"title:\" prefix, this can be ignored.\n", 428 | "# NOTE: The command has to be interrupted by pressing the stop button in Jupyter.\n", 429 | "!openai tools fine_tunes.prepare_data -f fine-tune.jsonl" 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": 6, 435 | "id": "e03d7a85", 436 | "metadata": {}, 437 | "outputs": [ 438 | { 439 | "name": "stdout", 440 | "output_type": "stream", 441 | "text": [ 442 | "Upload progress: 100%|███████████████████████| 643k/643k [00:00<00:00, 554Mit/s]\n", 443 | "Uploaded file from fine-tune.jsonl: file-qggpGW5qlkmO5Moe0DzGP4Hw\n", 444 | "Created fine-tune: ft-iwX4ajAH4TaptYpJdXFSSOu2\n", 445 | "Streaming events until fine-tuning is complete...\n", 446 | "\n", 447 | "(Ctrl-C will interrupt the stream, but not cancel the fine-tune)\n", 448 | "[2023-01-23 10:18:08] Created fine-tune: ft-iwX4ajAH4TaptYpJdXFSSOu2\n", 449 | "\n", 450 | "Stream interrupted (client disconnected).\n", 451 | "To resume the stream, run:\n", 452 | "\n", 453 | " openai api fine_tunes.follow -i ft-iwX4ajAH4TaptYpJdXFSSOu2\n", 454 | "\n" 455 | ] 456 | } 457 | ], 458 | "source": [ 459 | "# Perform the actual finetuning via the API. This can take a while, there can be a long queue.\n", 460 | "!openai api fine_tunes.create -t fine-tune.jsonl -m curie" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 7, 466 | "id": "2327e76c", 467 | "metadata": {}, 468 | "outputs": [ 469 | { 470 | "name": "stdout", 471 | "output_type": "stream", 472 | "text": [ 473 | "[2023-01-23 10:18:08] Created fine-tune: ft-iwX4ajAH4TaptYpJdXFSSOu2\n", 474 | "[2023-01-23 10:22:29] Fine-tune costs $2.40\n", 475 | "[2023-01-23 10:22:29] Fine-tune enqueued. Queue number: 1\n", 476 | "[2023-01-23 10:22:30] Fine-tune is in the queue. Queue number: 0\n", 477 | "[2023-01-23 10:22:38] Fine-tune started\n", 478 | "[2023-01-23 10:24:36] Completed epoch 1/4\n", 479 | "[2023-01-23 10:25:56] Completed epoch 2/4\n", 480 | "[2023-01-23 10:27:16] Completed epoch 3/4\n", 481 | "[2023-01-23 10:28:38] Completed epoch 4/4\n", 482 | "[2023-01-23 10:29:01] Uploaded model: curie:ft-personal-2023-01-23-08-29-00\n", 483 | "[2023-01-23 10:29:02] Uploaded result file: file-ZDZOAruJOzm3JPYQMfjIQmIu\n", 484 | "[2023-01-23 10:29:02] Fine-tune succeeded\n", 485 | "\n", 486 | "Job complete! Status: succeeded 🎉\n", 487 | "Try out your fine-tuned model:\n", 488 | "\n", 489 | "openai api completions.create -m curie:ft-personal-2023-01-23-08-29-00 -p \n" 490 | ] 491 | } 492 | ], 493 | "source": [ 494 | "# reattach in case the stream gets interrupted\n", 495 | "#!openai api fine_tunes.follow -i \n", 496 | "!openai api fine_tunes.follow -i ft-iwX4ajAH4TaptYpJdXFSSOu2" 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": 8, 502 | "id": "9376e9a8", 503 | "metadata": {}, 504 | "outputs": [], 505 | "source": [ 506 | "# copy the model name from above output and store it in a variable\n", 507 | "model_name = \"curie:ft-personal-2023-01-23-08-29-00\"" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "id": "4285709c", 513 | "metadata": {}, 514 | "source": [ 515 | "## Test the fine-tuned model\n", 516 | "\n", 517 | "Give the model some documents from the test set that it has never seen before and see what kind of metadata it can extract. Compare that to the manually created metadata of the same documents, extracted from DSpace systems." 518 | ] 519 | }, 520 | { 521 | "cell_type": "code", 522 | "execution_count": 9, 523 | "id": "e8cba90a", 524 | "metadata": { 525 | "scrolled": true 526 | }, 527 | "outputs": [ 528 | { 529 | "name": "stdout", 530 | "output_type": "stream", 531 | "text": [ 532 | "https://osuva.uwasa.fi/handle/10024/11207\n", 533 | "---\n", 534 | "DSpace metadata:\n", 535 | "title: How to apply technology, knowledge and operations decision models for strategically sustainable resource allocation?\n", 536 | "title/alternative: Kuinka soveltaa teknologiaan, tietoon ja tuotantoon liittyvän päätöksenteon malleja strategisesti kestävään resurssien allokointiin?\n", 537 | "contributor/faculty: fi=Tekniikan ja innovaatiojohtamisen yksikkö|en=School of Technology and Innovations|\n", 538 | "contributor/author: Tilabi, Sara\n", 539 | "contributor/organization: fi=Vaasan yliopisto|en=University of Vaasa|\n", 540 | "publisher: Vaasan yliopisto\n", 541 | "date/issued: 2020\n", 542 | "relation/issn: 2323-9123\n", 543 | "relation/issn: 0355-2667\n", 544 | "relation/isbn: 978-952-476-915-0\n", 545 | "relation/ispartofseries: Acta Wasaensia\n", 546 | "relation/numberinseries: 445\n", 547 | "---\n", 548 | "Generated metadata:\n", 549 | "title: How to apply technology, knowledge and operations decision models for strategically sustainable resource allocation?\n", 550 | "title/alternative: Kuinka soveltaa teknologiaan, tietoon ja tuotantoon liittyvää päätöksen teon malleja strategisesti kestävään resurssien allokointiin?\n", 551 | "contributor/faculty: fi=Tekniikan ja innovaatiojohtamisen yksikkö|en=School of Technology and Innovations|\n", 552 | "contributor/author: Tilabi, Sara\n", 553 | "contributor/organization: fi=Vaasan yliopisto|en=University of Vaasa|\n", 554 | "publisher: Vaasan yliopisto\n", 555 | "date/issued: 2020\n", 556 | "relation/issn: 2323-9123\n", 557 | "relation/issn: 0355-2667\n", 558 | "relation/isbn: 978-952-476-915-0\n", 559 | "relation/ispartofseries: Acta Wasaensia\n", 560 | "relation/numberinseries: 445\n", 561 | "\n", 562 | "https://osuva.uwasa.fi/handle/10024/10432\n", 563 | "---\n", 564 | "DSpace metadata:\n", 565 | "title: Persoonallinen ajattelu päättelyssä ja päätöksenteossa\n", 566 | "title/alternative: Personal Thinking in reasoning and decision making\n", 567 | "contributor/faculty: fi=Tekniikan ja innovaatiojohtamisen yksikkö|en=School of Technology and Innovations|\n", 568 | "contributor/author: Alho, Tapio\n", 569 | "contributor/organization: fi=Vaasan yliopisto|en=University of Vaasa|\n", 570 | "publisher: Vaasan yliopisto\n", 571 | "date/issued: 2020\n", 572 | "relation/issn: 2323-9123\n", 573 | "relation/issn: 0355-2667\n", 574 | "relation/isbn: 978-952-476-903-7\n", 575 | "relation/ispartofseries: Acta Wasaensia\n", 576 | "relation/numberinseries: 440\n", 577 | "---\n", 578 | "Generated metadata:\n", 579 | "title: Persoonallinen ajattelu päättelyssä ja päätöksenteossa\n", 580 | "title/alternative: Personal thinking in judgment and decision-making\n", 581 | "contributor/faculty: fi=Tekniikan ja innovaatiojohtamisen yksikkö|en=School of Technology and Innovations|\n", 582 | "contributor/author: Alho, Tapio\n", 583 | "contributor/organization: fi=Vaasan yliopisto|en=University of Vaasa|\n", 584 | "publisher: Vaasan yliopisto\n", 585 | "date/issued: 2020\n", 586 | "relation/issn: 2323-9123\n", 587 | "relation/issn: 0355-2667\n", 588 | "relation/isbn: 978-952-476-903-7\n", 589 | "relation/ispartofseries: Acta Wasaensia\n", 590 | "relation/numberinseries: 440\n", 591 | "\n", 592 | "https://www.utupub.fi/handle/10024/152860\n", 593 | "---\n", 594 | "DSpace metadata:\n", 595 | "title: Essays on income inequality and financial incentives to work\n", 596 | "contributor/faculty: fi=Turun kauppakorkeakoulu|en=Turku School of Economics|\n", 597 | "contributor/author: Ollonqvist, Joonas\n", 598 | "publisher: fi=Turun yliopisto. Turun kauppakorkeakoulu|en=University of Turku, Turku School of Economics|\n", 599 | "date/issued: 2021\n", 600 | "relation/issn: 2343-3167\n", 601 | "relation/ispartofseries: Turun yliopiston julkaisuja - Annales Universitatis Turkuensis, Ser E: Oeconomica\n", 602 | "relation/numberinseries: 82\n", 603 | "---\n", 604 | "Generated metadata:\n", 605 | "title: Esa: Income inequality and financial incentives to work\n", 606 | "contributor/faculty: fi=Turun kauppakorkeakoulu|en=Turku School of Economics|\n", 607 | "contributor/author: Ollonqvist, Joonas\n", 608 | "publisher: fi=Turun yliopisto. Turun kauppakorkeakoulu|en=University of Turku, Turku School of Economics|\n", 609 | "date/issued: 2022\n", 610 | "relation/issn: 2343-3167\n", 611 | "relation/ispartofseries: Turun yliopiston julkaisuja - Annales Universitatis Turkuensis, Ser E: Oeconomica\n", 612 | "relation/numberinseries: 82\n", 613 | "\n", 614 | "https://www.utupub.fi/handle/10024/152852\n", 615 | "---\n", 616 | "DSpace metadata:\n", 617 | "title: Run-time management of many-core SoCs: A communication-centric approach\n", 618 | "contributor/faculty: fi=Teknillinen tiedekunta|en=Faculty of Technology|\n", 619 | "contributor/author: Fattah, Mohammad\n", 620 | "publisher: fi=Turun yliopisto|en=University of Turku|\n", 621 | "date/issued: 2021\n", 622 | "relation/issn: 2736-9684\n", 623 | "relation/ispartofseries: Turun yliopiston julkaisua - Annales Universitatis Turkuensis, Ser. F: Technica - Informatica\n", 624 | "relation/numberinseries: 7\n", 625 | "---\n", 626 | "Generated metadata:\n", 627 | "title: Run-time management of many-core systems – A communication-centric approach\n", 628 | "contributor/faculty: fi=Teknillinen tiedekunta|en=Faculty of Technology|\n", 629 | "contributor/author: Fattah, Mohammad\n", 630 | "publisher: fi=Turun yliopisto|en=University of Turku|\n", 631 | "date/issued: 2021\n", 632 | "relation/issn: 2736-9684\n", 633 | "relation/ispartofseries: Turun yliopiston julkaisuja - Annales Universitatis Turkuensis, Ser. F: Technica - Informatica\n", 634 | "relation/numberinseries: 7\n", 635 | "\n", 636 | "https://www.doria.fi/handle/10024/181280\n", 637 | "---\n", 638 | "DSpace metadata:\n", 639 | "title: Production and testing of magnesium carbonate hydrates for thermal energy storage (TES) application\n", 640 | "contributor/faculty: Faculty of Science and Engineering\n", 641 | "contributor/faculty: Fakulteten för naturvetenskaper och teknik\n", 642 | "contributor/faculty: Luonnontieteiden ja tekniikan tiedekunta\n", 643 | "contributor/author: Erlund, Rickard\n", 644 | "contributor/opponent: Professor Brian Elmegaard, Technical University of Denmark, Lyngby, Denmark\n", 645 | "contributor/supervisor: Professor Ron Zevenhoven, Åbo Akademi University, Turku\n", 646 | "publisher: Åbo Akademi University\n", 647 | "date/issued: 2021\n", 648 | "---\n", 649 | "Generated metadata:\n", 650 | "title: Production and Testing of Magnesium Carbonate Hydrates for Thermal Energy Storage (TES) Application\n", 651 | "contributor/faculty: Faculty of Science and Engineering\n", 652 | "contributor/faculty: Fakulteten för naturvetenskaper och teknik\n", 653 | "contributor/faculty: Luonnontieteiden ja tekniikan tiedekunta\n", 654 | "contributor/author: Erlund, Rickard\n", 655 | "contributor/opponent: Professor, Technical University of Denmark, Lyngby, Denmark\n", 656 | "contributor/supervisor: Professor, Åbo Akademi University, Turku\n", 657 | "contributor/supervisor: Professor, Åbo Akademi University, Turku\n", 658 | "publisher: Åbo Akademi University\n", 659 | "date/issued: 2021\n", 660 | "\n", 661 | "https://www.doria.fi/handle/10024/181139\n", 662 | "---\n", 663 | "DSpace metadata:\n", 664 | "title: Coulometric Transduction Method for Solid-Contact Ion-Selective Electrodes\n", 665 | "contributor/faculty: Faculty of Science and Engineering\n", 666 | "contributor/faculty: Fakulteten för naturvetenskaper och teknik\n", 667 | "contributor/faculty: Luonnontieteiden ja tekniikan tiedekunta\n", 668 | "contributor/author: Han, Tingting\n", 669 | "contributor/opponent: Professor Agata Michalska, University of Warsaw, Warsaw, Poland\n", 670 | "contributor/supervisor: Professor Johan Bobacka, Åbo Akademi University, Åbo\n", 671 | "contributor/supervisor: Dr. Ulriika Mattinen, Åbo Akademi University, Åbo\n", 672 | "contributor/supervisor: Docent Zekra Mousavi, Åbo Akademi University, Åbo\n", 673 | "publisher: Åbo Akademi University\n", 674 | "date/issued: 2021\n", 675 | "---\n", 676 | "Generated metadata:\n", 677 | "title: Coulometric transduction method for solid-contact ion-selective electrodes\n", 678 | "contributor/faculty: Faculty of Science and Engineering\n", 679 | "contributor/faculty: Fakulteten för naturvetenskaper och teknik\n", 680 | "contributor/faculty: Luonnontieteiden ja tekniikan tiedekunta\n", 681 | "contributor/author: Han, Tingting\n", 682 | "contributor/opponent: Professor Agata Michalska, University of Warsaw, Warsaw, Poland\n", 683 | "contributor/supervisor: Professor Johan Bobacka, Åbo Akademi University, Åbo\n", 684 | "contributor/supervisor: Dr. Ulriika Mattinen, Åbo Akademi University, Åbo\n", 685 | "contributor/supervisor: Docent Zekra Mousavi, Åbo Akademi University, Åbo\n", 686 | "publisher: Åbo Akademi University\n", 687 | "date/issued: 2021\n", 688 | "\n", 689 | "https://lutpub.lut.fi/handle/10024/163304\n", 690 | "---\n", 691 | "DSpace metadata:\n", 692 | "title: Responsible business practices in internationalized SMEs\n", 693 | "contributor/faculty: fi=School of Business and Management|en=School of Business and Management|\n", 694 | "contributor/author: Uzhegova, Maria\n", 695 | "contributor/organization: Lappeenrannan-Lahden teknillinen yliopisto LUT\n", 696 | "contributor/organization: Lappeenranta-Lahti University of Technology LUT\n", 697 | "contributor/opponent: Rialp-Criado, Alex\n", 698 | "contributor/reviewer: Rialp-Criado, Alex\n", 699 | "contributor/reviewer: Andersson, Svante\n", 700 | "date/issued: 2021\n", 701 | "relation/issn: 1456-4491\n", 702 | "relation/ispartofseries: Acta Universitatis Lappeenrantaensis\n", 703 | "---\n", 704 | "Generated metadata:\n", 705 | "title: Responsible business practices in internationalized SMEs\n", 706 | "contributor/faculty: fi=School of Business and Management|en=School of Business and Management|\n", 707 | "contributor/author: Uzhegova, Maria\n", 708 | "contributor/organization: Lappeenrannan-Lahden teknillinen yliopisto LUT\n", 709 | "contributor/organization: Lappeenranta-Lahti University of Technology LUT\n", 710 | "contributor/opponent: Rialp-Criado, Alex\n", 711 | "contributor/reviewer: Rialp-Criado, Alex\n", 712 | "contributor/reviewer: Andersson, Svante\n", 713 | "publisher: Lappeenranta-Lahti University of Technology LUT\n", 714 | "date/issued: 2021\n", 715 | "relation/issn: 1456-4491\n", 716 | "relation/ispartofseries: Acta Universitatis Lappeenrantaensis\n", 717 | "\n", 718 | "https://lutpub.lut.fi/handle/10024/163258\n", 719 | "---\n", 720 | "DSpace metadata:\n", 721 | "title: Life cycle cost-driven design for additive manufacturing : the frontier to sustainable manufacturing in laser-based powder bed fusion\n", 722 | "contributor/faculty: fi=School of Energy Systems|en=School of Energy Systems|\n", 723 | "contributor/author: Nyamekye, Patricia\n", 724 | "contributor/organization: Lappeenrannan-Lahden teknillinen yliopisto LUT\n", 725 | "contributor/organization: Lappeenranta-Lahti University of Technology LUT\n", 726 | "contributor/opponent: Hryha, Eduard\n", 727 | "contributor/reviewer: Hryha, Eduard\n", 728 | "contributor/reviewer: Mazzi, Anna\n", 729 | "publisher: Lappeenranta-Lahti University of Technology LUT\n", 730 | "date/issued: 2021\n", 731 | "relation/issn: 1456-4491\n", 732 | "relation/ispartofseries: Acta Universitatis Lappeenrantaensis\n", 733 | "---\n", 734 | "Generated metadata:\n" 735 | ] 736 | }, 737 | { 738 | "name": "stdout", 739 | "output_type": "stream", 740 | "text": [ 741 | "title: Life cycle cost-driven design for additive manufacturing: the frontier to sustainable manufacturing in laser-based powder bed fusion\n", 742 | "contributor/faculty: fi=School of Energy Systems|en=School of Energy Systems|\n", 743 | "contributor/author: Nyamekye, Patricia\n", 744 | "contributor/organization: Lappeenrannan-Lahden teknillinen yliopisto LUT\n", 745 | "contributor/organization: Lappeenranta-Lahti University of Technology LUT\n", 746 | "contributor/opponent: Hryha, Eduard\n", 747 | "contributor/reviewer: Hryha, Eduard\n", 748 | "contributor/reviewer: Mazzi, Anna\n", 749 | "publisher: Lappeenranta-Lahti University of Technology LUT\n", 750 | "date/issued: 2021\n", 751 | "relation/issn: 1456-4491\n", 752 | "relation/ispartofseries: Acta Universitatis Lappeenrantaensis\n", 753 | "\n" 754 | ] 755 | } 756 | ], 757 | "source": [ 758 | "def get_completions(text):\n", 759 | " response = openai.Completion.create(model=model_name,\n", 760 | " prompt=text + PROMPT_SUFFIX,\n", 761 | " temperature=0, # no fooling around!\n", 762 | " max_tokens=500, # should be plenty\n", 763 | " stop=[COMPLETION_STOP]) # stop at ###\n", 764 | " return response['choices'][0]['text']\n", 765 | "\n", 766 | "# test it with some sample documents from the test set\n", 767 | "for idx in (3,8,13,18,23,28,33,38):\n", 768 | " identifier = test_ids[idx]\n", 769 | " text = test_x[idx]\n", 770 | " metadata = test_y[idx]\n", 771 | " print(identifier)\n", 772 | " print(\"---\")\n", 773 | " print(\"DSpace metadata:\")\n", 774 | " print(metadata_to_text(metadata))\n", 775 | " print(\"---\")\n", 776 | " print(\"Generated metadata:\")\n", 777 | " gen_metadata = get_completions(text).strip()\n", 778 | " print(gen_metadata)\n", 779 | " print()\n", 780 | " \n", 781 | " " 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": null, 787 | "id": "a6dd50b7", 788 | "metadata": {}, 789 | "outputs": [], 790 | "source": [] 791 | } 792 | ], 793 | "metadata": { 794 | "kernelspec": { 795 | "display_name": "Python 3 (ipykernel)", 796 | "language": "python", 797 | "name": "python3" 798 | }, 799 | "language_info": { 800 | "codemirror_mode": { 801 | "name": "ipython", 802 | "version": 3 803 | }, 804 | "file_extension": ".py", 805 | "mimetype": "text/x-python", 806 | "name": "python", 807 | "nbconvert_exporter": "python", 808 | "pygments_lexer": "ipython3", 809 | "version": "3.10.6" 810 | } 811 | }, 812 | "nbformat": 4, 813 | "nbformat_minor": 5 814 | } 815 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # llm-metadata-extraction 2 | 3 | Experiment on metadata extraction using large language models. 4 | 5 | The experiment itself is in the Jupyter notebook 6 | [ExtractMetadataUsingFinetunedGPT3.ipynb](ExtractMetadataUsingFinetunedGPT3.ipynb) 7 | which you can view directly on GitHub without installing anything. If you want to run it yourself, see the bottom of this page. 8 | 9 | ## Data set 10 | 11 | The aim of the experiment is to extract metadata from recent (2021-2022) doctoral theses obtain from four Finnish universities (Åbo Akademi, University of Turku, University of Vaasa and Lappeenranta-Lahti University of Technology), using only the raw text from the first few pages of the PDF. 12 | 13 | The set of 192 documents (88% English, 7% Finnish, 5% Swedish language) will be split into two subsets (train: 149, test: 43). We will extract the text from around 5 pages of text, aiming for 500 to 700 words. The corresponding metadata, which has been exported from DSpace repositories of the universities, is represented in a simple textual "key: value" format, which should be easy enough for a language model to handle. The train set is used to create a data set which will then be used to fine-tune a GPT-3 Curie model. Subsequently the model can be used to generate similar metadata for unseen documents from the test set. 14 | 15 | ## Example output 16 | 17 | Here is `diff` output showing the difference between the original (manually created) metadata (-, red) and the output of the model (+, green) with some comments after each example. These are all documents from the test set that the model has never seen before. 18 | 19 | ### University of Vaasa 20 | 21 | ```diff 22 | https://osuva.uwasa.fi/handle/10024/11207 23 | --- 24 | title: How to apply technology, knowledge and operations decision models for strategically sustainable resource allocation? 25 | -title/alternative: Kuinka soveltaa teknologiaan, tietoon ja tuotantoon liittyvän päätöksenteon malleja strategisesti kestävään resurssien allokointiin? 26 | +title/alternative: Kuinka soveltaa teknologiaan, tietoon ja tuotantoon liittyvää päätöksen teon malleja strategisesti kestävään resurssien allokointiin? 27 | contributor/faculty: fi=Tekniikan ja innovaatiojohtamisen yksikkö|en=School of Technology and Innovations| 28 | contributor/author: Tilabi, Sara 29 | contributor/organization: fi=Vaasan yliopisto|en=University of Vaasa| 30 | publisher: Vaasan yliopisto 31 | date/issued: 2020 32 | relation/issn: 2323-9123 33 | relation/issn: 0355-2667 34 | relation/isbn: 978-952-476-915-0 35 | relation/ispartofseries: Acta Wasaensia 36 | relation/numberinseries: 445 37 | --- 38 | ``` 39 | 40 | Comment: The generated metadata is perfect except for a small mistake in the title. 41 | 42 | ```diff 43 | https://osuva.uwasa.fi/handle/10024/10432 44 | --- 45 | title: Persoonallinen ajattelu päättelyssä ja päätöksenteossa 46 | -title/alternative: Personal Thinking in reasoning and decision making 47 | +title/alternative: Personal thinking in judgment and decision-making 48 | contributor/faculty: fi=Tekniikan ja innovaatiojohtamisen yksikkö|en=School of Technology and Innovations| 49 | contributor/author: Alho, Tapio 50 | contributor/organization: fi=Vaasan yliopisto|en=University of Vaasa| 51 | publisher: Vaasan yliopisto 52 | date/issued: 2020 53 | relation/issn: 2323-9123 54 | relation/issn: 0355-2667 55 | relation/isbn: 978-952-476-903-7 56 | relation/ispartofseries: Acta Wasaensia 57 | relation/numberinseries: 440 58 | --- 59 | ``` 60 | 61 | Comment: The language model seems to have translated the Finnish title into English on its own, instead of picking up the English language title from the publication as it was written by the author. 62 | 63 | ### University of Turku 64 | 65 | ```diff 66 | https://www.utupub.fi/handle/10024/152860 67 | --- 68 | -title: Essays on income inequality and financial incentives to work 69 | +title: Esa: Income inequality and financial incentives to work 70 | contributor/faculty: fi=Turun kauppakorkeakoulu|en=Turku School of Economics| 71 | contributor/author: Ollonqvist, Joonas 72 | publisher: fi=Turun yliopisto. Turun kauppakorkeakoulu|en=University of Turku, Turku School of Economics| 73 | -date/issued: 2021 74 | +date/issued: 2022 75 | relation/issn: 2343-3167 76 | relation/ispartofseries: Turun yliopiston julkaisuja - Annales Universitatis Turkuensis, Ser E: Oeconomica 77 | relation/numberinseries: 82 78 | --- 79 | ``` 80 | 81 | Comment: The title is a bit off and the model thinks the publication is newer than it really is. Both are clearly indicated in the PDF. 82 | 83 | ```diff 84 | https://www.utupub.fi/handle/10024/152852 85 | --- 86 | -title: Run-time management of many-core SoCs: A communication-centric approach 87 | +title: Run-time management of many-core systems – A communication-centric approach 88 | contributor/faculty: fi=Teknillinen tiedekunta|en=Faculty of Technology| 89 | contributor/author: Fattah, Mohammad 90 | publisher: fi=Turun yliopisto|en=University of Turku| 91 | date/issued: 2021 92 | relation/issn: 2736-9684 93 | -relation/ispartofseries: Turun yliopiston julkaisua - Annales Universitatis Turkuensis, Ser. F: Technica - Informatica 94 | +relation/ispartofseries: Turun yliopiston julkaisuja - Annales Universitatis Turkuensis, Ser. F: Technica - Informatica 95 | relation/numberinseries: 7 96 | --- 97 | ``` 98 | 99 | Comment: The title is a bit off - the model has turned "SoCs" (meaning systems-on-chip) into "systems" with no good reason. The series title is different too, but in this case the original metadata had a typo and the model got it right! 100 | 101 | ### Åbo Akademi University 102 | 103 | ```diff 104 | https://www.doria.fi/handle/10024/181280 105 | --- 106 | -title: Production and testing of magnesium carbonate hydrates for thermal energy storage (TES) application 107 | +title: Production and Testing of Magnesium Carbonate Hydrates for Thermal Energy Storage (TES) Application 108 | contributor/faculty: Faculty of Science and Engineering 109 | contributor/faculty: Fakulteten för naturvetenskaper och teknik 110 | contributor/faculty: Luonnontieteiden ja tekniikan tiedekunta 111 | contributor/author: Erlund, Rickard 112 | -contributor/opponent: Professor Brian Elmegaard, Technical University of Denmark, Lyngby, Denmark 113 | -contributor/supervisor: Professor Ron Zevenhoven, Åbo Akademi University, Turku 114 | +contributor/opponent: Professor, Technical University of Denmark, Lyngby, Denmark 115 | +contributor/supervisor: Professor, Åbo Akademi University, Turku 116 | +contributor/supervisor: Professor, Åbo Akademi University, Turku 117 | publisher: Åbo Akademi University 118 | date/issued: 2021 119 | --- 120 | ``` 121 | 122 | Comment: The title differs in case only. The model couldn't extract the names of the opponent and supervisor. 123 | 124 | ```diff 125 | https://www.doria.fi/handle/10024/181139 126 | --- 127 | -title: Coulometric Transduction Method for Solid-Contact Ion-Selective Electrodes 128 | +title: Coulometric transduction method for solid-contact ion-selective electrodes 129 | contributor/faculty: Faculty of Science and Engineering 130 | contributor/faculty: Fakulteten för naturvetenskaper och teknik 131 | contributor/faculty: Luonnontieteiden ja tekniikan tiedekunta 132 | contributor/author: Han, Tingting 133 | contributor/opponent: Professor Agata Michalska, University of Warsaw, Warsaw, Poland 134 | contributor/supervisor: Professor Johan Bobacka, Åbo Akademi University, Åbo 135 | contributor/supervisor: Dr. Ulriika Mattinen, Åbo Akademi University, Åbo 136 | contributor/supervisor: Docent Zekra Mousavi, Åbo Akademi University, Åbo 137 | publisher: Åbo Akademi University 138 | date/issued: 2021 139 | --- 140 | ``` 141 | 142 | Comment: The title differs in case only. 143 | 144 | ### Lappeenranta-Lahti University of Technology 145 | 146 | ```diff 147 | https://lutpub.lut.fi/handle/10024/163304 148 | --- 149 | title: Responsible business practices in internationalized SMEs 150 | contributor/faculty: fi=School of Business and Management|en=School of Business and Management| 151 | contributor/author: Uzhegova, Maria 152 | contributor/organization: Lappeenrannan-Lahden teknillinen yliopisto LUT 153 | contributor/organization: Lappeenranta-Lahti University of Technology LUT 154 | contributor/opponent: Rialp-Criado, Alex 155 | contributor/reviewer: Rialp-Criado, Alex 156 | contributor/reviewer: Andersson, Svante 157 | +publisher: Lappeenranta-Lahti University of Technology LUT 158 | date/issued: 2021 159 | relation/issn: 1456-4491 160 | relation/ispartofseries: Acta Universitatis Lappeenrantaensis 161 | --- 162 | ``` 163 | 164 | Comment: The model extracted the publisher, which was missing from original metadata. 165 | 166 | ```diff 167 | https://lutpub.lut.fi/handle/10024/163258 168 | --- 169 | -title: Life cycle cost-driven design for additive manufacturing : the frontier to sustainable manufacturing in laser-based powder bed fusion 170 | +title: Life cycle cost-driven design for additive manufacturing: the frontier to sustainable manufacturing in laser-based powder bed fusion 171 | contributor/faculty: fi=School of Energy Systems|en=School of Energy Systems| 172 | contributor/author: Nyamekye, Patricia 173 | contributor/organization: Lappeenrannan-Lahden teknillinen yliopisto LUT 174 | contributor/organization: Lappeenranta-Lahti University of Technology LUT 175 | contributor/opponent: Hryha, Eduard 176 | contributor/reviewer: Hryha, Eduard 177 | contributor/reviewer: Mazzi, Anna 178 | publisher: Lappeenranta-Lahti University of Technology LUT 179 | date/issued: 2021 180 | relation/issn: 1456-4491 181 | relation/ispartofseries: Acta Universitatis Lappeenrantaensis 182 | --- 183 | ``` 184 | 185 | Comment: The only difference is the use of a colon with or without preceding space in the title. 186 | 187 | 188 | ## Installation 189 | 190 | This has been tested using Python 3.10 on Ubuntu 22.04. Other recent Python 191 | versions (3.8, 3.9) will probably work too. 192 | 193 | Create a virtual environment and install the dependencies listed 194 | `requirements.txt`: 195 | 196 | python3 -m venv venv 197 | source venv/bin/activate 198 | pip install -r requirements.txt 199 | 200 | ## Running the notebook 201 | 202 | You will need to register an OpenAI account (same that you can use for 203 | ChatGPT) and generate an API key in the OpenAI user interface. Store the key 204 | in an environment variable: 205 | 206 | export OPENAI_API_KEY= 207 | 208 | Then start Jupyter Notebook: 209 | 210 | jupyter notebook 211 | -------------------------------------------------------------------------------- /docs/.gitignore: -------------------------------------------------------------------------------- 1 | *.pdf 2 | -------------------------------------------------------------------------------- /docs/README.md: -------------------------------------------------------------------------------- 1 | This directory is for storing retrieved PDF documents. 2 | -------------------------------------------------------------------------------- /dspace/README.md: -------------------------------------------------------------------------------- 1 | This directory contains metadata records of doctoral theses, extracted from DSpace repositories. 2 | 3 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | jupyter 2 | openai[datalib] 3 | pypdf 4 | requests 5 | lxml 6 | --------------------------------------------------------------------------------