\" in the input field.\n",
274 | "9. The output should be an appropriate response to the instruction and the input. It should ideally not exceed 400 words.\n",
275 | "10. All generated output should use the metric system for measurements and UK names for medications, substances, drugs and everything else.\n",
276 | "\n",
277 | "List of {quantity} tasks (every task has the following fields: Task:, Instruction:, Input:, Output:):\n",
278 | "****************************************************************************************************\n",
279 | "\n"
280 | ]
281 | }
282 | ],
283 | "source": [
284 | "# Print the prompt db\n",
285 | "for prompt in db:\n",
286 | " print('Description: ', prompt['description'])\n",
287 | " print('Hash: ', prompt['hash'])\n",
288 | " print('Parser: ', prompt['parser'])\n",
289 | " print('Text: ', prompt['text'])\n",
290 | " print(\"*\"*100)\n",
291 | " print()"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "id": "2d4a2a9b-4de8-488e-9914-8181ec218935",
298 | "metadata": {},
299 | "outputs": [],
300 | "source": []
301 | }
302 | ],
303 | "metadata": {
304 | "kernelspec": {
305 | "display_name": "Python 3 (ipykernel)",
306 | "language": "python",
307 | "name": "python3"
308 | },
309 | "language_info": {
310 | "codemirror_mode": {
311 | "name": "ipython",
312 | "version": 3
313 | },
314 | "file_extension": ".py",
315 | "mimetype": "text/x-python",
316 | "name": "python",
317 | "nbconvert_exporter": "python",
318 | "pygments_lexer": "ipython3",
319 | "version": "3.8.0"
320 | }
321 | },
322 | "nbformat": 4,
323 | "nbformat_minor": 5
324 | }
325 |
--------------------------------------------------------------------------------
/experiments/Supervised Training.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 2,
6 | "id": "6f859a45-5ca7-47cc-8055-dabedd301963",
7 | "metadata": {
8 | "tags": []
9 | },
10 | "outputs": [
11 | {
12 | "name": "stderr",
13 | "output_type": "stream",
14 | "text": [
15 | "2023-05-08 18:13:35.960625: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
16 | "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
17 | "2023-05-08 18:13:37.052768: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"
18 | ]
19 | }
20 | ],
21 | "source": [
22 | "from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, pipeline\n",
23 | "import pickle\n",
24 | "import pandas as pd\n",
25 | "import datasets\n",
26 | "\n",
27 | "\n",
28 | "from opengpt.config import Config\n",
29 | "from opengpt.model_utils import add_tokens_to_model_and_tokenizer\n",
30 | "from opengpt.dataset_utils import create_labels, pack_examples\n",
31 | "from opengpt.data_collator import DataCollatorWithPadding"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 4,
37 | "id": "dd137647-a6d7-49c8-a241-404137ef3b08",
38 | "metadata": {
39 | "tags": []
40 | },
41 | "outputs": [],
42 | "source": [
43 | "config = Config(yaml_path='../configs/example_train_config.yaml')\n",
44 | "model = AutoModelForCausalLM.from_pretrained(config.train.model)\n",
45 | "tokenizer = AutoTokenizer.from_pretrained(config.train.model)\n",
46 | "tokenizer.model_max_length = config.train.max_seq_len"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 5,
52 | "id": "5c389455-d401-4857-a61d-0cc1e72d312b",
53 | "metadata": {},
54 | "outputs": [
55 | {
56 | "name": "stderr",
57 | "output_type": "stream",
58 | "text": [
59 | "WARNING:root:Added: 5 tokens to the tokenizer\n"
60 | ]
61 | }
62 | ],
63 | "source": [
64 | "add_tokens_to_model_and_tokenizer(config, tokenizer, model)"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "id": "bdb0bac3-1504-430f-a28c-0aea4ea28bbc",
70 | "metadata": {},
71 | "source": [
72 | "### Load data"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "id": "b0f6ef4f-3c62-47f8-a717-9d0a57d086ad",
78 | "metadata": {},
79 | "source": [
80 | "The datasets used for training have to have the special tokens as defined in the config. By default this means that the datasets have to be organised as conversations using the `<|user|> <|ai|>` and `<|eos|> <|eod|>` special tokens. An example of a question/answer pair from the NHS-UK dataset:\n",
81 | "\n",
82 | "```\n",
83 | "<|user|> What is high blood pressure? <|eos|> <|ai|> High blood pressure is a condition where the force at which your heart pumps blood around your body is high. It is recorded with 2 numbers, the systolic pressure and the diastolic pressure, both measured in millimetres of mercury (mmHg).\n",
84 | "References:\n",
85 | "- https://www.nhs.uk/conditions/Blood-pressure-(high)/Pages/Introduction.aspx <|eos|> <|eod|>\n",
86 | "```\n",
87 | "\n",
88 | "If not done the training scripts below will not work."
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 8,
94 | "id": "6a2589a6-2370-4b23-98fc-91ebcf6b24aa",
95 | "metadata": {
96 | "tags": []
97 | },
98 | "outputs": [
99 | {
100 | "name": "stdout",
101 | "output_type": "stream",
102 | "text": [
103 | "Shuffling dataset\n"
104 | ]
105 | }
106 | ],
107 | "source": [
108 | "train_dataset = datasets.Dataset.from_csv(config.train.datasets)\n",
109 | "if config.train.shuffle_dataset:\n",
110 | " train_dataset = train_dataset.shuffle()\n",
111 | " print(\"Shuffling dataset!\")"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "id": "11ab2eea-65d7-4b9e-934a-4179608bd6f4",
117 | "metadata": {},
118 | "source": [
119 | "#### Remove all columns that we do not need, filtering of the dataset can be done before removal if needed"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 9,
125 | "id": "75b45ea0-c435-4193-aa1b-622a062b4386",
126 | "metadata": {
127 | "tags": []
128 | },
129 | "outputs": [],
130 | "source": [
131 | "# Remove everything but text\n",
132 | "to_remove = list(train_dataset.column_names)\n",
133 | "to_remove.remove('text')\n",
134 | "train_dataset = train_dataset.remove_columns(to_remove)"
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 11,
140 | "id": "e270be31-1643-42c7-8dc1-a9206b88a243",
141 | "metadata": {
142 | "tags": []
143 | },
144 | "outputs": [
145 | {
146 | "data": {
147 | "application/vnd.jupyter.widget-view+json": {
148 | "model_id": "",
149 | "version_major": 2,
150 | "version_minor": 0
151 | },
152 | "text/plain": [
153 | "Map: 0%| | 0/29660 [00:00, ? examples/s]"
154 | ]
155 | },
156 | "metadata": {},
157 | "output_type": "display_data"
158 | }
159 | ],
160 | "source": [
161 | "# Ignore max_seq_len warning, it is handled by the packer or data_collator\n",
162 | "train_dataset = train_dataset.map(\n",
163 | " lambda examples: tokenizer(examples['text'], add_special_tokens=False), \n",
164 | " batched=True, \n",
165 | " num_proc=1, \n",
166 | " remove_columns=[\"text\"])\n",
167 | "# Create labels\n",
168 | "train_dataset = train_dataset.map(\n",
169 | " lambda examples: create_labels(examples, config, tokenizer),\n",
170 | " batched=True,\n",
171 | " batch_size=1000,\n",
172 | " num_proc=1,\n",
173 | ")\n",
174 | "# We only do packing for the train set\n",
175 | "train_dataset = train_dataset.map(\n",
176 | " lambda examples: pack_examples(examples, config.train.max_seq_len, packing_type=config.train.packing_type),\n",
177 | " batched=True,\n",
178 | " batch_size=1000,\n",
179 | " num_proc=1,\n",
180 | ")"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": 13,
186 | "id": "6919836a-eab8-43ff-9b1e-2b1d169fe3bd",
187 | "metadata": {
188 | "tags": []
189 | },
190 | "outputs": [],
191 | "source": [
192 | "training_args = TrainingArguments(**config.train.hf_training_arguments.to_dict())\n",
193 | "dc = DataCollatorWithPadding(tokenizer.pad_token_id, config.train.ignore_index, max_seq_len=config.train.max_seq_len)\n",
194 | "\n",
195 | "trainer = Trainer(\n",
196 | " model=model,\n",
197 | " args=training_args,\n",
198 | " train_dataset=train_dataset,\n",
199 | " eval_dataset=None,\n",
200 | " data_collator=dc,\n",
201 | ")"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": 16,
207 | "id": "a4bfc4d5-bbff-4b74-b75f-4ca79ce04124",
208 | "metadata": {
209 | "tags": []
210 | },
211 | "outputs": [
212 | {
213 | "data": {
214 | "text/plain": [
215 | "Dataset({\n",
216 | " features: ['input_ids', 'attention_mask', 'labels'],\n",
217 | " num_rows: 8771\n",
218 | "})"
219 | ]
220 | },
221 | "execution_count": 16,
222 | "metadata": {},
223 | "output_type": "execute_result"
224 | }
225 | ],
226 | "source": [
227 | "train_dataset"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 17,
233 | "id": "1e96d99b-c915-4ac4-8cbe-fb01081a7a42",
234 | "metadata": {
235 | "tags": []
236 | },
237 | "outputs": [
238 | {
239 | "name": "stderr",
240 | "output_type": "stream",
241 | "text": [
242 | "/data/zeljko/.venv/llama/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
243 | " warnings.warn(\n"
244 | ]
245 | },
246 | {
247 | "data": {
248 | "text/html": [
249 | "\n",
250 | " \n",
251 | " \n",
252 | "
\n",
253 | " [548/548 06:27, Epoch 0/1]\n",
254 | "
\n",
255 | " \n",
256 | " \n",
257 | " \n",
258 | " Step | \n",
259 | " Training Loss | \n",
260 | "
\n",
261 | " \n",
262 | " \n",
263 | " \n",
264 | " 100 | \n",
265 | " 1.650500 | \n",
266 | "
\n",
267 | " \n",
268 | " 200 | \n",
269 | " 1.489700 | \n",
270 | "
\n",
271 | " \n",
272 | " 300 | \n",
273 | " 1.445900 | \n",
274 | "
\n",
275 | " \n",
276 | " 400 | \n",
277 | " 1.416500 | \n",
278 | "
\n",
279 | " \n",
280 | " 500 | \n",
281 | " 1.399800 | \n",
282 | "
\n",
283 | " \n",
284 | "
"
285 | ],
286 | "text/plain": [
287 | ""
288 | ]
289 | },
290 | "metadata": {},
291 | "output_type": "display_data"
292 | },
293 | {
294 | "data": {
295 | "text/plain": [
296 | "TrainOutput(global_step=548, training_loss=1.4748950457050853, metrics={'train_runtime': 388.7346, 'train_samples_per_second': 22.563, 'train_steps_per_second': 1.41, 'total_flos': 1931665648896000.0, 'train_loss': 1.4748950457050853, 'epoch': 1.0})"
297 | ]
298 | },
299 | "execution_count": 17,
300 | "metadata": {},
301 | "output_type": "execute_result"
302 | }
303 | ],
304 | "source": [
305 | "trainer.train()"
306 | ]
307 | },
308 | {
309 | "cell_type": "markdown",
310 | "id": "11a88494-afb1-44a5-b607-6382fe5b0c9c",
311 | "metadata": {},
312 | "source": [
313 | "# Test Generation"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": 18,
319 | "id": "2d37cb47-c029-40f5-bc7d-accefae42f50",
320 | "metadata": {
321 | "tags": []
322 | },
323 | "outputs": [],
324 | "source": [
325 | "gen = pipeline(model=model, tokenizer=tokenizer, task='text-generation', device=model.device)"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": 19,
331 | "id": "86ce8f6e-6037-4c2b-8aac-5ad7a129834a",
332 | "metadata": {
333 | "tags": []
334 | },
335 | "outputs": [],
336 | "source": [
337 | "t = \"<|user|> What is diabetes? <|eos|> <|ai|>\" # The format with special tokens is required, because of training"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": 39,
343 | "id": "daab006f-eda3-43db-b864-a1c1de52d4d3",
344 | "metadata": {
345 | "tags": []
346 | },
347 | "outputs": [
348 | {
349 | "name": "stderr",
350 | "output_type": "stream",
351 | "text": [
352 | "Setting `pad_token_id` to `eos_token_id`:50267 for open-end generation.\n"
353 | ]
354 | },
355 | {
356 | "name": "stdout",
357 | "output_type": "stream",
358 | "text": [
359 | "<|user|> What is diabetes? <|eos|> <|ai|> Diabetes is a condition in which the body's insulin levels are too low, which can lead to high blood sugar levels.\n",
360 | "References:\n",
361 | "- https://www.nhs.uk/conditions/diabetes/ \n"
362 | ]
363 | }
364 | ],
365 | "source": [
366 | "# Temperature is important, and depending on your model different values will be good (this one is for gpt-2)\n",
367 | "print(gen(t, do_sample=True, max_length=128, temperature=0.2)[0]['generated_text'])"
368 | ]
369 | }
370 | ],
371 | "metadata": {
372 | "kernelspec": {
373 | "display_name": "Python 3 (ipykernel)",
374 | "language": "python",
375 | "name": "python3"
376 | },
377 | "language_info": {
378 | "codemirror_mode": {
379 | "name": "ipython",
380 | "version": 3
381 | },
382 | "file_extension": ".py",
383 | "mimetype": "text/x-python",
384 | "name": "python",
385 | "nbconvert_exporter": "python",
386 | "pygments_lexer": "ipython3",
387 | "version": "3.8.0"
388 | }
389 | },
390 | "nbformat": 4,
391 | "nbformat_minor": 5
392 | }
393 |
--------------------------------------------------------------------------------
/llama_train_requirements.txt:
--------------------------------------------------------------------------------
1 | .
2 | protobuf==3.20.3
3 | accelerate
4 | https://github.com/huggingface/transformers
5 | sentencepiece
6 |
--------------------------------------------------------------------------------
/opengpt/config.py:
--------------------------------------------------------------------------------
1 | from box import Box
2 | import jsonpickle
3 | import os
4 | import yaml
5 |
6 | class BaseConfig(object):
7 | def __init__(self, to_box=False):
8 | pass
9 |
10 | def _to_box(self):
11 | # Convert all dicts to boxes
12 | for key, val in self.__dict__.items():
13 | if isinstance(val, dict):
14 | self.__setattr__(key, Box(val))
15 |
16 | def _from_box(self):
17 | # Convert all dicts to boxes
18 | for key, val in self.__dict__.items():
19 | if isinstance(val, Box):
20 | self.__setattr__(key, val.to_dict())
21 |
22 | def save(self, save_path=None):
23 | r''' Save the config into a .json file
24 | Args:
25 | save_path (`str`):
26 | Where to save the created json file, if nothing we use the default from paths.
27 | '''
28 | if save_path is None:
29 | save_path = self.path.self
30 |
31 | # We want to save the dict here, not the whole class
32 | self._from_box()
33 | json_string = jsonpickle.encode({k:v for k,v in self.__dict__.items() if k != 'path'})
34 |
35 | with open(save_path, 'w') as f:
36 | f.write(json_string)
37 | self._to_box()
38 |
39 | @classmethod
40 | def load(cls, save_path):
41 | config = cls(to_box=False)
42 | # Read the jsonpickle string
43 | with open(save_path) as f:
44 | config_dict = jsonpickle.decode(f.read())
45 | config.merge_config(config_dict)
46 | config._to_box()
47 | return config
48 |
49 | def merge_config(self, config_dict):
50 | r''' Merge a config_dict with the existing config object.
51 | Args:
52 | config_dict (`dict`):
53 | A dictionary which key/values should be added to this class.
54 | '''
55 | for key in config_dict.keys():
56 | if key in self.__dict__ and isinstance(self.__dict__[key], dict):
57 | self.__dict__[key].update(config_dict[key])
58 | else:
59 | self.__dict__[key] = config_dict[key]
60 |
61 |
62 | class Config(BaseConfig):
63 | r''' There are probably nicer ways to do this, but I like this one.
64 | '''
65 | def __init__(self, yaml_path):
66 | self.yaml_path = yaml_path
67 | self.load_yaml(yaml_path)
68 |
69 | def reload_yaml(self):
70 | self.load_yaml(self.yaml_path)
71 |
72 | def load_yaml(self, yaml_path):
73 | _config = yaml.safe_load(open(yaml_path, 'r'))
74 | self.to_box = True
75 | self.base_path = './'
76 | self.datasets = {}
77 | self.name = 'opengpt'
78 |
79 | for k,v in _config.items():
80 | self.__setattr__(k, v)
81 | # For fun, we will also keept the _config
82 | self._config = _config
83 |
84 | self.path = {'self': os.path.join(self.base_path, f'config_for_{self.name}.json')}
85 | if _config.get('static_paths', None):
86 | self.path.update(_config['static_paths'])
87 |
88 | if self.to_box:
89 | self._to_box()
90 |
91 | def create_dirs(paths):
92 | for path in paths:
93 | if isinstance(path, str):
94 | os.makedirs(os.path.dirname(path), exist_ok=True)
95 | elif isinstance(path, dict):
96 | create_dirs(path.values())
97 | create_dirs(self.path.values())
98 |
99 | # Create dirs for datasets, this is where all the data from one dataset will go
100 | for ds in self.datasets:
101 | os.makedirs(os.path.join(self.base_path, ds['name']), exist_ok=True)
--------------------------------------------------------------------------------
/opengpt/data_collator.py:
--------------------------------------------------------------------------------
1 | import torch
2 |
3 | class DataCollatorWithPadding(object):
4 | r''' Will pad or trim examples to the appropriate length.
5 | '''
6 | def __init__(self, pad_token_id, ignore_index, max_seq_len):
7 | self.pad_token_id = pad_token_id
8 | self.ignore_index = ignore_index
9 | self.max_seq_len = max_seq_len
10 |
11 | def __call__(self, instances):
12 | input_ids, labels = tuple([torch.tensor(instance[key][0:self.max_seq_len]) for instance in instances] for key in ("input_ids", "labels"))
13 | batch = {}
14 |
15 | batch['input_ids'] = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=self.pad_token_id)
16 | batch['labels'] = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=self.ignore_index)
17 | batch['attention_mask'] = batch['input_ids'].ne(self.pad_token_id)
18 |
19 | return batch
--------------------------------------------------------------------------------
/opengpt/dataset_utils.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import math
3 | import os
4 | import json
5 | import hashlib
6 | from tqdm.auto import tqdm
7 | from opengpt import parsers, teachers
8 | import logging
9 | import random
10 |
11 |
12 | def split_csv_by_max_len(datasets, max_len, tokenizer, base_path):
13 | r''' Given a tokenizer it will split the dataset (based on the `text` column) into max_len sequencse
14 | '''
15 | for dataset in tqdm(datasets, desc='Datasets', total=len(datasets)):
16 | csv_path = dataset['path']
17 | name = dataset['name']
18 |
19 | nrows = None
20 | if dataset.get('nrows', -1) > 0:
21 | nrows = dataset['nrows']
22 |
23 | df = pd.read_csv(csv_path, nrows=nrows)
24 | cols = df.columns
25 | assert 'text' in cols, f'The CSV for dataset {name} has no "text" column.'
26 |
27 | new_data = [list(cols) + ['len', 'part']]
28 | for _, row in tqdm(df.iterrows(), desc=dataset['name'], total=len(df)):
29 | text = row['text']
30 | tokens = tokenizer.encode(text)
31 |
32 | for i in range(math.ceil(len(tokens) / max_len)):
33 | new_text = tokenizer.decode(tokens[i*max_len:(i+1)*max_len])
34 | new_data_row = [row[c] if c != 'text' else new_text for c in cols]
35 | new_data_row.append(len(tokens[i*max_len:(i+1)*max_len]))
36 | new_data_row.append(f'part_{i}')
37 | new_data.append(new_data_row)
38 |
39 | # Save
40 | new_df = pd.DataFrame(new_data[1:], columns=new_data[0])
41 | new_df.to_csv(os.path.join(base_path, name, 'data_split_by_length.csv'), index=False)
42 | logging.warning(f'{dataset["name"]}: length before vs after: {len(df)} vs {len(new_df)}\n')
43 |
44 |
45 | def create_dataset_no_input(config):
46 | r''' This does not require an input dataset to generate a new dataset, only a prompt is needed
47 | '''
48 | prompt_db = json.load(open(config.path.prompt_db, 'rb'))
49 | raw_data_columns = ['id', 'raw_output', 'prompt_hash']
50 | raw_data = pd.DataFrame(None, columns=raw_data_columns)
51 | raw_data_path = os.path.join(config.base_path, config.name, f"raw_generated_data_for_{config.name}.csv")
52 | if os.path.exists(raw_data_path):
53 | raw_data = pd.read_csv(raw_data_path)
54 | logging.warning(f"Loading an existing openai generated dataset found at: {raw_data_path}" +
55 | f"There are already {len(raw_data)} rows in the that dataset, the generation will continue from where last left off. " +
56 | f"The script will also do all examples that were not done in the previous run.")
57 |
58 |
59 | teacher = getattr(teachers, f'ask_{config.teacher.name}')
60 | for prompt_config in config.prompts:
61 | prompts = [prompt for prompt in prompt_db if prompt['hash'] in prompt_config['hashes']] # There must be one
62 |
63 | parameters = prompt_config.get('extra_parameters', {})
64 |
65 | for language in prompt_config.get('languages', ['English']):
66 | parameters['language'] = language
67 | logging.warning(f"\nStarting prompts: {prompt_config['hashes']}\n #Runs: {prompt_config['runs']}\nLanguage: {language}")
68 | for prompt in prompts:
69 | # If some examples exist already
70 |
71 |
72 | start = len(raw_data[raw_data.prompt_hash == prompt['hash']])
73 | for _ in tqdm(range(start, prompt_config['runs']), total=(prompt_config['runs'] - start)):
74 | prompt_text_template = prompt['text']
75 | prompt_text = prompt_text_template.format(**parameters)
76 | try:
77 | out = teacher(prompt_text, config)
78 | new_data = pd.DataFrame([[len(raw_data), out, prompt['hash']]], columns=raw_data_columns)
79 | raw_data = pd.concat([raw_data, new_data], ignore_index=True)
80 |
81 | if len(raw_data) % config.data_generation_checkpoint_every == 0:
82 | logging.warning("Checkpointing the generated dataset.")
83 | raw_data.to_csv(raw_data_path, index=False)
84 |
85 | except Exception as e:
86 | logging.exception(e)
87 | logging.warning(f"Skipping example for prompt: {prompt['hash']}\n")
88 |
89 | if raw_data is not None and len(raw_data) > 0:
90 | raw_data.to_csv(raw_data_path, index=False)
91 |
92 | return raw_data
93 |
94 |
95 | def create_dataset(config):
96 | prompt_db = json.load(open(config.path.prompt_db, 'rb'))
97 | raw_data_columns = ['id', 'raw_output', 'dataset', 'language', 'run', 'prompt_hash', 'prompt_text_hash', 'context']
98 | raw_data = pd.DataFrame(None, columns=raw_data_columns)
99 | prepared_data = None
100 | raw_data_path = os.path.join(config.base_path, config.name, f"raw_generated_data_for_{config.name}.csv")
101 | prepared_data_path = os.path.join(config.base_path, config.name, f"prepared_generated_data_for_{config.name}.csv")
102 | if os.path.exists(raw_data_path) and os.path.exists(prepared_data_path):
103 | raw_data = pd.read_csv(raw_data_path)
104 | prepared_data = pd.read_csv(prepared_data_path)
105 | logging.warning(f"Loading an existing openai generated dataset found at: \n{raw_data_path}\n and\n{prepared_data_path}\n" +
106 | f"There are already {len(raw_data)} rows in the that dataset, the generation will continue from where last left off. " +
107 | f"The script will also do all examples that were not done in the previous run.\n" +
108 | "***Take care that if prompt_config['random_prompt'] is set to true, it can produce unwanted results.\n\n")
109 |
110 | cnt = 0
111 | for prompt_config in config.prompts:
112 | prompts = [prompt for prompt in prompt_db if prompt['hash'] in prompt_config['hashes']] # There must be one
113 | teacher = getattr(teachers, f'ask_{config.teacher.name}')
114 |
115 | for run in range(prompt_config.get('runs', 1)):
116 | parameters = prompt_config.get('extra_parameters', {})
117 | extra_data_columns = prompt_config.get('extra_data_columns', [])
118 |
119 | for language in prompt_config.get('languages', ['English']):
120 | parameters['language'] = language
121 | logging.warning(f"\nStarting prompts: {prompt_config['hashes']}\nRun: {run}\nLanguage: {language}")
122 | for dataset_name in prompt_config['datasets']:
123 | df = pd.read_csv(os.path.join(config.base_path, dataset_name, 'data_split_by_length.csv'))
124 | for row_ind, row in tqdm(df.iterrows(), desc=dataset_name, total=len(df)):
125 | # Set the context from the current row
126 | parameters['context'] = row['text']
127 | for col in extra_data_columns:
128 | parameters[col] = row[col]
129 | if prompt_config.get('random_prompt', False):
130 | # This means for each example in the dataset we randomly select a prompt to be used, if False
131 | #every example will run through every prompt
132 | selected_prompts = [random.choice(prompts)]
133 | else:
134 | selected_prompts = prompts # Use all prompts sequentially
135 | for prompt in selected_prompts:
136 | prompt_text_template = prompt['text']
137 | # Every prompt has its own parser
138 | parser = getattr(parsers, prompt['parser'])
139 | if len(str(row['text']).split(" ")) > config.teacher.min_len:
140 | prompt_text = prompt_text_template.format(**parameters)
141 | # The hash is of everything that is used to generate the output
142 | h = hashlib.sha256(prompt_text.encode("utf-8"))
143 | h.update(str(run).encode("utf-8"))
144 | h = h.hexdigest()
145 |
146 | # Only get the output if this was not done already
147 | if h not in raw_data.prompt_text_hash.values:
148 | # Get output from OpenAI and parse using parser, the parser will append the parsed data onto the prepared_data CSV.
149 | try:
150 | openai_output = teacher(prompt_text, config)
151 | prepared_data = parser(data=openai_output, prepared_data=prepared_data, prompt_config=prompt_config, config=config, row=row,
152 | raw_data_id=len(raw_data), prompt_text=prompt_text) # ID is length of raw_data
153 |
154 | # Concat the current output to the data dataframe, only if not None
155 | if prepared_data is not None and len(prepared_data) > 0:
156 | new_data = pd.DataFrame([[len(raw_data), openai_output, dataset_name, language, run, prompt['hash'], h, parameters['context']]],
157 | columns=raw_data_columns)
158 | raw_data = pd.concat([raw_data, new_data], ignore_index=True)
159 | if len(raw_data) % config.data_generation_checkpoint_every == 0:
160 | logging.warning("Checkpointing the generated dataset.")
161 | raw_data.to_csv(raw_data_path, index=False)
162 | prepared_data.to_csv(prepared_data_path, index=False)
163 | except Exception as e:
164 | logging.exception(e)
165 | logging.warning(f"Skipping example at position: {row_ind} for dataset: {dataset_name}\n")
166 | # Final save
167 | if raw_data is not None and prepared_data is not None and len(raw_data) > 0 and len(prepared_data) > 0:
168 | raw_data.to_csv(raw_data_path, index=False)
169 | prepared_data.to_csv(prepared_data_path, index=False)
170 | return raw_data, prepared_data
171 |
172 |
173 | def create_labels(examples, config, tokenizer):
174 | r''' This is used with a prepared HF dataset that is already tokenized. It will add labels
175 | so that only the AI generated parts (answers) will be trained on.
176 | '''
177 |
178 | user_token_id = tokenizer.vocab[config.special_tokens.user]
179 | ai_token_id = tokenizer.vocab[config.special_tokens.ai]
180 | # Everything written by an AI will be used for training, and everything by a user will be ignored
181 |
182 | examples['labels'] = []
183 | for i in range(len(examples['input_ids'])):
184 | labels = []
185 | ignore = True
186 | for tkn_id in examples['input_ids'][i]:
187 | if tkn_id == user_token_id:
188 | ignore = True
189 | elif tkn_id == ai_token_id:
190 | ignore = False
191 |
192 | if ignore:
193 | labels.append(config.train.ignore_index)
194 | else:
195 | labels.append(tkn_id)
196 | examples['labels'].append(labels)
197 | return examples
198 |
199 |
200 | def pack_examples(examples, block_size, packing_type='partial'):
201 | r''' Used with a prepared HF dataset, will pack/group examples. Use with care, can mess up many things
202 | if the input is not formated properly (requires the <|eod|> token).
203 |
204 | packing_type: partial/full/no
205 | '''
206 | # Concatenate all texts.
207 | if packing_type == 'partial':
208 | result = {k:[] for k in examples.keys()}
209 | _key = list(examples.keys())[0] # Take whichever key
210 | new_example = {k:[] for k in examples.keys()}
211 |
212 | for ind in range(len(examples[_key])):
213 | # Trim long sequences to block_size, this is required for partial packing
214 | example = {k:v[ind][0:block_size] for k,v in examples.items()}
215 | if len(new_example[_key]) + len(example[_key]) > block_size:
216 | result = {k:result[k] + [v] for k,v in new_example.items()}
217 | new_example = example
218 | else:
219 | new_example = {k:new_example[k] + v for k,v in example.items()}
220 | # Add the last example if there is something to add
221 | if len(new_example[_key]) > 0:
222 | result = {k:result[k] + [v] for k,v in new_example.items()}
223 | elif packing_type == 'full':
224 | # Full packing
225 | concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
226 | total_length = len(concatenated_examples[list(examples.keys())[0]])
227 | total_length = (total_length // block_size) * block_size
228 | # Split by chunks of max_len.
229 | result = {
230 | k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
231 | for k, t in concatenated_examples.items()
232 | }
233 | else:
234 | # Do nothing
235 | result = examples
236 | return result
237 |
--------------------------------------------------------------------------------
/opengpt/model_utils.py:
--------------------------------------------------------------------------------
1 | import logging
2 |
3 | def add_tokens_to_model_and_tokenizer(config, tokenizer, model):
4 | ntkns = tokenizer.add_tokens(list(config.special_tokens.values()))
5 | logging.warning(f"Added: {ntkns} tokens to the tokenizer")
6 | if ntkns > 0:
7 | input_embeddings = model.get_input_embeddings().weight.data
8 | output_embeddings = model.get_output_embeddings().weight.data
9 | input_embeddings_avg = input_embeddings[:-ntkns].mean(dim=0, keepdim=True)
10 | output_embeddings_avg = output_embeddings[:-ntkns].mean(dim=0, keepdim=True)
11 | model.resize_token_embeddings(len(tokenizer))
12 | input_embeddings[-ntkns:] = input_embeddings_avg
13 | output_embeddings[-ntkns:] = output_embeddings_avg
14 |
15 | # Set the eos and pad tokens properly
16 | tokenizer.add_special_tokens({"eos_token": config.special_tokens.eos, "pad_token": config.special_tokens.pad})
17 | model.config.eos_token_id = tokenizer.eos_token_id
18 |
19 | assert model.get_input_embeddings().num_embeddings == len(tokenizer)
--------------------------------------------------------------------------------
/opengpt/parsers.py:
--------------------------------------------------------------------------------
1 | r'''
2 | Parsers are used to parse the output from a Teacher (OpenAI, Google, ...) into the right format. The purpose of the paraser is to
3 | parse the new output and append it to the prepared_data. Every parser will receive:
4 | - data: the new data output from a Teacher model
5 | - prepared_data: the dataset we are creating, in other words old data that was output by a parser
6 | - prompt_config: the prompt_config for the current prompt as a dictionary (taken from the .yaml file)
7 | - config: general config, ie the whole .yaml file as a python-box (can be used as a dictionary)
8 | - row: the row from the original CSV that was used for context to generate the `data`, can be empty given the use-case
9 | - raw_data_id: the ID of the `data` in the raw_data CSV (used to store the raw output from OpenAI)
10 | - prompt_text: the prepared prompt that was used to generate `data`
11 |
12 | If we are running the paraser for the first time the `prepared_data` will be empty (None) and it is up to us to define how that prepared_data (e.g. CSV) should look. Every parser can have different columns depending on the use-case.
13 |
14 | If the parser will output the final prepeared data that will be used for model training, it should append special tokens: config.special_tokens.[user, ai, eos, eod],
15 | have a look at the functions below (e.g. csv_qa_parser).
16 | '''
17 |
18 | import pandas as pd
19 | from io import StringIO
20 | import re
21 | import logging
22 |
23 | def csv_qa_parser(data, prepared_data, prompt_config, config, row, raw_data_id, prompt_text):
24 | r''' Expects data in the CSV format, with the separator `;`, the dataframe has to have two columns: `Question`, `Answer`
25 | '''
26 | qa_pairs = None
27 | df = pd.read_csv(StringIO(data), sep=';')
28 |
29 | # Strip everything
30 | df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
31 |
32 | ref_col = prompt_config.get('reference_column_to_append', None)
33 | if ref_col and row is not None and ref_col in row and row[ref_col]:
34 | # Means we want to append a reference at the end of each Answer
35 | to_append = f"\nReferences:\n- {row[ref_col]}"
36 | df['Answer'] = df['Answer'] + to_append
37 | df['Question'] += f' {config.special_tokens.eos}' # Every Q/A pair is independent
38 | df['Answer'] += f' {config.special_tokens.eos} {config.special_tokens.eod}'
39 | qa_pairs = [f'{config.special_tokens.user} {q.strip()} {config.special_tokens.ai} {a.strip()}' for q,a in df[['Question', 'Answer']].values]
40 |
41 | new_data = pd.DataFrame([[text, raw_data_id] for text in qa_pairs], columns=['text', 'raw_data_id'])
42 | if prepared_data is None:
43 | prepared_data = new_data
44 | else:
45 | prepared_data = pd.concat([prepared_data, new_data], ignore_index=True)
46 |
47 | return prepared_data
48 |
49 |
50 | instruction_text = re.compile(r'Instruction:?(.*?)Input:', re.DOTALL)
51 | input_text = re.compile(r'Input:?(.*?)Output:?', re.DOTALL)
52 | output_text = re.compile(r'Output:?(.*?)$', re.DOTALL)
53 | def task_parser(data, prepared_data, prompt_config, row, config, raw_data_id, prompt_text):
54 | r''' This parser can be used with prompts similar to Alpaca, it expects `data` in the following format:
55 | Task:
56 | Instruction:
57 | Input:
58 | Output:
59 |
60 | Task:
61 | Instruction:
62 | Input:
63 | Output:
64 | .
65 | .
66 | .
67 | '''
68 | tasks = re.split(r'[1-9 \.]*Task[:\s]*', str(data))
69 | st = config.special_tokens
70 | new_data = []
71 | for task in tasks:
72 | task = task.strip()
73 | ins = re.search(instruction_text, task).group(1).strip()
74 | inp = re.search(input_text, task).group(1).strip()
75 | out = re.search(output_text, task).group(1).strip()
76 |
77 | if inp:
78 | if inp.startswith('"'):
79 | inp = inp[1:]
80 | if inp.endswith('"'):
81 | inp = inp[:-1]
82 | if inp == '':
83 | inp = ''
84 | else:
85 | inp = '\n' + str(inp)
86 |
87 | if ins and out:
88 | if inp in ins:
89 | new_data.append((len(prepared_data), f'{st.user} {ins} {st.eos} {st.ai} {out} {st.eos} {st.eod}', raw_data_id))
90 | else:
91 | new_data.append((len(prepared_data), f'{st.user} {ins}{inp} {st.eos} {st.ai} {out} {st.eos} {st.eod}', raw_data_id))
92 |
93 | new_data = pd.DataFrame(new_data, columns=['text', 'raw_data_id'])
94 | if prepared_data is None:
95 | prepared_data = new_data
96 | else:
97 | prepared_data = pd.concat([prepared_data, new_data], ignore_index=True)
98 |
99 | return prepared_data
100 |
101 |
102 | def simple_task_parser(data, prepared_data, prompt_config, row, config, raw_data_id, prompt_text):
103 | r''' This parser can be used with prompts similar to Alpaca, but that only have Instructions, it expects data :
104 | Task Number:
105 | Instruction:
106 |
107 | Task Number:
108 | Instruction:
109 |
110 | This parser is used as an intermediate, so the output is a csv with columns `text`, `instruction`, `raw_data_id`
111 | .
112 | .
113 | .
114 | '''
115 | tasks = [x.replace("Instruction:", "").strip() for x in re.split(r'[1-9 \.]*Task Number[:\s]*[\d\n]*', str(data)) if x.strip()]
116 | new_data = []
117 | for task in tasks:
118 | task = task.strip()
119 |
120 | new_data = pd.DataFrame([[[row['text']], task, raw_data_id] for task in tasks], columns=['text', 'instruction', 'raw_data_id'])
121 | if prepared_data is None:
122 | prepared_data = new_data
123 | else:
124 | prepared_data = pd.concat([prepared_data, new_data], ignore_index=True)
125 |
126 | return prepared_data
127 |
128 |
129 | def medical_conversation_parser(data, prepared_data, prompt_config, config, row, raw_data_id, prompt_text):
130 | r''' It expects data to be in form of a conversation, like:
131 | Patient:
132 | AI-Assistant:
133 | Patient:
134 | .
135 | .
136 | .
137 | The actor names 'Patient' and 'AI-Assistant" have to match exactlty
138 | '''
139 | conversation = None
140 |
141 | # Merge the extractions into one conversation
142 | data = re.split(r'\s*(Patient\s*:|AI-Assistant\s*:)\s*', data)[1:]
143 | if len(data) > 0:
144 | conversation = ""
145 | to_append = None
146 |
147 | ref_col = prompt_config.get('reference_column_to_append', None)
148 | if ref_col and ref_col in row and row[ref_col]:
149 | # Means we want to append a reference at the end of each Answer
150 | to_append = f"\nReferences:\n- {row[ref_col]}"
151 |
152 | actor = None
153 | for message in data:
154 | message = message.strip()
155 | if message in ['Patient:', 'AI-Assistant:', 'Patient', 'AI-Assistant', 'Patient :', 'AI-Assistant :']:
156 | actor = message
157 | elif actor is not None: #TODO: Make this nicer
158 | if actor in ['Patient:', 'Patient :', 'Patient']:
159 | conversation += f'{config.special_tokens.user} {message} {config.special_tokens.eos} '
160 | elif actor in ['AI-Assistant:', 'AI-Assistant :', 'AI-Assistant']:
161 | conversation += f'{config.special_tokens.ai} {message}'
162 | if to_append is not None and to_append:
163 | conversation += to_append
164 | conversation += f" {config.special_tokens.eos} "
165 | if conversation:
166 | conversation = conversation.strip() + f" {config.special_tokens.eod}"
167 |
168 | new_data = pd.DataFrame([[conversation, raw_data_id]], columns=['text', 'raw_data_id'])
169 | if prepared_data is None:
170 | prepared_data = new_data
171 | else:
172 | prepared_data = pd.concat([prepared_data, new_data], ignore_index=True)
173 |
174 | return prepared_data
175 |
176 |
177 | def csv_ner_parser(data, prepared_data, prompt_config, config, row, raw_data_id, prompt_text):
178 | r''' Expects data in CSV format, using the `;` separator
179 | '''
180 | df = pd.read_csv(StringIO(data), sep=';', engine='python')
181 | df['raw_data_id'] = raw_data_id
182 |
183 | if prepared_data is None:
184 | prepared_data = df
185 | else:
186 | prepared_data = pd.concat([prepared_data, df], ignore_index=True)
187 |
188 | return prepared_data
--------------------------------------------------------------------------------
/opengpt/prompt_utils.py:
--------------------------------------------------------------------------------
1 | import json
2 | import hashlib
3 | import os
4 | import logging
5 |
6 | def add_to_prompt_database(text, description, parser, database_path, force_replace=False):
7 | r''' The database is a simple json file where all the prompts are saved.
8 | '''
9 | if os.path.exists(database_path):
10 | logging.info(f"Loading db from: {database_path}")
11 | db = json.load(open(database_path, 'r'))
12 | hashes = set([prompt['hash'] for prompt in db])
13 | else:
14 | db = []
15 | hashes = set()
16 |
17 | # Good enough for what we need
18 | h = hashlib.sha256(text.encode("utf-8")).hexdigest()[:10]
19 | if force_replace and h in hashes:
20 | logging.warning("Found an existing prompt with the same hash, it will be replaced with the new one.")
21 | # Remove the prompt with the hash as the current one
22 | db = [prompt for prompt in db if prompt['hash'] != h]
23 | hashes = set([prompt['hash'] for prompt in db])
24 | if h not in hashes:
25 | db.append({
26 | 'hash': h,
27 | 'text': text,
28 | 'description': description,
29 | 'parser': parser
30 | })
31 |
32 | json.dump(db, open(database_path, 'w'), indent=2)
33 | logging.warning(f"Added prompt: {h}")
34 | else:
35 | logging.warning("The prompt is already in the database. It will not be added, you can use force_replace if you really want to add it.")
36 |
37 | return db
--------------------------------------------------------------------------------
/opengpt/teachers.py:
--------------------------------------------------------------------------------
1 | import openai
2 |
3 | def ask_openai(prompt, config):
4 | response = openai.ChatCompletion.create(
5 | model = config.teacher.model,
6 | messages = [
7 | {"role": "user", "content": prompt},
8 | ]
9 | )
10 |
11 | message = None
12 | if response['choices'][0]['finish_reason'] == 'stop':
13 | message = response['choices'][0]['message']['content']
14 |
15 | return message
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | .
2 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import setuptools
2 | from setuptools.command.install import install
3 | from setuptools.command.develop import develop
4 | from setuptools.command.egg_info import egg_info
5 |
6 | with open("./README.md", "r") as fh:
7 | long_description = fh.read()
8 |
9 | setuptools.setup(
10 | name="opengpt",
11 | version="0.0.5",
12 | author="w-is-h",
13 | author_email="w.kraljevic@gmail.com",
14 | description="OpenGPT a framework for producing grounded domain specific LLMs, and NHS-LLM a conversational model for healthcare made using OpenGPT.",
15 | long_description=long_description,
16 | long_description_content_type="text/markdown",
17 | url="https://github.com/cogstack/opengpt",
18 | packages=['opengpt'],
19 | install_requires=[
20 | 'datasets>=2,<3',
21 | 'transformers>=4.2,<5',
22 | 'tiktoken>=0.3.2',
23 | 'pandas',
24 | 'openai',
25 | 'numpy',
26 | 'tqdm',
27 | 'python-box',
28 | 'jsonpickle',
29 | ],
30 | classifiers=[
31 | "Programming Language :: Python :: 3",
32 | "License :: OSI Approved :: MIT License",
33 | "Operating System :: OS Independent",
34 | ],
35 | )
36 |
--------------------------------------------------------------------------------