├── .gitignore
├── README.md
├── check_sentences.py
├── check_sentences_run.sh
├── corpus_counts.md
├── dict.txt
├── fix_tokenizer.py
├── preprocess.sh
├── preprocess_bigfile.sh
├── save_to_huggingface.py
├── sentencepiece_encoder.py
├── sentencepiece_trainer.py
├── shard_txt_file.sh
├── singularity_pytorch_bart.def
├── test_bart_checkpoint.py
├── tokenizers_encoder.py
├── tokenizers_encoder_run.sh
├── tokenizers_trainer.py
├── train_bart.sh
└── train_bart_args.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | /data
2 | /config
3 | /checkpoints
4 | /fairseq
5 | /apex
6 | /multirun
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## BART pretraining scripts
2 |
3 | User should
4 |
5 | 1. Clone this repo
6 | 2. Clone and install the following fork of fairseq [Lauler/fairseq](https://github.com/Lauler/fairseq).
7 | 3. (Optional) For faster training, clone and build NVIDIAS's apex library (see instructions above in the fairseq documentation). Make sure that system CUDA version matches the CUDA version pytorch was built with in local python environment before attempting to install apex (otherwise there will be version mismatch).
8 |
9 | Optionally users may build a Singularity container that has all these dependencies. See the file `singularity_pytorch_bart.def` for a definition file of the singularity container we built and used ourselves. You can build the same container using the command
10 |
11 | ```
12 | sudo singularity build pytorch_21.10_bart.sif singularity_pytorch_bart.def
13 | ```
14 |
15 | This will create a singularity image file `pytorch_21.10_bart.sif` which you can either use interactively via `singularity shell --nv pytorch_21.10_bart.sif` or use to execute scripts via `singularity exec --nv pytorch_21.10_bart.sif ...`, where `...` are programs/commands you want to run or issue.
16 |
17 | ## Data and tokenization
18 |
19 | Assuming you have a large corpus to pretrain on, you should arrange the data into one or several text files with one sequence (can be composed of many sentences) per line. If you have limited GPU memory, consider sharding the text file into multiple chunks. A convenience script for doing this can be found in `shard_txt_file.sh`.
20 |
21 | Two methods for training GPT2-style BPE tokenizers are provided: `tokenizers_trainer.py` (Huggingface's tokenizers) and `sentencepiece_trainer.py` (Google's SentecePiece).
22 |
23 | ### Sentencepiece
24 | To use [sentencepiece](https://github.com/google/sentencepiece) first install
25 |
26 | ```
27 | pip install sentencepiece
28 | ```
29 |
30 | Then
31 |
32 | 1. Run `sentencepiece_trainer.py` to train your tokenizer. This will generate two files: `spm.bpe.model` and `spm.bpe.vocab`.
33 | 2. (Optional) Shard your dataset with `shard_txt_file.sh`.
34 | 3. Apply the tokenizer you have trained on the text file shard by running `sentencepiece_encoder.py`. This script will byte pair encode your text and generate output files ending with suffix `.bpe`.
35 | 4. Before `fairseq-preprocess` can be applied, we need to edit the `spm.bpe.vocab` file to change the column separator from tab (`\t`) to a space separator, because fairseq expects a space separated vocab file. See [this](https://github.com/musixmatchresearch/umberto/issues/2#issuecomment-585894712) Github thread explaining the process of using a sentencepiece vocab in fairseq. And see [this](https://github.com/facebookresearch/fairseq/issues/1490#issuecomment-566604192) Github comment for context on what information the columns contain. Column nr 2 is the frequency of the token in your training set, but can be set to any dummy integer.
36 |
37 | ### Huggingface tokenizers
38 | To use tokenizers from Huggingface you
39 |
40 | 1. Run tokenizers_trainer.py to train your tokenizer. This will generate a file called `tokenizer.json`.
41 | 2. (Optional) Shard your datasets using `shard_txt_file.sh`.
42 | 3. Apply the tokenizer you have trained to tokenize the text file shards by running `tokenizers_encoder.py`. We launch multiple parallel jobs on SLURM to encode each individual shard via the script `tokenizers_encoder_run.sh`.
43 |
44 | ## Preprocess data with fairseq
45 |
46 | Once we have tokenized our data and converted it to byte pair encoded format, we are ready to preprocess it to a format `fairseq` can ingest during training. For `sentencepiece` vocab preprocessing, please see this github [comment and issue thread](https://github.com/musixmatchresearch/umberto/issues/2#issuecomment-585894712). If you want to use a Huggingface tokenizer (like we did), we will outline two different ways of doing this.
47 |
48 | ### Option 1 (untested but should work and be the easiest)
49 |
50 | Convert your `tokenizer.json` from `json` format to a `txt` format. Fairseq expects the vocab file to have two columns, and to use a **space separator** between the columns. The first columns is the tokens, and the second column a frequency count of how often the tokens appear in the training dataset. We can insert [dummy placeholder integers instead of actual frequencies](https://github.com/facebookresearch/fairseq/issues/1490#issuecomment-566604192). The converted `dict.txt` file might look something like this:
51 |
52 | ```
53 | Ġ. 12345
54 | Ġ, 12345
55 | Ġoch 12345
56 | Ġi 12345
57 | Ġatt 12345
58 | Ġär 12345
59 | Ġsom 12345
60 | ...
61 | ```
62 |
63 | You will generally see a lot of `Ġ` characters. This is because it is the encoding for a space character in BPE. The number `12345` are simply dummy freqency counts added to the second column because fairseq expects them.
64 |
65 | ### Option 2 (tested, but we recommend against this option if you manually added tokens to your vocab when training it with Huggingface)
66 |
67 | A second option is to let `fairseq-preprocess` automatically generate the `dict.txt` file for us. This can be done by [not specifying the `--srcdict` option](https://github.com/facebookresearch/fairseq/issues/1186#issuecomment-535606529) when running `fairseq-preprocess`.
68 |
69 | **IMPORTANT:** You need to run `fairseq-preprocess` once on your entire dataset to generate a `dict.txt` which covers all the text in your training/validation data. If you have sharded your dataset in the previous step, this means you need to (re)create a big text file that combines all the tokenized shards into one giant file. For example something like:
70 |
71 | ```bash
72 | singularity exec pytorch_21.03_bart.sif \
73 | fairseq-preprocess --only-source \
74 | --trainpref "/path/to/all_data.txt" \
75 | --validpref "/path/to/valid.txt" \
76 | --destdir "/path/to/dest_folder/all" \
77 | --workers 64
78 | ```
79 |
80 | In the above example, the resulting `dict.txt` will be available in the folder `/path/to/dest_folder/all`. A big file of 101 GB took us about 2h 30min to preprocess with 20 workers (threads). With 128 workers the same file was preprocessed in 42 minutes.
81 |
82 | **WARNING:** If you manually enforced the inclusion of certain tokens during the Huggingface tokenizers training, there may be a token mismatch between your original `tokenizer.json` and the `dict.txt` generated by `fairseq-preprocess`. This is possible to fix afterwards by creating a new `tokenizer.json` from the `dict.txt`. But it can be a huge headache, because `tokenizers` in Huggingface is sensitive to the ordering of tokens. Thus we strongly recommend you only use **Option 2** if the set of tokens in your `tokenizer.json` are exactly the same as the set of tokens in `dict.txt`.
83 |
84 | ### Final preprocessing step when we finally have a `dict.txt`
85 |
86 | Once you have a `dict.txt` which covers your entire dataset, you can preprocess the individual shards by pointing to your "master dictionary file" using the `--srcdict` argument in `fairseq-preprocess`.
87 |
88 | ```bash
89 | singularity exec pytorch_21.03_bart.sif \
90 | fairseq-preprocess --only-source \
91 | --trainpref "/path/to/shard1.txt" \
92 | --validpref "/path/to/valid.txt" \
93 | --destdir "/path/to/dest_folder/shard1" \
94 | --srcdict "/path/to/dict.txt" \
95 | --workers 64
96 | ```
97 |
98 | We have a bash script for launching multiple jobs to preprocess the shards in `preprocess.sh`.
99 |
100 | ## Train the model
101 |
102 | Finally we have everything that is needed to start training. See `train_bart.sh` and `train_bart_args.sh` for a BART pre-training script.
103 |
104 | **IMPORTANT:** Fairseq training will always crash once you run out of shards. It does not allow you to relist the same shard files in order to continue training another epoch on the same files. This behavior might be by design, as it is better to create a fresh shuffling of the data for each subsequent epoch. It is possible to reuse the existing shards by restarting the training. However, we rather recommend creating more shuffled shards of the same data (for however many epochs you expect to train) before starting training.
105 |
106 | I have done my best to translate what was written in the paper to fairseq config commands by reading the fairseq docs and the relevant source code. Details on dropout and learning rate are not clear from paper. You need to set these yourselves and adjust based on your batch size.
107 |
108 | Kindly open an issue if you notice anything strange with my suggested config.
109 |
110 | ## Convert to Huggingface format
111 |
112 | Look at the file [`save_to_hugginface.py`](https://github.com/kb-labb/kb_bart/blob/main/save_to_huggingface.py) for an example on how to convert the pretrained BART model to a Huggingface compatible format.
113 |
--------------------------------------------------------------------------------
/check_sentences.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import argparse
4 | import fairseq
5 | import multiprocessing as mp
6 |
7 | parser = argparse.ArgumentParser()
8 | parser.add_argument("-f", "--filename", type=str)
9 | parser.add_argument("-w", "--num_workers", type=int)
10 | parser.add_argument("--dictionary", type=str, default="dict.txt")
11 | parser.add_argument(
12 | "data_folder",
13 | nargs="?",
14 | type=str,
15 | default="/ceph/hpc/home/eufatonr/data/text/kb_bart_data/tokenized",
16 | )
17 | parser.add_argument(
18 | "-d",
19 | "--dest_folder",
20 | nargs="?",
21 | type=str,
22 | default="/ceph/hpc/home/eufatonr/data/text/kb_bart_data/tokenized",
23 | )
24 | args = parser.parse_args()
25 |
26 | d = fairseq.data.Dictionary.load(args.dictionary)
27 |
28 |
29 | def chunks(l, n):
30 | """Yield n number of striped chunks from l.
31 | https://stackoverflow.com/questions/24483182/python-split-list-into-n-chunks/48971420
32 | """
33 | for i in range(0, n):
34 | yield l[i::n]
35 |
36 |
37 | def validate_sentences(doc_chunk):
38 | """
39 | Check if occurs naturally in the middle of a document.
40 | Remove the observation, otherwise training will crash with
41 | an assertionerror that (sentence[1:-1] >= 1).all().
42 | is encoded as 0 in the dictionary/vocabulary.
43 | """
44 | output_docs = []
45 | for doc in doc_chunk:
46 | encoded_sen = d.encode_line(doc, add_if_not_exist=False)
47 |
48 | if (encoded_sen[1:-1] >= 1).all():
49 | output_docs.append(doc)
50 | else:
51 | print(f"Removing observation, document: {doc}")
52 | return output_docs
53 |
54 |
55 | text_shard_file = os.path.join(args.data_folder, args.filename)
56 |
57 | documents = []
58 | with open(text_shard_file) as f:
59 | for line in f:
60 | documents.append(line)
61 |
62 | doc_chunks = list(chunks(documents, args.num_workers))
63 |
64 | t0 = time.time()
65 | pool = mp.Pool(processes=args.num_workers)
66 | validated_sentences = pool.map(validate_sentences, doc_chunks)
67 | t1 = time.time()
68 | print(f"Documents in file {text_shard_file} validated in {t1 - t0} seconds.")
69 |
70 |
71 | flat_list = [item for sublist in validated_sentences for item in sublist]
72 |
73 | output_filename = os.path.basename(text_shard_file) + ".check"
74 | output_path = os.path.join(args.dest_folder, output_filename)
75 |
76 | with open(output_path, "w") as wf:
77 | for line in flat_list:
78 | wf.write(line)
79 |
--------------------------------------------------------------------------------
/check_sentences_run.sh:
--------------------------------------------------------------------------------
1 | num_workers=32
2 |
3 | for filename in $(ls -p /ceph/hpc/home/eufatonr/data/text/kb_bart_data/tokenized | grep ".docs.token");
4 | do
5 | srun -p cpu --mem=45G --nodes=1 --ntasks=1 --cpus-per-task=${num_workers} --time=00:30:00 \
6 | singularity exec pytorch_21.03_bart.sif \
7 | python check_sentences.py -f $filename --num_workers $num_workers --dictionary "dict.txt" &
8 | done
9 |
--------------------------------------------------------------------------------
/corpus_counts.md:
--------------------------------------------------------------------------------
1 | # Corpus Counts
2 |
3 |
4 | | Subcorpus | docs |
5 | | ------------------------------- | ---------- |
6 | | edeposcorpus.split00.docs.token | 297,478 |
7 | | flashcorpus.split00.docs.token | 273,794 |
8 | | news_2.split00.docs.token | 3,093,543 |
9 | | news_2.split01.docs.token | 2,976,780 |
10 | | news_2.split02.docs.token | 3,100,877 |
11 | | news_2.split03.docs.token | 2,934,816 |
12 | | news_2.split04.docs.token | 2,811,136 |
13 | | news_2.split05.docs.token | 2,671,540 |
14 | | news_2.split06.docs.token | 2,167,264 |
15 | | news_2.split07.docs.token | 2,823,473 |
16 | | news_2.split08.docs.token | 281,078 |
17 | | news_2.split09.docs.token | 2,979,623 |
18 | | news_2.split10.docs.token | 3,105,495 |
19 | | news_2.split11.docs.token | 2,745,166 |
20 | | news_2.split12.docs.token | 3,105,369 |
21 | | news_2.split13.docs.token | 2,888,914 |
22 | | news_2.split14.docs.token | 3,044,163 |
23 | | news_2.split15.docs.token | 2,243,778 |
24 | | news_2.split16.docs.token | 2,652,094 |
25 | | news_2.split17.docs.token | 3,073,893 |
26 | | news_2.split18.docs.token | 2,926,715 |
27 | | nok.split00.docs.token | 1,622,992 |
28 | | offentligt.split00.docs.token | 3,306,624 |
29 | | offentligt.split01.docs.token | 2,198,892 |
30 | | oscar.split00.docs.token | 859,790 |
31 | | oscar.split01.docs.token | 800,000 |
32 | | oscar.split02.docs.token | 881,928 |
33 | | oscar.split03.docs.token | 873,169 |
34 | | oscar.split04.docs.token | 892,183 |
35 | | oscar.split05.docs.token | 882,298 |
36 | | oscar.split06.docs.token | 882,881 |
37 | | oscar.split07.docs.token | 888,328 |
38 | | oscar.split08.docs.token | 904,712 |
39 | | oscar.split09.docs.token | 893,941 |
40 | | oscar.split10.docs.token | 865,024 |
41 | | oscar.split11.docs.token | 903,509 |
42 | | oscar.split12.docs.token | 415,618 |
43 | | runeberg.split00.docs.token | 902,768 |
44 | | tweets.split00.docs.token | 10,442,046 |
45 | | wiki.split00.docs.token | 3,421,795 |
46 | | total | 85,035,487 |
47 |
48 | Total tokens in corpus : 15,151,843,671
49 |
50 | ## Too long docs
51 |
52 | | Subcorpus | too long docs (> 1022) |
53 | | ------------------------------- | ---------------------- |
54 | | edeposcorpus.split00.docs.token | 76 |
55 | | flashcorpus.split00.docs.token | 135 |
56 | | news_2.split00.docs.token | 668 |
57 | | news_2.split01.docs.token | 547 |
58 | | news_2.split02.docs.token | 117 |
59 | | news_2.split03.docs.token | 742 |
60 | | news_2.split04.docs.token | 571 |
61 | | news_2.split05.docs.token | 805 |
62 | | news_2.split06.docs.token | 2269 |
63 | | news_2.split07.docs.token | 557 |
64 | | news_2.split08.docs.token | 35 |
65 | | news_2.split09.docs.token | 283 |
66 | | news_2.split10.docs.token | 443 |
67 | | news_2.split11.docs.token | 3459 |
68 | | news_2.split12.docs.token | 218 |
69 | | news_2.split13.docs.token | 151 |
70 | | news_2.split14.docs.token | 171 |
71 | | news_2.split15.docs.token | 1461 |
72 | | news_2.split16.docs.token | 1232 |
73 | | news_2.split17.docs.token | 314 |
74 | | news_2.split18.docs.token | 100 |
75 | | nok.split00.docs.token | 122 |
76 | | offentligt.split00.docs.token | 8 |
77 | | offentligt.split01.docs.token | 1 |
78 | | oscar.split00.docs.token | 87546 |
79 | | oscar.split01.docs.token | 80124 |
80 | | oscar.split02.docs.token | 86818 |
81 | | oscar.split03.docs.token | 88286 |
82 | | oscar.split04.docs.token | 86809 |
83 | | oscar.split05.docs.token | 86787 |
84 | | oscar.split06.docs.token | 86828 |
85 | | oscar.split07.docs.token | 86707 |
86 | | oscar.split08.docs.token | 87348 |
87 | | oscar.split09.docs.token | 87151 |
88 | | oscar.split10.docs.token | 85381 |
89 | | oscar.split11.docs.token | 88898 |
90 | | oscar.split12.docs.token | 39836 |
91 | | runeberg.split00.docs.token | 2738 |
92 | | tweets.split00.docs.token | 0 |
93 | | wiki.split00.docs.token | 24200 |
94 | | total | 1119942 |
95 |
96 | Total tokens in too long docs: 2,999,450,123
97 |
98 | Tokens left if not splitting: 12,152,393,548
99 |
--------------------------------------------------------------------------------
/fix_tokenizer.py:
--------------------------------------------------------------------------------
1 | from fairseq.models.bart import BARTModel
2 | from typing import Dict
3 | import json
4 | from copy import deepcopy
5 | import argparse
6 |
7 |
8 | def fix_tokenizer(old_tokenizer, new_vocab: Dict[str, int]):
9 | """
10 | The new_tokenizer is a copy of the old_tokenizer and is supplied instead the
11 | new vocabulary dictionary.
12 | Since the keys must match the merges, we append keys from the old_tokenizer
13 | to the new_tokenizer.
14 | """
15 | new_tokenizer = deepcopy(old_tokenizer)
16 | new_tokenizer["model"]["vocab"] = deepcopy(new_vocab)
17 | for k in old_tokenizer["model"]["vocab"]:
18 | if k not in new_vocab:
19 | new_tokenizer["model"]["vocab"][k] = len(new_tokenizer["model"]["vocab"])
20 |
21 | # Change token id from 4 to 50184
22 | new_tokenizer["added_tokens"][4]["id"] = 50184
23 | return new_tokenizer
24 |
25 |
26 | def get_args() -> argparse.Namespace:
27 | parser = argparse.ArgumentParser()
28 | parser.add_argument(
29 | "--tokenizer",
30 | type=str,
31 | default="tokenizer.json",
32 | help="Path to the huggingface tokenizer-json",
33 | )
34 | parser.add_argument(
35 | "--checkpoint",
36 | type=str,
37 | default="checkpoint_best.pt",
38 | help="Name of the BART checkpoint file.",
39 | )
40 | parser.add_argument(
41 | "--folder",
42 | type=str,
43 | default="bart_model",
44 | help="Path to the folder containing a BART checkpoint and dict.txt",
45 | )
46 | parser.add_argument(
47 | "--new_tokenizer",
48 | type=str,
49 | default="tokenizer_fixed.json",
50 | help="Path to the new tokenizer-json",
51 | )
52 | return parser.parse_args()
53 |
54 |
55 | def main():
56 |
57 | args = get_args()
58 |
59 | with open(args.tokenizer) as fin:
60 | tokenizer = json.load(fin)
61 |
62 | bart = BARTModel.from_pretrained(args.folder, checkpoint_file=args.checkpoint)
63 | model_dict = bart.task.source_dictionary.__dict__
64 | new_vocab = model_dict["indices"]
65 |
66 | fixed_tokenizer = fix_tokenizer(tokenizer, new_vocab)
67 |
68 | with open(args.new_tokenizer, "w") as fout:
69 | json.dump(fixed_tokenizer, fout, indent=4, ensure_ascii=False)
70 |
71 |
72 | if __name__ == "__main__":
73 | main()
74 |
--------------------------------------------------------------------------------
/preprocess.sh:
--------------------------------------------------------------------------------
1 | data_folder="/ceph/hpc/home/eufatonr/data/text/kb_bart_data"
2 |
3 | # Iterate only over files ending with .docs.token.check (ignore all.txt and oscar.split12.valid)
4 | for filename in $(ls -p /ceph/hpc/home/eufatonr/data/text/kb_bart_data/tokenized | grep ".docs.token.check");
5 | do
6 | # Positive lookahead regex: news_2.split00.docs.token.check ----> news_2.split00
7 | # (return everything before '.docs.token')
8 | destination_dir=`echo "$filename" | grep -oP '.*(?=\.docs\.token\.check)'`
9 |
10 | srun -p cpu --mem=10G --nodes=1 --ntasks=1 --cpus-per-task=20 --time=00:30:00 \
11 | singularity exec pytorch_21.03_bart.sif \
12 | fairseq-preprocess --only-source \
13 | --trainpref "${data_folder}/tokenized/${filename}" \
14 | --validpref "${data_folder}/tokenized/oscar.split12.valid" \
15 | --destdir "/ceph/hpc/home/eufatonr/faton/kb_bart/data/${destination_dir}" \
16 | --srcdict "dict.txt" \
17 | --workers 20 &
18 | done
--------------------------------------------------------------------------------
/preprocess_bigfile.sh:
--------------------------------------------------------------------------------
1 | srun -p cpu --mem=60G --nodes=1 --ntasks=1 --cpus-per-task=128 --time=02:30:00 \
2 | singularity exec pytorch_21.03_bart.sif \
3 | fairseq-preprocess --only-source \
4 | --trainpref "/ceph/hpc/home/eufatonr/data/text/kb_bart_data/tokenized/all.txt" \
5 | --validpref "/ceph/hpc/home/eufatonr/data/text/kb_bart_data/tokenized/oscar.split12.docs.token" \
6 | --destdir "/ceph/hpc/home/eufatonr/faton/kb_bart/data/all" \
7 | --workers 128
--------------------------------------------------------------------------------
/save_to_huggingface.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch import nn
3 | import transformers
4 | import argparse
5 |
6 |
7 | def make_linear_from_emb(emb):
8 | vocab_size, emb_size = emb.weight.shape
9 | lin_layer = nn.Linear(vocab_size, emb_size, bias=False)
10 | lin_layer.weight.data = emb.weight.data
11 | return lin_layer
12 |
13 |
14 | def remove_ignore_keys_(state_dict):
15 | ignore_keys = [
16 | "encoder.version",
17 | "decoder.version",
18 | "model.encoder.version",
19 | "model.decoder.version",
20 | "_float_tensor",
21 | "decoder.output_projection.weight",
22 | ]
23 | for k in ignore_keys:
24 | state_dict.pop(k, None)
25 |
26 |
27 | def get_args() -> argparse.Namespace:
28 | parser = argparse.ArgumentParser()
29 | parser.add_argument("--tokenizer", type=str, default="tokenizer.json")
30 | parser.add_argument(
31 | "--checkpoint", type=str, default="checkpoints/checkpoint_best.pt"
32 | )
33 | return parser.parse_args()
34 |
35 |
36 | def main():
37 | args = get_args()
38 |
39 | tok = transformers.PreTrainedTokenizerFast(
40 | tokenizer_file=args.tokenizer,
41 | bos_token="",
42 | eos_token="",
43 | unk_token="",
44 | pad_token="",
45 | mask_token="",
46 | cls_token="",
47 | sep_token="",
48 | )
49 |
50 | state_dict = torch.load(args.checkpoint, map_location="cpu")["model"]
51 |
52 | vocab_size = state_dict["encoder.embed_tokens.weight"].shape[0]
53 |
54 | config = transformers.BartConfig(
55 | vocab_size=vocab_size,
56 | d_model=768,
57 | decoder_ffn_dim=3072,
58 | encoder_ffn_dim=3072,
59 | decoder_layers=6,
60 | encoder_layers=6,
61 | )
62 |
63 | model = transformers.BartForConditionalGeneration(config)
64 |
65 | remove_ignore_keys_(state_dict)
66 | state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
67 | model.model.load_state_dict(state_dict)
68 |
69 | model.lm_head = make_linear_from_emb(model.model.shared)
70 |
71 | # Save to Huggingface format
72 | model.save_pretrained("hfmodel")
73 | tok.save_pretrained("hfmodel")
74 |
75 |
76 | if __name__ == "__main__":
77 | main()
78 |
--------------------------------------------------------------------------------
/sentencepiece_encoder.py:
--------------------------------------------------------------------------------
1 | import sentencepiece as spm
2 |
3 | # Best let dataset/dataloader insert bos and eos rather than do it here
4 | sp = spm.SentencePieceProcessor(model_file="spm.bpe.model",) # add_bos=True, add_eos=True
5 |
6 | sp.encode("Det här är en överdriven testmening", out_type=str)
7 | sp.encode(" Testa vad som sker med BOS- och EOS-symboler", out_type=str)
8 |
9 |
10 | with open("oscar_train.txt", "r") as rf, open("oscar_train.bpe", "w") as wf:
11 | for line in rf:
12 | wf.write(" ".join(sp.encode(line, out_type=str)))
13 | wf.write("\n")
14 |
15 | with open("oscar_valid.txt", "r") as rf, open("oscar_valid.bpe", "w") as wf:
16 | for line in rf:
17 | wf.write(" ".join(sp.encode(line, out_type=str)))
18 | wf.write("\n")
19 |
20 | print("Done.")
--------------------------------------------------------------------------------
/sentencepiece_trainer.py:
--------------------------------------------------------------------------------
1 | import sentencepiece as spm
2 |
3 | # from datasets import load_dataset
4 |
5 | # dataset = load_dataset("oscar", "unshuffled_deduplicated_sv", cache_dir="/ceph/hpc/home/eufatonr/faton/kb_bart/oscar")
6 |
7 |
8 | def batch_iterator(dataset, dataset_size, batch_size):
9 | for i in range(0, dataset_size, batch_size):
10 | # Tokenizers ignore new lines, but when writing to .txt-file
11 | # we don't want newlines inserted at every \n, only at the end
12 | text_batch = map(
13 | lambda text: text.replace("\n", " ") + "\n", dataset[i : i + batch_size]["text"]
14 | )
15 | yield list(text_batch)
16 |
17 |
18 | def create_txt_from_dataset(text_line_generator, filename):
19 | with open(filename, "w") as f:
20 | for line in text_line_generator:
21 | f.writelines(line)
22 |
23 |
24 | # text_line_generator = batch_iterator(dataset["train"], len(dataset["train"]), 50)
25 | # create_txt_from_dataset(text_line_generator, "oscar_train.txt")
26 |
27 | spm.SentencePieceTrainer.train(
28 | input="oscar_train.txt",
29 | model_prefix="spm.bpe",
30 | vocab_size=50265,
31 | user_defined_symbols=[""],
32 | model_type="bpe",
33 | bos_id=0,
34 | pad_id=1,
35 | eos_id=2,
36 | unk_id=3,
37 | )
38 |
--------------------------------------------------------------------------------
/shard_txt_file.sh:
--------------------------------------------------------------------------------
1 | split -C 1500m --numeric-suffixes output.txt output
--------------------------------------------------------------------------------
/singularity_pytorch_bart.def:
--------------------------------------------------------------------------------
1 | BootStrap: docker
2 | From: nvcr.io/nvidia/pytorch:21.10-py3
3 |
4 | # %runscript
5 | # echo "Building Nvidia Pytorch singularity image with fairseq"
6 | # source $(conda info --base)/etc/profile.d/conda.sh
7 | # conda activate base
8 |
9 | %environment
10 | export LC_ALL=C
11 |
12 | %post
13 | # create mount points for SLING
14 | mkdir /data1 /data2 /data0
15 | mkdir -p /var/spool/slurm
16 | mkdir -p /d/hpc
17 | mkdir -p /ceph/grid
18 | mkdir -p /ceph/hpc
19 | mkdir -p /scratch
20 | mkdir -p /exa5/scratch
21 |
22 | pip install --no-cache-dir transformers datasets pyarrow sentencepiece
23 | git clone https://github.com/Lauler/fairseq
24 | cd fairseq
25 | pip install --editable ./
26 |
--------------------------------------------------------------------------------
/test_bart_checkpoint.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from torch import nn
3 | import transformers
4 | import argparse
5 |
6 |
7 | def make_linear_from_emb(emb):
8 | vocab_size, emb_size = emb.weight.shape
9 | lin_layer = nn.Linear(vocab_size, emb_size, bias=False)
10 | lin_layer.weight.data = emb.weight.data
11 | return lin_layer
12 |
13 |
14 | def remove_ignore_keys_(state_dict):
15 | ignore_keys = [
16 | "encoder.version",
17 | "decoder.version",
18 | "model.encoder.version",
19 | "model.decoder.version",
20 | "_float_tensor",
21 | "decoder.output_projection.weight",
22 | ]
23 | for k in ignore_keys:
24 | state_dict.pop(k, None)
25 |
26 |
27 | def get_args() -> argparse.Namespace:
28 | parser = argparse.ArgumentParser()
29 | parser.add_argument("--tokenizer", type=str, default="tokenizer.json")
30 | parser.add_argument("--checkpoint",
31 | type=str,
32 | default="checkpoints/checkpoint_best.pt")
33 | return parser.parse_args()
34 |
35 |
36 | def main():
37 | args = get_args()
38 |
39 | tok = transformers.PreTrainedTokenizerFast(tokenizer_file=args.tokenizer,
40 | bos_token="",
41 | eos_token="",
42 | unk_token="",
43 | pad_token="",
44 | mask_token="",
45 | cls_token="",
46 | sep_token="")
47 |
48 | state_dict = torch.load(args.checkpoint,
49 | map_location="cpu")["model"]
50 |
51 | vocab_size = state_dict["encoder.embed_tokens.weight"].shape[0]
52 |
53 | config = transformers.BartConfig(vocab_size=vocab_size,
54 | d_model=768,
55 | decoder_ffn_dim=3072,
56 | encoder_ffn_dim=3072,
57 | decoder_layers=6,
58 | encoder_layers=6)
59 |
60 | model = transformers.BartForConditionalGeneration(config)
61 |
62 | remove_ignore_keys_(state_dict)
63 | state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
64 | model.model.load_state_dict(state_dict)
65 |
66 | model.lm_head = make_linear_from_emb(model.model.shared)
67 |
68 | while True:
69 | text = input("Please write your input sentence:\n")
70 | xs = tok(text, return_tensors="pt")
71 |
72 | print(tok.decode(model.generate(xs["input_ids"])[0]))
73 |
74 |
75 | if __name__ == "__main__":
76 | main()
77 |
--------------------------------------------------------------------------------
/tokenizers_encoder.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import time
4 | import multiprocessing as mp
5 | import fairseq
6 | from transformers import PreTrainedTokenizerFast
7 |
8 | # Run this file on slurm via tokenizers_encoder_run.sh
9 |
10 | parser = argparse.ArgumentParser()
11 | parser.add_argument("-f", "--filename", type=str)
12 | parser.add_argument(
13 | "data_folder",
14 | nargs="?",
15 | type=str,
16 | default="/ceph/hpc/home/eufatonr/data/text/kb_bart_data/split",
17 | )
18 | parser.add_argument(
19 | "-d",
20 | "--dest_folder",
21 | nargs="?",
22 | type=str,
23 | default="/ceph/hpc/home/eufatonr/data/text/kb_bart_data/tokenized",
24 | )
25 | args = parser.parse_args()
26 |
27 | tokenizer = PreTrainedTokenizerFast(
28 | tokenizer_file="tokenizer.json",
29 | bos_token="",
30 | eos_token="",
31 | unk_token="",
32 | mask_token="",
33 | pad_token="",
34 | )
35 |
36 |
37 | def per_document(iterator, is_delimiter=lambda x: x.isspace()):
38 | """
39 | # Read text file where sentences are separated by newline and
40 | # documents by empty line into a list of lists.
41 | # https://stackoverflow.com/questions/25226871/splitting-textfile-into-section-with-special-delimiter-line-python/25226944#25226944
42 | """
43 | sentences = []
44 | for line in iterator:
45 | if is_delimiter(line):
46 | if sentences:
47 | yield sentences # OR ''.join(sentences)
48 | sentences = []
49 | else:
50 | sentences.append(line.rstrip()) # OR sentences.append(line)
51 | if sentences:
52 | yield sentences
53 |
54 |
55 | def tokenize_text(document):
56 | """
57 | Document is a list of lists where each nested list is a sentence.
58 | [[sentence1], [sentence2], ...]
59 | """
60 | tokenized_sentences = []
61 | for sentence in document:
62 | tokenized_sentence = tokenizer.tokenize(sentence)
63 | tokenized_sentence = " ".join(tokenized_sentence)
64 | tokenized_sentences.append(tokenized_sentence)
65 |
66 | return tokenized_sentences
67 |
68 |
69 | def split_long_docs(doc, max_len=1022):
70 | """
71 | Split documents longer than 1022 tokens into chunks
72 | """
73 | new_doc = []
74 | doc_len = 0
75 | for i, sen in enumerate(doc):
76 | sen_len = len(sen.split()) # word split
77 | if doc_len + sen_len < max_len:
78 | new_doc.append(sen)
79 | doc_len += sen_len
80 | else:
81 | yield new_doc
82 | new_doc = [sen]
83 | doc_len = sen_len
84 | yield new_doc
85 |
86 |
87 | def preprocess_text(document, max_sequence_length=1022):
88 | tokenized_document = tokenize_text(document)
89 | total_doc_length = sum([len(sentence) for sentence in tokenized_document])
90 | if total_doc_length > max_sequence_length:
91 | tokenized_document_splits = split_long_docs(tokenized_document, max_sequence_length)
92 | return list(tokenized_document_splits)
93 | else:
94 | return [tokenized_document]
95 |
96 |
97 | # data_folder = "/ceph/hpc/home/eufatonr/data/text/kb_bart_data/split"
98 | text_shard_file = os.path.join(args.data_folder, args.filename)
99 |
100 | t0 = time.time()
101 | with open(text_shard_file) as f:
102 | documents = list(per_document(f)) # default delimiter
103 | t1 = time.time()
104 | print(f"Reading sentences from file {text_shard_file}. Completed in {t1 - t0} seconds.")
105 |
106 | t0 = time.time()
107 | pool = mp.Pool(processes=20)
108 | tokenized_sentences = pool.map(preprocess_text, documents)
109 | pool.close()
110 | t1 = time.time()
111 |
112 | # Unnest the inner lists in tokenized_sentences
113 | flat_list = [item for sublist in tokenized_sentences for item in sublist]
114 | flat_list = [" ".join(sen) for sen in flat_list] # join list of sentences to doc
115 |
116 | output_filename = os.path.basename(text_shard_file) + ".docs.token"
117 | output_path = os.path.join(args.dest_folder, output_filename)
118 |
119 | # Use regular file write line by line to avoid quoting/escaping issues of pandas.
120 | with open(output_path, "w") as wf:
121 | for line in flat_list:
122 | wf.write(line)
123 | wf.write("\n")
124 |
125 |
126 | print(f"{os.path.basename(text_shard_file)} has been tokenized and saved to {output_path}.")
127 | print(f"Time to tokenize sentences in shard: {t1 - t0} seconds.")
128 |
--------------------------------------------------------------------------------
/tokenizers_encoder_run.sh:
--------------------------------------------------------------------------------
1 | for filename in $(ls -p /ceph/hpc/home/eufatonr/data/text/kb_bart_data/split | grep -v "/$");
2 | do
3 | srun -p cpu --mem=30G --nodes=1 --ntasks=1 --cpus-per-task=20 --time=00:30:00 \
4 | singularity exec pytorch_21.03_bart.sif \
5 | python tokenizers_encoder.py -f $filename &
6 | done
7 |
--------------------------------------------------------------------------------
/tokenizers_trainer.py:
--------------------------------------------------------------------------------
1 | from datasets import load_dataset, concatenate_datasets
2 | from tokenizers import (
3 | Tokenizer,
4 | models,
5 | normalizers,
6 | pre_tokenizers,
7 | decoders,
8 | trainers,
9 | processors,
10 | Regex,
11 | )
12 |
13 |
14 | def batch_iterator(dataset, dataset_size, batch_size):
15 | for i in range(0, dataset_size, batch_size):
16 | yield dataset[i : i + batch_size]["text"]
17 |
18 |
19 | # https://github.com/huggingface/tokenizers/issues/640#issuecomment-792305076
20 | def bpe_tokenizer_trainer(text, vocab_size, min_frequency=0, add_prefix_space=True, batch_size=50):
21 | # Supply either path to txt file or list of strings as text arg
22 |
23 | tokenizer = Tokenizer(models.BPE())
24 |
25 | tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
26 | [
27 | pre_tokenizers.Whitespace(),
28 | pre_tokenizers.Punctuation(),
29 | pre_tokenizers.ByteLevel(add_prefix_space=add_prefix_space),
30 | ]
31 | )
32 | tokenizer.normalizer = normalizers.Sequence(
33 | [normalizers.Nmt(), normalizers.NFKC(), normalizers.Replace(Regex(" {2,}"), " "),]
34 | )
35 |
36 | tokenizer.decoder = decoders.ByteLevel()
37 |
38 | trainer = trainers.BpeTrainer(
39 | vocab_size=vocab_size,
40 | special_tokens=["", "", "", "", ""],
41 | min_frequency=min_frequency,
42 | initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
43 | )
44 |
45 | if isinstance(text, str):
46 | # if user specified path to txt file as string
47 | tokenizer.train(text, trainer=trainer)
48 | else:
49 | # text is a datasets Dataset
50 | tokenizer.train_from_iterator(batch_iterator(text, len(text), batch_size), trainer=trainer)
51 |
52 | tokenizer.post_processor = processors.RobertaProcessing(
53 | sep=("", tokenizer.token_to_id("")), cls=("", tokenizer.token_to_id(""))
54 | )
55 |
56 | tokenizer.save("tokenizer.json", pretty=True)
57 | # tokenizer.model.save("output_dir")
58 |
59 |
60 | def pretokenizer_print(text):
61 | # To print and check how pre tokenization looks like
62 | pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
63 | return pre_tokenizer.pre_tokenize_str(text)
64 |
65 |
66 | dataset = load_dataset(
67 | "text",
68 | data_files={
69 | "wiki": "/ceph/hpc/home/eufatonr/data/text/public/wiki.sv.docs",
70 | "oscar_local": "/ceph/hpc/home/eufatonr/data/text/public/oscar.sv.docs",
71 | },
72 | cache_dir="cache_dataset",
73 | )
74 |
75 | dataset = concatenate_datasets([dataset["wiki"], dataset["oscar_local"]])
76 | bpe_tokenizer_trainer(text=dataset, vocab_size=50260)
77 |
--------------------------------------------------------------------------------
/train_bart.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --partition=gpu
3 | #SBATCH --job-name=kb_bart
4 | #SBATCH --mem=40G
5 | #SBATCH --gres=gpu:4
6 | #SBATCH --nodes=16
7 | ##SBATCH --begin=now+4hour
8 | ##SBATCH --nodelist=gn[45-60]
9 | #SBATCH --exclude=gn40
10 | #SBATCH --cpus-per-gpu=2
11 | #SBATCH --time=0-05:00:00
12 | #SBATCH --output=logs/faton.log
13 |
14 | # module purge
15 | export MASTER_ADDR=`/bin/hostname -s`
16 | export MASTER_PORT=13673
17 | export NPROC_PER_NODE=4
18 |
19 |
20 | # debugging flags (optional)
21 | export NCCL_DEBUG=INFO
22 | export PYTHONFAULTHANDLER=1
23 |
24 | DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
25 | PROJECT=/ceph/hpc/home/eufatonr/faton/kb_bart
26 | LOGGING=$PROJECT/logs
27 | LOGFILE="${LOGGING}/%x_${DATETIME}.log"
28 | echo $LOGFILE
29 |
30 | echo $MASTER_ADDR
31 | echo $MASTER_PORT
32 | echo $NPROC_PER_NODE
33 | echo $SLURM_JOB_NAME
34 | echo $SLURM_JOB_ID
35 | echo $SLURM_JOB_NODELIST
36 | echo $SLURM_JOB_NUM_NODES
37 | echo $SLURM_LOCALID
38 | echo $SLURM_NODEID
39 | echo $SLURM_PROCID
40 |
41 |
42 | # Need to restart training every 40 epochs (1 cycle through all shards of data)
43 | train_cycle=`cat current_cycle.txt`
44 | DATA_DIRS=""
45 | data_dirs_added=$(ls -d -1 "data/"**/ | shuf | tr "\n" ":")
46 | for i in `seq $train_cycle`
47 | do
48 | DATA_DIRS=${DATA_DIRS}${data_dirs_added}
49 | done
50 |
51 | export DATA_DIRS
52 | echo $DATA_DIRS
53 |
54 | # Add a cycle to keep track of which cycle we are on
55 | echo "$((train_cycle + 1))" > current_cycle.txt
56 | echo "${DATA_DIRS}" > current_shards.txt
57 |
58 | run_cmd="bash train_bart_args.sh"
59 |
60 | ls -ltrh
61 | pwd
62 |
63 | srun -l -o $LOGFILE singularity exec --nv -B $(pwd) pytorch_21.03_bart.sif ${run_cmd}
--------------------------------------------------------------------------------
/train_bart_args.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | /bin/hostname -s
4 | which python
5 |
6 | # List all data folders in 'data/' and make a colon separated list of the data folders.
7 | # data/oscar.split0.docs.token/:data/oscar.split01.docs.token/:... etc
8 | # data_dirs=$(ls -d -1 "data/"**/ | tr "\n" ":")
9 | # shuffled_dirs = data_dirs=$(ls -d -1 "data/"**/ | shuf | tr "\n" ":")
10 |
11 | # Fairseq is stupid and thinks data ends when we stop listing new shards.
12 | # To fix we need to append the same folders again to data_dirs.
13 | # This way training can continue once all data has been cycled through.
14 | # NOTE: We cannot append multiple copies of the folders, e.g.:
15 | # This is OK after 1 cycle: data_dirs=${data_dirs}${data_dirs}
16 | # Big NONO BEFORE 1 cycle has finished: data_dirs=${data_dirs}${data_dirs}
17 |
18 | # checkpoint80 means we are starting on Cycle 3 (we have 40 shards per cycle).
19 | # We add another ${data_dirs} whenever we reach 40, 80, 120, 160, ...
20 | # data_dirs=${data_dirs}${data_dirs}${data_dirs}${data_dirs}${data_dirs}${data_dirs}${data_dirs}${data_dirs}${data_dirs}
21 |
22 |
23 | python -m torch.distributed.launch \
24 | --nproc_per_node=$NPROC_PER_NODE \
25 | --nnodes=$SLURM_JOB_NUM_NODES \
26 | --node_rank=$SLURM_NODEID \
27 | --master_addr=$MASTER_ADDR \
28 | --master_port=$MASTER_PORT \
29 | $(which fairseq-train) $DATA_DIRS \
30 | --train-subset train \
31 | --skip-invalid-size-inputs-valid-test \
32 | --ignore-unused-valid-subsets \
33 | --num-workers 2 \
34 | --memory-efficient-fp16 \
35 | --arch bart_base \
36 | --task denoising \
37 | --mask 0.3 `# Proportion to mask` \
38 | --mask-length span-poisson `# Mask a span of words, sampled with poisson distr lambda=3` \
39 | --replace-length 1 `# Replace spans of masked tokens with single token` \
40 | --permute-sentences 1.0 `# Paper states they permute all sentences` \
41 | --rotate 0.0 \
42 | --sample-break-mode complete `# complete sentences` \
43 | --shorten-method truncate \
44 | --tokens-per-sample 1024 \
45 | --max-source-positions 1024 \
46 | --max-target-positions 1024 \
47 | --optimizer adam --adam-betas "(0.9, 0.98)" \
48 | --adam-eps 1e-06 \
49 | --clip-norm 0.0 \
50 | --lr 0.00045 \
51 | --lr-scheduler polynomial_decay \
52 | --warmup-updates 10000 \
53 | --dropout 0.1 \
54 | --attention-dropout 0.1 \
55 | --weight-decay 0.01 \
56 | --batch-size 8 `# global bsz = batch-size*update-req*num_nodes*num_gpu_per_node` \
57 | --update-freq 4 `# gradient accumulation, batch size per gpu becomes batch-size*update-freq` \
58 | --total-num-update 500000 \
59 | --max-update 500000 \
60 | --save-interval 3 `# Save checkpoint and validate after every 3 epochs (epoch=dataset shard)` \
61 | --log-format json --log-interval 10
62 |
--------------------------------------------------------------------------------