├── .gitignore ├── ATTENTION_TRANSFORMERS.md ├── README.md ├── Text-Cleaning-With-Clustering.ipynb ├── data ├── fasthugs_language_model.ipynb ├── fasthugs_seq_classification.ipynb ├── images └── roberta_pred.png ├── sentencepiece_model_pb2.py ├── speed_test_fastai_vs_hf_nlp_datasets.ipynb └── splitters.py /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | 38_tutorial.text.ipynb 3 | __pycache__/ 4 | esperberto-merges.txt 5 | esperberto-vocab.json 6 | fasthugs_T5.ipynb 7 | fasthugs_bridge.py 8 | fasthugs_language_model-Copy1.ipynb 9 | gaelbert-merges.txt 10 | gaelbert-vocab.json 11 | models/ 12 | gawiki/ 13 | old_fasthugs_language-model.ipynb 14 | -------------------------------------------------------------------------------- /ATTENTION_TRANSFORMERS.md: -------------------------------------------------------------------------------- 1 | A few super-helpful resources to better understand Attention and Transformers, in recommended order: 2 | 3 | 0. Luis Serano: [A Friendly Introduction to RNNs](https://www.youtube.com/watch?v=UNmqTiOnRfg) 4 | 1. Jay Alamar: [Seq2Seq RNN model with Attention](https://jalammar.github.io/illustrated-transformer/) 5 | 2. Jay Alamar: [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) 6 | 3. Harvard NLP: [The Anotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) 7 | 4. [@bentrevett's Seq2Seq explainer notebooks](https://github.com/bentrevett/pytorch-seq2seq) 8 | 5. [Annotated GPT-2](https://amaarora.github.io/2020/02/18/annotatedGPT2.html) 9 | 6. Andrew Peng: [Translation with transformer](https://andrewpeng.dev/transformer-pytorch/) 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FastHugs 2 | Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task: 3 | 4 | - Text classification: `fasthugs_seq_classification.ipynb` 5 | - Language model pre-training or fine-tuning (RoBERTa only for now): `fasthugs_language_model.ipynb` 6 | 7 | ## What's New 8 | 9 | ### April 24, 2020 10 | - Added `fasthugs_language_model.ipynb` which shows you how to pre-train or fine-tune a Masked Language Model (MLM), RoBERTa in this case, from scratch 11 | 12 | ### April 17, 2020 13 | - Added new `get_vocab` functionality from HuggingFace, unified api to extract a tokenizer's vocab 14 | - Added new `AutoModelForSequenceClassification`, `AutoConfig`, `AutoModelForSequenceClassification` HuggingFace functionality to make things tider 15 | - Tidied up and refactored `FastHugsTokenizer` and `FastHugsModel` 16 | - OLD demo and vocab files to be deleted soon 17 | 18 | ## Things You Might Like (❤️ ?) 19 | **FastHugsTokenizer:** A tokenizer wrapper than can be used with fastai-v2’s tokenizer. 20 | 21 | **FastHugsModel:** A model wrapper over the HF models, more or less the same to the wrapper’s from HF fastai-v1 articles mentioned below 22 | 23 | **Padding:** Padding settings for the padding token index and on whether the transformer prefers left or right padding 24 | 25 | **Model Splitters**: Functions to split the classification head from the model backbone in line with fastai-v2’s new definition of Learner (`splitters`) 26 | 27 | ## Read these first 👇 28 | This notebook heavily borrows from this notebook , which in turn is based off of this tutorial and accompanying article. Huge thanks to Melissa Rajaram and Maximilien Roberti for these great resources, if you're not familiar with the HuggingFace library please given them a read first as they are quite comprehensive. 29 | 30 | ## fastai-v2 ✌️2️⃣ 31 | This paper introduces the v2 version of the fastai library and you can follow and contribute to v2's progress on the forums. This notebook uses the small IMDB dataset and is based off the fastai-v2 ULMFiT tutorial. Huge thanks to Jeremy, Sylvain, Rachel and the fastai community for making this library what it is. I'm super excited about the additinal flexibility v2 brings. 🎉 32 | -------------------------------------------------------------------------------- /data: -------------------------------------------------------------------------------- 1 | /home/morgan/ml/data -------------------------------------------------------------------------------- /images/roberta_pred.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morganmcg1/fasthugs/7d0be360f88a1a75f6344089c1d291fe7ec7f553/images/roberta_pred.png -------------------------------------------------------------------------------- /sentencepiece_model_pb2.py: -------------------------------------------------------------------------------- 1 | # Generated by the protocol buffer compiler. DO NOT EDIT! 2 | # source: sentencepiece_model.proto 3 | 4 | import sys 5 | _b=sys.version_info[0]<3 and (lambda x:x) or (lambda x:x.encode('latin1')) 6 | from google.protobuf import descriptor as _descriptor 7 | from google.protobuf import message as _message 8 | from google.protobuf import reflection as _reflection 9 | from google.protobuf import symbol_database as _symbol_database 10 | from google.protobuf import descriptor_pb2 11 | # @@protoc_insertion_point(imports) 12 | 13 | _sym_db = _symbol_database.Default() 14 | 15 | 16 | 17 | 18 | DESCRIPTOR = _descriptor.FileDescriptor( 19 | name='sentencepiece_model.proto', 20 | package='sentencepiece', 21 | syntax='proto2', 22 | serialized_pb=_b('\n\x19sentencepiece_model.proto\x12\rsentencepiece\"\xf4\x08\n\x0bTrainerSpec\x12\r\n\x05input\x18\x01 \x03(\t\x12\x14\n\x0cinput_format\x18\x07 \x01(\t\x12\x14\n\x0cmodel_prefix\x18\x02 \x01(\t\x12\x41\n\nmodel_type\x18\x03 \x01(\x0e\x32$.sentencepiece.TrainerSpec.ModelType:\x07UNIGRAM\x12\x18\n\nvocab_size\x18\x04 \x01(\x05:\x04\x38\x30\x30\x30\x12\x17\n\x0f\x61\x63\x63\x65pt_language\x18\x05 \x03(\t\x12 \n\x15self_test_sample_size\x18\x06 \x01(\x05:\x01\x30\x12\"\n\x12\x63haracter_coverage\x18\n \x01(\x02:\x06\x30.9995\x12\x1e\n\x13input_sentence_size\x18\x0b \x01(\x05:\x01\x30\x12$\n\x16shuffle_input_sentence\x18\x13 \x01(\x08:\x04true\x12 \n\x14mining_sentence_size\x18\x0c \x01(\x05\x42\x02\x18\x01\x12\"\n\x16training_sentence_size\x18\r \x01(\x05\x42\x02\x18\x01\x12(\n\x17seed_sentencepiece_size\x18\x0e \x01(\x05:\x07\x31\x30\x30\x30\x30\x30\x30\x12\x1e\n\x10shrinking_factor\x18\x0f \x01(\x02:\x04\x30.75\x12!\n\x13max_sentence_length\x18\x12 \x01(\x05:\x04\x34\x31\x39\x32\x12\x17\n\x0bnum_threads\x18\x10 \x01(\x05:\x02\x31\x36\x12\x1d\n\x12num_sub_iterations\x18\x11 \x01(\x05:\x01\x32\x12$\n\x18max_sentencepiece_length\x18\x14 \x01(\x05:\x02\x31\x36\x12%\n\x17split_by_unicode_script\x18\x15 \x01(\x08:\x04true\x12\x1d\n\x0fsplit_by_number\x18\x17 \x01(\x08:\x04true\x12!\n\x13split_by_whitespace\x18\x16 \x01(\x08:\x04true\x12)\n\x1atreat_whitespace_as_suffix\x18\x18 \x01(\x08:\x05\x66\x61lse\x12\x17\n\x0f\x63ontrol_symbols\x18\x1e \x03(\t\x12\x1c\n\x14user_defined_symbols\x18\x1f \x03(\t\x12\x1e\n\x10hard_vocab_limit\x18! \x01(\x08:\x04true\x12\x1c\n\ruse_all_vocab\x18\" \x01(\x08:\x05\x66\x61lse\x12\x11\n\x06unk_id\x18( \x01(\x05:\x01\x30\x12\x11\n\x06\x62os_id\x18) \x01(\x05:\x01\x31\x12\x11\n\x06\x65os_id\x18* \x01(\x05:\x01\x32\x12\x12\n\x06pad_id\x18+ \x01(\x05:\x02-1\x12\x18\n\tunk_piece\x18- \x01(\t:\x05\x12\x16\n\tbos_piece\x18. \x01(\t:\x03\x12\x17\n\teos_piece\x18/ \x01(\t:\x04\x12\x18\n\tpad_piece\x18\x30 \x01(\t:\x05\x12\x1a\n\x0bunk_surface\x18, \x01(\t:\x05 \xe2\x81\x87 \"5\n\tModelType\x12\x0b\n\x07UNIGRAM\x10\x01\x12\x07\n\x03\x42PE\x10\x02\x12\x08\n\x04WORD\x10\x03\x12\x08\n\x04\x43HAR\x10\x04*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"\xd1\x01\n\x0eNormalizerSpec\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\x1c\n\x14precompiled_charsmap\x18\x02 \x01(\x0c\x12\x1e\n\x10\x61\x64\x64_dummy_prefix\x18\x03 \x01(\x08:\x04true\x12&\n\x18remove_extra_whitespaces\x18\x04 \x01(\x08:\x04true\x12 \n\x12\x65scape_whitespaces\x18\x05 \x01(\x08:\x04true\x12\x1e\n\x16normalization_rule_tsv\x18\x06 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"y\n\x0cSelfTestData\x12\x33\n\x07samples\x18\x01 \x03(\x0b\x32\".sentencepiece.SelfTestData.Sample\x1a)\n\x06Sample\x12\r\n\x05input\x18\x01 \x01(\t\x12\x10\n\x08\x65xpected\x18\x02 \x01(\t*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\"\xba\x03\n\nModelProto\x12\x37\n\x06pieces\x18\x01 \x03(\x0b\x32\'.sentencepiece.ModelProto.SentencePiece\x12\x30\n\x0ctrainer_spec\x18\x02 \x01(\x0b\x32\x1a.sentencepiece.TrainerSpec\x12\x36\n\x0fnormalizer_spec\x18\x03 \x01(\x0b\x32\x1d.sentencepiece.NormalizerSpec\x12\x33\n\x0eself_test_data\x18\x04 \x01(\x0b\x32\x1b.sentencepiece.SelfTestData\x1a\xc8\x01\n\rSentencePiece\x12\r\n\x05piece\x18\x01 \x01(\t\x12\r\n\x05score\x18\x02 \x01(\x02\x12\x42\n\x04type\x18\x03 \x01(\x0e\x32,.sentencepiece.ModelProto.SentencePiece.Type:\x06NORMAL\"J\n\x04Type\x12\n\n\x06NORMAL\x10\x01\x12\x0b\n\x07UNKNOWN\x10\x02\x12\x0b\n\x07\x43ONTROL\x10\x03\x12\x10\n\x0cUSER_DEFINED\x10\x04\x12\n\n\x06UNUSED\x10\x05*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02*\t\x08\xc8\x01\x10\x80\x80\x80\x80\x02\x42\x02H\x03') 23 | ) 24 | _sym_db.RegisterFileDescriptor(DESCRIPTOR) 25 | 26 | 27 | 28 | _TRAINERSPEC_MODELTYPE = _descriptor.EnumDescriptor( 29 | name='ModelType', 30 | full_name='sentencepiece.TrainerSpec.ModelType', 31 | filename=None, 32 | file=DESCRIPTOR, 33 | values=[ 34 | _descriptor.EnumValueDescriptor( 35 | name='UNIGRAM', index=0, number=1, 36 | options=None, 37 | type=None), 38 | _descriptor.EnumValueDescriptor( 39 | name='BPE', index=1, number=2, 40 | options=None, 41 | type=None), 42 | _descriptor.EnumValueDescriptor( 43 | name='WORD', index=2, number=3, 44 | options=None, 45 | type=None), 46 | _descriptor.EnumValueDescriptor( 47 | name='CHAR', index=3, number=4, 48 | options=None, 49 | type=None), 50 | ], 51 | containing_type=None, 52 | options=None, 53 | serialized_start=1121, 54 | serialized_end=1174, 55 | ) 56 | _sym_db.RegisterEnumDescriptor(_TRAINERSPEC_MODELTYPE) 57 | 58 | _MODELPROTO_SENTENCEPIECE_TYPE = _descriptor.EnumDescriptor( 59 | name='Type', 60 | full_name='sentencepiece.ModelProto.SentencePiece.Type', 61 | filename=None, 62 | file=DESCRIPTOR, 63 | values=[ 64 | _descriptor.EnumValueDescriptor( 65 | name='NORMAL', index=0, number=1, 66 | options=None, 67 | type=None), 68 | _descriptor.EnumValueDescriptor( 69 | name='UNKNOWN', index=1, number=2, 70 | options=None, 71 | type=None), 72 | _descriptor.EnumValueDescriptor( 73 | name='CONTROL', index=2, number=3, 74 | options=None, 75 | type=None), 76 | _descriptor.EnumValueDescriptor( 77 | name='USER_DEFINED', index=3, number=4, 78 | options=None, 79 | type=None), 80 | _descriptor.EnumValueDescriptor( 81 | name='UNUSED', index=4, number=5, 82 | options=None, 83 | type=None), 84 | ], 85 | containing_type=None, 86 | options=None, 87 | serialized_start=1869, 88 | serialized_end=1943, 89 | ) 90 | _sym_db.RegisterEnumDescriptor(_MODELPROTO_SENTENCEPIECE_TYPE) 91 | 92 | 93 | _TRAINERSPEC = _descriptor.Descriptor( 94 | name='TrainerSpec', 95 | full_name='sentencepiece.TrainerSpec', 96 | filename=None, 97 | file=DESCRIPTOR, 98 | containing_type=None, 99 | fields=[ 100 | _descriptor.FieldDescriptor( 101 | name='input', full_name='sentencepiece.TrainerSpec.input', index=0, 102 | number=1, type=9, cpp_type=9, label=3, 103 | has_default_value=False, default_value=[], 104 | message_type=None, enum_type=None, containing_type=None, 105 | is_extension=False, extension_scope=None, 106 | options=None), 107 | _descriptor.FieldDescriptor( 108 | name='input_format', full_name='sentencepiece.TrainerSpec.input_format', index=1, 109 | number=7, type=9, cpp_type=9, label=1, 110 | has_default_value=False, default_value=_b("").decode('utf-8'), 111 | message_type=None, enum_type=None, containing_type=None, 112 | is_extension=False, extension_scope=None, 113 | options=None), 114 | _descriptor.FieldDescriptor( 115 | name='model_prefix', full_name='sentencepiece.TrainerSpec.model_prefix', index=2, 116 | number=2, type=9, cpp_type=9, label=1, 117 | has_default_value=False, default_value=_b("").decode('utf-8'), 118 | message_type=None, enum_type=None, containing_type=None, 119 | is_extension=False, extension_scope=None, 120 | options=None), 121 | _descriptor.FieldDescriptor( 122 | name='model_type', full_name='sentencepiece.TrainerSpec.model_type', index=3, 123 | number=3, type=14, cpp_type=8, label=1, 124 | has_default_value=True, default_value=1, 125 | message_type=None, enum_type=None, containing_type=None, 126 | is_extension=False, extension_scope=None, 127 | options=None), 128 | _descriptor.FieldDescriptor( 129 | name='vocab_size', full_name='sentencepiece.TrainerSpec.vocab_size', index=4, 130 | number=4, type=5, cpp_type=1, label=1, 131 | has_default_value=True, default_value=8000, 132 | message_type=None, enum_type=None, containing_type=None, 133 | is_extension=False, extension_scope=None, 134 | options=None), 135 | _descriptor.FieldDescriptor( 136 | name='accept_language', full_name='sentencepiece.TrainerSpec.accept_language', index=5, 137 | number=5, type=9, cpp_type=9, label=3, 138 | has_default_value=False, default_value=[], 139 | message_type=None, enum_type=None, containing_type=None, 140 | is_extension=False, extension_scope=None, 141 | options=None), 142 | _descriptor.FieldDescriptor( 143 | name='self_test_sample_size', full_name='sentencepiece.TrainerSpec.self_test_sample_size', index=6, 144 | number=6, type=5, cpp_type=1, label=1, 145 | has_default_value=True, default_value=0, 146 | message_type=None, enum_type=None, containing_type=None, 147 | is_extension=False, extension_scope=None, 148 | options=None), 149 | _descriptor.FieldDescriptor( 150 | name='character_coverage', full_name='sentencepiece.TrainerSpec.character_coverage', index=7, 151 | number=10, type=2, cpp_type=6, label=1, 152 | has_default_value=True, default_value=float(0.9995), 153 | message_type=None, enum_type=None, containing_type=None, 154 | is_extension=False, extension_scope=None, 155 | options=None), 156 | _descriptor.FieldDescriptor( 157 | name='input_sentence_size', full_name='sentencepiece.TrainerSpec.input_sentence_size', index=8, 158 | number=11, type=5, cpp_type=1, label=1, 159 | has_default_value=True, default_value=0, 160 | message_type=None, enum_type=None, containing_type=None, 161 | is_extension=False, extension_scope=None, 162 | options=None), 163 | _descriptor.FieldDescriptor( 164 | name='shuffle_input_sentence', full_name='sentencepiece.TrainerSpec.shuffle_input_sentence', index=9, 165 | number=19, type=8, cpp_type=7, label=1, 166 | has_default_value=True, default_value=True, 167 | message_type=None, enum_type=None, containing_type=None, 168 | is_extension=False, extension_scope=None, 169 | options=None), 170 | _descriptor.FieldDescriptor( 171 | name='mining_sentence_size', full_name='sentencepiece.TrainerSpec.mining_sentence_size', index=10, 172 | number=12, type=5, cpp_type=1, label=1, 173 | has_default_value=False, default_value=0, 174 | message_type=None, enum_type=None, containing_type=None, 175 | is_extension=False, extension_scope=None, 176 | options=_descriptor._ParseOptions(descriptor_pb2.FieldOptions(), _b('\030\001'))), 177 | _descriptor.FieldDescriptor( 178 | name='training_sentence_size', full_name='sentencepiece.TrainerSpec.training_sentence_size', index=11, 179 | number=13, type=5, cpp_type=1, label=1, 180 | has_default_value=False, default_value=0, 181 | message_type=None, enum_type=None, containing_type=None, 182 | is_extension=False, extension_scope=None, 183 | options=_descriptor._ParseOptions(descriptor_pb2.FieldOptions(), _b('\030\001'))), 184 | _descriptor.FieldDescriptor( 185 | name='seed_sentencepiece_size', full_name='sentencepiece.TrainerSpec.seed_sentencepiece_size', index=12, 186 | number=14, type=5, cpp_type=1, label=1, 187 | has_default_value=True, default_value=1000000, 188 | message_type=None, enum_type=None, containing_type=None, 189 | is_extension=False, extension_scope=None, 190 | options=None), 191 | _descriptor.FieldDescriptor( 192 | name='shrinking_factor', full_name='sentencepiece.TrainerSpec.shrinking_factor', index=13, 193 | number=15, type=2, cpp_type=6, label=1, 194 | has_default_value=True, default_value=float(0.75), 195 | message_type=None, enum_type=None, containing_type=None, 196 | is_extension=False, extension_scope=None, 197 | options=None), 198 | _descriptor.FieldDescriptor( 199 | name='max_sentence_length', full_name='sentencepiece.TrainerSpec.max_sentence_length', index=14, 200 | number=18, type=5, cpp_type=1, label=1, 201 | has_default_value=True, default_value=4192, 202 | message_type=None, enum_type=None, containing_type=None, 203 | is_extension=False, extension_scope=None, 204 | options=None), 205 | _descriptor.FieldDescriptor( 206 | name='num_threads', full_name='sentencepiece.TrainerSpec.num_threads', index=15, 207 | number=16, type=5, cpp_type=1, label=1, 208 | has_default_value=True, default_value=16, 209 | message_type=None, enum_type=None, containing_type=None, 210 | is_extension=False, extension_scope=None, 211 | options=None), 212 | _descriptor.FieldDescriptor( 213 | name='num_sub_iterations', full_name='sentencepiece.TrainerSpec.num_sub_iterations', index=16, 214 | number=17, type=5, cpp_type=1, label=1, 215 | has_default_value=True, default_value=2, 216 | message_type=None, enum_type=None, containing_type=None, 217 | is_extension=False, extension_scope=None, 218 | options=None), 219 | _descriptor.FieldDescriptor( 220 | name='max_sentencepiece_length', full_name='sentencepiece.TrainerSpec.max_sentencepiece_length', index=17, 221 | number=20, type=5, cpp_type=1, label=1, 222 | has_default_value=True, default_value=16, 223 | message_type=None, enum_type=None, containing_type=None, 224 | is_extension=False, extension_scope=None, 225 | options=None), 226 | _descriptor.FieldDescriptor( 227 | name='split_by_unicode_script', full_name='sentencepiece.TrainerSpec.split_by_unicode_script', index=18, 228 | number=21, type=8, cpp_type=7, label=1, 229 | has_default_value=True, default_value=True, 230 | message_type=None, enum_type=None, containing_type=None, 231 | is_extension=False, extension_scope=None, 232 | options=None), 233 | _descriptor.FieldDescriptor( 234 | name='split_by_number', full_name='sentencepiece.TrainerSpec.split_by_number', index=19, 235 | number=23, type=8, cpp_type=7, label=1, 236 | has_default_value=True, default_value=True, 237 | message_type=None, enum_type=None, containing_type=None, 238 | is_extension=False, extension_scope=None, 239 | options=None), 240 | _descriptor.FieldDescriptor( 241 | name='split_by_whitespace', full_name='sentencepiece.TrainerSpec.split_by_whitespace', index=20, 242 | number=22, type=8, cpp_type=7, label=1, 243 | has_default_value=True, default_value=True, 244 | message_type=None, enum_type=None, containing_type=None, 245 | is_extension=False, extension_scope=None, 246 | options=None), 247 | _descriptor.FieldDescriptor( 248 | name='treat_whitespace_as_suffix', full_name='sentencepiece.TrainerSpec.treat_whitespace_as_suffix', index=21, 249 | number=24, type=8, cpp_type=7, label=1, 250 | has_default_value=True, default_value=False, 251 | message_type=None, enum_type=None, containing_type=None, 252 | is_extension=False, extension_scope=None, 253 | options=None), 254 | _descriptor.FieldDescriptor( 255 | name='control_symbols', full_name='sentencepiece.TrainerSpec.control_symbols', index=22, 256 | number=30, type=9, cpp_type=9, label=3, 257 | has_default_value=False, default_value=[], 258 | message_type=None, enum_type=None, containing_type=None, 259 | is_extension=False, extension_scope=None, 260 | options=None), 261 | _descriptor.FieldDescriptor( 262 | name='user_defined_symbols', full_name='sentencepiece.TrainerSpec.user_defined_symbols', index=23, 263 | number=31, type=9, cpp_type=9, label=3, 264 | has_default_value=False, default_value=[], 265 | message_type=None, enum_type=None, containing_type=None, 266 | is_extension=False, extension_scope=None, 267 | options=None), 268 | _descriptor.FieldDescriptor( 269 | name='hard_vocab_limit', full_name='sentencepiece.TrainerSpec.hard_vocab_limit', index=24, 270 | number=33, type=8, cpp_type=7, label=1, 271 | has_default_value=True, default_value=True, 272 | message_type=None, enum_type=None, containing_type=None, 273 | is_extension=False, extension_scope=None, 274 | options=None), 275 | _descriptor.FieldDescriptor( 276 | name='use_all_vocab', full_name='sentencepiece.TrainerSpec.use_all_vocab', index=25, 277 | number=34, type=8, cpp_type=7, label=1, 278 | has_default_value=True, default_value=False, 279 | message_type=None, enum_type=None, containing_type=None, 280 | is_extension=False, extension_scope=None, 281 | options=None), 282 | _descriptor.FieldDescriptor( 283 | name='unk_id', full_name='sentencepiece.TrainerSpec.unk_id', index=26, 284 | number=40, type=5, cpp_type=1, label=1, 285 | has_default_value=True, default_value=0, 286 | message_type=None, enum_type=None, containing_type=None, 287 | is_extension=False, extension_scope=None, 288 | options=None), 289 | _descriptor.FieldDescriptor( 290 | name='bos_id', full_name='sentencepiece.TrainerSpec.bos_id', index=27, 291 | number=41, type=5, cpp_type=1, label=1, 292 | has_default_value=True, default_value=1, 293 | message_type=None, enum_type=None, containing_type=None, 294 | is_extension=False, extension_scope=None, 295 | options=None), 296 | _descriptor.FieldDescriptor( 297 | name='eos_id', full_name='sentencepiece.TrainerSpec.eos_id', index=28, 298 | number=42, type=5, cpp_type=1, label=1, 299 | has_default_value=True, default_value=2, 300 | message_type=None, enum_type=None, containing_type=None, 301 | is_extension=False, extension_scope=None, 302 | options=None), 303 | _descriptor.FieldDescriptor( 304 | name='pad_id', full_name='sentencepiece.TrainerSpec.pad_id', index=29, 305 | number=43, type=5, cpp_type=1, label=1, 306 | has_default_value=True, default_value=-1, 307 | message_type=None, enum_type=None, containing_type=None, 308 | is_extension=False, extension_scope=None, 309 | options=None), 310 | _descriptor.FieldDescriptor( 311 | name='unk_piece', full_name='sentencepiece.TrainerSpec.unk_piece', index=30, 312 | number=45, type=9, cpp_type=9, label=1, 313 | has_default_value=True, default_value=_b("").decode('utf-8'), 314 | message_type=None, enum_type=None, containing_type=None, 315 | is_extension=False, extension_scope=None, 316 | options=None), 317 | _descriptor.FieldDescriptor( 318 | name='bos_piece', full_name='sentencepiece.TrainerSpec.bos_piece', index=31, 319 | number=46, type=9, cpp_type=9, label=1, 320 | has_default_value=True, default_value=_b("").decode('utf-8'), 321 | message_type=None, enum_type=None, containing_type=None, 322 | is_extension=False, extension_scope=None, 323 | options=None), 324 | _descriptor.FieldDescriptor( 325 | name='eos_piece', full_name='sentencepiece.TrainerSpec.eos_piece', index=32, 326 | number=47, type=9, cpp_type=9, label=1, 327 | has_default_value=True, default_value=_b("").decode('utf-8'), 328 | message_type=None, enum_type=None, containing_type=None, 329 | is_extension=False, extension_scope=None, 330 | options=None), 331 | _descriptor.FieldDescriptor( 332 | name='pad_piece', full_name='sentencepiece.TrainerSpec.pad_piece', index=33, 333 | number=48, type=9, cpp_type=9, label=1, 334 | has_default_value=True, default_value=_b("").decode('utf-8'), 335 | message_type=None, enum_type=None, containing_type=None, 336 | is_extension=False, extension_scope=None, 337 | options=None), 338 | _descriptor.FieldDescriptor( 339 | name='unk_surface', full_name='sentencepiece.TrainerSpec.unk_surface', index=34, 340 | number=44, type=9, cpp_type=9, label=1, 341 | has_default_value=True, default_value=_b(" \342\201\207 ").decode('utf-8'), 342 | message_type=None, enum_type=None, containing_type=None, 343 | is_extension=False, extension_scope=None, 344 | options=None), 345 | ], 346 | extensions=[ 347 | ], 348 | nested_types=[], 349 | enum_types=[ 350 | _TRAINERSPEC_MODELTYPE, 351 | ], 352 | options=None, 353 | is_extendable=True, 354 | syntax='proto2', 355 | extension_ranges=[(200, 536870912), ], 356 | oneofs=[ 357 | ], 358 | serialized_start=45, 359 | serialized_end=1185, 360 | ) 361 | 362 | 363 | _NORMALIZERSPEC = _descriptor.Descriptor( 364 | name='NormalizerSpec', 365 | full_name='sentencepiece.NormalizerSpec', 366 | filename=None, 367 | file=DESCRIPTOR, 368 | containing_type=None, 369 | fields=[ 370 | _descriptor.FieldDescriptor( 371 | name='name', full_name='sentencepiece.NormalizerSpec.name', index=0, 372 | number=1, type=9, cpp_type=9, label=1, 373 | has_default_value=False, default_value=_b("").decode('utf-8'), 374 | message_type=None, enum_type=None, containing_type=None, 375 | is_extension=False, extension_scope=None, 376 | options=None), 377 | _descriptor.FieldDescriptor( 378 | name='precompiled_charsmap', full_name='sentencepiece.NormalizerSpec.precompiled_charsmap', index=1, 379 | number=2, type=12, cpp_type=9, label=1, 380 | has_default_value=False, default_value=_b(""), 381 | message_type=None, enum_type=None, containing_type=None, 382 | is_extension=False, extension_scope=None, 383 | options=None), 384 | _descriptor.FieldDescriptor( 385 | name='add_dummy_prefix', full_name='sentencepiece.NormalizerSpec.add_dummy_prefix', index=2, 386 | number=3, type=8, cpp_type=7, label=1, 387 | has_default_value=True, default_value=True, 388 | message_type=None, enum_type=None, containing_type=None, 389 | is_extension=False, extension_scope=None, 390 | options=None), 391 | _descriptor.FieldDescriptor( 392 | name='remove_extra_whitespaces', full_name='sentencepiece.NormalizerSpec.remove_extra_whitespaces', index=3, 393 | number=4, type=8, cpp_type=7, label=1, 394 | has_default_value=True, default_value=True, 395 | message_type=None, enum_type=None, containing_type=None, 396 | is_extension=False, extension_scope=None, 397 | options=None), 398 | _descriptor.FieldDescriptor( 399 | name='escape_whitespaces', full_name='sentencepiece.NormalizerSpec.escape_whitespaces', index=4, 400 | number=5, type=8, cpp_type=7, label=1, 401 | has_default_value=True, default_value=True, 402 | message_type=None, enum_type=None, containing_type=None, 403 | is_extension=False, extension_scope=None, 404 | options=None), 405 | _descriptor.FieldDescriptor( 406 | name='normalization_rule_tsv', full_name='sentencepiece.NormalizerSpec.normalization_rule_tsv', index=5, 407 | number=6, type=9, cpp_type=9, label=1, 408 | has_default_value=False, default_value=_b("").decode('utf-8'), 409 | message_type=None, enum_type=None, containing_type=None, 410 | is_extension=False, extension_scope=None, 411 | options=None), 412 | ], 413 | extensions=[ 414 | ], 415 | nested_types=[], 416 | enum_types=[ 417 | ], 418 | options=None, 419 | is_extendable=True, 420 | syntax='proto2', 421 | extension_ranges=[(200, 536870912), ], 422 | oneofs=[ 423 | ], 424 | serialized_start=1188, 425 | serialized_end=1397, 426 | ) 427 | 428 | 429 | _SELFTESTDATA_SAMPLE = _descriptor.Descriptor( 430 | name='Sample', 431 | full_name='sentencepiece.SelfTestData.Sample', 432 | filename=None, 433 | file=DESCRIPTOR, 434 | containing_type=None, 435 | fields=[ 436 | _descriptor.FieldDescriptor( 437 | name='input', full_name='sentencepiece.SelfTestData.Sample.input', index=0, 438 | number=1, type=9, cpp_type=9, label=1, 439 | has_default_value=False, default_value=_b("").decode('utf-8'), 440 | message_type=None, enum_type=None, containing_type=None, 441 | is_extension=False, extension_scope=None, 442 | options=None), 443 | _descriptor.FieldDescriptor( 444 | name='expected', full_name='sentencepiece.SelfTestData.Sample.expected', index=1, 445 | number=2, type=9, cpp_type=9, label=1, 446 | has_default_value=False, default_value=_b("").decode('utf-8'), 447 | message_type=None, enum_type=None, containing_type=None, 448 | is_extension=False, extension_scope=None, 449 | options=None), 450 | ], 451 | extensions=[ 452 | ], 453 | nested_types=[], 454 | enum_types=[ 455 | ], 456 | options=None, 457 | is_extendable=False, 458 | syntax='proto2', 459 | extension_ranges=[], 460 | oneofs=[ 461 | ], 462 | serialized_start=1468, 463 | serialized_end=1509, 464 | ) 465 | 466 | _SELFTESTDATA = _descriptor.Descriptor( 467 | name='SelfTestData', 468 | full_name='sentencepiece.SelfTestData', 469 | filename=None, 470 | file=DESCRIPTOR, 471 | containing_type=None, 472 | fields=[ 473 | _descriptor.FieldDescriptor( 474 | name='samples', full_name='sentencepiece.SelfTestData.samples', index=0, 475 | number=1, type=11, cpp_type=10, label=3, 476 | has_default_value=False, default_value=[], 477 | message_type=None, enum_type=None, containing_type=None, 478 | is_extension=False, extension_scope=None, 479 | options=None), 480 | ], 481 | extensions=[ 482 | ], 483 | nested_types=[_SELFTESTDATA_SAMPLE, ], 484 | enum_types=[ 485 | ], 486 | options=None, 487 | is_extendable=True, 488 | syntax='proto2', 489 | extension_ranges=[(200, 536870912), ], 490 | oneofs=[ 491 | ], 492 | serialized_start=1399, 493 | serialized_end=1520, 494 | ) 495 | 496 | 497 | _MODELPROTO_SENTENCEPIECE = _descriptor.Descriptor( 498 | name='SentencePiece', 499 | full_name='sentencepiece.ModelProto.SentencePiece', 500 | filename=None, 501 | file=DESCRIPTOR, 502 | containing_type=None, 503 | fields=[ 504 | _descriptor.FieldDescriptor( 505 | name='piece', full_name='sentencepiece.ModelProto.SentencePiece.piece', index=0, 506 | number=1, type=9, cpp_type=9, label=1, 507 | has_default_value=False, default_value=_b("").decode('utf-8'), 508 | message_type=None, enum_type=None, containing_type=None, 509 | is_extension=False, extension_scope=None, 510 | options=None), 511 | _descriptor.FieldDescriptor( 512 | name='score', full_name='sentencepiece.ModelProto.SentencePiece.score', index=1, 513 | number=2, type=2, cpp_type=6, label=1, 514 | has_default_value=False, default_value=float(0), 515 | message_type=None, enum_type=None, containing_type=None, 516 | is_extension=False, extension_scope=None, 517 | options=None), 518 | _descriptor.FieldDescriptor( 519 | name='type', full_name='sentencepiece.ModelProto.SentencePiece.type', index=2, 520 | number=3, type=14, cpp_type=8, label=1, 521 | has_default_value=True, default_value=1, 522 | message_type=None, enum_type=None, containing_type=None, 523 | is_extension=False, extension_scope=None, 524 | options=None), 525 | ], 526 | extensions=[ 527 | ], 528 | nested_types=[], 529 | enum_types=[ 530 | _MODELPROTO_SENTENCEPIECE_TYPE, 531 | ], 532 | options=None, 533 | is_extendable=True, 534 | syntax='proto2', 535 | extension_ranges=[(200, 536870912), ], 536 | oneofs=[ 537 | ], 538 | serialized_start=1754, 539 | serialized_end=1954, 540 | ) 541 | 542 | _MODELPROTO = _descriptor.Descriptor( 543 | name='ModelProto', 544 | full_name='sentencepiece.ModelProto', 545 | filename=None, 546 | file=DESCRIPTOR, 547 | containing_type=None, 548 | fields=[ 549 | _descriptor.FieldDescriptor( 550 | name='pieces', full_name='sentencepiece.ModelProto.pieces', index=0, 551 | number=1, type=11, cpp_type=10, label=3, 552 | has_default_value=False, default_value=[], 553 | message_type=None, enum_type=None, containing_type=None, 554 | is_extension=False, extension_scope=None, 555 | options=None), 556 | _descriptor.FieldDescriptor( 557 | name='trainer_spec', full_name='sentencepiece.ModelProto.trainer_spec', index=1, 558 | number=2, type=11, cpp_type=10, label=1, 559 | has_default_value=False, default_value=None, 560 | message_type=None, enum_type=None, containing_type=None, 561 | is_extension=False, extension_scope=None, 562 | options=None), 563 | _descriptor.FieldDescriptor( 564 | name='normalizer_spec', full_name='sentencepiece.ModelProto.normalizer_spec', index=2, 565 | number=3, type=11, cpp_type=10, label=1, 566 | has_default_value=False, default_value=None, 567 | message_type=None, enum_type=None, containing_type=None, 568 | is_extension=False, extension_scope=None, 569 | options=None), 570 | _descriptor.FieldDescriptor( 571 | name='self_test_data', full_name='sentencepiece.ModelProto.self_test_data', index=3, 572 | number=4, type=11, cpp_type=10, label=1, 573 | has_default_value=False, default_value=None, 574 | message_type=None, enum_type=None, containing_type=None, 575 | is_extension=False, extension_scope=None, 576 | options=None), 577 | ], 578 | extensions=[ 579 | ], 580 | nested_types=[_MODELPROTO_SENTENCEPIECE, ], 581 | enum_types=[ 582 | ], 583 | options=None, 584 | is_extendable=True, 585 | syntax='proto2', 586 | extension_ranges=[(200, 536870912), ], 587 | oneofs=[ 588 | ], 589 | serialized_start=1523, 590 | serialized_end=1965, 591 | ) 592 | 593 | _TRAINERSPEC.fields_by_name['model_type'].enum_type = _TRAINERSPEC_MODELTYPE 594 | _TRAINERSPEC_MODELTYPE.containing_type = _TRAINERSPEC 595 | _SELFTESTDATA_SAMPLE.containing_type = _SELFTESTDATA 596 | _SELFTESTDATA.fields_by_name['samples'].message_type = _SELFTESTDATA_SAMPLE 597 | _MODELPROTO_SENTENCEPIECE.fields_by_name['type'].enum_type = _MODELPROTO_SENTENCEPIECE_TYPE 598 | _MODELPROTO_SENTENCEPIECE.containing_type = _MODELPROTO 599 | _MODELPROTO_SENTENCEPIECE_TYPE.containing_type = _MODELPROTO_SENTENCEPIECE 600 | _MODELPROTO.fields_by_name['pieces'].message_type = _MODELPROTO_SENTENCEPIECE 601 | _MODELPROTO.fields_by_name['trainer_spec'].message_type = _TRAINERSPEC 602 | _MODELPROTO.fields_by_name['normalizer_spec'].message_type = _NORMALIZERSPEC 603 | _MODELPROTO.fields_by_name['self_test_data'].message_type = _SELFTESTDATA 604 | DESCRIPTOR.message_types_by_name['TrainerSpec'] = _TRAINERSPEC 605 | DESCRIPTOR.message_types_by_name['NormalizerSpec'] = _NORMALIZERSPEC 606 | DESCRIPTOR.message_types_by_name['SelfTestData'] = _SELFTESTDATA 607 | DESCRIPTOR.message_types_by_name['ModelProto'] = _MODELPROTO 608 | 609 | TrainerSpec = _reflection.GeneratedProtocolMessageType('TrainerSpec', (_message.Message,), dict( 610 | DESCRIPTOR = _TRAINERSPEC, 611 | __module__ = 'sentencepiece_model_pb2' 612 | # @@protoc_insertion_point(class_scope:sentencepiece.TrainerSpec) 613 | )) 614 | _sym_db.RegisterMessage(TrainerSpec) 615 | 616 | NormalizerSpec = _reflection.GeneratedProtocolMessageType('NormalizerSpec', (_message.Message,), dict( 617 | DESCRIPTOR = _NORMALIZERSPEC, 618 | __module__ = 'sentencepiece_model_pb2' 619 | # @@protoc_insertion_point(class_scope:sentencepiece.NormalizerSpec) 620 | )) 621 | _sym_db.RegisterMessage(NormalizerSpec) 622 | 623 | SelfTestData = _reflection.GeneratedProtocolMessageType('SelfTestData', (_message.Message,), dict( 624 | 625 | Sample = _reflection.GeneratedProtocolMessageType('Sample', (_message.Message,), dict( 626 | DESCRIPTOR = _SELFTESTDATA_SAMPLE, 627 | __module__ = 'sentencepiece_model_pb2' 628 | # @@protoc_insertion_point(class_scope:sentencepiece.SelfTestData.Sample) 629 | )) 630 | , 631 | DESCRIPTOR = _SELFTESTDATA, 632 | __module__ = 'sentencepiece_model_pb2' 633 | # @@protoc_insertion_point(class_scope:sentencepiece.SelfTestData) 634 | )) 635 | _sym_db.RegisterMessage(SelfTestData) 636 | _sym_db.RegisterMessage(SelfTestData.Sample) 637 | 638 | ModelProto = _reflection.GeneratedProtocolMessageType('ModelProto', (_message.Message,), dict( 639 | 640 | SentencePiece = _reflection.GeneratedProtocolMessageType('SentencePiece', (_message.Message,), dict( 641 | DESCRIPTOR = _MODELPROTO_SENTENCEPIECE, 642 | __module__ = 'sentencepiece_model_pb2' 643 | # @@protoc_insertion_point(class_scope:sentencepiece.ModelProto.SentencePiece) 644 | )) 645 | , 646 | DESCRIPTOR = _MODELPROTO, 647 | __module__ = 'sentencepiece_model_pb2' 648 | # @@protoc_insertion_point(class_scope:sentencepiece.ModelProto) 649 | )) 650 | _sym_db.RegisterMessage(ModelProto) 651 | _sym_db.RegisterMessage(ModelProto.SentencePiece) 652 | 653 | 654 | DESCRIPTOR.has_options = True 655 | DESCRIPTOR._options = _descriptor._ParseOptions(descriptor_pb2.FileOptions(), _b('H\003')) 656 | _TRAINERSPEC.fields_by_name['mining_sentence_size'].has_options = True 657 | _TRAINERSPEC.fields_by_name['mining_sentence_size']._options = _descriptor._ParseOptions(descriptor_pb2.FieldOptions(), _b('\030\001')) 658 | _TRAINERSPEC.fields_by_name['training_sentence_size'].has_options = True 659 | _TRAINERSPEC.fields_by_name['training_sentence_size']._options = _descriptor._ParseOptions(descriptor_pb2.FieldOptions(), _b('\030\001')) 660 | # @@protoc_insertion_point(module_scope) 661 | -------------------------------------------------------------------------------- /speed_test_fastai_vs_hf_nlp_datasets.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# \"Speed: Fastai vs HuggingFace nlp Datasets\"\n", 8 | "\n", 9 | "> Speed-test: Fastai's `TextDataloders` vs HuggingFace's `nlp` Datasets\n", 10 | "\n", 11 | "- badges: true\n", 12 | "- categories: [nlp, fastai, dataloader]\n", 13 | "- image: images/bokeh_mini.png" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "## tl;dr\n", 21 | "- Fastai's `Textdataloader` is well optimised and appears to come in faster on `nlp` Datasets for a dataset of 1.6M\n", 22 | "\n", 23 | "## Speed\n", 24 | "I started playing around with HuggingFace's [`nlp` Datasets library](https://huggingface.co/nlp) recently and was blown away by the speed at which you can iterate through the data, thanks to some PyArrow wizardry its seriously fast!\n", 25 | "\n", 26 | "> twitter: https://twitter.com/Thom_Wolf/status/1272512974935203841\n", 27 | "\n", 28 | "So I wondered if there was a significant speed up to be gained by doing as much text processing as I could with this library as opposed to Fastai's default text processing. (Also, after previously discovering Fastai's functionality to do [faster text loading](https://forums.fast.ai/t/nlp-speed-up-if-using-sorteddl/74636) I was in the mood for further speed-ups! 💨)\n", 29 | "\n", 30 | "\n", 31 | "## `nlp` Datasets\n", 32 | "\n", 33 | "The [`nlp` Datasets library](https://huggingface.co/nlp) is incredibly memory efficient:\n", 34 | "\n", 35 | "> It provides a very efficient way to load and process data from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficency and speed. As a matter of example, **loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM** and you can iterate over the dataset **at 1-2 GBit/s** in python.\n", 36 | "\n", 37 | "It also hosts 130+ common nlp research datasets AND (thanks to [this pointer from Thomas Wolf](https://discuss.huggingface.co/t/nlp-0-3-0-is-out/50/3) on the new HuggingFace forums) I also learned that you can also easily load your own CSVs (or jsons, pandas dataframes) and bask in all of that speedy goodness, for example like below:\n", 38 | "\n", 39 | "```\n", 40 | "from nlp import load_dataset\n", 41 | "dataset = load_dataset('csv', data_files='my_file.csv')\n", 42 | "```\n", 43 | "\n", 44 | "By the way, if you're curious to learn more about PyArrow then I highly recommend Dejan Simic's [post about it](https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a)\n", 45 | "\n", 46 | "> twitter: https://twitter.com/simicdds/status/1276521257304023046\n", 47 | "\n", 48 | "> Note: If you love the sound of laptop fans spinning like sonic the hedgehog 🦔, redhot battery packs 🔥 and the adrenaline 😰 of living on the edge of pandas' capabilities as you explore, plot and manipulate your giant text datasets, then the `nlp` Datasets library probably isn't for you. \n", 49 | "\n", 50 | "Otherwise, regardless about using it for your final DL pipeline or not, `nlp` Datasets is definitely worth using just for the shear speed at which it can apply functions to GB's of data.\n", 51 | "\n", 52 | "So, is it faster? Lets see! \n", 53 | "\n", 54 | "## The Setup\n", 55 | "\n", 56 | "To find out we'll be comparing Fastai's high-level `TextDataloders` class to a custom dataprocessing pipeline using HuggingFace's `nlp` datasets library.\n", 57 | "\n", 58 | "#### Fastai\n", 59 | "This Fastai class does a bunch of different things, all by calling just 1 line of code, including:\n", 60 | "- Pre and Post Processing\n", 61 | "- Tokenization with Spacy's tokenizer, including creating a vocabulary and **parallelising** the tokenization\n", 62 | "- Speed optimizations including sorting data by text sample length and padding only to the longest item in the sequence, [similar what was described here](https://towardsdatascience.com/divide-hugging-face-transformers-training-time-by-2-or-more-21bf7129db9q-21bf7129db9e)\n", 63 | "- Creating the train and validation dataloaders and putting them onto the gpu\n", 64 | "\n", 65 | "#### HuggingFace `nlp` Datasets\n", 66 | "The `nlp` Datasets pipeline I wrote tries to replicate all of the core functionality of `TextDataloaders` as best I could. \n", 67 | "\n", 68 | "> Note: I couldn't figure out how to parallelise the text processing with `nlp` although this is probably down to my lack of experience with parallelism as opposed to a limitation of the library\n", 69 | "\n", 70 | "## Sentiment Dataset\n", 71 | "For this experiment I used the [Sentiment140](https://huggingface.co/datasets/sentiment140) dataset, a sentiment classifcation dataset of Twitter data. \n", 72 | "\n", 73 | "For our experiment we'll use\n", 74 | "- 10% of sentiment dataset (160,000 tweets, 11.8M space-separated tokens), pulled from nlp library\n", 75 | "- 80/20 train/val split\n", 76 | "\n", 77 | "## Experiment Settings\n", 78 | "\n", 79 | "The timings will consist of 2 elements, an \"init\" and \"1 epoch\". The former will covering the entire process from loading the data (already downloaded) to creating the dataloaders. The second element will simply consist of iterating through the entire training dataset once.\n", 80 | "\n", 81 | "### init Details\n", 82 | "\"init\" consists of:\n", 83 | "\n", 84 | "0. Reading the data from disk, from a csv for fastai and from a PyArrow file for `nlp`\n", 85 | "\n", 86 | "1. Applying [fastai's default text pre-processing functions](http://dev.fast.ai/text.core#Preprocessing-rules). These will:\n", 87 | "\n", 88 | "\n", 89 | " Fix various messy bits of html sometimes seen in documents\n", 90 | " Replace repetitions at the character level, e.g. `cccc` becomes: `TK_REP 4 c`\n", 91 | " Replace word repetitions, e.g. `cow cow cow cow` becomes: `TK_WREP 4 cow`\n", 92 | " Add spaces around / and #\n", 93 | " Remove multiple spaces \n", 94 | " Replace tokens in ALL CAPS by their lower version and add TK_UP before.\n", 95 | " Replace characters in ALL CAPS by their lower version and add TK_UP before.\n", 96 | " Lowercases everything\n", 97 | "\n", 98 | "\n", 99 | "2. Tokenizing based on Spacy's tokenizer (fastai's default)\n", 100 | "\n", 101 | "3. Applying a post-processing rule which replaces embedded spaces in a token with unicode line char to allow for split/join\n", 102 | "\n", 103 | "4. Performing 1 epoch iterating through the training data, bs = 64\n", 104 | "\n", 105 | "\n", 106 | "## Results\n", 107 | "\n", 108 | "#### 10% Data\n", 109 | "Results are...mixed! While the Fastai convienience function had a faster init (48s vs 71s), the PyArrow-backed `nlp` run through a single epoch was significantly faster (11s vs 14s).\n", 110 | "\n", 111 | "| 0.16M ROWS: | Init (s)| 1 epoch (s) | 1 mini-batch [bs=64] (ms) |\n", 112 | "| :- | :-: | :-: | :-: |\n", 113 | "| **Fastai** | 124 | 14.3 | 7.4 | \n", 114 | "| **Fastai w/sorted** | **48.1** | 14.3 | 7.4 |\n", 115 | "| **nlp** | 71.2 | **11.3** | **5.6** |\n", 116 | "\n", 117 | "#### 100% Data\n", 118 | "Surprisingly for the full dataset of 1.6M tweets, Fastai dramatically outperforms `nlp` Datasets.\n", 119 | "\n", 120 | "| 1.6M ROWS: | Init (s) | 1 epoch (s) |\n", 121 | "| :- | :-: | :-: |\n", 122 | "| **Fastai w/sorted** | **484** | **142** |\n", 123 | "| **nlp**| 1290 | 323 |\n", 124 | "\n", 125 | "## Thoughts\n", 126 | "- **Memory** Note that Fastai was much faster on the full dataset, but I also had to delete the pandas dataframe used to calculate the text lengths as it was taking up too much memory and causing my dataloaders to fail, I probably should find a more memory efficient way to do this calculation. On the other hand, `nlp` datasets won't incur this issue.\n", 127 | "- **nlp pipeline** Given the large difference for the full dataset, I am suspicious about some parts of my implementation, for example sorting the entire dataset takes 416s and splitting it into a train and validation set takes 416s. Do I need to sort the full dataset? Should chopping the dataset in 2 really take so long? Removing these two steps bring the timing down to 458s, slightly faster than Fastai." 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 16, 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "image/png": "\n", 138 | "text/plain": [ 139 | "
" 140 | ] 141 | }, 142 | "metadata": { 143 | "needs_background": "light" 144 | }, 145 | "output_type": "display_data" 146 | } 147 | ], 148 | "source": [ 149 | "#hide\n", 150 | "def get_timings(n_epochs, init, per_ep):\n", 151 | " #init=48\n", 152 | " #eps = n_epochs * 14.25\n", 153 | " eps = n_epochs * per_ep\n", 154 | " return init+eps\n", 155 | "\n", 156 | "fastai_sorted_10 = [48, 14.25]\n", 157 | "fastai_sorted_100 = [484, 142]\n", 158 | "nlp_10 = [71, 11.27]\n", 159 | "nlp_100 = [1290, 323]\n", 160 | "\n", 161 | "timings_ls, timing_data = [], []\n", 162 | "timings = [fastai_sorted_10,fastai_sorted_100, nlp_10, nlp_100]\n", 163 | "\n", 164 | "n_eps = list(range(1,20,2))\n", 165 | "\n", 166 | "for t in timings: timing_data.append([get_timings(n_epochs=n, init=t[0], per_ep=t[1]) for n in n_eps])\n", 167 | "\n", 168 | "colors = ['deepskyblue','blue','goldenrod', 'darkorange']\n", 169 | "labels = ['fastai_sorted_10','fastai_sorted_100','nlp_10','nlp_100']\n", 170 | "for i,t in enumerate(timing_data):\n", 171 | " plt.plot(t, color=colors[i], label=labels[i])\n", 172 | "\n", 173 | "plt.title('Training run duration')\n", 174 | "plt.xlabel('Num Epochs')\n", 175 | "plt.ylabel('Time (s)')\n", 176 | "plt.yscale('log')\n", 177 | "plt.legend()\n", 178 | "plt.show();" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## Thanks for Reading 😃\n", 186 | "As always, I would love to hear if you have any comments, thoughts or criticisms, you can find me on Twitter at [@mcgenergy](www.twitter.com/mcgenergy)\n", 187 | "\n", 188 | "## [Appendix]\n", 189 | "\n", 190 | "## Code:\n", 191 | "For those curious, you can peek the code used in testing this below 👇" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 1, 197 | "metadata": { 198 | "scrolled": true 199 | }, 200 | "outputs": [], 201 | "source": [ 202 | "#hide_collapse\n", 203 | "# Imports\n", 204 | "%reload_ext autoreload\n", 205 | "%autoreload 2\n", 206 | "\n", 207 | "from fastai2.basics import *\n", 208 | "from fastai2.text.all import *\n", 209 | "# from fastai2.callback.all import *\n", 210 | "# from fastai2.data.transforms import RandomSplitter\n", 211 | "from fastai2.text.core import defaults\n", 212 | "\n", 213 | "from nlp import load_dataset\n", 214 | "\n", 215 | "import spacy,html\n", 216 | "from spacy.symbols import ORTH\n", 217 | "\n", 218 | "import timeit\n", 219 | "import gc" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "### Fastai Testing\n", 227 | "Init timing:" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 2, 233 | "metadata": {}, 234 | "outputs": [ 235 | { 236 | "data": { 237 | "text/html": [], 238 | "text/plain": [ 239 | "" 240 | ] 241 | }, 242 | "metadata": {}, 243 | "output_type": "display_data" 244 | } 245 | ], 246 | "source": [ 247 | "#hide_collapse\n", 248 | "#%%timeit -n 1 -r 3\n", 249 | "\n", 250 | "# Download data and save as csv\n", 251 | "# senti_dataset = load_dataset('sentiment140', split='train[:100%]', download_mode='reuse_cache_if_exists')\n", 252 | "# df = senti_dataset.data.to_pandas()\n", 253 | "# df.to_csv('sentiment140.csv')\n", 254 | "\n", 255 | "# Read data; the first 10% of the sentiment140 dataset, extraced from the `nlp` library and saved as a csv\n", 256 | "#fn_10pct = 'sentiment140_10pct.csv'\n", 257 | "fn = 'sentiment140.csv'\n", 258 | "df = pd.read_csv(fn, index_col=None)\n", 259 | "\n", 260 | "# SORT: Calculate text sample lengths\n", 261 | "df['word_count'] = df['text'].str.split().map(len)\n", 262 | "\n", 263 | "res=df['word_count'].values\n", 264 | "\n", 265 | "# Create Dataloaders\n", 266 | "dls = TextDataLoaders.from_csv(path='.', csv_fname=fn, valid_pct=0.2, bs=64, \n", 267 | " text_col='text', label_col='sentiment' , res=res)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "markdown", 272 | "metadata": {}, 273 | "source": [ 274 | "1 epoch timing" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 6, 280 | "metadata": { 281 | "scrolled": true 282 | }, 283 | "outputs": [ 284 | { 285 | "data": { 286 | "text/plain": [ 287 | "(143.04122078799992, 0.007152061039399996)" 288 | ] 289 | }, 290 | "execution_count": 6, 291 | "metadata": {}, 292 | "output_type": "execute_result" 293 | } 294 | ], 295 | "source": [ 296 | "#hide_collapse\n", 297 | "\n", 298 | "del df, res\n", 299 | "gc.collect()\n", 300 | "\n", 301 | "# Do 1 pass of the training dataloader\n", 302 | "s = \"\"\"for b in dls.train: pass\n", 303 | " \"\"\"\n", 304 | "\n", 305 | "time = timeit.timeit(stmt=s, number=1, globals=globals()); time\n", 306 | "time, time / len(dls.train)" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "### HuggingFace `nlp` Datasets Testing" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "Tokenizer, Numericalizer and Padding functions:" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 2, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "#hide_collapse\n", 330 | "class SpacyTokenizerNLP():\n", 331 | " \"Spacy tokenizer for `lang`\"\n", 332 | " def __init__(self, lang='en', special_toks=None, buf_sz=5000):\n", 333 | " self.special_toks = ifnone(special_toks, defaults.text_spec_tok)\n", 334 | " nlp = spacy.blank(lang, disable=[\"parser\", \"tagger\", \"ner\"])\n", 335 | " for w in self.special_toks: nlp.tokenizer.add_special_case(w, [{ORTH: w}])\n", 336 | " self.pipe,self.buf_sz = nlp.pipe,buf_sz\n", 337 | " \n", 338 | " def encodes(self, items):\n", 339 | " tmp = [list(doc) for doc in self.pipe(items, batch_size=self.buf_sz)]\n", 340 | " return {'tok_text_pre': [list(str(t) for t in l) for l in tmp]}\n", 341 | "\n", 342 | "def make_vocab(count, min_freq=3, max_vocab=60000, special_toks=None):\n", 343 | " \"Create a vocab of `max_vocab` size from `Counter` `count` with items present more than `min_freq`\"\n", 344 | " vocab = [o for o,c in count.most_common(max_vocab) if c >= min_freq]\n", 345 | " special_toks = ifnone(special_toks, defaults.text_spec_tok)\n", 346 | " for o in reversed(special_toks): #Make sure all special tokens are in the vocab\n", 347 | " if o in vocab: vocab.remove(o)\n", 348 | " vocab.insert(0, o)\n", 349 | " vocab = vocab[:max_vocab]\n", 350 | " return vocab + [f'xxfake' for i in range(0, 8-len(vocab)%8)]\n", 351 | "\n", 352 | "class NumericalizeNLP(Transform):\n", 353 | " \"Reversible transform of tokenized texts to numericalized ids\"\n", 354 | " def __init__(self, dsets=None, vocab=None, min_freq=3, max_vocab=60000, special_toks=None, pad_tok=None):\n", 355 | " store_attr(self, 'vocab,min_freq,max_vocab,special_toks,pad_tok')\n", 356 | " self.vocab, self.special_toks, self.min_freq, self.max_vocab = vocab, special_toks, min_freq, max_vocab\n", 357 | " self.o2i = None if vocab is None else defaultdict(int, {v:k for k,v in enumerate(vocab)})\n", 358 | "\n", 359 | " if self.vocab is None:\n", 360 | " count = Counter(p for o in dsets for p in o)\n", 361 | " self.vocab = make_vocab(count, min_freq=self.min_freq, max_vocab=self.max_vocab, special_toks=self.special_toks)\n", 362 | " self.o2i = defaultdict(int, {v:k for k,v in enumerate(self.vocab) if v != 'xxfake'})\n", 363 | " \n", 364 | " def encodes_nlp(self, o): return TensorText(tensor([self.o2i [o_] for o_ in o]))\n", 365 | " def encodes_nlp(self, b): return {'toks' : [[self.o2i[o_] for o_ in oo] for oo in b['tok_text']]}\n", 366 | " \n", 367 | "# Padding functions\n", 368 | "def pad_seq(x, max_batch_len, pad_idx): \n", 369 | " pad = x.new_zeros(max_batch_len-x.size(0))+pad_idx\n", 370 | " return torch.cat([x, pad])\n", 371 | " \n", 372 | "# Pad up to longest item in the batch and put batch on the GPU\n", 373 | "def pad_batch(batch=None, pad_token_id=1):\n", 374 | " batch_inputs = list()\n", 375 | " max_size = max([len(item['toks']) for item in batch])\n", 376 | " for item in batch: batch_inputs += [pad_seq(item['toks'], max_size, pad_token_id)]\n", 377 | " return torch.stack(batch_inputs).cuda()" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "Data loading and processing functions:" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 9, 390 | "metadata": { 391 | "scrolled": true 392 | }, 393 | "outputs": [ 394 | { 395 | "name": "stdout", 396 | "output_type": "stream", 397 | "text": [ 398 | "Downloading and preparing dataset sentiment140/sentiment140 (download: 77.59 MiB, generated: 214.21 MiB, total: 291.81 MiB) to /home/morgan/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0...\n" 399 | ] 400 | }, 401 | { 402 | "data": { 403 | "application/vnd.jupyter.widget-view+json": { 404 | "model_id": "", 405 | "version_major": 2, 406 | "version_minor": 0 407 | }, 408 | "text/plain": [ 409 | "HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))" 410 | ] 411 | }, 412 | "metadata": {}, 413 | "output_type": "display_data" 414 | }, 415 | { 416 | "name": "stdout", 417 | "output_type": "stream", 418 | "text": [ 419 | "\r" 420 | ] 421 | }, 422 | { 423 | "data": { 424 | "application/vnd.jupyter.widget-view+json": { 425 | "model_id": "", 426 | "version_major": 2, 427 | "version_minor": 0 428 | }, 429 | "text/plain": [ 430 | "HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))" 431 | ] 432 | }, 433 | "metadata": {}, 434 | "output_type": "display_data" 435 | }, 436 | { 437 | "name": "stdout", 438 | "output_type": "stream", 439 | "text": [ 440 | "Dataset sentiment140 downloaded and prepared to /home/morgan/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0. Subsequent calls will reuse this data.\n" 441 | ] 442 | } 443 | ], 444 | "source": [ 445 | "#hide_collapse\n", 446 | "\n", 447 | "# Download text, a clean version of the dataset is downloaded (not included in the timings), 'train[:10%]'\n", 448 | "senti_dataset = load_dataset('sentiment140', split='train', download_mode='reuse_cache_if_exists')\n", 449 | "\n", 450 | "spacy_tok = SpacyTokenizerNLP(lang='en', special_toks=defaults.text_spec_tok)\n", 451 | "\n", 452 | "def preproc_and_tok(b): return spacy_tok.encodes(list(maps(*defaults.text_proc_rules, b['text'])))\n", 453 | "\n", 454 | "def postproc(b): \n", 455 | " return {'tok_text': [list(maps(*defaults.text_postproc_rules, _b)) for _b in b['tok_text_pre']]}\n", 456 | "\n", 457 | "def get_tok_lengths(example_batch): return {'tok_lens': [len(e) for e in example_batch['toks']]}\n", 458 | "\n", 459 | "def prepare_dataset(dataset):\n", 460 | " '''\n", 461 | " Takes a raw nlp dataset and returns a processed, tokenized, numericalised dataset\n", 462 | " '''\n", 463 | " # Apply processing rules and tokenize\n", 464 | " print('pre-proc and tokenize')\n", 465 | " dataset = dataset.map(preproc_and_tok, batched=True)\n", 466 | "\n", 467 | " # Apply post-processing rules \n", 468 | " print('post=proc')\n", 469 | " dataset = dataset.map(postproc, batched=True)\n", 470 | "\n", 471 | " # Init Numericalizer and create vocab\n", 472 | " print('init numericalizer')\n", 473 | " numeric = NumericalizeNLP(dsets=dataset['tok_text_pre'], special_toks=defaults.text_spec_tok, pad_tok=1)\n", 474 | "\n", 475 | " # Numericalize\n", 476 | " print('numericalizing')\n", 477 | " dataset = dataset.map(numeric.encodes_nlp, batched=True)\n", 478 | "\n", 479 | " # Get sample lengths for sorting\n", 480 | " dataset=dataset.map(get_tok_lengths, batched=True)\n", 481 | " \n", 482 | " print('sorting')\n", 483 | " # Sort dataset from small to large\n", 484 | " dataset = dataset.sort('tok_lens')\n", 485 | " \n", 486 | " return dataset" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "Init Timing" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 10, 499 | "metadata": { 500 | "scrolled": true 501 | }, 502 | "outputs": [ 503 | { 504 | "name": "stderr", 505 | "output_type": "stream", 506 | "text": [ 507 | "\r", 508 | " 0%| | 0/1600 [00:00