├── LICENSE.md ├── README.md ├── alexnet_distilled.py ├── alexnet_distilled2bits.py ├── alexnet_doublefilters.py ├── cifar100_test.py ├── cifar10_diff_quant_test.py ├── cifar10_test.py ├── cifar10_wideResNet.py ├── cnn_models ├── __init__.py ├── alexnet_kfilters.py ├── conv_forward_model.py ├── help_fun.py ├── resnet_kfilters.py ├── wide_resnet.py └── wide_resnet_imagenet.py ├── datasets ├── CIFAR10.py ├── CIFAR100.py ├── ImageNet12.py ├── MNIST.py ├── PennTreeBank.py ├── __init__.py ├── customs_datasets.py ├── torchvision_extension.py └── translation_datasets.py ├── helpers └── functions.py ├── imageNet_distilled.py ├── model_manager.py ├── onmt ├── Beam.py ├── IO.py ├── Loss.py ├── ModelConstructor.py ├── Models.py ├── Optim.py ├── Trainer.py ├── Translator.py ├── Utils.py ├── __init__.py ├── modules │ ├── Conv2Conv.py │ ├── ConvMultiStepAttention.py │ ├── CopyGenerator.py │ ├── Embeddings.py │ ├── Gate.py │ ├── GlobalAttention.py │ ├── ImageEncoder.py │ ├── MultiHeadedAttn.py │ ├── SRU.py │ ├── StackedRNN.py │ ├── StructuredAttention.py │ ├── Transformer.py │ ├── UtilClass.py │ ├── WeightNorm.py │ └── __init__.py └── standard_options.py ├── openNMT_WMT13.py ├── openNMT_integ_dataset.py ├── openNMT_multi30k.py ├── perl_scripts ├── README.md ├── multi-bleu.perl ├── nonbreaking_prefix.de ├── nonbreaking_prefix.en └── tokenizer.perl ├── quantization ├── __init__.py ├── help_functions.py └── quant_functions.py ├── requirements.txt ├── resnet34_doublefilters.py └── translation_models ├── __init__.py ├── help_fun.py └── model.py /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Antonio Polino - Dan Alistarh - Razvan Pascanu 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Model compression via distillation and quantization 2 | 3 | This code has been written to experiment with quantized distillation and differentiable quantization, techniques developed in our paper ["Model compression via distillation and quantization"](https://arxiv.org/abs/1802.05668). 4 | 5 | If you find this code useful in your research, please cite the paper: 6 | 7 | ``` 8 | @article{2018arXiv180205668P, 9 | author = {{Polino}, A. and {Pascanu}, R. and {Alistarh}, D.}, 10 | title = "{Model compression via distillation and quantization}", 11 | journal = {ArXiv e-prints}, 12 | archivePrefix = "arXiv", 13 | eprint = {1802.05668}, 14 | keywords = {Computer Science - Neural and Evolutionary Computing, Computer Science - Learning}, 15 | year = 2018, 16 | month = feb, 17 | } 18 | ``` 19 | 20 | 21 | The code is written in [Pytorch 0.3](http://pytorch.org/) using Python 3.6. It is not backward compatible with Python2.x 22 | 23 | *Note* Pytorch 0.4 introduced some major breaking changes. To use this code, please use Pytorch 0.3. 24 | 25 | Check for the compatible version of torchvision. To run the code, use torchvision 0.2.0. 26 | ``` 27 | pip install torchvision==0.2.0 28 | ``` 29 | This should be done after installing the [requirements](requirements.txt). 30 | 31 | # Getting started 32 | 33 | ### Prerequisites 34 | This code is mostly self contained. Only a few additional libraries are requires, specified in [requirements.txt](requirements.txt). The repository already contains a fork of the [openNMT-py project](https://github.com/OpenNMT/OpenNMT-py). Note that, due to the rapidly changing nature of the openNMT-py codebase and the substantial time and effort required to make it compatible with our code, it is unlikely that we will support newer versions of openNMT-py. 35 | 36 | ### Summary of folder's content 37 | This is a short explanation of the contents of each folder: 38 | 39 | - *datasets* is a package that automatically downloads and process several datasets, including CIFAR10, PennTreeBank, WMT2013, etc. 40 | - *quantization* contains the quantization functions that are used. 41 | - *perl_scripts* contains some perl scripts taken from the [moses project](https://github.com/moses-smt/mosesdecoder) to help with the translation task. 42 | - *onmt* contains the code from [openNMT-py project](https://github.com/OpenNMT/OpenNMT-py). It is slightly modified to make it consistent with our codebase. 43 | - *helpers* contains some functions used across the whole project. 44 | - *model_manager.py* contains a useful class that implements common I/O operations on saved models. It is especially useful when training multiple similar models, as it keeps track of the options with which the models were trained and the results of each training run. *Note*: it does not support concurrent access to the same files. I am working on a version that does; if you are interested, drop me a line. 45 | - First-level files like [cifar10_test.py](cifar10_test.py) are the main files that implement the experiments using the rest of the codebase. 46 | - Other folders contain model definitions and training routines, depending on the task. 47 | 48 | ### Running the code 49 | 50 | The first thing to do is to import some dataset and create the train and test set loaders. 51 | Define a folder where you want to save all your datasets; they will be automatically downloaded and processed in the folder specified. The following example shows how to load the CIFAR10 dataset, create and train a model. 52 | ```python 53 | import datasets 54 | datasets.BASE_DATA_FOLDER = '/home/saved_datasets' 55 | 56 | batch_size = 50 57 | cifar10 = datasets.CIFAR10() #-> will be saved in /home/saved_datasets/cifar10 58 | train_loader, test_loader = cifar10.getTrainLoader(batch_size), cifar10.getTestLoader(batch_size) 59 | ``` 60 | 61 | Now we can use ```train_loader``` and ```test_loader``` as generators from which to get the train and test data as pytorch tensors. 62 | 63 | At this point we just need to define a model and train it: 64 | 65 | ```python 66 | import os 67 | import cnn_models.conv_forward_model as convForwModel 68 | import cnn_models.help_fun as cnn_hf 69 | teacherModel = convForwModel.ConvolForwardNet(**convForwModel.teacherModelSpec, 70 | useBatchNorm=True, 71 | useAffineTransformInBatchNorm=True) 72 | convForwModel.train_model(teacherModel, train_loader, test_loader, epochs_to_train=200) 73 | ``` 74 | 75 | As mentioned before, it is often better to use the ModelManager class to be able to automatically save the results and retrieve them later. So we would typically write 76 | 77 | ```python 78 | import os 79 | import cnn_models.conv_forward_model as convForwModel 80 | import cnn_models.help_fun as cnn_hf 81 | import model_manager 82 | cifar10Manager = model_manager.ModelManager('model_manager_cifar10.tst', 83 | 'model_manager', create_new_model_manager=False)#the first time set this to True 84 | model_name = 'cifar10_teacher' 85 | cifar10modelsFolder = '~/quantized_distillation/' 86 | teacherModelPath = os.path.join(cifar10modelsFolder, model_name) 87 | teacherModel = convForwModel.ConvolForwardNet(**convForwModel.teacherModelSpec, 88 | useBatchNorm=True, 89 | useAffineTransformInBatchNorm=True) 90 | if not model_name in cifar10Manager.saved_models: 91 | cifar10Manager.add_new_model(model_name, teacherModelPath, 92 | arguments_creator_function={**convForwModel.teacherModelSpec, 93 | 'useBatchNorm':True, 94 | 'useAffineTransformInBatchNorm':True}) 95 | cifar10Manager.train_model(teacherModel, model_name=model_name, 96 | train_function=convForwModel.train_model, 97 | arguments_train_function={'epochs_to_train': 200}, 98 | train_loader=train_loader, test_loader=test_loader) 99 | ``` 100 | 101 | This is the general structure necessary to use the code. For more examples, please look at one of the main files that are used to run the experiments. 102 | 103 | # Authors 104 | 105 | - Antonio Polino 106 | - Razvan Pascanu 107 | - Dan Alistarh 108 | 109 | # License 110 | 111 | The code is licensed under the MIT Licence. See the [LICENSE.md](LICENSE.md) file for detail. 112 | 113 | # Acknowledgements 114 | 115 | We would like to thank Ce Zhang (ETH Zürich), Hantian Zhang (ETH Zürich) and Martin Jaggi (EPFL) for their support with experiments and valuable feedback. 116 | -------------------------------------------------------------------------------- /alexnet_distilled.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import torchvision 4 | import cnn_models.conv_forward_model as convForwModel 5 | import cnn_models.help_fun as cnn_hf 6 | import datasets 7 | import model_manager 8 | 9 | cuda_devices = os.environ['CUDA_VISIBLE_DEVICES'].split(',') 10 | print('CUDA_VISIBLE_DEVICES: {} for a total of {} GPUs'.format(cuda_devices, len(cuda_devices))) 11 | 12 | 13 | NUM_BITS = 4 14 | print('Number of bits in training: {}'.format(NUM_BITS)) 15 | 16 | datasets.BASE_DATA_FOLDER = '...' 17 | SAVED_MODELS_FOLDER = '...' 18 | 19 | 20 | USE_CUDA = torch.cuda.is_available() 21 | NUM_GPUS = len(cuda_devices) 22 | 23 | try: 24 | os.mkdir(datasets.BASE_DATA_FOLDER) 25 | except:pass 26 | try: 27 | os.mkdir(SAVED_MODELS_FOLDER) 28 | except:pass 29 | 30 | epochsToTrainImageNet = 90 31 | imageNet12modelsFolder = os.path.join(SAVED_MODELS_FOLDER, 'imagenet12_new') 32 | imagenet_manager = model_manager.ModelManager('model_manager_imagenet_Alexnet_distilled4bits.tst', 33 | 'model_manager', create_new_model_manager=False) 34 | 35 | for x in imagenet_manager.list_models(): 36 | if imagenet_manager.get_num_training_runs(x) >= 1: 37 | s = '{}; Last prediction acc: {}, Best prediction acc: {}'.format(x, 38 | imagenet_manager.load_metadata(x)[1]['predictionAccuracy'][-1], 39 | max(imagenet_manager.load_metadata(x)[1]['predictionAccuracy'])) 40 | print(s) 41 | 42 | try: 43 | os.mkdir(imageNet12modelsFolder) 44 | except:pass 45 | 46 | print('Batch size: {}'.format(batch_size)) 47 | 48 | if batch_size % NUM_GPUS != 0: 49 | raise ValueError('Batch size: {} must be a multiple of the number of gpus:{}'.format(batch_size, NUM_GPUS)) 50 | 51 | 52 | imageNet12 = datasets.ImageNet12('...', 53 | '...', 54 | type_of_data_augmentation='extended', already_scaled=False, 55 | pin_memory=True) 56 | 57 | 58 | train_loader = imageNet12.getTrainLoader(batch_size, shuffle=True) 59 | test_loader = imageNet12.getTestLoader(batch_size, shuffle=False) 60 | 61 | # # Teacher model 62 | alexnet_unquantized = torchvision.models.alexnet(pretrained=True) 63 | if USE_CUDA: 64 | alexnet_unquantized = alexnet_unquantized.cuda() 65 | if NUM_GPUS > 1: 66 | alexnet_unquantized = torch.nn.parallel.DataParallel(alexnet_unquantized) 67 | 68 | 69 | #Train a wide-resNet with quantized distillation 70 | quant_distilled_model_name = 'alexnet_quant_distilled{}bits'.format(NUM_BITS) 71 | quantDistilledModelPath = os.path.join(imageNet12modelsFolder, quant_distilled_model_name) 72 | quantDistilledOptions = {} 73 | quant_distilled_model = torchvision.models.alexnet(pretrained=False) 74 | 75 | if USE_CUDA: 76 | quant_distilled_model = quant_distilled_model.cuda() 77 | if NUM_GPUS > 1: 78 | quant_distilled_model = torch.nn.parallel.DataParallel(quant_distilled_model) 79 | 80 | if not quant_distilled_model_name in imagenet_manager.saved_models: 81 | imagenet_manager.add_new_model(quant_distilled_model_name, quantDistilledModelPath, 82 | arguments_creator_function=quantDistilledOptions) 83 | 84 | imagenet_manager.train_model(quant_distilled_model, model_name=quant_distilled_model_name, 85 | train_function=convForwModel.train_model, 86 | arguments_train_function={'epochs_to_train': epochsToTrainImageNet, 87 | 'learning_rate_style': 'imagenet', 88 | 'initial_learning_rate': 0.1, 89 | 'use_nesterov':True, 90 | 'initial_momentum':0.9, 91 | 'weight_decayL2':1e-4, 92 | 'start_epoch': 0, 93 | 'print_every':30, 94 | 'use_distillation_loss':True, 95 | 'teacher_model': alexnet_unquantized, 96 | 'quantizeWeights':True, 97 | 'numBits':NUM_BITS, 98 | 'bucket_size':256, 99 | 'quantize_first_and_last_layer': False}, 100 | train_loader=train_loader, test_loader=test_loader) 101 | -------------------------------------------------------------------------------- /alexnet_distilled2bits.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import torchvision 4 | import cnn_models.conv_forward_model as convForwModel 5 | import cnn_models.help_fun as cnn_hf 6 | import datasets 7 | import model_manager 8 | 9 | cuda_devices = os.environ['CUDA_VISIBLE_DEVICES'].split(',') 10 | print('CUDA_VISIBLE_DEVICES: {} for a total of {} GPUs'.format(cuda_devices, len(cuda_devices))) 11 | 12 | NUM_BITS = 2 13 | print('Number of bits in training: {}'.format(NUM_BITS)) 14 | 15 | datasets.BASE_DATA_FOLDER = '...' 16 | SAVED_MODELS_FOLDER = '...' 17 | 18 | USE_CUDA = torch.cuda.is_available() 19 | NUM_GPUS = len(cuda_devices) 20 | 21 | try: 22 | os.mkdir(datasets.BASE_DATA_FOLDER) 23 | except:pass 24 | try: 25 | os.mkdir(SAVED_MODELS_FOLDER) 26 | except:pass 27 | 28 | TRAIN_QUANTIZED_DISTILLED = False 29 | 30 | epochsToTrainImageNet = 90 31 | imageNet12modelsFolder = os.path.join(SAVED_MODELS_FOLDER, 'imagenet12_new') 32 | imagenet_manager = model_manager.ModelManager('model_manager_imagenet_Alexnet_distilled2bits.tst', 33 | 'model_manager', create_new_model_manager=False) 34 | 35 | for x in imagenet_manager.list_models(): 36 | if imagenet_manager.get_num_training_runs(x) >= 1: 37 | s = '{}; Last prediction acc: {}, Best prediction acc: {}'.format(x, 38 | imagenet_manager.load_metadata(x)[1]['predictionAccuracy'][-1], 39 | max(imagenet_manager.load_metadata(x)[1]['predictionAccuracy'])) 40 | print(s) 41 | 42 | try: 43 | os.mkdir(imageNet12modelsFolder) 44 | except:pass 45 | 46 | print('Batch size: {}'.format(batch_size)) 47 | 48 | if batch_size % NUM_GPUS != 0: 49 | raise ValueError('Batch size: {} must be a multiple of the number of gpus:{}'.format(batch_size, NUM_GPUS)) 50 | 51 | imageNet12 = datasets.ImageNet12('...', 52 | '...', 53 | type_of_data_augmentation='extended', already_scaled=False, 54 | pin_memory=True) 55 | 56 | 57 | 58 | train_loader = imageNet12.getTrainLoader(batch_size, shuffle=True) 59 | test_loader = imageNet12.getTestLoader(batch_size, shuffle=False) 60 | 61 | # # Teacher model 62 | alexnet_unquantized = torchvision.models.alexnet(pretrained=True) 63 | if USE_CUDA: 64 | alexnet_unquantized = alexnet_unquantized.cuda() 65 | if NUM_GPUS > 1: 66 | alexnet_unquantized = torch.nn.parallel.DataParallel(alexnet_unquantized) 67 | 68 | 69 | #Train a wide-resNet with quantized distillation 70 | quant_distilled_model_name = 'alexnet_quant_distilled{}bits'.format(NUM_BITS) 71 | quantDistilledModelPath = os.path.join(imageNet12modelsFolder, quant_distilled_model_name) 72 | quantDistilledOptions = {} 73 | quant_distilled_model = torchvision.models.alexnet(pretrained=False) 74 | 75 | if USE_CUDA: 76 | quant_distilled_model = quant_distilled_model.cuda() 77 | if NUM_GPUS > 1: 78 | quant_distilled_model = torch.nn.parallel.DataParallel(quant_distilled_model) 79 | 80 | if not quant_distilled_model_name in imagenet_manager.saved_models: 81 | imagenet_manager.add_new_model(quant_distilled_model_name, quantDistilledModelPath, 82 | arguments_creator_function=quantDistilledOptions) 83 | 84 | if TRAIN_QUANTIZED_DISTILLED: 85 | imagenet_manager.train_model(quant_distilled_model, model_name=quant_distilled_model_name, 86 | train_function=convForwModel.train_model, 87 | arguments_train_function={'epochs_to_train': epochsToTrainImageNet, 88 | 'learning_rate_style': 'imagenet', 89 | 'initial_learning_rate': 0.1, 90 | 'use_nesterov':True, 91 | 'initial_momentum':0.9, 92 | 'weight_decayL2':1e-4, 93 | 'start_epoch': 0, 94 | 'print_every':30, 95 | 'use_distillation_loss':True, 96 | 'teacher_model': alexnet_unquantized, 97 | 'quantizeWeights':True, 98 | 'numBits':NUM_BITS, 99 | 'bucket_size':256, 100 | 'quantize_first_and_last_layer': False}, 101 | train_loader=train_loader, test_loader=test_loader) 102 | quant_distilled_model.load_state_dict(imagenet_manager.load_model_state_dict(quant_distilled_model_name)) 103 | 104 | print(cnn_hf.evaluateModel(quant_distilled_model)) 105 | -------------------------------------------------------------------------------- /alexnet_doublefilters.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import torchvision 4 | import cnn_models.conv_forward_model as convForwModel 5 | import cnn_models.help_fun as cnn_hf 6 | import cnn_models.alexnet_kfilters as alexnet_kfilters 7 | import datasets 8 | import model_manager 9 | import functools 10 | import quantization 11 | import helpers.functions as mhf 12 | 13 | cuda_devices = os.environ['CUDA_VISIBLE_DEVICES'].split(',') 14 | print('CUDA_VISIBLE_DEVICES: {} for a total of {} GPUs'.format(cuda_devices, len(cuda_devices))) 15 | 16 | 17 | NUM_BITS = 4 18 | print('Server Name: {}'.format(server_name)) 19 | print('Number of bits in training: {}'.format(NUM_BITS)) 20 | 21 | datasets.BASE_DATA_FOLDER = '...' 22 | SAVED_MODELS_FOLDER = '...' 23 | 24 | USE_CUDA = torch.cuda.is_available() 25 | NUM_GPUS = len(cuda_devices) 26 | 27 | try: 28 | os.mkdir(datasets.BASE_DATA_FOLDER) 29 | except:pass 30 | try: 31 | os.mkdir(SAVED_MODELS_FOLDER) 32 | except:pass 33 | 34 | epochsToTrainImageNet = 90 35 | imageNet12modelsFolder = os.path.join(SAVED_MODELS_FOLDER, 'imagenet12_new') 36 | imagenet_manager = model_manager.ModelManager('model_manager_imagenet_AlexnetDoubleFilters_distilled4bits.tst', 37 | 'model_manager', create_new_model_manager=False) 38 | 39 | for x in imagenet_manager.list_models(): 40 | if imagenet_manager.get_num_training_runs(x) >= 1: 41 | s = '{}; Last prediction acc: {}, Best prediction acc: {}'.format(x, 42 | imagenet_manager.load_metadata(x)[1]['predictionAccuracy'][-1], 43 | max(imagenet_manager.load_metadata(x)[1]['predictionAccuracy'])) 44 | print(s) 45 | 46 | try: 47 | os.mkdir(imageNet12modelsFolder) 48 | except:pass 49 | 50 | 51 | TRAIN_QUANTIZED_DISTILLED = False 52 | 53 | print('Batch size: {}'.format(batch_size)) 54 | 55 | if batch_size % NUM_GPUS != 0: 56 | raise ValueError('Batch size: {} must be a multiple of the number of gpus:{}'.format(batch_size, NUM_GPUS)) 57 | 58 | imageNet12 = datasets.ImageNet12('...', 59 | '...', 60 | type_of_data_augmentation='extended', already_scaled=False, 61 | pin_memory=True) 62 | 63 | 64 | train_loader = imageNet12.getTrainLoader(batch_size, shuffle=True) 65 | test_loader = imageNet12.getTestLoader(batch_size, shuffle=False) 66 | 67 | # # Teacher model 68 | alexnet_unquantized = torchvision.models.alexnet(pretrained=True) 69 | if USE_CUDA: 70 | alexnet_unquantized = alexnet_unquantized.cuda() 71 | if NUM_GPUS > 1: 72 | alexnet_unquantized = torch.nn.parallel.DataParallel(alexnet_unquantized) 73 | 74 | 75 | #Train a wide-resNet with quantized distillation 76 | quant_distilled_model_name = 'alexnet_DoubleFiltersquant_distilled{}bits'.format(NUM_BITS) 77 | quantDistilledModelPath = os.path.join(imageNet12modelsFolder, quant_distilled_model_name) 78 | quantDistilledOptions = {} 79 | quant_distilled_model = alexnet_kfilters.AlexNet(k=2) 80 | 81 | if USE_CUDA: 82 | quant_distilled_model = quant_distilled_model.cuda() 83 | if NUM_GPUS > 1: 84 | quant_distilled_model = torch.nn.parallel.DataParallel(quant_distilled_model) 85 | 86 | if not quant_distilled_model_name in imagenet_manager.saved_models: 87 | imagenet_manager.add_new_model(quant_distilled_model_name, quantDistilledModelPath, 88 | arguments_creator_function=quantDistilledOptions) 89 | 90 | if TRAIN_QUANTIZED_DISTILLED: 91 | imagenet_manager.train_model(quant_distilled_model, model_name=quant_distilled_model_name, 92 | train_function=convForwModel.train_model, 93 | arguments_train_function={'epochs_to_train': epochsToTrainImageNet, 94 | 'learning_rate_style': 'imagenet', 95 | 'initial_learning_rate': 0.1, 96 | 'use_nesterov':True, 97 | 'initial_momentum':0.9, 98 | 'weight_decayL2':1e-4, 99 | 'start_epoch': 0, 100 | 'print_every':30, 101 | 'use_distillation_loss':True, 102 | 'teacher_model': alexnet_unquantized, 103 | 'quantizeWeights':True, 104 | 'numBits':NUM_BITS, 105 | 'bucket_size':256, 106 | 'quantize_first_and_last_layer': False}, 107 | train_loader=train_loader, test_loader=test_loader) 108 | quant_distilled_model.load_state_dict(imagenet_manager.load_model_state_dict(quant_distilled_model_name)) 109 | print(cnn_hf.evaluateModel(quant_distilled_model, test_loader, fastEvaluation=False)) 110 | print(cnn_hf.evaluateModel(quant_distilled_model, test_loader, fastEvaluation=False, k=5)) 111 | print(cnn_hf.evaluateModel(alexnet_unquantized, test_loader, fastEvaluation=False)) 112 | print(cnn_hf.evaluateModel(alexnet_unquantized, test_loader, fastEvaluation=False, k=5)) 113 | quant_fun = functools.partial(quantization.uniformQuantization, s=2**4, bucket_size=256) 114 | size_mb = mhf.get_size_quantized_model(quant_distilled_model, 4, quant_fun, 256, 115 | quantizeFirstLastLayer=False) 116 | print(size_mb) 117 | -------------------------------------------------------------------------------- /cifar10_diff_quant_test.py: -------------------------------------------------------------------------------- 1 | import model_manager 2 | import torch 3 | import os 4 | import datasets 5 | import cnn_models.conv_forward_model as convForwModel 6 | import cnn_models.help_fun as cnn_hf 7 | import quantization 8 | import pickle 9 | import copy 10 | import quantization.help_functions as qhf 11 | import functools 12 | import helpers.functions as mhf 13 | import itertools as it 14 | 15 | datasets.BASE_DATA_FOLDER = '...' 16 | SAVED_MODELS_FOLDER = '...' 17 | USE_CUDA = torch.cuda.is_available() 18 | 19 | print('CUDA_VISIBLE_DEVICES: {}'.format(os.environ['CUDA_VISIBLE_DEVICES'])) 20 | 21 | try: 22 | os.mkdir(datasets.BASE_DATA_FOLDER) 23 | except:pass 24 | try: 25 | os.mkdir(SAVED_MODELS_FOLDER) 26 | except:pass 27 | 28 | 29 | cifar10Manager = model_manager.ModelManager('model_manager_cifar10.tst', 30 | 'model_manager', create_new_model_manager=False) 31 | cifar10modelsFolder = os.path.join(SAVED_MODELS_FOLDER, 'cifar10') 32 | 33 | for x in cifar10Manager.list_models(): 34 | if cifar10Manager.get_num_training_runs(x) >= 1: 35 | print(x, cifar10Manager.load_metadata(x)[1]['predictionAccuracy'][-1]) 36 | 37 | try: 38 | os.mkdir(cifar10modelsFolder) 39 | except:pass 40 | 41 | USE_BATCH_NORM = True 42 | AFFINE_BATCH_NORM = True 43 | 44 | 45 | COMPUTE_DIFFERENT_HEURISTICS = False 46 | 47 | batch_size = 25 48 | cifar10 = datasets.CIFAR10() 49 | train_loader, test_loader = cifar10.getTrainLoader(batch_size), cifar10.getTestLoader(batch_size) 50 | 51 | 52 | ## distilled model 53 | distilled_model_name = 'cifar10_distilled' 54 | distilledModelSpec = copy.deepcopy(convForwModel.smallerModelSpec) 55 | distilledModelSpec['spec_dropout_rates'] = [] #no dropout with distilled model 56 | 57 | 58 | def values_param_iter(n=3): 59 | return it.product(*([True, False],)*n) 60 | 61 | numBits = [2, 4] 62 | if COMPUTE_DIFFERENT_HEURISTICS: 63 | for numBit in numBits: 64 | for assign_bits_auto, use_distillation_loss, compute_initial_points in values_param_iter(n=3): 65 | if compute_initial_points is True: 66 | compute_initial_points = 'quantiles' 67 | else: 68 | compute_initial_points = 'uniform' 69 | str_identifier = 'quantpoints{}bits_auto{}_distill{}_initial"{}"'.format(numBit, assign_bits_auto, 70 | use_distillation_loss, 71 | compute_initial_points) 72 | distilled_quantized_model_name = distilled_model_name + str_identifier 73 | distilled_quantized_model = convForwModel.ConvolForwardNet(**distilledModelSpec, 74 | useBatchNorm=USE_BATCH_NORM, 75 | useAffineTransformInBatchNorm=AFFINE_BATCH_NORM) 76 | if USE_CUDA: distilled_quantized_model = distilled_quantized_model.cuda() 77 | distilled_quantized_model.load_state_dict(cifar10Manager.load_model_state_dict(distilled_model_name)) 78 | epochs_to_train = 50 79 | 80 | quantized_model_dict, quantization_points, infoDict = convForwModel.optimize_quantization_points( 81 | distilled_quantized_model, 82 | train_loader, test_loader, numPointsPerTensor=2**numBit, 83 | assignBitsAutomatically=assign_bits_auto, 84 | bucket_size=256, epochs_to_train=epochs_to_train, 85 | use_distillation_loss=use_distillation_loss, initial_learning_rate=1e-5, 86 | initialize_method=compute_initial_points) 87 | quantization_points = [x.data.view(1,-1).cpu().numpy().tolist()[0] for x in quantization_points] 88 | save_path = cifar10Manager.get_model_base_path(distilled_model_name) + str_identifier 89 | with open(save_path, 'wb') as p: 90 | pickle.dump((quantization_points, infoDict), p) 91 | torch.save(quantized_model_dict, save_path + '_model_state_dict') 92 | 93 | for numBit in numBits: 94 | for assign_bits_auto, use_distillation_loss, compute_initial_points in values_param_iter(n=3): 95 | if compute_initial_points is True: 96 | compute_initial_points = 'quantiles' 97 | else: 98 | compute_initial_points = 'uniform' 99 | str_identifier = 'quantpoints{}bits_auto{}_distill{}_initial"{}"'.format(numBit, assign_bits_auto, 100 | use_distillation_loss, 101 | compute_initial_points) 102 | distilled_quantized_model_name = distilled_model_name + str_identifier 103 | save_path = cifar10Manager.get_model_base_path(distilled_model_name) + str_identifier 104 | with open(save_path, 'rb') as p: 105 | quantization_points, infoDict = pickle.load(p) 106 | 107 | distilled_quantized_model = convForwModel.ConvolForwardNet(**distilledModelSpec, 108 | useBatchNorm=USE_BATCH_NORM, 109 | useAffineTransformInBatchNorm=AFFINE_BATCH_NORM) 110 | if USE_CUDA: distilled_quantized_model = distilled_quantized_model.cuda() 111 | distilled_quantized_model.load_state_dict(torch.load(save_path + '_model_state_dict')) 112 | reported_accuracy = max(infoDict['predictionAccuracy']) 113 | actual_accuracy = cnn_hf.evaluateModel(distilled_quantized_model, test_loader) #this corresponds to the last one 114 | #the only problem is that I don't save the model with the max accuracy, but the model at the last epoch 115 | print('Model "{}" => reported accuracy: {} - actual accuracy: {}'.format(distilled_quantized_model_name, 116 | reported_accuracy, actual_accuracy)) 117 | -------------------------------------------------------------------------------- /cnn_models/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /cnn_models/alexnet_kfilters.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch.utils.model_zoo as model_zoo 3 | import math 4 | 5 | class AlexNet(nn.Module): 6 | 7 | def __init__(self, num_classes=1000, k=1): 8 | super(AlexNet, self).__init__() 9 | self.features = nn.Sequential( 10 | nn.Conv2d(3, math.floor(64*k), kernel_size=11, stride=4, padding=2), #originally 64 filters 11 | nn.ReLU(inplace=True), 12 | nn.MaxPool2d(kernel_size=3, stride=2), 13 | nn.Conv2d(math.floor(64*k), math.floor(192*k), kernel_size=5, padding=2), #originally 192 14 | nn.ReLU(inplace=True), 15 | nn.MaxPool2d(kernel_size=3, stride=2), 16 | nn.Conv2d(math.floor(192*k), math.floor(384*k), kernel_size=3, padding=1), #originally 384 17 | nn.ReLU(inplace=True), 18 | nn.Conv2d(math.floor(384*k), math.floor(256*k), kernel_size=3, padding=1), #originally 256 19 | nn.ReLU(inplace=True), 20 | nn.Conv2d(math.floor(256*k), math.floor(256*k), kernel_size=3, padding=1), #originally 256 21 | nn.ReLU(inplace=True), 22 | nn.MaxPool2d(kernel_size=3, stride=2), 23 | ) 24 | self.classifier = nn.Sequential( 25 | nn.Dropout(), 26 | nn.Linear(math.floor(256*k) * 6 * 6, 4096), #originally 256 * 6 * 6 27 | nn.ReLU(inplace=True), 28 | nn.Dropout(), 29 | nn.Linear(4096, 4096), 30 | nn.ReLU(inplace=True), 31 | nn.Linear(4096, num_classes), 32 | ) 33 | 34 | def forward(self, x): 35 | x = self.features(x) 36 | x = x.view(x.size(0), 512 * 6 * 6) #originally 256 * 6 * 6 37 | x = self.classifier(x) 38 | return x -------------------------------------------------------------------------------- /cnn_models/resnet_kfilters.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import math 3 | 4 | 5 | def conv3x3(in_planes, out_planes, stride=1): 6 | "3x3 convolution with padding" 7 | return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, 8 | padding=1, bias=False) 9 | 10 | 11 | class BasicBlock(nn.Module): 12 | expansion = 1 13 | 14 | def __init__(self, inplanes, planes, stride=1, downsample=None): 15 | super(BasicBlock, self).__init__() 16 | self.conv1 = conv3x3(inplanes, planes, stride) 17 | self.bn1 = nn.BatchNorm2d(planes) 18 | self.relu = nn.ReLU(inplace=True) 19 | self.conv2 = conv3x3(planes, planes) 20 | self.bn2 = nn.BatchNorm2d(planes) 21 | self.downsample = downsample 22 | self.stride = stride 23 | 24 | def forward(self, x): 25 | residual = x 26 | 27 | out = self.conv1(x) 28 | out = self.bn1(out) 29 | out = self.relu(out) 30 | 31 | out = self.conv2(out) 32 | out = self.bn2(out) 33 | 34 | if self.downsample is not None: 35 | residual = self.downsample(x) 36 | 37 | out += residual 38 | out = self.relu(out) 39 | 40 | return out 41 | 42 | 43 | class Bottleneck(nn.Module): 44 | expansion = 4 45 | 46 | def __init__(self, inplanes, planes, stride=1, downsample=None): 47 | super(Bottleneck, self).__init__() 48 | self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False) 49 | self.bn1 = nn.BatchNorm2d(planes) 50 | self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, 51 | padding=1, bias=False) 52 | self.bn2 = nn.BatchNorm2d(planes) 53 | self.conv3 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False) 54 | self.bn3 = nn.BatchNorm2d(planes * 4) 55 | self.relu = nn.ReLU(inplace=True) 56 | self.downsample = downsample 57 | self.stride = stride 58 | 59 | def forward(self, x): 60 | residual = x 61 | 62 | out = self.conv1(x) 63 | out = self.bn1(out) 64 | out = self.relu(out) 65 | 66 | out = self.conv2(out) 67 | out = self.bn2(out) 68 | out = self.relu(out) 69 | 70 | out = self.conv3(out) 71 | out = self.bn3(out) 72 | 73 | if self.downsample is not None: 74 | residual = self.downsample(x) 75 | 76 | out += residual 77 | out = self.relu(out) 78 | 79 | return out 80 | 81 | 82 | class ResNet(nn.Module): 83 | 84 | def __init__(self, block, layers, num_classes=1000, k=1): 85 | self.inplanes = math.floor(64*k) #originally 64 86 | super(ResNet, self).__init__() 87 | self.conv1 = nn.Conv2d(3, math.floor(64*k), kernel_size=7, stride=2, padding=3, #originally 64 88 | bias=False) 89 | self.bn1 = nn.BatchNorm2d(math.floor(64*k)) #originally 64 90 | self.relu = nn.ReLU(inplace=True) 91 | self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1) 92 | self.layer1 = self._make_layer(block, math.floor(64*k), layers[0]) #originally 64 93 | self.layer2 = self._make_layer(block, math.floor(128*k), layers[1], stride=2) #originally 128 94 | self.layer3 = self._make_layer(block, math.floor(256*k), layers[2], stride=2) #originally 256 95 | self.layer4 = self._make_layer(block, math.floor(512*k), layers[3], stride=2) #originally 512 96 | self.avgpool = nn.AvgPool2d(7, stride=1) 97 | self.fc = nn.Linear(math.floor(512*k) * block.expansion, num_classes) #originally 512 * block.expansion 98 | 99 | #initialization 100 | for m in self.modules(): 101 | if isinstance(m, nn.Conv2d): 102 | n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels 103 | m.weight.data.normal_(0, math.sqrt(2. / n)) 104 | elif isinstance(m, nn.BatchNorm2d): 105 | m.weight.data.fill_(1) 106 | m.bias.data.zero_() 107 | 108 | def _make_layer(self, block, planes, blocks, stride=1): 109 | downsample = None 110 | if stride != 1 or self.inplanes != planes * block.expansion: 111 | downsample = nn.Sequential( 112 | nn.Conv2d(self.inplanes, planes * block.expansion, 113 | kernel_size=1, stride=stride, bias=False), 114 | nn.BatchNorm2d(planes * block.expansion), 115 | ) 116 | 117 | layers = [] 118 | layers.append(block(self.inplanes, planes, stride, downsample)) 119 | self.inplanes = planes * block.expansion 120 | for i in range(1, blocks): 121 | layers.append(block(self.inplanes, planes)) 122 | 123 | return nn.Sequential(*layers) 124 | 125 | def forward(self, x): 126 | x = self.conv1(x) 127 | x = self.bn1(x) 128 | x = self.relu(x) 129 | x = self.maxpool(x) 130 | 131 | x = self.layer1(x) 132 | x = self.layer2(x) 133 | x = self.layer3(x) 134 | x = self.layer4(x) 135 | 136 | x = self.avgpool(x) 137 | x = x.view(x.size(0), -1) 138 | x = self.fc(x) 139 | 140 | return x 141 | 142 | 143 | def resnet18(**kwargs): 144 | """Constructs a ResNet-18 model. 145 | 146 | Args: 147 | pretrained (bool): If True, returns a model pre-trained on ImageNet 148 | """ 149 | model = ResNet(BasicBlock, [2, 2, 2, 2], **kwargs) 150 | return model 151 | 152 | 153 | def resnet34(**kwargs): 154 | """Constructs a ResNet-34 model. 155 | 156 | Args: 157 | pretrained (bool): If True, returns a model pre-trained on ImageNet 158 | """ 159 | model = ResNet(BasicBlock, [3, 4, 6, 3], **kwargs) 160 | return model 161 | 162 | 163 | def resnet50(**kwargs): 164 | """Constructs a ResNet-50 model. 165 | 166 | Args: 167 | pretrained (bool): If True, returns a model pre-trained on ImageNet 168 | """ 169 | model = ResNet(Bottleneck, [3, 4, 6, 3], **kwargs) 170 | return model 171 | 172 | 173 | def resnet101(**kwargs): 174 | """Constructs a ResNet-101 model. 175 | 176 | Args: 177 | pretrained (bool): If True, returns a model pre-trained on ImageNet 178 | """ 179 | model = ResNet(Bottleneck, [3, 4, 23, 3], **kwargs) 180 | return model 181 | 182 | 183 | def resnet152(**kwargs): 184 | """Constructs a ResNet-152 model. 185 | 186 | Args: 187 | pretrained (bool): If True, returns a model pre-trained on ImageNet 188 | """ 189 | model = ResNet(Bottleneck, [3, 8, 36, 3], **kwargs) 190 | return model 191 | -------------------------------------------------------------------------------- /cnn_models/wide_resnet.py: -------------------------------------------------------------------------------- 1 | #code taken from https://github.com/meliketoy/wide-resnet.pytorch 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.init as init 5 | import torch.nn.functional as F 6 | from torch.autograd import Variable 7 | 8 | import sys 9 | import numpy as np 10 | 11 | #TODO: Some of the things are not equal to the model definition (from the authors) 12 | # which is here: https://github.com/szagoruyko/functional-zoo/blob/master/wide-resnet-50-2-export.ipynb 13 | 14 | 15 | def conv3x3(in_planes, out_planes, stride=1): 16 | #TODO: Authors use, in their conv2d a padding=0 by default if I am not mistaken 17 | return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, padding=1, bias=True) 18 | 19 | def conv_init(m): 20 | classname = m.__class__.__name__ 21 | if classname.find('Conv') != -1: 22 | init.xavier_uniform(m.weight, gain=np.sqrt(2)) 23 | init.constant(m.bias, 0) 24 | elif classname.find('BatchNorm') != -1: 25 | init.constant(m.weight, 1) 26 | init.constant(m.bias, 0) 27 | 28 | class wide_basic(nn.Module): 29 | def __init__(self, in_planes, planes, dropout_rate, stride=1): 30 | super(wide_basic, self).__init__() 31 | self.bn1 = nn.BatchNorm2d(in_planes) 32 | self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, padding=1, bias=True) 33 | self.dropout = nn.Dropout(p=dropout_rate) 34 | self.bn2 = nn.BatchNorm2d(planes) 35 | self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=True) 36 | 37 | self.shortcut = nn.Sequential() 38 | if stride != 1 or in_planes != planes: 39 | self.shortcut = nn.Sequential( 40 | nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride, bias=True), 41 | ) 42 | 43 | def forward(self, x): 44 | out = self.dropout(self.conv1(F.relu(self.bn1(x)))) 45 | out = self.conv2(F.relu(self.bn2(out))) 46 | out += self.shortcut(x) 47 | 48 | return out 49 | 50 | class Wide_ResNet(nn.Module): 51 | def __init__(self, depth, widen_factor, dropout_rate, num_classes): 52 | super(Wide_ResNet, self).__init__() 53 | self.in_planes = 16 54 | 55 | assert ((depth-4)%6 ==0), 'Wide-resnet depth should be 6n+4' 56 | n = int((depth-4)/6) 57 | k = widen_factor 58 | 59 | nStages = [16, 16*k, 32*k, 64*k] 60 | 61 | self.conv1 = conv3x3(3,nStages[0]) #TODO: authors use stride=2, padding=3 in first convolution 62 | self.layer1 = self._wide_layer(wide_basic, nStages[1], n, dropout_rate, stride=1) 63 | self.layer2 = self._wide_layer(wide_basic, nStages[2], n, dropout_rate, stride=2) 64 | self.layer3 = self._wide_layer(wide_basic, nStages[3], n, dropout_rate, stride=2) 65 | self.bn1 = nn.BatchNorm2d(nStages[3], momentum=0.9) 66 | self.linear = nn.Linear(nStages[3], num_classes) 67 | self.apply(conv_init) 68 | 69 | def _wide_layer(self, block, planes, num_blocks, dropout_rate, stride): 70 | strides = [stride] + [1]*(num_blocks-1) 71 | layers = [] 72 | 73 | for stride in strides: 74 | layers.append(block(self.in_planes, planes, dropout_rate, stride)) 75 | self.in_planes = planes 76 | 77 | return nn.Sequential(*layers) 78 | 79 | def forward(self, x): 80 | out = self.conv1(x) #TODO: after first layer they use relu and maxpool2d with parameters 3, 2, 1 81 | out = self.layer1(out) 82 | out = self.layer2(out) 83 | out = self.layer3(out) 84 | out = F.relu(self.bn1(out)) 85 | out = F.avg_pool2d(out, 8) 86 | out = out.view(out.size(0), -1) 87 | out = self.linear(out) 88 | 89 | return out -------------------------------------------------------------------------------- /cnn_models/wide_resnet_imagenet.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.init as init 4 | import torch.nn.functional as F 5 | from torch.autograd import Variable 6 | 7 | import sys 8 | import numpy as np 9 | 10 | #TODO: Some of the things are not equal to the model definition (from the authors) 11 | # which is here: https://github.com/szagoruyko/functional-zoo/blob/master/wide-resnet-50-2-export.ipynb 12 | 13 | #code taken from https://github.com/meliketoy/wide-resnet.pytorch 14 | 15 | def conv_init(m): 16 | classname = m.__class__.__name__ 17 | if classname.find('Conv') != -1: 18 | init.xavier_uniform(m.weight, gain=np.sqrt(2)) 19 | init.constant(m.bias, 0) 20 | elif classname.find('BatchNorm') != -1: 21 | init.constant(m.weight, 1) 22 | init.constant(m.bias, 0) 23 | 24 | class wide_bottleneck_straightned(nn.Module): 25 | #Bottlebeck because has the structure 1x1 conv, 3x3 conv, 1x1 conv 26 | #straightned because the dimensions are the same for all convolutions 27 | def __init__(self, in_planes, planes, dropout_rate, stride=1, use_residual=True): 28 | super(wide_bottleneck_straightned, self).__init__() 29 | self.bn1 = nn.BatchNorm2d(in_planes) 30 | self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, stride=1, padding=0, bias=True) 31 | self.dropout1 = nn.Dropout(p=dropout_rate) 32 | self.bn2 = nn.BatchNorm2d(planes) 33 | stride3conv = (not use_residual) and stride or 1 34 | self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride3conv, padding=1, bias=True) 35 | self.dropout2 = nn.Dropout(p=dropout_rate) 36 | self.bn3 = nn.BatchNorm2d(planes) 37 | self.conv3 = nn.Conv2d(planes, planes, kernel_size=1, stride=1, padding=0, bias=True) 38 | self.use_residual = use_residual 39 | if not self.use_residual: 40 | self.conv_dim = nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride, padding=0, bias=True) 41 | 42 | def forward(self, input): 43 | x = input 44 | x = self.dropout1(self.conv1(F.relu(self.bn1(x)))) 45 | x = self.dropout2(self.conv2(F.relu(self.bn2(x)))) 46 | x = self.conv3(F.relu((self.bn3(x)))) 47 | if self.use_residual: 48 | x += input 49 | else: 50 | x += self.conv_dim(input) 51 | x = F.relu(x) 52 | return x 53 | 54 | class Wide_ResNet_imagenet(nn.Module): 55 | def __init__(self, depth, widen_factor, dropout_rate, num_classes): 56 | super(Wide_ResNet_imagenet, self).__init__() 57 | 58 | assert ((depth-5)%12 ==0), 'Wide-resnet depth should be 12n+5' 59 | n = int((depth-5)/12) 60 | k = widen_factor 61 | 62 | nStages = [16*k, 32*k, 64*k, 128*k, 256*k] 63 | self.in_planes = nStages[0] 64 | 65 | self.conv1 = nn.Conv2d(3, nStages[0], 7, stride=2, padding=3, bias=True) 66 | self.layer1 = self._wide_layer(wide_bottleneck_straightned, nStages[1], n, dropout_rate, stride=1) 67 | self.layer2 = self._wide_layer(wide_bottleneck_straightned, nStages[2], n, dropout_rate, stride=2) 68 | self.layer3 = self._wide_layer(wide_bottleneck_straightned, nStages[3], n, dropout_rate, stride=2) 69 | self.layer4 = self._wide_layer(wide_bottleneck_straightned, nStages[4], n, dropout_rate, stride=2) 70 | self.bn1 = nn.BatchNorm2d(nStages[4], momentum=0.9) 71 | self.linear = nn.Linear(nStages[4], num_classes) 72 | self.apply(conv_init) 73 | 74 | def _wide_layer(self, block, planes, num_blocks, dropout_rate, stride): 75 | layers = [] 76 | 77 | for idx_block in range(num_blocks): 78 | use_residual = idx_block != 0 79 | layers.append(block(self.in_planes, planes, dropout_rate, stride, use_residual)) 80 | self.in_planes = planes 81 | 82 | return nn.Sequential(*layers) 83 | 84 | def forward(self, x): 85 | out = F.relu(self.conv1(x)) 86 | out = F.max_pool2d(out, kernel_size=3, stride=2, padding=1) 87 | 88 | out = self.layer1(out) 89 | out = self.layer2(out) 90 | out = self.layer3(out) 91 | out = self.layer4(out) 92 | out = F.relu(self.bn1(out)) 93 | out = F.avg_pool2d(out, kernel_size=7, stride=1, padding=0) 94 | out = out.view(out.size(0), -1) 95 | out = self.linear(out) 96 | 97 | return out -------------------------------------------------------------------------------- /datasets/CIFAR10.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torchvision 3 | import torchvision.transforms as vision_transforms 4 | import datasets 5 | import torch 6 | import datasets.torchvision_extension as vision_transforms_extension 7 | import numpy as np 8 | 9 | try: 10 | import matplotlib.pyplot as plt 11 | except: 12 | print('Cannot import matplotlib. CIFAR10.save_img method will crash if used') 13 | 14 | meanstd = { 15 | 'mean':[0.5, 0.5, 0.5], 16 | 'std': [0.5, 0.5, 0.5], 17 | } 18 | 19 | class CIFAR10(object): 20 | def __init__(self, dataFolder=None, pin_memory=False): 21 | 22 | self.dataFolder = dataFolder if dataFolder is not None else os.path.join(datasets.BASE_DATA_FOLDER, 'CIFAR10') 23 | self.pin_memory = pin_memory 24 | self.meanStd = meanstd 25 | 26 | #download the dataset 27 | torchvision.datasets.CIFAR10(self.dataFolder, download=True) 28 | 29 | #add some useful metadata 30 | self.classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck') 31 | 32 | def getTrainLoader(self, batch_size, shuffle=True, num_workers=1, checkFileIntegrity=False): 33 | 34 | #first we define the training transform we will apply to the dataset 35 | listOfTransoforms = [] 36 | listOfTransoforms.append(vision_transforms.RandomCrop((32, 32), padding=4)) 37 | listOfTransoforms.append(vision_transforms.RandomHorizontalFlip()) 38 | # 39 | # listOfTransoforms.append(vision_transforms.ColorJitter(brightness=0.4, 40 | # contrast=0.4, 41 | # saturation=0.4)) 42 | listOfTransoforms.append(vision_transforms.ToTensor()) 43 | # TODO: TO make this work I need the pca values, i.e. eigenvalues and eigenvectors 44 | # of the RGB colors, computed on a subset of the cifar10 dataset. 45 | # try to implement this at some point 46 | # listOfTransoforms.append(vision_transforms_extension.Lighting(alphastd=0.1, 47 | # eigval=self.pca['eigval'], 48 | # eigvec=self.pca['eigvec'])) 49 | listOfTransoforms.append(vision_transforms.Normalize(mean=self.meanStd['mean'], 50 | std=self.meanStd['std'])) 51 | 52 | train_transform = vision_transforms.Compose(listOfTransoforms) 53 | 54 | #define the trainset 55 | trainset = torchvision.datasets.CIFAR10(root=self.dataFolder, train=True, 56 | download=checkFileIntegrity, transform=train_transform) 57 | trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=shuffle, 58 | num_workers=num_workers, pin_memory=self.pin_memory) 59 | 60 | return trainloader 61 | 62 | def getTestLoader(self, batch_size, shuffle=True, num_workers=1, checkFileIntegrity=False): 63 | 64 | listOfTransoforms = [vision_transforms.ToTensor()] 65 | listOfTransoforms.append(vision_transforms.Normalize(mean=self.meanStd['mean'], 66 | std=self.meanStd['std'])) 67 | 68 | test_transform = vision_transforms.Compose(listOfTransoforms) 69 | 70 | testset = torchvision.datasets.CIFAR10(root=self.dataFolder, train=False, 71 | download=checkFileIntegrity, transform=test_transform) 72 | testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=shuffle, 73 | num_workers=num_workers, pin_memory=self.pin_memory) 74 | 75 | return testloader 76 | 77 | @staticmethod 78 | def save_img(img, path_file): 79 | try: 80 | img = img.data #in case a variable is passed 81 | except:pass 82 | mean_ = meanstd['mean'] 83 | std_ = meanstd['std'] 84 | meanDivStd = [-mean_[idx]/std_[idx] for idx in range(len(mean_))] 85 | inv_std = [1/std_[idx] for idx in range(len(std_))] 86 | img = vision_transforms.Normalize(meanDivStd, inv_std)(img) 87 | npimg = img.cpu().numpy() 88 | plt.imsave(path_file, np.transpose(npimg, (1, 2, 0))) -------------------------------------------------------------------------------- /datasets/CIFAR100.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torchvision 3 | import torchvision.transforms as vision_transforms 4 | import datasets 5 | import torch 6 | import datasets.torchvision_extension as vision_transforms_extension 7 | 8 | 9 | meanstd = { 10 | 'mean': [0.5071, 0.4867, 0.4408], 11 | 'std': [0.2675, 0.2565, 0.2761], 12 | } 13 | 14 | class CIFAR100(object): 15 | def __init__(self, dataFolder=None, pin_memory=False): 16 | 17 | self.dataFolder = dataFolder if dataFolder is not None else os.path.join(datasets.BASE_DATA_FOLDER, 'CIFAR100') 18 | self.pin_memory = pin_memory 19 | self.meanStd = meanstd 20 | 21 | #download the dataset 22 | torchvision.datasets.CIFAR100(self.dataFolder, download=True) 23 | 24 | def getTrainLoader(self, batch_size, shuffle=True, num_workers=1, checkFileIntegrity=False): 25 | 26 | #first we define the training transform we will apply to the dataset 27 | listOfTransoforms = [] 28 | listOfTransoforms.append(vision_transforms.RandomCrop((32, 32), padding=4)) 29 | listOfTransoforms.append(vision_transforms.RandomHorizontalFlip()) 30 | # listOfTransoforms.append(vision_transforms.ColorJitter(brightness=0.4, 31 | # contrast=0.4, 32 | # saturation=0.4)) 33 | listOfTransoforms.append(vision_transforms.ToTensor()) 34 | listOfTransoforms.append(vision_transforms.Normalize(mean=self.meanStd['mean'], 35 | std=self.meanStd['std'])) 36 | 37 | train_transform = vision_transforms.Compose(listOfTransoforms) 38 | 39 | #define the trainset 40 | trainset = torchvision.datasets.CIFAR100(root=self.dataFolder, train=True, 41 | download=checkFileIntegrity, transform=train_transform) 42 | trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=shuffle, 43 | num_workers=num_workers, pin_memory=self.pin_memory) 44 | 45 | return trainloader 46 | 47 | def getTestLoader(self, batch_size, shuffle=True, num_workers=1, checkFileIntegrity=False): 48 | 49 | listOfTransoforms = [vision_transforms.ToTensor()] 50 | listOfTransoforms.append(vision_transforms.Normalize(mean=self.meanStd['mean'], 51 | std=self.meanStd['std'])) 52 | 53 | test_transform = vision_transforms.Compose(listOfTransoforms) 54 | 55 | testset = torchvision.datasets.CIFAR100(root=self.dataFolder, train=False, 56 | download=checkFileIntegrity, transform=test_transform) 57 | testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=shuffle, 58 | num_workers=num_workers, pin_memory=self.pin_memory) 59 | 60 | return testloader -------------------------------------------------------------------------------- /datasets/ImageNet12.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchvision 3 | import torchvision.transforms as vision_transforms 4 | import torch.utils.data 5 | import datasets.torchvision_extension as vision_transforms_extension 6 | 7 | #For this dataset, automatic download has not been implemented. You have to provide path to the train and test folders 8 | #formatted as described here: http://pytorch.org/docs/master/torchvision/datasets.html#imagefolder 9 | #Essentially images in the same class must be in the same folder with the name of the class, like so: 10 | # root/dog/xxx.png 11 | # root/dog/xxy.png 12 | # root/dog/xxz.png 13 | # 14 | # root/cat/123.png 15 | # root/cat/nsdf3.png 16 | # root/cat/asd932_.png 17 | 18 | #To prepare the imagenet2012 dataset in such a way, follow the instructions at 19 | # https://github.com/soumith/imagenet-multiGPU.torch 20 | # It says: 21 | # The training images for imagenet are already in appropriate subfolders (like n07579787, n07880968). 22 | # You need to get the validation groundtruth and move the validation images into appropriate subfolders. 23 | # To do this, download ILSVRC2012_img_train.tar ILSVRC2012_img_val.tar and use the following commands: 24 | # extract train data 25 | # mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train 26 | # tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar 27 | # find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done 28 | # # extract validation data 29 | # cd ../ && mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar 30 | # wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash 31 | 32 | #Now you're all set! 33 | 34 | 35 | #Computed from random subset of ImageNet training images 36 | #This values were taken from the fb.resnet github project linked above 37 | meanstd = { 38 | 'mean':[0.485, 0.456, 0.406], 39 | 'std': [0.229, 0.224, 0.225], 40 | } 41 | 42 | pca = { 43 | 'eigval': torch.Tensor([0.2175, 0.0188, 0.0045]), 44 | 'eigvec': torch.Tensor([ 45 | [-0.5675, 0.7192, 0.4009], 46 | [-0.5808, -0.0045, -0.8140], 47 | [-0.5836, -0.6948, 0.4203], 48 | ]) 49 | } 50 | 51 | class ImageNet12(object): 52 | def __init__(self, trainFolder, testFolder, pin_memory=False, size_images=224, 53 | scaled_size=256, type_of_data_augmentation='basic', already_scaled=False): 54 | self.trainFolder = trainFolder 55 | self.testFolder = testFolder 56 | self.pin_memory = pin_memory 57 | self.meanstd = meanstd 58 | self.pca = pca 59 | #images will be rescaled to match this size 60 | if not isinstance(size_images, int): 61 | raise ValueError('size_images must be an int. It will be scaled to a square image') 62 | self.size_images = size_images 63 | self.scaled_size = scaled_size 64 | type_of_data_augmentation = type_of_data_augmentation.lower() 65 | if type_of_data_augmentation not in ('basic', 'extended'): 66 | raise ValueError('type_of_data_augmentation must be either basic or extended') 67 | self.type_of_data_augmentation = type_of_data_augmentation 68 | self.already_scaled = already_scaled # if you scaled all the images before training (see link above) 69 | # then set this to True 70 | 71 | def getTrainLoader(self, batch_size, shuffle=True, num_workers=4): 72 | 73 | # first we define the training transform we will apply to the dataset 74 | list_of_transforms = [] 75 | list_of_transforms.append(vision_transforms.RandomSizedCrop(self.size_images)) 76 | list_of_transforms.append(vision_transforms.RandomHorizontalFlip()) 77 | 78 | if self.type_of_data_augmentation == 'extended': 79 | list_of_transforms.append(vision_transforms.ColorJitter(brightness=0.4, 80 | contrast=0.4, 81 | saturation=0.4)) 82 | list_of_transforms.append(vision_transforms.ToTensor()) 83 | if self.type_of_data_augmentation == 'extended': 84 | list_of_transforms.append(vision_transforms_extension.Lighting(alphastd=0.1, 85 | eigval=self.pca['eigval'], 86 | eigvec=self.pca['eigvec'])) 87 | 88 | list_of_transforms.append(vision_transforms.Normalize(mean=self.meanstd['mean'], 89 | std=self.meanstd['std'])) 90 | train_transform = vision_transforms.Compose(list_of_transforms) 91 | train_set = torchvision.datasets.ImageFolder(self.trainFolder, train_transform) 92 | train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=shuffle, 93 | num_workers=num_workers, pin_memory=self.pin_memory) 94 | 95 | return train_loader 96 | 97 | def getTestLoader(self, batch_size, shuffle=True, num_workers=4): 98 | # first we define the training transform we will apply to the dataset 99 | list_of_transforms = [] 100 | if self.already_scaled is False: 101 | list_of_transforms.append(vision_transforms.Resize(self.scaled_size)) 102 | list_of_transforms.append(vision_transforms.CenterCrop(self.size_images)) 103 | list_of_transforms.append(vision_transforms.ToTensor()) 104 | list_of_transforms.append(vision_transforms.Normalize(mean=self.meanstd['mean'], 105 | std=self.meanstd['std'])) 106 | 107 | test_transform = vision_transforms.Compose(list_of_transforms) 108 | 109 | test_set = torchvision.datasets.ImageFolder(self.testFolder, test_transform) 110 | test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=shuffle, 111 | num_workers=num_workers, pin_memory=self.pin_memory) 112 | 113 | return test_loader -------------------------------------------------------------------------------- /datasets/MNIST.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torchvision 3 | import torchvision.transforms as vision_transforms 4 | import datasets 5 | import torch 6 | import datasets.torchvision_extension as vision_transforms_extension 7 | 8 | 9 | meanstd = { 10 | 'mean':(0.1307,), 11 | 'std': (0.3081,), 12 | } 13 | 14 | class MNIST(object): 15 | def __init__(self, dataFolder=None, pin_memory=False): 16 | 17 | self.dataFolder = dataFolder if dataFolder is not None else os.path.join(datasets.BASE_DATA_FOLDER, 'MNIST') 18 | self.pin_memory = pin_memory 19 | self.meanStd = meanstd 20 | 21 | #download the dataset 22 | torchvision.datasets.MNIST(self.dataFolder, download=True) 23 | 24 | def getTrainLoader(self, batch_size, shuffle=True, num_workers=1, checkFileIntegrity=False): 25 | 26 | #first we define the training transform we will apply to the dataset 27 | listOfTransoforms = [] 28 | listOfTransoforms.append(vision_transforms.ToTensor()) 29 | listOfTransoforms.append(vision_transforms.Normalize(mean=self.meanStd['mean'], 30 | std=self.meanStd['std'])) 31 | train_transform = vision_transforms.Compose(listOfTransoforms) 32 | 33 | #define the trainset 34 | trainset = torchvision.datasets.MNIST(root=self.dataFolder, train=True, 35 | download=checkFileIntegrity, transform=train_transform) 36 | trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=shuffle, 37 | num_workers=num_workers, pin_memory=self.pin_memory) 38 | 39 | return trainloader 40 | 41 | def getTestLoader(self, batch_size, shuffle=True, num_workers=1, checkFileIntegrity=False): 42 | 43 | listOfTransoforms = [vision_transforms.ToTensor()] 44 | listOfTransoforms.append(vision_transforms.Normalize(mean=self.meanStd['mean'], 45 | std=self.meanStd['std'])) 46 | 47 | test_transform = vision_transforms.Compose(listOfTransoforms) 48 | testset = torchvision.datasets.MNIST(root=self.dataFolder, train=False, 49 | download=checkFileIntegrity, transform=test_transform) 50 | testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=shuffle, 51 | num_workers=num_workers, pin_memory=self.pin_memory) 52 | 53 | return testloader -------------------------------------------------------------------------------- /datasets/PennTreeBank.py: -------------------------------------------------------------------------------- 1 | import os 2 | import datasets 3 | import torch 4 | import pickle 5 | import urllib 6 | import shutil 7 | import numpy as np 8 | import helpers.functions as mhf 9 | 10 | class PennTreeBank(object): 11 | def __init__(self, dataFolder=None): 12 | self.dataFolder = dataFolder if dataFolder is not None else os.path.join(datasets.BASE_DATA_FOLDER, 'PennTreeBank') 13 | self.dictionary = Dictionary() 14 | try: 15 | os.mkdir(self.dataFolder) 16 | except:pass 17 | 18 | self.trainFilePath = os.path.join(self.dataFolder, 'train.txt') 19 | self.testFilePath = os.path.join(self.dataFolder, 'test.txt') 20 | self.validFilePath = os.path.join(self.dataFolder, 'valid.txt') 21 | 22 | self.trainSetPath = os.path.join(self.dataFolder, 'trainSet') 23 | self.testSetPath = os.path.join(self.dataFolder, 'testSet') 24 | self.validSetPath = os.path.join(self.dataFolder, 'validSet') 25 | self.dictionaryPath = os.path.join(self.dataFolder, 'dictionary') 26 | 27 | checkProcessedFiles = self.checkProcessedFiles() 28 | 29 | if (not self.checkDataFiles()) and (not checkProcessedFiles): 30 | #download the files from pytorch example folder, but only if the data are not there and not even the 31 | #processed files 32 | baseUrl = 'https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/penn/' 33 | trainUrl = baseUrl + 'train.txt' 34 | testUrl = baseUrl + 'test.txt' 35 | validUrl = baseUrl + 'valid.txt' 36 | 37 | for pathToSave, urlDownload in zip([self.trainFilePath, self.testFilePath, self.validFilePath], 38 | [trainUrl, testUrl, validUrl]): 39 | print('Downloading {} to {}'.format(urlDownload, pathToSave)) 40 | with urllib.request.urlopen(urlDownload) as response, open(pathToSave, 'wb') as out_file: 41 | shutil.copyfileobj(response, out_file) 42 | 43 | print('Files downloaded') 44 | else: 45 | print('Files already downloaded') 46 | 47 | if not checkProcessedFiles: 48 | print('Processing files') 49 | 50 | self.trainSet = self.tokenize(self.trainFilePath) 51 | self.testSet = self.tokenize(self.testFilePath) 52 | self.validSet = self.tokenize(self.validFilePath) 53 | 54 | for pathToSave, dataset in zip([self.trainSetPath, self.testSetPath, self.validSetPath], 55 | [self.trainSet, self.testSet, self.validSet]): 56 | with open(pathToSave, 'wb') as f: 57 | torch.save(dataset, f) 58 | 59 | with open(self.dictionaryPath, 'wb') as f: 60 | pickle.dump(self.dictionary, f) 61 | 62 | print('Files processed') 63 | else: 64 | with open(self.dictionaryPath, 'rb') as f: 65 | self.dictionary = pickle.load(f) 66 | 67 | with open(self.trainSetPath, 'rb') as f: 68 | self.trainSet = torch.load(f) 69 | 70 | with open(self.testSetPath, 'rb') as f: 71 | self.testSet = torch.load(f) 72 | 73 | with open(self.validSetPath, 'rb') as f: 74 | self.validSet = torch.load(f) 75 | 76 | print('Files already processed') 77 | 78 | def getTrainLoader(self, batch_size, length_sequence, force_same_size_batch=False): 79 | return self.getDataLoader('train', batch_size, length_sequence, shuffle=True, 80 | force_same_size_batch=force_same_size_batch) 81 | 82 | def getTestLoader(self, batch_size, length_sequence, force_same_size_batch=False): 83 | return self.getDataLoader('test', batch_size,length_sequence, shuffle=False, 84 | force_same_size_batch=force_same_size_batch) 85 | 86 | def getValidLoader(self, batch_size, length_sequence, force_same_size_batch=False): 87 | return self.getDataLoader('valid', batch_size, length_sequence, shuffle=False, 88 | force_same_size_batch=force_same_size_batch) 89 | 90 | def getDataLoader(self, type, batch_size, length_sequence, shuffle=True, force_same_size_batch=False): 91 | 92 | if type == 'train': 93 | data = self.trainSet 94 | elif type == 'test': 95 | data = self.testSet 96 | elif type == 'valid': 97 | data = self.validSet 98 | else: 99 | raise ValueError('Invalid type. It must be "train", "test" or "valid"') 100 | 101 | if data.ndimension() != 1: 102 | raise ValueError('Data in input must be a vector') 103 | 104 | length_data = data.size(0) 105 | total_amount_data = length_data - length_sequence - 1 106 | 107 | def loadIter(): 108 | 109 | if shuffle: 110 | allIndices = list(range(length_data - length_sequence)) 111 | np.random.shuffle(allIndices) 112 | 113 | countNumData = 0 114 | while True: 115 | if countNumData + batch_size < total_amount_data: 116 | dimCurrBatch = batch_size 117 | else: 118 | if force_same_size_batch is True: 119 | break 120 | dimCurrBatch = total_amount_data - countNumData 121 | 122 | currBatchData = torch.LongTensor(dimCurrBatch, length_sequence).zero_() 123 | if currBatchData.type() != data.type(): 124 | currBatchData.type_as(data) 125 | 126 | currBatchTarget = torch.LongTensor(dimCurrBatch, length_sequence).zero_() 127 | if currBatchTarget.type() != data.type(): 128 | currBatchTarget.type_as(data) 129 | 130 | for j in range(countNumData, countNumData + dimCurrBatch): 131 | idxToUse = allIndices[j] if shuffle is True else j 132 | currBatchData[j-countNumData, :] = data[idxToUse:idxToUse+length_sequence] 133 | currBatchTarget[j-countNumData, :] = data[idxToUse+1:idxToUse+length_sequence+1] 134 | 135 | yield currBatchData, currBatchTarget 136 | 137 | countNumData = countNumData + dimCurrBatch 138 | if countNumData >= total_amount_data: 139 | break 140 | 141 | dataLoader = mhf.DataLoader(loadIter, total_amount_data, batch_size, shuffled=shuffle, 142 | length_sequence=length_sequence, 143 | force_same_size_batch=force_same_size_batch) 144 | return dataLoader 145 | 146 | def checkDataFiles(self): 147 | return all(os.path.isfile(x) for x in [self.trainFilePath, self.testFilePath, 148 | self.validFilePath]) 149 | 150 | def checkProcessedFiles(self): 151 | return all(os.path.isfile(x) for x in [self.trainSetPath, self.testSetPath, 152 | self.validSetPath, self.dictionaryPath]) 153 | 154 | def tokenize(self, path): 155 | 156 | ''' Tokenizes a text file ''' 157 | 158 | assert os.path.exists(path) 159 | 160 | # Add words to the dictionary 161 | with open(path, 'r') as f: 162 | tokens = 0 163 | for line in f: 164 | words = line.split() + [''] 165 | tokens += len(words) 166 | for word in words: 167 | self.dictionary.add_word(word) 168 | 169 | # Tokenize file content 170 | with open(path, 'r') as f: 171 | ids = torch.LongTensor(tokens) 172 | token = 0 173 | for line in f: 174 | words = line.split() + [''] 175 | for word in words: 176 | ids[token] = self.dictionary.word2idx[word] 177 | token += 1 178 | 179 | return ids 180 | 181 | class Dictionary(object): 182 | 183 | 'Helper class for PennTreeBank dataset ' 184 | 185 | def __init__(self): 186 | self.word2idx = {} 187 | self.idx2word = [] 188 | 189 | def add_word(self, word): 190 | if word not in self.word2idx: 191 | self.idx2word.append(word) 192 | self.word2idx[word] = len(self.idx2word) - 1 193 | return self.word2idx[word] 194 | 195 | def __len__(self): 196 | return len(self.idx2word) 197 | 198 | -------------------------------------------------------------------------------- /datasets/__init__.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | _currDir = os.path.dirname(os.path.abspath(__file__)) 4 | BASE_DATA_FOLDER = os.path.join(_currDir, 'saved_datasets') 5 | PATH_PERL_SCRIPTS_FOLDER = os.path.abspath(os.path.join(_currDir, '..', 'perl_scripts')) 6 | 7 | try: 8 | os.mkdir(BASE_DATA_FOLDER) 9 | except:pass 10 | 11 | from .CIFAR10 import CIFAR10 12 | from .CIFAR100 import CIFAR100 13 | from .PennTreeBank import PennTreeBank 14 | from .ImageNet12 import ImageNet12 15 | from .translation_datasets import multi30k_DE_EN, onmt_integ_dataset, WMT13_DE_EN 16 | from .MNIST import MNIST 17 | from .customs_datasets import LoadingTensorsDataset 18 | 19 | __all__ = ('CIFAR10', 'PennTreeBank', 'WMT13_DE_EN', 'ImageNet12', 'multi30k_DE_EN', 20 | 'onmt_integ_dataset', 'CIFAR100', 'MNIST', 'LoadingTensorsDataset') -------------------------------------------------------------------------------- /datasets/customs_datasets.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import helpers.functions as mhf 3 | import numpy as np 4 | 5 | class LoadingTensorsDataset: 6 | 7 | 'A simple loading dataset - loads the tensor that are passed in input' 8 | 9 | def __init__(self, path_train_data, path_test_data): 10 | 11 | self.trainData = torch.load(path_train_data) 12 | self.testData = torch.load(path_test_data) 13 | 14 | def get_train_loader(self, batch_size): 15 | return self.get_data_loader('train', batch_size, shuffle=True) 16 | 17 | def get_test_loader(self, batch_size): 18 | return self.get_data_loader('test', batch_size, shuffle=False) 19 | 20 | def get_data_loader(self, type, batch_size, shuffle=False): 21 | if batch_size <= 0: 22 | raise ValueError('batch size must be bigger than zero') 23 | 24 | if type == 'train': 25 | dataset, labels = self.trainData 26 | elif type == 'test': 27 | dataset, labels = self.testData 28 | else: raise ValueError('Invalid value for type') 29 | 30 | total_amount_data = dataset.size(0) 31 | 32 | def loadIter(): 33 | 34 | if shuffle: 35 | allIndices = list(range(total_amount_data)) #TODO: This is stupidily inefficient. Change when have time 36 | np.random.shuffle(allIndices) 37 | 38 | currIdx = 0 39 | while True: 40 | 41 | if currIdx + batch_size > total_amount_data: 42 | currData = dataset[currIdx:total_amount_data, :] 43 | currLabels = labels[currIdx:total_amount_data] 44 | yield currData, currLabels 45 | break 46 | 47 | currData = dataset[currIdx:currIdx+batch_size, :] 48 | currLabels = labels[currIdx:currIdx+batch_size] 49 | yield currData, currLabels 50 | 51 | currIdx += batch_size 52 | 53 | dataLoader = mhf.DataLoader(loadIter, total_amount_data, batch_size, shuffled=shuffle) 54 | 55 | return dataLoader -------------------------------------------------------------------------------- /datasets/torchvision_extension.py: -------------------------------------------------------------------------------- 1 | #In this file some more transformations (apart from the ones defined in torchvision.transform) 2 | #are added. Particularly helpful to train imagenet, and in the style of the transforms 3 | #used by fb.resnet https://github.com/facebook/fb.resnet.torch/blob/master/datasets/imagenet.lua 4 | 5 | #This file is taken from a proposed pull request on the torchvision github project. 6 | #At the moment this pull request has not been accepted yet, that is why I report it here. 7 | #Link to the pull request: https://github.com/pytorch/vision/pull/27/files 8 | 9 | class Lighting(object): 10 | 11 | """Lighting noise(AlexNet - style PCA - based noise)""" 12 | 13 | def __init__(self, alphastd, eigval, eigvec): 14 | self.alphastd = alphastd 15 | self.eigval = eigval 16 | self.eigvec = eigvec 17 | 18 | def __call__(self, img): 19 | # img is supposed go be a torch tensor 20 | 21 | if self.alphastd == 0: 22 | return img 23 | 24 | alpha = img.new().resize_(3).normal_(0, self.alphastd) 25 | rgb = self.eigvec.type_as(img).clone()\ 26 | .mul(alpha.view(1, 3).expand(3, 3))\ 27 | .mul(self.eigval.view(1, 3).expand(3, 3))\ 28 | .sum(1).squeeze() 29 | 30 | return img.add(rgb.view(3, 1, 1).expand_as(img)) 31 | -------------------------------------------------------------------------------- /helpers/functions.py: -------------------------------------------------------------------------------- 1 | import time 2 | import smtplib 3 | import torch 4 | import pickle 5 | import os 6 | import tarfile 7 | from email.mime.text import MIMEText 8 | from collections import namedtuple 9 | from collections import OrderedDict 10 | import functools 11 | import numpy as np 12 | import torch.nn as nn 13 | from torch.autograd import Variable 14 | import math 15 | import quantization.help_functions as qhf 16 | 17 | 18 | USE_CUDA = torch.cuda.is_available() 19 | 20 | 21 | def rsetattr(obj, attr, val): 22 | 'recurrent setattr' 23 | 24 | pre, _, post = attr.rpartition('.') 25 | return setattr(rgetattr(obj, pre) if pre else obj, post, val) 26 | 27 | 28 | sentinel = object() 29 | 30 | 31 | def rgetattr(obj, attr, default=sentinel): 32 | 'recurrent getattr' 33 | 34 | if default is sentinel: 35 | _getattr = getattr 36 | else: 37 | def _getattr(obj, name): 38 | return getattr(obj, name, default) 39 | return functools.reduce(_getattr, [obj] + attr.split('.')) 40 | 41 | def read_email_info_from_file(filepath): 42 | 43 | ''' 44 | read email username and password from a file. The format is simply 45 | 46 | email_account::: emailAccount@account.com 47 | password::: password_of_email_account 48 | 49 | ''' 50 | email_account_flag = 'email_account::: ' 51 | password_flag = 'password::: ' 52 | 53 | with open(filepath, 'r') as p: 54 | for line in p.readlines(): 55 | line = line.rstrip() 56 | if line.startswith(email_account_flag): 57 | email_account = line[len(email_account_flag):] 58 | if '@' not in email_account: 59 | raise ValueError('File badly formatted; missing "@" in email account') 60 | elif line.startswith('password::: '): 61 | password = line[len(password_flag):] 62 | else: 63 | raise ValueError('File badly formatted; wrong line identificators. ' 64 | 'Lines should start with "{}" and "{}"'.format(email_account_flag, password_flag)) 65 | 66 | try: 67 | return email_account, password 68 | except: 69 | raise ValueError('File badly formatted, missing email or password information') 70 | 71 | def send_email_yandex(username, password, targets, subject, message, verbose=True): 72 | try: 73 | smtp_ssl_host = 'smtp.yandex.com' 74 | smtp_ssl_port = 465 75 | email_suffix = '@yandex.com' 76 | if email_suffix in username: 77 | username = username 78 | else: 79 | if '@' in username: 80 | raise ValueError('This does not appear to be a yandex email account') 81 | username = username + email_suffix 82 | 83 | if isinstance(targets, str): 84 | targets = [targets] 85 | 86 | msg = MIMEText(message) 87 | msg['Subject'] = subject 88 | msg['From'] = username 89 | msg['To'] = ', '.join(targets) 90 | 91 | server = smtplib.SMTP_SSL(smtp_ssl_host, smtp_ssl_port) 92 | server.login(username, password) 93 | server.sendmail(username, targets, msg.as_string()) 94 | server.quit() 95 | errMsg = '' 96 | if verbose: 97 | print('email sent') 98 | return True, errMsg 99 | except Exception as e: 100 | errMsg = 'Unable to send email: {}'.format(e) 101 | if verbose: 102 | print(errMsg) 103 | return False, errMsg 104 | 105 | 106 | def asMinutesHours(s): 107 | h = s // 3600 108 | s -= h*3600 109 | m = s // 60 110 | s -= m * 60 111 | if h == 0: 112 | if m == 0: 113 | return '%ds' % s 114 | else: 115 | return '%dm %ds' % (m, s) 116 | else: 117 | return '%dh %dm %ds' %(h,m,s) 118 | 119 | 120 | def timeSince(since): 121 | now = time.time() 122 | s = now - since 123 | return '{}'.format(asMinutesHours(s)) 124 | 125 | 126 | def getNumberOfParameters(model): 127 | res = 0 128 | for x in model.parameters(): 129 | res = res + x.data.cpu().numpy().size 130 | 131 | return res 132 | 133 | def convertToNamedTuple(dictionary): 134 | 135 | ''' 136 | 137 | :param dictionary: converts a dictionary to a named tuple (if possible), so that you can access with the .attribute 138 | syntax. This is necessary to use openNMT-py code base 139 | :return: the namedtuple 140 | ''' 141 | 142 | return namedtuple('GenericDict', dictionary.keys())(**dictionary) 143 | 144 | def convertToDictionary(named_tuple): 145 | ''' 146 | :param namedTuple: converts a named tuple into a dictionary 147 | :return: the dictionary 148 | ''' 149 | 150 | return named_tuple._asdict() 151 | 152 | def extractTarFile(tar_path, extract_path=None): 153 | if extract_path is None: 154 | extract_path = os.path.splitext(tar_path)[0] 155 | try: 156 | os.mkdir(extract_path) 157 | except:pass 158 | 159 | tar = tarfile.open(tar_path, 'r') 160 | for item in tar: 161 | tar.extract(item, extract_path) 162 | if item.name.find(".tgz") != -1 or item.name.find(".tar") != -1: 163 | extractTarFile(item.name, "./" + item.name[:item.name.rfind('/')]) 164 | 165 | def countLinesFile(filepath): 166 | with open(filepath, 'r') as f: 167 | count = sum(1 for line in f) 168 | return count 169 | 170 | def remove_files_list(list_filepath): 171 | infoMsg = '' 172 | for x in list_filepath: 173 | try: 174 | os.remove(x) 175 | except Exception as e: 176 | infoMsg += repr(e) + '\n' 177 | return infoMsg 178 | 179 | def convert_state_dict_to_data_parallel(state_dict): 180 | 181 | ''' 182 | Converts a state dict that was saved without data parallel to one tha can be loaded 183 | by a data parallel module 184 | ''' 185 | new_state_dict = OrderedDict() 186 | for k, v in state_dict.items(): 187 | name = 'module.' + k 188 | new_state_dict[name] = v 189 | return new_state_dict 190 | 191 | def convert_state_dict_from_data_parallel(state_dict): 192 | 193 | ''' 194 | Converts a state dict that was saved with data parallel to one tha can be loaded 195 | by a non-data parallel module 196 | ''' 197 | 198 | new_state_dict = OrderedDict() 199 | for k, v in state_dict.items(): 200 | if k.startswith('module.'): 201 | name = k[7:] # remove `module.` 202 | else: 203 | raise ValueError('The state_dict passed was not saved by a data parallel instance') 204 | new_state_dict[name] = v 205 | return new_state_dict 206 | 207 | def num_distinct_elements(numpy_array, tol=1e-8): 208 | 209 | '''returns the number of distinct elements, considering elements closer than tol as the same 210 | numpy_array must be one dimensional!''' 211 | 212 | aux = numpy_array[~(np.triu(np.abs(numpy_array[:, None] - numpy_array) <= tol, 1)).any(0)] 213 | #maybe this is better: np.unique(numpy_array.round(decimals=5)).size 214 | return aux.size 215 | 216 | def get_size_reduction(effective_number_bits, bucket_size=256, full_precision_bits=32): 217 | 218 | if bucket_size is None: 219 | return full_precision_bits/effective_number_bits 220 | 221 | f = full_precision_bits 222 | k = bucket_size 223 | b = effective_number_bits 224 | return (k*f)/(k*b+2*f) 225 | 226 | def get_size_quantized_model(model, numBits, quantization_functions, bucket_size=256, 227 | type_quantization='uniform', quantizeFirstLastLayer=True): 228 | 229 | 'Returns size in MB' 230 | 231 | if numBits is None: 232 | return sum(p.numel() for p in model.parameters()) * 4 / 1000000 233 | 234 | 235 | numTensors = sum(1 for _ in model.parameters()) 236 | if quantizeFirstLastLayer is True: 237 | def get_quantized_params(): 238 | return model.parameters() 239 | def get_unquantized_params(): 240 | return iter(()) 241 | else: 242 | def get_quantized_params(): 243 | return (p for idx, p in enumerate(model.parameters()) if idx not in (0, numTensors - 1)) 244 | def get_unquantized_params(): 245 | return (p for idx, p in enumerate(model.parameters()) if idx in (0, numTensors - 1)) 246 | 247 | count_quantized_parameters = sum(p.numel() for p in get_quantized_params()) 248 | count_unquantized_parameters = sum(p.numel() for p in get_unquantized_params()) 249 | 250 | #Now get the best huffmann bit length for the quantized parameters 251 | actual_bit_huffmman = qhf.get_huffman_encoding_mean_bit_length(get_quantized_params(), quantization_functions, 252 | type_quantization, s=2**numBits) 253 | 254 | #Now we can compute the size. 255 | size_mb = 0 256 | size_mb += count_unquantized_parameters*4 #32 bits / 8 = 4 byte per parameter 257 | size_mb += actual_bit_huffmman*count_quantized_parameters/8 #For the quantized parameters we use the mean huffman length 258 | if bucket_size is not None: 259 | size_mb += count_quantized_parameters/bucket_size*8 #for every bucket size, we have to save 2 parameters. 260 | #so we multiply the number of buckets by 2*32/8 = 8 261 | size_mb = size_mb / 1000000 #to bring it in MB 262 | return size_mb 263 | 264 | 265 | def get_entropy(probabilities): 266 | 267 | natural_log = torch.log(1/probabilities) 268 | natural_log[natural_log == float('inf')] = 0 #this puts all inf in the tensor to 0, so they don't matter for the entropy 269 | log_2 = natural_log / np.log(2) 270 | entropy = (probabilities * log_2).sum() 271 | return entropy 272 | 273 | def compute_entropy_layer(layer_out, normalize=False): 274 | prob_out = torch.nn.functional.softmax(Variable(layer_out), dim=1).data 275 | curr_entropy = [get_entropy(prob_out[idx_b, :]) for idx_b in range(prob_out.size(0))] 276 | curr_entropy = torch.FloatTensor(curr_entropy).view(-1, 1) 277 | if USE_CUDA: curr_entropy = curr_entropy.cuda() 278 | if normalize: 279 | # Normalize them with the max possible entropy value, so divide by log_2(n) 280 | N = layer_out.size(1) 281 | curr_entropy = curr_entropy / math.log2(N) 282 | return curr_entropy 283 | 284 | class DataLoader(object): 285 | 286 | """ 287 | Simple data loader that wraps the one-epoch generator 288 | """ 289 | 290 | #TODO: Shouldn't this inherit from some torch.utils class? 291 | 292 | #TODO: This probably belongs to the dataset package 293 | 294 | def __init__(self, dataLoaderIterator, length_dataset, batch_size, shuffled, **kwargs): 295 | self.dataLoaderIterator = dataLoaderIterator 296 | self.batch_size = batch_size 297 | self.shuffled = shuffled 298 | self.length_dataset = length_dataset 299 | 300 | for key, val in kwargs.items(): 301 | setattr(self, key, val) 302 | 303 | def __iter__(self): 304 | return self.dataLoaderIterator() 305 | 306 | def __len__(self): 307 | #TODO: shouldn't it rather be int(self.length_dataset/batch_size)? 308 | return self.length_dataset 309 | 310 | class EnsembleModel(nn.Module): 311 | def __init__(self, modules): 312 | super(EnsembleModel, self).__init__() 313 | 314 | self.modules_list = nn.ModuleList(modules) 315 | 316 | def forward(self, input): 317 | 318 | num_modules = len(self.modules_list) 319 | output = self.modules_list[0](input) 320 | for idx in range(1, num_modules): 321 | output += self.modules_list[idx](input) 322 | return output / num_modules 323 | -------------------------------------------------------------------------------- /imageNet_distilled.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import torchvision 4 | import cnn_models.conv_forward_model as convForwModel 5 | import cnn_models.help_fun as cnn_hf 6 | import datasets 7 | import model_manager 8 | 9 | cuda_devices = os.environ['CUDA_VISIBLE_DEVICES'].split(',') 10 | print('CUDA_VISIBLE_DEVICES: {} for a total of {} GPUs'.format(cuda_devices, len(cuda_devices))) 11 | 12 | 13 | if 'NUM_BITS' in os.environ: 14 | NUM_BITS = int(os.environ['NUM_BITS']) 15 | else: 16 | NUM_BITS = 4 17 | 18 | print('Number of bits in training: {}'.format(NUM_BITS)) 19 | 20 | datasets.BASE_DATA_FOLDER = '...' 21 | SAVED_MODELS_FOLDER = '...' 22 | 23 | USE_CUDA = torch.cuda.is_available() 24 | NUM_GPUS = len(cuda_devices) 25 | 26 | try: 27 | os.mkdir(datasets.BASE_DATA_FOLDER) 28 | except:pass 29 | try: 30 | os.mkdir(SAVED_MODELS_FOLDER) 31 | except:pass 32 | 33 | epochsToTrainImageNet = 90 34 | imageNet12modelsFolder = os.path.join(SAVED_MODELS_FOLDER, 'imagenet12_new') 35 | imagenet_manager = model_manager.ModelManager('model_manager_imagenet_distilled_New{}bits.tst'.format(NUM_BITS), 36 | 'model_manager', create_new_model_manager=False) 37 | 38 | for x in imagenet_manager.list_models(): 39 | if imagenet_manager.get_num_training_runs(x) >= 1: 40 | s = '{}; Last prediction acc: {}, Best prediction acc: {}'.format(x, 41 | imagenet_manager.load_metadata(x)[1]['predictionAccuracy'][-1], 42 | max(imagenet_manager.load_metadata(x)[1]['predictionAccuracy'])) 43 | print(s) 44 | 45 | try: 46 | os.mkdir(imageNet12modelsFolder) 47 | except:pass 48 | 49 | print('Batch size: {}'.format(batch_size)) 50 | 51 | if batch_size % NUM_GPUS != 0: 52 | raise ValueError('Batch size: {} must be a multiple of the number of gpus:{}'.format(batch_size, NUM_GPUS)) 53 | 54 | imageNet12 = datasets.ImageNet12('...', 55 | '...', 56 | type_of_data_augmentation='extended', already_scaled=False, 57 | pin_memory=True) 58 | 59 | 60 | train_loader = imageNet12.getTrainLoader(batch_size, shuffle=True) 61 | test_loader = imageNet12.getTestLoader(batch_size, shuffle=False) 62 | 63 | # # Teacher model 64 | # resnet152 = torchvision.models.resnet152(True) #already trained 65 | # if USE_CUDA: 66 | # resnet152 = resnet152.cuda() 67 | # if NUM_GPUS > 1: 68 | # resnet152 = torch.nn.parallel.DataParallel(resnet152) 69 | 70 | 71 | #normal resnet18 training 72 | resnet18 = torchvision.models.resnet18(False) #not pre-trained, 11.7 million parameters 73 | if USE_CUDA: 74 | resnet18 = resnet18.cuda() 75 | if NUM_GPUS > 1: 76 | resnet18 = torch.nn.parallel.DataParallel(resnet18) 77 | model_name = 'resnet18_normal_fullprecision' 78 | model_path = os.path.join(imageNet12modelsFolder, model_name) 79 | 80 | if not model_name in imagenet_manager.saved_models: 81 | imagenet_manager.add_new_model(model_name, model_path, 82 | arguments_creator_function={'loaded_from':'torchvision_models'}) 83 | 84 | imagenet_manager.train_model(resnet18, model_name=model_name, 85 | train_function=convForwModel.train_model, 86 | arguments_train_function={'epochs_to_train': epochsToTrainImageNet, 87 | 'learning_rate_style': 'imagenet', 88 | 'initial_learning_rate': 0.1, 89 | 'weight_decayL2':1e-4, 90 | 'start_epoch':0, 91 | 'print_every':30}, 92 | train_loader=train_loader, test_loader=test_loader) 93 | 94 | #distilled 95 | # resnet18_distilled = torchvision.models.resnet18(False) #not pre-trained, 11.7 million parameters 96 | # if USE_CUDA: 97 | # resnet18_distilled = resnet18_distilled.cuda() 98 | # if NUM_GPUS > 1: 99 | # resnet18_distilled = torch.nn.parallel.DataParallel(resnet18_distilled) 100 | # model_name = 'resnet18_distilled' 101 | # model_path = os.path.join(imageNet12modelsFolder, model_name) 102 | # 103 | # if not model_name in imagenet_manager.saved_models: 104 | # imagenet_manager.add_new_model(model_name, model_path, 105 | # arguments_creator_function={'loaded_from':'torchvision_models'}) 106 | 107 | # imagenet_manager.train_model(resnet18_distilled, model_name=model_name, 108 | # train_function=convForwModel.train_model, 109 | # arguments_train_function={'epochs_to_train': epochsToTrainImageNet, 110 | # 'teacher_model': resnet34, 111 | # 'learning_rate_style': 'imagenet', 112 | # 'initial_learning_rate': initial_lr, 113 | # 'weight_decayL2':1e-4, 114 | # 'use_distillation_loss':True, 115 | # 'start_epoch':start_epoch, 116 | # 'print_every':100}, 117 | # train_loader=train_loader, test_loader=test_loader) 118 | 119 | #quantized distilled 120 | # bits_to_try = [NUM_BITS] 121 | # 122 | # for numBit in bits_to_try: 123 | # resnet18_quant_distilled = torchvision.models.resnet18(False) #not pre-trained, 11.7 million parameters 124 | # if USE_CUDA: 125 | # resnet18_quant_distilled = resnet18_quant_distilled.cuda() 126 | # if NUM_GPUS > 1: 127 | # resnet18_quant_distilled = torch.nn.parallel.DataParallel(resnet18_quant_distilled) 128 | # model_name = 'resnet18_quant_distilled_{}bits'.format(numBit) 129 | # model_path = os.path.join(imageNet12modelsFolder, model_name) 130 | # 131 | # if not model_name in imagenet_manager.saved_models: 132 | # imagenet_manager.add_new_model(model_name, model_path, 133 | # arguments_creator_function={'loaded_from':'torchvision_models'}) 134 | # 135 | # imagenet_manager.train_model(resnet18_quant_distilled, model_name=model_name, 136 | # train_function=convForwModel.train_model, 137 | # arguments_train_function={'epochs_to_train': epochsToTrainImageNet, 138 | # 'learning_rate_style': 'imagenet', 139 | # 'initial_learning_rate': 0.1, 140 | # 'use_nesterov':True, 141 | # 'initial_momentum':0.9, 142 | # 'weight_decayL2':1e-4, 143 | # 'start_epoch': 0, 144 | # 'print_every':30, 145 | # 'use_distillation_loss':True, 146 | # 'teacher_model': resnet152, 147 | # 'quantizeWeights':True, 148 | # 'numBits':numBit, 149 | # 'bucket_size':256, 150 | # 'quantize_first_and_last_layer': False}, 151 | # train_loader=train_loader, test_loader=test_loader) 152 | -------------------------------------------------------------------------------- /onmt/Beam.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import torch 3 | import onmt 4 | 5 | """ 6 | Class for managing the internals of the beam search process. 7 | 8 | Takes care of beams, back pointers, and scores. 9 | """ 10 | 11 | 12 | class Beam(object): 13 | def __init__(self, size, n_best=1, cuda=False, vocab=None, 14 | global_scorer=None): 15 | 16 | self.size = size 17 | self.tt = torch.cuda if cuda else torch 18 | 19 | # The score for each translation on the beam. 20 | self.scores = self.tt.FloatTensor(size).zero_() 21 | self.allScores = [] 22 | 23 | # The backpointers at each time-step. 24 | self.prevKs = [] 25 | 26 | # The outputs at each time-step. 27 | self.nextYs = [self.tt.LongTensor(size) 28 | .fill_(vocab.stoi[onmt.IO.PAD_WORD])] 29 | self.nextYs[0][0] = vocab.stoi[onmt.IO.BOS_WORD] 30 | self.vocab = vocab 31 | 32 | # Has EOS topped the beam yet. 33 | self._eos = self.vocab.stoi[onmt.IO.EOS_WORD] 34 | self.eosTop = False 35 | 36 | # The attentions (matrix) for each time. 37 | self.attn = [] 38 | 39 | # Time and k pair for finished. 40 | self.finished = [] 41 | self.n_best = n_best 42 | 43 | # Information for global scoring. 44 | self.globalScorer = global_scorer 45 | self.globalState = {} 46 | 47 | def getCurrentState(self): 48 | "Get the outputs for the current timestep." 49 | return self.nextYs[-1] 50 | 51 | def getCurrentOrigin(self): 52 | "Get the backpointers for the current timestep." 53 | return self.prevKs[-1] 54 | 55 | def advance(self, wordLk, attnOut): 56 | """ 57 | Given prob over words for every last beam `wordLk` and attention 58 | `attnOut`: Compute and update the beam search. 59 | 60 | Parameters: 61 | 62 | * `wordLk`- probs of advancing from the last step (K x words) 63 | * `attnOut`- attention at the last step 64 | 65 | Returns: True if beam search is complete. 66 | """ 67 | numWords = wordLk.size(1) 68 | 69 | # Sum the previous scores. 70 | if len(self.prevKs) > 0: 71 | beamLk = wordLk + self.scores.unsqueeze(1).expand_as(wordLk) 72 | 73 | # Don't let EOS have children. 74 | for i in range(self.nextYs[-1].size(0)): 75 | if self.nextYs[-1][i] == self._eos: 76 | beamLk[i] = -1e20 77 | else: 78 | beamLk = wordLk[0] 79 | flatBeamLk = beamLk.view(-1) 80 | bestScores, bestScoresId = flatBeamLk.topk(self.size, 0, True, True) 81 | 82 | self.allScores.append(self.scores) 83 | self.scores = bestScores 84 | 85 | # bestScoresId is flattened beam x word array, so calculate which 86 | # word and beam each score came from 87 | prevK = bestScoresId / numWords 88 | self.prevKs.append(prevK) 89 | self.nextYs.append((bestScoresId - prevK * numWords)) 90 | self.attn.append(attnOut.index_select(0, prevK)) 91 | 92 | if self.globalScorer is not None: 93 | self.globalScorer.updateGlobalState(self) 94 | 95 | for i in range(self.nextYs[-1].size(0)): 96 | if self.nextYs[-1][i] == self._eos: 97 | s = self.scores[i] 98 | if self.globalScorer is not None: 99 | globalScores = self.globalScorer.score(self, self.scores) 100 | s = globalScores[i] 101 | self.finished.append((s, len(self.nextYs) - 1, i)) 102 | 103 | # End condition is when top-of-beam is EOS and no global score. 104 | if self.nextYs[-1][0] == self.vocab.stoi[onmt.IO.EOS_WORD]: 105 | # self.allScores.append(self.scores) 106 | self.eosTop = True 107 | 108 | def done(self): 109 | return self.eosTop and len(self.finished) >= self.n_best 110 | 111 | def sortFinished(self, minimum=None): 112 | if minimum is not None: 113 | i = 0 114 | # Add from beam until we have minimum outputs. 115 | while len(self.finished) < minimum: 116 | s = self.scores[i] 117 | if self.globalScorer is not None: 118 | globalScores = self.globalScorer.score(self, self.scores) 119 | s = globalScores[i] 120 | self.finished.append((s, len(self.nextYs) - 1, i)) 121 | 122 | self.finished.sort(key=lambda a: -a[0]) 123 | scores = [sc for sc, _, _ in self.finished] 124 | ks = [(t, k) for _, t, k in self.finished] 125 | return scores, ks 126 | 127 | def getHyp(self, timestep, k): 128 | """ 129 | Walk back to construct the full hypothesis. 130 | """ 131 | hyp, attn = [], [] 132 | for j in range(len(self.prevKs[:timestep]) - 1, -1, -1): 133 | hyp.append(self.nextYs[j+1][k]) 134 | attn.append(self.attn[j][k]) 135 | k = self.prevKs[j][k] 136 | return hyp[::-1], torch.stack(attn[::-1]) 137 | 138 | 139 | class GNMTGlobalScorer(object): 140 | """ 141 | Google NMT ranking score from Wu et al. 142 | """ 143 | def __init__(self, alpha, beta): 144 | self.alpha = alpha 145 | self.beta = beta 146 | 147 | def score(self, beam, logprobs): 148 | "Additional term add to log probability" 149 | cov = beam.globalState["coverage"] 150 | pen = self.beta * torch.min(cov, cov.clone().fill_(1.0)).log().sum(1) 151 | l_term = (((5 + len(beam.nextYs)) ** self.alpha) / 152 | ((5 + 1) ** self.alpha)) 153 | return (logprobs / l_term) + pen 154 | 155 | def updateGlobalState(self, beam): 156 | "Keeps the coverage vector as sum of attens" 157 | if len(beam.prevKs) == 1: 158 | beam.globalState["coverage"] = beam.attn[-1] 159 | else: 160 | beam.globalState["coverage"] = beam.globalState["coverage"] \ 161 | .index_select(0, beam.prevKs[-1]).add(beam.attn[-1]) 162 | -------------------------------------------------------------------------------- /onmt/Loss.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file handles the details of the loss function during training. 3 | 4 | This includes: LossComputeBase and the standard NMTLossCompute, and 5 | sharded loss compute stuff. 6 | """ 7 | from __future__ import division 8 | import torch 9 | import torch.nn as nn 10 | from torch.autograd import Variable 11 | 12 | import onmt 13 | 14 | 15 | class LossComputeBase(nn.Module): 16 | """ 17 | This is the loss criterion base class. Users can implement their own 18 | loss computation strategy by making subclass of this one. 19 | Users need to implement the compute_loss() method. 20 | We inherits from nn.Module to leverage the cuda behavior. 21 | """ 22 | def __init__(self, generator, tgt_vocab): 23 | super(LossComputeBase, self).__init__() 24 | self.generator = generator 25 | self.tgt_vocab = tgt_vocab 26 | self.padding_idx = tgt_vocab.stoi[onmt.IO.PAD_WORD] 27 | 28 | def forward(self, batch, output, target, **kwargs): 29 | """ 30 | Compute the loss. Subclass must define the compute_loss(). 31 | Args: 32 | batch: the current batch. 33 | output: the predict output from the model. 34 | target: the validate target to compare output with. 35 | **kwargs: additional info for computing loss. 36 | """ 37 | # Need to simplify this interface. 38 | return self.compute_loss(batch, output, target, **kwargs) 39 | 40 | def sharded_compute_loss(self, batch, output, attns, 41 | cur_trunc, trunc_size, shard_size, teacher_outputs=None): 42 | """ 43 | Compute the loss in shards for efficiency. 44 | """ 45 | batch_stats = onmt.Statistics() 46 | range_ = (cur_trunc, cur_trunc + trunc_size) 47 | gen_state = make_gen_state(output, batch, attns, range_, 48 | self.copy_attn, teacher_outputs) 49 | 50 | for shard in shards(gen_state, shard_size): 51 | loss, stats = self.compute_loss(batch, **shard) 52 | loss.div(batch.batch_size).backward() 53 | batch_stats.update(stats) 54 | 55 | return batch_stats 56 | 57 | def stats(self, loss, scores, target): 58 | """ 59 | Compute and return a Statistics object. 60 | 61 | Args: 62 | loss(Tensor): the loss computed by the loss criterion. 63 | scores(Tensor): a sequence of predict output with scores. 64 | """ 65 | pred = scores.max(1)[1] 66 | non_padding = target.ne(self.padding_idx) 67 | num_correct = pred.eq(target) \ 68 | .masked_select(non_padding) \ 69 | .sum() 70 | return onmt.Statistics(loss[0], non_padding.sum(), num_correct) 71 | 72 | def bottle(self, v): 73 | return v.view(-1, v.size(2)) 74 | 75 | def unbottle(self, v, batch_size): 76 | return v.view(-1, batch_size, v.size(1)) 77 | 78 | 79 | class NMTLossCompute(LossComputeBase): 80 | """ 81 | Standard NMT Loss Computation. 82 | """ 83 | def __init__(self, generator, tgt_vocab, use_distillation_loss=False, teacher_generator=None): 84 | 85 | if use_distillation_loss is True and teacher_generator is None: 86 | raise ValueError('to use distillation loss you have to pass the teacher generator') 87 | 88 | super(NMTLossCompute, self).__init__(generator, tgt_vocab) 89 | 90 | self.copy_attn = False 91 | weight = torch.ones(len(tgt_vocab)) 92 | weight[self.padding_idx] = 0 93 | self.criterion = nn.NLLLoss(weight, size_average=False) 94 | self.use_distillation_loss = use_distillation_loss 95 | self.teacher_generator = teacher_generator 96 | 97 | def compute_loss(self, batch, output, target, **kwargs): 98 | """ See base class for args description. """ 99 | scores = self.generator(self.bottle(output)) 100 | scores_data = scores.data.clone() 101 | 102 | target = target.view(-1) 103 | target_data = target.data.clone() 104 | 105 | loss = self.criterion(scores, target) 106 | if self.use_distillation_loss: 107 | weight_teacher_loss = 0.7 108 | teacher_outputs = kwargs['teacher_outputs'] 109 | scores_teacher = self.teacher_generator(self.bottle(teacher_outputs)) 110 | prob_teacher = scores_teacher.exp().detach() 111 | # Here we use a temperature of 1.. 112 | loss_distilled = nn.functional.kl_div(scores, prob_teacher, 113 | weight=self.criterion.weight, 114 | size_average=self.criterion.size_average) 115 | loss = (1-weight_teacher_loss)*loss + weight_teacher_loss*loss_distilled 116 | 117 | loss_data = loss.data.clone() 118 | stats = self.stats(loss_data, scores_data, target_data) 119 | 120 | return loss, stats 121 | 122 | 123 | def make_gen_state(output, batch, attns, range_, copy_attn=None, teacher_outputs=None): 124 | """ 125 | Create generator state for use in sharded loss computation. 126 | This needs to match compute_loss exactly. 127 | """ 128 | if copy_attn and getattr(batch, 'alignment', None) is None: 129 | raise AssertionError("using -copy_attn you need to pass in " 130 | "-dynamic_dict during preprocess stage.") 131 | 132 | res_dict = {} 133 | res_dict["output"] = output 134 | if teacher_outputs is not None: 135 | res_dict['teacher_outputs'] = teacher_outputs 136 | res_dict["target"] = batch.tgt[range_[0] + 1: range_[1]] 137 | res_dict["copy_attn"] = attns.get("copy") 138 | res_dict["align"] = None if not copy_attn else batch.alignment[range_[0] + 1: range_[1]] 139 | res_dict["coverage"] = attns.get("coverage") 140 | 141 | return res_dict 142 | 143 | 144 | def filter_gen_state(state): 145 | for k, v in state.items(): 146 | if v is not None: 147 | if isinstance(v, Variable) and v.requires_grad: 148 | v = Variable(v.data, requires_grad=True, volatile=False) 149 | yield k, v 150 | 151 | 152 | def shards(state, shard_size, eval=False): 153 | """ 154 | Args: 155 | state: A dictionary which corresponds to the output of 156 | make_gen_state(). The values for those keys are 157 | Tensor-like or None. 158 | shard_size: The maximum size of the shards yielded by the model. 159 | eval: If True, only yield the state, nothing else. 160 | Otherwise, yield shards. 161 | 162 | yields: 163 | Each yielded shard is a dict. 164 | side effect: 165 | After the last shard, this function does back-propagation. 166 | """ 167 | if eval: 168 | yield state 169 | else: 170 | # non_none: the subdict of the state dictionary where the values 171 | # are not None. 172 | non_none = dict(filter_gen_state(state)) 173 | 174 | # Now, the iteration: 175 | # split_state is a dictionary of sequences of tensor-like but we 176 | # want a sequence of dictionaries of tensors. 177 | # First, unzip the dictionary into a sequence of keys and a 178 | # sequence of tensor-like sequences. 179 | keys, values = zip(*((k, torch.split(v, shard_size)) 180 | for k, v in non_none.items())) 181 | 182 | # Now, yield a dictionary for each shard. The keys are always 183 | # the same. values is a sequence of length #keys where each 184 | # element is a sequence of length #shards. We want to iterate 185 | # over the shards, not over the keys: therefore, the values need 186 | # to be re-zipped by shard and then each shard can be paired 187 | # with the keys. 188 | for shard_tensors in zip(*values): 189 | yield dict(zip(keys, shard_tensors)) 190 | 191 | # Assumed backprop'd 192 | variables = ((state[k], v.grad.data) for k, v in non_none.items() 193 | if isinstance(v, Variable) and v.grad is not None) 194 | inputs, grads = zip(*variables) 195 | torch.autograd.backward(inputs, grads) 196 | -------------------------------------------------------------------------------- /onmt/ModelConstructor.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file is for models creation, which consults options 3 | and creates each encoder and decoder accordingly. 4 | """ 5 | import torch.nn as nn 6 | 7 | import onmt 8 | import onmt.Models 9 | import onmt.modules 10 | from onmt.IO import ONMTDataset 11 | from onmt.Models import NMTModel, MeanEncoder, RNNEncoder, \ 12 | StdRNNDecoder, InputFeedRNNDecoder 13 | from onmt.modules import Embeddings, ImageEncoder, CopyGenerator, \ 14 | TransformerEncoder, TransformerDecoder, \ 15 | CNNEncoder, CNNDecoder 16 | 17 | 18 | def make_embeddings(opt, word_dict, feature_dicts, for_encoder=True): 19 | """ 20 | Make an Embeddings instance. 21 | Args: 22 | opt: the option in current environment. 23 | word_dict(Vocab): words dictionary. 24 | feature_dicts([Vocab], optional): a list of feature dictionary. 25 | for_encoder(bool): make Embeddings for encoder or decoder? 26 | """ 27 | if for_encoder: 28 | embedding_dim = opt.src_word_vec_size 29 | else: 30 | embedding_dim = opt.tgt_word_vec_size 31 | 32 | word_padding_idx = word_dict.stoi[onmt.IO.PAD_WORD] 33 | num_word_embeddings = len(word_dict) 34 | 35 | feats_padding_idx = [feat_dict.stoi[onmt.IO.PAD_WORD] 36 | for feat_dict in feature_dicts] 37 | num_feat_embeddings = [len(feat_dict) for feat_dict in 38 | feature_dicts] 39 | 40 | return Embeddings(embedding_dim, 41 | opt.position_encoding, 42 | opt.feat_merge, 43 | opt.feat_vec_exponent, 44 | opt.feat_vec_size, 45 | opt.dropout, 46 | word_padding_idx, 47 | feats_padding_idx, 48 | num_word_embeddings, 49 | num_feat_embeddings) 50 | 51 | 52 | def make_encoder(opt, embeddings): 53 | """ 54 | Various encoder dispatcher function. 55 | Args: 56 | opt: the option in current environment. 57 | embeddings (Embeddings): vocab embeddings for this encoder. 58 | """ 59 | if opt.encoder_type == "transformer": 60 | return TransformerEncoder(opt.enc_layers, opt.rnn_size, 61 | opt.dropout, embeddings) 62 | elif opt.encoder_type == "cnn": 63 | return CNNEncoder(opt.enc_layers, opt.rnn_size, 64 | opt.cnn_kernel_width, 65 | opt.dropout, embeddings) 66 | elif opt.encoder_type == "mean": 67 | return MeanEncoder(opt.enc_layers, embeddings) 68 | else: 69 | # "rnn" or "brnn" 70 | return RNNEncoder(opt.rnn_type, opt.brnn, opt.dec_layers, 71 | opt.rnn_size, opt.dropout, embeddings) 72 | 73 | 74 | def make_decoder(opt, embeddings): 75 | """ 76 | Various decoder dispatcher function. 77 | Args: 78 | opt: the option in current environment. 79 | embeddings (Embeddings): vocab embeddings for this decoder. 80 | """ 81 | if opt.decoder_type == "transformer": 82 | return TransformerDecoder(opt.dec_layers, opt.rnn_size, 83 | opt.global_attention, opt.copy_attn, 84 | opt.dropout, embeddings) 85 | elif opt.decoder_type == "cnn": 86 | return CNNDecoder(opt.dec_layers, opt.rnn_size, 87 | opt.global_attention, opt.copy_attn, 88 | opt.cnn_kernel_width, opt.dropout, 89 | embeddings) 90 | elif opt.input_feed: 91 | return InputFeedRNNDecoder(opt.rnn_type, opt.brnn, 92 | opt.dec_layers, opt.rnn_size, 93 | opt.global_attention, 94 | opt.coverage_attn, 95 | opt.context_gate, 96 | opt.copy_attn, 97 | opt.dropout, 98 | embeddings) 99 | else: 100 | return StdRNNDecoder(opt.rnn_type, opt.brnn, 101 | opt.dec_layers, opt.rnn_size, 102 | opt.global_attention, 103 | opt.coverage_attn, 104 | opt.context_gate, 105 | opt.copy_attn, 106 | opt.dropout, 107 | embeddings) 108 | 109 | 110 | def make_base_model(model_opt, fields, gpu, checkpoint=None): 111 | """ 112 | Args: 113 | model_opt: the option loaded from checkpoint. 114 | fields: `Field` objects for the model. 115 | gpu(bool): whether to use gpu. 116 | checkpoint: the model gnerated by train phase, or a resumed snapshot 117 | model from a stopped training. 118 | Returns: 119 | the NMTModel. 120 | """ 121 | assert model_opt.model_type in ["text", "img"], \ 122 | ("Unsupported model type %s" % (model_opt.model_type)) 123 | 124 | # Make encoder. 125 | if model_opt.model_type == "text": 126 | src_dict = fields["src"].vocab 127 | feature_dicts = ONMTDataset.collect_feature_dicts(fields) 128 | src_embeddings = make_embeddings(model_opt, src_dict, 129 | feature_dicts) 130 | encoder = make_encoder(model_opt, src_embeddings) 131 | else: 132 | encoder = ImageEncoder(model_opt.layers, 133 | model_opt.brnn, 134 | model_opt.rnn_size, 135 | model_opt.dropout) 136 | 137 | # Make decoder. 138 | tgt_dict = fields["tgt"].vocab 139 | # TODO: prepare for a future where tgt features are possible. 140 | feature_dicts = [] 141 | tgt_embeddings = make_embeddings(model_opt, tgt_dict, 142 | feature_dicts, for_encoder=False) 143 | decoder = make_decoder(model_opt, tgt_embeddings) 144 | 145 | # Make NMTModel(= encoder + decoder). 146 | model = NMTModel(encoder, decoder) 147 | 148 | # Make Generator. 149 | if not model_opt.copy_attn: 150 | generator = nn.Sequential( 151 | nn.Linear(model_opt.rnn_size, len(fields["tgt"].vocab)), 152 | nn.LogSoftmax()) 153 | if model_opt.share_decoder_embeddings: 154 | generator[0].weight = decoder.embeddings.word_lut.weight 155 | else: 156 | generator = CopyGenerator(model_opt, fields["src"].vocab, 157 | fields["tgt"].vocab) 158 | 159 | # Load the model states from checkpoint or initialize them. 160 | if checkpoint is not None: 161 | #print('Loading model parameters.') 162 | model.load_state_dict(checkpoint['model']) 163 | generator.load_state_dict(checkpoint['generator']) 164 | else: 165 | if model_opt.param_init != 0.0: 166 | #print('Intializing parameters.') 167 | for p in model.parameters(): 168 | p.data.uniform_(-model_opt.param_init, model_opt.param_init) 169 | model.encoder.embeddings.load_pretrained_vectors( 170 | model_opt.pre_word_vecs_enc, model_opt.fix_word_vecs_enc) 171 | model.decoder.embeddings.load_pretrained_vectors( 172 | model_opt.pre_word_vecs_dec, model_opt.fix_word_vecs_dec) 173 | 174 | # add the generator to the module (does this register the parameter?) 175 | model.generator = generator 176 | 177 | # Make the whole model leverage GPU if indicated to do so. 178 | if gpu: 179 | model.cuda() 180 | else: 181 | model.cpu() 182 | 183 | return model 184 | -------------------------------------------------------------------------------- /onmt/Optim.py: -------------------------------------------------------------------------------- 1 | import torch.optim as optim 2 | from torch.nn.utils import clip_grad_norm 3 | 4 | 5 | class Optim(object): 6 | 7 | def set_parameters(self, params): 8 | self.params = [p for p in params if p.requires_grad] 9 | if self.method == 'sgd': 10 | self.optimizer = optim.SGD(self.params, lr=self.lr) 11 | elif self.method == 'adagrad': 12 | self.optimizer = optim.Adagrad(self.params, lr=self.lr) 13 | elif self.method == 'adadelta': 14 | self.optimizer = optim.Adadelta(self.params, lr=self.lr) 15 | elif self.method == 'adam': 16 | self.optimizer = optim.Adam(self.params, lr=self.lr, 17 | betas=self.betas, eps=1e-9) 18 | else: 19 | raise RuntimeError("Invalid optim method: " + self.method) 20 | 21 | def __init__(self, method, lr, max_grad_norm, 22 | lr_decay=1, start_decay_at=None, 23 | beta1=0.9, beta2=0.98, 24 | opt=None): 25 | self.last_ppl = None 26 | self.lr = lr 27 | self.max_grad_norm = max_grad_norm 28 | self.method = method 29 | self.lr_decay = lr_decay 30 | self.start_decay_at = start_decay_at 31 | self.start_decay = False 32 | self._step = 0 33 | self.betas = [beta1, beta2] 34 | self.opt = opt 35 | 36 | def _setRate(self, lr): 37 | self.lr = lr 38 | self.optimizer.param_groups[0]['lr'] = self.lr 39 | 40 | def step(self): 41 | "Compute gradients norm." 42 | self._step += 1 43 | 44 | # Decay method used in tensor2tensor. 45 | #Changed here, since NamedTuple does not have a __dict__ attribute 46 | if self.opt.decay_method == "noam": 47 | self._setRate( 48 | self.opt.learning_rate * 49 | (self.opt.rnn_size ** (-0.5) * 50 | min(self._step ** (-0.5), 51 | self._step * self.opt.warmup_steps**(-1.5)))) 52 | 53 | if self.max_grad_norm: 54 | clip_grad_norm(self.params, self.max_grad_norm) 55 | self.optimizer.step() 56 | 57 | def updateLearningRate(self, ppl, epoch): 58 | """ 59 | Decay learning rate if val perf does not improve 60 | or we hit the start_decay_at limit. 61 | """ 62 | 63 | if self.start_decay_at is not None and epoch >= self.start_decay_at: 64 | self.start_decay = True 65 | if self.last_ppl is not None and ppl > self.last_ppl: 66 | self.start_decay = True 67 | 68 | if self.start_decay: 69 | self.lr = self.lr * self.lr_decay 70 | print("Decaying learning rate to %g" % self.lr) 71 | 72 | self.last_ppl = ppl 73 | self.optimizer.param_groups[0]['lr'] = self.lr 74 | -------------------------------------------------------------------------------- /onmt/Trainer.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | """ 3 | This is the loadable seq2seq trainer library that is 4 | in charge of training details, loss compute, and statistics. 5 | See train.py for a use case of this library. 6 | 7 | Note!!! To make this a general library, we implement *only* 8 | mechanism things here(i.e. what to do), and leave the strategy 9 | things to users(i.e. how to do it). Also see train.py(one of the 10 | users of this library) for the strategy things we do. 11 | """ 12 | import time 13 | import sys 14 | import math 15 | import torch 16 | import torch.nn as nn 17 | 18 | import onmt 19 | import onmt.modules 20 | 21 | 22 | class Statistics(object): 23 | """ 24 | Train/validate loss statistics. 25 | """ 26 | def __init__(self, loss=0, n_words=0, n_correct=0): 27 | self.loss = loss 28 | self.n_words = n_words 29 | self.n_correct = n_correct 30 | self.n_src_words = 0 31 | self.start_time = time.time() 32 | 33 | def update(self, stat): 34 | self.loss += stat.loss 35 | self.n_words += stat.n_words 36 | self.n_correct += stat.n_correct 37 | 38 | def accuracy(self): 39 | return 100 * (self.n_correct / self.n_words) 40 | 41 | def ppl(self): 42 | return math.exp(min(self.loss / self.n_words, 100)) 43 | 44 | def elapsed_time(self): 45 | return time.time() - self.start_time 46 | 47 | def output(self, epoch, batch, n_batches, start): 48 | t = self.elapsed_time() 49 | print(("Epoch %2d, %5d/%5d; acc: %6.2f; ppl: %6.2f; " + 50 | "%3.0f src tok/s; %3.0f tgt tok/s; %6.0f s elapsed") % 51 | (epoch, batch, n_batches, 52 | self.accuracy(), 53 | self.ppl(), 54 | self.n_src_words / (t + 1e-5), 55 | self.n_words / (t + 1e-5), 56 | time.time() - start)) 57 | sys.stdout.flush() 58 | 59 | def log(self, prefix, experiment, optim): 60 | t = self.elapsed_time() 61 | experiment.add_scalar_value(prefix + "_ppl", self.ppl()) 62 | experiment.add_scalar_value(prefix + "_accuracy", self.accuracy()) 63 | experiment.add_scalar_value(prefix + "_tgtper", self.n_words / t) 64 | experiment.add_scalar_value(prefix + "_lr", optim.lr) 65 | 66 | 67 | class Trainer(object): 68 | def __init__(self, model, train_iter, valid_iter, 69 | train_loss, valid_loss, optim, 70 | trunc_size, shard_size): 71 | """ 72 | Args: 73 | model: the seq2seq model. 74 | train_iter: the train data iterator. 75 | valid_iter: the validate data iterator. 76 | train_loss: the train side LossCompute object for computing loss. 77 | valid_loss: the valid side LossCompute object for computing loss. 78 | optim: the optimizer responsible for lr update. 79 | trunc_size: a batch is divided by several truncs of this size. 80 | shard_size: compute loss in shards of this size for efficiency. 81 | """ 82 | # Basic attributes. 83 | self.model = model 84 | self.train_iter = train_iter 85 | self.valid_iter = valid_iter 86 | self.train_loss = train_loss 87 | self.valid_loss = valid_loss 88 | self.optim = optim 89 | self.trunc_size = trunc_size 90 | self.shard_size = shard_size 91 | 92 | # Set model in training mode. 93 | self.model.train() 94 | 95 | def train(self, epoch, report_func=None): 96 | """ Called for each epoch to train. """ 97 | total_stats = Statistics() 98 | report_stats = Statistics() 99 | 100 | for i, batch in enumerate(self.train_iter): 101 | target_size = batch.tgt.size(0) 102 | # Truncated BPTT 103 | trunc_size = self.trunc_size if self.trunc_size else target_size 104 | 105 | dec_state = None 106 | _, src_lengths = batch.src 107 | 108 | src = onmt.IO.make_features(batch, 'src') 109 | tgt_outer = onmt.IO.make_features(batch, 'tgt') 110 | report_stats.n_src_words += src_lengths.sum() 111 | 112 | for j in range(0, target_size-1, trunc_size): 113 | # 1. Create truncated target. 114 | tgt = tgt_outer[j: j + trunc_size] 115 | 116 | # 2. F-prop all but generator. 117 | self.model.zero_grad() 118 | outputs, attns, dec_state = \ 119 | self.model(src, tgt, src_lengths, dec_state) 120 | 121 | # 3. Compute loss in shards for memory efficiency. 122 | batch_stats = self.train_loss.sharded_compute_loss( 123 | batch, outputs, attns, j, 124 | trunc_size, self.shard_size) 125 | 126 | # 4. Update the parameters and statistics. 127 | self.optim.step() 128 | total_stats.update(batch_stats) 129 | report_stats.update(batch_stats) 130 | 131 | # If truncated, don't backprop fully. 132 | if dec_state is not None: 133 | dec_state.detach() 134 | 135 | if report_func is not None: 136 | report_func(epoch, i, len(self.train_iter), 137 | total_stats.start_time, self.optim.lr, 138 | report_stats) 139 | report_stats = Statistics() 140 | 141 | return total_stats 142 | 143 | def validate(self): 144 | """ Called for each epoch to validate. """ 145 | # Set model in validating mode. 146 | self.model.eval() 147 | 148 | stats = Statistics() 149 | 150 | for batch in self.valid_iter: 151 | _, src_lengths = batch.src 152 | src = onmt.IO.make_features(batch, 'src') 153 | tgt = onmt.IO.make_features(batch, 'tgt') 154 | 155 | # F-prop through the model. 156 | outputs, attns, _ = self.model(src, tgt, src_lengths) 157 | 158 | # Compute loss. 159 | gen_state = onmt.Loss.make_gen_state( 160 | outputs, batch, attns, (0, batch.tgt.size(0))) 161 | _, batch_stats = self.valid_loss(batch, **gen_state) 162 | 163 | # Update statistics. 164 | stats.update(batch_stats) 165 | 166 | # Set model back to training mode. 167 | self.model.train() 168 | 169 | return stats 170 | 171 | def epoch_step(self, ppl, epoch): 172 | """ Called for each epoch to update learning rate. """ 173 | return self.optim.updateLearningRate(ppl, epoch) 174 | 175 | def drop_checkpoint(self, opt, epoch, fields, valid_stats): 176 | """ Called conditionally each epoch to save a snapshot. """ 177 | real_model = (self.model.module 178 | if isinstance(self.model, nn.DataParallel) 179 | else self.model) 180 | real_generator = (real_model.generator.module 181 | if isinstance(real_model.generator, nn.DataParallel) 182 | else real_model.generator) 183 | 184 | model_state_dict = real_model.state_dict() 185 | model_state_dict = {k: v for k, v in model_state_dict.items() 186 | if 'generator' not in k} 187 | generator_state_dict = real_generator.state_dict() 188 | checkpoint = { 189 | 'model': model_state_dict, 190 | 'generator': generator_state_dict, 191 | 'vocab': onmt.IO.ONMTDataset.save_vocab(fields), 192 | 'opt': opt, 193 | 'epoch': epoch, 194 | 'optim': self.optim 195 | } 196 | torch.save(checkpoint, 197 | '%s_acc_%.2f_ppl_%.2f_e%d.pt' 198 | % (opt.save_model, valid_stats.accuracy(), 199 | valid_stats.ppl(), epoch)) 200 | -------------------------------------------------------------------------------- /onmt/Translator.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.autograd import Variable 3 | 4 | import onmt 5 | import onmt.Models 6 | import onmt.ModelConstructor 7 | import onmt.modules 8 | import onmt.IO 9 | from onmt.Utils import use_gpu 10 | 11 | 12 | class Translator(object): 13 | def __init__(self, opt, dummy_opt={}): 14 | # Add in default model arguments, possibly added since training. 15 | self.opt = opt 16 | checkpoint = torch.load(opt.model, 17 | map_location=lambda storage, loc: storage) 18 | self.fields = onmt.IO.ONMTDataset.load_fields(checkpoint['vocab']) 19 | 20 | model_opt = checkpoint['opt'] 21 | for arg in dummy_opt: 22 | if arg not in model_opt: 23 | model_opt.__dict__[arg] = dummy_opt[arg] 24 | 25 | self._type = model_opt.encoder_type 26 | self.copy_attn = model_opt.copy_attn 27 | 28 | self.model = onmt.ModelConstructor.make_base_model( 29 | model_opt, self.fields, use_gpu(opt), checkpoint) 30 | self.model.eval() 31 | self.model.generator.eval() 32 | 33 | # for debugging 34 | self.beam_accum = None 35 | 36 | def initBeamAccum(self): 37 | self.beam_accum = { 38 | "predicted_ids": [], 39 | "beam_parent_ids": [], 40 | "scores": [], 41 | "log_probs": []} 42 | 43 | def buildTargetTokens(self, pred, src, attn, copy_vocab): 44 | vocab = self.fields["tgt"].vocab 45 | tokens = [] 46 | for tok in pred: 47 | if tok < len(vocab): 48 | tokens.append(vocab.itos[tok]) 49 | else: 50 | tokens.append(copy_vocab.itos[tok - len(vocab)]) 51 | if tokens[-1] == onmt.IO.EOS_WORD: 52 | tokens = tokens[:-1] 53 | break 54 | 55 | if self.opt.replace_unk and attn is not None: 56 | for i in range(len(tokens)): 57 | if tokens[i] == vocab.itos[onmt.IO.UNK]: 58 | _, maxIndex = attn[i].max(0) 59 | tokens[i] = self.fields["src"].vocab.itos[src[maxIndex[0]]] 60 | return tokens 61 | 62 | def _runTarget(self, batch, data): 63 | 64 | _, src_lengths = batch.src 65 | src = onmt.IO.make_features(batch, 'src') 66 | tgt_in = onmt.IO.make_features(batch, 'tgt')[:-1] 67 | 68 | # (1) run the encoder on the src 69 | encStates, context = self.model.encoder(src, src_lengths) 70 | decStates = self.model.decoder.init_decoder_state( 71 | src, context, encStates) 72 | 73 | # (2) if a target is specified, compute the 'goldScore' 74 | # (i.e. log likelihood) of the target under the model 75 | tt = torch.cuda if self.opt.cuda else torch 76 | goldScores = tt.FloatTensor(batch.batch_size).fill_(0) 77 | decOut, decStates, attn = self.model.decoder( 78 | tgt_in, context, decStates) 79 | 80 | tgt_pad = self.fields["tgt"].vocab.stoi[onmt.IO.PAD_WORD] 81 | for dec, tgt in zip(decOut, batch.tgt[1:].data): 82 | # Log prob of each word. 83 | out = self.model.generator.forward(dec) 84 | tgt = tgt.unsqueeze(1) 85 | scores = out.data.gather(1, tgt) 86 | scores.masked_fill_(tgt.eq(tgt_pad), 0) 87 | goldScores += scores 88 | return goldScores 89 | 90 | def translateBatch(self, batch, dataset): 91 | beam_size = self.opt.beam_size 92 | batch_size = batch.batch_size 93 | 94 | # (1) Run the encoder on the src. 95 | _, src_lengths = batch.src 96 | src = onmt.IO.make_features(batch, 'src') 97 | encStates, context = self.model.encoder(src, src_lengths) 98 | decStates = self.model.decoder.init_decoder_state( 99 | src, context, encStates) 100 | 101 | # (1b) Initialize for the decoder. 102 | def var(a): return Variable(a, volatile=True) 103 | 104 | def rvar(a): return var(a.repeat(1, beam_size, 1)) 105 | 106 | # Repeat everything beam_size times. 107 | context = rvar(context.data) 108 | src = rvar(src.data) 109 | srcMap = rvar(batch.src_map.data) 110 | decStates.repeat_beam_size_times(beam_size) 111 | scorer = None 112 | # scorer=onmt.GNMTGlobalScorer(0.3, 0.4) 113 | beam = [onmt.Beam(beam_size, n_best=self.opt.n_best, 114 | cuda=self.opt.cuda, 115 | vocab=self.fields["tgt"].vocab, 116 | global_scorer=scorer) 117 | for __ in range(batch_size)] 118 | 119 | # (2) run the decoder to generate sentences, using beam search. 120 | 121 | def bottle(m): 122 | return m.view(batch_size * beam_size, -1) 123 | 124 | def unbottle(m): 125 | return m.view(beam_size, batch_size, -1) 126 | 127 | for i in range(self.opt.max_sent_length): 128 | 129 | if all((b.done() for b in beam)): 130 | break 131 | 132 | # Construct batch x beam_size nxt words. 133 | # Get all the pending current beam words and arrange for forward. 134 | inp = var(torch.stack([b.getCurrentState() for b in beam]) 135 | .t().contiguous().view(1, -1)) 136 | 137 | # Turn any copied words to UNKs 138 | # 0 is unk 139 | if self.copy_attn: 140 | inp = inp.masked_fill( 141 | inp.gt(len(self.fields["tgt"].vocab) - 1), 0) 142 | 143 | # Temporary kludge solution to handle changed dim expectation 144 | # in the decoder 145 | inp = inp.unsqueeze(2) 146 | 147 | # Run one step. 148 | decOut, decStates, attn = \ 149 | self.model.decoder(inp, context, decStates) 150 | decOut = decOut.squeeze(0) 151 | # decOut: beam x rnn_size 152 | 153 | # (b) Compute a vector of batch*beam word scores. 154 | if not self.copy_attn: 155 | out = self.model.generator.forward(decOut).data 156 | out = unbottle(out) 157 | # beam x tgt_vocab 158 | else: 159 | out = self.model.generator.forward(decOut, 160 | attn["copy"].squeeze(0), 161 | srcMap) 162 | # beam x (tgt_vocab + extra_vocab) 163 | out = dataset.collapse_copy_scores( 164 | unbottle(out.data), 165 | batch, self.fields["tgt"].vocab) 166 | # beam x tgt_vocab 167 | out = out.log() 168 | 169 | # (c) Advance each beam. 170 | for j, b in enumerate(beam): 171 | b.advance(out[:, j], unbottle(attn["std"]).data[:, j]) 172 | decStates.beam_update(j, b.getCurrentOrigin(), beam_size) 173 | 174 | if "tgt" in batch.__dict__: 175 | allGold = self._runTarget(batch, dataset) 176 | else: 177 | allGold = [0] * batch_size 178 | 179 | # (3) Package everything up. 180 | allHyps, allScores, allAttn = [], [], [] 181 | for b in beam: 182 | n_best = self.opt.n_best 183 | scores, ks = b.sortFinished(minimum=n_best) 184 | hyps, attn = [], [] 185 | for i, (times, k) in enumerate(ks[:n_best]): 186 | hyp, att = b.getHyp(times, k) 187 | hyps.append(hyp) 188 | attn.append(att) 189 | allHyps.append(hyps) 190 | allScores.append(scores) 191 | allAttn.append(attn) 192 | 193 | return allHyps, allScores, allAttn, allGold 194 | 195 | def translate(self, batch, data): 196 | # (1) convert words to indexes 197 | batch_size = batch.batch_size 198 | 199 | # (2) translate 200 | pred, predScore, attn, goldScore = self.translateBatch(batch, data) 201 | assert(len(goldScore) == len(pred)) 202 | pred, predScore, attn, goldScore, i = list(zip( 203 | *sorted(zip(pred, predScore, attn, goldScore, 204 | batch.indices.data), 205 | key=lambda x: x[-1]))) 206 | inds, perm = torch.sort(batch.indices.data) 207 | 208 | # (3) convert indexes to words 209 | predBatch, goldBatch = [], [] 210 | src = batch.src[0].data.index_select(1, perm) 211 | if self.opt.tgt: 212 | tgt = batch.tgt.data.index_select(1, perm) 213 | for b in range(batch_size): 214 | src_vocab = data.src_vocabs[inds[b]] 215 | predBatch.append( 216 | [self.buildTargetTokens(pred[b][n], src[:, b], 217 | attn[b][n], src_vocab) 218 | for n in range(self.opt.n_best)]) 219 | if self.opt.tgt: 220 | goldBatch.append( 221 | self.buildTargetTokens(tgt[1:, b], src[:, b], 222 | None, None)) 223 | return predBatch, goldBatch, predScore, goldScore, attn, src 224 | -------------------------------------------------------------------------------- /onmt/Utils.py: -------------------------------------------------------------------------------- 1 | def aeq(*args): 2 | """ 3 | Assert all arguments have the same value 4 | """ 5 | arguments = (arg for arg in args) 6 | first = next(arguments) 7 | assert all(arg == first for arg in arguments), \ 8 | "Not all arguments have the same value: " + str(args) 9 | 10 | 11 | def use_gpu(opt): 12 | return (hasattr(opt, 'gpuid') and len(opt.gpuid) > 0) or \ 13 | (hasattr(opt, 'gpu') and opt.gpu > -1) 14 | -------------------------------------------------------------------------------- /onmt/__init__.py: -------------------------------------------------------------------------------- 1 | import onmt.IO 2 | import onmt.Models 3 | import onmt.Loss 4 | from onmt.Trainer import Trainer, Statistics 5 | from onmt.Translator import Translator 6 | from onmt.Optim import Optim 7 | from onmt.Beam import Beam, GNMTGlobalScorer 8 | import onmt.standard_options 9 | 10 | # For flake8 compatibility 11 | __all__ = [onmt.Loss, onmt.IO, onmt.Models, Trainer, Translator, 12 | Optim, Beam, Statistics, GNMTGlobalScorer, onmt.standard_options] 13 | -------------------------------------------------------------------------------- /onmt/modules/Conv2Conv.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of "Convolutional Sequence to Sequence Learning" 3 | """ 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.init as init 7 | import torch.nn.functional as F 8 | from torch.autograd import Variable 9 | 10 | import onmt.modules 11 | from onmt.modules.WeightNorm import WeightNormConv2d 12 | from onmt.Models import EncoderBase 13 | from onmt.Models import DecoderState 14 | from onmt.Utils import aeq 15 | 16 | 17 | SCALE_WEIGHT = 0.5 ** 0.5 18 | 19 | 20 | def shape_transform(x): 21 | """ Tranform the size of the tensors to fit for conv input. """ 22 | return torch.unsqueeze(torch.transpose(x, 1, 2), 3) 23 | 24 | 25 | class GatedConv(nn.Module): 26 | def __init__(self, input_size, width=3, dropout=0.2, nopad=False): 27 | super(GatedConv, self).__init__() 28 | self.conv = WeightNormConv2d(input_size, 2 * input_size, 29 | kernel_size=(width, 1), stride=(1, 1), 30 | padding=(width // 2 * (1 - nopad), 0)) 31 | init.xavier_uniform(self.conv.weight, gain=(4 * (1 - dropout))**0.5) 32 | self.dropout = nn.Dropout(dropout) 33 | 34 | def forward(self, x_var, hidden=None): 35 | x_var = self.dropout(x_var) 36 | x_var = self.conv(x_var) 37 | out, gate = x_var.split(int(x_var.size(1) / 2), 1) 38 | out = out * F.sigmoid(gate) 39 | return out 40 | 41 | 42 | class StackedCNN(nn.Module): 43 | def __init__(self, num_layers, input_size, cnn_kernel_width=3, 44 | dropout=0.2): 45 | super(StackedCNN, self).__init__() 46 | self.dropout = dropout 47 | self.num_layers = num_layers 48 | self.layers = nn.ModuleList() 49 | for i in range(num_layers): 50 | self.layers.append( 51 | GatedConv(input_size, cnn_kernel_width, dropout)) 52 | 53 | def forward(self, x, hidden=None): 54 | for conv in self.layers: 55 | x = x + conv(x) 56 | x *= SCALE_WEIGHT 57 | return x 58 | 59 | 60 | class CNNEncoder(EncoderBase): 61 | """ 62 | Encoder built on CNN. 63 | """ 64 | def __init__(self, num_layers, hidden_size, 65 | cnn_kernel_width, dropout, embeddings): 66 | super(CNNEncoder, self).__init__() 67 | 68 | self.embeddings = embeddings 69 | input_size = embeddings.embedding_size 70 | self.linear = nn.Linear(input_size, hidden_size) 71 | self.cnn = StackedCNN(num_layers, hidden_size, 72 | cnn_kernel_width, dropout) 73 | 74 | def forward(self, input, lengths=None, hidden=None): 75 | """ See EncoderBase.forward() for description of args and returns.""" 76 | self._check_args(input, lengths, hidden) 77 | 78 | emb = self.embeddings(input) 79 | s_len, batch, emb_dim = emb.size() 80 | 81 | emb = emb.transpose(0, 1).contiguous() 82 | emb_reshape = emb.view(emb.size(0) * emb.size(1), -1) 83 | emb_remap = self.linear(emb_reshape) 84 | emb_remap = emb_remap.view(emb.size(0), emb.size(1), -1) 85 | emb_remap = shape_transform(emb_remap) 86 | out = self.cnn(emb_remap) 87 | 88 | return emb_remap.squeeze(3).transpose(0, 1).contiguous(),\ 89 | out.squeeze(3).transpose(0, 1).contiguous() 90 | 91 | 92 | class CNNDecoder(nn.Module): 93 | """ 94 | Decoder built on CNN, which consists of resduial convolutional layers, 95 | with ConvMultiStepAttention. 96 | """ 97 | def __init__(self, num_layers, hidden_size, attn_type, 98 | copy_attn, cnn_kernel_width, dropout, embeddings): 99 | super(CNNDecoder, self).__init__() 100 | 101 | # Basic attributes. 102 | self.decoder_type = 'cnn' 103 | self.num_layers = num_layers 104 | self.hidden_size = hidden_size 105 | self.cnn_kernel_width = cnn_kernel_width 106 | self.embeddings = embeddings 107 | self.dropout = dropout 108 | 109 | # Build the CNN. 110 | input_size = self.embeddings.embedding_size 111 | self.linear = nn.Linear(input_size, self.hidden_size) 112 | self.conv_layers = nn.ModuleList() 113 | for i in range(self.num_layers): 114 | self.conv_layers.append( 115 | GatedConv(self.hidden_size, self.cnn_kernel_width, 116 | self.dropout, True)) 117 | 118 | self.attn_layers = nn.ModuleList() 119 | for i in range(self.num_layers): 120 | self.attn_layers.append( 121 | onmt.modules.ConvMultiStepAttention(self.hidden_size)) 122 | 123 | # CNNDecoder has its own attention mechanism. 124 | # Set up a separated copy attention layer, if needed. 125 | self._copy = False 126 | if copy_attn: 127 | self.copy_attn = onmt.modules.GlobalAttention( 128 | hidden_size, attn_type=attn_type) 129 | self._copy = True 130 | 131 | def forward(self, input, context, state): 132 | """ 133 | Forward through the CNNDecoder. 134 | Args: 135 | input (LongTensor): a sequence of input tokens tensors 136 | of size (len x batch x nfeats). 137 | context (FloatTensor): output(tensor sequence) from the encoder 138 | CNN of size (src_len x batch x hidden_size). 139 | state (FloatTensor): hidden state from the encoder CNN for 140 | initializing the decoder. 141 | Returns: 142 | outputs (FloatTensor): a Tensor sequence of output from the decoder 143 | of shape (len x batch x hidden_size). 144 | state (FloatTensor): final hidden state from the decoder. 145 | attns (dict of (str, FloatTensor)): a dictionary of different 146 | type of attention Tensor from the decoder 147 | of shape (src_len x batch). 148 | """ 149 | # CHECKS 150 | assert isinstance(state, CNNDecoderState) 151 | input_len, input_batch, _ = input.size() 152 | contxt_len, contxt_batch, _ = context.size() 153 | aeq(input_batch, contxt_batch) 154 | # END CHECKS 155 | 156 | if state.previous_input is not None: 157 | input = torch.cat([state.previous_input, input], 0) 158 | 159 | # Initialize return variables. 160 | outputs = [] 161 | attns = {"std": []} 162 | assert not self._copy, "Copy mechanism not yet tested in conv2conv" 163 | if self._copy: 164 | attns["copy"] = [] 165 | 166 | emb = self.embeddings(input) 167 | assert emb.dim() == 3 # len x batch x embedding_dim 168 | 169 | tgt_emb = emb.transpose(0, 1).contiguous() 170 | # The output of CNNEncoder. 171 | src_context_t = context.transpose(0, 1).contiguous() 172 | # The combination of output of CNNEncoder and source embeddings. 173 | src_context_c = state.init_src.transpose(0, 1).contiguous() 174 | 175 | # Run the forward pass of the CNNDecoder. 176 | emb_reshape = tgt_emb.contiguous().view( 177 | tgt_emb.size(0) * tgt_emb.size(1), -1) 178 | linear_out = self.linear(emb_reshape) 179 | x = linear_out.view(tgt_emb.size(0), tgt_emb.size(1), -1) 180 | x = shape_transform(x) 181 | 182 | pad = Variable(torch.zeros(x.size(0), x.size(1), 183 | self.cnn_kernel_width - 1, 1)) 184 | pad = pad.type_as(x) 185 | base_target_emb = x 186 | 187 | for conv, attention in zip(self.conv_layers, self.attn_layers): 188 | new_target_input = torch.cat([pad, x], 2) 189 | out = conv(new_target_input) 190 | c, attn = attention(base_target_emb, out, 191 | src_context_t, src_context_c) 192 | x = (x + (c + out) * SCALE_WEIGHT) * SCALE_WEIGHT 193 | output = x.squeeze(3).transpose(1, 2) 194 | 195 | # Process the result and update the attentions. 196 | outputs = output.transpose(0, 1).contiguous() 197 | if state.previous_input is not None: 198 | outputs = outputs[state.previous_input.size(0):] 199 | attn = attn[:, state.previous_input.size(0):].squeeze() 200 | attn = torch.stack([attn]) 201 | attns["std"] = attn 202 | if self._copy: 203 | attns["copy"] = attn 204 | 205 | # Update the state. 206 | state.update_state(input) 207 | 208 | return outputs, state, attns 209 | 210 | def init_decoder_state(self, src, context, enc_hidden): 211 | return CNNDecoderState(context, enc_hidden) 212 | 213 | 214 | class CNNDecoderState(DecoderState): 215 | def __init__(self, context, enc_hidden): 216 | self.init_src = (context + enc_hidden) * SCALE_WEIGHT 217 | self.previous_input = None 218 | 219 | @property 220 | def _all(self): 221 | """ 222 | Contains attributes that need to be updated in self.beam_update(). 223 | """ 224 | return (self.previous_input,) 225 | 226 | def update_state(self, input): 227 | """ Called for every decoder forward pass. """ 228 | self.previous_input = input 229 | 230 | def repeat_beam_size_times(self, beam_size): 231 | """ Repeat beam_size times along batch dimension. """ 232 | self.init_src = Variable( 233 | self.init_src.data.repeat(1, beam_size, 1), volatile=True) 234 | -------------------------------------------------------------------------------- /onmt/modules/ConvMultiStepAttention.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from onmt.Utils import aeq 5 | 6 | 7 | SCALE_WEIGHT = 0.5 ** 0.5 8 | 9 | 10 | def seq_linear(linear, x): 11 | # linear transform for 3-d tensor 12 | batch, hidden_size, length, _ = x.size() 13 | h = linear(torch.transpose(x, 1, 2).contiguous().view( 14 | batch * length, hidden_size)) 15 | return torch.transpose(h.view(batch, length, hidden_size, 1), 1, 2) 16 | 17 | 18 | class ConvMultiStepAttention(nn.Module): 19 | def __init__(self, input_size): 20 | super(ConvMultiStepAttention, self).__init__() 21 | self.linear_in = nn.Linear(input_size, input_size) 22 | self.mask = None 23 | 24 | def applyMask(self, mask): 25 | self.mask = mask 26 | 27 | def forward(self, base_target_emb, input, encoder_out_top, 28 | encoder_out_combine): 29 | """ 30 | It's like Luong Attetion. 31 | Conv attention takes a key matrix, a value matrix and a query vector. 32 | Attention weight is calculated by key matrix with the query vector 33 | and sum on the value matrix. And the same operation is applied 34 | in each decode conv layer. 35 | Args: 36 | base_target_emb: target emb tensor 37 | input: output of decode conv 38 | encoder_out_t: the key matrix for calculation of attetion weight, 39 | which is the top output of encode conv 40 | encoder_out_c: the value matrix for the attention-weighted sum, 41 | which is the combination of base emb and top output of encode 42 | 43 | """ 44 | # checks 45 | batch, channel, height, width = base_target_emb.size() 46 | batch_, channel_, height_, width_ = input.size() 47 | aeq(batch, batch_) 48 | aeq(height, height_) 49 | 50 | enc_batch, enc_channel, enc_height = encoder_out_top.size() 51 | enc_batch_, enc_channel_, enc_height_ = encoder_out_combine.size() 52 | 53 | aeq(enc_batch, enc_batch_) 54 | aeq(enc_height, enc_height_) 55 | 56 | preatt = seq_linear(self.linear_in, input) 57 | target = (base_target_emb + preatt) * SCALE_WEIGHT 58 | target = torch.squeeze(target, 3) 59 | target = torch.transpose(target, 1, 2) 60 | pre_attn = torch.bmm(target, encoder_out_top) 61 | 62 | if self.mask is not None: 63 | pre_attn.data.masked_fill_(self.mask, -float('inf')) 64 | 65 | attn = F.softmax(pre_attn) 66 | context_output = torch.bmm( 67 | attn, torch.transpose(encoder_out_combine, 1, 2)) 68 | context_output = torch.transpose( 69 | torch.unsqueeze(context_output, 3), 1, 2) 70 | return context_output, attn 71 | -------------------------------------------------------------------------------- /onmt/modules/CopyGenerator.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch.nn.functional as F 3 | import torch 4 | import torch.cuda 5 | 6 | import onmt 7 | from onmt.Utils import aeq 8 | 9 | 10 | class CopyGenerator(nn.Module): 11 | """ 12 | Generator module that additionally considers copying 13 | words directly from the source. 14 | """ 15 | def __init__(self, opt, src_dict, tgt_dict): 16 | super(CopyGenerator, self).__init__() 17 | self.linear = nn.Linear(opt.rnn_size, len(tgt_dict)) 18 | self.linear_copy = nn.Linear(opt.rnn_size, 1) 19 | self.src_dict = src_dict 20 | self.tgt_dict = tgt_dict 21 | 22 | def forward(self, hidden, attn, src_map): 23 | """ 24 | Computes p(w) = p(z=1) p_{copy}(w|z=0) + p(z=0) * p_{softmax}(w|z=0) 25 | """ 26 | # CHECKS 27 | batch_by_tlen, _ = hidden.size() 28 | batch_by_tlen_, slen = attn.size() 29 | slen_, batch, cvocab = src_map.size() 30 | aeq(batch_by_tlen, batch_by_tlen_) 31 | aeq(slen, slen_) 32 | 33 | # Original probabilities. 34 | logits = self.linear(hidden) 35 | logits[:, self.tgt_dict.stoi[onmt.IO.PAD_WORD]] = -float('inf') 36 | prob = F.softmax(logits) 37 | 38 | # Probability of copying p(z=1) batch. 39 | copy = F.sigmoid(self.linear_copy(hidden)) 40 | 41 | # Probibility of not copying: p_{word}(w) * (1 - p(z)) 42 | out_prob = torch.mul(prob, 1 - copy.expand_as(prob)) 43 | mul_attn = torch.mul(attn, copy.expand_as(attn)) 44 | copy_prob = torch.bmm(mul_attn.view(-1, batch, slen) 45 | .transpose(0, 1), 46 | src_map.transpose(0, 1)).transpose(0, 1) 47 | copy_prob = copy_prob.contiguous().view(-1, cvocab) 48 | return torch.cat([out_prob, copy_prob], 1) 49 | 50 | 51 | class CopyGeneratorCriterion(object): 52 | def __init__(self, vocab_size, force_copy, pad, eps=1e-20): 53 | self.force_copy = force_copy 54 | self.eps = eps 55 | self.offset = vocab_size 56 | self.pad = pad 57 | 58 | def __call__(self, scores, align, target): 59 | align = align.view(-1) 60 | 61 | # Copy prob. 62 | out = scores.gather(1, align.view(-1, 1) + self.offset) \ 63 | .view(-1).mul(align.ne(0).float()) 64 | tmp = scores.gather(1, target.view(-1, 1)).view(-1) 65 | 66 | # Regular prob (no unks and unks that can't be copied) 67 | if not self.force_copy: 68 | out = out + self.eps + tmp.mul(target.ne(0).float()) + \ 69 | tmp.mul(align.eq(0).float()).mul(target.eq(0).float()) 70 | else: 71 | # Forced copy. 72 | out = out + self.eps + tmp.mul(align.eq(0).float()) 73 | 74 | # Drop padding. 75 | loss = -out.log().mul(target.ne(self.pad).float()).sum() 76 | return loss 77 | 78 | 79 | class CopyGeneratorLossCompute(onmt.Loss.LossComputeBase): 80 | """ 81 | Copy Generator Loss Computation. 82 | """ 83 | def __init__(self, generator, tgt_vocab, dataset, 84 | force_copy, eps=1e-20): 85 | super(CopyGeneratorLossCompute, self).__init__(generator, tgt_vocab) 86 | 87 | self.dataset = dataset 88 | self.copy_attn = True 89 | self.force_copy = force_copy 90 | self.criterion = CopyGeneratorCriterion(len(tgt_vocab), force_copy, 91 | self.padding_idx) 92 | 93 | def compute_loss(self, batch, output, target, copy_attn, align): 94 | """ 95 | Compute the loss. The args must match Loss.make_gen_state(). 96 | Args: 97 | batch: the current batch. 98 | output: the predict output from the model. 99 | target: the validate target to compare output with. 100 | copy_attn: the copy attention value. 101 | align: the align info. 102 | """ 103 | target = target.view(-1) 104 | align = align.view(-1) 105 | scores = self.generator(self.bottle(output), 106 | self.bottle(copy_attn), 107 | batch.src_map) 108 | 109 | loss = self.criterion(scores, align, target) 110 | 111 | scores_data = scores.data.clone() 112 | scores_data = self.dataset.collapse_copy_scores( 113 | self.unbottle(scores_data, batch.batch_size), 114 | batch, self.tgt_vocab) 115 | scores_data = self.bottle(scores_data) 116 | 117 | # Correct target is copy when only option. 118 | # TODO: replace for loop with masking or boolean indexing 119 | target_data = target.data.clone() 120 | for i in range(target_data.size(0)): 121 | if target_data[i] == 0 and align.data[i] != 0: 122 | target_data[i] = align.data[i] + len(self.tgt_vocab) 123 | 124 | # Coverage loss term. 125 | loss_data = loss.data.clone() 126 | 127 | stats = self.stats(loss_data, scores_data, target_data) 128 | 129 | return loss, stats 130 | -------------------------------------------------------------------------------- /onmt/modules/Embeddings.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | from torch.autograd import Variable 4 | 5 | from onmt.modules import BottleLinear, Elementwise 6 | from onmt.Utils import aeq 7 | 8 | 9 | class PositionalEncoding(nn.Module): 10 | 11 | def __init__(self, dropout, dim, max_len=5000): 12 | pe = torch.arange(0, max_len).unsqueeze(1).expand(max_len, dim) 13 | div_term = 1 / torch.pow(10000, torch.arange(0, dim * 2, 2) / dim) 14 | pe = pe * div_term.expand_as(pe) 15 | pe[:, 0::2] = torch.sin(pe[:, 0::2]) 16 | pe[:, 1::2] = torch.cos(pe[:, 1::2]) 17 | pe = pe.unsqueeze(1) 18 | super(PositionalEncoding, self).__init__() 19 | self.register_buffer('pe', pe) 20 | self.dropout = nn.Dropout(p=dropout) 21 | 22 | def forward(self, emb): 23 | # We must wrap the self.pe in Variable to compute, not the other 24 | # way - unwrap emb(i.e. emb.data). Otherwise the computation 25 | # wouldn't be watched to build the compute graph. 26 | emb = emb + Variable(self.pe[:emb.size(0), :1, :emb.size(2)] 27 | .expand_as(emb), requires_grad=False) 28 | emb = self.dropout(emb) 29 | return emb 30 | 31 | 32 | class Embeddings(nn.Module): 33 | """ 34 | Words embeddings dictionary for encoder/decoder. 35 | 36 | Args: 37 | word_vec_size (int): size of the dictionary of embeddings. 38 | position_encoding (bool): use a sin to mark relative words positions. 39 | feat_merge (string): merge action for the features embeddings: 40 | concat, sum or mlp. 41 | feat_vec_exponent (float): when using '-feat_merge concat', feature 42 | embedding size is N^feat_dim_exponent, where N is the 43 | number of values of feature takes. 44 | feat_vec_size (int): embedding dimension for features when using 45 | '-feat_merge mlp' 46 | dropout (float): dropout probability. 47 | word_padding_idx (int): padding index for words in the embeddings. 48 | feats_padding_idx ([int]): padding index for a list of features 49 | in the embeddings. 50 | word_vocab_size (int): size of dictionary of embeddings for words. 51 | feat_vocab_sizes ([int], optional): list of size of dictionary 52 | of embeddings for each feature. 53 | """ 54 | def __init__(self, word_vec_size, position_encoding, feat_merge, 55 | feat_vec_exponent, feat_vec_size, dropout, 56 | word_padding_idx, feat_padding_idx, 57 | word_vocab_size, feat_vocab_sizes=[]): 58 | 59 | self.word_padding_idx = word_padding_idx 60 | 61 | # Dimensions and padding for constructing the word embedding matrix 62 | vocab_sizes = [word_vocab_size] 63 | emb_dims = [word_vec_size] 64 | pad_indices = [word_padding_idx] 65 | 66 | # Dimensions and padding for feature embedding matrices 67 | # (these have no effect if feat_vocab_sizes is empty) 68 | if feat_merge == 'sum': 69 | feat_dims = [word_vec_size] * len(feat_vocab_sizes) 70 | elif feat_vec_size > 0: 71 | feat_dims = [feat_vec_size] * len(feat_vocab_sizes) 72 | else: 73 | feat_dims = [int(vocab ** feat_vec_exponent) 74 | for vocab in feat_vocab_sizes] 75 | vocab_sizes.extend(feat_vocab_sizes) 76 | emb_dims.extend(feat_dims) 77 | pad_indices.extend(feat_padding_idx) 78 | 79 | # The embedding matrix look-up tables. The first look-up table 80 | # is for words. Subsequent ones are for features, if any exist. 81 | emb_params = zip(vocab_sizes, emb_dims, pad_indices) 82 | embeddings = [nn.Embedding(vocab, dim, padding_idx=pad) 83 | for vocab, dim, pad in emb_params] 84 | emb_luts = Elementwise(feat_merge, embeddings) 85 | 86 | # The final output size of word + feature vectors. This can vary 87 | # from the word vector size if and only if features are defined. 88 | # This is the attribute you should access if you need to know 89 | # how big your embeddings are going to be. 90 | self.embedding_size = (sum(emb_dims) if feat_merge == 'concat' 91 | else word_vec_size) 92 | 93 | # The sequence of operations that converts the input sequence 94 | # into a sequence of embeddings. At minimum this consists of 95 | # looking up the embeddings for each word and feature in the 96 | # input. Model parameters may require the sequence to contain 97 | # additional operations as well. 98 | super(Embeddings, self).__init__() 99 | self.make_embedding = nn.Sequential() 100 | self.make_embedding.add_module('emb_luts', emb_luts) 101 | 102 | if feat_merge == 'mlp': 103 | in_dim = sum(emb_dims) 104 | out_dim = word_vec_size 105 | mlp = nn.Sequential(BottleLinear(in_dim, out_dim), nn.ReLU()) 106 | self.make_embedding.add_module('mlp', mlp) 107 | 108 | if position_encoding: 109 | pe = PositionalEncoding(dropout, self.embedding_size) 110 | self.make_embedding.add_module('pe', pe) 111 | 112 | @property 113 | def word_lut(self): 114 | return self.make_embedding[0][0] 115 | 116 | @property 117 | def emb_luts(self): 118 | return self.make_embedding[0] 119 | 120 | def load_pretrained_vectors(self, emb_file, fixed): 121 | if emb_file: 122 | pretrained = torch.load(emb_file) 123 | self.word_lut.weight.data.copy_(pretrained) 124 | if fixed: 125 | self.word_lut.weight.requires_grad = False 126 | 127 | def forward(self, input): 128 | """ 129 | Return the embeddings for words, and features if there are any. 130 | Args: 131 | input (LongTensor): len x batch x nfeat 132 | Return: 133 | emb (FloatTensor): len x batch x self.embedding_size 134 | """ 135 | in_length, in_batch, nfeat = input.size() 136 | aeq(nfeat, len(self.emb_luts)) 137 | 138 | emb = self.make_embedding(input) 139 | 140 | out_length, out_batch, emb_size = emb.size() 141 | aeq(in_length, out_length) 142 | aeq(in_batch, out_batch) 143 | aeq(emb_size, self.embedding_size) 144 | 145 | return emb 146 | -------------------------------------------------------------------------------- /onmt/modules/Gate.py: -------------------------------------------------------------------------------- 1 | """ 2 | Context gate is a decoder module that takes as input the previous word 3 | embedding, the current decoder state and the attention state, and produces a 4 | gate. 5 | The gate can be used to select the input from the target side context 6 | (decoder state), from the source context (attention state) or both. 7 | """ 8 | import torch 9 | import torch.nn as nn 10 | 11 | 12 | def ContextGateFactory(type, embeddings_size, decoder_size, 13 | attention_size, output_size): 14 | """Returns the correct ContextGate class""" 15 | 16 | gate_types = {'source': SourceContextGate, 17 | 'target': TargetContextGate, 18 | 'both': BothContextGate} 19 | 20 | assert type in gate_types, "Not valid ContextGate type: {0}".format(type) 21 | return gate_types[type](embeddings_size, decoder_size, attention_size, 22 | output_size) 23 | 24 | 25 | class ContextGate(nn.Module): 26 | """Implement up to the computation of the gate""" 27 | 28 | def __init__(self, embeddings_size, decoder_size, 29 | attention_size, output_size): 30 | super(ContextGate, self).__init__() 31 | input_size = embeddings_size + decoder_size + attention_size 32 | self.gate = nn.Linear(input_size, output_size, bias=True) 33 | self.sig = nn.Sigmoid() 34 | self.source_proj = nn.Linear(attention_size, output_size) 35 | self.target_proj = nn.Linear(embeddings_size + decoder_size, 36 | output_size) 37 | 38 | def forward(self, prev_emb, dec_state, attn_state): 39 | input_tensor = torch.cat((prev_emb, dec_state, attn_state), dim=1) 40 | z = self.sig(self.gate(input_tensor)) 41 | proj_source = self.source_proj(attn_state) 42 | proj_target = self.target_proj( 43 | torch.cat((prev_emb, dec_state), dim=1)) 44 | return z, proj_source, proj_target 45 | 46 | 47 | class SourceContextGate(nn.Module): 48 | """Apply the context gate only to the source context""" 49 | 50 | def __init__(self, embeddings_size, decoder_size, 51 | attention_size, output_size): 52 | super(SourceContextGate, self).__init__() 53 | self.context_gate = ContextGate(embeddings_size, decoder_size, 54 | attention_size, output_size) 55 | self.tanh = nn.Tanh() 56 | 57 | def forward(self, prev_emb, dec_state, attn_state): 58 | z, source, target = self.context_gate( 59 | prev_emb, dec_state, attn_state) 60 | return self.tanh(target + z * source) 61 | 62 | 63 | class TargetContextGate(nn.Module): 64 | """Apply the context gate only to the target context""" 65 | 66 | def __init__(self, embeddings_size, decoder_size, 67 | attention_size, output_size): 68 | super(TargetContextGate, self).__init__() 69 | self.context_gate = ContextGate(embeddings_size, decoder_size, 70 | attention_size, output_size) 71 | self.tanh = nn.Tanh() 72 | 73 | def forward(self, prev_emb, dec_state, attn_state): 74 | z, source, target = self.context_gate(prev_emb, dec_state, attn_state) 75 | return self.tanh(z * target + source) 76 | 77 | 78 | class BothContextGate(nn.Module): 79 | """Apply the context gate to both contexts""" 80 | 81 | def __init__(self, embeddings_size, decoder_size, 82 | attention_size, output_size): 83 | super(BothContextGate, self).__init__() 84 | self.context_gate = ContextGate(embeddings_size, decoder_size, 85 | attention_size, output_size) 86 | self.tanh = nn.Tanh() 87 | 88 | def forward(self, prev_emb, dec_state, attn_state): 89 | z, source, target = self.context_gate(prev_emb, dec_state, attn_state) 90 | return self.tanh((1. - z) * target + z * source) 91 | -------------------------------------------------------------------------------- /onmt/modules/GlobalAttention.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | from onmt.modules.UtilClass import BottleLinear 5 | from onmt.Utils import aeq 6 | 7 | 8 | class GlobalAttention(nn.Module): 9 | """ 10 | Luong Attention. 11 | 12 | Global attention takes a matrix and a query vector. It 13 | then computes a parameterized convex combination of the matrix 14 | based on the input query. 15 | 16 | 17 | H_1 H_2 H_3 ... H_n 18 | q q q q 19 | | | | | 20 | \ | | / 21 | ..... 22 | \ | / 23 | a 24 | 25 | Constructs a unit mapping. 26 | $$(H_1 + H_n, q) => (a)$$ 27 | Where H is of `batch x n x dim` and q is of `batch x dim`. 28 | 29 | Luong Attention (dot, general): 30 | The full function is 31 | $$\tanh(W_2 [(softmax((W_1 q + b_1) H) H), q] + b_2)$$. 32 | 33 | * dot: $$score(h_t,{\overline{h}}_s) = h_t^T{\overline{h}}_s$$ 34 | * general: $$score(h_t,{\overline{h}}_s) = h_t^T W_a {\overline{h}}_s$$ 35 | 36 | Bahdanau Attention (mlp): 37 | $$c = \sum_{j=1}^{SeqLength}\a_jh_j$$. 38 | The Alignment-function $$a$$ computes an alignment as: 39 | $$a_j = softmax(v_a^T \tanh(W_a q + U_a h_j) )$$. 40 | 41 | """ 42 | def __init__(self, dim, coverage=False, attn_type="dot"): 43 | super(GlobalAttention, self).__init__() 44 | 45 | self.dim = dim 46 | self.attn_type = attn_type 47 | assert (self.attn_type in ["dot", "general", "mlp"]), ( 48 | "Please select a valid attention type.") 49 | 50 | if self.attn_type == "general": 51 | self.linear_in = nn.Linear(dim, dim, bias=False) 52 | elif self.attn_type == "mlp": 53 | self.linear_context = BottleLinear(dim, dim, bias=False) 54 | self.linear_query = nn.Linear(dim, dim, bias=True) 55 | self.v = BottleLinear(dim, 1, bias=False) 56 | # mlp wants it with bias 57 | out_bias = self.attn_type == "mlp" 58 | self.linear_out = nn.Linear(dim*2, dim, bias=out_bias) 59 | 60 | self.sm = nn.Softmax() 61 | self.tanh = nn.Tanh() 62 | self.mask = None 63 | 64 | if coverage: 65 | self.linear_cover = nn.Linear(1, dim, bias=False) 66 | 67 | def applyMask(self, mask): 68 | self.mask = mask 69 | 70 | def score(self, h_t, h_s): 71 | """ 72 | h_t (FloatTensor): batch x tgt_len x dim 73 | h_s (FloatTensor): batch x src_len x dim 74 | returns scores (FloatTensor): batch x tgt_len x src_len: 75 | raw attention scores for each src index 76 | """ 77 | 78 | # Check input sizes 79 | src_batch, src_len, src_dim = h_s.size() 80 | tgt_batch, tgt_len, tgt_dim = h_t.size() 81 | aeq(src_batch, tgt_batch) 82 | aeq(src_dim, tgt_dim) 83 | aeq(self.dim, src_dim) 84 | 85 | if self.attn_type in ["general", "dot"]: 86 | if self.attn_type == "general": 87 | h_t_ = h_t.view(tgt_batch*tgt_len, tgt_dim) 88 | h_t_ = self.linear_in(h_t_) 89 | h_t = h_t_.view(tgt_batch, tgt_len, tgt_dim) 90 | h_s_ = h_s.transpose(1, 2) 91 | # (batch, t_len, d) x (batch, d, s_len) --> (batch, t_len, s_len) 92 | return torch.bmm(h_t, h_s_) 93 | else: 94 | dim = self.dim 95 | wq = self.linear_query(h_t.view(-1, dim)) 96 | wq = wq.view(tgt_batch, tgt_len, 1, dim) 97 | wq = wq.expand(tgt_batch, tgt_len, src_len, dim) 98 | 99 | uh = self.linear_context(h_s.contiguous().view(-1, dim)) 100 | uh = uh.view(src_batch, 1, src_len, dim) 101 | uh = uh.expand(src_batch, tgt_len, src_len, dim) 102 | 103 | # (batch, t_len, s_len, d) 104 | wquh = self.tanh(wq + uh) 105 | 106 | return self.v(wquh.view(-1, dim)).view(tgt_batch, tgt_len, src_len) 107 | 108 | def forward(self, input, context, coverage=None): 109 | """ 110 | input (FloatTensor): batch x tgt_len x dim: decoder's rnn's output. 111 | context (FloatTensor): batch x src_len x dim: src hidden states 112 | coverage (FloatTensor): None (not supported yet) 113 | """ 114 | 115 | # one step input 116 | if input.dim() == 2: 117 | one_step = True 118 | input = input.unsqueeze(1) 119 | else: 120 | one_step = False 121 | 122 | batch, sourceL, dim = context.size() 123 | batch_, targetL, dim_ = input.size() 124 | aeq(batch, batch_) 125 | aeq(dim, dim_) 126 | aeq(self.dim, dim) 127 | if coverage is not None: 128 | batch_, sourceL_ = coverage.size() 129 | aeq(batch, batch_) 130 | aeq(sourceL, sourceL_) 131 | 132 | if self.mask is not None: 133 | beam_, batch_, sourceL_ = self.mask.size() 134 | aeq(batch, batch_*beam_) 135 | aeq(sourceL, sourceL_) 136 | 137 | if coverage is not None: 138 | cover = coverage.view(-1).unsqueeze(1) 139 | context += self.linear_cover(cover).view_as(context) 140 | context = self.tanh(context) 141 | 142 | # compute attention scores, as in Luong et al. 143 | align = self.score(input, context) 144 | 145 | if self.mask is not None: 146 | mask_ = self.mask.view(batch, 1, sourceL) # make it broardcastable 147 | align.data.masked_fill_(mask_, -float('inf')) 148 | 149 | # Softmax to normalize attention weights 150 | align_vectors = self.sm(align.view(batch*targetL, sourceL)) 151 | align_vectors = align_vectors.view(batch, targetL, sourceL) 152 | 153 | # each context vector c_t is the weighted average 154 | # over all the source hidden states 155 | c = torch.bmm(align_vectors, context) 156 | 157 | # concatenate 158 | concat_c = torch.cat([c, input], 2).view(batch*targetL, dim*2) 159 | attn_h = self.linear_out(concat_c).view(batch, targetL, dim) 160 | if self.attn_type in ["general", "dot"]: 161 | attn_h = self.tanh(attn_h) 162 | 163 | if one_step: 164 | attn_h = attn_h.squeeze(1) 165 | align_vectors = align_vectors.squeeze(1) 166 | 167 | # Check output sizes 168 | batch_, dim_ = attn_h.size() 169 | aeq(batch, batch_) 170 | aeq(dim, dim_) 171 | batch_, sourceL_ = align_vectors.size() 172 | aeq(batch, batch_) 173 | aeq(sourceL, sourceL_) 174 | else: 175 | attn_h = attn_h.transpose(0, 1).contiguous() 176 | align_vectors = align_vectors.transpose(0, 1).contiguous() 177 | 178 | # Check output sizes 179 | targetL_, batch_, dim_ = attn_h.size() 180 | aeq(targetL, targetL_) 181 | aeq(batch, batch_) 182 | aeq(dim, dim_) 183 | targetL_, batch_, sourceL_ = align_vectors.size() 184 | aeq(targetL, targetL_) 185 | aeq(batch, batch_) 186 | aeq(sourceL, sourceL_) 187 | 188 | return attn_h, align_vectors 189 | -------------------------------------------------------------------------------- /onmt/modules/ImageEncoder.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch.nn.functional as F 3 | import torch 4 | import torch.cuda 5 | from torch.autograd import Variable 6 | 7 | 8 | class ImageEncoder(nn.Module): 9 | """ 10 | Encoder recurrent neural network for Images. 11 | """ 12 | def __init__(self, num_layers, bidirectional, rnn_size, dropout): 13 | """ 14 | Args: 15 | num_layers (int): number of encoder layers. 16 | bidirectional (bool): bidirectional encoder. 17 | rnn_size (int): size of hidden states of the rnn. 18 | dropout (float): dropout probablity. 19 | """ 20 | super(ImageEncoder, self).__init__() 21 | self.num_layers = num_layers 22 | self.num_directions = 2 if bidirectional else 1 23 | self.hidden_size = rnn_size 24 | 25 | self.layer1 = nn.Conv2d(3, 64, kernel_size=(3, 3), 26 | padding=(1, 1), stride=(1, 1)) 27 | self.layer2 = nn.Conv2d(64, 128, kernel_size=(3, 3), 28 | padding=(1, 1), stride=(1, 1)) 29 | self.layer3 = nn.Conv2d(128, 256, kernel_size=(3, 3), 30 | padding=(1, 1), stride=(1, 1)) 31 | self.layer4 = nn.Conv2d(256, 256, kernel_size=(3, 3), 32 | padding=(1, 1), stride=(1, 1)) 33 | self.layer5 = nn.Conv2d(256, 512, kernel_size=(3, 3), 34 | padding=(1, 1), stride=(1, 1)) 35 | self.layer6 = nn.Conv2d(512, 512, kernel_size=(3, 3), 36 | padding=(1, 1), stride=(1, 1)) 37 | 38 | self.batch_norm1 = nn.BatchNorm2d(256) 39 | self.batch_norm2 = nn.BatchNorm2d(512) 40 | self.batch_norm3 = nn.BatchNorm2d(512) 41 | 42 | input_size = 512 43 | self.rnn = nn.LSTM(input_size, rnn_size, 44 | num_layers=num_layers, 45 | dropout=dropout, 46 | bidirectional=bidirectional) 47 | self.pos_lut = nn.Embedding(1000, input_size) 48 | 49 | def load_pretrained_vectors(self, opt): 50 | # Pass in needed options only when modify function definition. 51 | pass 52 | 53 | def forward(self, input, lengths=None): 54 | batchSize = input.size(0) 55 | # (batch_size, 64, imgH, imgW) 56 | # layer 1 57 | input = F.relu(self.layer1(input[:, :, :, :]-0.5), True) 58 | 59 | # (batch_size, 64, imgH/2, imgW/2) 60 | input = F.max_pool2d(input, kernel_size=(2, 2), stride=(2, 2)) 61 | 62 | # (batch_size, 128, imgH/2, imgW/2) 63 | # layer 2 64 | input = F.relu(self.layer2(input), True) 65 | 66 | # (batch_size, 128, imgH/2/2, imgW/2/2) 67 | input = F.max_pool2d(input, kernel_size=(2, 2), stride=(2, 2)) 68 | 69 | # (batch_size, 256, imgH/2/2, imgW/2/2) 70 | # layer 3 71 | # batch norm 1 72 | input = F.relu(self.batch_norm1(self.layer3(input)), True) 73 | 74 | # (batch_size, 256, imgH/2/2, imgW/2/2) 75 | # layer4 76 | input = F.relu(self.layer4(input), True) 77 | 78 | # (batch_size, 256, imgH/2/2/2, imgW/2/2) 79 | input = F.max_pool2d(input, kernel_size=(1, 2), stride=(1, 2)) 80 | 81 | # (batch_size, 512, imgH/2/2/2, imgW/2/2) 82 | # layer 5 83 | # batch norm 2 84 | input = F.relu(self.batch_norm2(self.layer5(input)), True) 85 | 86 | # (batch_size, 512, imgH/2/2/2, imgW/2/2/2) 87 | input = F.max_pool2d(input, kernel_size=(2, 1), stride=(2, 1)) 88 | 89 | # (batch_size, 512, imgH/2/2/2, imgW/2/2/2) 90 | input = F.relu(self.batch_norm3(self.layer6(input)), True) 91 | 92 | # # (batch_size, 512, H, W) 93 | # # (batch_size, H, W, 512) 94 | all_outputs = [] 95 | for row in range(input.size(2)): 96 | inp = input[:, :, row, :].transpose(0, 2)\ 97 | .transpose(1, 2) 98 | pos_emb = self.pos_lut( 99 | Variable(torch.cuda.LongTensor(batchSize).fill_(row))) 100 | with_pos = torch.cat( 101 | (pos_emb.view(1, pos_emb.size(0), pos_emb.size(1)), inp), 0) 102 | outputs, hidden_t = self.rnn(with_pos) 103 | all_outputs.append(outputs) 104 | out = torch.cat(all_outputs, 0) 105 | 106 | return hidden_t, out 107 | -------------------------------------------------------------------------------- /onmt/modules/MultiHeadedAttn.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch 3 | import torch.nn as nn 4 | from torch.autograd import Variable 5 | 6 | from onmt.Utils import aeq 7 | from onmt.modules.UtilClass import BottleLinear, \ 8 | BottleLayerNorm, BottleSoftmax 9 | 10 | 11 | class MultiHeadedAttention(nn.Module): 12 | ''' Multi-Head Attention module from 13 | "Attention is All You Need". 14 | ''' 15 | def __init__(self, head_count, model_dim, p=0.1): 16 | """ 17 | Args: 18 | head_count(int): number of parallel heads. 19 | model_dim(int): the dimension of keys/values/queries in this 20 | MultiHeadedAttention, must be divisible by head_count. 21 | """ 22 | assert model_dim % head_count == 0 23 | self.dim_per_head = model_dim // head_count 24 | self.model_dim = model_dim 25 | 26 | super(MultiHeadedAttention, self).__init__() 27 | self.head_count = head_count 28 | 29 | self.linear_keys = BottleLinear(model_dim, 30 | head_count * self.dim_per_head, 31 | bias=False) 32 | self.linear_values = BottleLinear(model_dim, 33 | head_count * self.dim_per_head, 34 | bias=False) 35 | self.linear_query = BottleLinear(model_dim, 36 | head_count * self.dim_per_head, 37 | bias=False) 38 | self.sm = BottleSoftmax() 39 | self.activation = nn.ReLU() 40 | self.layer_norm = BottleLayerNorm(model_dim) 41 | self.dropout = nn.Dropout(p) 42 | self.res_dropout = nn.Dropout(p) 43 | 44 | def forward(self, key, value, query, mask=None): 45 | # CHECKS 46 | batch, k_len, d = key.size() 47 | batch_, k_len_, d_ = value.size() 48 | aeq(batch, batch_) 49 | aeq(k_len, k_len_) 50 | aeq(d, d_) 51 | batch_, q_len, d_ = query.size() 52 | aeq(batch, batch_) 53 | aeq(d, d_) 54 | aeq(self.model_dim % 8, 0) 55 | if mask is not None: 56 | batch_, q_len_, k_len_ = mask.size() 57 | aeq(batch_, batch) 58 | aeq(k_len_, k_len) 59 | aeq(q_len_ == q_len) 60 | # END CHECKS 61 | 62 | def shape_projection(x): 63 | b, l, d = x.size() 64 | return x.view(b, l, self.head_count, self.dim_per_head) \ 65 | .transpose(1, 2).contiguous() \ 66 | .view(b * self.head_count, l, self.dim_per_head) 67 | 68 | def unshape_projection(x, q): 69 | b, l, d = q.size() 70 | return x.view(b, self.head_count, l, self.dim_per_head) \ 71 | .transpose(1, 2).contiguous() \ 72 | .view(b, l, self.head_count * self.dim_per_head) 73 | 74 | residual = query 75 | key_up = shape_projection(self.linear_keys(key)) 76 | value_up = shape_projection(self.linear_values(value)) 77 | query_up = shape_projection(self.linear_query(query)) 78 | 79 | scaled = torch.bmm(query_up, key_up.transpose(1, 2)) 80 | scaled = scaled / math.sqrt(self.dim_per_head) 81 | bh, l, dim_per_head = scaled.size() 82 | b = bh // self.head_count 83 | if mask is not None: 84 | 85 | scaled = scaled.view(b, self.head_count, l, dim_per_head) 86 | mask = mask.unsqueeze(1).expand_as(scaled) 87 | scaled = scaled.masked_fill(Variable(mask), -float('inf')) \ 88 | .view(bh, l, dim_per_head) 89 | attn = self.sm(scaled) 90 | # Return one attn 91 | top_attn = attn \ 92 | .view(b, self.head_count, l, dim_per_head)[:, 0, :, :] \ 93 | .contiguous() 94 | 95 | drop_attn = self.dropout(self.sm(scaled)) 96 | 97 | # values : (batch * 8) x qlen x dim 98 | out = unshape_projection(torch.bmm(drop_attn, value_up), residual) 99 | 100 | # Residual and layer norm 101 | res = self.res_dropout(out) + residual 102 | ret = self.layer_norm(res) 103 | 104 | # CHECK 105 | batch_, q_len_, d_ = ret.size() 106 | aeq(q_len, q_len_) 107 | aeq(batch, batch_) 108 | aeq(d, d_) 109 | # END CHECK 110 | return ret, top_attn 111 | -------------------------------------------------------------------------------- /onmt/modules/StackedRNN.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class StackedLSTM(nn.Module): 6 | """ 7 | Our own implementation of stacked LSTM. 8 | Needed for the decoder, because we do input feeding. 9 | """ 10 | def __init__(self, num_layers, input_size, rnn_size, dropout): 11 | super(StackedLSTM, self).__init__() 12 | self.dropout = nn.Dropout(dropout) 13 | self.num_layers = num_layers 14 | self.layers = nn.ModuleList() 15 | 16 | for i in range(num_layers): 17 | self.layers.append(nn.LSTMCell(input_size, rnn_size)) 18 | input_size = rnn_size 19 | 20 | def forward(self, input, hidden): 21 | h_0, c_0 = hidden 22 | h_1, c_1 = [], [] 23 | for i, layer in enumerate(self.layers): 24 | h_1_i, c_1_i = layer(input, (h_0[i], c_0[i])) 25 | input = h_1_i 26 | if i + 1 != self.num_layers: 27 | input = self.dropout(input) 28 | h_1 += [h_1_i] 29 | c_1 += [c_1_i] 30 | 31 | h_1 = torch.stack(h_1) 32 | c_1 = torch.stack(c_1) 33 | 34 | return input, (h_1, c_1) 35 | 36 | 37 | class StackedGRU(nn.Module): 38 | 39 | def __init__(self, num_layers, input_size, rnn_size, dropout): 40 | super(StackedGRU, self).__init__() 41 | self.dropout = nn.Dropout(dropout) 42 | self.num_layers = num_layers 43 | self.layers = nn.ModuleList() 44 | 45 | for i in range(num_layers): 46 | self.layers.append(nn.GRUCell(input_size, rnn_size)) 47 | input_size = rnn_size 48 | 49 | def forward(self, input, hidden): 50 | h_1 = [] 51 | for i, layer in enumerate(self.layers): 52 | h_1_i = layer(input, hidden[0][i]) 53 | input = h_1_i 54 | if i + 1 != self.num_layers: 55 | input = self.dropout(input) 56 | h_1 += [h_1_i] 57 | 58 | h_1 = torch.stack(h_1) 59 | return input, (h_1,) 60 | -------------------------------------------------------------------------------- /onmt/modules/StructuredAttention.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch 3 | import torch.cuda 4 | from torch.autograd import Variable 5 | 6 | 7 | class MatrixTree(nn.Module): 8 | """Implementation of the matrix-tree theorem for computing marginals 9 | of non-projective dependency parsing. This attention layer is used 10 | in the paper "Learning Structured Text Representations." 11 | """ 12 | def __init__(self, eps=1e-5): 13 | self.eps = eps 14 | super(MatrixTree, self).__init__() 15 | 16 | def forward(self, input): 17 | laplacian = input.exp() + self.eps 18 | output = input.clone() 19 | for b in range(input.size(0)): 20 | lap = laplacian[b].masked_fill( 21 | Variable(torch.eye(input.size(1)).cuda().ne(0)), 0) 22 | lap = -lap + torch.diag(lap.sum(0)) 23 | # store roots on diagonal 24 | lap[0] = input[b].diag().exp() 25 | inv_laplacian = lap.inverse() 26 | 27 | factor = inv_laplacian.diag().unsqueeze(1)\ 28 | .expand_as(input[b]).transpose(0, 1) 29 | term1 = input[b].exp().mul(factor).clone() 30 | term2 = input[b].exp().mul(inv_laplacian.transpose(0, 1)).clone() 31 | term1[:, 0] = 0 32 | term2[0] = 0 33 | output[b] = term1 - term2 34 | roots_output = input[b].diag().exp().mul( 35 | inv_laplacian.transpose(0, 1)[0]) 36 | output[b] = output[b] + torch.diag(roots_output) 37 | return output 38 | 39 | 40 | if __name__ == "__main__": 41 | dtree = MatrixTree() 42 | q = torch.rand(1, 5, 5).cuda() 43 | marg = dtree.forward(Variable(q)) 44 | print(marg.sum(1)) 45 | -------------------------------------------------------------------------------- /onmt/modules/UtilClass.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class Bottle(nn.Module): 6 | def forward(self, input): 7 | if len(input.size()) <= 2: 8 | return super(Bottle, self).forward(input) 9 | size = input.size()[:2] 10 | out = super(Bottle, self).forward(input.view(size[0]*size[1], -1)) 11 | return out.contiguous().view(size[0], size[1], -1) 12 | 13 | 14 | class Bottle2(nn.Module): 15 | def forward(self, input): 16 | if len(input.size()) <= 3: 17 | return super(Bottle2, self).forward(input) 18 | size = input.size() 19 | out = super(Bottle2, self).forward(input.view(size[0]*size[1], 20 | size[2], size[3])) 21 | return out.contiguous().view(size[0], size[1], size[2], size[3]) 22 | 23 | 24 | class LayerNorm(nn.Module): 25 | ''' Layer normalization module ''' 26 | 27 | def __init__(self, d_hid, eps=1e-3): 28 | super(LayerNorm, self).__init__() 29 | 30 | self.eps = eps 31 | self.a_2 = nn.Parameter(torch.ones(d_hid), requires_grad=True) 32 | self.b_2 = nn.Parameter(torch.zeros(d_hid), requires_grad=True) 33 | 34 | def forward(self, z): 35 | if z.size(1) == 1: 36 | return z 37 | mu = torch.mean(z, dim=1) 38 | sigma = torch.std(z, dim=1) 39 | # HACK. PyTorch is changing behavior 40 | if mu.dim() == 1: 41 | mu = mu.unsqueeze(1) 42 | sigma = sigma.unsqueeze(1) 43 | ln_out = (z - mu.expand_as(z)) / (sigma.expand_as(z) + self.eps) 44 | ln_out = ln_out.mul(self.a_2.expand_as(ln_out)) \ 45 | + self.b_2.expand_as(ln_out) 46 | return ln_out 47 | 48 | 49 | class BottleLinear(Bottle, nn.Linear): 50 | pass 51 | 52 | 53 | class BottleLayerNorm(Bottle, LayerNorm): 54 | pass 55 | 56 | 57 | class BottleSoftmax(Bottle, nn.Softmax): 58 | pass 59 | 60 | 61 | class Elementwise(nn.ModuleList): 62 | """ 63 | A simple network container. 64 | Parameters are a list of modules. 65 | Inputs are a 3d Variable whose last dimension is the same length 66 | as the list. 67 | Outputs are the result of applying modules to inputs elementwise. 68 | An optional merge parameter allows the outputs to be reduced to a 69 | single Variable. 70 | """ 71 | 72 | def __init__(self, merge=None, *args): 73 | assert merge in [None, 'first', 'concat', 'sum', 'mlp'] 74 | self.merge = merge 75 | super(Elementwise, self).__init__(*args) 76 | 77 | def forward(self, input): 78 | inputs = [feat.squeeze(2) for feat in input.split(1, dim=2)] 79 | assert len(self) == len(inputs) 80 | outputs = [f(x) for f, x in zip(self, inputs)] 81 | if self.merge == 'first': 82 | return outputs[0] 83 | elif self.merge == 'concat' or self.merge == 'mlp': 84 | return torch.cat(outputs, 2) 85 | elif self.merge == 'sum': 86 | return sum(outputs) 87 | else: 88 | return outputs 89 | -------------------------------------------------------------------------------- /onmt/modules/WeightNorm.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of "Weight Normalization: A Simple Reparameterization 3 | to Accelerate Training of Deep Neural Networks" 4 | As a reparameterization method, weight normalization is same 5 | as BatchNormalization, but it doesn't depend on minibatch. 6 | """ 7 | import torch 8 | import torch.nn as nn 9 | import torch.nn.functional as F 10 | from torch.nn import Parameter 11 | from torch.autograd import Variable 12 | 13 | 14 | def get_var_maybe_avg(namespace, var_name, training, polyak_decay): 15 | # utility for retrieving polyak averaged params 16 | # Update average 17 | v = getattr(namespace, var_name) 18 | v_avg = getattr(namespace, var_name + '_avg') 19 | v_avg -= (1 - polyak_decay) * (v_avg - v.data) 20 | 21 | if training: 22 | return v 23 | else: 24 | return Variable(v_avg) 25 | 26 | 27 | def get_vars_maybe_avg(namespace, var_names, training, polyak_decay): 28 | # utility for retrieving polyak averaged params 29 | vars = [] 30 | for vn in var_names: 31 | vars.append(get_var_maybe_avg( 32 | namespace, vn, training, polyak_decay)) 33 | return vars 34 | 35 | 36 | class WeightNormLinear(nn.Linear): 37 | def __init__(self, in_features, out_features, 38 | init_scale=1., polyak_decay=0.9995): 39 | super(WeightNormLinear, self).__init__( 40 | in_features, out_features, bias=True) 41 | 42 | self.V = self.weight 43 | self.g = Parameter(torch.Tensor(out_features)) 44 | self.b = self.bias 45 | 46 | self.register_buffer( 47 | 'V_avg', torch.zeros(out_features, in_features)) 48 | self.register_buffer('g_avg', torch.zeros(out_features)) 49 | self.register_buffer('b_avg', torch.zeros(out_features)) 50 | 51 | self.init_scale = init_scale 52 | self.polyak_decay = polyak_decay 53 | self.reset_parameters() 54 | 55 | def reset_parameters(self): 56 | return 57 | 58 | def forward(self, x, init=False): 59 | if init is True: 60 | # out_features * in_features 61 | self.V.data.copy_(torch.randn(self.V.data.size()).type_as( 62 | self.V.data) * 0.05) 63 | # norm is out_features * 1 64 | V_norm = self.V.data / \ 65 | self.V.data.norm(2, 1).expand_as(self.V.data) 66 | # batch_size * out_features 67 | x_init = F.linear(x, Variable(V_norm)).data 68 | # out_features 69 | m_init, v_init = x_init.mean(0).squeeze( 70 | 0), x_init.var(0).squeeze(0) 71 | # out_features 72 | scale_init = self.init_scale / \ 73 | torch.sqrt(v_init + 1e-10) 74 | self.g.data.copy_(scale_init) 75 | self.b.data.copy_(-m_init * scale_init) 76 | x_init = scale_init.view(1, -1).expand_as(x_init) \ 77 | * (x_init - m_init.view(1, -1).expand_as(x_init)) 78 | self.V_avg.copy_(self.V.data) 79 | self.g_avg.copy_(self.g.data) 80 | self.b_avg.copy_(self.b.data) 81 | return Variable(x_init) 82 | else: 83 | V, g, b = get_vars_maybe_avg(self, ['V', 'g', 'b'], 84 | self.training, 85 | polyak_decay=self.polyak_decay) 86 | # batch_size * out_features 87 | x = F.linear(x, V) 88 | scalar = g / torch.norm(V, 2, 1).squeeze(1) 89 | x = scalar.view(1, -1).expand_as(x) * x + \ 90 | b.view(1, -1).expand_as(x) 91 | return x 92 | 93 | 94 | class WeightNormConv2d(nn.Conv2d): 95 | def __init__(self, in_channels, out_channels, kernel_size, stride=1, 96 | padding=0, dilation=1, groups=1, init_scale=1., 97 | polyak_decay=0.9995): 98 | super(WeightNormConv2d, self).__init__(in_channels, out_channels, 99 | kernel_size, stride, padding, 100 | dilation, groups) 101 | 102 | self.V = self.weight 103 | self.g = Parameter(torch.Tensor(out_channels)) 104 | self.b = self.bias 105 | 106 | self.register_buffer('V_avg', torch.zeros(self.V.size())) 107 | self.register_buffer('g_avg', torch.zeros(out_channels)) 108 | self.register_buffer('b_avg', torch.zeros(out_channels)) 109 | 110 | self.init_scale = init_scale 111 | self.polyak_decay = polyak_decay 112 | self.reset_parameters() 113 | 114 | def reset_parameters(self): 115 | return 116 | 117 | def forward(self, x, init=False): 118 | if init is True: 119 | # out_channels, in_channels // groups, * kernel_size 120 | self.V.data.copy_(torch.randn(self.V.data.size() 121 | ).type_as(self.V.data) * 0.05) 122 | V_norm = self.V.data / self.V.data.view(self.out_channels, -1)\ 123 | .norm(2, 1).view(self.out_channels, *( 124 | [1] * (len(self.kernel_size) + 1))).expand_as(self.V.data) 125 | x_init = F.conv2d(x, Variable(V_norm), None, self.stride, 126 | self.padding, self.dilation, self.groups).data 127 | t_x_init = x_init.transpose(0, 1).contiguous().view( 128 | self.out_channels, -1) 129 | m_init, v_init = t_x_init.mean(1).squeeze( 130 | 1), t_x_init.var(1).squeeze(1) 131 | # out_features 132 | scale_init = self.init_scale / \ 133 | torch.sqrt(v_init + 1e-10) 134 | self.g.data.copy_(scale_init) 135 | self.b.data.copy_(-m_init * scale_init) 136 | scale_init_shape = scale_init.view( 137 | 1, self.out_channels, *([1] * (len(x_init.size()) - 2))) 138 | m_init_shape = m_init.view( 139 | 1, self.out_channels, *([1] * (len(x_init.size()) - 2))) 140 | x_init = scale_init_shape.expand_as( 141 | x_init) * (x_init - m_init_shape.expand_as(x_init)) 142 | self.V_avg.copy_(self.V.data) 143 | self.g_avg.copy_(self.g.data) 144 | self.b_avg.copy_(self.b.data) 145 | return Variable(x_init) 146 | else: 147 | V, g, b = get_vars_maybe_avg( 148 | self, ['V', 'g', 'b'], self.training, 149 | polyak_decay=self.polyak_decay) 150 | 151 | scalar = torch.norm(V.view(self.out_channels, -1), 2, 1) 152 | if len(scalar.size()) == 2: 153 | scalar = g / scalar.squeeze(1) 154 | else: 155 | scalar = g / scalar 156 | 157 | W = scalar.view(self.out_channels, * 158 | ([1] * (len(V.size()) - 1))).expand_as(V) * V 159 | 160 | x = F.conv2d(x, W, b, self.stride, 161 | self.padding, self.dilation, self.groups) 162 | return x 163 | 164 | 165 | class WeightNormConvTranspose2d(nn.ConvTranspose2d): 166 | def __init__(self, in_channels, out_channels, kernel_size, stride=1, 167 | padding=0, output_padding=0, groups=1, init_scale=1., 168 | polyak_decay=0.9995): 169 | super(WeightNormConvTranspose2d, self).__init__( 170 | in_channels, out_channels, 171 | kernel_size, stride, 172 | padding, output_padding, 173 | groups) 174 | # in_channels, out_channels, *kernel_size 175 | self.V = self.weight 176 | self.g = Parameter(torch.Tensor(out_channels)) 177 | self.b = self.bias 178 | 179 | self.register_buffer('V_avg', torch.zeros(self.V.size())) 180 | self.register_buffer('g_avg', torch.zeros(out_channels)) 181 | self.register_buffer('b_avg', torch.zeros(out_channels)) 182 | 183 | self.init_scale = init_scale 184 | self.polyak_decay = polyak_decay 185 | self.reset_parameters() 186 | 187 | def reset_parameters(self): 188 | return 189 | 190 | def forward(self, x, init=False): 191 | if init is True: 192 | # in_channels, out_channels, *kernel_size 193 | self.V.data.copy_(torch.randn(self.V.data.size()).type_as( 194 | self.V.data) * 0.05) 195 | V_norm = self.V.data / self.V.data.transpose(0, 1).contiguous() \ 196 | .view(self.out_channels, -1).norm(2, 1).view( 197 | self.in_channels, self.out_channels, 198 | *([1] * len(self.kernel_size))).expand_as(self.V.data) 199 | x_init = F.conv_transpose2d( 200 | x, Variable(V_norm), None, self.stride, 201 | self.padding, self.output_padding, self.groups).data 202 | # self.out_channels, 1 203 | t_x_init = x_init.tranpose(0, 1).contiguous().view( 204 | self.out_channels, -1) 205 | # out_features 206 | m_init, v_init = t_x_init.mean(1).squeeze( 207 | 1), t_x_init.var(1).squeeze(1) 208 | # out_features 209 | scale_init = self.init_scale / \ 210 | torch.sqrt(v_init + 1e-10) 211 | self.g.data.copy_(scale_init) 212 | self.b.data.copy_(-m_init * scale_init) 213 | scale_init_shape = scale_init.view( 214 | 1, self.out_channels, *([1] * (len(x_init.size()) - 2))) 215 | m_init_shape = m_init.view( 216 | 1, self.out_channels, *([1] * (len(x_init.size()) - 2))) 217 | 218 | x_init = scale_init_shape.expand_as(x_init)\ 219 | * (x_init - m_init_shape.expand_as(x_init)) 220 | self.V_avg.copy_(self.V.data) 221 | self.g_avg.copy_(self.g.data) 222 | self.b_avg.copy_(self.b.data) 223 | return Variable(x_init) 224 | else: 225 | V, g, b = get_vars_maybe_avg( 226 | self, ['V', 'g', 'b'], self.training, 227 | polyak_decay=self.polyak_decay) 228 | scalar = g / \ 229 | torch.norm(V.transpose(0, 1).contiguous().view( 230 | self.out_channels, -1), 2, 1).squeeze(1) 231 | W = scalar.view(self.in_channels, self.out_channels, 232 | *([1] * (len(V.size()) - 2))).expand_as(V) * V 233 | 234 | x = F.conv_transpose2d(x, W, b, self.stride, 235 | self.padding, self.output_padding, 236 | self.groups) 237 | return x 238 | -------------------------------------------------------------------------------- /onmt/modules/__init__.py: -------------------------------------------------------------------------------- 1 | from onmt.modules.UtilClass import LayerNorm, Bottle, BottleLinear, \ 2 | BottleLayerNorm, BottleSoftmax, Elementwise 3 | from onmt.modules.Gate import ContextGateFactory 4 | from onmt.modules.GlobalAttention import GlobalAttention 5 | from onmt.modules.ConvMultiStepAttention import ConvMultiStepAttention 6 | from onmt.modules.ImageEncoder import ImageEncoder 7 | from onmt.modules.CopyGenerator import CopyGenerator, CopyGeneratorLossCompute 8 | from onmt.modules.StructuredAttention import MatrixTree 9 | from onmt.modules.Transformer import TransformerEncoder, TransformerDecoder 10 | from onmt.modules.Conv2Conv import CNNEncoder, CNNDecoder 11 | from onmt.modules.MultiHeadedAttn import MultiHeadedAttention 12 | from onmt.modules.StackedRNN import StackedLSTM, StackedGRU 13 | from onmt.modules.Embeddings import Embeddings 14 | from onmt.modules.WeightNorm import WeightNormConv2d 15 | 16 | from onmt.modules.SRU import check_sru_requirement 17 | can_use_sru = check_sru_requirement() 18 | if can_use_sru: 19 | from onmt.modules.SRU import SRU 20 | 21 | 22 | # For flake8 compatibility. 23 | __all__ = [GlobalAttention, ImageEncoder, CopyGenerator, MultiHeadedAttention, 24 | LayerNorm, Bottle, BottleLinear, BottleLayerNorm, BottleSoftmax, 25 | TransformerEncoder, TransformerDecoder, Embeddings, Elementwise, 26 | MatrixTree, WeightNormConv2d, ConvMultiStepAttention, 27 | CNNEncoder, CNNDecoder, StackedLSTM, StackedGRU, ContextGateFactory, 28 | CopyGeneratorLossCompute] 29 | 30 | if can_use_sru: 31 | __all__.extend([SRU, check_sru_requirement]) 32 | -------------------------------------------------------------------------------- /onmt/standard_options.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | import torch 3 | 4 | USE_CUDA = torch.cuda.is_available() 5 | 6 | 7 | ''' 8 | This file will store the standard options used by openNMT-py in a dictionary 9 | 10 | ''' 11 | 12 | stdOptions = { 13 | 14 | # Model options 15 | 'model_type':'text', # Type of encoder to use. Options are [text|img] 16 | 17 | # Embedding Options 18 | 'word_vec_size':-1, # Word embedding for both. 19 | 'src_word_vec_size':500, # Src word embedding sizes 20 | 'tgt_word_vec_size':500, # Tgt word embedding sizes 21 | 'feat_vec_size':-1, # If specified, feature embedding sizes will be set to this. Otherwise, feat_vec_exponent 22 | # will be used. 23 | 'feat_merge':'concat', # Merge action for the features embeddings 24 | 'feat_vec_exponent':0.7, # When using -feat_merge concat, feature embedding sizes will be set to 25 | # N^feat_vec_exponent where N is the number of values the feature takes. 26 | 'position_encoding':False, #Use a sin to mark relative words positions. 27 | 'share_decoder_embeddings':False, #Share the word and out embeddings for decoder. 28 | 29 | # RNN Options 30 | 'encoder_type':'rnn', # Type of encoder layer to use. Choices: ['rnn', 'brnn', 'mean', 'transformer'] 31 | 'decoder_type':'rnn', # Type of decoder layer to use. Choices: ['rnn', 'transformer'] 32 | 'layers':-1, # Number of layers in enc/dec. 33 | 'enc_layers':2, # Number of layers in the encoder 34 | 'dec_layers':2, # Number of layers in the decoder 35 | 'rnn_size':500, # Size of LSTM hidden states 36 | 'input_feed':1, # Feed the context vector at each time step as additional input (via concatenation 37 | # with the word embeddings) to the decoder. 38 | 'rnn_type':'LSTM', # The gate type to use in the RNNs. Choices: ['LSTM', 'GRU'] 39 | 'brnn':False, # Deprecated, use `encoder_type`. 40 | 'brnn_merge':'concat',# Merge action for the bidir hidden states 41 | 'context_gate':None, # Type of context gate to use. Do not select for no context gate. 42 | # Choices: ['concat', 'sum'] 43 | 44 | # Attention options 45 | 46 | 'global_attention':'general', # The attention type to use: dotprot or general (Luong) or MLP (Bahdanau) 47 | # Choices: ['dot', 'general', 'mlp'] 48 | 49 | # Genenerator and loss options. 50 | 51 | 'copy_attn':False, # Train copy attention layer 52 | 'copy_attn_force':False, # When available, train to copy. 53 | 'coverage_attn':False, # Train a coverage attention layer. 54 | 'lambda_coverage':1, # Lambda value for coverage. 55 | 56 | # Training Options 57 | 58 | 'save_model':'model', # Model filename (the model will be saved as _epochN_PPL.pt where PPL is the 59 | # validation perplexity 60 | 'train_from':'', # If training from a checkpoint then this is the path to the pretrained model's state_dict. 61 | 62 | # GPU options 63 | 64 | 'gpuid':[0], # Use CUDA on the listed devices. 65 | 'seed':-1, # Random seed used for the experiments reproducibility. 66 | 67 | # Init options 68 | 69 | 'start_epoch':1, # The epoch from which to start 70 | 'param_init':0.1, # Parameters are initialized over uniform distribution with support (-param_init, param_init). 71 | # Use 0 to not use initialization 72 | 73 | # Pretrained word vectors 74 | 75 | 'pre_word_vecs_enc':None, # If a valid path is specified, then this will load pretrained word embeddings 76 | # on the encoder side. See README for specific formatting instructions. 77 | 'pre_word_vecs_dec':None, # If a valid path is specified, then this will load pretrained word embeddings 78 | # on the decoder side. See README for specific formatting instructions. 79 | 80 | # Fixed word vectors 81 | 82 | 'fix_word_vecs_enc':False, # Fix word embeddings on the encoder side 83 | 'fix_word_vecs_dec':False, # Fix word embeddings on the decoder side 84 | 85 | # Optimization options 86 | 87 | 'batch_size':64, 88 | 'max_generator_batches':32, # Maximum batches of words in a sequence to run the generator on in parallel. 89 | # Higher is faster, but uses more memory. 90 | 'epochs':13, # Number of training epochs 91 | 'optim':'sgd', # Optimization method. Choices: ['sgd', 'adagrad', 'adadelta', 'adam'] 92 | 'max_grad_norm':5.0, # If the norm of the gradient vector exceeds this, renormalize it to have the norm 93 | # equal to max_grad_norm 94 | 'dropout':0.3, # Dropout probability; applied in LSTM stacks. 95 | 'truncated_decoder':0, # Truncated bptt 96 | 97 | #Learning rate 98 | 99 | 'learning_rate':1.0, # Starting learning rate. If adagrad/adadelta/adam is used, then this is the global 100 | # learning rate. Recommended settings: sgd = 1, adagrad = 0.1, 101 | 'learning_rate_decay':0.5, # If update_learning_rate, decay learning rate by this much if (i) perplexity does 102 | # not decrease on the validation set or (ii) epoch has gone past start_decay_at 103 | 'start_decay_at':8, # Start decaying every epoch after and including this epoch 104 | 'start_checkpoint_at':0, # Start checkpointing every epoch after and including this epoch 105 | 'decay_method': '', # Use a custom decay rate. Choices: ['noam'] 106 | 'warmup_steps': 4000, # Number of warmup steps for custom decay. 107 | 'report_every': 50, # Print stats at this interval. 108 | 'exp_host':'', # Send logs to this crayon server. 109 | 'exp': '' # Name of the experiment for logging. 110 | 111 | } 112 | 113 | # the standard preprocessing options to use with openNMT-py dataset parser 114 | standardPreProcessingOptions = { 115 | 116 | # Dictionary options 117 | 'src_vocab_size': 50000, # Size of the source vocabulary 118 | 'tgt_vocab_size': 50000, # Size of the target vocabulary 119 | 'src_words_min_frequency': 0, 120 | 'tgt_words_min_frequency': 0, 121 | 122 | # Truncation options 123 | 'src_seq_length': 50, # Maximum source sequence length 124 | 'src_seq_length_trunc': 0, # Truncate source sequence length. 125 | 'tgt_seq_length': 50, # Maximum target sequence length to keep 126 | 'tgt_seq_length_trunc': 0, # Truncate target sequence length. 127 | 128 | # Data processing options 129 | 130 | 'shuffle': 1, # Shuffle data 131 | 'lower': True, # lowercase data 132 | 133 | # Options most relevant to summarization 134 | 135 | 'dynamic_dict': False, # Create dynamic dictionaries 136 | 'share_vocab': False # Share source and target vocabulary 137 | } 138 | 139 | standardTranslationOptions = { 140 | 141 | 'tgt':None, # True target sequence (optional) 142 | 'output':'pred.txt', # Path to output the predictions (each line will be the decoded sequence) 143 | 'beam_size':5, 144 | 'batch_size':1, 145 | 'max_sent_length':100, 146 | 'replace_unk':True, # Replace the generated UNK tokens with the source token that had highest attention weight. 147 | # If phrase_table is provided, it will lookup the identified source token and 148 | # give the corresponding target token. If it is not provided (or the identified source token 149 | # does not exist in the table) then it will copy the source token 150 | 'verbose':False, # Print scores and predictions for each sentence' 151 | 'attn_debug':False, # Print best attn for each word 152 | 'dump_beam':'', # File to dump beam information to. 153 | 'n_best':1, # If verbose is set, will output the n_best decoded sentences 154 | 'gpu':0, # Device to run on 155 | 'dynamic_dict':False, # Create dynamic dictionaries 156 | 'share_vocab':False, # Share source and target vocabulary 157 | 'cuda':USE_CUDA 158 | } 159 | 160 | 161 | 162 | 163 | -------------------------------------------------------------------------------- /perl_scripts/README.md: -------------------------------------------------------------------------------- 1 | Helpful files taken from the moses project: https://github.com/moses-smt/mosesdecoder 2 | 3 | -------------------------------------------------------------------------------- /perl_scripts/multi-bleu.perl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env perl 2 | # 3 | # This file is part of moses. Its use is licensed under the GNU Lesser General 4 | # Public License version 2.1 or, at your option, any later version. 5 | 6 | # $Id$ 7 | use warnings; 8 | use strict; 9 | 10 | my $lowercase = 0; 11 | if ($ARGV[0] eq "-lc") { 12 | $lowercase = 1; 13 | shift; 14 | } 15 | 16 | my $stem = $ARGV[0]; 17 | if (!defined $stem) { 18 | print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n"; 19 | print STDERR "Reads the references from reference or reference0, reference1, ...\n"; 20 | exit(1); 21 | } 22 | 23 | $stem .= ".ref" if !-e $stem && !-e $stem."0" && -e $stem.".ref0"; 24 | 25 | my @REF; 26 | my $ref=0; 27 | while(-e "$stem$ref") { 28 | &add_to_ref("$stem$ref",\@REF); 29 | $ref++; 30 | } 31 | &add_to_ref($stem,\@REF) if -e $stem; 32 | die("ERROR: could not find reference file $stem") unless scalar @REF; 33 | 34 | # add additional references explicitly specified on the command line 35 | shift; 36 | foreach my $stem (@ARGV) { 37 | &add_to_ref($stem,\@REF) if -e $stem; 38 | } 39 | 40 | 41 | 42 | sub add_to_ref { 43 | my ($file,$REF) = @_; 44 | my $s=0; 45 | if ($file =~ /.gz$/) { 46 | open(REF,"gzip -dc $file|") or die "Can't read $file"; 47 | } else { 48 | open(REF,$file) or die "Can't read $file"; 49 | } 50 | while() { 51 | chop; 52 | push @{$$REF[$s++]}, $_; 53 | } 54 | close(REF); 55 | } 56 | 57 | my(@CORRECT,@TOTAL,$length_translation,$length_reference); 58 | my $s=0; 59 | while() { 60 | chop; 61 | $_ = lc if $lowercase; 62 | my @WORD = split; 63 | my %REF_NGRAM = (); 64 | my $length_translation_this_sentence = scalar(@WORD); 65 | my ($closest_diff,$closest_length) = (9999,9999); 66 | foreach my $reference (@{$REF[$s]}) { 67 | # print "$s $_ <=> $reference\n"; 68 | $reference = lc($reference) if $lowercase; 69 | my @WORD = split(' ',$reference); 70 | my $length = scalar(@WORD); 71 | my $diff = abs($length_translation_this_sentence-$length); 72 | if ($diff < $closest_diff) { 73 | $closest_diff = $diff; 74 | $closest_length = $length; 75 | # print STDERR "$s: closest diff ".abs($length_translation_this_sentence-$length)." = abs($length_translation_this_sentence-$length), setting len: $closest_length\n"; 76 | } elsif ($diff == $closest_diff) { 77 | $closest_length = $length if $length < $closest_length; 78 | # from two references with the same closeness to me 79 | # take the *shorter* into account, not the "first" one. 80 | } 81 | for(my $n=1;$n<=4;$n++) { 82 | my %REF_NGRAM_N = (); 83 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 84 | my $ngram = "$n"; 85 | for(my $w=0;$w<$n;$w++) { 86 | $ngram .= " ".$WORD[$start+$w]; 87 | } 88 | $REF_NGRAM_N{$ngram}++; 89 | } 90 | foreach my $ngram (keys %REF_NGRAM_N) { 91 | if (!defined($REF_NGRAM{$ngram}) || 92 | $REF_NGRAM{$ngram} < $REF_NGRAM_N{$ngram}) { 93 | $REF_NGRAM{$ngram} = $REF_NGRAM_N{$ngram}; 94 | # print "$i: REF_NGRAM{$ngram} = $REF_NGRAM{$ngram}
\n"; 95 | } 96 | } 97 | } 98 | } 99 | $length_translation += $length_translation_this_sentence; 100 | $length_reference += $closest_length; 101 | for(my $n=1;$n<=4;$n++) { 102 | my %T_NGRAM = (); 103 | for(my $start=0;$start<=$#WORD-($n-1);$start++) { 104 | my $ngram = "$n"; 105 | for(my $w=0;$w<$n;$w++) { 106 | $ngram .= " ".$WORD[$start+$w]; 107 | } 108 | $T_NGRAM{$ngram}++; 109 | } 110 | foreach my $ngram (keys %T_NGRAM) { 111 | $ngram =~ /^(\d+) /; 112 | my $n = $1; 113 | # my $corr = 0; 114 | # print "$i e $ngram $T_NGRAM{$ngram}
\n"; 115 | $TOTAL[$n] += $T_NGRAM{$ngram}; 116 | if (defined($REF_NGRAM{$ngram})) { 117 | if ($REF_NGRAM{$ngram} >= $T_NGRAM{$ngram}) { 118 | $CORRECT[$n] += $T_NGRAM{$ngram}; 119 | # $corr = $T_NGRAM{$ngram}; 120 | # print "$i e correct1 $T_NGRAM{$ngram}
\n"; 121 | } 122 | else { 123 | $CORRECT[$n] += $REF_NGRAM{$ngram}; 124 | # $corr = $REF_NGRAM{$ngram}; 125 | # print "$i e correct2 $REF_NGRAM{$ngram}
\n"; 126 | } 127 | } 128 | # $REF_NGRAM{$ngram} = 0 if !defined $REF_NGRAM{$ngram}; 129 | # print STDERR "$ngram: {$s, $REF_NGRAM{$ngram}, $T_NGRAM{$ngram}, $corr}\n" 130 | } 131 | } 132 | $s++; 133 | } 134 | my $brevity_penalty = 1; 135 | my $bleu = 0; 136 | 137 | my @bleu=(); 138 | 139 | for(my $n=1;$n<=4;$n++) { 140 | if (defined ($TOTAL[$n])){ 141 | $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0; 142 | # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n"; 143 | }else{ 144 | $bleu[$n]=0; 145 | } 146 | } 147 | 148 | if ($length_reference==0){ 149 | printf "BLEU = 0, 0/0/0/0 (BP=0, ratio=0, hyp_len=0, ref_len=0)\n"; 150 | exit(1); 151 | } 152 | 153 | if ($length_translation<$length_reference) { 154 | $brevity_penalty = exp(1-$length_reference/$length_translation); 155 | } 156 | $bleu = $brevity_penalty * exp((my_log( $bleu[1] ) + 157 | my_log( $bleu[2] ) + 158 | my_log( $bleu[3] ) + 159 | my_log( $bleu[4] ) ) / 4) ; 160 | printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_len=%d)\n", 161 | 100*$bleu, 162 | 100*$bleu[1], 163 | 100*$bleu[2], 164 | 100*$bleu[3], 165 | 100*$bleu[4], 166 | $brevity_penalty, 167 | $length_translation / $length_reference, 168 | $length_translation, 169 | $length_reference; 170 | 171 | sub my_log { 172 | return -9999999999 unless $_[0]; 173 | return log($_[0]); 174 | } 175 | -------------------------------------------------------------------------------- /perl_scripts/nonbreaking_prefix.de: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | #no german words end in single lower-case letters, so we throw those in too. 7 | A 8 | B 9 | C 10 | D 11 | E 12 | F 13 | G 14 | H 15 | I 16 | J 17 | K 18 | L 19 | M 20 | N 21 | O 22 | P 23 | Q 24 | R 25 | S 26 | T 27 | U 28 | V 29 | W 30 | X 31 | Y 32 | Z 33 | a 34 | b 35 | c 36 | d 37 | e 38 | f 39 | g 40 | h 41 | i 42 | j 43 | k 44 | l 45 | m 46 | n 47 | o 48 | p 49 | q 50 | r 51 | s 52 | t 53 | u 54 | v 55 | w 56 | x 57 | y 58 | z 59 | 60 | 61 | #Roman Numerals. A dot after one of these is not a sentence break in German. 62 | I 63 | II 64 | III 65 | IV 66 | V 67 | VI 68 | VII 69 | VIII 70 | IX 71 | X 72 | XI 73 | XII 74 | XIII 75 | XIV 76 | XV 77 | XVI 78 | XVII 79 | XVIII 80 | XIX 81 | XX 82 | i 83 | ii 84 | iii 85 | iv 86 | v 87 | vi 88 | vii 89 | viii 90 | ix 91 | x 92 | xi 93 | xii 94 | xiii 95 | xiv 96 | xv 97 | xvi 98 | xvii 99 | xviii 100 | xix 101 | xx 102 | 103 | #Titles and Honorifics 104 | Adj 105 | Adm 106 | Adv 107 | Asst 108 | Bart 109 | Bldg 110 | Brig 111 | Bros 112 | Capt 113 | Cmdr 114 | Col 115 | Comdr 116 | Con 117 | Corp 118 | Cpl 119 | DR 120 | Dr 121 | Ens 122 | Gen 123 | Gov 124 | Hon 125 | Hosp 126 | Insp 127 | Lt 128 | MM 129 | MR 130 | MRS 131 | MS 132 | Maj 133 | Messrs 134 | Mlle 135 | Mme 136 | Mr 137 | Mrs 138 | Ms 139 | Msgr 140 | Op 141 | Ord 142 | Pfc 143 | Ph 144 | Prof 145 | Pvt 146 | Rep 147 | Reps 148 | Res 149 | Rev 150 | Rt 151 | Sen 152 | Sens 153 | Sfc 154 | Sgt 155 | Sr 156 | St 157 | Supt 158 | Surg 159 | 160 | #Misc symbols 161 | Mio 162 | Mrd 163 | bzw 164 | v 165 | vs 166 | usw 167 | d.h 168 | z.B 169 | u.a 170 | etc 171 | Mrd 172 | MwSt 173 | ggf 174 | d.J 175 | D.h 176 | m.E 177 | vgl 178 | I.F 179 | z.T 180 | sogen 181 | ff 182 | u.E 183 | g.U 184 | g.g.A 185 | c.-à-d 186 | Buchst 187 | u.s.w 188 | sog 189 | u.ä 190 | Std 191 | evtl 192 | Zt 193 | Chr 194 | u.U 195 | o.ä 196 | Ltd 197 | b.A 198 | z.Zt 199 | spp 200 | sen 201 | SA 202 | k.o 203 | jun 204 | i.H.v 205 | dgl 206 | dergl 207 | Co 208 | zzt 209 | usf 210 | s.p.a 211 | Dkr 212 | Corp 213 | bzgl 214 | BSE 215 | 216 | #Number indicators 217 | # add #NUMERIC_ONLY# after the word if it should ONLY be non-breaking when a 0-9 digit follows it 218 | No 219 | Nos 220 | Art 221 | Nr 222 | pp 223 | ca 224 | Ca 225 | 226 | #Ordinals are done with . in German - "1." = "1st" in English 227 | 1 228 | 2 229 | 3 230 | 4 231 | 5 232 | 6 233 | 7 234 | 8 235 | 9 236 | 10 237 | 11 238 | 12 239 | 13 240 | 14 241 | 15 242 | 16 243 | 17 244 | 18 245 | 19 246 | 20 247 | 21 248 | 22 249 | 23 250 | 24 251 | 25 252 | 26 253 | 27 254 | 28 255 | 29 256 | 30 257 | 31 258 | 32 259 | 33 260 | 34 261 | 35 262 | 36 263 | 37 264 | 38 265 | 39 266 | 40 267 | 41 268 | 42 269 | 43 270 | 44 271 | 45 272 | 46 273 | 47 274 | 48 275 | 49 276 | 50 277 | 51 278 | 52 279 | 53 280 | 54 281 | 55 282 | 56 283 | 57 284 | 58 285 | 59 286 | 60 287 | 61 288 | 62 289 | 63 290 | 64 291 | 65 292 | 66 293 | 67 294 | 68 295 | 69 296 | 70 297 | 71 298 | 72 299 | 73 300 | 74 301 | 75 302 | 76 303 | 77 304 | 78 305 | 79 306 | 80 307 | 81 308 | 82 309 | 83 310 | 84 311 | 85 312 | 86 313 | 87 314 | 88 315 | 89 316 | 90 317 | 91 318 | 92 319 | 93 320 | 94 321 | 95 322 | 96 323 | 97 324 | 98 325 | 99 326 | -------------------------------------------------------------------------------- /perl_scripts/nonbreaking_prefix.en: -------------------------------------------------------------------------------- 1 | #Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. 2 | #Special cases are included for prefixes that ONLY appear before 0-9 numbers. 3 | 4 | #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) 5 | #usually upper case letters are initials in a name 6 | A 7 | B 8 | C 9 | D 10 | E 11 | F 12 | G 13 | H 14 | I 15 | J 16 | K 17 | L 18 | M 19 | N 20 | O 21 | P 22 | Q 23 | R 24 | S 25 | T 26 | U 27 | V 28 | W 29 | X 30 | Y 31 | Z 32 | 33 | #List of titles. These are often followed by upper-case names, but do not indicate sentence breaks 34 | Adj 35 | Adm 36 | Adv 37 | Asst 38 | Bart 39 | Bldg 40 | Brig 41 | Bros 42 | Capt 43 | Cmdr 44 | Col 45 | Comdr 46 | Con 47 | Corp 48 | Cpl 49 | DR 50 | Dr 51 | Drs 52 | Ens 53 | Gen 54 | Gov 55 | Hon 56 | Hr 57 | Hosp 58 | Insp 59 | Lt 60 | MM 61 | MR 62 | MRS 63 | MS 64 | Maj 65 | Messrs 66 | Mlle 67 | Mme 68 | Mr 69 | Mrs 70 | Ms 71 | Msgr 72 | Op 73 | Ord 74 | Pfc 75 | Ph 76 | Prof 77 | Pvt 78 | Rep 79 | Reps 80 | Res 81 | Rev 82 | Rt 83 | Sen 84 | Sens 85 | Sfc 86 | Sgt 87 | Sr 88 | St 89 | Supt 90 | Surg 91 | 92 | #misc - odd period-ending items that NEVER indicate breaks (p.m. does NOT fall into this category - it sometimes ends a sentence) 93 | v 94 | vs 95 | i.e 96 | rev 97 | e.g 98 | 99 | #Numbers only. These should only induce breaks when followed by a numeric sequence 100 | # add NUMERIC_ONLY after the word for this function 101 | #This case is mostly for the english "No." which can either be a sentence of its own, or 102 | #if followed by a number, a non-breaking prefix 103 | No #NUMERIC_ONLY# 104 | Nos 105 | Art #NUMERIC_ONLY# 106 | Nr 107 | pp #NUMERIC_ONLY# 108 | 109 | #month abbreviations 110 | Jan 111 | Feb 112 | Mar 113 | Apr 114 | #May is a full word 115 | Jun 116 | Jul 117 | Aug 118 | Sep 119 | Oct 120 | Nov 121 | Dec 122 | -------------------------------------------------------------------------------- /quantization/__init__.py: -------------------------------------------------------------------------------- 1 | import torch 2 | 3 | USE_CUDA = torch.cuda.is_available() 4 | from .quant_functions import uniformQuantization, nonUniformQuantization, ScalingFunction, \ 5 | uniformQuantization_variable, nonUniformQuantization_variable 6 | 7 | __all__ = ('uniformQuantization', 'ScalingFunction', 'nonUniformQuantization', 8 | 'uniformQuantization_variable', 'nonUniformQuantization_variable') -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch==0.3.1 2 | torchvision 3 | numpy 4 | scipy 5 | torchtext==0.1.1 6 | -------------------------------------------------------------------------------- /resnet34_doublefilters.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | import torchvision 4 | import cnn_models.conv_forward_model as convForwModel 5 | import cnn_models.help_fun as cnn_hf 6 | import datasets 7 | import model_manager 8 | from cnn_models.wide_resnet_imagenet import Wide_ResNet_imagenet 9 | import cnn_models.resnet_kfilters as resnet_kfilters 10 | import functools 11 | import quantization 12 | import helpers.functions as mhf 13 | 14 | cuda_devices = os.environ['CUDA_VISIBLE_DEVICES'].split(',') 15 | print('CUDA_VISIBLE_DEVICES: {} for a total of {} GPUs'.format(cuda_devices, len(cuda_devices))) 16 | 17 | print('Number of bits in training: {}'.format(4)) 18 | 19 | datasets.BASE_DATA_FOLDER = '...' 20 | SAVED_MODELS_FOLDER = '...' 21 | 22 | USE_CUDA = torch.cuda.is_available() 23 | NUM_GPUS = len(cuda_devices) 24 | 25 | try: 26 | os.mkdir(datasets.BASE_DATA_FOLDER) 27 | except:pass 28 | try: 29 | os.mkdir(SAVED_MODELS_FOLDER) 30 | except:pass 31 | 32 | epochsToTrainImageNet = 90 33 | imageNet12modelsFolder = os.path.join(SAVED_MODELS_FOLDER, 'imagenet12_new') 34 | imagenet_manager = model_manager.ModelManager('model_manager_resnet34double.tst', 35 | 'model_manager', create_new_model_manager=False) 36 | 37 | for x in imagenet_manager.list_models(): 38 | if imagenet_manager.get_num_training_runs(x) >= 1: 39 | s = '{}; Last prediction acc: {}, Best prediction acc: {}'.format(x, 40 | imagenet_manager.load_metadata(x)[1]['predictionAccuracy'][-1], 41 | max(imagenet_manager.load_metadata(x)[1]['predictionAccuracy'])) 42 | print(s) 43 | 44 | try: 45 | os.mkdir(imageNet12modelsFolder) 46 | except:pass 47 | 48 | TRAIN_QUANTIZED_DISTILLED = True 49 | 50 | print('Batch size: {}'.format(batch_size)) 51 | 52 | if batch_size % NUM_GPUS != 0: 53 | raise ValueError('Batch size: {} must be a multiple of the number of gpus:{}'.format(batch_size, NUM_GPUS)) 54 | 55 | 56 | imageNet12 = datasets.ImageNet12('...', 57 | '...', 58 | type_of_data_augmentation='extended', already_scaled=False, 59 | pin_memory=True) 60 | 61 | 62 | train_loader = imageNet12.getTrainLoader(batch_size, shuffle=True) 63 | test_loader = imageNet12.getTestLoader(batch_size, shuffle=False) 64 | 65 | # # Teacher model 66 | resnet34 = torchvision.models.resnet34(True) #already trained 67 | if USE_CUDA: 68 | resnet34 = resnet34.cuda() 69 | if NUM_GPUS > 1: 70 | resnet34 = torch.nn.parallel.DataParallel(resnet34) 71 | 72 | 73 | #Train a wide-resNet with quantized distillation 74 | quant_distilled_model_name = 'resnet18_1.5xfilters_quant_distilled4bits' 75 | quantDistilledModelPath = os.path.join(imageNet12modelsFolder, quant_distilled_model_name) 76 | quantDistilledOptions = {} 77 | quant_distilled_model = resnet_kfilters.resnet18(k=1.5) 78 | 79 | if USE_CUDA: 80 | quant_distilled_model = quant_distilled_model.cuda() 81 | if NUM_GPUS > 1: 82 | quant_distilled_model = torch.nn.parallel.DataParallel(quant_distilled_model) 83 | 84 | if not quant_distilled_model_name in imagenet_manager.saved_models: 85 | imagenet_manager.add_new_model(quant_distilled_model_name, quantDistilledModelPath, 86 | arguments_creator_function=quantDistilledOptions) 87 | 88 | if TRAIN_QUANTIZED_DISTILLED: 89 | imagenet_manager.train_model(quant_distilled_model, model_name=quant_distilled_model_name, 90 | train_function=convForwModel.train_model, 91 | arguments_train_function={'epochs_to_train': epochsToTrainImageNet, 92 | 'learning_rate_style': 'imagenet', 93 | 'initial_learning_rate': 0.1, 94 | 'use_nesterov':True, 95 | 'initial_momentum':0.9, 96 | 'weight_decayL2':1e-4, 97 | 'start_epoch': 0, 98 | 'print_every':30, 99 | 'use_distillation_loss':True, 100 | 'teacher_model': resnet34, 101 | 'quantizeWeights':True, 102 | 'numBits':4, 103 | 'bucket_size':256, 104 | 'quantize_first_and_last_layer': False}, 105 | train_loader=train_loader, test_loader=test_loader) 106 | quant_distilled_model.load_state_dict(imagenet_manager.load_model_state_dict(quant_distilled_model_name)) 107 | 108 | # print(cnn_hf.evaluateModel(quant_distilled_model, test_loader, fastEvaluation=False)) 109 | # print(cnn_hf.evaluateModel(quant_distilled_model, test_loader, fastEvaluation=False, k=5)) 110 | # print(cnn_hf.evaluateModel(resnet34, test_loader, fastEvaluation=False)) 111 | # print(cnn_hf.evaluateModel(resnet34, test_loader, fastEvaluation=False, k=5)) 112 | # quant_fun = functools.partial(quantization.uniformQuantization, s=2**4, bucket_size=256) 113 | # size_mb = mhf.get_size_quantized_model(quant_distilled_model, 4, quant_fun, 256, 114 | # quantizeFirstLastLayer=False) 115 | # print(size_mb) 116 | # print(mhf.getNumberOfParameters(quant_distilled_model)/1000000) 117 | # print(mhf.getNumberOfParameters(resnet34) / 1000000) 118 | -------------------------------------------------------------------------------- /translation_models/__init__.py: -------------------------------------------------------------------------------- 1 | import os 2 | _currDir = os.path.dirname(os.path.abspath(__file__)) 3 | PATH_PERL_SCRIPTS_FOLDER = os.path.abspath(os.path.join(_currDir, '..', 'perl_scripts')) --------------------------------------------------------------------------------