├── .github ├── ISSUE_TEMPLATE.md └── PULL_REQUEST_TEMPLATE.md ├── .gitignore ├── CHANGELOG.md ├── CONTRIBUTING.md ├── HttpTrigger ├── __init__.py ├── function.json ├── host.json └── project │ ├── .gitignore │ └── pytorch_train.py ├── LICENSE.md ├── README.md ├── azuredeploy.json ├── host.json ├── images ├── arch_diagram.png ├── data_file_structure.png └── testing_locally.png ├── local.settings.json └── requirements.txt /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | 4 | > Please provide us with the following information: 5 | > --------------------------------------------------------------- 6 | 7 | ### This issue is for a: (mark with an `x`) 8 | ``` 9 | - [ ] bug report -> please search issues before submitting 10 | - [ ] feature request 11 | - [ ] documentation issue or request 12 | - [ ] regression (a behavior that used to work and stopped in a new release) 13 | ``` 14 | 15 | ### Minimal steps to reproduce 16 | > 17 | 18 | ### Any log messages given by the failure 19 | > 20 | 21 | ### Expected/desired behavior 22 | > 23 | 24 | ### OS and Version? 25 | > Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) 26 | 27 | ### Versions 28 | > 29 | 30 | ### Mention any other details that might be useful 31 | 32 | > --------------------------------------------------------------- 33 | > Thanks! We'll be in touch soon. 34 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | ## Purpose 2 | 3 | * ... 4 | 5 | ## Does this introduce a breaking change? 6 | 7 | ``` 8 | [ ] Yes 9 | [ ] No 10 | ``` 11 | 12 | ## Pull Request Type 13 | What kind of change does this Pull Request introduce? 14 | 15 | 16 | ``` 17 | [ ] Bugfix 18 | [ ] Feature 19 | [ ] Code style update (formatting, local variables) 20 | [ ] Refactoring (no functional changes, no api changes) 21 | [ ] Documentation content changes 22 | [ ] Other... Please describe: 23 | ``` 24 | 25 | ## How to Test 26 | * Get the code 27 | 28 | ``` 29 | git clone [repo-address] 30 | cd [repo-name] 31 | git checkout [branch-name] 32 | npm install 33 | ``` 34 | 35 | * Test the code 36 | 37 | ``` 38 | ``` 39 | 40 | ## What to Check 41 | Verify that the following are valid 42 | * ... 43 | 44 | ## Other Information 45 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .vars 3 | bin 4 | obj 5 | csx 6 | .vs 7 | edge 8 | Publish 9 | config.json 10 | data 11 | extensions.* 12 | .vscode 13 | 14 | *.user 15 | *.suo 16 | *.cscfg 17 | *.Cache 18 | project.lock.json 19 | 20 | /packages 21 | /TestResults 22 | 23 | /tools/NuGet.exe 24 | /App_Data 25 | /secrets 26 | /data 27 | .secrets 28 | appsettings.json 29 | 30 | node_modules 31 | 32 | # Local python packages 33 | .python_packages/ 34 | 35 | # Python Environments 36 | .env 37 | .venv 38 | env/ 39 | venv/ 40 | ENV/ 41 | env.bak/ 42 | venv.bak/ 43 | 44 | # Byte-compiled / optimized / DLL files 45 | __pycache__/ 46 | *.py[cod] 47 | *$py.class 48 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | ## [project-title] Changelog 2 | 3 | 4 | # x.y.z (yyyy-mm-dd) 5 | 6 | *Features* 7 | * ... 8 | 9 | *Bug Fixes* 10 | * ... 11 | 12 | *Breaking Changes* 13 | * ... 14 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to [project-title] 2 | 3 | This project welcomes contributions and suggestions. Most contributions require you to agree to a 4 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us 5 | the rights to use your contribution. For details, visit https://cla.microsoft.com. 6 | 7 | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide 8 | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions 9 | provided by the bot. You will only need to do this once across all repos using our CLA. 10 | 11 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 12 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or 13 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. 14 | 15 | - [Code of Conduct](#coc) 16 | - [Issues and Bugs](#issue) 17 | - [Feature Requests](#feature) 18 | - [Submission Guidelines](#submit) 19 | 20 | ## Code of Conduct 21 | Help us keep this project open and inclusive. Please read and follow our [Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 22 | 23 | ## Found an Issue? 24 | If you find a bug in the source code or a mistake in the documentation, you can help us by 25 | [submitting an issue](#submit-issue) to the GitHub Repository. Even better, you can 26 | [submit a Pull Request](#submit-pr) with a fix. 27 | 28 | ## Want a Feature? 29 | You can *request* a new feature by [submitting an issue](#submit-issue) to the GitHub 30 | Repository. If you would like to *implement* a new feature, please submit an issue with 31 | a proposal for your work first, to be sure that we can use it. 32 | 33 | * **Small Features** can be crafted and directly [submitted as a Pull Request](#submit-pr). 34 | 35 | ## Submission Guidelines 36 | 37 | ### Submitting an Issue 38 | Before you submit an issue, search the archive, maybe your question was already answered. 39 | 40 | If your issue appears to be a bug, and hasn't been reported, open a new issue. 41 | Help us to maximize the effort we can spend fixing issues and adding new 42 | features, by not reporting duplicate issues. Providing the following information will increase the 43 | chances of your issue being dealt with quickly: 44 | 45 | * **Overview of the Issue** - if an error is being thrown a non-minified stack trace helps 46 | * **Version** - what version is affected (e.g. 0.1.2) 47 | * **Motivation for or Use Case** - explain what are you trying to do and why the current behavior is a bug for you 48 | * **Browsers and Operating System** - is this a problem with all browsers? 49 | * **Reproduce the Error** - provide a live example or a unambiguous set of steps 50 | * **Related Issues** - has a similar issue been reported before? 51 | * **Suggest a Fix** - if you can't fix the bug yourself, perhaps you can point to what might be 52 | causing the problem (line of code or commit) 53 | 54 | You can file new issues by providing the above information at the corresponding repository's issues link: https://github.com/[organization-name]/[repository-name]/issues/new]. 55 | 56 | ### Submitting a Pull Request (PR) 57 | Before you submit your Pull Request (PR) consider the following guidelines: 58 | 59 | * Search the repository (https://github.com/[organization-name]/[repository-name]/pulls) for an open or closed PR 60 | that relates to your submission. You don't want to duplicate effort. 61 | 62 | * Make your changes in a new git fork: 63 | 64 | * Commit your changes using a descriptive commit message 65 | * Push your fork to GitHub: 66 | * In GitHub, create a pull request 67 | * If we suggest changes then: 68 | * Make the required updates. 69 | * Rebase your fork and force push to your GitHub repository (this will update your Pull Request): 70 | 71 | ```shell 72 | git rebase master -i 73 | git push -f 74 | ``` 75 | 76 | That's it! Thank you for your contribution! 77 | -------------------------------------------------------------------------------- /HttpTrigger/__init__.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import azure.functions as func 3 | from azureml.core.authentication import ServicePrincipalAuthentication 4 | from azureml.core import Workspace, Experiment, Datastore 5 | from azureml.exceptions import ProjectSystemException 6 | from azureml.core.compute import ComputeTarget, AmlCompute 7 | from azureml.core.compute_target import ComputeTargetException 8 | from azureml.train.dnn import PyTorch 9 | import shutil 10 | import os 11 | import json 12 | import time 13 | 14 | 15 | def main(req: func.HttpRequest) -> (func.HttpResponse): 16 | logging.info('Python HTTP trigger function processed a request.') 17 | 18 | # For now this can be a POST where we have /api/HttpTrigger?start= 19 | image_url = req.params.get('start') 20 | logging.info(type(image_url)) 21 | 22 | # Use service principal secrets to create authentication vehicle and 23 | # define workspace object 24 | try: 25 | svc_pr = ServicePrincipalAuthentication( 26 | tenant_id=os.getenv('TENANT_ID', ''), 27 | service_principal_id=os.getenv('APP_ID', ''), 28 | service_principal_password=os.getenv('PRINCIPAL_PASSWORD', '')) 29 | 30 | ws = Workspace(subscription_id=os.getenv('AZURE_SUB', ''), 31 | resource_group=os.getenv('RESOURCE_GROUP', ''), 32 | workspace_name=os.getenv('WORKSPACE_NAME',''), 33 | auth=svc_pr) 34 | print("Found workspace {} at location {} using Azure CLI \ 35 | authentication".format(ws.name, ws.location)) 36 | # Usually because authentication didn't work 37 | except ProjectSystemException as err: 38 | print('Authentication did not work.') 39 | return json.dumps('ProjectSystemException') 40 | # Need to create the workspace 41 | except Exception as err: 42 | ws = Workspace.create(name=os.getenv('WORKSPACE_NAME', ''), 43 | subscription_id=os.getenv('AZURE_SUB', ''), 44 | resource_group=os.getenv('RESOURCE_GROUP', ''), 45 | create_resource_group=True, 46 | location='westus', # Or other supported Azure region 47 | auth=svc_pr) 48 | print("Created workspace {} at location {}".format(ws.name, ws.location)) 49 | 50 | 51 | 52 | # choose a name for your cluster - under 16 characters 53 | cluster_name = "gpuforpytorch" 54 | 55 | try: 56 | compute_target = ComputeTarget(workspace=ws, name=cluster_name) 57 | print('Found existing compute target.') 58 | except ComputeTargetException: 59 | print('Creating a new compute target...') 60 | # AML Compute config - if max_nodes are set, it becomes persistent storage that scales 61 | compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 62 | min_nodes=0, 63 | max_nodes=2) 64 | # create the cluster 65 | compute_target = ComputeTarget.create(ws, cluster_name, compute_config) 66 | compute_target.wait_for_completion(show_output=True) 67 | 68 | # use get_status() to get a detailed status for the current cluster. 69 | # print(compute_target.get_status().serialize()) 70 | 71 | # # Create a project directory and copy training script to ii 72 | project_folder = os.path.join(os.getcwd(), 'HttpTrigger', 'project') 73 | # os.makedirs(project_folder, exist_ok=True) 74 | # shutil.copy(os.path.join(os.getcwd(), 'HttpTrigger', 'pytorch_train.py'), project_folder) 75 | 76 | # Create an experiment 77 | experiment_name = 'fish-no-fish' 78 | experiment = Experiment(ws, name=experiment_name) 79 | 80 | # Use an AML Data Store for training data 81 | ds = Datastore.register_azure_blob_container(workspace=ws, 82 | datastore_name='funcdefaultdatastore', 83 | container_name=os.getenv('STORAGE_CONTAINER_NAME_TRAINDATA', ''), 84 | account_name=os.getenv('STORAGE_ACCOUNT_NAME', ''), 85 | account_key=os.getenv('STORAGE_ACCOUNT_KEY', ''), 86 | create_if_not_exists=True) 87 | 88 | # Use an AML Data Store to save models back up to 89 | ds_models = Datastore.register_azure_blob_container(workspace=ws, 90 | datastore_name='modelsdatastorage', 91 | container_name=os.getenv('STORAGE_CONTAINER_NAME_MODELS', ''), 92 | account_name=os.getenv('STORAGE_ACCOUNT_NAME', ''), 93 | account_key=os.getenv('STORAGE_ACCOUNT_KEY', ''), 94 | create_if_not_exists=True) 95 | 96 | # Set up for training ("trans" flag means - use transfer learning and 97 | # this should download a model on compute) 98 | # Using /tmp to store model and info due to the fact that 99 | # creating new folders and files on the Azure Function host 100 | # will trigger the function to restart. 101 | script_params = { 102 | '--data_dir': ds.as_mount(), 103 | '--num_epochs': 30, 104 | '--learning_rate': 0.01, 105 | '--output_dir': '/tmp/outputs', 106 | '--trans': 'True' 107 | } 108 | 109 | # Instantiate PyTorch estimator with upload of final model to 110 | # a specified blob storage container (this can be anything) 111 | estimator = PyTorch(source_directory=project_folder, 112 | script_params=script_params, 113 | compute_target=compute_target, 114 | entry_script='pytorch_train.py', 115 | use_gpu=True, 116 | inputs=[ds_models.as_upload(path_on_compute='./outputs/model_finetuned.pth')]) 117 | 118 | run = experiment.submit(estimator) 119 | print(run.get_details()) 120 | 121 | # # The following would certainly be blocking, but that's ok for debugging 122 | # while run.get_status() not in ['Completed', 'Failed']: # For example purposes only, not exhaustive 123 | # print('Run {} not in terminal state'.format(run.id)) 124 | # time.sleep(10) 125 | 126 | return json.dumps(run.get_status()) 127 | -------------------------------------------------------------------------------- /HttpTrigger/function.json: -------------------------------------------------------------------------------- 1 | { 2 | "scriptFile": "__init__.py", 3 | "bindings": [ 4 | { 5 | "authLevel": "anonymous", 6 | "type": "httpTrigger", 7 | "direction": "in", 8 | "name": "req", 9 | "methods": [ 10 | "get", 11 | "post" 12 | ] 13 | }, 14 | { 15 | "type": "http", 16 | "direction": "out", 17 | "name": "$return" 18 | } 19 | ] 20 | } -------------------------------------------------------------------------------- /HttpTrigger/host.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": "2.0" 3 | } -------------------------------------------------------------------------------- /HttpTrigger/project/.gitignore: -------------------------------------------------------------------------------- 1 | assets 2 | aml_config 3 | .amlignore -------------------------------------------------------------------------------- /HttpTrigger/project/pytorch_train.py: -------------------------------------------------------------------------------- 1 | """ 2 | Based on: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html 3 | 4 | In this tutorial, you will learn how to train your network using 5 | transfer learning. You can read more about the transfer learning at `cs231n 6 | notes `__ 7 | 8 | Quoting these notes, 9 | 10 | In practice, very few people train an entire Convolutional Network 11 | from scratch (with random initialization), because it is relatively 12 | rare to have a dataset of sufficient size. Instead, it is common to 13 | pretrain a ConvNet on a very large dataset (e.g. ImageNet, which 14 | contains 1.2 million images with 1000 categories), and then use the 15 | ConvNet either as an initialization or a fixed feature extractor for 16 | the task of interest. 17 | 18 | These two major transfer learning scenarios look as follows: 19 | 20 | - **Finetuning the convnet**: Instead of random initializaion, we 21 | initialize the network with a pretrained network, like the one that is 22 | trained on imagenet 1000 dataset. Rest of the training looks as 23 | usual. 24 | - **ConvNet as fixed feature extractor**: Here, we will freeze the weights 25 | for all of the network except that of the final fully connected 26 | layer. This last fully connected layer is replaced with a new one 27 | with random weights and only this layer is trained. 28 | 29 | **Original Author**: `Sasank Chilamkurthy `_ 30 | """ 31 | 32 | from __future__ import print_function, division 33 | import torch 34 | import torch.nn as nn 35 | import torch.optim as optim 36 | from torch.optim import lr_scheduler 37 | from torchvision import datasets, models, transforms 38 | import numpy as np 39 | import time 40 | import os 41 | import copy 42 | import argparse 43 | 44 | 45 | from azureml.core.run import Run 46 | from azureml.core import Datastore 47 | 48 | # get the Azure ML run object 49 | run = Run.get_context() 50 | 51 | def load_data(data_dir): 52 | """Load the train/val data.""" 53 | 54 | # Data augmentation and normalization for training 55 | # Just normalization for validation 56 | data_transforms = { 57 | 'train': transforms.Compose([ 58 | transforms.RandomResizedCrop(224), 59 | transforms.RandomHorizontalFlip(), 60 | transforms.ToTensor(), 61 | transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) 62 | ]), 63 | 'val': transforms.Compose([ 64 | transforms.Resize(256), 65 | transforms.CenterCrop(224), 66 | transforms.ToTensor(), 67 | transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) 68 | ]), 69 | } 70 | 71 | image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, 'data', x), 72 | data_transforms[x]) 73 | for x in ['train', 'val']} 74 | dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4, 75 | shuffle=True, num_workers=4) 76 | for x in ['train', 'val']} 77 | dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']} 78 | class_names = image_datasets['train'].classes 79 | 80 | return dataloaders, dataset_sizes, class_names 81 | 82 | 83 | def train_model(model, criterion, optimizer, scheduler, num_epochs, data_dir): 84 | """Train the model.""" 85 | 86 | # load training/validation data 87 | dataloaders, dataset_sizes, class_names = load_data(data_dir) 88 | 89 | device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 90 | 91 | since = time.time() 92 | 93 | best_model_wts = copy.deepcopy(model.state_dict()) 94 | best_acc = 0.0 95 | 96 | for epoch in range(num_epochs): 97 | print('Epoch {}/{}'.format(epoch, num_epochs - 1)) 98 | print('-' * 10) 99 | 100 | # Each epoch has a training and validation phase 101 | for phase in ['train', 'val']: 102 | if phase == 'train': 103 | scheduler.step() 104 | model.train() # Set model to training mode 105 | else: 106 | model.eval() # Set model to evaluate mode 107 | 108 | running_loss = 0.0 109 | running_corrects = 0 110 | 111 | # Iterate over data. 112 | for inputs, labels in dataloaders[phase]: 113 | inputs = inputs.to(device) 114 | labels = labels.to(device) 115 | 116 | # zero the parameter gradients 117 | optimizer.zero_grad() 118 | 119 | # forward 120 | # track history if only in train 121 | with torch.set_grad_enabled(phase == 'train'): 122 | outputs = model(inputs) 123 | _, preds = torch.max(outputs, 1) 124 | loss = criterion(outputs, labels) 125 | 126 | # backward + optimize only if in training phase 127 | if phase == 'train': 128 | loss.backward() 129 | optimizer.step() 130 | 131 | # statistics 132 | running_loss += loss.item() * inputs.size(0) 133 | running_corrects += torch.sum(preds == labels.data) 134 | 135 | epoch_loss = running_loss / dataset_sizes[phase] 136 | epoch_acc = running_corrects.double() / dataset_sizes[phase] 137 | 138 | print('{} Loss: {:.4f} Acc: {:.4f}'.format( 139 | phase, epoch_loss, epoch_acc)) 140 | 141 | # deep copy the model 142 | if phase == 'val' and epoch_acc > best_acc: 143 | best_acc = epoch_acc 144 | best_model_wts = copy.deepcopy(model.state_dict()) 145 | 146 | # log the best val accuracy to AML run 147 | run.log('best_val_acc', np.float(best_acc)) 148 | 149 | print() 150 | 151 | time_elapsed = time.time() - since 152 | print('Training complete in {:.0f}m {:.0f}s'.format( 153 | time_elapsed // 60, time_elapsed % 60)) 154 | print('Best val Acc: {:4f}'.format(best_acc)) 155 | 156 | # load best model weights 157 | model.load_state_dict(best_model_wts) 158 | return model 159 | 160 | 161 | def fine_tune_model(num_epochs, data_dir, learning_rate, momentum, transfer_learn): 162 | """Load a pretrained model and reset the final fully connected layer.""" 163 | 164 | # log the hyperparameter metrics to the AML run 165 | run.log('lr', np.float(learning_rate)) 166 | run.log('momentum', np.float(momentum)) 167 | 168 | model_ft = models.resnet18(pretrained=transfer_learn) 169 | num_ftrs = model_ft.fc.in_features 170 | model_ft.fc = nn.Linear(num_ftrs, 2) # only 2 classes to predict 171 | 172 | device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 173 | model_ft = model_ft.to(device) 174 | 175 | criterion = nn.CrossEntropyLoss() 176 | 177 | # Observe that all parameters are being optimized 178 | optimizer_ft = optim.SGD(model_ft.parameters(), 179 | lr=learning_rate, momentum=momentum) 180 | 181 | # Decay LR by a factor of 0.1 every 7 epochs 182 | exp_lr_scheduler = lr_scheduler.StepLR( 183 | optimizer_ft, step_size=7, gamma=0.1) 184 | 185 | model = train_model(model_ft, criterion, optimizer_ft, 186 | exp_lr_scheduler, num_epochs, data_dir) 187 | 188 | # Complete the run 189 | run.complete() 190 | 191 | return model 192 | 193 | def main(): 194 | print("PyTorch version:", torch.__version__) 195 | 196 | # get command-line arguments 197 | parser = argparse.ArgumentParser() 198 | parser.add_argument('--data_dir', type=str, help='data directory', 199 | default='data') 200 | parser.add_argument('--num_epochs', type=int, default=25, 201 | help='number of epochs to train') 202 | parser.add_argument('--output_dir', type=str, help='output directory', 203 | default='models') 204 | parser.add_argument('--learning_rate', type=float, 205 | default=0.001, help='learning rate') 206 | parser.add_argument('--trans', type=str, default='True', 207 | help='Set to True if wishing to use transfer learning') 208 | parser.add_argument('--momentum', type=float, default=0.9, help='momentum') 209 | args = parser.parse_args() 210 | 211 | print("data directory is: {}".format(args.data_dir)) 212 | print("using transfer learning: {}".format(args.trans)) 213 | model = fine_tune_model(args.num_epochs, args.data_dir, 214 | args.learning_rate, args.momentum, bool(args.trans)) 215 | os.makedirs(args.output_dir, exist_ok=True) 216 | torch.save(model, os.path.join(args.output_dir, 'model_finetuned.pth')) 217 | 218 | if __name__ == "__main__": 219 | main() -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) Microsoft Corporation. All rights reserved. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | --- 2 | page_type: sample 3 | languages: 4 | - python 5 | products: 6 | - azure 7 | - azure-functions 8 | description: "Communicate the process of training a model using a Python-based Azure Function and the Azure ML Python SDK." 9 | urlFragment: training-a-model-with-azureml-and-azure-functions 10 | --- 11 | 12 | # Training a Model with AzureML and Azure Functions 13 | 14 | Automating the training of new ML model given code updates and/or new data with labels provided by a data scientist, can pose a challenge in the context of the dev ops or app development process due to the manual nature of data science work. One solution would be to use a training script (written by the data scientist) with the Azure Machine Learning SDK for Python (Azure ML Python SDK) run with a lightweight Azure Function that sends a training script to larger compute (process managed by the dev ops professional) to train an ML model (automatically performed when new data appears). This, then, triggers the build and release of an intelligent application (managed by the app developer). 15 | 16 | The intent of this repository is to communicate the process of training a model using a Python-based Azure Function and the Azure ML Python SDK, as well as, to provide a code sample for doing so. Training a model with the Azure ML Python SDK involves utilizing an Azure Compute option (e.g. an N-Series AML Compute) - the model **_is not_** trained within the Azure Function Consumption Plan. Triggers for the Azure Function could be HTTP Requests, an Event Grid or some other trigger. 17 | 18 | The motivation behind this process was to provide a way to automate ML model training/retraining in a lightweight, serverless fashion, once new data is provided _and potentially_ labels which are stored Azure Storage Blob containers. This solution is especially attractive for people familiar with Azure Functions. 19 | 20 | The idea is that once new data is provided, it would signal training a new model via an Event Grid - note that the _training is actually done on a separate compute_, not in the Function. The Azure ML SDK can be set up to output a model to a user-specified Storage option, here shown as the "Trained Models" Blob Storage Container. Another lightweight Function, triggered by Event Grid, may be used to send an HTTP request to an Azure DevOps build Pipeline to create the final application. 21 | 22 | The following diagram represents an example process as part of a larger deployment. The end product or application could be an IoT Edge module, web service or any other application a DevOps build/release can produce. 23 | 24 | ![](images/arch_diagram.png) 25 | 26 | ## Getting started 27 | 28 | The instructions below are an example - it follows [this Azure Docs tutorial](https://docs.microsoft.com/en-us/azure/azure-functions/functions-create-first-function-python) which should be referenced as needed. 29 | 30 | The commands are listed here for quick reference (but if errors are encountered, check the docs link above for troubleshooting, as it may have been updated). 31 | 32 | ## Prerequisites 33 | 34 | - Install Python 3.6 35 | - Install [Functions Core Tools](https://docs.microsoft.com/en-us/azure/azure-functions/functions-run-local#v2) 36 | - Install Docker 37 | - Note: If run on Windows you can use Ubuntu WSL to run the scripts 38 | 39 | ## Steps 40 | 41 | In summary, the steps include the following sections. 42 | 43 | - [Deploy locally](#deploy-locally) 44 | - [Set up a Service Principal for AzureML](#set-up-a-service-principal-for-azureml) 45 | - [Data setup](#data-setup) 46 | - [Test locally](#test-locally) 47 | - [Deploy function to Azure](#deploy-function-to-azure) 48 | 49 | ### Deploy locally 50 | 51 | #### Set up virtual environment 52 | 53 | Important notes: 54 | 55 | * It is good practice to create a fresh virtual environment for each function 56 | * Make sure `.env`, which holds the virtual environment, once created, resides in main folder (same place as `requirements.txt`) 57 | * Make sure to use the `pip` installer from the virtual environment 58 | 59 | **Create the virtual environment** 60 | 61 | The `venv` command is part of the Python standard library as of version 3.3. Python 3.6 is being used in this sample. 62 | 63 | ``` 64 | python3.6 -m venv .env 65 | ``` 66 | 67 | **Activate the virtual environment** 68 | 69 | On Windows, the command is: 70 | 71 | ``` 72 | .env\Scripts\activate 73 | 74 | ``` 75 | 76 | On unix systems (including MacOS), the command is: 77 | ``` 78 | source .env/bin/activate 79 | ``` 80 | 81 | **Install the required Python packages** 82 | 83 | Please check the `requirements.txt` file for versions of packages used. 84 | 85 | IMPORTANT NOTE: If a more recent version of a package is available it is ok to update after testing locally and in a staging environment. 86 | 87 | On Windows, the command to install packages is as follows. 88 | 89 | ``` 90 | .env\Scripts\pip install -r requirements.txt 91 | ``` 92 | 93 | On unix systems (including macOS), the command is as follows. 94 | 95 | ``` 96 | .env/bin/pip install -r requirements.txt 97 | ``` 98 | 99 | ### Set up a Service Principal for AzureML 100 | 101 | This will set up an Azure Active Directory registered application so that we can authenticate in Azure in code, as we will do for the AzureML workspace, without a requirement for interactive or CLI-based login. 102 | 103 | Follow the brief instructions under "Service Principal Authentication" in this doc for setting up the application that will allow authentication more easily. 104 | 105 | Take note of the variables mentioned in the doc for the next section. 106 | 107 | #### Set up environment variables 108 | 109 | This is so that AzureML, Service Principal and correct storage accounts may be accessed. 110 | 111 | **For local testing** 112 | 113 | Unix systems create a shell script `set_vars.sh` to set environment variables in the current shell. Run this in the bash terminal. 114 | 115 | ``` 116 | export AZURE_SUB= 117 | export RESOURCE_GROUP= 118 | export WORKSPACE_NAME= 119 | export STORAGE_CONTAINER_NAME_TRAINDATA= 120 | export STORAGE_CONTAINER_NAME_MODELS= 121 | export STORAGE_ACCOUNT_NAME= 122 | export STORAGE_ACCOUNT_KEY= 123 | export TENANT_ID= 124 | export APP_ID= 125 | export PRINCIPAL_PASSWORD= 126 | ``` 127 | 128 | On Windows create a script called `set_vars.cmd` and run it in the shell where the work is being done. 129 | 130 | ``` 131 | set AZURE_SUB 132 | set RESOURCE_GROUP 133 | set WORKSPACE_NAME 134 | set STORAGE_CONTAINER_NAME_TRAINDATA 135 | set STORAGE_CONTAINER_NAME_MODELS 136 | set STORAGE_ACCOUNT_NAME 137 | set STORAGE_ACCOUNT_KEY 138 | set TENANT_ID 139 | set APP_ID 140 | set PRINCIPAL_PASSWORD 141 | ``` 142 | 143 | Run each file in a bash shell/terminal window. 144 | 145 | Further descriptions of the environment variables are as follows. 146 | 147 | 1. `AZURE_SUB` - the Azure Subscription id 148 | 2. `RESOURCE_GROUP` - the resource group in which AzureML Workspace is found 149 | 3. `WORKSPACE_NAME` - the AzureML Workspace name (create this if it doesn't exist - [with code](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-create-workspace-with-python) or [in Azure Portal](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-get-started)) 150 | 4. `STORAGE_CONTAINER_NAME_TRAINDATA` - the Blob Storage container name containing the training data (in this sample is was fish images - see [Data setup](#data-setup) above. 151 | 5. `STORAGE_CONTAINER_NAME_MODELS` - the specific Blob Storage container where the output model should go (this could be the same as the `STORAGE_CONTAINER_NAME_TRAINDATA`) 152 | 5. `STORAGE_ACCOUNT_NAME` - the Storage Account name for the training and model blobs 153 | 6. `STORAGE_ACCOUNT_KEY` - the Storage Account access key for the training and model blobs (Note: these must be in the same Storage Account) 154 | 7. `TENANT_ID` - from AAD the tenant ID for subscription from Service Principal step 155 | 8. `APP_ID` - the AAD registered application ID from Service Principal step 156 | 9. `PRINCIPAL_PASSWORD` - the AAD registered app password from Service Principal step 157 | 158 | **Note: For when moving on to Azure deployments** 159 | 160 | Add as a key/value pairs, when performing deployment to Azure, the following under **Application settings** in the "Application settings" configuration link/tab in the Azure Portal under the published Azure Function App. 161 | 162 | ### Data setup 163 | 164 | In this example the labels are `fish`/`not_fish` for a binary classification scenario in this example which uses the PyTorch framework. The data structure in this repo is shown in the following image. For adding training data, use this structure so that the Python scripts may find the data. 165 | 166 | A `train` and `val` folder are both required for training. The folders under `train` and `val` are used in PyTorch's `datasets.ImageFolderImage()` function that delineates the labels using folder names, a common pattern for classification. 167 | 168 | Notice that the `data` folder is under the Function's `HttpTrigger` folder. 169 | 170 | 171 | 172 | 173 | ### Test locally 174 | 175 | **Start the function** 176 | 177 | From the base of the repo run the following. 178 | 179 | ``` 180 | func host start 181 | ``` 182 | 183 | Information similar to the following should appear. 184 | 185 | ![testing locally url](images/testing_locally.png) 186 | 187 | This provides a URL with which to use as a POST HTTP call. 188 | 189 | **Call the function** 190 | 191 | For now this can be a POST request using `https:///api/HttpTrigger?start=`, where `start` is specified as the parameter in the Azure Function `__init__.py` code and the value is any string for this sample (note: this is a potential entrypoint for passing a variable to the Azure Function in the future). 192 | 193 | 194 | One way to call the Function App is with the `curl` command. 195 | 196 | ``` 197 | curl http://localhost:7071/api/HttpTrigger?start=foo 198 | ``` 199 | 200 | 201 | ### Deploy function to Azure 202 | 203 | Use the following commands to deploy the function to Azure from a local machine that _has this repo cloned locally_. 204 | 205 | Here is an example of deploying this sample to `westus` region. Update the `--location`, `--name`'s, `resource-group`, and `--sku` as needed. 206 | 207 | Use the Azure CLI to log in. 208 | 209 | ``` 210 | az login 211 | ``` 212 | 213 | Create a resource group for the Azure Function. 214 | 215 | ``` 216 | az group create --name azfunc --location westus 217 | ``` 218 | 219 | Create a storage account for the Azure Function. 220 | 221 | ``` 222 | az storage account create --name azfuncstorage123 --location westus --resource-group azfunc --sku Standard_LRS 223 | 224 | ``` 225 | 226 | Create the Azure Function. 227 | 228 | ``` 229 | az functionapp create --resource-group azfunc --os-type Linux --consumption-plan-location westus --runtime python --name dnnfuncapp --storage-account azfuncstorage123 230 | 231 | ``` 232 | 233 | Publish the Azure Function. 234 | 235 | ``` 236 | func azure functionapp publish dnnfuncapp --build-native-deps 237 | ``` 238 | 239 | IMPORTANT NOTE: Don't forget to add as a key/value pairs, when performing deployment to Azure, the environment variables (from above) under **Application settings** in the "Application settings" configuration link/tab in the Azure Portal under the published Azure Function App. 240 | 241 | #### Test deployment 242 | 243 | For now this can be a POST request using `https:///api/HttpTrigger?start=`, where `start` is specified as the parameter in the Azure Function `__init__.py` code and the value is any string (this is a potential entrypoint for passing a variable to the Azure Function in the future). 244 | 245 | One way to call the Function App, for e.g., is: 246 | 247 | ``` 248 | curl https://dnnfuncapp.azurewebsites.net/api/HttpTrigger?start=foo 249 | ``` 250 | 251 | Or to go to the browser and enter in the same URL. 252 | 253 | This may time out, but don't worry if this happens. For proof of successful execution check for a completed AzureML Experiment run in the Azure Portal ML Workspace and look for the model in the blob storage as well (specified earlier in STORAGE_CONTAINER_NAME_MODELS). 254 | 255 | ## References 256 | 257 | 1. Queue an Azure DevOps Build with HTTP request 258 | 2. Azure Machine Learning Services Overview 259 | 3. Using a Azure Storage with Azure ML Python SDK 260 | 4. Python Azure Functions 261 | 5. [How to call another function with in an Azure function (StackOverflow)](https://stackoverflow.com/questions/46315734/how-to-call-another-function-with-in-an-azure-function) 262 | 6. [Creation of virtual environments](https://docs.python.org/3/library/venv.html) 263 | 7. [How to access data in Blob and elsewhere with the AzureML Python SDK](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data) 264 | 265 | ## Troubleshooting 266 | 267 | * Function using the wrong Python: if there are multiple versions of Python on the system, be sure to preface any `func` commands with `PYTHONPATH=.env/bin/python` to ensure the correct Python interpreter is being used. 268 | 269 | -------------------------------------------------------------------------------- /azuredeploy.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#", 3 | "contentVersion": "1.0.0.0", 4 | "parameters": { 5 | "uniqueResourceNameSuffix" : { 6 | "type" : "string", 7 | "defaultValue" : "[uniqueString(subscription().subscriptionId, resourceGroup().id)]" 8 | }, 9 | "config_web_name": { 10 | "defaultValue": "web", 11 | "type": "String" 12 | }, 13 | "storageName": { 14 | "defaultValue": "[uniqueString(subscription().subscriptionId, resourceGroup().id)]", 15 | "type": "String" 16 | }, 17 | "linuxConsumptionAppName": { 18 | "defaultValue": "WestUSLinuxDynamicPlan", 19 | "type": "String" 20 | } 21 | }, 22 | "variables": { 23 | "storageAccountid": "[concat(resourceGroup().id,'/providers/','Microsoft.Storage/storageAccounts/', parameters('storageName'))]", 24 | "functionapp" : "[concat('incv3',parameters('uniqueResourceNameSuffix'))]", 25 | "siteName1" : "[concat(variables('functionapp'),'.azurewebsites.net')]" 26 | }, 27 | "resources": [ 28 | { 29 | "name": "[parameters('storageName')]", 30 | "type": "Microsoft.Storage/storageAccounts", 31 | "apiVersion": "2017-10-01", 32 | "sku": { 33 | "name": "Standard_LRS" 34 | }, 35 | "kind": "StorageV2", 36 | "location": "West US", 37 | "tags": {}, 38 | "properties": { 39 | "accessTier": "Hot" 40 | } 41 | }, 42 | { 43 | "type": "Microsoft.Web/serverfarms", 44 | "sku": { 45 | "name": "Y1", 46 | "tier": "Dynamic", 47 | "size": "Y1", 48 | "family": "Y", 49 | "capacity": 0 50 | }, 51 | "kind": "functionapp", 52 | "name": "[parameters('linuxConsumptionAppName')]", 53 | "apiVersion": "2016-09-01", 54 | "location": "West US", 55 | "properties": { 56 | "name": "[parameters('linuxConsumptionAppName')]", 57 | "perSiteScaling": false, 58 | "reserved": true 59 | }, 60 | "dependsOn": [] 61 | }, 62 | { 63 | "type": "Microsoft.Web/sites", 64 | "kind": "functionapp,linux", 65 | "name": "[variables('functionapp')]", 66 | "apiVersion": "2016-08-01", 67 | "location": "West US", 68 | "properties": { 69 | "enabled": true, 70 | "hostNameSslStates": [ 71 | { 72 | "name": "[concat(variables('functionapp'),'.azurewebsites.net')]", 73 | "sslState": "Disabled", 74 | "hostType": "Standard" 75 | } 76 | ], 77 | "serverFarmId": "[resourceId('Microsoft.Web/serverfarms', parameters('linuxConsumptionAppName'))]", 78 | "reserved": true, 79 | "siteConfig": { 80 | "appSettings": [ 81 | { 82 | "name": "AzureWebJobsDashboard", 83 | "value": "[concat('DefaultEndpointsProtocol=https;AccountName=', parameters('storageName'), ';AccountKey=', listKeys(variables('storageAccountid'),'2015-05-01-preview').key1)]" 84 | }, 85 | { 86 | "name": "AzureWebJobsStorage", 87 | "value": "[concat('DefaultEndpointsProtocol=https;AccountName=', parameters('storageName'), ';AccountKey=', listKeys(variables('storageAccountid'),'2015-05-01-preview').key1)]" 88 | }, 89 | { 90 | "name": "WEBSITE_CONTENTAZUREFILECONNECTIONSTRING", 91 | "value": "[concat('DefaultEndpointsProtocol=https;AccountName=', parameters('storageName'), ';AccountKey=', listKeys(variables('storageAccountid'),'2015-05-01-preview').key1)]" 92 | }, 93 | { 94 | "name": "WEBSITE_CONTENTSHARE", 95 | "value": "[variables('functionapp')]" 96 | }, 97 | { 98 | "name": "FUNCTIONS_EXTENSION_VERSION", 99 | "value": "~2" 100 | }, 101 | { 102 | "name": "WEBSITE_NODE_DEFAULT_VERSION", 103 | "value": "8.11.1" 104 | }, 105 | { 106 | "name": "FUNCTIONS_WORKER_RUNTIME", 107 | "value": "python" 108 | } 109 | ] 110 | } 111 | }, 112 | "dependsOn": [ 113 | "[resourceId('Microsoft.Web/serverfarms', parameters('linuxConsumptionAppName'))]" 114 | ] 115 | }, 116 | { 117 | "type": "Microsoft.Web/sites/config", 118 | "name": "[concat(variables('functionapp'), '/', parameters('config_web_name'))]", 119 | "apiVersion": "2016-08-01", 120 | "location": "West US", 121 | "properties": { 122 | "netFrameworkVersion": "v4.0", 123 | "scmType": "None", 124 | "use32BitWorkerProcess": true, 125 | "webSocketsEnabled": false, 126 | "alwaysOn": false, 127 | "appCommandLine": "", 128 | "managedPipelineMode": "Integrated", 129 | "virtualApplications": [ 130 | { 131 | "virtualPath": "/", 132 | "physicalPath": "site\\wwwroot", 133 | "preloadEnabled": false 134 | } 135 | ], 136 | "customAppPoolIdentityAdminState": false, 137 | "customAppPoolIdentityTenantState": false, 138 | "loadBalancing": "LeastRequests", 139 | "routingRules": [], 140 | "experiments": { 141 | "rampUpRules": [] 142 | }, 143 | "autoHealEnabled": false, 144 | "vnetName": "" 145 | }, 146 | "dependsOn": [ 147 | "[resourceId('Microsoft.Web/sites', variables('functionapp'))]" 148 | ] 149 | }, 150 | { 151 | "type": "Microsoft.Web/sites/hostNameBindings", 152 | "name": "[concat(variables('functionapp'), '/', variables('siteName1'))]", 153 | "apiVersion": "2016-08-01", 154 | "location": "West US", 155 | "properties": { 156 | "siteName": "ldamodeling", 157 | "hostNameType": "Verified" 158 | }, 159 | "dependsOn": [ 160 | "[resourceId('Microsoft.Web/sites', variables('functionapp'))]" 161 | ] 162 | } 163 | ] 164 | } 165 | -------------------------------------------------------------------------------- /host.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": "2.0" 3 | } -------------------------------------------------------------------------------- /images/arch_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure-Samples/functions-python-azureml-azurefunctions-deeplearning/5de8d3752b92411c6ededb0cb41c3d01f217dc13/images/arch_diagram.png -------------------------------------------------------------------------------- /images/data_file_structure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure-Samples/functions-python-azureml-azurefunctions-deeplearning/5de8d3752b92411c6ededb0cb41c3d01f217dc13/images/data_file_structure.png -------------------------------------------------------------------------------- /images/testing_locally.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure-Samples/functions-python-azureml-azurefunctions-deeplearning/5de8d3752b92411c6ededb0cb41c3d01f217dc13/images/testing_locally.png -------------------------------------------------------------------------------- /local.settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "IsEncrypted": false, 3 | "Values": { 4 | "FUNCTIONS_WORKER_RUNTIME": "python", 5 | "AzureWebJobsStorage": "{AzureWebJobsStorage}" 6 | } 7 | } -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | azure-functions==1.0.0b4 2 | azure-functions-worker==1.0.0b6 3 | grpcio==1.20.1 4 | grpcio-tools==1.20.1 5 | protobuf==3.6.1 6 | six==1.12.0 7 | azureml-sdk==1.0.39 8 | requests==2.20.1 9 | scikit-image==0.14.2 10 | scikit-learn==0.20.2 11 | --------------------------------------------------------------------------------