├── pytest.ini
├── pyproject.toml
├── src
    ├── data
    │   └── .gitkeep
    ├── rai
    │   └── .gitkeep
    ├── deploy
    │   ├── .gitkeep
    │   ├── batch
    │   │   └── score.py
    │   └── online
    │   │   └── score.py
    ├── features
    │   └── .gitkeep
    ├── monitor
    │   └── .gitkeep
    └── model
    │   ├── register_model.py
    │   └── train.py
├── environments
    ├── Dockerfile
    ├── code_quality.txt
    ├── requirements.txt
    └── conda_train.yml
├── tests
    ├── unit
    │   └── .gitkeep
    └── data_validation
    │   └── .gitkeep
├── tox.ini
├── .env.example
├── NOTICE.md
├── .vscode
    └── settings.json
├── cli
    ├── endpoints
    │   ├── batch_endpoint.yml
    │   ├── online_endpoint.yml
    │   ├── online_deployment_mlflow.yml
    │   ├── online_deployment.yml
    │   ├── batch_deployment_mlflow.yml
    │   └── batch_deployment.yml
    ├── assets
    │   ├── create-data.yml
    │   ├── create-compute.yml
    │   ├── register_model.yml
    │   ├── create-environment.yml
    │   └── register_model_mlflow.yml
    └── jobs
    │   └── train.yml
├── .devcontainer
    ├── noop.txt
    ├── Dockerfile
    ├── devcontainer.json
    └── add-notice.sh
├── scripts
    ├── jobs
    │   └── train.sh
    ├── prototyping
    │   └── run-notebooks.sh
    ├── endpoints
    │   ├── deploy-batch-endpoint-custom.sh
    │   ├── deploy-online-endpoint-custom.sh
    │   ├── deploy-batch-endpoint-mlflow.sh
    │   └── deploy-online-endpoint-mlflow.sh
    ├── configure-workspace.sh
    ├── assets
    │   ├── create-data.sh
    │   ├── create-compute.sh
    │   ├── create-environment.sh
    │   └── register-model.sh
    └── setup.sh
├── data
    └── samples
    │   └── nyc_taxi_sample.json
├── CODE_OF_CONDUCT.md
├── SUPPORT.md
├── .github
    ├── dependabot.yml
    └── workflows
    │   ├── smoke-testing.yml
    │   ├── smoke-testing-azureml.yml
    │   ├── smoke-testing-python-script.yml
    │   ├── microsoft-security-devops-analysis.yml
    │   ├── smoke-testing-notebook.yml
    │   ├── code-quality.yml
    │   └── codeql-analysis.yml
├── pipelines
    ├── train.yml
    ├── eval.yml
    ├── prep.yml
    ├── score.yml
    ├── pipeline.yml
    ├── train
    │   └── train.py
    ├── prep
    │   └── prep.py
    ├── score
    │   └── score.py
    └── eval
    │   └── eval.py
├── CONTRIBUTING.md
├── docs
    ├── images
    │   └── azureml-icon.svg
    ├── quickstart.md
    └── coding-guidelines.md
├── .pre-commit-config.yaml
├── LICENSE
├── .gitignore
├── SECURITY.md
├── utils
    └── prepare_data.py
├── notebooks
    ├── train-experiment.ipynb
    ├── train-mlflow-local.ipynb
    └── train-model-debugging.ipynb
└── README.md


/pytest.ini:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/src/data/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/src/rai/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/environments/Dockerfile:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/src/deploy/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/src/features/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/src/monitor/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tests/unit/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tests/data_validation/.gitkeep:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | [flake8]
2 | max-line-length = 120
3 | 


--------------------------------------------------------------------------------
/.env.example:
--------------------------------------------------------------------------------
1 | GROUP=""
2 | LOCATION=""
3 | WORKSPACE=""
4 | SUBSCRIPTION=""
5 | 


--------------------------------------------------------------------------------
/NOTICE.md:
--------------------------------------------------------------------------------
1 | NOTICES
2 | This repository incorporates material as listed below or described in the code.
3 | 


--------------------------------------------------------------------------------
/environments/code_quality.txt:
--------------------------------------------------------------------------------
1 | black[jupyter]==23.1.0
2 | isort[color]==5.12.0
3 | flake8==6.0.0
4 | mypy==0.991
5 | 


--------------------------------------------------------------------------------
/environments/requirements.txt:
--------------------------------------------------------------------------------
 1 | scikit-learn
 2 | pandas
 3 | ipykernel
 4 | matplotlib
 5 | 
 6 | 
 7 | black
 8 | flake8
 9 | pytest
10 | pre-commit
11 | 


--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
1 | {
2 |     "python.linting.flake8Enabled": true,
3 |     "python.linting.pylintEnabled": false,
4 |     "python.linting.enabled": true
5 | }
6 | 


--------------------------------------------------------------------------------
/cli/endpoints/batch_endpoint.yml:
--------------------------------------------------------------------------------
1 | $schema: https://azuremlschemas.azureedge.net/latest/batchEndpoint.schema.json
2 | description: endpoint for batch-deployment
3 | auth_mode: aad_token
4 | 


--------------------------------------------------------------------------------
/cli/endpoints/online_endpoint.yml:
--------------------------------------------------------------------------------
1 | $schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
2 | description: endpoint for online-deployment
3 | auth_mode: key
4 | 


--------------------------------------------------------------------------------
/.devcontainer/noop.txt:
--------------------------------------------------------------------------------
1 | This file is copied into the container along with environment.yml* from the
2 | parent folder. This is done to prevent the Dockerfile COPY instruction from
3 | failing if no environment.yml is found.
4 | 


--------------------------------------------------------------------------------
/cli/assets/create-data.yml:
--------------------------------------------------------------------------------
1 | $schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
2 | name: nyc_taxi_dataset
3 | description: nyc_taxi_dataset.
4 | type: uri_file
5 | path: ../../data/raw/nyc_taxi_dataset.csv
6 | 


--------------------------------------------------------------------------------
/scripts/jobs/train.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | # move to directory of shell script
4 | exec_path=$(readlink -f "$0")
5 | exec_dir=$(dirname "$exec_path")
6 | cd $exec_dir/../../
7 | 
8 | az ml job create -f ./cli/jobs/train.yml --stream
9 | 


--------------------------------------------------------------------------------
/cli/assets/create-compute.yml:
--------------------------------------------------------------------------------
1 | $schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
2 | name: cpu-cluster
3 | type: amlcompute
4 | size: STANDARD_DS3_v2
5 | min_instances: 0
6 | max_instances: 4
7 | idle_time_before_scale_down: 300
8 | 


--------------------------------------------------------------------------------
/cli/endpoints/online_deployment_mlflow.yml:
--------------------------------------------------------------------------------
1 | $schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
2 | name: blue-mlflow
3 | model: azureml:nyc_taxi_mlflow@latest
4 | instance_type: Standard_DS4_v2
5 | instance_count: 1
6 | 


--------------------------------------------------------------------------------
/cli/assets/register_model.yml:
--------------------------------------------------------------------------------
1 | $schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
2 | name: nyc_taxi
3 | type: custom_model
4 | path: azureml://datastores/workspaceblobstore/paths/nyc-taxi/models
5 | description: データストア上にあるモデルファイルをカスタム型のモデルとして登録する。
6 | 


--------------------------------------------------------------------------------
/cli/assets/create-environment.yml:
--------------------------------------------------------------------------------
1 | $schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
2 | name: nyc-taxi-env
3 | image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
4 | conda_file: ../../environments/conda_train.yml
5 | description: nyc-taxi-env
6 | 


--------------------------------------------------------------------------------
/cli/assets/register_model_mlflow.yml:
--------------------------------------------------------------------------------
1 | $schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
2 | name: nyc_taxi_mlflow
3 | type: mlflow_model
4 | path: azureml://datastores/workspaceblobstore/paths/nyc-taxi/models
5 | description: データストア上にあるモデルファイルを MLflow 型のモデルとして登録する。
6 | 


--------------------------------------------------------------------------------
/data/samples/nyc_taxi_sample.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "data": [
 3 |         [
 4 |             2,
 5 |             1421107273.0,
 6 |             1,
 7 |             2.99,
 8 |             -73.82838439941406,
 9 |             40.75553512573242,
10 |             -73.78858184814453,
11 |             40.74454879760742
12 |         ]
13 |     ]
14 | }
15 | 


--------------------------------------------------------------------------------
/cli/endpoints/online_deployment.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
 2 | name: blue
 3 | model: azureml:nyc_taxi:1
 4 | code_configuration:
 5 |   code: ../../src/deploy/online/
 6 |   scoring_script: score.py
 7 | environment: azureml:nyc-taxi-env@latest
 8 | instance_type: Standard_DS4_v2
 9 | instance_count: 1
10 | 


--------------------------------------------------------------------------------
/scripts/prototyping/run-notebooks.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | source /anaconda/etc/profile.d/conda.sh
 3 | conda activate mlops-train
 4 | 
 5 | # move to directory of shell script
 6 | exec_path=$(readlink -f "$0")
 7 | exec_dir=$(dirname "$exec_path")
 8 | cd $exec_dir/../../notebooks
 9 | 
10 | papermill train-experiment.ipynb out.ipynb -k mlops-train
11 | papermill train-mlflow-local.ipynb out.ipynb -k mlops-train
12 | papermill train-model-debugging.ipynb out.ipynb -k mlops-train
13 | 


--------------------------------------------------------------------------------
/scripts/endpoints/deploy-batch-endpoint-custom.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../../
 7 | 
 8 | export ENDPOINT_NAME=batch-endpoint-custom-`echo $RANDOM`
 9 | 
10 | az ml batch-endpoint create --name $ENDPOINT_NAME -f ./cli/endpoints/batch_endpoint.yml
11 | 
12 | az ml batch-deployment create --endpoint-name $ENDPOINT_NAME -f ./cli/endpoints/batch_deployment.yml
13 | 


--------------------------------------------------------------------------------
/scripts/endpoints/deploy-online-endpoint-custom.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../../
 7 | 
 8 | export ENDPOINT_NAME=online-endpoint-custom-`echo $RANDOM`
 9 | 
10 | az ml online-endpoint create --name $ENDPOINT_NAME -f ./cli/endpoints/online_endpoint.yml
11 | 
12 | az ml online-deployment create --endpoint-name $ENDPOINT_NAME -f ./cli/endpoints/online_deployment.yml
13 | 


--------------------------------------------------------------------------------
/scripts/endpoints/deploy-batch-endpoint-mlflow.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../../
 7 | 
 8 | export ENDPOINT_NAME=batch-endpoint-mlflow-`echo $RANDOM`
 9 | 
10 | az ml batch-endpoint create --name $ENDPOINT_NAME -f ./cli/endpoints/batch_endpoint.yml
11 | 
12 | az ml batch-deployment create --endpoint-name $ENDPOINT_NAME -f ./cli/endpoints/batch_deployment_mlflow.yml
13 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Microsoft Open Source Code of Conduct
 2 | 
 3 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
 4 | 
 5 | Resources:
 6 | 
 7 | - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
 8 | - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
 9 | - Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
10 | 


--------------------------------------------------------------------------------
/scripts/endpoints/deploy-online-endpoint-mlflow.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../../
 7 | 
 8 | export ENDPOINT_NAME=online-endpoint-mlflow-`echo $RANDOM`
 9 | 
10 | az ml online-endpoint create --name $ENDPOINT_NAME -f ./cli/endpoints/online_endpoint.yml
11 | 
12 | az ml online-deployment create --endpoint-name $ENDPOINT_NAME -f ./cli/endpoints/online_deployment_mlflow.yml
13 | 


--------------------------------------------------------------------------------
/cli/endpoints/batch_deployment_mlflow.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json
 2 | name: batch-deployment-mlflow
 3 | model: azureml:nyc_taxi@latest
 4 | compute: azureml:cpu-cluster
 5 | resources:
 6 |   instance_count: 2
 7 | max_concurrency_per_instance: 2
 8 | mini_batch_size: 10
 9 | output_action: append_row
10 | output_file_name: predictions.csv
11 | retry_settings:
12 |   max_retries: 3
13 |   timeout: 30
14 | error_threshold: -1
15 | logging_level: info
16 | 


--------------------------------------------------------------------------------
/SUPPORT.md:
--------------------------------------------------------------------------------
 1 | # Support
 2 | 
 3 | ## How to file issues and get help
 4 | 
 5 | This project uses GitHub Issues to track bugs and feature requests. Please search the existing
 6 | issues before filing new issues to avoid duplicates.  For new issues, file your bug or
 7 | feature request as a new Issue.
 8 | 
 9 | For help and questions about using this project, please contact our team via GitHub Issues.
10 | 
11 | ## Microsoft Support Policy
12 | 
13 | Support for this project is limited to the resources listed above.
14 | 


--------------------------------------------------------------------------------
/environments/conda_train.yml:
--------------------------------------------------------------------------------
 1 | name:
 2 | channels:
 3 |   - conda-forge
 4 | dependencies:
 5 |   - python=3.8.5
 6 |   - pip=22.3.1
 7 |   - pip:
 8 |       - azureml-mlflow==1.45.0
 9 |       - mlflow==1.30.0
10 |       - scikit-learn==1.0.2
11 |       - pandas==1.1.5
12 |       - joblib==1.0.0
13 |       - matplotlib==3.3.3
14 |       - azureml-defaults==1.47.0
15 |       - black[jupyter]==22.8.0
16 |       - pre-commit==2.20.0
17 |       - papermill==2.4.0
18 |       - ipykernel==6.6.0
19 |       - raiwidgets==0.23.0
20 |       - numpy<1.24.0
21 | 


--------------------------------------------------------------------------------
/.github/dependabot.yml:
--------------------------------------------------------------------------------
 1 | # To get started with Dependabot version updates, you'll need to specify which
 2 | # package ecosystems to update and where the package manifests are located.
 3 | # Please see the documentation for all configuration options:
 4 | # https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates
 5 | 
 6 | version: 2
 7 | updates:
 8 |   - package-ecosystem: "" # See documentation for possible values
 9 |     directory: "/" # Location of package manifests
10 |     schedule:
11 |       interval: "weekly"
12 | 


--------------------------------------------------------------------------------
/scripts/configure-workspace.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../
 7 | 
 8 | # 環境変数の読み込み
 9 | source .env
10 | 
11 | # 共同で利用しているサブスクリプションをセット
12 | az account set --subscription $SUBSCRIPTION
13 | # az ml デフォルトワークスペースの設定
14 | az configure --defaults group=$GROUP workspace=$WORKSPACE location=$LOCATION
15 | 
16 | az configure -l -o table
17 | 
18 | echo "Note : リージョン $LOCATION にあるリソースグループ $GROUP の Azure Machine Learning ワークスペース $WORKSPACE をデフォルトのリソースとして設定"
19 | 


--------------------------------------------------------------------------------
/pipelines/train.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
 2 | name: train_linear_regression_model
 3 | 
 4 | display_name: TrainLinearRegressionModel
 5 | 
 6 | version: 1
 7 | 
 8 | type: command
 9 | 
10 | inputs:
11 |   training_data:
12 |     type: uri_folder
13 | 
14 | outputs:
15 |   model_output:
16 |     type: mlflow_model
17 | 
18 | code: ./train
19 | 
20 | environment: azureml:nyc-taxi-env@latest
21 | 
22 | command: >-
23 |   python train.py
24 |   --training_data ${{inputs.training_data}}
25 |   --model_output ${{outputs.model_output}}
26 | 


--------------------------------------------------------------------------------
/.github/workflows/smoke-testing.yml:
--------------------------------------------------------------------------------
 1 | name: smoke-testing
 2 | on:
 3 |   push:
 4 |     branches:
 5 |       - main
 6 |   pull_request:
 7 |     branches:
 8 |       - main
 9 |   schedule:
10 |       - cron: "0 0 * * *"
11 | jobs:
12 |   build:
13 |     runs-on: ubuntu-latest
14 |     steps:
15 |     - name: Checkout repository
16 |       uses: actions/checkout@v3
17 |     - name: Setup python
18 |       uses: actions/setup-python@v4.2.0
19 |       with:
20 |         python-version: "3.9"
21 |     - name: pip install
22 |       run: pip install black[jupyter]==22.8.0
23 |     - name: Check code format
24 |       run: black --check .
25 | 


--------------------------------------------------------------------------------
/scripts/assets/create-data.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../../
 7 | 
 8 | # Get dataset name from yaml file
 9 | dataset_name=$(cat ./cli/assets/create-data.yml | grep name | awk '{print $2}')
10 | 
11 | # Check if dataset exists
12 | dataset_exists=$(az ml data list --query "[?name=='$dataset_name']" | jq 'length')
13 | 
14 | # If dataset exists, do not create
15 | if [ $dataset_exists -gt 0 ]; then
16 |   echo "Dataset already exists"
17 | else
18 |   # Create new dataset
19 |   az ml data create -f ./cli/assets/create-data.yml
20 | fi
21 | 


--------------------------------------------------------------------------------
/cli/jobs/train.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
 2 | experiment_name: train_nyc_taxi
 3 | description: train_nyc_taxi
 4 | type: command
 5 | code: ../../src/model
 6 | command: >-
 7 |   python train.py
 8 |   --input_data ${{inputs.nyc_taxi_data}}
 9 |   --output_dir ${{outputs.output_dir}}
10 | environment: azureml:nyc-taxi-env@latest
11 | inputs:
12 |   nyc_taxi_data:
13 |     type: uri_file
14 |     path: azureml:nyc_taxi_dataset@latest
15 | outputs:
16 |   output_dir:
17 |     type: uri_folder
18 |     path: azureml://datastores/workspaceblobstore/paths/nyc-taxi/
19 |     mode: mount
20 | compute: azureml:cpu-cluster
21 | 


--------------------------------------------------------------------------------
/cli/endpoints/batch_deployment.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/batchDeployment.schema.json
 2 | name: batch-deployment
 3 | description: custom batch deployment
 4 | model: azureml:nyc_taxi@latest
 5 | code_configuration:
 6 |   code: ../../src/deploy/batch/
 7 |   scoring_script: score.py
 8 | environment: azureml:nyc-taxi-env@latest
 9 | compute: azureml:cpu-cluster
10 | resources:
11 |   instance_count: 1
12 | max_concurrency_per_instance: 2
13 | mini_batch_size: 10
14 | output_action: append_row
15 | output_file_name: predictions.csv
16 | retry_settings:
17 |   max_retries: 3
18 |   timeout: 30
19 | error_threshold: -1
20 | logging_level: info
21 | 


--------------------------------------------------------------------------------
/scripts/assets/create-compute.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../../
 7 | 
 8 | # Get cluster name from yaml file
 9 | cluster_name=$(cat ./cli/assets/create-compute.yml | grep name | awk '{print $2}')
10 | 
11 | # Check if cluster exists
12 | cluster_exists=$(az ml compute list --query "[?name=='$cluster_name']" | jq 'length')
13 | 
14 | # If cluster exists, do not create
15 | if [ $cluster_exists -gt 0 ]; then
16 |   echo "Cluster already exists"
17 | else
18 |   # Create new cluster
19 |   az ml compute create -f ./cli/assets/create-compute.yml
20 | fi
21 | 


--------------------------------------------------------------------------------
/scripts/assets/create-environment.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../../
 7 | 
 8 | # Get environment name from yaml file
 9 | env_name=$(cat ./cli/assets/create-environment.yml | grep name | awk '{print $2}')
10 | 
11 | # Check if environment exists
12 | env_exists=$(az ml environment list --query "[?name=='$env_name']" | jq 'length')
13 | 
14 | # If environment exists, do not create
15 | if [ $env_exists -gt 0 ]; then
16 |   echo "Environment already exists"
17 | else
18 |   # Create new environment
19 |   az ml environment create -f ./cli/assets/create-environment.yml
20 | fi
21 | 


--------------------------------------------------------------------------------
/pipelines/eval.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
 2 | name: evaluate_linear_regression_model
 3 | 
 4 | display_name: EvaluateLinearRegressionModel
 5 | 
 6 | version: 1
 7 | 
 8 | type: command
 9 | 
10 | inputs:
11 |   predicted_data:
12 |     type: uri_folder
13 |   label_data:
14 |     type: uri_folder
15 | 
16 | outputs:
17 |   model_performance_report:
18 |     type: uri_folder
19 | 
20 | code: ./eval
21 | 
22 | environment: azureml:nyc-taxi-env@latest
23 | 
24 | command: >-
25 |   python eval.py
26 |   --predicted_data ${{inputs.predicted_data}}
27 |   --label_data ${{inputs.label_data}}
28 |   --model_performance_report ${{outputs.model_performance_report}}
29 | 


--------------------------------------------------------------------------------
/pipelines/prep.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
 2 | name: prep_taxi_data
 3 | 
 4 | display_name: PrepTaxiData
 5 | 
 6 | version: 1
 7 | 
 8 | type: command
 9 | 
10 | inputs:
11 |   nyc_taxi_data:
12 |     type: uri_file
13 |   test_split_ratio:
14 |     type: number
15 |     min: 0
16 |     max: 1
17 |     default: 0.2
18 | 
19 | outputs:
20 |   training_data:
21 |     type: uri_folder
22 |   testing_data:
23 |     type: uri_folder
24 | 
25 | code: ./prep
26 | 
27 | environment: azureml:nyc-taxi-env@latest
28 | 
29 | command: >-
30 |   python prep.py
31 |   --input_data ${{inputs.nyc_taxi_data}}
32 |   --training_data ${{outputs.training_data}}
33 |   --testing_data ${{outputs.testing_data}}
34 | 


--------------------------------------------------------------------------------
/pipelines/score.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
 2 | name: score_linear_regression_model
 3 | 
 4 | display_name: ScoreLinearRegressionModel
 5 | 
 6 | version: 1
 7 | 
 8 | type: command
 9 | 
10 | inputs:
11 |   model_input:
12 |     type: mlflow_model
13 |   testing_data:
14 |     type: uri_folder
15 | 
16 | outputs:
17 |   predicted_data:
18 |     type: uri_folder
19 |   label_data:
20 |     type: uri_folder
21 | 
22 | code: ./score
23 | 
24 | environment: azureml:nyc-taxi-env@latest
25 | 
26 | command: >-
27 |   python score.py
28 |   --testing_data ${{inputs.testing_data}}
29 |   --model_input ${{inputs.model_input}}
30 |   --predicted_data ${{outputs.predicted_data}}
31 |   --label_data ${{outputs.label_data}}
32 | 


--------------------------------------------------------------------------------
/scripts/assets/register-model.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../../
 7 | 
 8 | # get the latest run id
 9 | run_id=$(echo $(az ml job list --query "reverse(sort_by([?status=='Completed'].{experiment_name:experiment_name, run_id:name, status:status, date:creation_context.created_at}, &date))[0].run_id") | sed 's/"//g')
10 | echo $run_id
11 | 
12 | 
13 | # register model that was trained in the latest job
14 | az ml model create -f ./cli/assets/register_model.yml --path azureml://datastores/workspaceblobstore/paths/nyc-taxi/$run_id/models
15 | az ml model create -f ./cli/assets/register_model_mlflow.yml --path azureml://datastores/workspaceblobstore/paths/nyc-taxi/$run_id/models
16 | 


--------------------------------------------------------------------------------
/.devcontainer/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM mcr.microsoft.com/vscode/devcontainers/miniconda:0-3
 2 | SHELL ["/bin/bash", "-c"]
 3 | 
 4 | # conda インストール
 5 | COPY ./environments/conda_train.yml /environments/
 6 | RUN conda env create -n mlops-train --file /environments/conda_train.yml  && \
 7 |     conda init bash
 8 | 
 9 | # kernel の設定
10 | RUN source ~/.bashrc && conda activate mlops-train && \
11 |     ipython kernel install --name=mlops-train --display-name=mlops-train
12 | 
13 | # pre-commit 設定
14 | COPY .pre-commit-config.yaml .
15 | RUN source ~/.bashrc && conda activate mlops-train && \
16 |     git init . && \
17 |     pre-commit install-hooks
18 | 
19 | # vscode ユーザへ切り替え
20 | USER vscode
21 | 
22 | # Azure CLI & ml extension のインストール
23 | RUN curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash - && az extension add --name ml
24 | 
25 | # conda の設定
26 | RUN conda init bash
27 | 


--------------------------------------------------------------------------------
/.devcontainer/devcontainer.json:
--------------------------------------------------------------------------------
 1 | {
 2 | 	"name": "Miniconda (Python 3)",
 3 | 	"build": {
 4 | 		"context": "..",
 5 | 		"dockerfile": "Dockerfile",
 6 | 	},
 7 | 
 8 | 	"customizations": {
 9 | 		"vscode": {
10 | 			"settings": {
11 | 				"terminal.integrated.profiles.linux": {
12 | 					"bash": {
13 | 						"path": "/bin/bash"
14 | 					}
15 | 				},
16 | 				"[python]": {
17 | 					"editor.defaultFormatter": "ms-python.black-formatter",
18 | 					"editor.formatOnSave": true,
19 | 					"editor.codeActionsOnSave": {
20 | 						"source.organizeImports": true
21 | 					},
22 | 				  },
23 | 				"isort.args":[
24 | 					"--profile", "black"
25 | 				],
26 | 			},
27 | 			"extensions": [
28 | 				"ms-python.python",
29 | 				"ms-python.vscode-pylance",
30 | 				"ms-toolsai.vscode-ai",
31 | 				"ms-python.black-formatter",
32 | 				"ms-python.isort"
33 | 			]
34 | 		}
35 | 	}
36 | }
37 | 


--------------------------------------------------------------------------------
/src/deploy/batch/score.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | import numpy as np
 4 | from mlflow.pyfunc import load_model
 5 | 
 6 | 
 7 | # Called when the service is loaded
 8 | def init():
 9 |     global model
10 | 
11 |     model_path = os.path.join(
12 |         os.environ["AZUREML_MODEL_DIR"],
13 |         "nyc_taxi",
14 |     )
15 |     #    model_path = Model.get_model_path(args.model_name)
16 |     model = load_model(model_path)
17 | 
18 | 
19 | def run(mini_batch):
20 |     print(f"run method start: {__file__}, run({mini_batch})")
21 | 
22 |     for input in mini_batch:
23 |         input_np = np.loadtxt(input, delimiter=",", skiprows=1)
24 |         predictions = model.predict(input_np)
25 |         log_txt = "Predictions:" + str(predictions)
26 |         print(log_txt)
27 | 
28 |     return [
29 |         [row, pred] for row, pred in enumerate(predictions)
30 |     ]  # return a dataframe or a list
31 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | ## Contributing
2 | 
3 | This project welcomes contributions and suggestions.  Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
4 | 
5 | When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
6 | 
7 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
8 | 


--------------------------------------------------------------------------------
/.devcontainer/add-notice.sh:
--------------------------------------------------------------------------------
 1 | # Display a notice when not running in GitHub Codespaces
 2 | 
 3 | cat << 'EOF' > /usr/local/etc/vscode-dev-containers/conda-notice.txt
 4 | When using "conda" from outside of GitHub Codespaces, note the Anaconda repository
 5 | contains restrictions on commercial use that may impact certain organizations. See
 6 | https://aka.ms/vscode-remote/conda/miniconda
 7 | 
 8 | EOF
 9 | 
10 | notice_script="$(cat << 'EOF'
11 | if [ -t 1 ] && [ "${IGNORE_NOTICE}" != "true" ] && [ "${TERM_PROGRAM}" = "vscode" ] && [ "${CODESPACES}" != "true" ] && [ ! -f "$HOME/.config/vscode-dev-containers/conda-notice-already-displayed" ]; then
12 |     cat "/usr/local/etc/vscode-dev-containers/conda-notice.txt"
13 |     mkdir -p "$HOME/.config/vscode-dev-containers"
14 |     ((sleep 10s; touch "$HOME/.config/vscode-dev-containers/conda-notice-already-displayed") &)
15 | fi
16 | EOF
17 | )"
18 | 
19 | echo "${notice_script}" | tee -a /etc/bash.bashrc >> /etc/zsh/zshrc
20 | 


--------------------------------------------------------------------------------
/.github/workflows/smoke-testing-azureml.yml:
--------------------------------------------------------------------------------
 1 | name: Smoke Testing for Azure ML
 2 | on:
 3 |   push:
 4 |     branches:
 5 |       - main
 6 |   pull_request:
 7 |     branches:
 8 |       - main
 9 |   schedule:
10 |       - cron: "0 0 * * *"
11 |   workflow_dispatch:
12 | 
13 | 
14 | jobs:
15 |   smoke-testing-azureml-training:
16 |     name: Smoke Testing for Azure ML (Training)
17 |     runs-on: ubuntu-latest
18 |     steps:
19 |     - name: Checkout repository
20 |       uses: actions/checkout@v3
21 |     - name: Install az ml extension
22 |       run: az extension add -n ml -y
23 |     - name: Azure login
24 |       uses: azure/login@v1
25 |       with:
26 |         creds: ${{secrets.AZURE_CREDENTIALS}}
27 |     - name: Configure default azureml workspace
28 |       run: |
29 |         az configure --defaults group=${{secrets.GROUP}} workspace=${{secrets.WORKSPACE}} location=${{secrets.LOCATION}}
30 |     - name: Job for model training
31 |       run: |
32 |         az ml job create -f train.yml --stream
33 |       working-directory: cli/jobs
34 | 


--------------------------------------------------------------------------------
/docs/images/azureml-icon.svg:
--------------------------------------------------------------------------------
 1 | <svg id="bc46fe20-f36c-49f2-b34e-b5185b9a5e7d" xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 18 18">
 2 |   <defs>
 3 |     <linearGradient id="e1bcfd0f-68b5-4f66-9501-e6a7245a18e7" x1="1.1" y1="169" x2="11.12" y2="169" gradientTransform="translate(0 -160)" gradientUnits="userSpaceOnUse">
 4 |       <stop offset="0" stop-color="#50c7e8" />
 5 |       <stop offset="0.25" stop-color="#4cc3e4" />
 6 |       <stop offset="0.51" stop-color="#41b6da" />
 7 |       <stop offset="0.77" stop-color="#2fa2c8" />
 8 |       <stop offset="1" stop-color="#1989b2" />
 9 |     </linearGradient>
10 |   </defs>
11 |   <title>Icon-166Artboard 1</title>
12 |   <path id="bc892891-989d-4c43-80d3-d2b4546a974f" d="M15.8,17.5H2.2L1.1,13.4H16.9Z" fill="#198ab3" />
13 |   <polygon points="6.9 0.5 6.9 6.9 1.1 13.4 2.2 17.5 11.1 6.9 11.1 0.5 6.9 0.5" fill="url(#e1bcfd0f-68b5-4f66-9501-e6a7245a18e7)" />
14 |   <path id="e6be01d6-345d-4df4-bbdb-151aa8edaa1f" d="M15.8,17.5,9.6,11.1l2.6-3,4.7,5.3Z" fill="#32bedd" />
15 | </svg>
16 | 


--------------------------------------------------------------------------------
/.pre-commit-config.yaml:
--------------------------------------------------------------------------------
 1 | # See https://pre-commit.com for more information
 2 | # See https://pre-commit.com/hooks.html for more hooks
 3 | repos:
 4 | # サンプルで生成されるもの (pre-commit sample-config > .pre-commit-config.yaml)
 5 | -   repo: https://github.com/pre-commit/pre-commit-hooks
 6 |     rev: v4.4.0
 7 |     hooks:
 8 |     -   id: trailing-whitespace
 9 |     -   id: no-commit-to-branch
10 |         args: [--branch, main]
11 |     -   id: end-of-file-fixer
12 |     -   id: check-yaml
13 |     -   id: check-added-large-files
14 | 
15 | # Python 自動フォーマッター
16 | -   repo: https://github.com/ambv/black
17 |     rev: 23.1.0
18 |     hooks:
19 |     - id: black
20 |     - id: black-jupyter
21 |       language_version: python3
22 | 
23 | # import 並び替え
24 | -   repo: https://github.com/pycqa/isort
25 |     rev: 5.12.0
26 |     hooks:
27 |       - id: isort
28 |         name: isort (python)
29 |         args: ["--profile", "black"] # black との競合回避 (他には .isort.cfg にて [tools.isort] profile="black" とする方法もある)
30 | 
31 | # Python 静的解析ツール
32 | -   repo: https://github.com/pycqa/flake8
33 |     rev: 6.0.0
34 |     hooks:
35 |     - id: flake8
36 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 |     MIT License
 2 | 
 3 |     Copyright (c) Microsoft Corporation.
 4 | 
 5 |     Permission is hereby granted, free of charge, to any person obtaining a copy
 6 |     of this software and associated documentation files (the "Software"), to deal
 7 |     in the Software without restriction, including without limitation the rights
 8 |     to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 |     copies of the Software, and to permit persons to whom the Software is
10 |     furnished to do so, subject to the following conditions:
11 | 
12 |     The above copyright notice and this permission notice shall be included in all
13 |     copies or substantial portions of the Software.
14 | 
15 |     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 |     IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 |     FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 |     AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 |     LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 |     OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 |     SOFTWARE
22 | 


--------------------------------------------------------------------------------
/.github/workflows/smoke-testing-python-script.yml:
--------------------------------------------------------------------------------
 1 | name: Smoke Testing for Python Script
 2 | on:
 3 |   push:
 4 |     branches:
 5 |       - main
 6 |   pull_request:
 7 |     branches:
 8 |       - main
 9 |   schedule:
10 |       - cron: "0 0 * * *"
11 |   workflow_dispatch:
12 | 
13 | 
14 | jobs:
15 |   smoke-testing-python-script:
16 |     name: Smoke Testing for Python Script
17 |     runs-on: ubuntu-latest
18 |     env:
19 |       INPUT_DATA: './data/raw/nyc_taxi_dataset.csv'
20 |     steps:
21 |     - name: Checkout repository
22 |       uses: actions/checkout@v3
23 |     - name: Setup python
24 |       uses: actions/setup-python@v4.2.0
25 |       with:
26 |         python-version: "3.9"
27 |     - name: Create conda environment
28 |       uses: conda-incubator/setup-miniconda@v2
29 |       with:
30 |         activate-environment: mlops-train
31 |         environment-file: environments/conda_train.yml
32 |     - name : Kernel configuration
33 |       run: |
34 |         python -m ipykernel install --user --name mlops-train
35 |       shell: bash -el {0}
36 |     - name: Run Python script
37 |       run: |
38 |         python src/model/train.py --input_data $INPUT_DATA
39 |       shell: bash -el {0}
40 |       env:
41 |         AZUREML_RUN_ID : $GITHUB_RUN_ID
42 | 


--------------------------------------------------------------------------------
/src/deploy/online/score.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import logging
 3 | import os
 4 | 
 5 | import joblib
 6 | import numpy
 7 | 
 8 | 
 9 | def init():
10 |     """
11 |     This function is called when the container is initialized/started, typically after create/update of the deployment.
12 |     You can write the logic here to perform init operations like caching the model in memory
13 |     """
14 |     global model
15 |     # AZUREML_MODEL_DIR is an environment variable created during deployment.
16 |     # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)
17 |     # Please provide your model's folder name if there is one
18 |     model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "models/model.pkl")
19 |     # deserialize the model file back into a sklearn model
20 |     model = joblib.load(model_path)
21 |     logging.info("Init complete")
22 | 
23 | 
24 | def run(raw_data):
25 |     """
26 |     This function is called for every invocation of the endpoint to perform the actual scoring/prediction.
27 |     In the example we extract the data from the json input and call the scikit-learn model's predict()
28 |     method and return the result back
29 |     """
30 |     logging.info("model: request received")
31 |     data = json.loads(raw_data)["data"]
32 |     data = numpy.array(data)
33 |     result = model.predict(data)
34 |     logging.info("Request processed")
35 |     return result.tolist()
36 | 


--------------------------------------------------------------------------------
/.github/workflows/microsoft-security-devops-analysis.yml:
--------------------------------------------------------------------------------
 1 | name: Microsoft Security DevOps Analysis
 2 | on:
 3 |   push:
 4 |     branches:
 5 |       - main
 6 |   pull_request:
 7 |     branches:
 8 |       - main
 9 |   schedule:
10 |       - cron: "0 0 * * *"
11 |   workflow_dispatch:
12 | 
13 | jobs:
14 |   scan-code:
15 |     name: Microsoft Security DevOps Analysis
16 | 
17 |     # MSDO runs on windows-latest.
18 |     # ubuntu-latest and macos-latest supporting coming soon
19 |     runs-on: ubuntu-latest
20 | 
21 |     steps:
22 | 
23 |       # Checkout your code repository to scan
24 |     - uses: actions/checkout@v3
25 | 
26 |       # Install dotnet, used by MSDO
27 |     - uses: actions/setup-dotnet@v3
28 |       with:
29 |         dotnet-version: |
30 |           3.1.x
31 |           5.0.x
32 |           6.0.x
33 | 
34 |       # Run analyzers
35 |     - name: Run Microsoft Security DevOps Analysis
36 |       uses: microsoft/security-devops-action@preview
37 |       id: msdo
38 | 
39 |       # Upload alerts to the Security tab
40 |     - name: Upload alerts to Security tab
41 |       uses: github/codeql-action/upload-sarif@v2
42 |       with:
43 |         sarif_file: ${{ steps.msdo.outputs.sarifFile }}
44 | 
45 |       # Upload alerts file as a workflow artifact
46 |     - name: Upload alerts file as a workflow artifact
47 |       uses: actions/upload-artifact@v3
48 |       with:
49 |         name: alerts
50 |         path: ${{ steps.msdo.outputs.sarifFile }}
51 | 


--------------------------------------------------------------------------------
/scripts/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # move to directory of shell script
 4 | exec_path=$(readlink -f "$0")
 5 | exec_dir=$(dirname "$exec_path")
 6 | cd $exec_dir/../
 7 | 
 8 | # conda 仮想環境
 9 | # conda環境名を指定
10 | CONDA_ENV_NAME="mlops-train"
11 | 
12 | # 特定の環境が存在するかどうかを確認
13 | if conda env list | grep -Eq "\s*$CONDA_ENV_NAME\s"; then
14 |     # check if the conda environment already exists
15 |     echo "Conda environment '$CONDA_ENV_NAME' already exists. Skipping creation."
16 |     source /anaconda/etc/profile.d/conda.sh
17 |     conda activate $CONDA_ENV_NAME
18 | else
19 |     # 環境が存在しない場合はエラーメッセージを表示
20 |     conda env create -n $CONDA_ENV_NAME --file environments/conda_train.yml
21 | 
22 |     conda init bash
23 |     # check if the command was successful
24 |     if [ $? -eq 0 ]; then
25 |         echo "'conda init' command was successful."
26 |     fi
27 | 
28 |     source /anaconda/etc/profile.d/conda.sh
29 |     conda activate $CONDA_ENV_NAME
30 | 
31 |     # check if the command was successful
32 |     if [ $? -eq 0 ]; then
33 |         echo "'conda activate $CONDA_ENV_NAME' command was successful."
34 |     fi
35 | fi
36 | 
37 | 
38 | # ipykernel
39 | ipython kernel install --user --name=$CONDA_ENV_NAME --display-name=$CONDA_ENV_NAME
40 | 
41 | # pre-commit
42 | git init .
43 | git config --global --add safe.directory .
44 | pre-commit install-hooks
45 | 
46 | # Azure CLI & ml extension
47 | curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash - && az extension add --name ml
48 | 


--------------------------------------------------------------------------------
/.github/workflows/smoke-testing-notebook.yml:
--------------------------------------------------------------------------------
 1 | name: Smoke Testing for Notebook
 2 | on:
 3 |   push:
 4 |     branches:
 5 |       - main
 6 |   pull_request:
 7 |     branches:
 8 |       - main
 9 |   schedule:
10 |       - cron: "0 0 * * *"
11 |   workflow_dispatch:
12 | 
13 | 
14 | jobs:
15 |   smoke-testing-notebook:
16 |     name: Smoke Testing for Notebook
17 |     runs-on: ubuntu-latest
18 |     steps:
19 |     - name: Checkout repository
20 |       uses: actions/checkout@v3
21 |     - name: Setup python
22 |       uses: actions/setup-python@v4.2.0
23 |       with:
24 |         python-version: "3.9"
25 |     - name: Create conda environment
26 |       uses: conda-incubator/setup-miniconda@v2
27 |       with:
28 |         activate-environment: mlops-train
29 |         environment-file: environments/conda_train.yml
30 |     - name : Kernel configuration
31 |       run: |
32 |         python -m ipykernel install --user --name mlops-train
33 |       shell: bash -el {0}
34 |     - name: Run Notebook for experiment
35 |       run: |
36 |         papermill train-experiment.ipynb output.ipynb -k mlops-train
37 |       working-directory: notebooks
38 |       shell: bash -el {0}
39 |     - name: Run Notebook for mlflow
40 |       run: |
41 |         papermill train-mlflow-local.ipynb output.ipynb -k mlops-train
42 |       working-directory: notebooks
43 |       shell: bash -el {0}
44 |     - name: Run Notebook for responsible ai debugging
45 |       run: |
46 |         papermill train-model-debugging.ipynb output.ipynb -k mlops-train
47 |       working-directory: notebooks
48 |       shell: bash -el {0}
49 | 


--------------------------------------------------------------------------------
/pipelines/pipeline.yml:
--------------------------------------------------------------------------------
 1 | $schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
 2 | type: pipeline
 3 | display_name: train_nyc_taxi_pipeline
 4 | description: train_nyc_taxi
 5 | 
 6 | outputs:
 7 |   pipeline_job_trained_model:
 8 |     type: mlflow_model
 9 |     mode: upload
10 | 
11 | settings:
12 |   default_datastore: azureml:workspaceblobstore
13 |   default_compute: azureml:cpu-cluster
14 |   continue_on_step_failure: false
15 | 
16 | 
17 | jobs:
18 |   prep_job:
19 |     type: command
20 |     component: ./prep.yml
21 |     inputs:
22 |       nyc_taxi_data: #using local data, will crate an anonymous data asset
23 |         type: uri_file
24 |         path: azureml:nyc_taxi_dataset@latest
25 |     outputs:
26 |       training_data:
27 |       testing_data:
28 | 
29 | 
30 |   train_job:
31 |     type: command
32 |     component: ./train.yml
33 |     inputs:
34 |       training_data: ${{parent.jobs.prep_job.outputs.training_data}}
35 |     outputs:
36 |       model_output: ${{parent.outputs.pipeline_job_trained_model}}
37 | 
38 | 
39 |   score_job:
40 |     type: command
41 |     component: ./score.yml
42 |     inputs:
43 |       testing_data: ${{parent.jobs.prep_job.outputs.testing_data}}
44 |       model_input: ${{parent.jobs.train_job.outputs.model_output}}
45 |     outputs:
46 |       predicted_data:
47 |       label_data:
48 | 
49 | 
50 |   eval_job:
51 |     type: command
52 |     component: ./eval.yml
53 |     inputs:
54 |       predicted_data: ${{parent.jobs.score_job.outputs.predicted_data}}
55 |       label_data: ${{parent.jobs.score_job.outputs.label_data}}
56 |     outputs:
57 |       model_performance_report:
58 | 


--------------------------------------------------------------------------------
/.github/workflows/code-quality.yml:
--------------------------------------------------------------------------------
 1 | name: Code Quality (linter, formatter, pre-commit)
 2 | on:
 3 |   push:
 4 |     branches:
 5 |       - main
 6 |   pull_request:
 7 |     branches:
 8 |       - main
 9 |   schedule:
10 |       - cron: "0 0 * * *"
11 |   workflow_dispatch:
12 | 
13 | 
14 | jobs:
15 |   lint-python:
16 |     name: Python Lint
17 |     runs-on: ubuntu-latest
18 |     steps:
19 |     - name: Checkout repository
20 |       uses: actions/checkout@v3
21 |     - name: Setup python
22 |       uses: actions/setup-python@v4.2.0
23 |       with:
24 |         python-version: "3.9"
25 |     - name: Set up python
26 |       run: |
27 |         pip install -r environments/code_quality.txt
28 |         pip list
29 |     - name: Lint with flake8
30 |       run: |
31 |          flake8 .
32 |     # - name: Tyepcheck with mypy
33 |     #   run: |
34 |     #      mypy .
35 | 
36 | 
37 |   format-python:
38 |     name: Python Format
39 |     runs-on: ubuntu-latest
40 |     steps:
41 |     - name: Checkout repository
42 |       uses: actions/checkout@v3
43 |     - name: Setup python
44 |       uses: actions/setup-python@v4.2.0
45 |       with:
46 |         python-version: "3.9"
47 |     - name: Set up python
48 |       run: |
49 |         pip install -r environments/code_quality.txt
50 |         pip list
51 |     - name: Lint with isort
52 |       run: |
53 |          isort . --check --diff
54 |     - name: Lint with black
55 |       run: |
56 |          black . --check --diff
57 | 
58 | 
59 |   pre-commit:
60 |     name: pre-commit
61 |     runs-on: ubuntu-latest
62 |     env:
63 |       SKIP: no-commit-to-branch
64 |     steps:
65 |     - uses: actions/checkout@v3
66 |     - uses: actions/setup-python@v3
67 |     - uses: pre-commit/action@v3.0.0
68 | 


--------------------------------------------------------------------------------
/pipelines/train/train.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | from pathlib import Path
 3 | 
 4 | import mlflow
 5 | import mlflow.sklearn
 6 | import pandas as pd
 7 | from sklearn.linear_model import LinearRegression
 8 | 
 9 | 
10 | def parse_args():
11 |     # 引数の処理
12 |     parser = argparse.ArgumentParser()
13 |     parser.add_argument("--training_data", type=str, help="Path of training_data")
14 |     parser.add_argument("--model_output", type=str, help="Path of training_data")
15 | 
16 |     args = parser.parse_args()
17 |     return args
18 | 
19 | 
20 | def split_label(df):
21 |     X = df.drop(columns="totalAmount")
22 |     y = df["totalAmount"]
23 |     return X, y
24 | 
25 | 
26 | def train_model(X_train, y_train):
27 |     # データのサンプル数のロギング
28 |     mlflow.log_metric("Train samples", len(X_train))
29 | 
30 |     # モデル学習
31 |     model = LinearRegression().fit(X_train, y_train)
32 | 
33 |     return model
34 | 
35 | 
36 | def save_model(model, output_dir):
37 |     # モデルの保存
38 |     mlflow.sklearn.save_model(model, output_dir)
39 | 
40 | 
41 | def main(args):
42 |     # 自動ロギングの有効化
43 |     mlflow.autolog(log_models=False)
44 | 
45 |     # 引数の確認
46 |     lines = [
47 |         f"training_data のパス: {args.training_data}",
48 |         f"モデル出力フォルダのパス: {args.model_output}",
49 |     ]
50 |     [print(line) for line in lines]
51 | 
52 |     # 学習データの読み込み
53 |     df = pd.read_csv(Path(args.training_data) / "train.csv")
54 | 
55 |     # データ前処理
56 |     X_train, y_train = split_label(df)
57 | 
58 |     # モデル学習
59 |     model = train_model(X_train, y_train)
60 | 
61 |     # モデル保存
62 |     save_model(model, args.model_output)
63 | 
64 | 
65 | if __name__ == "__main__":
66 |     # 引数の処理
67 |     args = parse_args()
68 | 
69 |     # main 関数の実行
70 |     main(args)
71 | 


--------------------------------------------------------------------------------
/pipelines/prep/prep.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | from pathlib import Path
 3 | 
 4 | import mlflow
 5 | import mlflow.sklearn
 6 | import pandas as pd
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | 
10 | def parse_args():
11 |     # 引数の処理
12 |     parser = argparse.ArgumentParser()
13 |     parser.add_argument("--input_data", type=str, help="input data")
14 |     parser.add_argument(
15 |         "--test_split_ratio", type=float, help="Ratio of train test split"
16 |     )
17 |     parser.add_argument("--training_data", type=str, help="Path of training_data")
18 |     parser.add_argument("--testing_data", type=str, help="Path of training_data")
19 | 
20 |     args = parser.parse_args()
21 |     return args
22 | 
23 | 
24 | def process_data(df):
25 |     training_data, testing_data = train_test_split(
26 |         df, test_size=args.test_split_ratio, random_state=0
27 |     )
28 |     mlflow.log_metric("Train samples", len(training_data))
29 |     mlflow.log_metric("Test samples", len(testing_data))
30 | 
31 |     # 分割データの出力
32 |     return training_data, testing_data
33 | 
34 | 
35 | def main(args):
36 |     # 引数の確認
37 |     lines = [
38 |         f"学習データのパス: {args.input_data}",
39 |         f"分割データのパス (training_data): {args.training_data}",
40 |         f"分割データのパス (testing_data): {args.testing_data}",
41 |     ]
42 |     [print(line) for line in lines]
43 | 
44 |     # 学習データの読み込み
45 |     df = pd.read_csv(args.input_data)
46 | 
47 |     # データ前処理
48 |     training_data, testing_data = process_data(df)
49 |     training_data.to_csv(Path(args.training_data) / "train.csv", index=False)
50 |     testing_data.to_csv(Path(args.testing_data) / "test.csv", index=False)
51 | 
52 | 
53 | if __name__ == "__main__":
54 |     # 引数の処理
55 |     args = parse_args()
56 | 
57 |     # main 関数の実行
58 |     main(args)
59 | 


--------------------------------------------------------------------------------
/pipelines/score/score.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | from pathlib import Path
 3 | 
 4 | import mlflow
 5 | import mlflow.sklearn
 6 | import numpy as np
 7 | import pandas as pd
 8 | 
 9 | 
10 | def parse_args():
11 |     # 引数の処理
12 |     parser = argparse.ArgumentParser()
13 |     parser.add_argument("--model_input", type=str, help="Path of model input")
14 |     parser.add_argument("--testing_data", type=str, help="Path of testing data")
15 |     parser.add_argument("--predicted_data", type=str, help="Path of predicted data")
16 |     parser.add_argument("--label_data", type=str, help="Path of label data")
17 | 
18 |     args = parser.parse_args()
19 |     return args
20 | 
21 | 
22 | def split_label(df):
23 |     X = df.drop(columns="totalAmount")
24 |     y = df["totalAmount"]
25 |     return X, y
26 | 
27 | 
28 | def get_model(model_input):
29 |     return mlflow.sklearn.load_model(model_input)
30 | 
31 | 
32 | def score_model(X_test, model):
33 |     pred = model.predict(X_test)
34 |     return pred
35 | 
36 | 
37 | def save_data(pred, data_path, filename):
38 |     np.savetxt(Path(data_path) / filename, pred, delimiter=",")
39 | 
40 | 
41 | def main(args):
42 |     # 引数の確認
43 |     lines = [
44 |         f"モデル入力ファイルのパス: {args.model_input}",
45 |         f"testing_data のパス: {args.testing_data}",
46 |     ]
47 |     [print(line) for line in lines]
48 | 
49 |     # テストデータの読み込み
50 |     df = pd.read_csv(Path(args.testing_data) / "test.csv")
51 | 
52 |     # データ前処理
53 |     X_test, y_test = split_label(df)
54 | 
55 |     # モデルの取得
56 |     model = get_model(args.model_input)
57 | 
58 |     # 予測
59 |     pred = score_model(X_test, model)
60 | 
61 |     # 予測値の保存
62 |     save_data(pred, args.predicted_data, "pred.csv")
63 | 
64 |     # ラベルデータの保存
65 |     save_data(y_test, args.label_data, "label.csv")
66 | 
67 | 
68 | if __name__ == "__main__":
69 |     # 引数の処理
70 |     args = parse_args()
71 | 
72 |     # main 関数の実行
73 |     main(args)
74 | 


--------------------------------------------------------------------------------
/src/model/register_model.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | from pathlib import Path
 4 | 
 5 | import mlflow
 6 | from mlflow.pyfunc import load_model
 7 | 
 8 | 
 9 | def parse_args():
10 |     parser = argparse.ArgumentParser()
11 |     parser.add_argument(
12 |         "--model_name",
13 |         type=str,
14 |         help="Name under which model will be registered",
15 |     )
16 |     parser.add_argument(
17 |         "--model_path",
18 |         type=str,
19 |         help="Model directory",
20 |     )
21 |     parser.add_argument(
22 |         "--deploy_flag",
23 |         type=str,
24 |         help="A deploy flag whether to deploy or no",
25 |     )
26 | 
27 |     args, _ = parser.parse_known_args()
28 |     print(f"Arguments: {args}")
29 | 
30 |     return args
31 | 
32 | 
33 | def main():
34 |     # Get run
35 |     run = mlflow.start_run()
36 |     run_id = run.info.run_id
37 |     print("run_id: ", run_id)
38 | 
39 |     args = parse_args()
40 | 
41 |     model_name = args.model_name
42 |     model_path = args.model_path
43 | 
44 |     if len(args.deploy_flag) == 1:  # this is the case where deploy_flag is a digit
45 |         deploy_flag = int(args.deploy_flag)
46 |     else:  # this is the case where deploy_flag is a path name
47 |         with open(
48 |             (Path(args.deploy_flag) / "deploy_flag"),
49 |             "rb",
50 |         ) as f:
51 |             deploy_flag = int(f.read())
52 | 
53 |     if deploy_flag == 1:
54 |         print("Registering ", model_name)
55 | 
56 |         model = load_model(os.path.join(model_path, "models"))
57 |         # log model using mlflow
58 |         mlflow.sklearn.log_model(model, model_name)
59 | 
60 |         # register model using mlflow model
61 |         model_uri = f"runs:/{run_id}/{args.model_name}"
62 |         mlflow.register_model(model_uri, model_name)
63 | 
64 |     else:
65 |         print("Model will not be registered!")
66 | 
67 |     mlflow.end_run()
68 | 
69 | 
70 | if __name__ == "__main__":
71 |     main()
72 | 


--------------------------------------------------------------------------------
/pipelines/eval/eval.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | from pathlib import Path
 3 | 
 4 | import matplotlib.pyplot as plt
 5 | import mlflow
 6 | import mlflow.sklearn
 7 | import numpy as np
 8 | import pandas as pd
 9 | from sklearn.metrics import mean_squared_error, r2_score
10 | 
11 | 
12 | def parse_args():
13 |     # 引数の処理
14 |     parser = argparse.ArgumentParser()
15 |     parser.add_argument("--predicted_data", type=str, help="Path of predicted data")
16 |     parser.add_argument("--label_data", type=str, help="Path of label data")
17 |     parser.add_argument(
18 |         "--model_performance_report", type=str, help="Path of model performance report"
19 |     )
20 | 
21 |     args = parser.parse_args()
22 |     return args
23 | 
24 | 
25 | def evaluate_model(y_test, y_pred):
26 |     # データのサンプル数のロギング
27 |     mlflow.log_metric("Test samples", len(y_test))
28 | 
29 |     # モデル評価
30 |     mse = mean_squared_error(y_test, y_pred)
31 |     rmse = np.sqrt(mse)
32 |     r2 = r2_score(y_test, y_pred)
33 | 
34 |     # 精度メトリックのロギング
35 |     mlflow.log_metric("mse", mse)
36 |     mlflow.log_metric("rmse", rmse)
37 |     mlflow.log_metric("r2", r2)
38 | 
39 | 
40 | def plot_actuals_predictions(y_test, y_pred, report_path):
41 |     # 出力パス
42 |     output_path = str(Path(report_path) / "actuals_vs_predictions.png")
43 |     # 実測値と予測値のプロット
44 |     plt.figure(figsize=(10, 7))
45 |     plt.scatter(y_test, y_pred)
46 |     plt.plot(y_test, y_test, color="r")
47 |     plt.title("Actual VS Predicted Values (Test set)")
48 |     plt.xlabel("Actual Values")
49 |     plt.ylabel("Predicted Values")
50 |     plt.savefig(output_path)
51 | 
52 |     # プロット画像のロギング
53 |     mlflow.log_artifact(output_path)
54 | 
55 | 
56 | def main(args):
57 |     # 引数の確認
58 |     lines = [
59 |         f"予測値データのパス: {args.predicted_data}",
60 |         f"ラベルデータのパス: {args.label_data}",
61 |         f"モデルパフォーマンスレポートのパス: {args.model_performance_report}",
62 |     ]
63 |     [print(line) for line in lines]
64 | 
65 |     # 予測値データの読み込み
66 |     y_pred = pd.read_csv(Path(args.predicted_data) / "pred.csv")
67 | 
68 |     # ラベルデータの読み込み
69 |     y_test = pd.read_csv(Path(args.label_data) / "label.csv")
70 | 
71 |     # モデル評価指標の算出
72 |     evaluate_model(y_test, y_pred)
73 | 
74 |     # 実測値と予測値のプロット
75 |     plot_actuals_predictions(y_test, y_pred, args.model_performance_report)
76 | 
77 | 
78 | if __name__ == "__main__":
79 |     # 引数の処理
80 |     args = parse_args()
81 | 
82 |     # main 関数の実行
83 |     main(args)
84 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 
131 | ##AML
132 | config.json
133 | 
134 | # Mlflow
135 | mlruns/
136 | mlartifacts/
137 | outputs/
138 | mlruns.db
139 | 


--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
 1 | <!-- BEGIN MICROSOFT SECURITY.MD V0.0.7 BLOCK -->
 2 | 
 3 | ## Security
 4 | 
 5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
 6 | 
 7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below.
 8 | 
 9 | ## Reporting Security Issues
10 | 
11 | **Please do not report security vulnerabilities through public GitHub issues.**
12 | 
13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report).
14 | 
15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey).
16 | 
17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc).
18 | 
19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
20 | 
21 |   * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
22 |   * Full paths of source file(s) related to the manifestation of the issue
23 |   * The location of the affected source code (tag/branch/commit or direct URL)
24 |   * Any special configuration required to reproduce the issue
25 |   * Step-by-step instructions to reproduce the issue
26 |   * Proof-of-concept or exploit code (if possible)
27 |   * Impact of the issue, including how an attacker might exploit the issue
28 | 
29 | This information will help us triage your report more quickly.
30 | 
31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs.
32 | 
33 | ## Preferred Languages
34 | 
35 | We prefer all communications to be in English.
36 | 
37 | ## Policy
38 | 
39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd).
40 | 
41 | <!-- END MICROSOFT SECURITY.MD BLOCK -->
42 | 


--------------------------------------------------------------------------------
/utils/prepare_data.py:
--------------------------------------------------------------------------------
 1 | # 本リポジトリで利用するデータを生成するスクリプト
 2 | import copy
 3 | import os
 4 | from datetime import datetime
 5 | 
 6 | import pandas as pd
 7 | from azureml.core import Dataset, Datastore, Workspace
 8 | from azureml.opendatasets import NycTlcGreen
 9 | from dateutil.relativedelta import relativedelta
10 | 
11 | 
12 | def register_dataset(ws: Workspace) -> None:
13 |     dataset_name = "nyc_taxi_dataset"
14 |     try:
15 |         dataset = Dataset.get_by_name(ws, dataset_name)
16 |         df = dataset.to_pandas_dataframe()
17 |     except Exception:
18 |         raw_df = pd.DataFrame([])
19 |         start = datetime.strptime("1/1/2015", "%m/%d/%Y")
20 |         end = datetime.strptime("1/31/2015", "%m/%d/%Y")
21 | 
22 |         for sample_month in range(3):
23 |             temp_df_green = NycTlcGreen(
24 |                 start + relativedelta(months=sample_month),
25 |                 end + relativedelta(months=sample_month),
26 |             ).to_pandas_dataframe()
27 |             raw_df = raw_df.append(temp_df_green.sample(2000))
28 | 
29 |         print(raw_df.head(10))
30 | 
31 |         df = copy.deepcopy(raw_df)
32 | 
33 |         columns_to_remove = [
34 |             "lpepDropoffDatetime",
35 |             "puLocationId",
36 |             "doLocationId",
37 |             "extra",
38 |             "mtaTax",
39 |             "improvementSurcharge",
40 |             "tollsAmount",
41 |             "ehailFee",
42 |             "tripType",
43 |             "rateCodeID",
44 |             "storeAndFwdFlag",
45 |             "paymentType",
46 |             "fareAmount",
47 |             "tipAmount",
48 |         ]
49 |         for col in columns_to_remove:
50 |             df.pop(col)
51 | 
52 |         df = df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
53 |         df = df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
54 |         df = df.query("tripDistance>=0.25 and tripDistance<31")
55 |         df = df.query("passengerCount>0 and totalAmount>0")
56 | 
57 |         df["lpepPickupDatetime"] = df["lpepPickupDatetime"].map(lambda x: x.timestamp())
58 | 
59 |         datastore = Datastore.get_default(ws)
60 |         dataset = Dataset.Tabular.register_pandas_dataframe(df, datastore, dataset_name)
61 | 
62 |     print(df.head(5))
63 |     df.to_csv(
64 |         "./data/raw/nyc_taxi_dataset.csv",
65 |         header=True,
66 |         index=False,
67 |     )
68 | 
69 | 
70 | def main() -> None:
71 |     subscription_id = os.getenv("subscription_id")
72 |     resource_group = os.getenv("resource_group")
73 |     workspace_name = os.getenv("workspace")
74 | 
75 |     ws = Workspace(
76 |         workspace_name=workspace_name,
77 |         subscription_id=subscription_id,
78 |         resource_group=resource_group,
79 |     )
80 | 
81 |     # Inline Environment によって生成、Environment を使い回す場合以下使用
82 |     # create_environment(ws)
83 |     register_dataset(ws)
84 | 
85 | 
86 | if __name__ == "__main__":
87 |     main()
88 | 


--------------------------------------------------------------------------------
/.github/workflows/codeql-analysis.yml:
--------------------------------------------------------------------------------
 1 | # For most projects, this workflow file will not need changing; you simply need
 2 | # to commit it to your repository.
 3 | #
 4 | # You may wish to alter this file to override the set of languages analyzed,
 5 | # or to provide custom queries or build logic.
 6 | #
 7 | # ******** NOTE ********
 8 | # We have attempted to detect the languages in your repository. Please check
 9 | # the `language` matrix defined below to confirm you have the correct set of
10 | # supported CodeQL languages.
11 | #
12 | name: "CodeQL"
13 | 
14 | on:
15 |   push:
16 |     branches: [ "main" ]
17 |   pull_request:
18 |     # The branches below must be a subset of the branches above
19 |     branches: [ "main" ]
20 |   schedule:
21 |     - cron: '45 21 * * 5'
22 | 
23 | jobs:
24 |   analyze:
25 |     name: Analyze
26 |     runs-on: ubuntu-latest
27 |     permissions:
28 |       actions: read
29 |       contents: read
30 |       security-events: write
31 | 
32 |     strategy:
33 |       fail-fast: false
34 |       matrix:
35 |         language: ['python']
36 |         # CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ]
37 |         # Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support
38 | 
39 |     steps:
40 |     - name: Checkout repository
41 |       uses: actions/checkout@v3
42 | 
43 |     # Initializes the CodeQL tools for scanning.
44 |     - name: Initialize CodeQL
45 |       uses: github/codeql-action/init@v2
46 |       with:
47 |         languages: ${{ matrix.language }}
48 |         # If you wish to specify custom queries, you can do so here or in a config file.
49 |         # By default, queries listed here will override any specified in a config file.
50 |         # Prefix the list here with "+" to use these queries and those in the config file.
51 | 
52 |         # Details on CodeQL's query packs refer to : https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
53 |         # queries: security-extended,security-and-quality
54 | 
55 | 
56 |     # Autobuild attempts to build any compiled languages  (C/C++, C#, or Java).
57 |     # If this step fails, then you should remove it and run the build manually (see below)
58 |     - name: Autobuild
59 |       uses: github/codeql-action/autobuild@v2
60 | 
61 |     # ℹ️ Command-line programs to run using the OS shell.
62 |     # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun
63 | 
64 |     #   If the Autobuild fails above, remove it and uncomment the following three lines.
65 |     #   modify them (or add more) to build your code if your project, please refer to the EXAMPLE below for guidance.
66 | 
67 |     # - run: |
68 |     #   echo "Run, Build Application using script"
69 |     #   ./location_of_script_within_repo/buildscript.sh
70 | 
71 |     - name: Perform CodeQL Analysis
72 |       uses: github/codeql-action/analyze@v2
73 | 


--------------------------------------------------------------------------------
/src/model/train.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | 
  4 | import matplotlib.pyplot as plt
  5 | import mlflow
  6 | import mlflow.sklearn
  7 | import numpy as np
  8 | import pandas as pd
  9 | from sklearn.linear_model import LinearRegression
 10 | from sklearn.metrics import mean_squared_error, r2_score
 11 | from sklearn.model_selection import train_test_split
 12 | 
 13 | 
 14 | def parse_args():
 15 |     # 引数の処理
 16 |     parser = argparse.ArgumentParser()
 17 |     parser.add_argument("--input_data", type=str, help="input data")
 18 |     parser.add_argument(
 19 |         "--output_dir", type=str, help="output dir", default="./outputs"
 20 |     )
 21 |     args = parser.parse_args()
 22 |     return args
 23 | 
 24 | 
 25 | def process_data(df):
 26 |     # X, y の作成
 27 |     X = df.drop(columns="totalAmount")
 28 |     y = df["totalAmount"]
 29 | 
 30 |     # 学習データ、テストデータの分割
 31 |     X_train, X_test, y_train, y_test = train_test_split(
 32 |         X, y, test_size=0.30, random_state=0
 33 |     )
 34 | 
 35 |     # 分割データの出力
 36 |     return X_train, X_test, y_train, y_test
 37 | 
 38 | 
 39 | def train_model(X_train, y_train):
 40 |     # データのサンプル数のロギング
 41 |     mlflow.log_metric("Train samples", len(X_train))
 42 | 
 43 |     # モデル学習
 44 |     model = LinearRegression().fit(X_train, y_train)
 45 | 
 46 |     return model
 47 | 
 48 | 
 49 | def evaluate_model(model, X_test, y_test):
 50 |     # データのサンプル数のロギング
 51 |     mlflow.log_metric("Test samples", len(X_test))
 52 | 
 53 |     # モデル評価
 54 |     y_pred = model.predict(X_test)
 55 |     mse = mean_squared_error(y_test, y_pred)
 56 |     rmse = np.sqrt(mse)
 57 |     r2 = r2_score(y_test, y_pred)
 58 | 
 59 |     # 精度メトリックのロギング
 60 |     mlflow.log_metric("mse", mse)
 61 |     mlflow.log_metric("rmse", rmse)
 62 |     mlflow.log_metric("r2", r2)
 63 | 
 64 |     # 実測値と予測値のプロット
 65 |     plot_actuals_predictions(y_test, y_pred)
 66 | 
 67 | 
 68 | def plot_actuals_predictions(y_test, y_pred):
 69 |     # 実測値と予測値のプロット
 70 |     plt.figure(figsize=(10, 7))
 71 |     plt.scatter(y_test, y_pred)
 72 |     plt.plot(y_test, y_test, color="r")
 73 |     plt.title("Actual VS Predicted Values (Test set)")
 74 |     plt.xlabel("Actual Values")
 75 |     plt.ylabel("Predicted Values")
 76 |     plt.savefig("actuals_vs_predictions.png")
 77 | 
 78 |     # プロット画像のロギング
 79 |     mlflow.log_artifact("actuals_vs_predictions.png")
 80 | 
 81 | 
 82 | def save_model(model, output_dir):
 83 |     # モデルの保存
 84 |     os.makedirs(os.path.join(output_dir, "models"), exist_ok=True)
 85 |     mlflow.sklearn.save_model(model, os.path.join(output_dir, "models"))
 86 | 
 87 | 
 88 | def main(args):
 89 |     # 自動ロギングの有効化
 90 |     mlflow.autolog(log_models=False)
 91 | 
 92 |     # 引数の確認
 93 |     lines = [
 94 |         f"学習データのパス: {args.input_data}",
 95 |         f"出力フォルダのパス: {args.output_dir}",
 96 |     ]
 97 |     [print(line) for line in lines]
 98 | 
 99 |     # 学習データの読み込み
100 |     df = pd.read_csv(args.input_data)
101 | 
102 |     # データ前処理
103 |     X_train, X_test, y_train, y_test = process_data(df)
104 | 
105 |     # モデル学習
106 |     model = train_model(X_train, y_train)
107 | 
108 |     # モデル評価
109 |     evaluate_model(model, X_test, y_test)
110 | 
111 |     # モデル保存
112 |     dir = os.path.join(args.output_dir, os.environ["AZUREML_RUN_ID"])
113 |     save_model(model, dir)
114 | 
115 | 
116 | if __name__ == "__main__":
117 |     # 引数の処理
118 |     args = parse_args()
119 | 
120 |     # main 関数の実行
121 |     main(args)
122 | 


--------------------------------------------------------------------------------
/notebooks/train-experiment.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import argparse\n",
 10 |     "import os\n",
 11 |     "import shutil\n",
 12 |     "\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "import numpy as np\n",
 15 |     "import pandas as pd\n",
 16 |     "from pathlib import Path\n",
 17 |     "from sklearn.linear_model import LinearRegression\n",
 18 |     "from sklearn.metrics import mean_squared_error, r2_score\n",
 19 |     "from sklearn.model_selection import train_test_split"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "input_data = \"../data/raw/nyc_taxi_dataset.csv\"\n",
 29 |     "output_dir = \"outputs\"\n",
 30 |     "\n",
 31 |     "lines = [f\"学習データのパス: {input_data}\", f\"出力フォルダのパス: {output_dir}\"]\n",
 32 |     "\n",
 33 |     "for line in lines:\n",
 34 |     "    print(line)"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": null,
 40 |    "metadata": {},
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "# 学習データの読み込み\n",
 44 |     "df = pd.read_csv(input_data)"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "df.head()"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "# X, y の作成\n",
 63 |     "X = df.drop(columns=\"totalAmount\")\n",
 64 |     "y = df[\"totalAmount\"]\n",
 65 |     "\n",
 66 |     "# 学習データ、テストデータの分割\n",
 67 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 68 |     "    X, y, test_size=0.30, random_state=0\n",
 69 |     ")"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": null,
 75 |    "metadata": {},
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "# モデル学習\n",
 79 |     "model = LinearRegression().fit(X_train, y_train)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": null,
 85 |    "metadata": {},
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "# モデル評価\n",
 89 |     "y_pred = model.predict(X_test)\n",
 90 |     "mse = mean_squared_error(y_test, y_pred)\n",
 91 |     "rmse = np.sqrt(mse)\n",
 92 |     "r2 = r2_score(y_test, y_pred)"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "# outputs フォルダの作成\n",
102 |     "os.makedirs(\"./outputs\", exist_ok=True)\n",
103 |     "\n",
104 |     "# 実測値と予測値のプロット\n",
105 |     "plt.figure(figsize=(10, 7))\n",
106 |     "plt.scatter(y_test, y_pred)\n",
107 |     "plt.plot(y_test, y_test, color=\"r\")\n",
108 |     "plt.title(\"Actual VS Predicted Values (Test set)\")\n",
109 |     "plt.xlabel(\"Actual Values\")\n",
110 |     "plt.ylabel(\"Predicted Values\")\n",
111 |     "plt.savefig(\"./outputs/actuals_vs_predictions.png\")"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": null,
117 |    "metadata": {},
118 |    "outputs": [],
119 |    "source": [
120 |     "# モデルの保存\n",
121 |     "model_path = os.path.join(output_dir, \"models\")\n",
122 |     "\n",
123 |     "if Path(model_path).exists():\n",
124 |     "    shutil.rmtree(model_path)\n",
125 |     "else:\n",
126 |     "    os.makedirs(model_path, exist_ok=True)"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "metadata": {},
133 |    "outputs": [],
134 |    "source": []
135 |   }
136 |  ],
137 |  "metadata": {
138 |   "kernelspec": {
139 |    "display_name": "Python 3.8.10 64-bit",
140 |    "language": "python",
141 |    "name": "python3"
142 |   },
143 |   "language_info": {
144 |    "codemirror_mode": {
145 |     "name": "ipython",
146 |     "version": 3
147 |    },
148 |    "file_extension": ".py",
149 |    "mimetype": "text/x-python",
150 |    "name": "python",
151 |    "nbconvert_exporter": "python",
152 |    "pygments_lexer": "ipython3",
153 |    "version": "3.8.10"
154 |   },
155 |   "orig_nbformat": 4,
156 |   "vscode": {
157 |    "interpreter": {
158 |     "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
159 |    }
160 |   }
161 |  },
162 |  "nbformat": 4,
163 |  "nbformat_minor": 2
164 | }
165 | 


--------------------------------------------------------------------------------
/docs/quickstart.md:
--------------------------------------------------------------------------------
  1 | # クイックスタート
  2 | ## 1. コード実行
  3 | サンプルコードを動かす手順を紹介します。
  4 | 
  5 | ### Azure Machine Learning の環境準備
  6 | - Azure の Subscription を準備します。
  7 |   - Azure のリソースグループに対する所有者権限を持っていることが前提です。
  8 | - [クイックスタート: Azure Machine Learning の利用を開始するために必要なワークスペース リソースを作成する](https://learn.microsoft.com/ja-jp/azure/machine-learning/quickstart-create-resources) の手順に従って、Azure Machine Learning の `ワークスペース` と `コンピューティングインスタンス` を作成します。
  9 | - [Visual Studio Code で Azure Machine Learning コンピューティング インスタンスに接続する (プレビュー)](https://learn.microsoft.com/ja-jp/azure/machine-learning/how-to-set-up-vs-code-remote?tabs=studio#configure-a-remote-compute-instance) の手順に従って、Azure Machine Learning の `コンピューティングインスタンス` にアクセス可能なことを確認します。
 10 | 
 11 | 
 12 | <br />
 13 | 
 14 | ### GitHub の環境準備
 15 | - GitHub のアカウントを準備します。
 16 |   - Public リポジトリを利用する前提であれば Free プラン (個人・組織の基本プラン) の[価格プラン](https://github.com/pricing)で動作しますが、セキュリティ機能などが豊富な Team プランや Enterprise プランの利用を推奨します。
 17 | - 本リポジトリ [Azure/mlops-starter-sklearn](https://github.com/Azure/mlops-starter-sklearn) を自分のアカウント・組織に Fork します。
 18 | - `コンピューティングインスタンス` のターミナル上で、User フォルダ (Users) 配下の自分の個人フォルダに Fork したリポジトリをクローンします。
 19 | 
 20 | ```bash
 21 | cd User/<username>
 22 | git clone https://github.com/<github user/org>/mlops-starter-sklearn #Fork 先のリポジトリを指定
 23 | ```
 24 | 
 25 | <br />
 26 | 
 27 | ### Azure Machine Learning 上での環境変数の設定
 28 | 先ほど Fork したコードを実行します。
 29 | 
 30 | - `.env.sample` ファイルを `.env` に改名します。
 31 | ```bash
 32 | mv .env.sample .env
 33 | ```
 34 | - `.env` ファイルを開いて環境変数を設定します。
 35 |    - GROUP: Azure Machine Learning ワークスペースのリソースグループ名
 36 |    - WORKSPACE: Azure Machine Learning ワークスペースの名前
 37 |    - LOCATION: Azure Machine Learning ワークスペースのリージョン
 38 |       - Azure Clud Shell や (Azure 認証後の) Azure ML の Compute Instance 上 で コマンド `az account list-locations -o table` を実行して Name 列を確認します。DisplayName ではありません。例えば東日本リージョンの場合は Japan East ではなく、japaneast になります。
 39 |    - SUBSCRIPTION: Azure サブスクリプションID
 40 | 
 41 | _.env ファイルの記載の例_
 42 | ```
 43 | GROUP="azureml"
 44 | WORKSPACE="azureml"
 45 | LOCATION="japaneast"
 46 | SUBSCRIPTION="xxxxxxxxxxx"
 47 | ```
 48 | 
 49 | ### シェルスクリプトの実行
 50 | - `コンピューティングインスタンス` のターミナル上で、[scripts](../scripts) フォルダの各シェルスクリプトを実行します。
 51 |    - Azure CLI ログイン
 52 |       - `az login --use-device` コマンドで Azure CLI 認証を行います。
 53 |    - [setup.sh](../scripts/setup.sh)
 54 |       - conda 仮想環境の作成
 55 |       - pre-commit の設定
 56 |       - Azure CLI と ML 拡張機能のインストール
 57 |    - [configure-workspace.sh](../scripts/configure-workspace.sh)
 58 |       - Azure CLI で利用する Azure Machine Learning ワークスペースの設定
 59 |    - ノートブック
 60 |       - [run-notebooks.sh](../scripts/prototyping/run-notebooks.sh): 実験用ノートブックの実行
 61 |    - アセット作成 (計算環境、データ、環境)
 62 |       - [create-compute.sh](../scripts/assets/create-compute.sh): コンピューティングクラスターの作成
 63 |       - [create-data.sh](../scripts/assets/create-data.sh): Data アセットの作成
 64 |       - [create-environment.sh](../scripts/assets/create-environment.sh): 環境の作成
 65 |    - ジョブの実行 (モデル学習)
 66 |       - [train.sh](../scripts/jobs/train.sh): Azure ML Job 形式でのモデル学習
 67 |    - アセットの作成 (モデル登録)
 68 |       - [register-model.sh](../scripts/assets/register-model.sh): 学習済みモデルの登録
 69 |     - エンドポイントの作成
 70 |       - [deploy-online-endpoint-custom.sh](../scripts/endpoints/deploy-online-endpoint-custom.sh): カスタム型モデルのバッチエンドポイントへのデプロイ
 71 |       - [deploy-online-endpoint-mlflow.sh](../scripts/endpoints/deploy-online-endpoint-mlflow.sh): MLflow 型モデルのバッチエンドポイントへのデプロイ
 72 |       - [deploy-batch-endpoint-custom.sh](../scripts/endpoints/deploy-batch-endpoint-custom.sh): カスタム型モデルのオンラインエンドポイントへのデプロイ
 73 |       - [deploy-batch-endpoint-mlflow.sh](../scripts/endpoints/deploy-batch-endpoint-mlflow.sh): MLflow 型モデルのオンラインエンドポイントへのデプロイ
 74 | 
 75 | #### E2E のスクリプト実行例
 76 | 
 77 | ```bash
 78 | # Azure ログイン認証
 79 | az login --use-device
 80 | 
 81 | # Python 環境の構築、Jupyter カーネルの設定、pre-commit 設定、Azure CLI インストール
 82 | bash ./scripts/setup.sh
 83 | 
 84 | # 環境変数の読み込みと Azure CLI の設定
 85 | bash ./scripts/configure-workspace.sh
 86 | 
 87 | # Notebook の実行
 88 | bash ./scripts/prototyping/run-notebooks.sh
 89 | 
 90 | # アセット (計算環境、データアセット、環境) の作成
 91 | bash ./scripts/assets/create-compute.sh
 92 | bash ./scripts/assets/create-data.sh
 93 | bash ./scripts/assets/create-environment.sh
 94 | 
 95 | # Job の実行
 96 | bash ./scripts/jobs/train.sh
 97 | 
 98 | # モデルの登録
 99 | bash ./scripts/assets/register-model.sh
100 | 
101 | # 推論環境の構築
102 | bash ./scripts/endpoints/deploy-online-endpoint-custom.sh
103 | bash ./scripts/endpoints/deploy-online-endpoint-mlflow.sh
104 | bash ./scripts/endpoints/deploy-batch-endpoint-custom.sh
105 | bash ./scripts/endpoints/deploy-batch-endpoint-mlflow.sh
106 | ```
107 | 
108 | ---
109 | 
110 | ## 2. CI/CD の実行
111 | 
112 | ### GitHub Actions のシークレット作成
113 | - GitHub Actions のシークレットを作成します。
114 |    - GROUP: Azure Machine Learning ワークスペースのリソースグループ名
115 |    - WORKSPACE: Azure Machine Learning ワークスペースの名前
116 |    - SUBSCRIPTION: Azure サブスクリプション ID
117 |    - AZURE_CREDENTIALS: Azure の接続情報
118 |       - Azure Service Principal を利用する想定で書かれています。技術的には OpenID Connect の利用も可能ですが、本ドキュメントやコードは Azure Service Principal を利用することを前提に作成されています。
119 |       - 資格情報とそれをシークレット AZURE_CREDENTAL に設定する詳細な方法は [Azure Machine Learning で GitHub Actions を使用する - 手順2. Azure での認証](https://learn.microsoft.com/ja-JP/azure/machine-learning/how-to-github-actions-machine-learning?tabs=userlevel#step-2-authenticate-with-azure) をご参照ください。
120 | 
121 | ### GitHub Actions の有効化と実行
122 | Fork 先の GitHub のページ内の `Actions` タブにアクセスし、GitHub Actions を有効化します。詳細は [GitHub アクション - ワークフローの無効化と有効化](https://docs.github.com/ja/actions/managing-workflow-runs/disabling-and-enabling-a-workflow) をご確認ください。
123 | 


--------------------------------------------------------------------------------
/notebooks/train-mlflow-local.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import argparse\n",
 10 |     "import os\n",
 11 |     "import shutil\n",
 12 |     "\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "import mlflow\n",
 15 |     "import mlflow.sklearn\n",
 16 |     "import numpy as np\n",
 17 |     "import pandas as pd\n",
 18 |     "from pathlib import Path\n",
 19 |     "from sklearn.linear_model import LinearRegression\n",
 20 |     "from sklearn.metrics import mean_squared_error, r2_score\n",
 21 |     "from sklearn.model_selection import train_test_split"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "input_data = \"../data/raw/nyc_taxi_dataset.csv\"\n",
 31 |     "output_dir = \"outputs\"\n",
 32 |     "\n",
 33 |     "lines = [f\"学習データのパス: {input_data}\", f\"出力フォルダのパス: {output_dir}\"]\n",
 34 |     "\n",
 35 |     "for line in lines:\n",
 36 |     "    print(line)"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": null,
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "# 自動ロギングの有効化\n",
 46 |     "mlflow.autolog(log_models=False)"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": null,
 52 |    "metadata": {},
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "# 学習データの読み込み\n",
 56 |     "df = pd.read_csv(input_data)"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "df.head()"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": null,
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "# X, y の作成\n",
 75 |     "X = df.drop(columns=\"totalAmount\")\n",
 76 |     "y = df[\"totalAmount\"]\n",
 77 |     "\n",
 78 |     "# 学習データ、テストデータの分割\n",
 79 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 80 |     "    X, y, test_size=0.30, random_state=0\n",
 81 |     ")"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": null,
 87 |    "metadata": {},
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "# データのサンプル数のロギング\n",
 91 |     "mlflow.log_metric(\"Train samples\", len(X_train))\n",
 92 |     "\n",
 93 |     "# モデル学習\n",
 94 |     "model = LinearRegression().fit(X_train, y_train)"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "# データのサンプル数のロギング\n",
104 |     "mlflow.log_metric(\"Test samples\", len(X_test))\n",
105 |     "\n",
106 |     "# モデル評価\n",
107 |     "y_pred = model.predict(X_test)\n",
108 |     "mse = mean_squared_error(y_test, y_pred)\n",
109 |     "rmse = np.sqrt(mse)\n",
110 |     "r2 = r2_score(y_test, y_pred)\n",
111 |     "\n",
112 |     "# 精度メトリックのロギング\n",
113 |     "mlflow.log_metric(\"mse\", mse)\n",
114 |     "mlflow.log_metric(\"rmse\", rmse)\n",
115 |     "mlflow.log_metric(\"r2\", r2)"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "# outputs フォルダの作成\n",
125 |     "os.makedirs(\"./outputs\", exist_ok=True)\n",
126 |     "\n",
127 |     "# 実測値と予測値のプロット\n",
128 |     "plt.figure(figsize=(10, 7))\n",
129 |     "plt.scatter(y_test, y_pred)\n",
130 |     "plt.plot(y_test, y_test, color=\"r\")\n",
131 |     "plt.title(\"Actual VS Predicted Values (Test set)\")\n",
132 |     "plt.xlabel(\"Actual Values\")\n",
133 |     "plt.ylabel(\"Predicted Values\")\n",
134 |     "plt.savefig(\"./outputs/actuals_vs_predictions.png\")\n",
135 |     "\n",
136 |     "# プロット画像のロギング\n",
137 |     "mlflow.log_artifact(\"./outputs/actuals_vs_predictions.png\")"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "# モデルの保存\n",
147 |     "model_path = os.path.join(output_dir, \"models\")\n",
148 |     "\n",
149 |     "if Path(model_path).exists():\n",
150 |     "    shutil.rmtree(model_path)\n",
151 |     "else:\n",
152 |     "    os.makedirs(model_path, exist_ok=True)\n",
153 |     "\n",
154 |     "mlflow.sklearn.save_model(model, model_path)"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": [
163 |     "# MLflow UI の起動\n",
164 |     "#!mlflow ui  --backend-store-uri ./mlruns"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": null,
170 |    "metadata": {},
171 |    "outputs": [],
172 |    "source": []
173 |   }
174 |  ],
175 |  "metadata": {
176 |   "kernelspec": {
177 |    "display_name": "Python 3.8.13 ('mlops-train')",
178 |    "language": "python",
179 |    "name": "python3"
180 |   },
181 |   "language_info": {
182 |    "codemirror_mode": {
183 |     "name": "ipython",
184 |     "version": 3
185 |    },
186 |    "file_extension": ".py",
187 |    "mimetype": "text/x-python",
188 |    "name": "python",
189 |    "nbconvert_exporter": "python",
190 |    "pygments_lexer": "ipython3",
191 |    "version": "3.8.5"
192 |   },
193 |   "orig_nbformat": 4,
194 |   "vscode": {
195 |    "interpreter": {
196 |     "hash": "74419d3d9274bcbfe6ecb9acd0596b867bc1ac63effdfbb8a6e0b958ebbd5c34"
197 |    }
198 |   }
199 |  },
200 |  "nbformat": 4,
201 |  "nbformat_minor": 2
202 | }
203 | 


--------------------------------------------------------------------------------
/notebooks/train-model-debugging.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import argparse\n",
 10 |     "import os\n",
 11 |     "import shutil\n",
 12 |     "\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "import mlflow\n",
 15 |     "import mlflow.sklearn\n",
 16 |     "import numpy as np\n",
 17 |     "import pandas as pd\n",
 18 |     "from pathlib import Path\n",
 19 |     "from sklearn.linear_model import LinearRegression\n",
 20 |     "from sklearn.metrics import mean_squared_error, r2_score\n",
 21 |     "from sklearn.model_selection import train_test_split"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "input_data = \"../data/raw/nyc_taxi_dataset.csv\"\n",
 31 |     "output_dir = \"outputs\"\n",
 32 |     "\n",
 33 |     "lines = [f\"学習データのパス: {input_data}\", f\"出力フォルダのパス: {output_dir}\"]\n",
 34 |     "\n",
 35 |     "for line in lines:\n",
 36 |     "    print(line)"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": null,
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "# 自動ロギングの有効化\n",
 46 |     "mlflow.autolog(log_models=False)"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": null,
 52 |    "metadata": {},
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "# 学習データの読み込み\n",
 56 |     "df = pd.read_csv(input_data)"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "df.head()"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": null,
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "# X, y の作成\n",
 75 |     "X = df.drop(columns=\"totalAmount\")\n",
 76 |     "y = df[\"totalAmount\"]\n",
 77 |     "\n",
 78 |     "# 学習データ、テストデータの分割\n",
 79 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 80 |     "    X, y, test_size=0.30, random_state=0\n",
 81 |     ")"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": null,
 87 |    "metadata": {},
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "# データのサンプル数のロギング\n",
 91 |     "mlflow.log_metric(\"Train samples\", len(X_train))\n",
 92 |     "\n",
 93 |     "# モデル学習\n",
 94 |     "model = LinearRegression().fit(X_train, y_train)"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "# データのサンプル数のロギング\n",
104 |     "mlflow.log_metric(\"Test samples\", len(X_test))\n",
105 |     "\n",
106 |     "# モデル評価\n",
107 |     "y_pred = model.predict(X_test)\n",
108 |     "mse = mean_squared_error(y_test, y_pred)\n",
109 |     "rmse = np.sqrt(mse)\n",
110 |     "r2 = r2_score(y_test, y_pred)\n",
111 |     "\n",
112 |     "# 精度メトリックのロギング\n",
113 |     "mlflow.log_metric(\"mse\", mse)\n",
114 |     "mlflow.log_metric(\"rmse\", rmse)\n",
115 |     "mlflow.log_metric(\"r2\", r2)"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "# outputs フォルダの作成\n",
125 |     "os.makedirs(\"./outputs\", exist_ok=True)\n",
126 |     "\n",
127 |     "# 実測値と予測値のプロット\n",
128 |     "plt.figure(figsize=(10, 7))\n",
129 |     "plt.scatter(y_test, y_pred)\n",
130 |     "plt.plot(y_test, y_test, color=\"r\")\n",
131 |     "plt.title(\"Actual VS Predicted Values (Test set)\")\n",
132 |     "plt.xlabel(\"Actual Values\")\n",
133 |     "plt.ylabel(\"Predicted Values\")\n",
134 |     "plt.savefig(\"./outputs/actuals_vs_predictions.png\")\n",
135 |     "\n",
136 |     "# プロット画像のロギング\n",
137 |     "mlflow.log_artifact(\"./outputs/actuals_vs_predictions.png\")"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "# モデルの保存\n",
147 |     "model_path = os.path.join(output_dir, \"models\")\n",
148 |     "\n",
149 |     "if Path(model_path).exists():\n",
150 |     "    shutil.rmtree(model_path)\n",
151 |     "else:\n",
152 |     "    os.makedirs(model_path, exist_ok=True)\n",
153 |     "\n",
154 |     "mlflow.sklearn.save_model(model, model_path)"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": [
163 |     "# Responsible AI Toolbox ライブラリのインポート\n",
164 |     "from raiwidgets import ResponsibleAIDashboard\n",
165 |     "from responsibleai import RAIInsights"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": null,
171 |    "metadata": {},
172 |    "outputs": [],
173 |    "source": [
174 |     "# データ準備\n",
175 |     "train_data = X_train.copy()\n",
176 |     "train_data[\"totalAmount\"] = y_train\n",
177 |     "\n",
178 |     "test_data = X_test.copy()\n",
179 |     "test_data[\"totalAmount\"] = y_test\n",
180 |     "\n",
181 |     "target_feature = \"totalAmount\""
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": null,
187 |    "metadata": {},
188 |    "outputs": [],
189 |    "source": [
190 |     "# 設定\n",
191 |     "rai_insights = RAIInsights(\n",
192 |     "    model, train_data, test_data, target_feature, \"regression\", categorical_features=[]\n",
193 |     ")"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": null,
199 |    "metadata": {},
200 |    "outputs": [],
201 |    "source": [
202 |     "# モデルの説明性\n",
203 |     "rai_insights.explainer.add()\n",
204 |     "# エラー分析\n",
205 |     "rai_insights.error_analysis.add()\n",
206 |     "# 反実仮想データの生成\n",
207 |     "rai_insights.counterfactual.add(total_CFs=20, desired_range=[50, 250])"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": null,
213 |    "metadata": {},
214 |    "outputs": [],
215 |    "source": [
216 |     "# 計算処理の実行\n",
217 |     "rai_insights.compute()"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "metadata": {},
224 |    "outputs": [],
225 |    "source": [
226 |     "# ダッシュボードの生成\n",
227 |     "ResponsibleAIDashboard(rai_insights)"
228 |    ]
229 |   }
230 |  ],
231 |  "metadata": {
232 |   "kernelspec": {
233 |    "display_name": "Python 3.8.10 64-bit",
234 |    "language": "python",
235 |    "name": "python3"
236 |   },
237 |   "language_info": {
238 |    "codemirror_mode": {
239 |     "name": "ipython",
240 |     "version": 3
241 |    },
242 |    "file_extension": ".py",
243 |    "mimetype": "text/x-python",
244 |    "name": "python",
245 |    "nbconvert_exporter": "python",
246 |    "pygments_lexer": "ipython3",
247 |    "version": "3.8.10"
248 |   },
249 |   "orig_nbformat": 4,
250 |   "vscode": {
251 |    "interpreter": {
252 |     "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
253 |    }
254 |   }
255 |  },
256 |  "nbformat": 4,
257 |  "nbformat_minor": 2
258 | }
259 | 


--------------------------------------------------------------------------------
/docs/coding-guidelines.md:
--------------------------------------------------------------------------------
  1 | # コーディングガイドライン
  2 | 
  3 | コード品質を改善するために本リポジトリで利用するツールの概要や機械学習システムへの導入方法を記載します。
  4 | 
  5 | 
  6 | ## 概念
  7 | 
  8 | 複数のエンジニアによる共同開発において、プロジェクトまたはリポジトリ全体で一貫性を保つことは解釈の違いを減らすことや可読性の向上、引継ぎの工数を減らす観点で重要です。
  9 | これらを実現するために、Linter やテキスト解析・整形ツールを使用する方法があります。
 10 | 
 11 | 本リポジトリでは、次のツールの活用を推奨します。
 12 | 
 13 | - [Linter](#linter)
 14 |     - [Flake8](#flake8)
 15 | - [Formatter](#formatter)
 16 |     - [black](#black)
 17 | - [型ヒント](#型ヒント)
 18 |     - [mypy](#mypy)
 19 | - [Git hook](#git-hook)
 20 |     - [pre-commit](#pre-commit)
 21 | 
 22 | ## Install
 23 | 本テンプレートを利用する際は、まずpre-commit環境、conda環境、Azure CLI v2環境の構築を行います。
 24 |  `/.pre-commit-config.yaml` にすでにFlake8、black、isort の設定の記述がされているので次の方法で反映を行います。
 25 | 
 26 | ※ VSCodeを用いる場合は`.vscode/settings.json` に black、 flake8、isort を設定します。
 27 | 詳細はこちらの[Editing](https://code.visualstudio.com/docs/python/editing), [Linting](https://code.visualstudio.com/docs/python/linting)のVSCodeのドキュメントをご参考ください。
 28 | 
 29 | 
 30 | 続いて、pre-commitの内容の反映とconda/Azure CLI環境を設定します。
 31 | 
 32 | **devcontainer を利用する場合**</br>
 33 | pre-commit のインストールと設定は自動で反映されます。
 34 | - [.devcontainer/Dockerfile](.devcontainer/Dockerfile) : devcontainer を構築する Docker ファイル
 35 | - [.pre-commit-config.yaml](.pre-commit-config.yaml) : pre-commit の設定
 36 | 
 37 | **devcontainer を利用しない場合**</br>
 38 | シェルスクリプト [scripts/setup.sh](scripts/setup.sh) を実行してください。
 39 | 
 40 | ```sh
 41 | chmod +x ./scripts/setup.sh #必要に応じて
 42 | bash ./scripts/setup.sh
 43 | ```
 44 | 
 45 | その後、git commit 時にpre-commitの動作確認を行ってください。
 46 | 
 47 | ## CI/CD パイプライン (GitHub Actions)
 48 | 
 49 | GitHub にコードがpush された段階で GitHub Actions 上でコードの確認をします。開発端末での漏れを防ぐことができます。
 50 | 
 51 | **参考**
 52 | - [Black with GitHub Actions integration](https://black.readthedocs.io/en/stable/integrations/github_actions.html) : Black の GitHub Actions 実装サンプル
 53 | - [pre-commit action](https://github.com/pre-commit/action) : pre-commit の GitHub Actions 実装サンプル
 54 | 
 55 | 
 56 | ## 各種ツールの簡易説明
 57 | ### Linter
 58 | 
 59 | コンパイラやインタープリタよりも厳しくソースコードをチェックし、文法だけでなく、バグの原因となる記述を検出して警告してくれるツール。例えば、ソースコード内で未使用の変数や初期化されていない変数のチェックします。
 60 | 
 61 | #### <u>◼︎ Flake8</u>
 62 | [Flake8](https://flake8.pycqa.org/en/latest/#) は、Python コードの静的解析ツールです。次の３つのツールのラッパーであり、単一のスクリプトを起動することですべてのツールを実行します。
 63 | 
 64 | - PyFlakes: コードに論理的なエラーが無いかを確認。
 65 | - pep8: コードがコーディング規約([PEP8](https://pep8.readthedocs.io/en/latest/))に準じているかを確認
 66 | - Ned Batchelder’s McCabe script: 循環的複雑度のチェック。
 67 | 
 68 | <details>
 69 | <summary>導入設定の詳細</summary>
 70 | <br/>
 71 | 
 72 | 1. flake8 のインストール
 73 | ```sh
 74 | pip install flake8
 75 | ```
 76 | 2. flake8 によるチェックの実行
 77 | ```sh
 78 | flake8 <任意のディレクトリ or Pythonファイル> # チェックしたい対象を指定して実行
 79 | ```
 80 | 3. コードの修正箇所の表示 (show-sourceオプションの指定)
 81 | ```sh
 82 | flake8 --show-source <任意のディレクトリ or Pythonファイル> # チェックしたいファイルを指定して実行
 83 | ```
 84 | 
 85 | </details>
 86 | 
 87 | <br/>
 88 | 
 89 | ### Formatter
 90 | 
 91 | ソースコードのスタイル(スペースの数、改行の位置、コメントの書き方など)をチェックし、自動的に修正・整形してくれるツールです。
 92 | 
 93 | #### <u>◼︎ black</u>
 94 | [black](https://black.readthedocs.io/en/stable/index.html) は一貫性、一般性、可読性及び git 差分の削減を追求した Formatter ツールです。black のコードスタイルは[こちら](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html)のドキュメントに記載してあります。
 95 | 
 96 | <details>
 97 | <summary>導入設定の詳細</summary>
 98 | <br/>
 99 | 
100 | 1. black のインストール
101 | 
102 | ```sh
103 | # 通常
104 | pip install black
105 | 
106 | # jupyter notebookを対象とする場合
107 | pip install black[jupyter]
108 | ```
109 | 
110 | 2. black によるフォーマットの実行
111 | 
112 | ```sh
113 | black <任意のディレクトリ or Pythonファイル> # チェックしたい対象を指定して実行
114 | ```
115 | ※ git hookの設定 (githookについては本ページの下の方で解説あり)
116 | git commit 前に black が自動実行されるようにするためには、Git で管理しているプロジェクトディレクトリの`.git/hooks/pre-commit`ファイルに下記の記述をすることで可能です。
117 | 
118 | ```sh:pre-commit
119 | #!/bin/bash
120 | black .
121 | ```
122 | 
123 | 実行可能なファイルへ権限を付与します。
124 | 
125 | ```sh
126 | chmod +x .git/hooks/pre-commit
127 | ```
128 | 
129 | 
130 | ※ black を利用していることを示すバッジをREADME.mdに表記する方法
131 | 
132 | [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
133 | 
134 | ▼ こちらを記述。
135 | ```md
136 | [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
137 | ```
138 | </details>
139 | <br/>
140 | 
141 | ### 型ヒント
142 | 
143 | Python ではオプションで型ヒントがサポートされています。
144 | 
145 | #### <u>◼︎ mypy</u>
146 | 
147 | [mypy](https://mypy.readthedocs.io/en/stable/index.html#) は型ヒントの静的チェックツールです。Python は関数や変数に対する型を強制しない仕様のため、型に注意して実装する必要があります。mypy は型アノテーションに基づきコードのバグを検知します。
148 | 
149 | <details>
150 | <summary>導入設定の詳細</summary>
151 | <br/>
152 | 
153 | 1. mypy のインストール
154 | ```sh
155 | pip install mypy
156 | ```
157 | 
158 | 2. 設定
159 | 型情報を保持する stub ファイルが存在しないパッケージに対するエラーを除外するために、次のように _mypy.ini_ に ignore_missing_imports = True を記載します。
160 | ```
161 | [mypy-numpy]
162 | ignore_missing_imports = True
163 | 
164 | [mypy-pandas.*]
165 | ignore_missing_imports = True
166 | 
167 | [mypy-sklearn.*]
168 | ignore_missing_imports = True
169 | 
170 | [mypy-matplotlib.*]
171 | ignore_missing_imports = True
172 | 
173 | [mypy-mlflow.*]
174 | ignore_missing_imports = True
175 | 
176 | [mypy-azureml.*]
177 | ignore_missing_imports = True
178 | 
179 | [mypy-dateutil.*]
180 | ignore_missing_imports = True
181 | ```
182 | 
183 | 3. mypy による型チェックの実行
184 | ```bash
185 | $ mypy train.py
186 | Success: no issues found in 1 source file
187 | ```
188 | 
189 | 
190 | </details>
191 | <br/>
192 | 
193 | ### Git hook
194 | #### ◼︎ pre-commit
195 | `pre-commit` は Git hook の Python ラッパーです。
196 | 
197 | <details>
198 | <summary>導入設定の詳細</summary>
199 | <br/>
200 | 
201 | 1. pre-commit のインストール
202 | 
203 | ```bash
204 | $ pip install pre-commit
205 | ```
206 | 
207 | 2. サンプルの設定ファイルの生成
208 | 
209 | ```bash
210 | $ pre-commit sample-config > .pre-commit-config.yaml
211 | ```
212 | 
213 | 3. git hook へのインストール
214 | 
215 | ```bash
216 | $ pre-commit install
217 | ```
218 | 
219 | 4. 設定 (.pre-commit-config.yaml)
220 | 
221 | ```yml
222 | repos:
223 | # サンプルで生成されるもの (pre-commit sample-config > .pre-commit-config.yaml)
224 | -   repo: https://github.com/pre-commit/pre-commit-hooks
225 |     rev: v4.3.0
226 |     hooks:
227 |     -   id: trailing-whitespace
228 |     - id: no-commit-to-branch
229 |         args: [--branch, main]
230 |     -   id: end-of-file-fixer
231 |     -   id: check-yaml
232 |     -   id: check-added-large-files
233 | ```
234 | 
235 | 5. pre-commit の 実行
236 | 
237 | ```bash
238 | $ git commit -m "pre-commit demo"
239 | [WARNING] Unstaged files detected.
240 | [INFO] Stashing unstaged files to /home/vscode/.cache/pre-commit/patch1666333249-14074.
241 | trim trailing whitespace.................................................Passed
242 | don't commit to branch...................................................Passed
243 | fix end of files.........................................................Passed
244 | check yaml...............................................................Passed
245 | check for added large files..............................................Passed
246 | [INFO] Restored changes from /home/vscode/.cache/pre-commit/patch1666333249-14074.
247 | [coding-guideline-v1 c101751] pre-commit demo
248 |  2 files changed, 19 insertions(+), 20 deletions(-)
249 | ```
250 | #### 参考
251 | 
252 | - [Git hooks](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks)
253 | - [pre-commit](https://pre-commit.com/)
254 | 
255 | </details>
256 | 
257 | 
258 | ## インスピレーション
259 | 以下の資料が本書に多大なインスピレーションを与えてくれた主な参考資料です。
260 | - [Code with Engineering](https://microsoft.github.io/code-with-engineering-playbook/)
261 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <div align="center">
  2 | <h1>
  3 | <img width="30", src="./docs/images/azureml-icon.svg">
  4 | &nbsp;
  5 | MLOps with Azure Machine Learning
  6 | </h1>
  7 | Azure Machine Learning + GitHub を利用した MLOps 実装サンプルコード
  8 | 
  9 | [![MIT licensed](https://img.shields.io/badge/license-MIT-brightgreen.svg)](LICENSE)
 10 | [![](https://img.shields.io/github/contributors-anon/Azure/mlops-starter-sklearn)](https://github.com/Azure/mlops-starter-sklearn/graphs/contributors)
 11 | [![Star](https://img.shields.io/github/stars/Azure/mlops-starter-sklearn.svg)](https://github.com/Azure/mlops-starter-sklearn)
 12 | [![Open in VSCode](https://img.shields.io/static/v1?logo=visualstudiocode&label=&message=Open%20in%20VSCode&labelColor=2c2c32&color=007acc&logoColor=007acc)](https://open.vscode.dev/Azure/mlops-starter-sklearn)
 13 | 
 14 | </div>
 15 | 
 16 | ---
 17 | 
 18 | ## 👋 概要
 19 | 本リポジトリは、MLOps のサンプルコードを素早く利用できることを目的に作成されました。Azure Machine Learning と GitHub Actions を利用する想定です。
 20 | 
 21 | 
 22 | ## 🚀 使い方
 23 | - Azure Machine Learning と GitHub の環境を準備します。
 24 | - クライアント環境として下記のいずれかにアクセスします。
 25 |     - Azure Machine Learning のコンピューティングインスタンス
 26 |     - DevContainer 環境
 27 |         - Conda でのパッケージインストールの際にメモリを消費するため、ある程度大きいスペックが必要になります。Codespaces の場合、4-core / 8GB RAM / 32GB storage 以上の Machine Type を選択してください。
 28 | - .env ファイルに環境変数の設定をします。
 29 | - [./scripts](./scripts) フォルダの各シェルスクリプトを実行します。
 30 | - GitHub の Secrets を作成し、GitHub Actions を有効化し実行します。
 31 | 
 32 | :point_right: **クライアント環境として Azure Machine Learning のコンピューティングインスタンス (Compute Instance) を利用した場合のコードや CI/CD の実行方法は [クイックスタート](./docs/quickstart.md) のドキュメントに記載してあります。**
 33 | 
 34 | 
 35 | ## 📝 技術条件
 36 | - GitHub
 37 |     - ソースコード管理、CI/CD パイプライン
 38 | - Data
 39 |     - [NYC タクシー & リムジン協会 - グリーンタクシー運行記録](https://learn.microsoft.com/ja-jp/azure/open-datasets/dataset-taxi-green?tabs=azureml-opendatasets)
 40 | - Azure Machine Learning
 41 |     - チーム・組織で共有の機械学習プラットフォーム
 42 |     - Compute Instance : CPU タイプ、クライアント端末
 43 |         - もしくは Dev Container に対応した GitHub Codespace など
 44 |     - Compute Cluster : 共有のクラスター環境
 45 |     - API : Azure Machine Learning CLI (v2)
 46 | - IDE/Editor
 47 |     - Visual Studio Code
 48 | 
 49 | ## 📁 コンテンツ
 50 | ### Assets
 51 | **CLI v2 + YAML**
 52 | 
 53 | |シナリオ              |YAML ファイル|シェルスクリプト|詳細        |
 54 | |--------------------|---------|-----------|-----------|
 55 | |Create Data asset   |[cli/assets/create-data.yml](cli/assets/create-data.yml)|[scripts/assets/create-data.sh](scripts/assets/create-data.sh)|データアセットを作成する|
 56 | |Create Compute Cluster|[cli/assets/create-compute.yml](cli/assets/create-compute.yml)|[scripts/assets/create-compute.sh](scripts/assets/create-compute.sh)|Compute を作成する|
 57 | |Create Environment for training|[cli/assets/create-environment.yml](cli/assets/create-environment.yml)|[scripts/assets/create-environment.sh](scripts/assets/create-environment.sh)|環境を作成する|
 58 | 
 59 | ### Prototyping
 60 | **Notebook**
 61 | 
 62 | |シナリオ              |Notebook|シェルスクリプト|詳細        |
 63 | |--------------------|---------|-----------|-----------|
 64 | |Baseline Notebook   |[notebooks/train-prototyping.ipynb](notebooks/train-prototyping.ipynb)|[scripts/prototyping/run-notebooks.sh](scripts/prototyping/run-notebooks.sh)|実験用の Notebook|
 65 | 
 66 | 
 67 | ### Training
 68 | **CLI v2 + YAML**
 69 | 
 70 | |シナリオ              |YAML ファイル|シェルスクリプト|詳細        |
 71 | |--------------------|---------|-----------|-----------|
 72 | |Job for training model |[cli/jobs/train.yml](cli/jobs/train.yml)           |[scripts/training/train.sh](scripts/training/train.sh)| Azure ML の Job として Python script を実行 |
 73 | 
 74 | 
 75 | **CI/CD Pipeline**
 76 | |シナリオ              |YAML ファイル|Status     |詳細        |
 77 | |--------------------|---------|-----------|-----------|
 78 | |Smoke Test          |[.github/workflows/smoke-testing.yml](.github/workflows/smoke-testing.yml)|[![smoke-testing](https://github.com/Azure/MLInsider-MLOps/actions/workflows/smoke-testing.yml/badge.svg)](https://github.com/Azure/MLInsider-MLOps/actions/workflows/smoke-testing.yml)|Smoke Test パイプライン|
 79 | 
 80 | 
 81 | ### Operationalizing
 82 | **CLI v2 + YAML**
 83 | 
 84 | |シナリオ                            |YAML ファイル |シェルスクリプト|詳細        |
 85 | |----------------------------------|---------|-----------|-----------|
 86 | |Create Batch Endpoint (custom)  |[cli/endpoints/batch_deployment.yml](cli/endpoints/batch_deployment.yml)|[scripts/endpoints/deploy-batch-endpoint.sh](scripts/endpoints/deploy-batch-endpoint-custom.sh)           |カスタム型モデルのバッチエンドポイントへのデプロイ|
 87 | |Create Batch Endpoint (mlflow)  |[cli/endpoints/batch_deployment_mlflow.yml](cli/endpoints/batch_deployment_mlflow.yml)|[scripts/endpoints/deploy-batch-endpoint.sh](scripts/endpoints/deploy-batch-endpoint-mlflow.sh)|MLflow 型モデルのバッチエンドポイントへのデプロイ|
 88 | |Create Online Endpoint (custom)  |[cli/endpoints/online_deployment.yml](cli/endpoints/online_deployment.yml)|[scripts/endpoints/deploy-online-endpoint-custom.sh](scripts/endpoints/deploy-online-endpoint-custom.sh)|カスタム型モデルのオンラインエンドポイントへのデプロイ|
 89 | |Create Online Endpoint (mlflow)  |[cli/endpoints/online_deployment_mlflow.yml](cli/endpoints/online_deployment_mlflow.yml)|[scripts/endpoints/deploy-online-endpoint-mlflow.sh](scripts/endpoints/deploy-online-endpoint-mlflow.sh)|MLflow 型モデルのオンラインエンドポイントへのデプロイ|
 90 | 
 91 | 
 92 | ### CI/CD Pipeline
 93 | 
 94 | >TODO
 95 | 
 96 | ## 🗒️ ドキュメンテーション
 97 | - [クイックスタート](./docs/quickstart.md)
 98 | - [Coding Guideline](./docs/coding-guidelines.md)
 99 | 
100 | ## 📄 ディレクトリ構造
101 | 
102 | ```
103 | .
104 | ├── .devcontainer   # Configuration files for DevContainer
105 | ├── .github
106 | │   └── workflows   # YAML files for GitHub Actions
107 | ├── .vscode
108 | ├── cli             # YAML files for Azure ML CLI v2
109 | │   ├── assets
110 | │   ├── endpoints
111 | │   └── jobs
112 | ├── data            # Sample data
113 | │   ├── raw
114 | │   └── samples
115 | ├── docs            # Documenting quickstart, coding style guide etc
116 | ├── environments    # Python libraries
117 | ├── notebooks       # Jupyter Notebook
118 | ├── pipelines       # Azure ML Pipeline CLI v2
119 | │   ├── eval
120 | │   ├── prep
121 | │   ├── score
122 | │   └── train
123 | ├── scripts
124 | │   ├── assets      # Shell scripts for creating assets like data, compute, environment
125 | │   ├── endpoints   # Shell scripts for scoring model
126 | │   ├── jobs        # Shell scripts for model training
127 | │   └── prototyping # Shell scripts for experimental
128 | ├── src
129 | │   ├── data        # Code for data preparation
130 | │   ├── deploy      # Code for scoring model
131 | │   ├── features    # Code for feature engineering
132 | │   ├── model       # Code for model training
133 | │   ├── monitor     # Code for monitoing data and model
134 | │   └── rai         # Code for responsible ai
135 | ├── tests
136 | │   ├── data_validation # Code for validating data
137 | │   └── unit            # Code for unit testing
138 | └── utils           # Code for utilities
139 | ```
140 | 
141 | ---
142 | 
143 | ## 関連リポジトリ/リソース
144 | 
145 | ### 主要なリポジトリ/リソースとの比較
146 | | リポジトリ/リソース名 | 概要と目的 | 本リポジトリとの差異 |
147 | | --- | --- | --- |
148 | | [microsoft/MLOps](https://github.com/microsoft/MLOps) | MLOps の概要説明から、Microsoft 製品でどのように MLOps を実現するのか Azure DevOps や GitHub Actions、IaCなどのツール単位やシナリオ単位でサンプルコードを提供している。 | 本リポジトリは、ツールを１シナリオに絞り、迅速的に活用可能な MLOps のテンプレートを提供する。|
149 | | [Azure/mlops-v2](https://github.com/Azure/mlops-v2) | MLOps に関してより広範で抽象的なテンプレート。 | 本リポジトリは、より具体的なデータ・コードを含み、実行可能なサンプルを目指している。|
150 | | [Azure/azureml-examples](https://github.com/Azure/azureml-examples) | AzureML に関してのサンプルコード集。テストコード等が整備されている。 | 本リポジトリは、単発のサンプルコード群ではなく、ML ライフサイクルの一連の流れが end-to-end で網羅されたものを目指している。|
151 | | [Tutorial: Azure Machine Learning in a day](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-azure-ml-in-a-day) | End-to-End で AzureML を学ぶことができるチュートリアルページ。 | 本リポジトリは、比較対象のリポジトリでは掲載されていない ML システムを設計・運用していくための Tips も提供している。|
152 | 
153 | ### その他
154 | - https://github.com/dslp/dslp-repo-template
155 | 
156 | ## 🛡 免責事項
157 | 当社は、外部のリンク先ウェブサイトの内容に関していかなる責任も負うものではありません。お客様は、自らの責任においてこれらのリンクをご利用ください。なお、お客様によるリンクご利用の結果、ないしはリンクご利用に関連して、お客様が被るいかなる損害または損失について当社は、責任を負うものではありません。
158 | 
159 | ## 🤝 Contributing
160 | We are welcome your contribution from customers and internal microsoft employees. Please see [CONTRIBUTING](./CONTRIBUTING.md). We appreciate all contributors from Microsoft employees and community to make this repo thrive.
161 | 
162 | 
163 | <a href="https://github.com/Azure/mlops-starter-sklearn/graphs/contributors"><img src="https://contrib.rocks/image?repo=Azure/mlops-starter-sklearn&max=240&columns=18" /></a>
164 | 
165 | ## Trademarks
166 | 
167 | This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
168 | trademarks or logos is subject to and must follow
169 | [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
170 | Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
171 | Any use of third-party trademarks or logos are subject to those third-party's policies.
172 | 


--------------------------------------------------------------------------------