├── _config.yml
├── docs
    ├── _config.yml
    ├── data
    │   ├── images
    │   │   ├── labeling.png
    │   │   ├── cloud_data.png
    │   │   ├── output_3_0.png
    │   │   ├── output_3_1.png
    │   │   ├── output_3_2.png
    │   │   ├── output_3_3.png
    │   │   ├── output_3_4.png
    │   │   ├── create_s3_bucket.png
    │   │   ├── label_main_page.png
    │   │   └── labeling_interface.png
    │   └── index.md
    ├── training
    │   ├── images
    │   │   ├── crnn.png
    │   │   ├── cometml_log.png
    │   │   ├── output_48_1.png
    │   │   └── cometml_api_key.png
    │   └── index.md
    ├── problem
    │   ├── images
    │   │   ├── problem.png
    │   │   └── service.png
    │   └── index.md
    ├── deployment
    │   ├── images
    │   │   └── gradio.png
    │   └── index.md
    ├── kubeflow
    │   ├── images
    │   │   ├── kubeflow.png
    │   │   └── pipeline.png
    │   └── index.md
    ├── overview
    │   ├── images
    │   │   └── ml_system.png
    │   └── index.md
    ├── terraform
    │   └── index.md
    ├── pipeline
    │   └── index.md
    └── index.md
├── mlpipeline
    ├── components
    │   ├── preprocess
    │   │   ├── Dockerfile
    │   │   ├── build_image.sh
    │   │   ├── component.yaml
    │   │   └── src
    │   │   │   └── main.py
    │   ├── train
    │   │   ├── build_image.sh
    │   │   ├── component.yaml
    │   │   ├── Dockerfile
    │   │   └── src
    │   │   │   └── main.py
    │   └── deployment
    │   │   ├── build_image.sh
    │   │   ├── component.yaml
    │   │   ├── Dockerfile
    │   │   └── src
    │   │       └── main.py
    ├── pipeline.py
    └── mlpipeline.yaml
└── README.md


/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-slate


--------------------------------------------------------------------------------
/docs/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-slate


--------------------------------------------------------------------------------
/mlpipeline/components/preprocess/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.7
2 | COPY ./src /src
3 | 


--------------------------------------------------------------------------------
/docs/data/images/labeling.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/labeling.png


--------------------------------------------------------------------------------
/docs/training/images/crnn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/training/images/crnn.png


--------------------------------------------------------------------------------
/docs/data/images/cloud_data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/cloud_data.png


--------------------------------------------------------------------------------
/docs/data/images/output_3_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/output_3_0.png


--------------------------------------------------------------------------------
/docs/data/images/output_3_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/output_3_1.png


--------------------------------------------------------------------------------
/docs/data/images/output_3_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/output_3_2.png


--------------------------------------------------------------------------------
/docs/data/images/output_3_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/output_3_3.png


--------------------------------------------------------------------------------
/docs/data/images/output_3_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/output_3_4.png


--------------------------------------------------------------------------------
/docs/problem/images/problem.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/problem/images/problem.png


--------------------------------------------------------------------------------
/docs/problem/images/service.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/problem/images/service.png


--------------------------------------------------------------------------------
/docs/deployment/images/gradio.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/deployment/images/gradio.png


--------------------------------------------------------------------------------
/docs/kubeflow/images/kubeflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/kubeflow/images/kubeflow.png


--------------------------------------------------------------------------------
/docs/kubeflow/images/pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/kubeflow/images/pipeline.png


--------------------------------------------------------------------------------
/docs/overview/images/ml_system.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/overview/images/ml_system.png


--------------------------------------------------------------------------------
/docs/data/images/create_s3_bucket.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/create_s3_bucket.png


--------------------------------------------------------------------------------
/docs/data/images/label_main_page.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/label_main_page.png


--------------------------------------------------------------------------------
/docs/training/images/cometml_log.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/training/images/cometml_log.png


--------------------------------------------------------------------------------
/docs/training/images/output_48_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/training/images/output_48_1.png


--------------------------------------------------------------------------------
/docs/data/images/labeling_interface.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/data/images/labeling_interface.png


--------------------------------------------------------------------------------
/docs/training/images/cometml_api_key.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thanhhau097/mlpipeline/HEAD/docs/training/images/cometml_api_key.png


--------------------------------------------------------------------------------
/mlpipeline/components/train/build_image.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | full_image_name=thanhhau097/ocrtrain
4 | 
5 | docker build -t "${full_image_name}" .
6 | docker push "$full_image_name"


--------------------------------------------------------------------------------
/mlpipeline/components/preprocess/build_image.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | full_image_name=thanhhau097/ocrpreprocess
4 | 
5 | docker build -t "${full_image_name}" .
6 | docker push "$full_image_name"


--------------------------------------------------------------------------------
/mlpipeline/components/deployment/build_image.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | full_image_name=thanhhau097/ocrdeployment:v1.0
4 | 
5 | docker build -t "${full_image_name}" .
6 | docker push "$full_image_name"


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning Pipeline System
2 | Hướng dẫn xây dựng hệ thống trí tuệ nhân tạo/học máy dành cho các kĩ sư học máy.
3 | 
4 | Blog: [thanhhau097.github.io/mlpipeline](https://thanhhau097.github.io/mlpipeline)


--------------------------------------------------------------------------------
/docs/terraform/index.md:
--------------------------------------------------------------------------------
 1 | ## Terraform
 2 | 
 3 | <script src="https://utteranc.es/client.js"
 4 |         repo="thanhhau097/mlpipeline"
 5 |         issue-term="pathname"
 6 |         label="Comment"
 7 |         theme="github-light"
 8 |         crossorigin="anonymous"
 9 |         async>
10 | </script>


--------------------------------------------------------------------------------
/docs/pipeline/index.md:
--------------------------------------------------------------------------------
 1 | ## Machine Learning Pipeline
 2 | 
 3 | 
 4 | <script src="https://utteranc.es/client.js"
 5 |         repo="thanhhau097/mlpipeline"
 6 |         issue-term="pathname"
 7 |         label="Comment"
 8 |         theme="github-light"
 9 |         crossorigin="anonymous"
10 |         async>
11 | </script>


--------------------------------------------------------------------------------
/mlpipeline/components/preprocess/component.yaml:
--------------------------------------------------------------------------------
 1 | name: OCR Preprocessing
 2 | description: Preprocess OCR model
 3 | 
 4 | outputs:
 5 |   - {name: output_path, type: String, description: 'Output Path'}
 6 | 
 7 | implementation:
 8 |   container:
 9 |     image: thanhhau097/ocrpreprocess
10 |     command: [
11 |       python, /src/main.py,
12 |     ]
13 | 
14 |     fileOutputs:
15 |       output_path: /output_path.txt # path to s3
16 | 


--------------------------------------------------------------------------------
/mlpipeline/components/train/component.yaml:
--------------------------------------------------------------------------------
 1 | name: OCR Train
 2 | description: Training OCR model
 3 | inputs:
 4 |   - {name: data_path, description: 's3 path to data'}
 5 | outputs:
 6 |   - {name: output_path, type: String, description: 'Output Path'}
 7 | 
 8 | implementation:
 9 |   container:
10 |     image: thanhhau097/ocrtrain
11 |     command: [
12 |       python, /src/main.py,
13 |       --data_path, {inputValue: data_path},
14 |     ]
15 | 
16 |     fileOutputs:
17 |       output_path: /output_path.txt


--------------------------------------------------------------------------------
/mlpipeline/components/deployment/component.yaml:
--------------------------------------------------------------------------------
 1 | name: OCR Deployment
 2 | description: Deploy OCR model
 3 | inputs:
 4 |   - {name: model_path, description: 's3 path to model'}
 5 | outputs:
 6 |   - {name: output_path, type: String, description: 'Output Path'}
 7 | 
 8 | implementation:
 9 |   container:
10 |     image: thanhhau097/ocrdeployment:v1.0
11 |     command: [
12 |       python, /src/main.py,
13 |       --model_path, {inputValue: model_path},
14 |     ]
15 | 
16 |     fileOutputs:
17 |       output_path: /output_path.txt # path to s3
18 | 


--------------------------------------------------------------------------------
/mlpipeline/components/train/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM ubuntu:20.04
 2 | # Install some common dependencies
 3 | ARG DEBIAN_FRONTEND=noninteractive
 4 | RUN apt-get update -qqy && \
 5 |     apt-get install -y \
 6 |       build-essential \
 7 |       cmake \
 8 |       g++ \
 9 |       git \
10 |       libsm6 \
11 |       libxrender1 \
12 |       locales \
13 |       libssl-dev \
14 |       pkg-config \
15 |       poppler-utils \
16 |       python3-dev \
17 |       python3-pip \
18 |       python3.6 \
19 |       software-properties-common \
20 |       && \
21 |     apt-get clean && \
22 |     apt-get autoremove && \
23 |     rm -rf /var/lib/apt/lists/*
24 | 
25 | # Set default python version to 3.6
26 | RUN ln -sf /usr/bin/python3 /usr/bin/python
27 | RUN ln -sf /usr/bin/pip3 /usr/bin/pip
28 | RUN pip3 install -U pip setuptools
29 | 
30 | # locale setting
31 | RUN locale-gen en_US.UTF-8
32 | ENV LC_ALL=en_US.UTF-8 \
33 |     LANG=en_US.UTF-8 \
34 |     LANGUAGE=en_US.UTF-8 \
35 |     PYTHONIOENCODING=utf-8
36 | 
37 | RUN pip install torch numpy comet_ml boto3 requests opencv-python
38 | RUN pip install flask
39 | RUN apt update && apt install ffmpeg libsm6 libxext6  -y
40 | COPY ./src /src


--------------------------------------------------------------------------------
/mlpipeline/components/deployment/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM ubuntu:20.04
 2 | # Install some common dependencies
 3 | ARG DEBIAN_FRONTEND=noninteractive
 4 | RUN apt-get update -qqy && \
 5 |     apt-get install -y \
 6 |       build-essential \
 7 |       cmake \
 8 |       g++ \
 9 |       git \
10 |       libsm6 \
11 |       libxrender1 \
12 |       locales \
13 |       libssl-dev \
14 |       pkg-config \
15 |       poppler-utils \
16 |       python3-dev \
17 |       python3-pip \
18 |       python3.6 \
19 |       software-properties-common \
20 |       && \
21 |     apt-get clean && \
22 |     apt-get autoremove && \
23 |     rm -rf /var/lib/apt/lists/*
24 | 
25 | # Set default python version to 3.6
26 | RUN ln -sf /usr/bin/python3 /usr/bin/python
27 | RUN ln -sf /usr/bin/pip3 /usr/bin/pip
28 | RUN pip3 install -U pip setuptools
29 | 
30 | # locale setting
31 | RUN locale-gen en_US.UTF-8
32 | ENV LC_ALL=en_US.UTF-8 \
33 |     LANG=en_US.UTF-8 \
34 |     LANGUAGE=en_US.UTF-8 \
35 |     PYTHONIOENCODING=utf-8
36 | 
37 | RUN pip3 install torch numpy comet_ml boto3 requests opencv-python
38 | RUN pip3 install flask
39 | RUN apt update && apt install ffmpeg libsm6 libxext6  -y
40 | COPY ./src /src


--------------------------------------------------------------------------------
/mlpipeline/pipeline.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import json
 3 | import kfp
 4 | from kfp.aws import use_aws_secret
 5 | import kfp.dsl as dsl
 6 | from kfp import components
 7 | import json
 8 | 
 9 | logging.basicConfig(level=logging.INFO, format='%(name)s - %(levelname)s - %(message)s')
10 | 
11 | 
12 | def create_preprocess_component():
13 |     component = kfp.components.load_component_from_file('./components/preprocess/component.yaml')
14 |     return component
15 | 
16 | def create_train_component():
17 |     component = kfp.components.load_component_from_file('./components/train/component.yaml')
18 |     return component
19 | 
20 | def mlpipeline():
21 |     preprocess_component = create_preprocess_component()
22 |     preprocess_task = preprocess_component()
23 | 
24 |     train_component = create_train_component()
25 |     train_task = train_component(data_path=preprocess_task.outputs['output_path'])
26 | 
27 |     deployment_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml')
28 |     container_spec = {"image": "thanhhau097/ocrdeployment", "port":5000, "name": "custom-container", "command": "python3 /src/main.py --model_s3_path {}".format(train_task.outputs['output_path'])}
29 |     container_spec = json.dumps(container_spec)
30 |     deployment_op(
31 |         action='apply',
32 |         model_name='custom-simple',
33 |         custom_model_spec=container_spec, 
34 |         namespace='kubeflow-user',
35 |         watch_timeout="1800"
36 |     )
37 | 
38 | kfp.compiler.Compiler().compile(mlpipeline, 'mlpipeline.yaml')


--------------------------------------------------------------------------------
/docs/problem/index.md:
--------------------------------------------------------------------------------
 1 | # Bài toán
 2 | Trong blog này, chúng ta sẽ xây dựng một hệ thống học máy cho bài toán đọc hình ảnh chữ viết tay.
 3 | 
 4 | ## Phân tích bài toán
 5 | ![alt text](./images/service.png "Bài toán")
 6 | - **Bài toán**: Chuyển đổi hình ảnh chữ viết thành dạng văn bản, ứng dụng trên điện thoại di động, máy tính bảng để người dùng có thể viết thay vì gõ phím.
 7 | 
 8 | - **Dữ liệu**: 
 9 |     - Bộ dữ liệu hình ảnh chữ viết [CinnamonOCR](https://drive.google.com/drive/folders/1Qa2YA6w6V5MaNV-qxqhsHHoYFRK5JB39)
10 |     - Dữ liệu công khai (public) và dữ liệu tổng hợp (synthetic data)
11 | 
12 | - **Phương pháp đánh giá**: 
13 |     - Độ chính xác của mô hình > 90%
14 |     - Thời gian dự đoán cho mỗi hình ảnh < 0.1s
15 | 
16 | - **Yêu cầu khác**:
17 |     - Có thể ứng dụng trực tiếp trên điện thoại hoặc thông qua API
18 | 
19 | ## Thiết kế 
20 | ![alt text](./images/problem.png "Hệ thống học máy")
21 | 
22 | Đây là thiết kế đơn giản cho hệ thống mà chúng ta cần xây dựng. Hệ thống nhận dữ liệu, xử lý và trả về API để người dùng có thể sử dụng. Chúng ta sẽ lưu trữ dữ liệu người dùng vào cơ sở dữ liệu nhằm mục đích cải thiện mô hình trong tương lai.
23 | 
24 | Bài trước: [Tổng quan về một hệ thống học máy (Machine Learning System)](../overview/index.md)
25 | 
26 | Bài tiếp theo: [Lưu trữ và gán nhãn dữ liệu](../data/index.md)
27 | 
28 | [Về Trang chủ](../index.md)
29 | 
30 | 
31 | <script src="https://utteranc.es/client.js"
32 |         repo="thanhhau097/mlpipeline"
33 |         issue-term="pathname"
34 |         label="Comment"
35 |         theme="github-light"
36 |         crossorigin="anonymous"
37 |         async>
38 | </script>


--------------------------------------------------------------------------------
/docs/index.md:
--------------------------------------------------------------------------------
 1 | ## Machine Learning Engineering Tutorial
 2 | Hướng dẫn xây dựng hệ thống trí tuệ nhân tạo/học máy dành cho các kĩ sư học máy.
 3 | 
 4 | ### 1. Giới thiệu
 5 | Trong blog này, chúng ta sẽ cùng nhau xây dựng một phần mềm ứng dụng các mô hình học máy trong thực tế, bằng cách sử dụng các công cụ mã nguồn mở. Đây là blog hướng dẫn chi tiết dành cho các bạn muốn theo con đường làm một kĩ học máy (Machine Learning Engineer), cũng như các kĩ sư phần mềm muốn ứng dụng các mô hình học máy vào sản phẩm của mình. 
 6 | 
 7 | ### 2. Mục lục
 8 | 1. [Tổng quan về một hệ thống học máy (Machine Learning System)](./overview/index.md)
 9 | 2. [Phân tích yêu cầu bài toán](./problem/index.md)
10 | 3. [Lưu trữ và gán nhãn dữ liệu](./data/index.md)
11 | 4. [Huấn luyện mô hình học máy](./training/index.md)
12 | 5. [Triển khai và áp dụng mô hình học máy](./deployment/index.md)
13 | 6. [Xây dựng hệ thống huấn luyện mô hình bằng Kubeflow](./kubeflow/index.md)
14 | 7. [Kết nối các thành phần của hệ thống học máy](./pipeline/index.md) (đang cập nhật)
15 | 8. [Xây dựng cơ sở hạ tầng cho mô hình học máy bằng Terraform](./terraform/index.md) (đang cập nhật)
16 | 
17 | ### 3. Công cụ
18 | Trong blog này, chúng ta sẽ sử dụng các công nghệ và các công cụ được áp dụng trong các dự án học máy thực tế, vì vậy người đọc cần có kiến thức cơ bản về phát triển phần mềm và các công cụ phát triển phần mềm.
19 | 
20 | Trong blog, chúng ta sẽ sử dụng các công cụ và thư viện sau:
21 | 1. Cơ bản: Python, Git, Pytorch, AWS, DVC, Docker, Kubernetes
22 | 2. Các thư viện mã nguồn mở có sẵn: Label Studio, Cometml, Kubeflow, KFServing.
23 | 
24 | Trong blog này, chúng ta sẽ cùng nhau sử dụng các công cụ được liệt kê ở trên để xây dựng một hệ thống machine learning hoàn chỉnh.
25 | 
26 | ### 4. Đóng góp và ý kiến
27 | Nếu có bất kì ý kiến đóng góp và sửa đổi nào, bạn vui lòng tạo *issues* và *pull requests*. Cảm ơn các bạn đã quan tâm!
28 | 
29 | ### Về tác giả
30 | **Tác giả:** Nguyễn Thành Hậu
31 | 
32 | **Email:** thanhhau097@gmail.com
33 | 
34 | <script src="https://utteranc.es/client.js"
35 |         repo="thanhhau097/mlpipeline"
36 |         issue-term="pathname"
37 |         label="Comment"
38 |         theme="github-light"
39 |         crossorigin="anonymous"
40 |         async>
41 | </script>


--------------------------------------------------------------------------------
/mlpipeline/components/preprocess/src/main.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | from pathlib import Path
  4 | 
  5 | 
  6 | os.environ["AWS_ACCESS_KEY_ID"] = "**************************"
  7 | os.environ["AWS_SECRET_ACCESS_KEY"] = "**************************"
  8 | 
  9 | # download data from label studio
 10 | print("DOWNLOADING DATA")
 11 | import requests
 12 | 
 13 | url = "https://***********.ngrok.io/api/projects/1/export"
 14 | querystring = {"exportType":"JSON"}
 15 | headers = {
 16 |     'authorization': "Token ******************************",
 17 | }
 18 | response = requests.request("GET", url, headers=headers, params=querystring)
 19 |     
 20 | # TODO: split train/val/test/user_data label from this json file
 21 | user_data = []
 22 | train_data = []
 23 | val_data = []
 24 | test_data = []
 25 | for item in response.json():
 26 |     if 's3://ocrpipeline/data/user_data' in item['data']['captioning']:
 27 |         user_data.append(item)
 28 |     elif 's3://ocrpipeline/data/train' in item['data']['captioning']:
 29 |         train_data.append(item)
 30 |     elif 's3://ocrpipeline/data/test' in item['data']['captioning']:
 31 |         test_data.append(item)
 32 |     elif 's3://ocrpipeline/data/validation' in item['data']['captioning']:
 33 |         val_data.append(item)
 34 | 
 35 | def save_labels(folder, data):
 36 |     if not os.path.exists(folder):
 37 |         os.makedirs(folder, exist_ok=True)
 38 |     label_path = os.path.join(folder, 'label_studio_data.json')
 39 |     with open(label_path, 'w') as f:
 40 |         json.dump(data, f)
 41 | 
 42 | save_labels('./data/train/', train_data)
 43 | save_labels('./data/test/', test_data)
 44 | save_labels('./data/validation/', val_data)
 45 | save_labels('./data/user_data/', user_data)
 46 | 
 47 | # preprocess
 48 | print("PREPROCESSING DATA")
 49 | def convert_label_studio_format_to_ocr_format(label_studio_json_path, output_path):
 50 |     with open(label_studio_json_path, 'r') as f:
 51 |         data = json.load(f)
 52 | 
 53 |     ocr_data = {}
 54 | 
 55 |     for item in data:
 56 |         image_name = os.path.basename(item['data']['captioning'])
 57 | 
 58 |         text = ''
 59 |         for value_item in item['annotations'][0]['result']:
 60 |             if value_item['from_name'] == 'caption':
 61 |                 text = value_item['value']['text'][0]
 62 |         ocr_data[image_name] = text
 63 | 
 64 |     with open(output_path, 'w') as f:
 65 |         json.dump(ocr_data, f, indent=4)
 66 | 
 67 |     print('Successfully converted ', label_studio_json_path)
 68 | convert_label_studio_format_to_ocr_format('./data/train/label_studio_data.json', './data/train/labels.json')
 69 | convert_label_studio_format_to_ocr_format('./data/validation/label_studio_data.json', './data/validation/labels.json')
 70 | convert_label_studio_format_to_ocr_format('./data/test/label_studio_data.json', './data/test/labels.json')
 71 | convert_label_studio_format_to_ocr_format('./data/user_data/label_studio_data.json', './data/user_data/labels.json')
 72 | 
 73 | # upload these file to s3
 74 | print("UPLOADING DATA")
 75 | import logging
 76 | import boto3
 77 | from botocore.exceptions import ClientError
 78 | 
 79 | 
 80 | def upload_file_to_s3(file_name, bucket, object_name=None):
 81 |     """Upload a file to an S3 bucket
 82 | 
 83 |     :param file_name: File to upload
 84 |     :param bucket: Bucket to upload to
 85 |     :param object_name: S3 object name. If not specified then file_name is used
 86 |     :return: True if file was uploaded, else False
 87 |     """
 88 | 
 89 |     # If S3 object_name was not specified, use file_name
 90 |     if object_name is None:
 91 |         object_name = file_name
 92 | 
 93 |     # Upload the file
 94 |     s3_client = boto3.client('s3')
 95 |     try:
 96 |         response = s3_client.upload_file(file_name, bucket, object_name)
 97 |     except ClientError as e:
 98 |         logging.error(e)
 99 |         return False
100 |     return True
101 | 
102 | 
103 | upload_file_to_s3('data/train/labels.json', bucket='ocrpipeline', object_name='data/train/labels.json')
104 | upload_file_to_s3('data/validation/labels.json', bucket='ocrpipeline', object_name='data/validation/labels.json')
105 | upload_file_to_s3('data/test/labels.json', bucket='ocrpipeline', object_name='data/test/labels.json')
106 | upload_file_to_s3('data/user_data/labels.json', bucket='ocrpipeline', object_name='data/user_data/labels.json')
107 | 
108 | data_path = 's3://ocrpipeline/data'
109 | Path('/output_path.txt').write_text(data_path)


--------------------------------------------------------------------------------
/docs/overview/index.md:
--------------------------------------------------------------------------------
 1 | # Tổng quan về một hệ thống học máy
 2 | Trí tuệ nhân tạo đang là một trong những lĩnh vực đang được chú trọng phát triển và ứng dụng rất nhiều trong thời gian gần đây, trong mọi lĩnh vực từ nông nghiệp, công nghiệp đến dịch vụ, và nó đã mang lại nhiều lợi ích cho các doanh nghiệp nhằm giúp tối ưu hoá quy trình cũng như cắt giảm chi phí. 
 3 | 
 4 | Để ứng dụng được các nghiên cứu về trí tuệ nhân tạo vào thực tế, các kỹ sư cần phải biết cách thiết kế và xây dựng một hệ thống học máy đảm bảo được các yêu cầu về kinh tế cũng như kĩ thuật, nhằm mang lại sản phẩm tối ưu nhất để ứng dụng cho các doanh nghiệp, tổ chức.
 5 | 
 6 | ## Quy trình của một hệ thống học máy 
 7 | ![alt text](./images/ml_system.png "Hệ thống học máy")
 8 | 
 9 | ### 1. Xây dựng yêu cầu cho hệ thống 
10 | <!-- Mục đích, cách đánh giá, kết quả mong muốn, chi phí và thời gian. -->
11 | Để xây dựng một hệ thống học máy, trước tiên chúng ta cần phải phân tích yêu cầu của hệ thống.
12 | 
13 | - **Bài toán**: Chúng ta cần xác định vấn đề đặt ra của bài toán mà chúng ta cần giải quyết, mục tiêu của bài toán là gì, có thể là giảm chi phí con người, tăng năng suất lao động bằng cách tự động hoá một phần quy trình hiện tại. 
14 | 
15 | - **Dữ liệu**: Dữ liệu đầu vào của bài toán là gì? Chúng ta có thể sử dụng những loại dữ liệu nào (hình ảnh, âm thanh, ngôn ngữ, cơ sở dữ liệu) để làm đầu vào cho bài toán cần giải quyết? Số lượng dữ liệu mà chúng ta có hoặc cần có để huấn luyện các mô hình học máy. 
16 | 
17 | - **Phương pháp đánh giá**: Chúng ta cần đưa ra cách đánh giá cho hệ thống mà chúng ta thiết kế. Cách đánh giá có thể là độ chính xác của mô hình, thời gian xử lý của mô hình. Chúng ta cần phải có một cách đánh giá có thể đo lường được bằng con số cụ thể.
18 | 
19 | - **Yêu cầu khác**: Ngoài ra chúng ta cần phải xem xét các ràng buộc về thời gian phát triển hệ thống, về tài nguyên mà chúng ta có thể sử dụng để phát triển và triển khai mô hình, ví dụ chúng ta phải triển khai trên mobile thay vì triển khai trên hệ thống có GPU. Tất cả những yêu cầu này cần được xem xét trước khi thực hiện việc xây dựng hệ thống.
20 | 
21 | ### 2. Dữ liệu
22 | <!-- Lưu trữ dữ liệu, gán nhãn dữ liệu, data versioning -->
23 | Dữ liệu mà một thành phần quan trọng và tất nhiên không thể thiếu để xây dựng một hệ thống. Trong thực tế, phần lớn thời gian phát triển hệ thống học máy tập trung vào việc thu thập, phân tích, gán nhãn và làm sạch dữ liệu. Ngoài việc phân loại và thu thập dữ liệu cần có, chúng ta cũng cần phải quan tâm đến một số vấn đề như:
24 | - **Lưu trữ dữ liệu**: chúng ta cần phải tìm hiểu cách mà chúng ta lưu trữ dữ liệu, dữ liệu ở lưu cục bộ, hay lưu đám mây (cloud) hay trực tiếp trên các thiết bị (on-device). Chúng ta cũng cần phải đảm bảo dữ liệu được lưu trữ một cách an toàn.
25 | - **Gán nhãn dữ liệu**: để có một mô hình học máy tốt, chúng ta cần phải đảm bảo dữ liệu được gán nhãn một cách đồng nhất và đảm bảo được chất lượng cũng như thời gian gán nhãn. Ngoài việc tự gán nhãn bằng cách sử dụng nội bộ, chúng ta có thể thuê công ty gán nhãn dữ liệu thực hiện việc này, hoặc sử dụng các nền tảng crowdsource labeling để thực hiện việc gán nhãn dữ liệu.
26 | - **Data Versioning**: để đảm bảo việc phát triển các mô hình học máy, chúng ta cần thực hiện đánh giá các mô hình trên cùng một phiên bản của dữ liệu. Thêm vào đó, dữ liệu cũng có thể được cập nhật liên tục theo thời gian, vì vậy việc đánh nhãn cho các phiên bản dữ liệu là một công việc cần thiết.
27 | 
28 | ### 3. Xây dựng và tối ưu mô hình học máy
29 | <!-- Modeling, Model training, model monitoring/experiment management, Hyperparameter tuning -->
30 | 
31 | Khi tiếp cận với một bài toán học máy, chúng ta cần phải xem xét và phân loại bài toán này để sử dụng mô hình học máy phù hợp với bài toán. Chúng ta cần phải phân tích các thành phần hoặc các bước cần phải thực hiện để có thể giải quyết bài toán, liệu chúng ta chỉ cần sử dụng một thuật toán/mô hình hay cần phải kết hợp các thuật toán khác nhau. 
32 | 
33 | - **Lựa chọn mô hình**: Thông thường để bắt đầu giải quyết một bài toán, chúng ta có thể xây dựng các thuật toán đơn giản để đánh giá kết quả và làm `baseline`, sau đó chúng ta có thể dần đi đến những phương pháp phức tạp hơn.
34 | - **Huấn luyện mô hình**: khi huấn luyện các mô hình học máy, chúng ta cần phải quan tâm đến các yếu tố ảnh hưởng đến chất lượng và kết quả của mô hình như chất lượng của dữ liệu, kiến trúc/thuật toán, các siêu tham số mà chúng ta sử dụng. Từ đó chúng ta có thể phân tích và cải thiện kết quả của các mô hình dựa trên các yếu tố này.
35 | - **Quản lý thí nghiệm**: Để có thể huấn luyện các mô hình học máy một cách hiệu quả, chúng ta cần phải quản lý và phân tích các thí nghiệm và kết quả của nó để có đánh giá và tìm phương pháp cải thiện. Chúng ta có thể sử dụng các thư viện để quản lý thí nghiệm như Tensorboard, Cometml, Weights & Bias,...
36 | - **Tối ưu siêu tham số**: Khi huấn luyện các mô hình, đặc biệt là các mô hình học sâu, các siêu tham số (hyper-parameters) có ảnh hưởng rất nhiều đến chất lượng của mô hình. Chúng ta có thể sử dụng các phương pháp và thuật toán để tìm ra siêu tham số tốt nhất ví dụ như: Bayes, grid search, random search,...
37 | 
38 | ### 4. Triển khai mô hình
39 | Sau khi huấn luyện mô hình, chúng ta cần triển khai mô hình này vào thực tế. Có nhiều yêu cầu cũng như môi trường khác nhau mà chúng ta cần phải triển khai tuỳ thuộc vào bài toán như: triển khai trên mobile thông qua các ứng dụng, triển khai mô hình trên nền tảng đám mây và cung cấp API cho người dùng, triển khai trên các máy của khách hàng bằng cách gửi các docker image, hoặc chúng ta cũng có thể triển khai thông qua các extensions, SDK,...
40 | 
41 | ## Công cụ phát triển
42 | Ở trong hình quy trình của một hệ thống học máy ở đầu trang, chúng ta có thể lựa chọn các thành phần phù hợp cho từng bước trong quy trình. Trong blog này, chúng ta sẽ sử dụng các thư viện mã nguồn mở có sẵn và được sử dụng nhiều trong thực tế.
43 | 1. Lưu trữ dữ liệu: Amazon Simple Storage Service (S3)
44 | 2. Gán nhãn dữ liệu: Label Studio
45 | 3. Data and Model Versioning: Data Version Control (DVC)
46 | 4. Xây dựng và huấn luyện mô hình: Pytorch
47 | 5. Quản lý thí nghiệm và tối ưu siêu tham số: Cometml
48 | 6. ML Workflows: Kubeflow
49 | 7. Giao diện demo: Gradio
50 | 8. Triển khai mô hình: Nuclio
51 | 9. Xây dựng cơ sở hạ tầng: Terraform
52 | ## Tài liệu tham khảo
53 | 1. [Machine Learning Systems Design - Chip Huyen](https://github.com/chiphuyen/machine-learning-systems-design)
54 | 2. [Full Stack Deep Learning](https://fullstackdeeplearning.com/)
55 | 
56 | Bài tiếp theo: [Phân tích yêu cầu bài toán](../problem/index.md)
57 | 
58 | [Về Trang chủ](../index.md)
59 | 
60 | <script src="https://utteranc.es/client.js"
61 |         repo="thanhhau097/mlpipeline"
62 |         issue-term="pathname"
63 |         label="Comment"
64 |         theme="github-light"
65 |         crossorigin="anonymous"
66 |         async>
67 | </script>


--------------------------------------------------------------------------------
/docs/deployment/index.md:
--------------------------------------------------------------------------------
  1 | # Triển khai mô hình học máy
  2 | 
  3 | Trong các phần trước, chúng ta đã thực hiện việc huấn luyện mô hình và lưu trữ mô hình. Sau khi huấn luyện mô hình, chúng ta cần triển khai mô hình bằng cách áp dụng vào một ứng dụng web hoặc mobile, hoặc chúng ta có thể trả về kết quả thông qua API. Trong phần này, chúng ta sẽ sử dụng công cụ Gradio để xây dựng ứng dụng web và sử dụng Flask để xây dựng API.
  4 | 
  5 | ## Ứng dụng web
  6 | Gradio là thư viện giúp chúng ta có thể tạo một ứng dụng web đơn giản để demo cho mô hình học máy của mình chỉ với một số dòng code. Ngoài ra, Gradio còn cho phép chúng ta sửa đổi giao diện của ứng dụng tùy theo đầu vào và đầu ra của mô hình học máy. Các bạn có thể tìm hiểu têm về Gradio ở đây: https://gradio.app/
  7 | 
  8 | Đầu tiên chúng ta cần cài đặt thư viện Gradio:
  9 | ```
 10 | pip install gradio
 11 | ```
 12 | 
 13 | Với bài toán của chúng ta, đầu vào là ảnh và trả về kết quả là chuỗi kí tự, vì vậy ta khởi tạo giao diện như sau:
 14 | 
 15 | ```python
 16 | import gradio
 17 | 
 18 | # load the best model
 19 | model.load_state_dict(torch.load('best_model.pth'))
 20 | model = model.to(device)
 21 | interface = gradio.Interface(predict, "image", "text")
 22 | 
 23 | interface.launch()
 24 | ```
 25 | 
 26 | Truy cập vào đường dẫn đến ứng dụng web [http://127.0.0.1:7860/](http://127.0.0.1:7862/), và thực hiện tải lên một ảnh để dự đoán và nhấn submit, chúng ta có:
 27 | 
 28 | ![alt text](./images/gradio.png "Ứng dụng web")
 29 | 
 30 | Như vậy chúng ta đã xây dựng được một ứng dụng web đơn giản bằng Gradio để demo cho mô hình học máy của mình.
 31 | 
 32 | ## Xây dựng API bằng Flask
 33 | Trong phần này, chúng ta sẽ sử dụng `Flask` để viết API cho mô hình học máy của mình, các lập trình viên khác có thể sử dụng mô hình học máy của chúng ta bằng cách gọi đến API này để lấy kết quả. 
 34 | 
 35 | Quy trình xây dựng một API tương đối đơn giản, chúng ta thực hiện như sau:
 36 | 
 37 | ```python
 38 | import os
 39 | from flask import Flask, flash, request, redirect, url_for
 40 | from werkzeug.utils import secure_filename
 41 | 
 42 | UPLOAD_FOLDER = './data/user_data/'
 43 | ALLOWED_EXTENSIONS = set(['txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif'])
 44 | 
 45 | app = Flask(__name__)
 46 | app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
 47 | app.secret_key = "super secret key"
 48 | 
 49 | 
 50 | @app.route("/predict", methods=['POST'])
 51 | def predict_flask():
 52 |     if request.method == 'POST':
 53 |         # check if the post request has the file part
 54 |         if 'file' not in request.files:
 55 |             flash('No file part')
 56 |             return redirect(request.url)
 57 |         file = request.files['file']
 58 |         # if user does not select file, browser also
 59 |         # submit a empty part without filename
 60 |         if file.filename == '':
 61 |             flash('No selected file')
 62 |             return redirect(request.url)
 63 |         if file:
 64 |             print('found file')
 65 |             filename = secure_filename(file.filename)
 66 |             filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
 67 |             file.save(filepath)
 68 |             
 69 |             image = cv2.imread(filepath)
 70 |             output = predict(image)
 71 |             return output
 72 | 
 73 | ```
 74 | 
 75 | Để thử nghiệm API, chúng ta thực hiện câu lệnh sau trong Terminal:
 76 | ```
 77 | curl -i -X POST "http://0.0.0.0:5000/predict" -F file=@<filepath>
 78 | ```
 79 | 
 80 | trong đó `filepath là đường dẫn đến dữ liệu `
 81 | 
 82 | 
 83 | Một công việc cần thiết chúng ta cần làm là lưu trữ dữ liệu mà người dùng tải lên  nhằm mục đích phân tích độ chính xác của mô hình trên dữ liệu thật và cải thiện mô hình sau này. Để lưu trực tiếp lên AWS S3 thay vì lưu vào thư mục ở máy cá nhân, chúng ta cần sử dụng thư viện `boto3`. Hàm sử dụng để ubload lên `AWS S3` như sau:
 84 | ```python
 85 | import logging
 86 | import boto3
 87 | from botocore.exceptions import ClientError
 88 | 
 89 | 
 90 | def upload_file_to_s3(file_name, bucket, object_name=None):
 91 |     """Upload a file to an S3 bucket
 92 | 
 93 |     :param file_name: File to upload
 94 |     :param bucket: Bucket to upload to
 95 |     :param object_name: S3 object name. If not specified then file_name is used
 96 |     :return: True if file was uploaded, else False
 97 |     """
 98 | 
 99 |     # If S3 object_name was not specified, use file_name
100 |     if object_name is None:
101 |         object_name = file_name
102 | 
103 |     # Upload the file
104 |     s3_client = boto3.client('s3')
105 |     try:
106 |         response = s3_client.upload_file(file_name, bucket, object_name)
107 |     except ClientError as e:
108 |         logging.error(e)
109 |         return False
110 |     return True
111 | ```
112 | 
113 | Để có thể upload hình ảnh mà người dùng tải lên `Gradio` hay gọi API, chúng ta sẽ sửa hàm `predict` thành như sau:
114 | ```python
115 | import uuid
116 | 
117 | 
118 | def predict(image):
119 |     # upload image to aws s3
120 |     filename = str(uuid.uuid4()) + '.png'
121 |     cv2.imwrite(filename, image)
122 |     upload_file_to_s3(filename, bucket='ocrpipeline', 'data/user_data/images')
123 | 
124 |     model.eval()
125 |     batch = [{'image': image, 'label': [1]}]
126 |     images = collate_wrapper(batch)[0]
127 | 
128 |     images = images.to(device)
129 | 
130 |     outputs = model(images)
131 |     outputs = outputs.permute(1, 0, 2)
132 |     output = outputs[0]
133 | 
134 |     out_best = list(torch.argmax(output, -1))  # [2:]
135 |     out_best = [k for k, g in itertools.groupby(out_best)]
136 |     pred_text = get_label_from_indices(out_best)
137 | 
138 |     return pred_text
139 | ```
140 | 
141 | Như vậy mỗi lần người dùng muốn tải lên một ảnh để lấy chuỗi kí tự, chúng ta có thể lưu trữ hình ảnh mà người dùng tải lên để thuận tiện cho việc phân tích và cải thiện mô hình. Ngoài ra, chúng ta cũng nên lưu lại dự đoán của mô hình để có thể đánh giá được độ chính xác của mô hình trên dữ liệu thật mà người dùng tải lên. Coi như đó là một bài tập nhỏ dành cho các bạn.
142 | 
143 | Sau khi sửa hàm `predict`, chúng ta sẽ khởi tạo lại `Gradio Interface` hoặc `Flask API` như đã làm ở bước trước để có thể sử dụng hàm `predict` mới. Chúng ta có thể kiếm tra liệu hình ảnh đã được tải lên AWS S3 hay chưa bằng cách:
144 | ```
145 | aws s3 ls s3://ocrpipeline/data/user_data/
146 | ```
147 | 
148 | ## Tổng kết 
149 | Trong bài này, chúng ta đã cùng nhau xây dựng một ứng dụng web và API cho mô hình học máy của mình, cũng như lưu trữ dữ liệu của người dùng tải lên để có thể cải thiện mô hình sau này. Trong các bài tiếp theo, chúng ta sẽ mở rộng và nâng cấp API bằng cách sử dụng thư viện Nuclio để có thể đảm bảo ứng dụng luôn hoạt động ổn định với số lượng lớn `request`.
150 | 
151 | Bài trước: [Huấn luyện mô hình học máy](../training/index.md)
152 | 
153 | Bài tiếp theo: [Xây dựng hệ thống huấn luyện mô hình bằng Kubeflow](../kubeflow/index.md)
154 | 
155 | [Về Trang chủ](../index.md)
156 | 
157 | <script src="https://utteranc.es/client.js"
158 |         repo="thanhhau097/mlpipeline"
159 |         issue-term="pathname"
160 |         label="Comment"
161 |         theme="github-light"
162 |         crossorigin="anonymous"
163 |         async>
164 | </script>


--------------------------------------------------------------------------------
/mlpipeline/components/deployment/src/main.py:
--------------------------------------------------------------------------------
  1 | from botocore.retries import bucket
  2 | 
  3 | 
  4 | def deploy_model(model_s3_path: str):
  5 |     CHARACTERS = "aáàạảãăắằẳẵặâấầẩẫậbcdđeéèẹẻẽêếềệểễfghiíìịỉĩjklmnoóòọỏõôốồộổỗơớờợởỡpqrstuúùụủũưứừựửữvxyýỳỷỹỵzwAÁÀẠẢÃĂẮẰẲẴẶÂẤẦẨẪẬBCDĐEÉÈẸẺẼÊẾỀỆỂỄFGHIÍÌỊỈĨJKLMNOÓÒỌỎÕÔỐỒỘỔỖƠỚỜỢỞỠPQRSTUÚÙỤỦŨƯỨỪỰỬỮVXYÝỲỴỶỸZW0123456789 .,-/()'#+:"
  6 |     PAD_token = 0  # Used for padding short sentences
  7 |     SOS_token = 1  # Start-of-sentence token
  8 |     EOS_token = 2  # End-of-sentence token
  9 | 
 10 |     CHAR2INDEX = {"PAD": PAD_token, "SOS": SOS_token, "EOS": EOS_token}
 11 |     INDEX2CHAR = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
 12 | 
 13 |     for i, c in enumerate(CHARACTERS):
 14 |         CHAR2INDEX[c] = i + 3
 15 |         INDEX2CHAR[i + 3] = c
 16 |     
 17 |     def get_indices_from_label(label):
 18 |         indices = []
 19 |         for char in label:
 20 |     #         if CHAR2INDEX.get(char) is not None:
 21 |                 indices.append(CHAR2INDEX[char])
 22 | 
 23 |         indices.append(EOS_token)
 24 |         return indices
 25 | 
 26 |     def get_label_from_indices(indices):
 27 |         label = ""
 28 |         for index in indices:
 29 |             if index == EOS_token:
 30 |                 break
 31 |             elif index == PAD_token:
 32 |                 continue
 33 |             else:
 34 |                 label += INDEX2CHAR[index.item()]
 35 | 
 36 |         return label
 37 |     
 38 |     import torch.nn as nn
 39 |     import torch.nn.functional as F
 40 | 
 41 | 
 42 |     class VGG_FeatureExtractor(nn.Module):
 43 |         """ FeatureExtractor of CRNN (https://arxiv.org/pdf/1507.05717.pdf) """
 44 | 
 45 |         def __init__(self, input_channel, output_channel=512):
 46 |             super(VGG_FeatureExtractor, self).__init__()
 47 |             self.output_channel = [int(output_channel / 8), int(output_channel / 4),
 48 |                                    int(output_channel / 2), output_channel]  # [64, 128, 256, 512]
 49 |             self.ConvNet = nn.Sequential(
 50 |                 nn.Conv2d(input_channel, self.output_channel[0], 3, 1, 1), nn.ReLU(True),
 51 |                 nn.MaxPool2d(2, 2),  # 64x16x50
 52 |                 nn.Conv2d(self.output_channel[0], self.output_channel[1], 3, 1, 1), nn.ReLU(True),
 53 |                 nn.MaxPool2d(2, 2),  # 128x8x25
 54 |                 nn.Conv2d(self.output_channel[1], self.output_channel[2], 3, 1, 1), nn.ReLU(True),  # 256x8x25
 55 |                 nn.Conv2d(self.output_channel[2], self.output_channel[2], 3, 1, 1), nn.ReLU(True),
 56 |                 nn.MaxPool2d((2, 1), (2, 1)),  # 256x4x25
 57 |                 nn.Conv2d(self.output_channel[2], self.output_channel[3], 3, 1, 1, bias=False),
 58 |                 nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),  # 512x4x25
 59 |                 nn.Conv2d(self.output_channel[3], self.output_channel[3], 3, 1, 1, bias=False),
 60 |                 nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),
 61 |                 nn.MaxPool2d((2, 1), (2, 1)),  # 512x2x25
 62 |                 nn.Conv2d(self.output_channel[3], self.output_channel[3], 2, 1, 0), nn.ReLU(True))  # 512x1x24
 63 | 
 64 |         def forward(self, input):
 65 |             return self.ConvNet(input)
 66 |         
 67 |     
 68 |     class BidirectionalGRU(nn.Module):
 69 |         def __init__(self, input_size, hidden_size, output_size):
 70 |             super(BidirectionalGRU, self).__init__()
 71 | 
 72 |             self.rnn = nn.GRU(input_size, hidden_size, bidirectional=True)
 73 |             self.embedding = nn.Linear(hidden_size * 2, output_size)
 74 | 
 75 |         def forward(self, x):
 76 |             recurrent, hidden = self.rnn(x)
 77 |             T, b, h = recurrent.size()
 78 |             t_rec = recurrent.view(T * b, h)
 79 | 
 80 |             output = self.embedding(t_rec)  # [T * b, nOut]
 81 |             output = output.view(T, b, -1)
 82 | 
 83 |             return output, hidden
 84 |     
 85 |     class CTCModel(nn.Module):
 86 |         def __init__(self, inner_dim=512, num_chars=65):
 87 |             super().__init__()
 88 |             self.encoder = VGG_FeatureExtractor(3, inner_dim)
 89 |             self.AdaptiveAvgPool = nn.AdaptiveAvgPool2d((None, 1))
 90 |             self.rnn_encoder = BidirectionalGRU(inner_dim, 256, 256)
 91 |             self.num_chars = num_chars
 92 |             self.decoder = nn.Linear(256, self.num_chars)
 93 | 
 94 |         def forward(self, x, labels=None, max_label_length=None, device=None, training=True):
 95 |             # ---------------- CNN ENCODER --------------
 96 |             x = self.encoder(x)
 97 |             # print('After CNN:', x.size())
 98 | 
 99 |             # ---------------- CNN TO RNN ----------------
100 |             x = x.permute(3, 0, 1, 2)  # from B x C x H x W -> W x B x C x H
101 |             x = self.AdaptiveAvgPool(x)
102 |             size = x.size()
103 |             x = x.reshape(size[0], size[1], size[2] * size[3])
104 | 
105 |             # ----------------- RNN ENCODER ---------------
106 |             encoder_outputs, last_hidden = self.rnn_encoder(x)
107 |             # print('After RNN', x.size())
108 | 
109 |             # --------------- CTC DECODER -------------------
110 |             # batch_size = encoder_outputs.size()[1]
111 |             outputs = self.decoder(encoder_outputs)
112 | 
113 |             return outputs
114 |         
115 |     import uuid
116 | 
117 |     import logging
118 |     import boto3
119 |     from botocore.exceptions import ClientError
120 | 
121 | 
122 |     def upload_file_to_s3(file_name, bucket, object_name=None):
123 |         """Upload a file to an S3 bucket
124 | 
125 |         :param file_name: File to upload
126 |         :param bucket: Bucket to upload to
127 |         :param object_name: S3 object name. If not specified then file_name is used
128 |         :return: True if file was uploaded, else False
129 |         """
130 | 
131 |         # If S3 object_name was not specified, use file_name
132 |         if object_name is None:
133 |             object_name = file_name
134 | 
135 |         # Upload the file
136 |         s3_client = boto3.client('s3')
137 |         try:
138 |             response = s3_client.upload_file(file_name, bucket, object_name)
139 |         except ClientError as e:
140 |             logging.error(e)
141 |             return False
142 |         return True
143 | 
144 |     import boto3
145 |     import torch
146 | 
147 |     import os
148 |     os.environ["AWS_ACCESS_KEY_ID"] = "*********************"
149 |     os.environ["AWS_SECRET_ACCESS_KEY"] = "*********************"
150 | 
151 |     s3 = boto3.client('s3')
152 |     bucket_name = model_s3_path.split('s3://')[1].split('/')[0]
153 |     object_name = model_s3_path.split('s3://' + bucket_name)[1][1:]
154 |     model_path = 'best_model.pth'
155 |     s3.download_file(bucket_name, object_name, model_path)
156 | 
157 |     model = CTCModel(inner_dim=128, num_chars=len(CHAR2INDEX))
158 |     device = 'cpu' if not torch.cuda.is_available() else 'cuda'
159 |     model.load_state_dict(torch.load(model_path, map_location=device))
160 |     model = model.to(device)
161 | 
162 |     def predict(image):
163 |         # upload image to aws s3
164 |         filename = str(uuid.uuid4()) + '.png'
165 |         cv2.imwrite(filename, image)
166 |         upload_file_to_s3(filename, bucket='ocrpipeline', object_name='data/user_data/images/')
167 | 
168 |         model.eval()
169 |         batch = [{'image': image, 'label': [1]}]
170 |         images = collate_wrapper(batch)[0]
171 | 
172 |         images = images.to(device)
173 | 
174 |         outputs = model(images)
175 |         outputs = outputs.permute(1, 0, 2)
176 |         output = outputs[0]
177 | 
178 |         out_best = list(torch.argmax(output, -1))  # [2:]
179 |         out_best = [k for k, g in itertools.groupby(out_best)]
180 |         pred_text = get_label_from_indices(out_best)
181 | 
182 |         return pred_text
183 |         
184 | 
185 |     # import gradio
186 |     # device = 'cpu' if not torch.cuda.is_available() else 'cuda'
187 |     # #     model.load_state_dict(torch.load('best_model.pth'))
188 |     # model = model.to(device)
189 |     # interface = gradio.Interface(predict, "image", "text")
190 |     
191 |     # interface.launch()
192 |     
193 |     import os
194 |     from flask import Flask, flash, request, redirect, url_for
195 |     from werkzeug.utils import secure_filename
196 | 
197 |     UPLOAD_FOLDER = './data/user_data/'
198 |     ALLOWED_EXTENSIONS = set(['txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif'])
199 | 
200 |     app = Flask(__name__)
201 |     app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
202 |     app.secret_key = "super secret key"
203 | 
204 | 
205 |     @app.route("/predict", methods=['POST'])
206 |     def upload_file():
207 |         if request.method == 'POST':
208 |             # check if the post request has the file part
209 |             if 'file' not in request.files:
210 |                 flash('No file part')
211 |                 return redirect(request.url)
212 |             file = request.files['file']
213 |             # if user does not select file, browser also
214 |             # submit a empty part without filename
215 |             if file.filename == '':
216 |                 flash('No selected file')
217 |                 return redirect(request.url)
218 |             if file:
219 |                 print('found file')
220 |                 filename = secure_filename(file.filename)
221 |                 filepath = os.path.join(filename)
222 |                 file.save(filepath)
223 |                 
224 |                 image = cv2.imread(filepath)
225 |                 output = predict(image)
226 |                 return output
227 | 
228 | 
229 |     app.run(debug=False, threaded=False, host='0.0.0.0', port=5000)
230 |     
231 | if __name__ == '__main__':
232 |     import argparse
233 |     parser = argparse.ArgumentParser(description='Deployment')
234 |     parser.add_argument('--model_s3_path', type=str)
235 | 
236 |     args = parser.parse_args()
237 | 
238 |     deploy_model(model_s3_path=args.model_s3_path)
239 | 
240 | 


--------------------------------------------------------------------------------
/mlpipeline/mlpipeline.yaml:
--------------------------------------------------------------------------------
  1 | apiVersion: argoproj.io/v1alpha1
  2 | kind: Workflow
  3 | metadata:
  4 |   generateName: mlpipeline-
  5 |   annotations: {pipelines.kubeflow.org/kfp_sdk_version: 1.7.0, pipelines.kubeflow.org/pipeline_compilation_time: '2021-08-13T17:29:40.859670',
  6 |     pipelines.kubeflow.org/pipeline_spec: '{"name": "Mlpipeline"}'}
  7 |   labels: {pipelines.kubeflow.org/kfp_sdk_version: 1.7.0}
  8 | spec:
  9 |   entrypoint: mlpipeline
 10 |   templates:
 11 |   - name: kubeflow-serve-model-using-kfserving
 12 |     container:
 13 |       args:
 14 |       - -u
 15 |       - kfservingdeployer.py
 16 |       - --action
 17 |       - apply
 18 |       - --model-name
 19 |       - custom-simple
 20 |       - --model-uri
 21 |       - ''
 22 |       - --canary-traffic-percent
 23 |       - '100'
 24 |       - --namespace
 25 |       - kubeflow-user
 26 |       - --framework
 27 |       - ''
 28 |       - --custom-model-spec
 29 |       - '{"image": "thanhhau097/ocrdeployment", "port": 5000, "name": "custom-container",
 30 |         "command": "python3 /src/main.py --model_s3_path {{inputs.parameters.ocr-train-output_path}}"}'
 31 |       - --autoscaling-target
 32 |       - '0'
 33 |       - --service-account
 34 |       - ''
 35 |       - --enable-istio-sidecar
 36 |       - "True"
 37 |       - --output-path
 38 |       - /tmp/outputs/InferenceService_Status/data
 39 |       - --inferenceservice-yaml
 40 |       - '{}'
 41 |       - --watch-timeout
 42 |       - '1800'
 43 |       - --min-replicas
 44 |       - '-1'
 45 |       - --max-replicas
 46 |       - '-1'
 47 |       - --request-timeout
 48 |       - '60'
 49 |       command: [python]
 50 |       image: quay.io/aipipeline/kfserving-component:v0.5.1
 51 |     inputs:
 52 |       parameters:
 53 |       - {name: ocr-train-output_path}
 54 |     outputs:
 55 |       artifacts:
 56 |       - {name: kubeflow-serve-model-using-kfserving-InferenceService-Status, path: /tmp/outputs/InferenceService_Status/data}
 57 |     metadata:
 58 |       labels:
 59 |         pipelines.kubeflow.org/kfp_sdk_version: 1.7.0
 60 |         pipelines.kubeflow.org/pipeline-sdk-type: kfp
 61 |         pipelines.kubeflow.org/enable_caching: "true"
 62 |       annotations: {pipelines.kubeflow.org/component_spec: '{"description": "Serve
 63 |           Models using Kubeflow KFServing", "implementation": {"container": {"args":
 64 |           ["-u", "kfservingdeployer.py", "--action", {"inputValue": "Action"}, "--model-name",
 65 |           {"inputValue": "Model Name"}, "--model-uri", {"inputValue": "Model URI"},
 66 |           "--canary-traffic-percent", {"inputValue": "Canary Traffic Percent"}, "--namespace",
 67 |           {"inputValue": "Namespace"}, "--framework", {"inputValue": "Framework"},
 68 |           "--custom-model-spec", {"inputValue": "Custom Model Spec"}, "--autoscaling-target",
 69 |           {"inputValue": "Autoscaling Target"}, "--service-account", {"inputValue":
 70 |           "Service Account"}, "--enable-istio-sidecar", {"inputValue": "Enable Istio
 71 |           Sidecar"}, "--output-path", {"outputPath": "InferenceService Status"}, "--inferenceservice-yaml",
 72 |           {"inputValue": "InferenceService YAML"}, "--watch-timeout", {"inputValue":
 73 |           "Watch Timeout"}, "--min-replicas", {"inputValue": "Min Replicas"}, "--max-replicas",
 74 |           {"inputValue": "Max Replicas"}, "--request-timeout", {"inputValue": "Request
 75 |           Timeout"}], "command": ["python"], "image": "quay.io/aipipeline/kfserving-component:v0.5.1"}},
 76 |           "inputs": [{"default": "create", "description": "Action to execute on KFServing",
 77 |           "name": "Action", "type": "String"}, {"default": "", "description": "Name
 78 |           to give to the deployed model", "name": "Model Name", "type": "String"},
 79 |           {"default": "", "description": "Path of the S3 or GCS compatible directory
 80 |           containing the model.", "name": "Model URI", "type": "String"}, {"default":
 81 |           "100", "description": "The traffic split percentage between the candidate
 82 |           model and the last ready model", "name": "Canary Traffic Percent", "type":
 83 |           "String"}, {"default": "", "description": "Kubernetes namespace where the
 84 |           KFServing service is deployed.", "name": "Namespace", "type": "String"},
 85 |           {"default": "", "description": "Machine Learning Framework for Model Serving.",
 86 |           "name": "Framework", "type": "String"}, {"default": "{}", "description":
 87 |           "Custom model runtime container spec in JSON", "name": "Custom Model Spec",
 88 |           "type": "String"}, {"default": "0", "description": "Autoscaling Target Number",
 89 |           "name": "Autoscaling Target", "type": "String"}, {"default": "", "description":
 90 |           "ServiceAccount to use to run the InferenceService pod", "name": "Service
 91 |           Account", "type": "String"}, {"default": "True", "description": "Whether
 92 |           to enable istio sidecar injection", "name": "Enable Istio Sidecar", "type":
 93 |           "Bool"}, {"default": "{}", "description": "Raw InferenceService serialized
 94 |           YAML for deployment", "name": "InferenceService YAML", "type": "String"},
 95 |           {"default": "300", "description": "Timeout seconds for watching until InferenceService
 96 |           becomes ready.", "name": "Watch Timeout", "type": "String"}, {"default":
 97 |           "-1", "description": "Minimum number of InferenceService replicas", "name":
 98 |           "Min Replicas", "type": "String"}, {"default": "-1", "description": "Maximum
 99 |           number of InferenceService replicas", "name": "Max Replicas", "type": "String"},
100 |           {"default": "60", "description": "Specifies the number of seconds to wait
101 |           before timing out a request to the component.", "name": "Request Timeout",
102 |           "type": "String"}], "name": "Kubeflow - Serve Model using KFServing", "outputs":
103 |           [{"description": "Status JSON output of InferenceService", "name": "InferenceService
104 |           Status", "type": "String"}]}', pipelines.kubeflow.org/component_ref: '{"digest":
105 |           "84f27d18805744db98e4d08804ea3c0e6ca5daa1a3a90dd057b5323f19d9dd2c", "url":
106 |           "https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml"}',
107 |         pipelines.kubeflow.org/arguments.parameters: '{"Action": "apply", "Autoscaling
108 |           Target": "0", "Canary Traffic Percent": "100", "Custom Model Spec": "{\"image\":
109 |           \"thanhhau097/ocrdeployment\", \"port\": 5000, \"name\": \"custom-container\",
110 |           \"command\": \"python3 /src/main.py --model_s3_path {{inputs.parameters.ocr-train-output_path}}\"}",
111 |           "Enable Istio Sidecar": "True", "Framework": "", "InferenceService YAML":
112 |           "{}", "Max Replicas": "-1", "Min Replicas": "-1", "Model Name": "custom-simple",
113 |           "Model URI": "", "Namespace": "kubeflow-user", "Request Timeout": "60",
114 |           "Service Account": "", "Watch Timeout": "1800"}'}
115 |   - name: mlpipeline
116 |     dag:
117 |       tasks:
118 |       - name: kubeflow-serve-model-using-kfserving
119 |         template: kubeflow-serve-model-using-kfserving
120 |         dependencies: [ocr-train]
121 |         arguments:
122 |           parameters:
123 |           - {name: ocr-train-output_path, value: '{{tasks.ocr-train.outputs.parameters.ocr-train-output_path}}'}
124 |       - {name: ocr-preprocessing, template: ocr-preprocessing}
125 |       - name: ocr-train
126 |         template: ocr-train
127 |         dependencies: [ocr-preprocessing]
128 |         arguments:
129 |           parameters:
130 |           - {name: ocr-preprocessing-output_path, value: '{{tasks.ocr-preprocessing.outputs.parameters.ocr-preprocessing-output_path}}'}
131 |   - name: ocr-preprocessing
132 |     container:
133 |       args: []
134 |       command: [python, /src/main.py]
135 |       image: thanhhau097/ocrpreprocess
136 |     outputs:
137 |       parameters:
138 |       - name: ocr-preprocessing-output_path
139 |         valueFrom: {path: /output_path.txt}
140 |       artifacts:
141 |       - {name: ocr-preprocessing-output_path, path: /output_path.txt}
142 |     metadata:
143 |       labels:
144 |         pipelines.kubeflow.org/kfp_sdk_version: 1.7.0
145 |         pipelines.kubeflow.org/pipeline-sdk-type: kfp
146 |         pipelines.kubeflow.org/enable_caching: "true"
147 |       annotations: {pipelines.kubeflow.org/component_spec: '{"description": "Preprocess
148 |           OCR model", "implementation": {"container": {"command": ["python", "/src/main.py"],
149 |           "fileOutputs": {"output_path": "/output_path.txt"}, "image": "thanhhau097/ocrpreprocess"}},
150 |           "name": "OCR Preprocessing", "outputs": [{"description": "Output Path",
151 |           "name": "output_path", "type": "String"}]}', pipelines.kubeflow.org/component_ref: '{"digest":
152 |           "d5922a4c63432cb79f05fc02cd1b6c0f0c881e0e3883573d5cadef1eb379087c", "url":
153 |           "./components/preprocess/component.yaml"}'}
154 |   - name: ocr-train
155 |     container:
156 |       args: []
157 |       command: [python, /src/main.py, --data_path, '{{inputs.parameters.ocr-preprocessing-output_path}}']
158 |       image: thanhhau097/ocrtrain
159 |     inputs:
160 |       parameters:
161 |       - {name: ocr-preprocessing-output_path}
162 |     outputs:
163 |       parameters:
164 |       - name: ocr-train-output_path
165 |         valueFrom: {path: /output_path.txt}
166 |       artifacts:
167 |       - {name: ocr-train-output_path, path: /output_path.txt}
168 |     metadata:
169 |       labels:
170 |         pipelines.kubeflow.org/kfp_sdk_version: 1.7.0
171 |         pipelines.kubeflow.org/pipeline-sdk-type: kfp
172 |         pipelines.kubeflow.org/enable_caching: "true"
173 |       annotations: {pipelines.kubeflow.org/component_spec: '{"description": "Training
174 |           OCR model", "implementation": {"container": {"command": ["python", "/src/main.py",
175 |           "--data_path", {"inputValue": "data_path"}], "fileOutputs": {"output_path":
176 |           "/output_path.txt"}, "image": "thanhhau097/ocrtrain"}}, "inputs": [{"description":
177 |           "s3 path to data", "name": "data_path"}], "name": "OCR Train", "outputs":
178 |           [{"description": "Output Path", "name": "output_path", "type": "String"}]}',
179 |         pipelines.kubeflow.org/component_ref: '{"digest": "382c44ee886fc59d95f01989ba7765df9ec2321b7c706a90a6f9c76f70cf2ab0",
180 |           "url": "./components/train/component.yaml"}', pipelines.kubeflow.org/arguments.parameters: '{"data_path":
181 |           "{{inputs.parameters.ocr-preprocessing-output_path}}"}'}
182 |   arguments:
183 |     parameters: []
184 |   serviceAccountName: pipeline-runner
185 | 


--------------------------------------------------------------------------------
/docs/data/index.md:
--------------------------------------------------------------------------------
  1 | # Data Pipeline
  2 | Trong phần này, chúng ta sẽ thực hiện việc lưu trữ và gán nhãn dữ liệu.
  3 | 
  4 | ## 1. Phân tích dữ liệu
  5 | Trước khi giải quyết một bài toán, chúng ta cần phải thực hiện công việc phân tích dữ liệu để đưa ra phương pháp gán nhãn phù hợp và hiệu quả. Bài toán mà chúng ta sẽ cùng nhau giải quyết trong blog này tương đối đơn giản về mặt dữ liệu, với đầu vào là hình ảnh của một dòng chữ viết tay và chúng ta cần trả về kết quả dãy kí tự mà máy đọc được từ hình ảnh đó.
  6 | 
  7 | ```python
  8 | import os
  9 | import cv2
 10 | from matplotlib import pyplot as plt
 11 | ```
 12 | 
 13 | ### Data Sample
 14 | 
 15 | 
 16 | ```python
 17 | DATA_SAMPLE_FOLDER = './Challenge 1_ Handwriting OCR for Vietnamese Address/0825_DataSamples 1'
 18 | ```
 19 | 
 20 | 
 21 | ```python
 22 | for image_name in os.listdir(DATA_SAMPLE_FOLDER)[:6]:
 23 |     if '.json' in image_name:
 24 |         continue
 25 |     
 26 |     image_path = os.path.join(DATA_SAMPLE_FOLDER, image_name)
 27 |     image = cv2.imread(image_path)
 28 |     plt.imshow(image)
 29 |     plt.show()
 30 | ```
 31 | 
 32 | 
 33 |     
 34 | > ![png](./images/output_3_0.png)
 35 |     
 36 | 
 37 | 
 38 | 
 39 |     
 40 | > ![png](./images/output_3_1.png)
 41 |     
 42 | 
 43 | 
 44 | 
 45 |     
 46 | > ![png](./images/output_3_2.png)
 47 |     
 48 | 
 49 | 
 50 | 
 51 |     
 52 | > ![png](./images/output_3_3.png)
 53 |     
 54 | 
 55 | 
 56 | 
 57 |     
 58 | > ![png](./images/output_3_4.png)
 59 |     
 60 | 
 61 | 
 62 | ### Dataset
 63 | Thông thường, khi tiếp cận các bài toán học máy, chúng ta cần phân tích dữ liệu và chia dữ liệu thành các tập huấn luyện (training set), tập đánh giá (validation set) và tập thử nghiệm (testing set). 
 64 | 
 65 | Trong bộ dữ liệu mà chúng ta sử dụng ở đây, tập dữ liệu huấn luyện và tập thử nghiệm đã được chia sẵn, vì vậy, chúng ta sẽ chia tập dữ liệu huấn luyện có sẵn thành hai tập: tập dữ liệu huấn luyện và tập đánh giá.
 66 | 
 67 | 
 68 | ```python
 69 | TRAIN_FOLDER = './Challenge 1_ Handwriting OCR for Vietnamese Address/0916_Data Samples 2'
 70 | TEST_FOLDER = './Challenge 1_ Handwriting OCR for Vietnamese Address/1015_Private Test'
 71 | ```
 72 | 
 73 | 
 74 | ```python
 75 | train_image_paths = [os.path.join(TRAIN_FOLDER, image_name) 
 76 |                      for image_name in os.listdir(TRAIN_FOLDER) 
 77 |                      if '.json' not in image_name]
 78 | train_image_paths[:5]
 79 | ```
 80 | 
 81 | 
 82 | >     ['./Challenge 1_ Handwriting OCR for Vietnamese Address/0916_Data Samples 2/0768_samples.png',
 83 | >      './Challenge 1_ Handwriting OCR for Vietnamese Address/0916_Data Samples 2/0238_samples.png',
 84 | >      './Challenge 1_ Handwriting OCR for Vietnamese Address/0916_Data Samples 2/0898_samples.png',
 85 | >      './Challenge 1_ Handwriting OCR for Vietnamese Address/0916_Data Samples 2/0907_samples.png',
 86 | >      './Challenge 1_ Handwriting OCR for Vietnamese Address/0916_Data Samples 2/0071_samples.png']
 87 | 
 88 | 
 89 | 
 90 | ```python
 91 | print('Number of training images:', len(train_image_paths))
 92 | ```
 93 | 
 94 | >     Number of training images: 1823
 95 | 
 96 | 
 97 | 
 98 | ```python
 99 | test_image_paths = [os.path.join(TEST_FOLDER, image_name) 
100 |                      for image_name in os.listdir(TEST_FOLDER) 
101 |                      if '.json' not in image_name]
102 | test_image_paths[:5]
103 | ```
104 | 
105 | 
106 | 
107 | 
108 | >     ['./Challenge 1_ Handwriting OCR for Vietnamese Address/1015_Private Test/0232_tests.png',
109 | >      './Challenge 1_ Handwriting OCR for Vietnamese Address/1015_Private Test/0004_tests.png',
110 | >      './Challenge 1_ Handwriting OCR for Vietnamese Address/1015_Private Test/0374_tests.png',
111 | >      './Challenge 1_ Handwriting OCR for Vietnamese Address/1015_Private Test/0142_tests.png',
112 | >      './Challenge 1_ Handwriting OCR for Vietnamese Address/1015_Private Test/0468_tests.png']
113 | 
114 | 
115 | 
116 | 
117 | ```python
118 | print('Number of testing images:', len(test_image_paths))
119 | ```
120 | 
121 | >     Number of testing images: 549
122 | 
123 | 
124 | #### Split datasets
125 | 
126 | 
127 | ```python
128 | NEW_TRAIN_FOLDER = './data/train/images'
129 | NEW_VALIDATION_FOLDER = './data/validation/images'
130 | NEW_TEST_FOLDER = './data/test/images'
131 | 
132 | for folder in [NEW_TRAIN_FOLDER, NEW_VALIDATION_FOLDER, NEW_TEST_FOLDER]:
133 |     if not os.path.exists(folder):
134 |         os.makedirs(folder, exist_ok=True)
135 | ```
136 | 
137 | 
138 | ```python
139 | import random
140 | 
141 | validation_image_paths = random.choices(train_image_paths, k=int(0.2*len(train_image_paths)))
142 | train_image_paths = list(set(train_image_paths).difference(set(validation_image_paths)))
143 | ```
144 | 
145 | 
146 | ```python
147 | print('Number of training images:', len(train_image_paths))
148 | print('Number of validation images:', len(validation_image_paths))
149 | print('Number of testing images:', len(test_image_paths))
150 | ```
151 | 
152 | >     Number of training images: 1497
153 | >     Number of validation images: 364
154 | >     Number of testing images: 549
155 | 
156 | 
157 | 
158 | ```python
159 | import shutil
160 | 
161 | def copy_images_to_folder(image_paths, to_folder):
162 |     for path in image_paths:
163 |         shutil.copy2(path, to_folder)
164 |         
165 | copy_images_to_folder(train_image_paths, NEW_TRAIN_FOLDER)
166 | copy_images_to_folder(validation_image_paths, NEW_VALIDATION_FOLDER)
167 | copy_images_to_folder(test_image_paths, NEW_TEST_FOLDER)
168 | ```
169 | 
170 | ## 2, Lưu trữ dữ liệu
171 | Để đảm bảo dữ liệu được an toàn và bảo mật, cũng như có khả năng cập nhật dữ liệu mới, chúng ta sẽ sử dụng AWS S3 để lưu trữ. AWS S3 cung cấp CLI interface, SDK để chúng ta có thể tương tác một cách dễ dàng hơn với dữ liệu của mình.
172 | 
173 | ### Tạo S3 bucket
174 | Để tạo một S3 bucket, chúng ta cần có một tài khoản AWS, sau đó truy cập vào [AWS Console](https://console.aws.amazon.com/) rồi chọn [S3 Service](https://s3.console.aws.amazon.com/s3/home?region=ap-southeast-1#) -> [**Create bucket**](https://s3.console.aws.amazon.com/s3/bucket/create?region=ap-southeast-1)
175 | 
176 | Ở đây, chúng ta nhập các thông tin cần thiết: `Bucket name`, `AWS Region` sau đó chọn `Create bucket` để tạo bucket. Ví dụ ở đây chúng ta sẽ đặt tên cho bucket là `ocrpipeline` và đặt tại Singapore.
177 | 
178 | ![alt text](./images/create_s3_bucket.png "Tạo S3 bucket")
179 | 
180 | ### AWS Credentials
181 | Để có thể tương tác với AWS S3 bằng CLI, chúng ta cần phải có AWS Credentials. Chúng ta có thể tạo AWS Credentials tại [Identity and Access Management (IAM)](https://console.aws.amazon.com/iam/home?region=ap-southeast-1#/security_credentials), chọn mục *Access keys (access key ID and secret access key)*, sau đó tiến hành tạo *Access Key* bằng cách chọn *Create New Access Key*. Bạn hãy lưu hai key này vào nơi an toàn, chúng ta sẽ không thể nhìn thấy `Secret Access Key` khi đóng cửa sổ này.
182 | 
183 | Sau khi tạo *Access Key*, chúng ta cần cài đặt chúng lên máy của mình bằng cách sử dụng `AWS CLI`. Trước hết chúng ta cần phải cài đặt AWS CLI bằng Terminal:
184 | 
185 | ```
186 | pip install awscli
187 | ```
188 | 
189 | Sau đó thực hiện cài đặt các key mà chúng ta vừa tạo ra bằng cách sử dụng câu lệnh:
190 | ```
191 | aws configure
192 | ```
193 | Nhập các thông tin cần thiết vào
194 | ```
195 | > AWS Access Key ID [****************4ZHS]:
196 | > AWS Secret Access Key [****************DNb9]: 
197 | > Default region name [ap-southeast-1]: 
198 | > Default output format [json]: 
199 | ```
200 | 
201 | Chúng ta thử nghiệm việc thêm access keys thành công bằng câu lệnh sau trong Terminal:
202 | ```
203 | aws s3 ls
204 | ```
205 | 
206 | Kết quả thu được là danh sách các bucket hiện tại mà chúng ta đang có:
207 | ```
208 | 202x-xx-xx xx:xx:xx ocrpipeline
209 | ```
210 | 
211 | ### Tải dữ liệu lên AWS S3
212 | Chúng ta tải dữ liệu lên S3 bucket như sau:
213 | ```
214 | aws s3 sync ./data s3://ocrpipeline/data/
215 | ```
216 | 
217 | ## Gán nhãn dữ liệu
218 | Trong blog này, chúng ta sẽ sử dụng [Label Studio](https://github.com/heartexlabs/label-studio) để gán nhãn. Label Studio là một phần mềm gán nhãn mã nguồn mở, hỗ trợ nhiều loại dữ liệu như hình ảnh, âm thanh, video hay dữ liệu chuỗi thời gian. Ngoài ra chúng ta cũng có thể tích hợp các API để phục vụ cho việc prelabeling, giúp quá trình gán nhãn được hoàn thành nhanh hơn. Chúng ta sẽ thực hiện việc tích hợp API này ở phần sau của blog. Các bạn có thể tìm hiểu thêm về các tính năng của Label Studio ở đây: https://github.com/heartexlabs/label-studio
219 | 
220 | (Ngoài ra, nếu chúng ta cần phải label một số lượng lớn dữ liệu, chúng ta có thể thuê các bên cung cấp dịch vụ gán nhãn dữ liệu như [Annoit](https://annoit.com), giúp chúng ta có được dữ liệu nhanh hơn và số lượng lớn hơn phục vụ cho bài toán của mình)
221 | 
222 | Để cài đặt Label Studio, chúng ta thực hiện câu lệnh sau trên Terminal: 
223 | ```
224 | pip install label-studio
225 | ```
226 | 
227 | Và khởi tạo Label Studio như sau:
228 | ```
229 | label-studio
230 | ```
231 | 
232 | Sau khi khởi tạo Label Studio, chúng ta truy cập vào đường dẫn mặc định khi Label Studio khởi chạy: [http://localhost:8080](http://localhost:8080). Tiếp theo chúng ta cần đăng kí tài khoản để thực hiện việc gán nhãn dữ liệu.
233 | 
234 | Việc tiếp theo chúng ta cần làm là tạo một dự án mới và cài đặt `labeling interface` cho dự án. Chọn `Create` ở góc trên bên phải trang web và nhập các thông tin như tên dự án và mô tả của dự án. Sau đó chuyển qua mục `Labeling Setup` để cài đặt `labeling interface` cho dự án. Với bài toán này, chúng ta có thể chọn template là `Optical Character Recognition` hoặc `Image Captioning`. Trong template `Optical Character Recognition`, chúng ta có thể đánh `bounding box` cho từng dòng chữ, nhưng do dữ liệu của chúng ta đã được cắt sẵn ra từng dòng, vì vậy chúng ta có thể lựa chọn template `Image Captioning` để gán nhãn cho bài toán của mình. Ngoài ra, chúng ta cũng có thể sửa đổi `labeling interface` tuỳ ý phụ thuộc vào dữ liệu của mình, bằng cách nhấn vào mục `Code/Visual`. Ở đây, mình sẽ thêm một thuộc tính mới để xác định chất lượng của bức ảnh có tốt hay không bằng cách thêm vào mục `Code` như sau:
235 | ```javascript
236 | <View>
237 |   <Image name="image" value="$captioning"/>
238 |   <Header value="Describe the image:"/>
239 |   <TextArea name="caption" toName="image" placeholder="Enter description here..." rows="5" maxSubmissions="1"/>
240 |   <Choices name="chc" toName="image">
241 |       <Choice value="Good"></Choice>
242 |       <Choice value="Bad"></Choice>
243 |     </Choices>
244 | </View>
245 | ```
246 | 
247 | Kết quả thu được như sau:
248 | ![alt text](./images/labeling_interface.png "Labeling Interface")
249 | 
250 | Nhấn `Save` để lưu giao diện gán nhãn. Tiếp theo chúng ta cần tải dữ liệu lên Label Studio để thực hiện việc gán nhãn. Ngoài việc có thể tải lên dữ liệu trực tiếp từ máy của mình, chúng ta có thể tải dữ liệu từ đám mây (ở đây) chúng ta sử dụng AWS S3, để thực hiện việc này, chúng ta chọn `Settings` -> `Cloud Storage`. Ở đây chúng ta có hai mục:
251 | - `Source Cloud Storage`: dùng để tải dữ liệu từ đám mây lên Label Studio
252 | - `Target Cloud Storage`: dùng để lưu nhãn của dữ liệu sau khi thực hiện việc gán nhãn
253 | 
254 | Chọn `Add Source Storage` để thêm dữ liệu từ AWS S3, nhập các thông tin cần thiết:
255 | - `Storage Title`: Training Data
256 | - `Bucket Name`: ocrpipeline
257 | - `Bucket Prefix`: data/train/images/
258 | - `Access Key ID` và `Secret Access Key`
259 | - Chọn `Treat every bucket object as a source file` để có thể tải ảnh lên
260 | 
261 | ![alt text](./images/cloud_data.png "Add Cloud Storage")
262 | 
263 | Sau đó chọn `Add Storage` và chọn `Sync Storage` để cập nhật dữ liệu. Đợi một lúc để dữ liệu có thể được cập nhật sau đó trở về màn hình chính, chúng ta có thể thấy danh sách hình ảnh đã được cập nhật như hình dưới: 
264 | 
265 | ![alt text](./images/label_main_page.png "Label Studio")
266 | 
267 | Chọn hình ảnh và chúng ta có thể bắt đầu thực hiện việc gán nhãn dữ liệu.
268 | 
269 | ![alt text](./images/labeling.png "Gán nhãn dữ liệu")
270 | 
271 | Chọn submit để hoàn thành việc gán nhãn và chuyển qua bước tiếp theo.
272 | 
273 | ## Tổng kết
274 | Trong phần này, chúng ta đã cùng nhau phân tích dữ liệu của dự án, tải dữ liệu lên đám mây (AWS S3) và thực hiện việc gán nhãn trên dữ liệu này. Trong phần tiếp theo chúng ta sẽ thực hiện việc huấn luyện mô hình học máy trên dữ liệu được gán nhãn và thực hiện tối ưu mô hình.
275 | 
276 | Bài trước: [Phân tích yêu cầu bài toán](./problem/index.md)
277 | 
278 | Bài tiếp theo: [Huấn luyện mô hình học máy](../training/index.md)
279 | 
280 | <script src="https://utteranc.es/client.js"
281 |         repo="thanhhau097/mlpipeline"
282 |         issue-term="pathname"
283 |         label="Comment"
284 |         theme="github-light"
285 |         crossorigin="anonymous"
286 |         async>
287 | </script>


--------------------------------------------------------------------------------
/mlpipeline/components/train/src/main.py:
--------------------------------------------------------------------------------
  1 | def train_model(data_path):
  2 |     import os
  3 |     from pathlib import Path
  4 |     import logging
  5 |     import boto3
  6 |     from botocore.exceptions import ClientError
  7 | 
  8 | 
  9 |     def upload_file_to_s3(file_name, bucket, object_name=None):
 10 |         """Upload a file to an S3 bucket
 11 | 
 12 |         :param file_name: File to upload
 13 |         :param bucket: Bucket to upload to
 14 |         :param object_name: S3 object name. If not specified then file_name is used
 15 |         :return: True if file was uploaded, else False
 16 |         """
 17 | 
 18 |         # If S3 object_name was not specified, use file_name
 19 |         if object_name is None:
 20 |             object_name = file_name
 21 | 
 22 |         # Upload the file
 23 |         s3_client = boto3.client('s3')
 24 |         try:
 25 |             response = s3_client.upload_file(file_name, bucket, object_name)
 26 |         except ClientError as e:
 27 |             logging.error(e)
 28 |             return False
 29 |         return True
 30 | 
 31 | 
 32 |     os.environ["AWS_ACCESS_KEY_ID"] = "*********************"
 33 |     os.environ["AWS_SECRET_ACCESS_KEY"] = "*********************"
 34 | 
 35 |     CHARACTERS = "aáàạảãăắằẳẵặâấầẩẫậbcdđeéèẹẻẽêếềệểễfghiíìịỉĩjklmnoóòọỏõôốồộổỗơớờợởỡpqrstuúùụủũưứừựửữvxyýỳỷỹỵzwAÁÀẠẢÃĂẮẰẲẴẶÂẤẦẨẪẬBCDĐEÉÈẸẺẼÊẾỀỆỂỄFGHIÍÌỊỈĨJKLMNOÓÒỌỎÕÔỐỒỘỔỖƠỚỜỢỞỠPQRSTUÚÙỤỦŨƯỨỪỰỬỮVXYÝỲỴỶỸZW0123456789 .,-/()'#+:"
 36 |     PAD_token = 0  # Used for padding short sentences
 37 |     SOS_token = 1  # Start-of-sentence token
 38 |     EOS_token = 2  # End-of-sentence token
 39 | 
 40 |     CHAR2INDEX = {"PAD": PAD_token, "SOS": SOS_token, "EOS": EOS_token}
 41 |     INDEX2CHAR = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
 42 | 
 43 |     for i, c in enumerate(CHARACTERS):
 44 |         CHAR2INDEX[c] = i + 3
 45 |         INDEX2CHAR[i + 3] = c
 46 | 
 47 |     def get_indices_from_label(label):
 48 |         indices = []
 49 |         for char in label:
 50 |     #         if CHAR2INDEX.get(char) is not None:
 51 |                 indices.append(CHAR2INDEX[char])
 52 | 
 53 |         indices.append(EOS_token)
 54 |         return indices
 55 | 
 56 |     def get_label_from_indices(indices):
 57 |         label = ""
 58 |         for index in indices:
 59 |             if index == EOS_token:
 60 |                 break
 61 |             elif index == PAD_token:
 62 |                 continue
 63 |             else:
 64 |                 label += INDEX2CHAR[index.item()]
 65 | 
 66 |         return label
 67 | 
 68 |     from comet_ml import Experiment
 69 | 
 70 |     import cv2
 71 |     from torch.utils.data import Dataset
 72 |     
 73 |     # Create an experiment with your api key
 74 |     experiment = Experiment(
 75 |         api_key="***********************",
 76 |         project_name="ocr",
 77 |         workspace="<username>",
 78 |     )
 79 | 
 80 | 
 81 |     class OCRDataset(Dataset):
 82 |         def __init__(self, data_dir, label_path, transform=None):
 83 |             self.data_dir = data_dir
 84 |             self.label_path = label_path
 85 |             self.transform = transform
 86 |             self.image_paths, self.labels = self.get_image_paths_and_labels(label_path)
 87 | 
 88 |         def __len__(self):
 89 |             return len(self.image_paths)
 90 | 
 91 |         def __getitem__(self, idx):
 92 |             img_path = self.get_data_path(self.image_paths[idx])
 93 |             image = cv2.imread(img_path)
 94 |             label = self.labels[idx]
 95 |             label = get_indices_from_label(label)
 96 |             sample = {"image": image, "label": label}
 97 |             return sample
 98 | 
 99 |         def get_data_path(self, path):
100 |             return os.path.join(self.data_dir, path)
101 | 
102 |         def get_image_paths_and_labels(self, json_path):
103 |             with open(json_path, 'r', encoding='utf-8') as f:
104 |                 data = json.load(f)
105 | 
106 |             image_paths = list(data.keys())
107 |             labels = list(data.values())
108 |             return image_paths, labels
109 |         
110 |         
111 |     import itertools
112 | 
113 |     def collate_wrapper(batch):
114 |         """
115 |         Labels are already numbers
116 |         :param batch:
117 |         :return:
118 |         """
119 |         images = []
120 |         labels = []
121 |         # TODO: can change height in config
122 |         height = 64
123 |         max_width = 0
124 |         max_label_length = 0
125 | 
126 |         for sample in batch:
127 |             image = sample['image']
128 |             try:
129 |                 image = process_image(image, height=height, channels=image.shape[2])
130 |             except:
131 |                 continue
132 | 
133 |             if image.shape[1] > max_width:
134 |                 max_width = image.shape[1]
135 | 
136 |             label = sample['label']
137 | 
138 |             if len(label) > max_label_length:
139 |                 max_label_length = len(label)
140 | 
141 |             images.append(image)
142 |             labels.append(label)
143 | 
144 |         # PAD IMAGES: convert to tensor with size b x c x h x w (from b x h x w x c)
145 |         channels = images[0].shape[2]
146 |         images = process_batch_images(images, height=height, max_width=max_width, channels=channels)
147 |         images = images.transpose((0, 3, 1, 2))
148 |         images = torch.from_numpy(images).float()
149 | 
150 |         # LABELS
151 |         pad_list = zero_padding(labels)
152 |         mask = binary_matrix(pad_list)
153 |         mask = torch.ByteTensor(mask)
154 |         labels = torch.LongTensor(pad_list)
155 |         return images, labels, mask, max_label_length
156 | 
157 | 
158 |     def process_image(image, height=64, channels=3):
159 |         """Converts to self.channels, self.max_height
160 |         # convert channels
161 |         # resize max_height = 64
162 |         """
163 |         shape = image.shape
164 |         # if shape[0] > 64 or shape[0] < 32:  # height
165 |         try:
166 |             image = cv2.resize(image, (int(height/shape[0] * shape[1]), height))
167 |         except:
168 |             return np.zeros([1, 1, channels])
169 |         return image / 255.0
170 | 
171 | 
172 |     def process_batch_images(images, height, max_width, channels=3):
173 |         """
174 |         Convert a list of images to a tensor (with padding)
175 |         :param images: list of numpy array images
176 |         :param height: desired height
177 |         :param max_width: max width of all images
178 |         :param channels: number of image channels
179 |         :return: a tensor representing images
180 |         """
181 |         output = np.ones([len(images), height, max_width, channels])
182 |         for i, image in enumerate(images):
183 |             final_img = image
184 |             shape = image.shape
185 |             output[i, :shape[0], :shape[1], :] = final_img
186 | 
187 |         return output
188 | 
189 | 
190 |     def zero_padding(l, fillvalue=PAD_token):
191 |         """
192 |         Pad value PAD token to l
193 |         :param l: list of sequences need padding
194 |         :param fillvalue: padded value
195 |         :return:
196 |         """
197 |         return list(itertools.zip_longest(*l, fillvalue=fillvalue))
198 | 
199 | 
200 |     def binary_matrix(l, value=PAD_token):
201 |         m = []
202 |         for i, seq in enumerate(l):
203 |             m.append([])
204 |             for token in seq:
205 |                 if token == value:
206 |                     m[i].append(0)
207 |                 else:
208 |                     m[i].append(1)
209 |         return m
210 | 
211 |     import torch.nn as nn
212 |     import torch.nn.functional as F
213 | 
214 | 
215 |     class VGG_FeatureExtractor(nn.Module):
216 |         """ FeatureExtractor of CRNN (https://arxiv.org/pdf/1507.05717.pdf) """
217 | 
218 |         def __init__(self, input_channel, output_channel=512):
219 |             super(VGG_FeatureExtractor, self).__init__()
220 |             self.output_channel = [int(output_channel / 8), int(output_channel / 4),
221 |                                     int(output_channel / 2), output_channel]  # [64, 128, 256, 512]
222 |             self.ConvNet = nn.Sequential(
223 |                 nn.Conv2d(input_channel, self.output_channel[0], 3, 1, 1), nn.ReLU(True),
224 |                 nn.MaxPool2d(2, 2),  # 64x16x50
225 |                 nn.Conv2d(self.output_channel[0], self.output_channel[1], 3, 1, 1), nn.ReLU(True),
226 |                 nn.MaxPool2d(2, 2),  # 128x8x25
227 |                 nn.Conv2d(self.output_channel[1], self.output_channel[2], 3, 1, 1), nn.ReLU(True),  # 256x8x25
228 |                 nn.Conv2d(self.output_channel[2], self.output_channel[2], 3, 1, 1), nn.ReLU(True),
229 |                 nn.MaxPool2d((2, 1), (2, 1)),  # 256x4x25
230 |                 nn.Conv2d(self.output_channel[2], self.output_channel[3], 3, 1, 1, bias=False),
231 |                 nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),  # 512x4x25
232 |                 nn.Conv2d(self.output_channel[3], self.output_channel[3], 3, 1, 1, bias=False),
233 |                 nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),
234 |                 nn.MaxPool2d((2, 1), (2, 1)),  # 512x2x25
235 |                 nn.Conv2d(self.output_channel[3], self.output_channel[3], 2, 1, 0), nn.ReLU(True))  # 512x1x24
236 | 
237 |         def forward(self, input):
238 |             return self.ConvNet(input)
239 |         
240 | 
241 |     class BidirectionalGRU(nn.Module):
242 |         def __init__(self, input_size, hidden_size, output_size):
243 |             super(BidirectionalGRU, self).__init__()
244 | 
245 |             self.rnn = nn.GRU(input_size, hidden_size, bidirectional=True)
246 |             self.embedding = nn.Linear(hidden_size * 2, output_size)
247 | 
248 |         def forward(self, x):
249 |             recurrent, hidden = self.rnn(x)
250 |             T, b, h = recurrent.size()
251 |             t_rec = recurrent.view(T * b, h)
252 | 
253 |             output = self.embedding(t_rec)  # [T * b, nOut]
254 |             output = output.view(T, b, -1)
255 | 
256 |             return output, hidden
257 | 
258 |     class CTCModel(nn.Module):
259 |         def __init__(self, inner_dim=512, num_chars=65):
260 |             super().__init__()
261 |             self.encoder = VGG_FeatureExtractor(3, inner_dim)
262 |             self.AdaptiveAvgPool = nn.AdaptiveAvgPool2d((None, 1))
263 |             self.rnn_encoder = BidirectionalGRU(inner_dim, 256, 256)
264 |             self.num_chars = num_chars
265 |             self.decoder = nn.Linear(256, self.num_chars)
266 | 
267 |         def forward(self, x, labels=None, max_label_length=None, device=None, training=True):
268 |             # ---------------- CNN ENCODER --------------
269 |             x = self.encoder(x)
270 |             # print('After CNN:', x.size())
271 | 
272 |             # ---------------- CNN TO RNN ----------------
273 |             x = x.permute(3, 0, 1, 2)  # from B x C x H x W -> W x B x C x H
274 |             x = self.AdaptiveAvgPool(x)
275 |             size = x.size()
276 |             x = x.reshape(size[0], size[1], size[2] * size[3])
277 | 
278 |             # ----------------- RNN ENCODER ---------------
279 |             encoder_outputs, last_hidden = self.rnn_encoder(x)
280 |             # print('After RNN', x.size())
281 | 
282 |             # --------------- CTC DECODER -------------------
283 |             # batch_size = encoder_outputs.size()[1]
284 |             outputs = self.decoder(encoder_outputs)
285 | 
286 |             return outputs
287 |         
288 |     import torch
289 |     import torch.nn.functional as F
290 |     from torch.nn import CTCLoss, CrossEntropyLoss
291 | 
292 |     def ctc_loss(outputs, targets, mask):
293 |         USE_CUDA = torch.cuda.is_available()
294 |         device = torch.device("cuda:0" if USE_CUDA else "cpu")
295 |         target_lengths = torch.sum(mask, dim=0).to(device)
296 |         # We need to change targets, PAD_token = 0 = blank
297 |         # EOS token -> PAD_token
298 |         targets[targets == EOS_token] = PAD_token
299 |         outputs = outputs.log_softmax(2)
300 |         input_lengths = outputs.size()[0] * torch.ones(outputs.size()[1], dtype=torch.int)
301 |         loss_fn = CTCLoss(blank=PAD_token, zero_infinity=True)
302 |         targets = targets.transpose(1, 0)
303 |         # target_lengths have EOS token, we need minus one
304 |         target_lengths = target_lengths - 1
305 |         targets = targets[:, :-1]
306 |         # print(input_lengths, target_lengths)
307 |         torch.backends.cudnn.enabled = False
308 |         # TODO: NAN when target_length > input_length, we can increase size or use zero infinity
309 |         loss = loss_fn(outputs, targets, input_lengths, target_lengths)
310 |         torch.backends.cudnn.enabled = True
311 | 
312 |         return loss, loss.item()
313 | 
314 |     import difflib
315 | 
316 |     def calculate_ac(str1, str2):
317 |         """Calculate accuracy by char of 2 string"""
318 | 
319 |         total_letters = len(str1)
320 |         ocr_letters = len(str2)
321 |         if total_letters == 0 and ocr_letters == 0:
322 |             acc_by_char = 1.0
323 |             return acc_by_char
324 |         diff = difflib.SequenceMatcher(None, str1, str2)
325 |         correct_letters = 0
326 |         for block in diff.get_matching_blocks():
327 |             correct_letters = correct_letters + block[2]
328 |         if ocr_letters == 0:
329 |             acc_by_char = 0
330 |         elif correct_letters == 0:
331 |             acc_by_char = 0
332 |         else:
333 |             acc_1 = correct_letters / total_letters
334 |             acc_2 = correct_letters / ocr_letters
335 |             acc_by_char = 2 * (acc_1 * acc_2) / (acc_1 + acc_2)
336 | 
337 |         return float(acc_by_char)
338 | 
339 |     def accuracy_ctc(outputs, targets):
340 |         outputs = outputs.permute(1, 0, 2)
341 |         targets = targets.transpose(1, 0)
342 | 
343 |         total_acc_by_char = 0
344 |         total_acc_by_field = 0
345 | 
346 |         for output, target in zip(outputs, targets):
347 |             out_best = list(torch.argmax(output, -1))  # [2:]
348 |             out_best = [k for k, g in itertools.groupby(out_best)]
349 |             pred_text = get_label_from_indices(out_best)
350 |             target_text = get_label_from_indices(target)
351 | 
352 |             # print('predict:', pred_text, 'target:', target_text)
353 | 
354 |             acc_by_char = calculate_ac(pred_text, target_text)
355 |             total_acc_by_char += acc_by_char
356 | 
357 |             if pred_text == target_text:
358 |                 total_acc_by_field += 1
359 | 
360 |         return np.array([total_acc_by_char / targets.size()[0], total_acc_by_field / targets.size()[0]])
361 | 
362 |     from torch.utils.data import DataLoader
363 | 
364 |     def train_epoch(epoch):
365 |         model.train()
366 |         total_loss = 0
367 |         total_metrics = np.zeros(2)
368 |         for batch_idx, (images, labels, mask, max_label_length) in enumerate(train_dataloader):
369 |             images, labels, mask = images.to(device), labels.to(device), mask.to(device)
370 | 
371 |             optimizer.zero_grad()
372 |             output = model(images, labels, max_label_length, device)
373 | 
374 |             loss, print_loss = loss_fn(output, labels, mask)
375 |             loss.backward()
376 |             optimizer.step()
377 | 
378 |             total_loss += print_loss  # loss.item()
379 |             total_metrics += metric_fn(output, labels)
380 | 
381 |             if batch_idx == len(train_dataloader):
382 |                 break
383 | 
384 |         log = {
385 |             'loss': total_loss / len(train_dataloader),
386 |             'metrics': (total_metrics / len(train_dataloader)).tolist()
387 |         }
388 | 
389 |         return log
390 | 
391 |     def val_epoch():
392 |         # when evaluating, we don't use teacher forcing
393 |         model.eval()
394 |         total_val_loss = 0
395 |         # total_val_metrics = np.zeros(len(self.metrics))
396 |         total_val_metrics = np.zeros(2)
397 |         with torch.no_grad():
398 |             # print("Length of validation:", len(self.valid_data_loader))
399 |             for batch_idx, (images, labels, mask, max_label_length) in enumerate(val_dataloader):
400 |                 images, labels, mask = images.to(device), labels.to(device), mask.to(device)
401 |                 images, labels, mask = images.to(device), labels.to(device), mask.to(device)
402 | 
403 |                 output = model(images, labels, max_label_length, device, training=False)
404 |                 _, print_loss = loss_fn(output, labels, mask)  # Attention:
405 |                 # loss = self.loss(output, labels, mask)
406 |                 # print_loss = loss.item()
407 | 
408 |                 total_val_loss += print_loss
409 |                 total_val_metrics += metric_fn(output, labels)
410 | 
411 |         return_value = {
412 |             'loss': total_val_loss / len(val_dataloader),
413 |             'metrics': (total_val_metrics / len(val_dataloader)).tolist()
414 |         }
415 | 
416 |         return return_value
417 | 
418 |     def train():
419 |         best_val_loss = np.inf
420 |         for epoch in range(N_EPOCHS):
421 |             print("Epoch", epoch + 1)
422 |             train_log = train_epoch(epoch)
423 |             print("Training log:", train_log)
424 |             train_loss = train_log['loss']
425 |             train_acc_by_char = train_log['metrics'][0]
426 |             train_acc_by_field = train_log['metrics'][1]
427 |             experiment.log_metrics({
428 |                 "train_loss": train_loss,
429 |                 "train_acc_by_char": train_acc_by_char,
430 |                 "train_acc_by_field": train_acc_by_field
431 |             }, epoch=epoch)
432 | 
433 |             val_log = val_epoch()
434 |             val_loss = val_log['loss']
435 |             val_acc_by_char = val_log['metrics'][0]
436 |             val_acc_by_field = val_log['metrics'][1]
437 |             experiment.log_metrics({
438 |                 "val_loss": val_loss,
439 |                 "val_acc_by_char": val_acc_by_char,
440 |                 "val_acc_by_field": val_acc_by_field
441 |             }, epoch=epoch)
442 |             if val_log['loss'] < best_val_loss:
443 |                 # save model
444 |                 best_val_loss = val_log['loss']
445 | 
446 |             print("Validation log:", val_log)
447 | 
448 | 
449 |     def weight_reset(m):
450 |         if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
451 |             m.reset_parameters()
452 | 
453 |     # Download data from s3
454 |     print('DOWNLOADING DATA FROM S3')
455 |     import boto3
456 |     import os 
457 | 
458 |     s3_resource = boto3.resource('s3')
459 |     bucket = s3_resource.Bucket('ocrpipeline') 
460 |     for obj in bucket.objects.filter(Prefix='data'):
461 |         if not os.path.exists(os.path.dirname(obj.key)):
462 |             os.makedirs(os.path.dirname(obj.key))
463 |         bucket.download_file(obj.key, obj.key) # save to same path
464 |             
465 |     print('START CREATING DATASET AND MODEL')
466 |     BATCH_SIZE = 32
467 |     NUM_WORKERS = 4
468 | 
469 |     # merge data from train folder and user_data folder 
470 |     import shutil
471 |     import json
472 | 
473 |     ## merge labels
474 |     with open('./data/user_data/labels.json', 'r') as f:
475 |         user_data = json.load(f)
476 | 
477 |     with open('./data/train/labels.json', 'r') as f:
478 |         train_data = json.load(f)
479 | 
480 |     train_data = {**train_data, **user_data}
481 |     with open('./data/train/labels.json', 'w') as f:
482 |         json.dump(train_data, f)
483 | 
484 |     ## merge images
485 |     SOURCE_FOLDER = './data/user_data/images'
486 |     TARGET_FOLDER = './data/train/images'
487 | 
488 |     for image_name in os.listdir(SOURCE_FOLDER):
489 |         shutil.copy2(os.path.join(SOURCE_FOLDER, image_name), TARGET_FOLDER)
490 | 
491 | 
492 |     # CREATE DATASET AND DATALOADER
493 |     train_dataset = OCRDataset(data_dir='./data/train/images/', label_path='./data/train/labels.json')
494 |     val_dataset = OCRDataset(data_dir='./data/validation/images/', label_path='./data/validation/labels.json')
495 | 
496 |     train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, collate_fn=collate_wrapper)
497 |     val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, collate_fn=collate_wrapper)
498 | 
499 |     # CREATE MODEL
500 |     model = CTCModel(inner_dim=128, num_chars=len(CHAR2INDEX))
501 | 
502 |     import torch
503 |     import numpy as np
504 |     
505 |     N_EPOCHS = 50
506 |     optimizer = torch.optim.Adam(model.parameters())
507 |     loss_fn = ctc_loss
508 |     metric_fn = accuracy_ctc
509 |     device = 'cpu' if not torch.cuda.is_available() else 'cuda'
510 | 
511 |     model.apply(weight_reset)
512 |     print('START TRAINING')
513 |     train()
514 | 
515 |     # save model to s3
516 |     upload_file_to_s3('best_model.pth', bucket='ocrpipeline', object_name='best_model.pth')
517 | 
518 |     # Write to output file
519 |     model_path = "s3://ocrpipeline/best_model.pth"
520 |     Path('/output_path.txt').write_text(model_path)
521 | 
522 | 
523 | if __name__ == '__main__':
524 |     import argparse
525 |     parser = argparse.ArgumentParser(description='Training model')
526 |     parser.add_argument('--data_path', type=str)
527 | 
528 |     args = parser.parse_args()
529 | 
530 |     deploy_model(data_path=args.data_path)
531 | 
532 | 


--------------------------------------------------------------------------------
/docs/training/index.md:
--------------------------------------------------------------------------------
  1 | # Huấn luyện mô hình
  2 | Sau khi có được dữ liệu thông qua việc gán nhãn, chúng ta sẽ tiến hành xây dựng và huấn luyện mô hình học máy dựa trên dữ liệu này. Trong blog này, chúng ta sẽ sử dụng Pytorch để xây dựng và huấn luyện mô hình, Cometml để quản lý thí nghiệm và tối ưu siêu tham số và DVC để lưu trọng số của mô hình.
  3 | 
  4 | ## Bài toán
  5 | Mục tiêu của bài toán là đọc hình ảnh của một dòng chữ và đưa ra kết quả dòng chữ đó. Ví dụ:
  6 | 
  7 | - Đầu vào: 
  8 | 
  9 |     ![png](../data/images/output_3_0.png)
 10 | 
 11 | - Kết quả: `phòng 101, tầng 1, lô 04 - TT58, khu đô thị Tây Nam Linh Đàm`
 12 | 
 13 | Chúng ta sẽ gọi bài toán này là bài toán `Nhận diện kí tự quang học (Optical Character Recognition - OCR)`
 14 | ## Các phương pháp/mô hình phổ biến
 15 | Có rất nhiều phương pháp, mô hình học máy có thể được sử dụng để giải quyết bài toán OCR, các bạn có thể tìm hiểu thêm ở github này: https://github.com/hwalsuklee/awesome-deep-text-detection-recognition
 16 | 
 17 | Trong phần này, chúng ta sẽ nhắc đến hai mô hình phổ biến là nền tảng của các mô hình sau này, đó là `Convolutional Recurrent Neural Network + CTC` và `Convolution Recurrent Neural Network + Attention`. Có nhiều blog/video giải thích các mô hình này một cách rất chi tiết và dễ hiểu, đường dẫn đến các blog này được đặt ở mục tài liệu tham khảo. Ở trong blog này, chúng ta sẽ tập trung vào phần thực hành để hiểu rõ hơn về cơ chế hoạt động và cách cải thiện mô hình.
 18 | 
 19 | ![alt text](./images/crnn.png "CRNN")
 20 | 
 21 | Trong blog này, chúng ta sẽ sử dụng `CTC decoding` cho `decoding algorithm` và `CTC loss function` để huấn luyện mô hình.
 22 | 
 23 | ## Xây dựng mô hình học máy với Convolutional Recurrent Neural Network
 24 | Trước hết chúng ta cần phải cài đặt các thư viện cần thiết:
 25 | ```
 26 | pip install torch
 27 | pip install numpy
 28 | pip install opencv-python
 29 | ```
 30 | 
 31 | ### Xử lý dữ liệu
 32 | 
 33 | LSTM networkSau khi gán nhãn dữ liệu, chúng ta chọn `Export` và chọn định dạng `JSON` để tải nhãn của dữ liệu về máy, dữ liệu đã được gán nhãn có định dạng như sau:
 34 | 
 35 | ```json
 36 | [
 37 |   {
 38 |     "id": 17,
 39 |     "annotations": [
 40 |       {
 41 |         "id": 17,
 42 |         "completed_by": {
 43 |           "id": 1,
 44 |           "email": "admin@test.com",
 45 |           "first_name": "",
 46 |           "last_name": ""
 47 |         },
 48 |         "result": [
 49 |           {
 50 |             "value": {
 51 |               "choices": [
 52 |                 "Good"
 53 |               ]
 54 |             },
 55 |             "id": "HDtFcMqe2T",
 56 |             "from_name": "chc",
 57 |             "to_name": "image",
 58 |             "type": "choices"
 59 |           },
 60 |           {
 61 |             "value": {
 62 |               "text": [
 63 |                 "Số 253 đường Trần Phú, Thị trấn Nam Sách, Huyện Nam Sách, Hải Dương"
 64 |               ]
 65 |             },
 66 |             "id": "fx_csJL0zD",
 67 |             "from_name": "caption",
 68 |             "to_name": "image",
 69 |             "type": "textarea"
 70 |           }
 71 |         ],
 72 |         "was_cancelled": false,
 73 |         "ground_truth": false,
 74 |         "created_at": "2021-08-03T04:55:01.906421Z",
 75 |         "updated_at": "2021-08-03T04:55:01.906486Z",
 76 |         "lead_time": 24.076,
 77 |         "prediction": {},
 78 |         "result_count": 0,
 79 |         "task": 17
 80 |       }
 81 |     ],
 82 |     "predictions": [],
 83 |     "data": {
 84 |       "captioning": "s3://ocrpipeline/data/train/images/0000_samples.png"
 85 |     },
 86 |     "meta": {},
 87 |     "created_at": "2021-08-02T15:26:42.229399Z",
 88 |     "updated_at": "2021-08-02T15:26:42.229500Z",
 89 |     "project": 3
 90 |   }
 91 | ]
 92 | ```
 93 | 
 94 | 
 95 | Chúng ta có thể giữ dạng dữ liệu này dùng cho việc huấn luyện mô hình, hoặc có thể chuyển về định dạng đơn giản hơn. Ở đây, chúng ta sẽ chuyển về định dạng mới là một `dictionary` giống với dữ liệu của `CinnamonOCR`, các bạn có thể tìm dữ liệu đã được gán nhãn theo định dạng này ở file `labels.json` ở mỗi thư mục ảnh.
 96 | 
 97 | 
 98 | ```python
 99 | import os
100 | import json
101 | ```
102 | 
103 | 
104 | ```python
105 | def convert_label_studio_format_to_ocr_format(label_studio_json_path, output_path):
106 |     with open(label_studio_json_path, 'r') as f:
107 |         data = json.load(f)
108 |         
109 |     ocr_data = {}
110 |     
111 |     for item in data:
112 |         image_name = os.path.basename(item['data']['captioning'])
113 |         
114 |         text = ''
115 |         for value_item in item['annotations'][0]['result']:
116 |             if value_item['from_name'] == 'caption':
117 |                 text = value_item['value']['text'][0]
118 |         ocr_data[image_name] = text
119 |         
120 |     with open(output_path, 'w') as f:
121 |         json.dump(ocr_data, f, indent=4)
122 |     
123 |     print('Successfully converted ', label_studio_json_path)
124 | ```
125 | 
126 | 
127 | ```python
128 | convert_label_studio_format_to_ocr_format('./data/train/label_studio_data.json', './data/train/labels.json')
129 | convert_label_studio_format_to_ocr_format('./data/validation/label_studio_data.json', './data/validation/labels.json')
130 | convert_label_studio_format_to_ocr_format('./data/test/label_studio_data.json', './data/test/labels.json')
131 | ```
132 | 
133 |     Successfully converted  ./data/train/label_studio_data.json
134 |     Successfully converted  ./data/validation/label_studio_data.json
135 |     Successfully converted  ./data/test/label_studio_data.json
136 | 
137 | 
138 | ### Xây dựng Dataloader
139 | 
140 | Xây dựng bộ từ điển các kí tự có thể xuất hiện:
141 | 
142 | 
143 | ```python
144 | CHARACTERS = "aáàạảãăắằẳẵặâấầẩẫậbcdđeéèẹẻẽêếềệểễfghiíìịỉĩjklmnoóòọỏõôốồộổỗơớờợởỡpqrstuúùụủũưứừựửữvxyýỳỷỹỵzwAÁÀẠẢÃĂẮẰẲẴẶÂẤẦẨẪẬBCDĐEÉÈẸẺẼÊẾỀỆỂỄFGHIÍÌỊỈĨJKLMNOÓÒỌỎÕÔỐỒỘỔỖƠỚỜỢỞỠPQRSTUÚÙỤỦŨƯỨỪỰỬỮVXYÝỲỴỶỸZW0123456789 .,-/()'#+:"
145 | 
146 | PAD_token = 0  # Used for padding short sentences
147 | SOS_token = 1  # Start-of-sentence token
148 | EOS_token = 2  # End-of-sentence token
149 | 
150 | CHAR2INDEX = {"PAD": PAD_token, "SOS": SOS_token, "EOS": EOS_token}
151 | INDEX2CHAR = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
152 | 
153 | for i, c in enumerate(CHARACTERS):
154 |     CHAR2INDEX[c] = i + 3
155 |     INDEX2CHAR[i + 3] = c
156 | ```
157 | 
158 | Chúng ta cần thêm các kí tự đặc biết như `PAD` để thêm vào những chuỗi ngắn hơn trong một `batch`, bởi vì các chuỗi trong một `batch` cần có độ dài bằng nhau. `SOS` và `EOS` là các kí tự để đánh dấu bắt đầu câu và kết thúc câu. Chúng ta cần xây dựng các hàm phụ trợ để chuyển đổi từ chuỗi kí tự thành dữ liệu dạng số và ngược lại.
159 | 
160 | 
161 | ```python
162 | def get_indices_from_label(label):
163 |     indices = []
164 |     for char in label:
165 | #         if CHAR2INDEX.get(char) is not None:
166 |             indices.append(CHAR2INDEX[char])
167 | 
168 |     indices.append(EOS_token)
169 |     return indices
170 | 
171 | def get_label_from_indices(indices):
172 |     label = ""
173 |     for index in indices:
174 |         if index == EOS_token:
175 |             break
176 |         elif index == PAD_token:
177 |             continue
178 |         else:
179 |             label += INDEX2CHAR[index.item()]
180 | 
181 |     return label
182 | ```
183 | 
184 | Xây dựng OCR dataset dựa trên các kí tự ở trên, chúng ta cần truyền vào đường dẫn của hình ảnh và nhãn của tập dữ liệu tương ứng. Ngoài ra chúng ta có thể truyền các hàm `transform` để thực hiện việc `augment` dữ liệu.
185 | 
186 | 
187 | ```python
188 | from comet_ml import Experiment
189 | 
190 | import cv2
191 | from torch.utils.data import Dataset
192 | 
193 | 
194 | class OCRDataset(Dataset):
195 |     def __init__(self, data_dir, label_path, transform=None):
196 |         self.data_dir = data_dir
197 |         self.label_path = label_path
198 |         self.transform = transform
199 |         self.image_paths, self.labels = self.get_image_paths_and_labels(label_path)
200 | 
201 |     def __len__(self):
202 |         return len(self.image_paths)
203 | 
204 |     def __getitem__(self, idx):
205 |         img_path = self.get_data_path(self.image_paths[idx])
206 |         image = cv2.imread(img_path)
207 |         label = self.labels[idx]
208 |         label = get_indices_from_label(label)
209 |         sample = {"image": image, "label": label}
210 |         return sample
211 | 
212 |     def get_data_path(self, path):
213 |         return os.path.join(self.data_dir, path)
214 | 
215 |     def get_image_paths_and_labels(self, json_path):
216 |         with open(json_path, 'r', encoding='utf-8') as f:
217 |             data = json.load(f)
218 | 
219 |         image_paths = list(data.keys())
220 |         labels = list(data.values())
221 |         return image_paths, labels
222 | ```
223 | 
224 | Xây dựng dataloader: Khi xây dựng dataloader, chúng ta cần có `collate_fn` để gộp các item đơn lẻ lại với nhau. Ở đây, `collate_fn` (collate_wrapper) thực hiện một số chức năng như:
225 | - Đưa các hình ảnh về cùng chiều cao và thêm nhưng khoảng trắng phía sau những hình ảnh có độ rộng ngắn hơn
226 | - Đưa các nhãn về cùng độ dài bằng cách thêm kí tự `PAD` vào những chuỗi ngắn hơn.
227 | - Chuyển dữ liệu về dạng `torch.Tensor`
228 | 
229 | 
230 | ```python
231 | import itertools
232 | 
233 | def collate_wrapper(batch):
234 |     """
235 |     Labels are already numbers
236 |     :param batch:
237 |     :return:
238 |     """
239 |     images = []
240 |     labels = []
241 |     # TODO: can change height in config
242 |     height = 64
243 |     max_width = 0
244 |     max_label_length = 0
245 | 
246 |     for sample in batch:
247 |         image = sample['image']
248 |         try:
249 |             image = process_image(image, height=height, channels=image.shape[2])
250 |         except:
251 |             continue
252 | 
253 |         if image.shape[1] > max_width:
254 |             max_width = image.shape[1]
255 | 
256 |         label = sample['label']
257 | 
258 |         if len(label) > max_label_length:
259 |             max_label_length = len(label)
260 | 
261 |         images.append(image)
262 |         labels.append(label)
263 | 
264 |     # PAD IMAGES: convert to tensor with size b x c x h x w (from b x h x w x c)
265 |     channels = images[0].shape[2]
266 |     images = process_batch_images(images, height=height, max_width=max_width, channels=channels)
267 |     images = images.transpose((0, 3, 1, 2))
268 |     images = torch.from_numpy(images).float()
269 | 
270 |     # LABELS
271 |     pad_list = zero_padding(labels)
272 |     mask = binary_matrix(pad_list)
273 |     mask = torch.ByteTensor(mask)
274 |     labels = torch.LongTensor(pad_list)
275 |     return images, labels, mask, max_label_length
276 | 
277 | 
278 | def process_image(image, height=64, channels=3):
279 |     """Converts to self.channels, self.max_height
280 |     # convert channels
281 |     # resize max_height = 64
282 |     """
283 |     shape = image.shape
284 |     # if shape[0] > 64 or shape[0] < 32:  # height
285 |     try:
286 |         image = cv2.resize(image, (int(height/shape[0] * shape[1]), height))
287 |     except:
288 |         return np.zeros([1, 1, channels])
289 |     return image / 255.0
290 | 
291 | 
292 | def process_batch_images(images, height, max_width, channels=3):
293 |     """
294 |     Convert a list of images to a tensor (with padding)
295 |     :param images: list of numpy array images
296 |     :param height: desired height
297 |     :param max_width: max width of all images
298 |     :param channels: number of image channels
299 |     :return: a tensor representing images
300 |     """
301 |     output = np.ones([len(images), height, max_width, channels])
302 |     for i, image in enumerate(images):
303 |         final_img = image
304 |         shape = image.shape
305 |         output[i, :shape[0], :shape[1], :] = final_img
306 | 
307 |     return output
308 | 
309 | 
310 | def zero_padding(l, fillvalue=PAD_token):
311 |     """
312 |     Pad value PAD token to l
313 |     :param l: list of sequences need padding
314 |     :param fillvalue: padded value
315 |     :return:
316 |     """
317 |     return list(itertools.zip_longest(*l, fillvalue=fillvalue))
318 | 
319 | 
320 | def binary_matrix(l, value=PAD_token):
321 |     m = []
322 |     for i, seq in enumerate(l):
323 |         m.append([])
324 |         for token in seq:
325 |             if token == value:
326 |                 m[i].append(0)
327 |             else:
328 |                 m[i].append(1)
329 |     return m
330 | ```
331 | 
332 | Sau đó chúng ta sử dụng hàm `collate_fn` này để xây dựng `dataloader`
333 | 
334 | ### Xây dựng mô hình học máy
335 | Trong phần này, chúng ta sẽ xây dựng mô hình học máy CRNN dựa theo hình minh hoạ ở trên. Mô hình của chúng ta gồm 2 thành phần chính: 
336 | - `CNN feature extraction`: là một mạng convolutional neural network để trích xuất các thông tin về hình ảnh.
337 | - `LSTM network`: để trích xuất thông tin dạng chuỗi của chuỗi kí tự
338 | 
339 | Trong `CNN feature extraction`, chúng ta sử dụng mạng VGG làm `backbone` của mạng, nhận đầu vào là hình ảnh với số kênh là 3, và trả về một tensor có độ sâu là 512
340 | 
341 | 
342 | ```python
343 | import torch.nn as nn
344 | import torch.nn.functional as F
345 | 
346 | 
347 | class VGG_FeatureExtractor(nn.Module):
348 |     """ FeatureExtractor of CRNN (https://arxiv.org/pdf/1507.05717.pdf) """
349 | 
350 |     def __init__(self, input_channel, output_channel=512):
351 |         super(VGG_FeatureExtractor, self).__init__()
352 |         self.output_channel = [int(output_channel / 8), int(output_channel / 4),
353 |                                int(output_channel / 2), output_channel]  # [64, 128, 256, 512]
354 |         self.ConvNet = nn.Sequential(
355 |             nn.Conv2d(input_channel, self.output_channel[0], 3, 1, 1), nn.ReLU(True),
356 |             nn.MaxPool2d(2, 2),  # 64x16x50
357 |             nn.Conv2d(self.output_channel[0], self.output_channel[1], 3, 1, 1), nn.ReLU(True),
358 |             nn.MaxPool2d(2, 2),  # 128x8x25
359 |             nn.Conv2d(self.output_channel[1], self.output_channel[2], 3, 1, 1), nn.ReLU(True),  # 256x8x25
360 |             nn.Conv2d(self.output_channel[2], self.output_channel[2], 3, 1, 1), nn.ReLU(True),
361 |             nn.MaxPool2d((2, 1), (2, 1)),  # 256x4x25
362 |             nn.Conv2d(self.output_channel[2], self.output_channel[3], 3, 1, 1, bias=False),
363 |             nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),  # 512x4x25
364 |             nn.Conv2d(self.output_channel[3], self.output_channel[3], 3, 1, 1, bias=False),
365 |             nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),
366 |             nn.MaxPool2d((2, 1), (2, 1)),  # 512x2x25
367 |             nn.Conv2d(self.output_channel[3], self.output_channel[3], 2, 1, 0), nn.ReLU(True))  # 512x1x24
368 | 
369 |     def forward(self, input):
370 |         return self.ConvNet(input)
371 | ```
372 | 
373 | Trong `LSTM network`, chúng ta sử dụng `GRU cell` giúp cải thiện tốc độ tính toán và cơ chế hai chiều `bidirectional` để giúp mô hình có thể học được thông tin từ cả hai hướng từ trái sang phải và từ phải sang trái.
374 | 
375 | 
376 | ```python
377 | class BidirectionalGRU(nn.Module):
378 |     def __init__(self, input_size, hidden_size, output_size):
379 |         super(BidirectionalGRU, self).__init__()
380 | 
381 |         self.rnn = nn.GRU(input_size, hidden_size, bidirectional=True)
382 |         self.embedding = nn.Linear(hidden_size * 2, output_size)
383 | 
384 |     def forward(self, x):
385 |         recurrent, hidden = self.rnn(x)
386 |         T, b, h = recurrent.size()
387 |         t_rec = recurrent.view(T * b, h)
388 | 
389 |         output = self.embedding(t_rec)  # [T * b, nOut]
390 |         output = output.view(T, b, -1)
391 | 
392 |         return output, hidden
393 | ```
394 | 
395 | Kết hợp hai thành phần này lại với nhau, chúng ta có được một mô hình, đặt tên là `CTCModel`, chi tiết như sau:
396 | 
397 | 
398 | ```python
399 | class CTCModel(nn.Module):
400 |     def __init__(self, inner_dim=512, num_chars=65):
401 |         super().__init__()
402 |         self.encoder = VGG_FeatureExtractor(3, inner_dim)
403 |         self.AdaptiveAvgPool = nn.AdaptiveAvgPool2d((None, 1))
404 |         self.rnn_encoder = BidirectionalGRU(inner_dim, 256, 256)
405 |         self.num_chars = num_chars
406 |         self.decoder = nn.Linear(256, self.num_chars)
407 | 
408 |     def forward(self, x, labels=None, max_label_length=None, device=None, training=True):
409 |         # ---------------- CNN ENCODER --------------
410 |         x = self.encoder(x)
411 |         # print('After CNN:', x.size())
412 | 
413 |         # ---------------- CNN TO RNN ----------------
414 |         x = x.permute(3, 0, 1, 2)  # from B x C x H x W -> W x B x C x H
415 |         x = self.AdaptiveAvgPool(x)
416 |         size = x.size()
417 |         x = x.reshape(size[0], size[1], size[2] * size[3])
418 | 
419 |         # ----------------- RNN ENCODER ---------------
420 |         encoder_outputs, last_hidden = self.rnn_encoder(x)
421 |         # print('After RNN', x.size())
422 | 
423 |         # --------------- CTC DECODER -------------------
424 |         # batch_size = encoder_outputs.size()[1]
425 |         outputs = self.decoder(encoder_outputs)
426 | 
427 |         return outputs
428 | ```
429 | 
430 | ### Huấn luyện mô hình
431 | 
432 | Để huấn luyện mô hình, chúng ta cần phải định nghĩa các hàm mất mát và các chỉ số dùng để đánh giá tính hiệu quả của mô hình.
433 | 
434 | Với hàm mất mát, chúng ta sử dụng `CTC loss function`, các bạn có thể tìm hiểu thêm cách nó hoạt động ở mục tài liệu tham khảo.
435 | 
436 | 
437 | ```python
438 | import torch
439 | import torch.nn.functional as F
440 | from torch.nn import CTCLoss, CrossEntropyLoss
441 | 
442 | def ctc_loss(outputs, targets, mask):
443 |     USE_CUDA = torch.cuda.is_available()
444 |     device = torch.device("cuda:0" if USE_CUDA else "cpu")
445 |     target_lengths = torch.sum(mask, dim=0).to(device)
446 |     # We need to change targets, PAD_token = 0 = blank
447 |     # EOS token -> PAD_token
448 |     targets[targets == EOS_token] = PAD_token
449 |     outputs = outputs.log_softmax(2)
450 |     input_lengths = outputs.size()[0] * torch.ones(outputs.size()[1], dtype=torch.int)
451 |     loss_fn = CTCLoss(blank=PAD_token, zero_infinity=True)
452 |     targets = targets.transpose(1, 0)
453 |     # target_lengths have EOS token, we need minus one
454 |     target_lengths = target_lengths - 1
455 |     targets = targets[:, :-1]
456 |     # print(input_lengths, target_lengths)
457 |     torch.backends.cudnn.enabled = False
458 |     # TODO: NAN when target_length > input_length, we can increase size or use zero infinity
459 |     loss = loss_fn(outputs, targets, input_lengths, target_lengths)
460 |     torch.backends.cudnn.enabled = True
461 | 
462 |     return loss, loss.item()
463 | ```
464 | 
465 | Với chỉ số đánh giá, chúng ta sẽ sử dụng hai chỉ số:
466 | - `Accuracy by char`: dựa trên kí tự dự đoán
467 | - `Accuracy by field`: dựa trên toàn bộ chuỗi, chỉ tính đúng khi toàn bộ chuỗi được dự đoán là đúng.
468 | 
469 | 
470 | ```python
471 | import difflib
472 | 
473 | def calculate_ac(str1, str2):
474 |     """Calculate accuracy by char of 2 string"""
475 | 
476 |     total_letters = len(str1)
477 |     ocr_letters = len(str2)
478 |     if total_letters == 0 and ocr_letters == 0:
479 |         acc_by_char = 1.0
480 |         return acc_by_char
481 |     diff = difflib.SequenceMatcher(None, str1, str2)
482 |     correct_letters = 0
483 |     for block in diff.get_matching_blocks():
484 |         correct_letters = correct_letters + block[2]
485 |     if ocr_letters == 0:
486 |         acc_by_char = 0
487 |     elif correct_letters == 0:
488 |         acc_by_char = 0
489 |     else:
490 |         acc_1 = correct_letters / total_letters
491 |         acc_2 = correct_letters / ocr_letters
492 |         acc_by_char = 2 * (acc_1 * acc_2) / (acc_1 + acc_2)
493 | 
494 |     return float(acc_by_char)
495 | 
496 | def accuracy_ctc(outputs, targets):
497 |     outputs = outputs.permute(1, 0, 2)
498 |     targets = targets.transpose(1, 0)
499 | 
500 |     total_acc_by_char = 0
501 |     total_acc_by_field = 0
502 | 
503 |     for output, target in zip(outputs, targets):
504 |         out_best = list(torch.argmax(output, -1))  # [2:]
505 |         out_best = [k for k, g in itertools.groupby(out_best)]
506 |         pred_text = get_label_from_indices(out_best)
507 |         target_text = get_label_from_indices(target)
508 | 
509 |         # print('predict:', pred_text, 'target:', target_text)
510 | 
511 |         acc_by_char = calculate_ac(pred_text, target_text)
512 |         total_acc_by_char += acc_by_char
513 | 
514 |         if pred_text == target_text:
515 |             total_acc_by_field += 1
516 | 
517 |     return np.array([total_acc_by_char / targets.size()[0], total_acc_by_field / targets.size()[0]])
518 | 
519 | ```
520 | 
521 | Chúng ta khởi tạo `dataloader` cho việc huấn luyện mô hình như sau:
522 | 
523 | 
524 | ```python
525 | from torch.utils.data import DataLoader
526 | 
527 | BATCH_SIZE = 32
528 | NUM_WORKERS = 4
529 | 
530 | train_dataset = OCRDataset(data_dir='./data/train/images/', label_path='./data/train/labels.json')
531 | val_dataset = OCRDataset(data_dir='./data/validation/images/', label_path='./data/validation/labels.json')
532 | 
533 | train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, collate_fn=collate_wrapper)
534 | val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, collate_fn=collate_wrapper)
535 | ```
536 | 
537 | Và khởi tạo mô hình học máy:
538 | 
539 | 
540 | ```python
541 | model = CTCModel(inner_dim=128, num_chars=len(CHAR2INDEX))
542 | ```
543 | 
544 | 
545 | ```python
546 | model
547 | ```
548 | 
549 | 
550 | 
551 | 
552 | >     CTCModel(
553 | >       (encoder): VGG_FeatureExtractor(
554 | >         (ConvNet): Sequential(
555 | >           (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
556 | >           (1): ReLU(inplace=True)
557 | >           (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
558 | >           (3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
559 | >           (4): ReLU(inplace=True)
560 | >           (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
561 | >           (6): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
562 | >           (7): ReLU(inplace=True)
563 | >           (8): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
564 | >           (9): ReLU(inplace=True)
565 | >           (10): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
566 | >           (11): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
567 | >           (12): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
568 | >           (13): ReLU(inplace=True)
569 | >           (14): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
570 | >           (15): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
571 | >           (16): ReLU(inplace=True)
572 | >           (17): MaxPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0, dilation=1, ceil_mode=False)
573 | >           (18): Conv2d(128, 128, kernel_size=(2, 2), stride=(1, 1))
574 | >           (19): ReLU(inplace=True)
575 | >         )
576 | >       )
577 | >       (AdaptiveAvgPool): AdaptiveAvgPool2d(output_size=(None, 1))
578 | >       (rnn_encoder): BidirectionalGRU(
579 | >         (rnn): GRU(128, 256, bidirectional=True)
580 | >         (embedding): Linear(in_features=512, out_features=256, bias=True)
581 | >       )
582 | >       (decoder): Linear(in_features=256, out_features=210, bias=True)
583 | >     )
584 | 
585 | 
586 | 
587 | Định nghĩa các thông số cần thiết:
588 | - N_EPOCHS: số lần lặp lại dữ liệu để huấn luyện mô hình
589 | - optimizer: phương pháp tối ưu mô hình
590 | - loss_fn: hàm mất mát
591 | - metric_fn: hàm để đánh giá hiệu quả của mô hình
592 | - device: huấn luyện mô hình trên thiết bị `cpu` hay `gpu`
593 | 
594 | 
595 | ```python
596 | import torch
597 | import numpy as np
598 | 
599 | N_EPOCHS = 50
600 | optimizer = torch.optim.Adam(model.parameters())
601 | loss_fn = ctc_loss
602 | metric_fn = accuracy_ctc
603 | device = 'cpu' if not torch.cuda.is_available() else 'cuda'
604 | ```
605 | 
606 | Huấn luyện mô hình trong 1 epoch được thực hiện như sau:
607 | 
608 | 
609 | ```python
610 | def train_epoch(epoch):
611 |     model.train()
612 |     total_loss = 0
613 |     total_metrics = np.zeros(2)
614 |     for batch_idx, (images, labels, mask, max_label_length) in enumerate(train_dataloader):
615 |         images, labels, mask = images.to(device), labels.to(device), mask.to(device)
616 | 
617 |         optimizer.zero_grad()
618 |         output = model(images, labels, max_label_length, device)
619 | 
620 |         loss, print_loss = loss_fn(output, labels, mask)
621 |         loss.backward()
622 |         optimizer.step()
623 | 
624 |         total_loss += print_loss  # loss.item()
625 |         total_metrics += metric_fn(output, labels)
626 | 
627 |         if batch_idx == len(train_dataloader):
628 |             break
629 | 
630 |     log = {
631 |         'loss': total_loss / len(train_dataloader),
632 |         'metrics': (total_metrics / len(train_dataloader)).tolist()
633 |     }
634 | 
635 |     return log
636 | ```
637 | 
638 | Và đánh giá mô hình sau mỗi epoch:
639 | 
640 | 
641 | ```python
642 | def val_epoch():
643 |     # when evaluating, we don't use teacher forcing
644 |     model.eval()
645 |     total_val_loss = 0
646 |     # total_val_metrics = np.zeros(len(self.metrics))
647 |     total_val_metrics = np.zeros(2)
648 |     with torch.no_grad():
649 |         # print("Length of validation:", len(self.valid_data_loader))
650 |         for batch_idx, (images, labels, mask, max_label_length) in enumerate(val_dataloader):
651 |             images, labels, mask = images.to(device), labels.to(device), mask.to(device)
652 |             images, labels, mask = images.to(device), labels.to(device), mask.to(device)
653 | 
654 |             output = model(images, labels, max_label_length, device, training=False)
655 |             _, print_loss = loss_fn(output, labels, mask)  # Attention:
656 |             # loss = self.loss(output, labels, mask)
657 |             # print_loss = loss.item()
658 | 
659 |             total_val_loss += print_loss
660 |             total_val_metrics += metric_fn(output, labels)
661 | 
662 |     return_value = {
663 |         'loss': total_val_loss / len(val_dataloader),
664 |         'metrics': (total_val_metrics / len(val_dataloader)).tolist()
665 |     }
666 | 
667 |     return return_value
668 | ```
669 | 
670 | Chúng ta sẽ huấn luyện mô hình trong `N_EPOCHS` và lưu lại kết quả tốt nhất:
671 | 
672 | 
673 | ```python
674 | def train():
675 |     best_val_loss = np.inf
676 |     for epoch in range(N_EPOCHS):
677 |         train_log = train_epoch(epoch)
678 |         val_log = val_epoch()
679 |         if val_log['loss'] < best_val_loss:
680 |             # save model
681 |             best_val_loss = val_log['loss']
682 |             torch.save(model.state_dict(), 'best_model.pth')
683 | 
684 |         if (epoch + 1) % 5 == 0:
685 |             print("Epoch", epoch + 1)
686 |             print("Training log:", train_log)
687 |             print("Validation log:", val_log)
688 | 
689 | model = model.to(device)      
690 | train()
691 | ```
692 | 
693 | >     Epoch 5
694 | >     Training log: {'loss': 3.3025089090520687, 'metrics': [0.14659446106811722, 0.0]}
695 | >     Validation log: {'loss': 3.416637056752255, 'metrics': [0.16560045738213602, 0.0]}
696 | >     Epoch 10
697 | >     Training log: {'loss': 1.7547788468274204, 'metrics': [0.5980111773906229, 0.0]}
698 | >     Validation log: {'loss': 3.7502923262746712, 'metrics': [0.27276395007704896, 0.0]}
699 | >     Epoch 15
700 | >     Training log: {'loss': 0.9382512699473988, 'metrics': [0.7987094508609432, 0.0]}
701 | >     Validation log: {'loss': 1.1770769294939543, 'metrics': [0.7511817872619981, 0.0]}
702 | >     Epoch 20
703 | >     Training log: {'loss': 0.6035220243714072, 'metrics': [0.8693043548266096, 0.003409090909090909]}
704 | >     Validation log: {'loss': 1.0168936660415249, 'metrics': [0.7848459277694243, 0.0]}
705 | >     Epoch 25
706 | >     Training log: {'loss': 0.38259893493218855, 'metrics': [0.9168481876647544, 0.026136363636363635]}
707 | >     Validation log: {'loss': 2.3598944450679578, 'metrics': [0.687874780974538, 0.0]}
708 | >     Epoch 30
709 | >     Training log: {'loss': 0.2684195746075023, 'metrics': [0.9406091565435668, 0.05777972027972028]}
710 | >     Validation log: {'loss': 1.6666406894984997, 'metrics': [0.7451734024494661, 0.0]}
711 | >     Epoch 35
712 | >     Training log: {'loss': 0.19241986789486626, 'metrics': [0.9572576912350058, 0.11162587412587412]}
713 | >     Validation log: {'loss': 1.4961954229756405, 'metrics': [0.7395299244534455, 0.0]}
714 | >     Epoch 40
715 | >     Training log: {'loss': 0.13954399499026213, 'metrics': [0.9696559504747236, 0.1910839160839161]}
716 | >     Validation log: {'loss': 0.32535767712091146, 'metrics': [0.9302368830633553, 0.03782894736842105]}
717 | >     Epoch 45
718 | >     Training log: {'loss': 0.09284963634881106, 'metrics': [0.9798440719613576, 0.3003059440559441]}
719 | >     Validation log: {'loss': 0.29168694662420375, 'metrics': [0.935447053481159, 0.05098684210526316]}
720 | >     Epoch 50
721 | >     Training log: {'loss': 0.056186766787008804, 'metrics': [0.9883436321086309, 0.47718531468531467]}
722 | >     Validation log: {'loss': 0.4874882886284276, 'metrics': [0.9058548327889324, 0.023026315789473683]}
723 | 
724 | 
725 | Chúng ta có thể thấy rằng, mô hình học máy có hội tụ và độ chính xác dựa trên kí tự tương đối cao, nhưng độ chính xác với toàn bộ dòng chữ vẫn còn thấp. Lý do ở đây có thể do số lượng dữ liệu tương đối ít và có thể đến sự đơn giản của mô hình học máy, chúng ta có thể cải thiện mô hình bằng cách `augment` dữ liệu và sử dụng dữ liệu tổng hợp để có được lượng dữ liệu đa dạng hơn. Chúng ta sẽ cải thiện kết quả của mô hình ở phần sau của blog.
726 | 
727 | ### Thử nghiệm dự đoán của mô hình
728 | Sau khi huấn luyện mô hình, chúng ta sẽ thử nghiệm dự đoán độ chính xác của mô hình trên một số hình ảnh mẫu.
729 | 
730 | 
731 | ```python
732 | def predict(image):
733 |     model.eval()
734 |     batch = [{'image': image, 'label': [1]}]
735 |     images = collate_wrapper(batch)[0]
736 | 
737 |     images = images.to(device)
738 | 
739 |     outputs = model(images)
740 |     outputs = outputs.permute(1, 0, 2)
741 |     output = outputs[0]
742 | 
743 |     out_best = list(torch.argmax(output, -1))  # [2:]
744 |     out_best = [k for k, g in itertools.groupby(out_best)]
745 |     pred_text = get_label_from_indices(out_best)
746 | 
747 |     return pred_text
748 | ```
749 | 
750 | 
751 | ```python
752 | from matplotlib import pyplot as plt
753 | image = cv2.imread('./data/test/images/0000_tests.png')
754 | plt.imshow(image)
755 | predict(image)
756 | ```
757 | 
758 | 
759 | 
760 | 
761 | >     'Số 10, đường Lý Văn TLâm, Phường 1, Thành Phố Gà Mau, Cà Mau'
762 | 
763 | 
764 | 
765 | 
766 |     
767 | > ![png](./images/output_48_1.png)
768 |     
769 | 
770 | 
771 | ## Quản lý thí nghiệm
772 | 
773 | Trong phần này, chúng ta sẽ sử dụng công cụ `Cometml`.
774 | `Cometml` cho phép người dùng có thể theo dõi, so sánh, giải thích và tối ưu các thí nghiệm và mô hình từ huấn luyện cho đến triển khai mô hình. Các bạn có thể tìm hiểu thêm về Cometml ở đây: https://www.comet.ml/site/
775 | 
776 | Chúng ta cần đăng kí tài khoản và cài đặt thư viện `cometml`:
777 | ```
778 | pip install comet_ml
779 | ```
780 | 
781 | Sau khi đăng kí tài khoản, chúng ta đăng nhập và tạo dự án mới với tên là `OCR`. Để có thể tương tác với Cometml, chúng ta cần có `API Key` của dự án tương ứng. Chọn `API Key` ở góc trên bên phải của giao diện dự án, chúng ta sẽ có được `API Key` cho dự án này.
782 | 
783 | ![alt text](./images/cometml_api_key.png "CRNN")
784 | 
785 | 
786 | ```python
787 | # Create an experiment with your api key
788 | experiment = Experiment(
789 |     api_key="*******************",
790 |     project_name="ocr",
791 |     workspace="thanhhau097",
792 | )
793 | ```
794 | 
795 | Chúng ta cần quan sát các chỉ số như hàm mất mất, chỉ số đánh giá. Để tải lên `Cometml` những chỉ số này, chúng ta cần chỉnh sửa hàm huấn luyện mô hình như sau:
796 | 
797 | 
798 | ```python
799 | def train():
800 |     best_val_loss = np.inf
801 |     for epoch in range(N_EPOCHS):
802 |         print("Epoch", epoch + 1)
803 |         train_log = train_epoch(epoch)
804 |         print("Training log:", train_log)
805 |         train_loss = train_log['loss']
806 |         train_acc_by_char = train_log['metrics'][0]
807 |         train_acc_by_field = train_log['metrics'][1]
808 |         experiment.log_metrics({
809 |             "train_loss": train_loss,
810 |             "train_acc_by_char": train_acc_by_char,
811 |             "train_acc_by_field": train_acc_by_field
812 |         }, epoch=epoch)
813 |         
814 |         val_log = val_epoch()
815 |         val_loss = val_log['loss']
816 |         val_acc_by_char = val_log['metrics'][0]
817 |         val_acc_by_field = val_log['metrics'][1]
818 |         experiment.log_metrics({
819 |             "val_loss": val_loss,
820 |             "val_acc_by_char": val_acc_by_char,
821 |             "val_acc_by_field": val_acc_by_field
822 |         }, epoch=epoch)
823 |         if val_log['loss'] < best_val_loss:
824 |             # save model
825 |             best_val_loss = val_log['loss']
826 |             
827 |         print("Validation log:", val_log)
828 | 
829 |     
830 | def weight_reset(m):
831 |     if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
832 |         m.reset_parameters()
833 |         
834 | model.apply(weight_reset)
835 | train()
836 | ```
837 | 
838 | Chúng ta sẽ quan sát một số thông số như hàm mất mát, độ chính xác. Sau khi huấn luyện mô hình, chúng ta có biểu đồ như sau trên giao diện của dự án trên `Cometml`:
839 | 
840 | ![alt text](./images/cometml_log.png "Cometml log")
841 | 
842 | Chúng ta sẽ phân tích hiệu năng của mô hình một cách dễ dàng hơn bằng cách sử dụng các biểu đồ này. Ngoài ra, chúng ta cũng có thể so sánh các thử nghiệm với nhau để phân tích hiệu năng của các lần chạy khác nhau, phân tích sự ảnh hưởng của kiến trúc mô hình cũng như các siêu tham số đến kết quả của việc huấn luyện mô hình.
843 | 
844 | ## Tối ưu siêu tham số
845 | 
846 | Tối ưu siêu tham số là một công việc cần có sự kinh nghiệm để lựa chọn siêu tham số phù hợp. Có nhiều siêu tham số có thể được tối ưu trong quá trình huấn luyện một mô hình học máy, có thể kể đến như: learning rate, batch size, số lượng layer trong mô hình,...
847 | 
848 | Ở đây chúng ta sẽ tối ưu số  kênh của `output tensor` của mô hình CNN, được đặt tên là `inner_dim`. Chúng ta sẽ sử dụng thư viên Cometml để tối ưu và tìm ra `inner_dim` tốt cho bài toán của mình.
849 | 
850 | 
851 | ```python
852 | from comet_ml import Optimizer
853 | # The optimization config:
854 | 
855 | config = {
856 |     "algorithm": "bayes",
857 |     "spec": {"maxCombo": 10, "objective": "minimize", "metric": "val_loss"},
858 |     "parameters": {
859 |          "inner_dim": {"type": "discrete", "values": [32, 64, 128, 256, 512]},
860 |     },
861 |     "trials": 1,
862 |     "name": "My Optimizer Name",
863 | }
864 | 
865 | 
866 | opt = Optimizer(config, api_key='*******************')
867 | ```
868 | 
869 | 
870 | ```python
871 | def train(experiment):
872 |     best_val_loss = np.inf
873 |     for epoch in range(N_EPOCHS):
874 |         print("Epoch", epoch + 1)
875 |         train_log = train_epoch(epoch)
876 |         print("Training log:", train_log)
877 |         train_loss = train_log['loss']
878 |         train_acc_by_char = train_log['metrics'][0]
879 |         train_acc_by_field = train_log['metrics'][1]
880 |         experiment.log_metrics({
881 |             "train_loss": train_loss,
882 |             "train_acc_by_char": train_acc_by_char,
883 |             "train_acc_by_field": train_acc_by_field
884 |         }, epoch=epoch)
885 |         
886 |         val_log = val_epoch()
887 |         val_loss = val_log['loss']
888 |         val_acc_by_char = val_log['metrics'][0]
889 |         val_acc_by_field = val_log['metrics'][1]
890 |         experiment.log_metrics({
891 |             "val_loss": val_loss,
892 |             "val_acc_by_char": val_acc_by_char,
893 |             "val_acc_by_field": val_acc_by_field
894 |         }, epoch=epoch)
895 |         if val_log['loss'] < best_val_loss:
896 |             # save model
897 |             best_val_loss = val_log['loss']
898 |             
899 |         print("Validation log:", val_log)
900 | 
901 | 
902 | N_EPOCHS = 20
903 | for experiment in opt.get_experiments(project_name="ocr"):
904 |     # Log parameters, or others:
905 |     experiment.log_parameter("epochs", 10)
906 | 
907 |     # Build the model:
908 |     model = CTCModel(inner_dim=experiment.get_parameter('inner_dim'))
909 |     model = model.to(device)
910 |     optimizer = torch.optim.Adam(model.parameters())
911 | 
912 |     # Train it:
913 |     train(experiment)
914 | 
915 |     # Optionally, end the experiment:
916 |     experiment.end()
917 | ```
918 | 
919 | 
920 | Sau khi huấn luyện mô hình, chúng ta có thể so sánh các thí nghiệm với nhau và tìm ra `inner_dim` tốt nhất cho mô hình. Trên đây chỉ là một ví dụ về việc lựa chọn `inner_dim`, ngoài ra chúng ta cũng có thể tối ưu các siêu tham số khác như `learning rate`, `batch size`,...
921 | 
922 | <!-- ![alt text](./images/cometml_optimization.png "Hyperparameter Optimization") -->
923 | 
924 | ## Model Versioning
925 | Sau khi huấn luyện mô hình, chúng ta cần lưu trọng số của mô hình tốt nhất mà chúng ta có được nhằm phục vụ việc triển khai mô hình sau này. Ngoài ra chúng ta cũng cần lưu trữ mô hình một cách hiệu quả để có thể chia sẻ trọng số của mô hình một cách dễ dàng cũng như có thể tái hiện lại kết quả của thí nghiệm trong trường hợp cần thiết.
926 | 
927 | Chúng ta sẽ sử dụng Data Version Control (DVC) để lưu trữ trọng số của mô hình lên AWS S3.
928 | 
929 | Trước tiên chúng ta cần cài đặt `dvc`:
930 | ```
931 | pip install dvc
932 | ```
933 | 
934 | Để có thể sử dụng `dvc`, chúng ta cần sử dụng cùng với `git`. Vì vậy chúng ta sẽ cần phải khởi tạo `git` và `dvc`:
935 | ```
936 | git init
937 | dvc init
938 | ```
939 | 
940 | Chúng ta sẽ lưu trữ mô hình trên AWS S3. Để cài đặt thư mục lưu trữ, chúng ta làm như sau:
941 | ```
942 | dvc remote add -d storage s3://ocrpipeline/weights
943 | ```
944 | 
945 | Sau đó chúng ta sẽ lưu trữ mô hình tốt nhất mà chúng ta đã huấn luyện lên AWS S3:
946 | ```
947 | dvc add best_model.pth
948 | dvc push
949 | ```
950 | 
951 | Sau khi lưu trọng số của mô hình bằng `dvc`, chúng ta sẽ có một tệp mới với lên `best_model.pth.dvc`. Chúng ta sẽ lưu trữ tệp này lên `git repository` thay vì phải lưu toàn bộ trọng số của mô hình bằng `git`.
952 | ## Tổng kết
953 | Trong phần này, chúng ta đã cùng nhau xây dựng, tối ưu mô hình và lưu trữ mô hình sử dụng các công cụ Pytorch, Cometml và DVC. Trong phần tiếp theo, chúng ta cùng nhau triển khai và ứng dụng mô hình sử dụng kết quả mà chúng ta có được ở phần này.
954 | 
955 | 
956 | ## Tài liệu tham khảo
957 | 1. [An Intuitive Explanation of Connectionist Temporal Classification](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)
958 | 2. [How to build end-to-end recognition system](https://www.youtube.com/watch?v=uVbOckyUemo)
959 | 
960 | Bài trước: [Lưu trữ và gán nhãn dữ liệu](../data/index.md)
961 | 
962 | Bài tiếp theo: [Triển khai và áp dụng mô hình học máy](../deployment/index.md)
963 | 
964 | [Về Trang chủ](../index.md)
965 | 
966 | 
967 | <script src="https://utteranc.es/client.js"
968 |         repo="thanhhau097/mlpipeline"
969 |         issue-term="pathname"
970 |         label="Comment"
971 |         theme="github-light"
972 |         crossorigin="anonymous"
973 |         async>
974 | </script>


--------------------------------------------------------------------------------
/docs/kubeflow/index.md:
--------------------------------------------------------------------------------
   1 | # Xây dựng hệ thống huấn luyện mô hình bằng Kubeflow
   2 | 
   3 | Sau khi gán nhãn dữ liệu, huấn luyện mô hình và triển khai mô hình trên giao diện web. Chúng ta có thể nhận thấy việc thực hiện công việc này đang độc lập với nhau và thực hiện một cách thủ công.
   4 | 
   5 | Trong phần này, chúng ta sẽ sử dụng Kubeflow để xây dựng một hệ thống để kết nối các bước tiền xử lý dữ liệu, huấn luyện mô hình, triển khai mô hình một cách tự động mỗi khi chúng ta có dữ liệu mới, thay vì phải thực hiện một cách thủ công như chúng ta đã thực hiện ở các bài trước.
   6 | 
   7 | Với Kubeflow, chúng ta có thể triển khai một hệ thống học máy trên Kubernetes một cách dễ dàng và có thể nâng cấp và mở rộng tuỳ theo nhu cầu. Các bạn có thể tìm hiểu thêm về Kubeflow ở đây: https://www.kubeflow.org/docs/started/kubeflow-overview/
   8 | 
   9 | ## Cài đặt Kubeflow
  10 | Chúng ta có thể cài đặt Kubeflow trên nhiều môi trường khác nhau, có thể trên máy tính cá nhân, `on-premise` hoặc trên các nền tảng đám mây như `Amazon Web Services (AWS)`, `Microsoft Azure`, `Google Cloud`, etc. 
  11 | 
  12 | ### Cài đặt Kubeflow bằng MiniKF
  13 | Trong phần này, mình sẽ cài đặt Kubeflow trên máy tính cá nhân với [Arrikato MiniKF](https://www.kubeflow.org/docs/distributions/minikf/). Các bạn có thể làm theo hướng dẫn trong đường dẫn này hoặc làm theo các bước dưới đây.
  14 | 
  15 | Để cài đặt `MiniKF`, trước hết chúng ta cần cài đặt `Vagrant` và `Virtual Box`. Ở đây mình sử dụng hệ điều hành Ubuntu.
  16 | 
  17 | Cài đặt `Vagrant`:
  18 | ```
  19 | curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
  20 | sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
  21 | sudo apt-get update && sudo apt-get install vagrant
  22 | ```
  23 | 
  24 | Với `Virtual Box`, các bạn tải xuống phiên bản tương ứng với hệ điều hành của mình và tiến hành cài đặt như thông thường theo hướng dẫn ở đây: https://www.virtualbox.org/wiki/Linux_Downloads
  25 | 
  26 | Sau khi cài đặt hoàn tất, chúng ta sẽ sử dụng `Vagrant` để cài đặt MiniKF và khởi động MiniKF:
  27 | 
  28 | ```
  29 | vagrant init arrikto/minikf
  30 | vagrant up
  31 | ```
  32 | 
  33 | Sau khi cài đặt và khởi động hoàn tất, chúng ta truy cập vào địa chỉ [http://10.10.10.10](http://10.10.10.10) để tiếp tục cài đặt và bắt đàu sử dụng Kubeflow. Giao diện cài đặt trên địa chỉ web như hình dưới.
  34 | 
  35 | ![alt text](./images/kubeflow.png "Kubeflow")
  36 | 
  37 | Chúng ta cài đặt theo giao diện trên `Terminal`, sau khi cài đặt, nhấn `Start MiniKF` và sử dụng tên tài khoản và mật khẩu trên giao diện để đăng nhập vào Kubeflow.
  38 | 
  39 | ### Cài đặt Kubeflow bằng trên Microk8s
  40 | Các bạn có thể làm theo hướng dẫn ở đường dẫn này hoặc các bước sau đây:
  41 | 1. Cài đặt Microk8s
  42 | ```
  43 | sudo snap install microk8s --classic
  44 | ```
  45 | 
  46 | 2. Thêm người dùng hiện tại vào `admin group`
  47 | ```
  48 | sudo usermod -a -G microk8s $USER
  49 | sudo chown -f -R $USER ~/.kube
  50 | ```
  51 | 
  52 | sau đó đăng xuất và đăng nhập lại.
  53 | 3. Kiểm tra xem Microk8s để đảm bảo Microk8s đang chạy
  54 | ```
  55 | microk8s status --wait-ready
  56 | ```
  57 | 4. Cài đặt Kubeflow
  58 | ```
  59 | microk8s enable kubeflow
  60 | ```
  61 | 
  62 | Để sử dụng GPU, chúng ta sử dụng lệnh sau:
  63 | ```
  64 | microk8s enable gpu
  65 | ```
  66 | 
  67 | Sau khi cài đặt thành công, chúng ta truy cập vào địa chỉ: 10.64.140.43.nip.io để sử dụng Kubeflow
  68 | 
  69 | Đây là giao diện của Kubeflow sau khi đăng nhập:
  70 | 
  71 | Kubeflow UI có một số tính năng như:
  72 | - `Home`: giao diện chính quản lý các thành phần của Kubeflow
  73 | - `Pipelines` dành cho việc quản lý các `pipeline`
  74 | - `Notebook Servers` để sử dụng Jupyter notebooks.
  75 | - `Katib` cho việc tối ưu siêu tham số.
  76 | - `Artifact Store` để quản lý `artifact`.
  77 | - `Manage Contributors` để chia sẻ quyền truy cập của người dùng trên các `namespace` trong Kubeflow.
  78 | 
  79 | ## Xây dựng Kubeflow Pipeline
  80 | Trong phần này, chúng ta sẽ xây dựng một Kubeflow pipeline bao gốm các thành phần: tiền xử lý dữ liệu, huấn luyện mô hình, và triển khai mô hình.
  81 | 
  82 | ### Cấu trúc thư mục 
  83 | Để dễ dàng xây dựng các thành phần của Kubeflow, chúng ta sẽ refactor lại code mà chúng ta đã sử dụng ở bước trên trong jupyter notebook.
  84 | 
  85 | Cấu trúc của một thư mục để xây dựng Kubeflow pipeline như sau:
  86 | ```
  87 | components
  88 | ├── pipeline.py
  89 | ├── preprocess
  90 | │   ├── src
  91 | │   │  └── main.py
  92 | │   ├── Dockerfile
  93 | │   ├── build_image.sh
  94 | │   └── component.yaml
  95 | ├── train
  96 | │   └── ...
  97 | └── _deployment
  98 |     └── ...
  99 | ```
 100 | 
 101 | Chúng ta sẽ chia thành ba thành phần nhỏ: `preprocess`, `train`, `deployment`. Trong mỗi thành phần, chúng ta có các thành phần con như sau:
 102 | - **Thư mục `src`**: dùng để chứa code cho nhiệm vụ mà chúng ta cần làm: tiền xử lý dữ liệu, huấn luyện mô hình hay triển khai mô hình. Chúng ta sẽ refactor và chuyển code chúng ta đã làm ở các bài trước vào thư mục này.
 103 | - `Dockerfile`: dùng để xây dựng `docker image` làm môi trường cho chương trình.
 104 | - `build_image.sh`: shell dùng để khởi tạo `docker image` và đẩy chúng lên `docker hub`
 105 | - `component.yaml`: định nghĩa nội dung của thành phần
 106 | 
 107 | Chúng ta sẽ đi vào chi tiết của từng thành phần trong phần tiếp theo. Chúng ta sẽ kết nối các thành phần này lại với nhau trong tệp `pipeline.py` và xây dựng một `pipeline` như sau:
 108 | 
 109 | ![alt text](./images/pipeline.png "Pipeline")
 110 | 
 111 | ### Tiền xử lý dữ liệu
 112 | Chúng ta sẽ cùng nhau tiến hành xây dựng thành phần đầu tiên đó là tiền xử lý dữ liệu. 
 113 | 
 114 | #### Xây dựng mã nguồn
 115 | Đầu tiên, chúng ta cần sao chép mã nguồn của thành phần tiền xử lý dữ liệu ở các bài trước và đưa vào tệp `main.py` trong thư mục `src`. Tệp `main.py` có nội dung như sau:
 116 | 
 117 | ```python
 118 | import os
 119 | import json
 120 | from pathlib import Path
 121 | 
 122 | # define the aws key
 123 | os.environ["AWS_ACCESS_KEY_ID"] = "**************************"
 124 | os.environ["AWS_SECRET_ACCESS_KEY"] = "**************************"
 125 | 
 126 | # download data from label studio
 127 | print("DOWNLOADING DATA")
 128 | import requests
 129 | 
 130 | url = "https://***********.ngrok.io/api/projects/1/export"
 131 | querystring = {"exportType":"JSON"}
 132 | headers = {
 133 |     'authorization': "Token ******************************",
 134 | }
 135 | response = requests.request("GET", url, headers=headers, params=querystring)
 136 |     
 137 | # split train/val/test/user_data label from this json file
 138 | user_data = []
 139 | train_data = []
 140 | val_data = []
 141 | test_data = []
 142 | for item in response.json():
 143 |     if 's3://ocrpipeline/data/user_data' in item['data']['captioning']:
 144 |         user_data.append(item)
 145 |     elif 's3://ocrpipeline/data/train' in item['data']['captioning']:
 146 |         train_data.append(item)
 147 |     elif 's3://ocrpipeline/data/test' in item['data']['captioning']:
 148 |         test_data.append(item)
 149 |     elif 's3://ocrpipeline/data/validation' in item['data']['captioning']:
 150 |         val_data.append(item)
 151 | 
 152 | def save_labels(folder, data):
 153 |     if not os.path.exists(folder):
 154 |         os.makedirs(folder, exist_ok=True)
 155 |     label_path = os.path.join(folder, 'label_studio_data.json')
 156 |     with open(label_path, 'w') as f:
 157 |         json.dump(data, f)
 158 | 
 159 | save_labels('./data/train/', train_data)
 160 | save_labels('./data/test/', test_data)
 161 | save_labels('./data/validation/', val_data)
 162 | save_labels('./data/user_data/', user_data)
 163 | 
 164 | # preprocess
 165 | print("PREPROCESSING DATA")
 166 | def convert_label_studio_format_to_ocr_format(label_studio_json_path, output_path):
 167 |     with open(label_studio_json_path, 'r') as f:
 168 |         data = json.load(f)
 169 | 
 170 |     ocr_data = {}
 171 | 
 172 |     for item in data:
 173 |         image_name = os.path.basename(item['data']['captioning'])
 174 | 
 175 |         text = ''
 176 |         for value_item in item['annotations'][0]['result']:
 177 |             if value_item['from_name'] == 'caption':
 178 |                 text = value_item['value']['text'][0]
 179 |         ocr_data[image_name] = text
 180 | 
 181 |     with open(output_path, 'w') as f:
 182 |         json.dump(ocr_data, f, indent=4)
 183 | 
 184 |     print('Successfully converted ', label_studio_json_path)
 185 | convert_label_studio_format_to_ocr_format('./data/train/label_studio_data.json', './data/train/labels.json')
 186 | convert_label_studio_format_to_ocr_format('./data/validation/label_studio_data.json', './data/validation/labels.json')
 187 | convert_label_studio_format_to_ocr_format('./data/test/label_studio_data.json', './data/test/labels.json')
 188 | convert_label_studio_format_to_ocr_format('./data/user_data/label_studio_data.json', './data/user_data/labels.json')
 189 | 
 190 | # upload these file to s3
 191 | print("UPLOADING DATA")
 192 | import logging
 193 | import boto3
 194 | from botocore.exceptions import ClientError
 195 | 
 196 | 
 197 | def upload_file_to_s3(file_name, bucket, object_name=None):
 198 |     """Upload a file to an S3 bucket
 199 | 
 200 |     :param file_name: File to upload
 201 |     :param bucket: Bucket to upload to
 202 |     :param object_name: S3 object name. If not specified then file_name is used
 203 |     :return: True if file was uploaded, else False
 204 |     """
 205 | 
 206 |     # If S3 object_name was not specified, use file_name
 207 |     if object_name is None:
 208 |         object_name = file_name
 209 | 
 210 |     # Upload the file
 211 |     s3_client = boto3.client('s3')
 212 |     try:
 213 |         response = s3_client.upload_file(file_name, bucket, object_name)
 214 |     except ClientError as e:
 215 |         logging.error(e)
 216 |         return False
 217 |     return True
 218 | 
 219 | 
 220 | upload_file_to_s3('data/train/labels.json', bucket='ocrpipeline', object_name='data/train/labels.json')
 221 | upload_file_to_s3('data/validation/labels.json', bucket='ocrpipeline', object_name='data/validation/labels.json')
 222 | upload_file_to_s3('data/test/labels.json', bucket='ocrpipeline', object_name='data/test/labels.json')
 223 | upload_file_to_s3('data/user_data/labels.json', bucket='ocrpipeline', object_name='data/user_data/labels.json')
 224 | 
 225 | # write data path to output_path.txt
 226 | data_path = 's3://ocrpipeline/data'
 227 | Path('/output_path.txt').write_text(data_path)
 228 | ```
 229 | 
 230 | Các bước thực hiện như sau:
 231 | 
 232 | 0. Khai báo các `aws keys` để có thể tải lên và tải xuống dữ liệu từ `AWS S3`: Trong thực tế, chúng ta không cần khai báo các biến này ở trong mã nguồn của mình, chúng ta sẽ thay thế nó bằng cách cho chúng là biến môi trường trước khi chạy `docker image`. Chúng ta không nên khai báo các `aws keys` trong code vì người khác có thể sử dụng được tài khoản của mình và sẽ ảnh hưởng đến vấn đề bảo mật.
 233 | 1. Tải dữ liệu đã được gán nhán từ `label studio`: Do chúng ta đang sử dụng Label Studio ở máy cục bộ (local host), vì vậy, để gửi được `request` từ `container` trong Kubeflow, chúng ta cần tìm cách mở (expose) Label Studio API. Một cách đơn giản là chúng ta có thể sử dụng `ngrok`. Để sử dụng `ngrok`, các bạn vào đường dẫn này và làm theo các bước để tải và cài đặt `ngrok`. Do `Label Studio` đang chạy ở cổng 8080, vì vậy chúng ta expose `Label Studio` bằng `ngrok` như sau: 
 234 |     ```
 235 |     ./ngrok http 8080
 236 |     ```
 237 | 
 238 |     Chúng ta sẽ có đường đường dẫn mà `ngrok` sinh ra và sử dụng để tải dữ liệu từ Label Studio.
 239 | 
 240 | 2. Xử lý dữ liệu `label studio` và chuyển về định dạng mới
 241 | 3. Tải dữ liệu lên `AWS S3`
 242 | 4. Ghi đường dẫn dữ liệu trên `AWS S3` vào tệp `output_path.txt` để có thể sử dụng trong thành phần kế tiếp.
 243 | 
 244 | #### Xây dựng Docker image
 245 | Chúng ta sẽ xây dựng Docker image để làm môi trường chạy mã nguồn ở trên như sau:
 246 | ```dockerfile
 247 | FROM python:3.7
 248 | COPY ./src /src
 249 | ```
 250 | 
 251 | Ở đây chúng ta sử dụng môi trường `python 3.7` và tiến hành sao chép mã nguồn trong thư mục `src`vào thư mục `/src` của `docker container`.
 252 | 
 253 | #### Xây dựng shell build_image.sh
 254 | Chúng ta xây dựng shell với nội dung như sau:
 255 | ```shell
 256 | #!/bin/bash
 257 | 
 258 | full_image_name=<username>/ocrpreprocess
 259 | 
 260 | docker build -t "${full_image_name}" .
 261 | docker push "$full_image_name"
 262 | ```
 263 | 
 264 | Chúng ta đặt tên cho `docker image` là `<username>/ocrpreprocess` (bạn có thể thay `<username>` với tên của bạn) và tiến hành xây dựng nó với câu lệnh `docker build`. Sau đó chúng ta sẽ đẩy `docker image` này lên `docker hub`.
 265 | 
 266 | Để có thể đẩy `docker image` lên `docker hub`, chúng ta cần có tài khoản tại https://hub.docker.com và đăng nhập docker bằng câu lệnh 
 267 | ```
 268 | docker login
 269 | ```
 270 | 
 271 | Sau khi đăng nhập `docker hub`, chúng ta tiến hành chạy shell để xây dựng docker image như sau:
 272 | 
 273 | ```
 274 | chmod +x build_image.sh
 275 | ./build_image.sh
 276 | ```
 277 | 
 278 | Như vậy chúng ta đã xây dựng môi trường để thực hiện việc tiền xử lý dữ liệu.
 279 | 
 280 | #### Định nghĩa cấu hình của thành phần 
 281 | Chúng ta sẽ định nghĩa nội dung của một thành phần của Kubeflow pipeline bằng cách khai báo đầu vào, đầu ra, môi trường chạy và cách chạy trong tệp `component.yaml` như sau:
 282 | ```yaml
 283 | name: OCR Preprocessing
 284 | description: Preprocess OCR model
 285 | 
 286 | outputs:
 287 |   - {name: output_path, type: String, description: 'Output Path'}
 288 | 
 289 | implementation:
 290 |   container:
 291 |     image: thanhhau097/ocrpreprocess
 292 |     command: [
 293 |       python, /src/main.py,
 294 |     ]
 295 | 
 296 |     fileOutputs:
 297 |       output_path: /output_path.txt # path to s3
 298 | 
 299 | ```
 300 | 
 301 | Ở đây, chúng ta sử dụng các thông tin ở các phần trước để điền vào nội dung của tệp này như tên `docker image`, tệp `main.py` để chạy chương trình,... Ngoài ra, chúng ta sẽ đọc dữ liệu từ tệp `output_path.txt` mà chúng ta đã ghi nội dung đường dẫn đến dữ liệu vào để làm đầu ra cho thành phần tiền xử lý dữ liệu.
 302 | 
 303 | ### Huấn luyện mô hình
 304 | Tương tự với thành phần tiền xử lý dữ liệu, chúng ta cũng sẽ xây dựng các tệp cấu tạo nên thành phần của Kubeflow.
 305 | 
 306 | #### Xây dựng mã nguồn
 307 | Chúng ta thực hiện refactor mã nguồn mà chúng ta sử dụng ở phần trước vào tệp `main.py` vào thư mục `src`:
 308 | 
 309 | ```python
 310 | def train_model(data_path):
 311 |     import os
 312 |     from pathlib import Path
 313 |     import logging
 314 |     import boto3
 315 |     from botocore.exceptions import ClientError
 316 | 
 317 | 
 318 |     def upload_file_to_s3(file_name, bucket, object_name=None):
 319 |         """Upload a file to an S3 bucket
 320 | 
 321 |         :param file_name: File to upload
 322 |         :param bucket: Bucket to upload to
 323 |         :param object_name: S3 object name. If not specified then file_name is used
 324 |         :return: True if file was uploaded, else False
 325 |         """
 326 | 
 327 |         # If S3 object_name was not specified, use file_name
 328 |         if object_name is None:
 329 |             object_name = file_name
 330 | 
 331 |         # Upload the file
 332 |         s3_client = boto3.client('s3')
 333 |         try:
 334 |             response = s3_client.upload_file(file_name, bucket, object_name)
 335 |         except ClientError as e:
 336 |             logging.error(e)
 337 |             return False
 338 |         return True
 339 | 
 340 | 
 341 |     os.environ["AWS_ACCESS_KEY_ID"] = "*********************"
 342 |     os.environ["AWS_SECRET_ACCESS_KEY"] = "*********************"
 343 | 
 344 |     CHARACTERS = "aáàạảãăắằẳẵặâấầẩẫậbcdđeéèẹẻẽêếềệểễfghiíìịỉĩjklmnoóòọỏõôốồộổỗơớờợởỡpqrstuúùụủũưứừựửữvxyýỳỷỹỵzwAÁÀẠẢÃĂẮẰẲẴẶÂẤẦẨẪẬBCDĐEÉÈẸẺẼÊẾỀỆỂỄFGHIÍÌỊỈĨJKLMNOÓÒỌỎÕÔỐỒỘỔỖƠỚỜỢỞỠPQRSTUÚÙỤỦŨƯỨỪỰỬỮVXYÝỲỴỶỸZW0123456789 .,-/()'#+:"
 345 |     PAD_token = 0  # Used for padding short sentences
 346 |     SOS_token = 1  # Start-of-sentence token
 347 |     EOS_token = 2  # End-of-sentence token
 348 | 
 349 |     CHAR2INDEX = {"PAD": PAD_token, "SOS": SOS_token, "EOS": EOS_token}
 350 |     INDEX2CHAR = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
 351 | 
 352 |     for i, c in enumerate(CHARACTERS):
 353 |         CHAR2INDEX[c] = i + 3
 354 |         INDEX2CHAR[i + 3] = c
 355 | 
 356 |     def get_indices_from_label(label):
 357 |         indices = []
 358 |         for char in label:
 359 |     #         if CHAR2INDEX.get(char) is not None:
 360 |                 indices.append(CHAR2INDEX[char])
 361 | 
 362 |         indices.append(EOS_token)
 363 |         return indices
 364 | 
 365 |     def get_label_from_indices(indices):
 366 |         label = ""
 367 |         for index in indices:
 368 |             if index == EOS_token:
 369 |                 break
 370 |             elif index == PAD_token:
 371 |                 continue
 372 |             else:
 373 |                 label += INDEX2CHAR[index.item()]
 374 | 
 375 |         return label
 376 | 
 377 |     from comet_ml import Experiment
 378 | 
 379 |     import cv2
 380 |     from torch.utils.data import Dataset
 381 |     
 382 |     # Create an experiment with your api key
 383 |     experiment = Experiment(
 384 |         api_key="***********************",
 385 |         project_name="ocr",
 386 |         workspace="<username>",
 387 |     )
 388 | 
 389 | 
 390 |     class OCRDataset(Dataset):
 391 |         def __init__(self, data_dir, label_path, transform=None):
 392 |             self.data_dir = data_dir
 393 |             self.label_path = label_path
 394 |             self.transform = transform
 395 |             self.image_paths, self.labels = self.get_image_paths_and_labels(label_path)
 396 | 
 397 |         def __len__(self):
 398 |             return len(self.image_paths)
 399 | 
 400 |         def __getitem__(self, idx):
 401 |             img_path = self.get_data_path(self.image_paths[idx])
 402 |             image = cv2.imread(img_path)
 403 |             label = self.labels[idx]
 404 |             label = get_indices_from_label(label)
 405 |             sample = {"image": image, "label": label}
 406 |             return sample
 407 | 
 408 |         def get_data_path(self, path):
 409 |             return os.path.join(self.data_dir, path)
 410 | 
 411 |         def get_image_paths_and_labels(self, json_path):
 412 |             with open(json_path, 'r', encoding='utf-8') as f:
 413 |                 data = json.load(f)
 414 | 
 415 |             image_paths = list(data.keys())
 416 |             labels = list(data.values())
 417 |             return image_paths, labels
 418 |         
 419 |         
 420 |     import itertools
 421 | 
 422 |     def collate_wrapper(batch):
 423 |         """
 424 |         Labels are already numbers
 425 |         :param batch:
 426 |         :return:
 427 |         """
 428 |         images = []
 429 |         labels = []
 430 |         # TODO: can change height in config
 431 |         height = 64
 432 |         max_width = 0
 433 |         max_label_length = 0
 434 | 
 435 |         for sample in batch:
 436 |             image = sample['image']
 437 |             try:
 438 |                 image = process_image(image, height=height, channels=image.shape[2])
 439 |             except:
 440 |                 continue
 441 | 
 442 |             if image.shape[1] > max_width:
 443 |                 max_width = image.shape[1]
 444 | 
 445 |             label = sample['label']
 446 | 
 447 |             if len(label) > max_label_length:
 448 |                 max_label_length = len(label)
 449 | 
 450 |             images.append(image)
 451 |             labels.append(label)
 452 | 
 453 |         # PAD IMAGES: convert to tensor with size b x c x h x w (from b x h x w x c)
 454 |         channels = images[0].shape[2]
 455 |         images = process_batch_images(images, height=height, max_width=max_width, channels=channels)
 456 |         images = images.transpose((0, 3, 1, 2))
 457 |         images = torch.from_numpy(images).float()
 458 | 
 459 |         # LABELS
 460 |         pad_list = zero_padding(labels)
 461 |         mask = binary_matrix(pad_list)
 462 |         mask = torch.ByteTensor(mask)
 463 |         labels = torch.LongTensor(pad_list)
 464 |         return images, labels, mask, max_label_length
 465 | 
 466 | 
 467 |     def process_image(image, height=64, channels=3):
 468 |         """Converts to self.channels, self.max_height
 469 |         # convert channels
 470 |         # resize max_height = 64
 471 |         """
 472 |         shape = image.shape
 473 |         # if shape[0] > 64 or shape[0] < 32:  # height
 474 |         try:
 475 |             image = cv2.resize(image, (int(height/shape[0] * shape[1]), height))
 476 |         except:
 477 |             return np.zeros([1, 1, channels])
 478 |         return image / 255.0
 479 | 
 480 | 
 481 |     def process_batch_images(images, height, max_width, channels=3):
 482 |         """
 483 |         Convert a list of images to a tensor (with padding)
 484 |         :param images: list of numpy array images
 485 |         :param height: desired height
 486 |         :param max_width: max width of all images
 487 |         :param channels: number of image channels
 488 |         :return: a tensor representing images
 489 |         """
 490 |         output = np.ones([len(images), height, max_width, channels])
 491 |         for i, image in enumerate(images):
 492 |             final_img = image
 493 |             shape = image.shape
 494 |             output[i, :shape[0], :shape[1], :] = final_img
 495 | 
 496 |         return output
 497 | 
 498 | 
 499 |     def zero_padding(l, fillvalue=PAD_token):
 500 |         """
 501 |         Pad value PAD token to l
 502 |         :param l: list of sequences need padding
 503 |         :param fillvalue: padded value
 504 |         :return:
 505 |         """
 506 |         return list(itertools.zip_longest(*l, fillvalue=fillvalue))
 507 | 
 508 | 
 509 |     def binary_matrix(l, value=PAD_token):
 510 |         m = []
 511 |         for i, seq in enumerate(l):
 512 |             m.append([])
 513 |             for token in seq:
 514 |                 if token == value:
 515 |                     m[i].append(0)
 516 |                 else:
 517 |                     m[i].append(1)
 518 |         return m
 519 | 
 520 |     import torch.nn as nn
 521 |     import torch.nn.functional as F
 522 | 
 523 | 
 524 |     class VGG_FeatureExtractor(nn.Module):
 525 |         """ FeatureExtractor of CRNN (https://arxiv.org/pdf/1507.05717.pdf) """
 526 | 
 527 |         def __init__(self, input_channel, output_channel=512):
 528 |             super(VGG_FeatureExtractor, self).__init__()
 529 |             self.output_channel = [int(output_channel / 8), int(output_channel / 4),
 530 |                                     int(output_channel / 2), output_channel]  # [64, 128, 256, 512]
 531 |             self.ConvNet = nn.Sequential(
 532 |                 nn.Conv2d(input_channel, self.output_channel[0], 3, 1, 1), nn.ReLU(True),
 533 |                 nn.MaxPool2d(2, 2),  # 64x16x50
 534 |                 nn.Conv2d(self.output_channel[0], self.output_channel[1], 3, 1, 1), nn.ReLU(True),
 535 |                 nn.MaxPool2d(2, 2),  # 128x8x25
 536 |                 nn.Conv2d(self.output_channel[1], self.output_channel[2], 3, 1, 1), nn.ReLU(True),  # 256x8x25
 537 |                 nn.Conv2d(self.output_channel[2], self.output_channel[2], 3, 1, 1), nn.ReLU(True),
 538 |                 nn.MaxPool2d((2, 1), (2, 1)),  # 256x4x25
 539 |                 nn.Conv2d(self.output_channel[2], self.output_channel[3], 3, 1, 1, bias=False),
 540 |                 nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),  # 512x4x25
 541 |                 nn.Conv2d(self.output_channel[3], self.output_channel[3], 3, 1, 1, bias=False),
 542 |                 nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),
 543 |                 nn.MaxPool2d((2, 1), (2, 1)),  # 512x2x25
 544 |                 nn.Conv2d(self.output_channel[3], self.output_channel[3], 2, 1, 0), nn.ReLU(True))  # 512x1x24
 545 | 
 546 |         def forward(self, input):
 547 |             return self.ConvNet(input)
 548 |         
 549 | 
 550 |     class BidirectionalGRU(nn.Module):
 551 |         def __init__(self, input_size, hidden_size, output_size):
 552 |             super(BidirectionalGRU, self).__init__()
 553 | 
 554 |             self.rnn = nn.GRU(input_size, hidden_size, bidirectional=True)
 555 |             self.embedding = nn.Linear(hidden_size * 2, output_size)
 556 | 
 557 |         def forward(self, x):
 558 |             recurrent, hidden = self.rnn(x)
 559 |             T, b, h = recurrent.size()
 560 |             t_rec = recurrent.view(T * b, h)
 561 | 
 562 |             output = self.embedding(t_rec)  # [T * b, nOut]
 563 |             output = output.view(T, b, -1)
 564 | 
 565 |             return output, hidden
 566 | 
 567 |     class CTCModel(nn.Module):
 568 |         def __init__(self, inner_dim=512, num_chars=65):
 569 |             super().__init__()
 570 |             self.encoder = VGG_FeatureExtractor(3, inner_dim)
 571 |             self.AdaptiveAvgPool = nn.AdaptiveAvgPool2d((None, 1))
 572 |             self.rnn_encoder = BidirectionalGRU(inner_dim, 256, 256)
 573 |             self.num_chars = num_chars
 574 |             self.decoder = nn.Linear(256, self.num_chars)
 575 | 
 576 |         def forward(self, x, labels=None, max_label_length=None, device=None, training=True):
 577 |             # ---------------- CNN ENCODER --------------
 578 |             x = self.encoder(x)
 579 |             # print('After CNN:', x.size())
 580 | 
 581 |             # ---------------- CNN TO RNN ----------------
 582 |             x = x.permute(3, 0, 1, 2)  # from B x C x H x W -> W x B x C x H
 583 |             x = self.AdaptiveAvgPool(x)
 584 |             size = x.size()
 585 |             x = x.reshape(size[0], size[1], size[2] * size[3])
 586 | 
 587 |             # ----------------- RNN ENCODER ---------------
 588 |             encoder_outputs, last_hidden = self.rnn_encoder(x)
 589 |             # print('After RNN', x.size())
 590 | 
 591 |             # --------------- CTC DECODER -------------------
 592 |             # batch_size = encoder_outputs.size()[1]
 593 |             outputs = self.decoder(encoder_outputs)
 594 | 
 595 |             return outputs
 596 |         
 597 |     import torch
 598 |     import torch.nn.functional as F
 599 |     from torch.nn import CTCLoss, CrossEntropyLoss
 600 | 
 601 |     def ctc_loss(outputs, targets, mask):
 602 |         USE_CUDA = torch.cuda.is_available()
 603 |         device = torch.device("cuda:0" if USE_CUDA else "cpu")
 604 |         target_lengths = torch.sum(mask, dim=0).to(device)
 605 |         # We need to change targets, PAD_token = 0 = blank
 606 |         # EOS token -> PAD_token
 607 |         targets[targets == EOS_token] = PAD_token
 608 |         outputs = outputs.log_softmax(2)
 609 |         input_lengths = outputs.size()[0] * torch.ones(outputs.size()[1], dtype=torch.int)
 610 |         loss_fn = CTCLoss(blank=PAD_token, zero_infinity=True)
 611 |         targets = targets.transpose(1, 0)
 612 |         # target_lengths have EOS token, we need minus one
 613 |         target_lengths = target_lengths - 1
 614 |         targets = targets[:, :-1]
 615 |         # print(input_lengths, target_lengths)
 616 |         torch.backends.cudnn.enabled = False
 617 |         # TODO: NAN when target_length > input_length, we can increase size or use zero infinity
 618 |         loss = loss_fn(outputs, targets, input_lengths, target_lengths)
 619 |         torch.backends.cudnn.enabled = True
 620 | 
 621 |         return loss, loss.item()
 622 | 
 623 |     import difflib
 624 | 
 625 |     def calculate_ac(str1, str2):
 626 |         """Calculate accuracy by char of 2 string"""
 627 | 
 628 |         total_letters = len(str1)
 629 |         ocr_letters = len(str2)
 630 |         if total_letters == 0 and ocr_letters == 0:
 631 |             acc_by_char = 1.0
 632 |             return acc_by_char
 633 |         diff = difflib.SequenceMatcher(None, str1, str2)
 634 |         correct_letters = 0
 635 |         for block in diff.get_matching_blocks():
 636 |             correct_letters = correct_letters + block[2]
 637 |         if ocr_letters == 0:
 638 |             acc_by_char = 0
 639 |         elif correct_letters == 0:
 640 |             acc_by_char = 0
 641 |         else:
 642 |             acc_1 = correct_letters / total_letters
 643 |             acc_2 = correct_letters / ocr_letters
 644 |             acc_by_char = 2 * (acc_1 * acc_2) / (acc_1 + acc_2)
 645 | 
 646 |         return float(acc_by_char)
 647 | 
 648 |     def accuracy_ctc(outputs, targets):
 649 |         outputs = outputs.permute(1, 0, 2)
 650 |         targets = targets.transpose(1, 0)
 651 | 
 652 |         total_acc_by_char = 0
 653 |         total_acc_by_field = 0
 654 | 
 655 |         for output, target in zip(outputs, targets):
 656 |             out_best = list(torch.argmax(output, -1))  # [2:]
 657 |             out_best = [k for k, g in itertools.groupby(out_best)]
 658 |             pred_text = get_label_from_indices(out_best)
 659 |             target_text = get_label_from_indices(target)
 660 | 
 661 |             # print('predict:', pred_text, 'target:', target_text)
 662 | 
 663 |             acc_by_char = calculate_ac(pred_text, target_text)
 664 |             total_acc_by_char += acc_by_char
 665 | 
 666 |             if pred_text == target_text:
 667 |                 total_acc_by_field += 1
 668 | 
 669 |         return np.array([total_acc_by_char / targets.size()[0], total_acc_by_field / targets.size()[0]])
 670 | 
 671 |     from torch.utils.data import DataLoader
 672 | 
 673 |     def train_epoch(epoch):
 674 |         model.train()
 675 |         total_loss = 0
 676 |         total_metrics = np.zeros(2)
 677 |         for batch_idx, (images, labels, mask, max_label_length) in enumerate(train_dataloader):
 678 |             images, labels, mask = images.to(device), labels.to(device), mask.to(device)
 679 | 
 680 |             optimizer.zero_grad()
 681 |             output = model(images, labels, max_label_length, device)
 682 | 
 683 |             loss, print_loss = loss_fn(output, labels, mask)
 684 |             loss.backward()
 685 |             optimizer.step()
 686 | 
 687 |             total_loss += print_loss  # loss.item()
 688 |             total_metrics += metric_fn(output, labels)
 689 | 
 690 |             if batch_idx == len(train_dataloader):
 691 |                 break
 692 | 
 693 |         log = {
 694 |             'loss': total_loss / len(train_dataloader),
 695 |             'metrics': (total_metrics / len(train_dataloader)).tolist()
 696 |         }
 697 | 
 698 |         return log
 699 | 
 700 |     def val_epoch():
 701 |         # when evaluating, we don't use teacher forcing
 702 |         model.eval()
 703 |         total_val_loss = 0
 704 |         # total_val_metrics = np.zeros(len(self.metrics))
 705 |         total_val_metrics = np.zeros(2)
 706 |         with torch.no_grad():
 707 |             # print("Length of validation:", len(self.valid_data_loader))
 708 |             for batch_idx, (images, labels, mask, max_label_length) in enumerate(val_dataloader):
 709 |                 images, labels, mask = images.to(device), labels.to(device), mask.to(device)
 710 |                 images, labels, mask = images.to(device), labels.to(device), mask.to(device)
 711 | 
 712 |                 output = model(images, labels, max_label_length, device, training=False)
 713 |                 _, print_loss = loss_fn(output, labels, mask)  # Attention:
 714 |                 # loss = self.loss(output, labels, mask)
 715 |                 # print_loss = loss.item()
 716 | 
 717 |                 total_val_loss += print_loss
 718 |                 total_val_metrics += metric_fn(output, labels)
 719 | 
 720 |         return_value = {
 721 |             'loss': total_val_loss / len(val_dataloader),
 722 |             'metrics': (total_val_metrics / len(val_dataloader)).tolist()
 723 |         }
 724 | 
 725 |         return return_value
 726 | 
 727 |     def train():
 728 |         best_val_loss = np.inf
 729 |         for epoch in range(N_EPOCHS):
 730 |             print("Epoch", epoch + 1)
 731 |             train_log = train_epoch(epoch)
 732 |             print("Training log:", train_log)
 733 |             train_loss = train_log['loss']
 734 |             train_acc_by_char = train_log['metrics'][0]
 735 |             train_acc_by_field = train_log['metrics'][1]
 736 |             experiment.log_metrics({
 737 |                 "train_loss": train_loss,
 738 |                 "train_acc_by_char": train_acc_by_char,
 739 |                 "train_acc_by_field": train_acc_by_field
 740 |             }, epoch=epoch)
 741 | 
 742 |             val_log = val_epoch()
 743 |             val_loss = val_log['loss']
 744 |             val_acc_by_char = val_log['metrics'][0]
 745 |             val_acc_by_field = val_log['metrics'][1]
 746 |             experiment.log_metrics({
 747 |                 "val_loss": val_loss,
 748 |                 "val_acc_by_char": val_acc_by_char,
 749 |                 "val_acc_by_field": val_acc_by_field
 750 |             }, epoch=epoch)
 751 |             if val_log['loss'] < best_val_loss:
 752 |                 # save model
 753 |                 best_val_loss = val_log['loss']
 754 | 
 755 |             print("Validation log:", val_log)
 756 | 
 757 | 
 758 |     def weight_reset(m):
 759 |         if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
 760 |             m.reset_parameters()
 761 | 
 762 |     # Download data from s3
 763 |     print('DOWNLOADING DATA FROM S3')
 764 |     import boto3
 765 |     import os 
 766 | 
 767 |     s3_resource = boto3.resource('s3')
 768 |     bucket = s3_resource.Bucket('ocrpipeline') 
 769 |     for obj in bucket.objects.filter(Prefix='data'):
 770 |         if not os.path.exists(os.path.dirname(obj.key)):
 771 |             os.makedirs(os.path.dirname(obj.key))
 772 |         bucket.download_file(obj.key, obj.key) # save to same path
 773 |             
 774 |     print('START CREATING DATASET AND MODEL')
 775 |     BATCH_SIZE = 32
 776 |     NUM_WORKERS = 4
 777 | 
 778 |     # merge data from train folder and user_data folder 
 779 |     import shutil
 780 |     import json
 781 | 
 782 |     ## merge labels
 783 |     with open('./data/user_data/labels.json', 'r') as f:
 784 |         user_data = json.load(f)
 785 | 
 786 |     with open('./data/train/labels.json', 'r') as f:
 787 |         train_data = json.load(f)
 788 | 
 789 |     train_data = {**train_data, **user_data}
 790 |     with open('./data/train/labels.json', 'w') as f:
 791 |         json.dump(train_data, f)
 792 | 
 793 |     ## merge images
 794 |     SOURCE_FOLDER = './data/user_data/images'
 795 |     TARGET_FOLDER = './data/train/images'
 796 | 
 797 |     for image_name in os.listdir(SOURCE_FOLDER):
 798 |         shutil.copy2(os.path.join(SOURCE_FOLDER, image_name), TARGET_FOLDER)
 799 | 
 800 | 
 801 |     # CREATE DATASET AND DATALOADER
 802 |     train_dataset = OCRDataset(data_dir='./data/train/images/', label_path='./data/train/labels.json')
 803 |     val_dataset = OCRDataset(data_dir='./data/validation/images/', label_path='./data/validation/labels.json')
 804 | 
 805 |     train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, collate_fn=collate_wrapper)
 806 |     val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, collate_fn=collate_wrapper)
 807 | 
 808 |     # CREATE MODEL
 809 |     model = CTCModel(inner_dim=128, num_chars=len(CHAR2INDEX))
 810 | 
 811 |     import torch
 812 |     import numpy as np
 813 |     
 814 |     N_EPOCHS = 50
 815 |     optimizer = torch.optim.Adam(model.parameters())
 816 |     loss_fn = ctc_loss
 817 |     metric_fn = accuracy_ctc
 818 |     device = 'cpu' if not torch.cuda.is_available() else 'cuda'
 819 | 
 820 |     model.apply(weight_reset)
 821 |     print('START TRAINING')
 822 |     train()
 823 | 
 824 |     # save model to s3
 825 |     upload_file_to_s3('best_model.pth', bucket='ocrpipeline', object_name='best_model.pth')
 826 | 
 827 |     # Write to output file
 828 |     model_path = "s3://ocrpipeline/best_model.pth"
 829 |     Path('/output_path.txt').write_text(model_path)
 830 | 
 831 | 
 832 | if __name__ == '__main__':
 833 |     import argparse
 834 |     parser = argparse.ArgumentParser(description='Training model')
 835 |     parser.add_argument('--data_path', type=str)
 836 | 
 837 |     args = parser.parse_args()
 838 | 
 839 |     deploy_model(data_path=args.data_path)
 840 | ```
 841 | 
 842 | Trước hết, chúng ta cần phải tải dữ liệu từ `AWS S3` bao gồm cả hình ảnh và nhãn mà chúng ta đã xứ lý ở bước tiền xử lý dữ liệu. Có một lưu ý quan trọng ở đây là chúng ta sẽ gộp dữ liệu của tập huấn luyện trong thư mục `data/train` và thư mục dữ liệu mới mà người dùng tải lên đã được gán nhãn (trong thư mục `data/user_data`) để thực hiện việc huấn luyện mô hình mới, việc này đảm bảo mô hình của chúng ta luôn được cập nhật dữ liệu mới nhất một cách tự động. 
 843 | 
 844 | Đường dẫn đến dữ liệu `data_path` sẽ được lấy từ tệp `output_path.txt` của thành phần tiền xử lý dữ liệu. Chúng ta sẽ thực hiện việc ghép nối này ở phần sau.
 845 | 
 846 | Tương tự với tiền xử lý dữ liệu, chúng ta sẽ lưu đường dẫn của trọng số mô hình tốt nhất vào tệp `output_path.py`. Đường dẫn này sẽ được truyền cho thành phần kế tiếp là triển khai mô hình trong Kubeflow pipeline. 
 847 | 
 848 | #### Xây dựng Docker image
 849 | Chúng ta sẽ xây dựng Docker image để làm môi trường chạy mã nguồn ở trên như sau:
 850 | ```dockerfile
 851 | FROM ubuntu:20.04
 852 | # Install some common dependencies
 853 | ARG DEBIAN_FRONTEND=noninteractive
 854 | RUN apt-get update -qqy && \
 855 |     apt-get install -y \
 856 |       build-essential \
 857 |       cmake \
 858 |       g++ \
 859 |       git \
 860 |       libsm6 \
 861 |       libxrender1 \
 862 |       locales \
 863 |       libssl-dev \
 864 |       pkg-config \
 865 |       poppler-utils \
 866 |       python3-dev \
 867 |       python3-pip \
 868 |       python3.6 \
 869 |       software-properties-common \
 870 |       && \
 871 |     apt-get clean && \
 872 |     apt-get autoremove && \
 873 |     rm -rf /var/lib/apt/lists/*
 874 | 
 875 | # Set default python version to 3.6
 876 | RUN ln -sf /usr/bin/python3 /usr/bin/python
 877 | RUN ln -sf /usr/bin/pip3 /usr/bin/pip
 878 | RUN pip3 install -U pip setuptools
 879 | 
 880 | # locale setting
 881 | RUN locale-gen en_US.UTF-8
 882 | ENV LC_ALL=en_US.UTF-8 \
 883 |     LANG=en_US.UTF-8 \
 884 |     LANGUAGE=en_US.UTF-8 \
 885 |     PYTHONIOENCODING=utf-8
 886 | 
 887 | RUN pip install torch numpy comet_ml boto3 requests opencv-python
 888 | RUN apt update && apt install ffmpeg libsm6 libxext6  -y
 889 | COPY ./src /src
 890 | ```
 891 | 
 892 | Ở trong Dockerfile này, chúng ta sẽ sử dụng môi trường là `Ubuntu 20.04` và tiến hành cài đặt python cũng như các cấu hình cần thiết. Tiếp theo, chúng ta cần phải cài đặt các thư viện để huấn luyện mô hình như `torch`, `numpy`,... Một lưu ý là với môi trường này, chúng ta không thể sử dụng GPU để huấn luyện mô hình. Để có thể huấn luyện mô hình học máy sử dụng GPU, chúng ta cần thay docker base image từ `ubuntu:20.04` sang một base image khác có hỗ trợ GPU, ví dụ như `nvidia/cuda:11.4.1-devel-ubuntu20.04`. Các bạn có thể tìm các môi trường để có thể sử dụng GPU do Nvidia cung cấp tại docker hub của Nvidia: https://hub.docker.com/r/nvidia/cuda. Lý do mình không dùng môi trường Ubuntu ở đây là mình đang cài đặt Kubeflow bằng Vagrant, và Vagrant chưa hỗ trợ sử dụng GPU trong các docker container. Các bạn có thể thay docker base image nếu các bạn cài đặt Kubeflow trên microk8s hoặc trên các cloud như EKS, GKS,...
 893 | 
 894 | #### Xây dựng shell build_image.sh
 895 | Tương tự với xây dựng shell cho thành phần tiền xử lý dữ liệu, chúng ta xây dựng shell với nội dung như sau:
 896 | ```shell
 897 | #!/bin/bash
 898 | 
 899 | full_image_name=<username>/ocrtrain
 900 | 
 901 | docker build -t "${full_image_name}" .
 902 | docker push "$full_image_name"
 903 | ```
 904 | 
 905 | Chúng ta đặt tên cho `docker image` là `<username>/ocrtrain` (bạn có thể thay `<username>` với tên của bạn) và tiến hành xây dựng nó với câu lệnh `docker build`. Sau đó chúng ta sẽ đẩy `docker image` này lên `docker hub`.
 906 | 
 907 | Chúng ta tiến hành chạy shell để xây dựng docker image như sau:
 908 | 
 909 | ```
 910 | chmod +x build_image.sh
 911 | ./build_image.sh
 912 | ```
 913 | 
 914 | Như vậy chúng ta đã xây dựng môi trường để thực hiện việc huấn luyện mô hình học máy
 915 | 
 916 | #### Định nghĩa cấu hình của thành phần 
 917 | Cũng tương tự với thành phần tiền xử lý dữ liệu, chúng ta sẽ xây dụng nội dung của thành phần trong tệp `.yaml` như sau:
 918 | ```yaml
 919 | name: OCR Train
 920 | description: Training OCR model
 921 | inputs:
 922 |   - {name: data_path, description: 's3 path to data'}
 923 | outputs:
 924 |   - {name: output_path, type: String, description: 'Output Path'}
 925 | 
 926 | implementation:
 927 |   container:
 928 |     image: <username>/ocrtrain
 929 |     command: [
 930 |       python, /src/main.py,
 931 |       --data_path, {inputValue: data_path},
 932 |     ]
 933 | 
 934 |     fileOutputs:
 935 |       output_path: /output_path.txt
 936 | ```
 937 | 
 938 | Ở đây chúng ta có thể thấy, thành phần huấn luyện mô hình nhận đầu vào là `data_path`, mà chúng ta có thể lấy từ đầu ra của thành phần tiền xử lý dữ liệu, và đầu ra của thành phần huấn luyện mô hình cũng được đọc từ tệp `output_path.txt` nơi chứa đường dẫn đến trọng số của mô hình được lưu trữ trên `AWS S3`.
 939 | 
 940 | ### Triển khai mô hình
 941 | #### Xây dựng mã nguồn
 942 | Tương tự với các phần trước, chúng ta sẽ tiến hành refactor mã nguồn của thành phần triển khai mô hình và lưu vào tệp `main.py` trong thư mục `src`. Ở đây mình sẽ lựa chọn trên khai Flask API thay vì Gradio Interface, các bạn có thể thực hiện với Gradio Interface một cách hoàn toàn tương tự.
 943 | 
 944 | Nội dung của tệp `main.py` sau khi thực hiện refactor mã nguồn như sau:
 945 | ```python
 946 | from botocore.retries import bucket
 947 | 
 948 | 
 949 | def deploy_model(model_s3_path: str):
 950 |     CHARACTERS = "aáàạảãăắằẳẵặâấầẩẫậbcdđeéèẹẻẽêếềệểễfghiíìịỉĩjklmnoóòọỏõôốồộổỗơớờợởỡpqrstuúùụủũưứừựửữvxyýỳỷỹỵzwAÁÀẠẢÃĂẮẰẲẴẶÂẤẦẨẪẬBCDĐEÉÈẸẺẼÊẾỀỆỂỄFGHIÍÌỊỈĨJKLMNOÓÒỌỎÕÔỐỒỘỔỖƠỚỜỢỞỠPQRSTUÚÙỤỦŨƯỨỪỰỬỮVXYÝỲỴỶỸZW0123456789 .,-/()'#+:"
 951 | 
 952 |     PAD_token = 0  # Used for padding short sentences
 953 |     SOS_token = 1  # Start-of-sentence token
 954 |     EOS_token = 2  # End-of-sentence token
 955 | 
 956 |     CHAR2INDEX = {"PAD": PAD_token, "SOS": SOS_token, "EOS": EOS_token}
 957 |     INDEX2CHAR = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
 958 | 
 959 |     for i, c in enumerate(CHARACTERS):
 960 |         CHAR2INDEX[c] = i + 3
 961 |         INDEX2CHAR[i + 3] = c
 962 |     
 963 |     def get_indices_from_label(label):
 964 |         indices = []
 965 |         for char in label:
 966 |     #         if CHAR2INDEX.get(char) is not None:
 967 |                 indices.append(CHAR2INDEX[char])
 968 | 
 969 |         indices.append(EOS_token)
 970 |         return indices
 971 | 
 972 |     def get_label_from_indices(indices):
 973 |         label = ""
 974 |         for index in indices:
 975 |             if index == EOS_token:
 976 |                 break
 977 |             elif index == PAD_token:
 978 |                 continue
 979 |             else:
 980 |                 label += INDEX2CHAR[index.item()]
 981 | 
 982 |         return label
 983 |     
 984 |     import torch.nn as nn
 985 |     import torch.nn.functional as F
 986 | 
 987 | 
 988 |     class VGG_FeatureExtractor(nn.Module):
 989 |         """ FeatureExtractor of CRNN (https://arxiv.org/pdf/1507.05717.pdf) """
 990 | 
 991 |         def __init__(self, input_channel, output_channel=512):
 992 |             super(VGG_FeatureExtractor, self).__init__()
 993 |             self.output_channel = [int(output_channel / 8), int(output_channel / 4),
 994 |                                    int(output_channel / 2), output_channel]  # [64, 128, 256, 512]
 995 |             self.ConvNet = nn.Sequential(
 996 |                 nn.Conv2d(input_channel, self.output_channel[0], 3, 1, 1), nn.ReLU(True),
 997 |                 nn.MaxPool2d(2, 2),  # 64x16x50
 998 |                 nn.Conv2d(self.output_channel[0], self.output_channel[1], 3, 1, 1), nn.ReLU(True),
 999 |                 nn.MaxPool2d(2, 2),  # 128x8x25
1000 |                 nn.Conv2d(self.output_channel[1], self.output_channel[2], 3, 1, 1), nn.ReLU(True),  # 256x8x25
1001 |                 nn.Conv2d(self.output_channel[2], self.output_channel[2], 3, 1, 1), nn.ReLU(True),
1002 |                 nn.MaxPool2d((2, 1), (2, 1)),  # 256x4x25
1003 |                 nn.Conv2d(self.output_channel[2], self.output_channel[3], 3, 1, 1, bias=False),
1004 |                 nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),  # 512x4x25
1005 |                 nn.Conv2d(self.output_channel[3], self.output_channel[3], 3, 1, 1, bias=False),
1006 |                 nn.BatchNorm2d(self.output_channel[3]), nn.ReLU(True),
1007 |                 nn.MaxPool2d((2, 1), (2, 1)),  # 512x2x25
1008 |                 nn.Conv2d(self.output_channel[3], self.output_channel[3], 2, 1, 0), nn.ReLU(True))  # 512x1x24
1009 | 
1010 |         def forward(self, input):
1011 |             return self.ConvNet(input)
1012 |         
1013 |     
1014 |     class BidirectionalGRU(nn.Module):
1015 |         def __init__(self, input_size, hidden_size, output_size):
1016 |             super(BidirectionalGRU, self).__init__()
1017 | 
1018 |             self.rnn = nn.GRU(input_size, hidden_size, bidirectional=True)
1019 |             self.embedding = nn.Linear(hidden_size * 2, output_size)
1020 | 
1021 |         def forward(self, x):
1022 |             recurrent, hidden = self.rnn(x)
1023 |             T, b, h = recurrent.size()
1024 |             t_rec = recurrent.view(T * b, h)
1025 | 
1026 |             output = self.embedding(t_rec)  # [T * b, nOut]
1027 |             output = output.view(T, b, -1)
1028 | 
1029 |             return output, hidden
1030 |     
1031 |     class CTCModel(nn.Module):
1032 |         def __init__(self, inner_dim=512, num_chars=65):
1033 |             super().__init__()
1034 |             self.encoder = VGG_FeatureExtractor(3, inner_dim)
1035 |             self.AdaptiveAvgPool = nn.AdaptiveAvgPool2d((None, 1))
1036 |             self.rnn_encoder = BidirectionalGRU(inner_dim, 256, 256)
1037 |             self.num_chars = num_chars
1038 |             self.decoder = nn.Linear(256, self.num_chars)
1039 | 
1040 |         def forward(self, x, labels=None, max_label_length=None, device=None, training=True):
1041 |             # ---------------- CNN ENCODER --------------
1042 |             x = self.encoder(x)
1043 |             # print('After CNN:', x.size())
1044 | 
1045 |             # ---------------- CNN TO RNN ----------------
1046 |             x = x.permute(3, 0, 1, 2)  # from B x C x H x W -> W x B x C x H
1047 |             x = self.AdaptiveAvgPool(x)
1048 |             size = x.size()
1049 |             x = x.reshape(size[0], size[1], size[2] * size[3])
1050 | 
1051 |             # ----------------- RNN ENCODER ---------------
1052 |             encoder_outputs, last_hidden = self.rnn_encoder(x)
1053 |             # print('After RNN', x.size())
1054 | 
1055 |             # --------------- CTC DECODER -------------------
1056 |             # batch_size = encoder_outputs.size()[1]
1057 |             outputs = self.decoder(encoder_outputs)
1058 | 
1059 |             return outputs
1060 |         
1061 |     import uuid
1062 | 
1063 |     import logging
1064 |     import boto3
1065 |     from botocore.exceptions import ClientError
1066 | 
1067 | 
1068 |     def upload_file_to_s3(file_name, bucket, object_name=None):
1069 |         """Upload a file to an S3 bucket
1070 | 
1071 |         :param file_name: File to upload
1072 |         :param bucket: Bucket to upload to
1073 |         :param object_name: S3 object name. If not specified then file_name is used
1074 |         :return: True if file was uploaded, else False
1075 |         """
1076 | 
1077 |         # If S3 object_name was not specified, use file_name
1078 |         if object_name is None:
1079 |             object_name = file_name
1080 | 
1081 |         # Upload the file
1082 |         s3_client = boto3.client('s3')
1083 |         try:
1084 |             response = s3_client.upload_file(file_name, bucket, object_name)
1085 |         except ClientError as e:
1086 |             logging.error(e)
1087 |             return False
1088 |         return True
1089 | 
1090 |     import boto3
1091 |     import torch
1092 | 
1093 |     import os
1094 |     os.environ["AWS_ACCESS_KEY_ID"] = "*********************"
1095 |     os.environ["AWS_SECRET_ACCESS_KEY"] = "*********************"
1096 | 
1097 |     s3 = boto3.client('s3')
1098 |     bucket_name = model_s3_path.split('s3://')[1].split('/')[0]
1099 |     object_name = model_s3_path.split('s3://' + bucket_name)[1][1:]
1100 |     model_path = 'best_model.pth'
1101 |     s3.download_file(bucket_name, object_name, model_path)
1102 | 
1103 |     model = CTCModel(inner_dim=128, num_chars=len(CHAR2INDEX))
1104 |     device = 'cpu' if not torch.cuda.is_available() else 'cuda'
1105 |     model.load_state_dict(torch.load(model_path, map_location=device))
1106 |     model = model.to(device)
1107 | 
1108 |     def predict(image):
1109 |         # upload image to aws s3
1110 |         filename = str(uuid.uuid4()) + '.png'
1111 |         cv2.imwrite(filename, image)
1112 |         upload_file_to_s3(filename, bucket='ocrpipeline', object_name='data/user_data/images/')
1113 | 
1114 |         model.eval()
1115 |         batch = [{'image': image, 'label': [1]}]
1116 |         images = collate_wrapper(batch)[0]
1117 | 
1118 |         images = images.to(device)
1119 | 
1120 |         outputs = model(images)
1121 |         outputs = outputs.permute(1, 0, 2)
1122 |         output = outputs[0]
1123 | 
1124 |         out_best = list(torch.argmax(output, -1))  # [2:]
1125 |         out_best = [k for k, g in itertools.groupby(out_best)]
1126 |         pred_text = get_label_from_indices(out_best)
1127 | 
1128 |         return pred_text
1129 |         
1130 | 
1131 |     # import gradio
1132 |     # device = 'cpu' if not torch.cuda.is_available() else 'cuda'
1133 |     # #     model.load_state_dict(torch.load('best_model.pth'))
1134 |     # model = model.to(device)
1135 |     # interface = gradio.Interface(predict, "image", "text")
1136 |     
1137 |     # interface.launch()
1138 |     
1139 |     import os
1140 |     from flask import Flask, flash, request, redirect, url_for
1141 |     from werkzeug.utils import secure_filename
1142 | 
1143 |     UPLOAD_FOLDER = './data/user_data/'
1144 |     ALLOWED_EXTENSIONS = set(['txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif'])
1145 | 
1146 |     app = Flask(__name__)
1147 |     app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
1148 |     app.secret_key = "super secret key"
1149 | 
1150 | 
1151 |     @app.route("/predict", methods=['POST'])
1152 |     def upload_file():
1153 |         if request.method == 'POST':
1154 |             # check if the post request has the file part
1155 |             if 'file' not in request.files:
1156 |                 flash('No file part')
1157 |                 return redirect(request.url)
1158 |             file = request.files['file']
1159 |             # if user does not select file, browser also
1160 |             # submit a empty part without filename
1161 |             if file.filename == '':
1162 |                 flash('No selected file')
1163 |                 return redirect(request.url)
1164 |             if file:
1165 |                 print('found file')
1166 |                 filename = secure_filename(file.filename)
1167 |                 filepath = os.path.join(filename)
1168 |                 file.save(filepath)
1169 |                 
1170 |                 image = cv2.imread(filepath)
1171 |                 output = predict(image)
1172 |                 return output
1173 | 
1174 | 
1175 |     app.run(debug=False, threaded=False, host='0.0.0.0', port=5000)
1176 |     
1177 | if __name__ == '__main__':
1178 |     import argparse
1179 |     parser = argparse.ArgumentParser(description='Deployment')
1180 |     parser.add_argument('--model_s3_path', type=str)
1181 | 
1182 |     args = parser.parse_args()
1183 | 
1184 |     deploy_model(model_s3_path=args.model_s3_path)
1185 | ```
1186 | 
1187 | Chúng ta sẽ đọc trọng số của mô hình từ thành phần huấn luyện mô hình ở phần trước và truyền vào tham số `model_s3_path` trong tệp `main.py`
1188 | #### Xây dựng Docker image
1189 | Nội dung của Docker image cho thành phần triển khai mô hình cũng tương tự với thành phần huấn luyện mô hình:
1190 | 
1191 | ```dockerfile
1192 | FROM ubuntu:20.04
1193 | # Install some common dependencies
1194 | ARG DEBIAN_FRONTEND=noninteractive
1195 | RUN apt-get update -qqy && \
1196 |     apt-get install -y \
1197 |       build-essential \
1198 |       cmake \
1199 |       g++ \
1200 |       git \
1201 |       libsm6 \
1202 |       libxrender1 \
1203 |       locales \
1204 |       libssl-dev \
1205 |       pkg-config \
1206 |       poppler-utils \
1207 |       python3-dev \
1208 |       python3-pip \
1209 |       python3.6 \
1210 |       software-properties-common \
1211 |       && \
1212 |     apt-get clean && \
1213 |     apt-get autoremove && \
1214 |     rm -rf /var/lib/apt/lists/*
1215 | 
1216 | # Set default python version to 3.6
1217 | RUN ln -sf /usr/bin/python3 /usr/bin/python
1218 | RUN ln -sf /usr/bin/pip3 /usr/bin/pip
1219 | RUN pip3 install -U pip setuptools
1220 | 
1221 | # locale setting
1222 | RUN locale-gen en_US.UTF-8
1223 | ENV LC_ALL=en_US.UTF-8 \
1224 |     LANG=en_US.UTF-8 \
1225 |     LANGUAGE=en_US.UTF-8 \
1226 |     PYTHONIOENCODING=utf-8
1227 | 
1228 | RUN pip3 install torch numpy comet_ml boto3 requests opencv-python
1229 | RUN pip3 install flask
1230 | RUN apt update && apt install ffmpeg libsm6 libxext6  -y
1231 | COPY ./src /src
1232 | ```
1233 | 
1234 | Điểm khác biệt ở đây là chúng ta cần cài thêm `flask` để có thể triển khai API.
1235 | #### Xây dựng shell build_image.sh
1236 | Không có gì khác biệt với các phần trước, chúng ta có nội dung của tệp `build_image.sh` như sau:
1237 | ```shell
1238 | #!/bin/bash
1239 | 
1240 | full_image_name=<username>/ocrtrain
1241 | 
1242 | docker build -t "${full_image_name}" .
1243 | docker push "$full_image_name"
1244 | ```
1245 | 
1246 | Chúng ta tiến hành chạy shell để xây dựng docker image như sau:
1247 | 
1248 | ```
1249 | chmod +x build_image.sh
1250 | ./build_image.sh
1251 | ```
1252 | 
1253 | #### Định nghĩa cấu hình của thành phần 
1254 | Ở phần này, chúng ta sẽ sử dụng `KFServing` cùng với Flask để triển khai mô hình, vì vậy chúng ta sẽ sử dụng cấu hình có sẵn của `KFServing` ở đây: https://github.com/kubeflow/pipelines/blob/master/components/kubeflow/kfserving/component.yaml
1255 | 
1256 | Chúng ta sẽ làm theo hướng dẫn để sử dụng KFServing theo hướng dẫn này https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/kfserving trong phần kết nối các thành phần vào Kubeflow pipeline dưới đây.
1257 | 
1258 | ### Kết nối các thành phần vào Kubeflow pipeline
1259 | 
1260 | Trong các mục trước, chúng ta đã xây dựng các thành phần riêng lẻ của một Kubeflow pipeline, trong phần này, chúng ta sẽ kết nối các thành phần này lại với nhau. Ở đây chúng ta sử dụng Kubeflow Python SDK để thực hiện công việc này.
1261 | 
1262 | Nội dung của `pipeline.py` như sau:
1263 | ```python
1264 | import logging
1265 | import json
1266 | import kfp
1267 | from kfp.aws import use_aws_secret
1268 | import kfp.dsl as dsl
1269 | from kfp import components
1270 | import json
1271 | 
1272 | logging.basicConfig(level=logging.INFO, format='%(name)s - %(levelname)s - %(message)s')
1273 | 
1274 | 
1275 | def create_preprocess_component():
1276 |     component = kfp.components.load_component_from_file('./components/preprocess/component.yaml')
1277 |     return component
1278 | 
1279 | 
1280 | def create_train_component():
1281 |     component = kfp.components.load_component_from_file('./components/train/component.yaml')
1282 |     return component
1283 | 
1284 | 
1285 | def mlpipeline():
1286 |     preprocess_component = create_preprocess_component()
1287 |     preprocess_task = preprocess_component()
1288 | 
1289 |     train_component = create_train_component()
1290 |     train_task = train_component(data_path=preprocess_task.outputs['output_path'])
1291 | 
1292 |     deployment_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml')
1293 |     container_spec = {"image": "thanhhau097/ocrdeployment", "port":5000, "name": "custom-container", "command": "python3 /src/main.py --model_s3_path {}".format(train_task.outputs['output_path'])}
1294 |     container_spec = json.dumps(container_spec)
1295 |     deployment_op(
1296 |         action='apply',
1297 |         model_name='custom-simple',
1298 |         custom_model_spec=container_spec, 
1299 |         namespace='kubeflow-user',
1300 |         watch_timeout="1800"
1301 |     )
1302 | 
1303 | # build pipeline
1304 | kfp.compiler.Compiler().compile(mlpipeline, 'mlpipeline.yaml')
1305 | ```
1306 | 
1307 | Ở đây chúng ta có 3 thành phần: 
1308 | - `preprocess`: thành phần tiền xử lý dữ liệu
1309 | - `train`: huấn luyện mô hình
1310 | - `deployment`: triển khai mô hình 
1311 | 
1312 | Chúng ta lấy đường dẫn dữ liệu trên `AWS S3` từ thành phần tiền xử lý dữ liệu và truyền vào thành phần huấn luyện mô hình bằng cách như sau: 
1313 | ```
1314 |     train_task = train_component(data_path=preprocess_task.outputs['output_path'])
1315 | ```
1316 | 
1317 | và cũng tương tự với đường dẫn đến trọng số của mô hình để đưa vào thành phần triển khai mô hình.
1318 | 
1319 | Chúng ta chạy `pipeline.py` để xây dựng Kubeflow pipeline và sinh ra tệp `mlpipeline.yaml`: 
1320 | 
1321 | ```python
1322 | python pipeline.py
1323 | ```
1324 | 
1325 | Nội dung của tệp `mlpipeline.yaml` như sau:
1326 | ```yaml
1327 | apiVersion: argoproj.io/v1alpha1
1328 | kind: Workflow
1329 | metadata:
1330 |   generateName: mlpipeline-
1331 |   annotations: {pipelines.kubeflow.org/kfp_sdk_version: 1.7.0, pipelines.kubeflow.org/pipeline_compilation_time: '2021-08-13T17:29:40.859670',
1332 |     pipelines.kubeflow.org/pipeline_spec: '{"name": "Mlpipeline"}'}
1333 |   labels: {pipelines.kubeflow.org/kfp_sdk_version: 1.7.0}
1334 | spec:
1335 |   entrypoint: mlpipeline
1336 |   templates:
1337 |   - name: kubeflow-serve-model-using-kfserving
1338 |     container:
1339 |       args:
1340 |       - -u
1341 |       - kfservingdeployer.py
1342 |       - --action
1343 |       - apply
1344 |       - --model-name
1345 |       - custom-simple
1346 |       - --model-uri
1347 |       - ''
1348 |       - --canary-traffic-percent
1349 |       - '100'
1350 |       - --namespace
1351 |       - kubeflow-user
1352 |       - --framework
1353 |       - ''
1354 |       - --custom-model-spec
1355 |       - '{"image": "thanhhau097/ocrdeployment", "port": 5000, "name": "custom-container",
1356 |         "command": "python3 /src/main.py --model_s3_path {{inputs.parameters.ocr-train-output_path}}"}'
1357 |       - --autoscaling-target
1358 |       - '0'
1359 |       - --service-account
1360 |       - ''
1361 |       - --enable-istio-sidecar
1362 |       - "True"
1363 |       - --output-path
1364 |       - /tmp/outputs/InferenceService_Status/data
1365 |       - --inferenceservice-yaml
1366 |       - '{}'
1367 |       - --watch-timeout
1368 |       - '1800'
1369 |       - --min-replicas
1370 |       - '-1'
1371 |       - --max-replicas
1372 |       - '-1'
1373 |       - --request-timeout
1374 |       - '60'
1375 |       command: [python]
1376 |       image: quay.io/aipipeline/kfserving-component:v0.5.1
1377 |     inputs:
1378 |       parameters:
1379 |       - {name: ocr-train-output_path}
1380 |     outputs:
1381 |       artifacts:
1382 |       - {name: kubeflow-serve-model-using-kfserving-InferenceService-Status, path: /tmp/outputs/InferenceService_Status/data}
1383 |     metadata:
1384 |       labels:
1385 |         pipelines.kubeflow.org/kfp_sdk_version: 1.7.0
1386 |         pipelines.kubeflow.org/pipeline-sdk-type: kfp
1387 |         pipelines.kubeflow.org/enable_caching: "true"
1388 |       annotations: {pipelines.kubeflow.org/component_spec: '{"description": "Serve
1389 |           Models using Kubeflow KFServing", "implementation": {"container": {"args":
1390 |           ["-u", "kfservingdeployer.py", "--action", {"inputValue": "Action"}, "--model-name",
1391 |           {"inputValue": "Model Name"}, "--model-uri", {"inputValue": "Model URI"},
1392 |           "--canary-traffic-percent", {"inputValue": "Canary Traffic Percent"}, "--namespace",
1393 |           {"inputValue": "Namespace"}, "--framework", {"inputValue": "Framework"},
1394 |           "--custom-model-spec", {"inputValue": "Custom Model Spec"}, "--autoscaling-target",
1395 |           {"inputValue": "Autoscaling Target"}, "--service-account", {"inputValue":
1396 |           "Service Account"}, "--enable-istio-sidecar", {"inputValue": "Enable Istio
1397 |           Sidecar"}, "--output-path", {"outputPath": "InferenceService Status"}, "--inferenceservice-yaml",
1398 |           {"inputValue": "InferenceService YAML"}, "--watch-timeout", {"inputValue":
1399 |           "Watch Timeout"}, "--min-replicas", {"inputValue": "Min Replicas"}, "--max-replicas",
1400 |           {"inputValue": "Max Replicas"}, "--request-timeout", {"inputValue": "Request
1401 |           Timeout"}], "command": ["python"], "image": "quay.io/aipipeline/kfserving-component:v0.5.1"}},
1402 |           "inputs": [{"default": "create", "description": "Action to execute on KFServing",
1403 |           "name": "Action", "type": "String"}, {"default": "", "description": "Name
1404 |           to give to the deployed model", "name": "Model Name", "type": "String"},
1405 |           {"default": "", "description": "Path of the S3 or GCS compatible directory
1406 |           containing the model.", "name": "Model URI", "type": "String"}, {"default":
1407 |           "100", "description": "The traffic split percentage between the candidate
1408 |           model and the last ready model", "name": "Canary Traffic Percent", "type":
1409 |           "String"}, {"default": "", "description": "Kubernetes namespace where the
1410 |           KFServing service is deployed.", "name": "Namespace", "type": "String"},
1411 |           {"default": "", "description": "Machine Learning Framework for Model Serving.",
1412 |           "name": "Framework", "type": "String"}, {"default": "{}", "description":
1413 |           "Custom model runtime container spec in JSON", "name": "Custom Model Spec",
1414 |           "type": "String"}, {"default": "0", "description": "Autoscaling Target Number",
1415 |           "name": "Autoscaling Target", "type": "String"}, {"default": "", "description":
1416 |           "ServiceAccount to use to run the InferenceService pod", "name": "Service
1417 |           Account", "type": "String"}, {"default": "True", "description": "Whether
1418 |           to enable istio sidecar injection", "name": "Enable Istio Sidecar", "type":
1419 |           "Bool"}, {"default": "{}", "description": "Raw InferenceService serialized
1420 |           YAML for deployment", "name": "InferenceService YAML", "type": "String"},
1421 |           {"default": "300", "description": "Timeout seconds for watching until InferenceService
1422 |           becomes ready.", "name": "Watch Timeout", "type": "String"}, {"default":
1423 |           "-1", "description": "Minimum number of InferenceService replicas", "name":
1424 |           "Min Replicas", "type": "String"}, {"default": "-1", "description": "Maximum
1425 |           number of InferenceService replicas", "name": "Max Replicas", "type": "String"},
1426 |           {"default": "60", "description": "Specifies the number of seconds to wait
1427 |           before timing out a request to the component.", "name": "Request Timeout",
1428 |           "type": "String"}], "name": "Kubeflow - Serve Model using KFServing", "outputs":
1429 |           [{"description": "Status JSON output of InferenceService", "name": "InferenceService
1430 |           Status", "type": "String"}]}', pipelines.kubeflow.org/component_ref: '{"digest":
1431 |           "84f27d18805744db98e4d08804ea3c0e6ca5daa1a3a90dd057b5323f19d9dd2c", "url":
1432 |           "https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml"}',
1433 |         pipelines.kubeflow.org/arguments.parameters: '{"Action": "apply", "Autoscaling
1434 |           Target": "0", "Canary Traffic Percent": "100", "Custom Model Spec": "{\"image\":
1435 |           \"thanhhau097/ocrdeployment\", \"port\": 5000, \"name\": \"custom-container\",
1436 |           \"command\": \"python3 /src/main.py --model_s3_path {{inputs.parameters.ocr-train-output_path}}\"}",
1437 |           "Enable Istio Sidecar": "True", "Framework": "", "InferenceService YAML":
1438 |           "{}", "Max Replicas": "-1", "Min Replicas": "-1", "Model Name": "custom-simple",
1439 |           "Model URI": "", "Namespace": "kubeflow-user", "Request Timeout": "60",
1440 |           "Service Account": "", "Watch Timeout": "1800"}'}
1441 |   - name: mlpipeline
1442 |     dag:
1443 |       tasks:
1444 |       - name: kubeflow-serve-model-using-kfserving
1445 |         template: kubeflow-serve-model-using-kfserving
1446 |         dependencies: [ocr-train]
1447 |         arguments:
1448 |           parameters:
1449 |           - {name: ocr-train-output_path, value: '{{tasks.ocr-train.outputs.parameters.ocr-train-output_path}}'}
1450 |       - {name: ocr-preprocessing, template: ocr-preprocessing}
1451 |       - name: ocr-train
1452 |         template: ocr-train
1453 |         dependencies: [ocr-preprocessing]
1454 |         arguments:
1455 |           parameters:
1456 |           - {name: ocr-preprocessing-output_path, value: '{{tasks.ocr-preprocessing.outputs.parameters.ocr-preprocessing-output_path}}'}
1457 |   - name: ocr-preprocessing
1458 |     container:
1459 |       args: []
1460 |       command: [python, /src/main.py]
1461 |       image: thanhhau097/ocrpreprocess
1462 |     outputs:
1463 |       parameters:
1464 |       - name: ocr-preprocessing-output_path
1465 |         valueFrom: {path: /output_path.txt}
1466 |       artifacts:
1467 |       - {name: ocr-preprocessing-output_path, path: /output_path.txt}
1468 |     metadata:
1469 |       labels:
1470 |         pipelines.kubeflow.org/kfp_sdk_version: 1.7.0
1471 |         pipelines.kubeflow.org/pipeline-sdk-type: kfp
1472 |         pipelines.kubeflow.org/enable_caching: "true"
1473 |       annotations: {pipelines.kubeflow.org/component_spec: '{"description": "Preprocess
1474 |           OCR model", "implementation": {"container": {"command": ["python", "/src/main.py"],
1475 |           "fileOutputs": {"output_path": "/output_path.txt"}, "image": "thanhhau097/ocrpreprocess"}},
1476 |           "name": "OCR Preprocessing", "outputs": [{"description": "Output Path",
1477 |           "name": "output_path", "type": "String"}]}', pipelines.kubeflow.org/component_ref: '{"digest":
1478 |           "d5922a4c63432cb79f05fc02cd1b6c0f0c881e0e3883573d5cadef1eb379087c", "url":
1479 |           "./components/preprocess/component.yaml"}'}
1480 |   - name: ocr-train
1481 |     container:
1482 |       args: []
1483 |       command: [python, /src/main.py, --data_path, '{{inputs.parameters.ocr-preprocessing-output_path}}']
1484 |       image: thanhhau097/ocrtrain
1485 |     inputs:
1486 |       parameters:
1487 |       - {name: ocr-preprocessing-output_path}
1488 |     outputs:
1489 |       parameters:
1490 |       - name: ocr-train-output_path
1491 |         valueFrom: {path: /output_path.txt}
1492 |       artifacts:
1493 |       - {name: ocr-train-output_path, path: /output_path.txt}
1494 |     metadata:
1495 |       labels:
1496 |         pipelines.kubeflow.org/kfp_sdk_version: 1.7.0
1497 |         pipelines.kubeflow.org/pipeline-sdk-type: kfp
1498 |         pipelines.kubeflow.org/enable_caching: "true"
1499 |       annotations: {pipelines.kubeflow.org/component_spec: '{"description": "Training
1500 |           OCR model", "implementation": {"container": {"command": ["python", "/src/main.py",
1501 |           "--data_path", {"inputValue": "data_path"}], "fileOutputs": {"output_path":
1502 |           "/output_path.txt"}, "image": "thanhhau097/ocrtrain"}}, "inputs": [{"description":
1503 |           "s3 path to data", "name": "data_path"}], "name": "OCR Train", "outputs":
1504 |           [{"description": "Output Path", "name": "output_path", "type": "String"}]}',
1505 |         pipelines.kubeflow.org/component_ref: '{"digest": "382c44ee886fc59d95f01989ba7765df9ec2321b7c706a90a6f9c76f70cf2ab0",
1506 |           "url": "./components/train/component.yaml"}', pipelines.kubeflow.org/arguments.parameters: '{"data_path":
1507 |           "{{inputs.parameters.ocr-preprocessing-output_path}}"}'}
1508 |   arguments:
1509 |     parameters: []
1510 |   serviceAccountName: pipeline-runner
1511 | ```
1512 | 
1513 | Để sử dụng pipline này, chúng ta lên giao diện của Kubeflow và chọn:
1514 | ```
1515 | Pipelines -> Upload pipeline -> Điền các thông tin cần thiết và chọn `Upload a file` -> Create
1516 | ```
1517 | chúng ta sẽ có được pipeline như hình minh họa ở đầu bài. 
1518 | 
1519 | Để có thể chạy pipeline, chúng ta cần tạo một `Experiment` bằng cách chọn:
1520 | ```
1521 | Experiments -> Create Experiment -> Nhập tên experiment và chọn next -> chọn pipeline, đặt tên cho lần chạy, chọn One-off để chạy một lần -> Start
1522 | ```
1523 | 
1524 | Như vậy chúng ta đã có thể chạy được Kubeflow pipeline một cách hoàn chỉnh. Để xem thông tin của các experiment hiện tại, chúng ta vào mục `Runs` và chọn experiment cần xem và chọn thành phần cần xem. Chúng ta có thể xem logs của tiến trình hiện tại bằng cách chọn tab `Logs` và các thông tin về đầu vào/đầu ra ở tab `Input/Output`.
1525 | 
1526 | Sau khi pipeline chạy xong, mô hình học máy đã được triển khai, chúng ta thực hiện `port forwarding` bằng Kubernetes để có thể sử dụng API như sau:
1527 | ```
1528 | kubectl port-forward --namespace kubeflow-user  $(kubectl get pod --namespace kubeflow-user --output jsonpath='{.items[0].metadata.name}') 8080:5000
1529 | ```
1530 | 
1531 | Ở đây chúng ta sẽ sử forward cổng 5000 mà Flask sinh ra sang cổng 8080 trên localhost và thực hiện gửi request đến API này:
1532 | 
1533 | ```
1534 | curl -i -v -H "Host: http://custom-simple.kubeflow-user.example.com" -X POST "http://localhost:8080/predict" -F file=@Downloads/plane.jpg
1535 | ```
1536 | 
1537 | trong đó `Downloads/plane.jpg` là đường dẫn đến ảnh mà chúng ta cần thử nghiệm với mô hình học máy.
1538 | 
1539 | Để quan sát thống kê về lượng CPU, bộ nhớ sử dụng và các thông số liên quan đến việc triển khai mô hình, chúng ta có thể sử dụng ứng dụng Grafana bằng câu lệnh sau: 
1540 | ```
1541 | kubectl port-forward --namespace knative-monitoring $(kubectl get pod --namespace knative-monitoring --selector="app=grafana" --output jsonpath='{.items[0].metadata.name}') 8081:3000
1542 | ```
1543 | 
1544 | Sau đó truy cập vào đường dẫn http://localhost:8081 để quan sát và phân tích.
1545 | 
1546 | ## Tổng kết
1547 | Như vậy trong bài này, chúng ta đã cùng nhau xây dựng một hệ thống huấn luyện mô hình học máy bằng cách sử dụng Kubeflow. Hệ thống này có thể cập nhật dữ liệu liên tục từ Label Studio và huấn luyện mô hình trên dữ liệu đó. Ngoài ra chúng ta cũng có thể cài đặt để chạy experiment tự động bằng cách chọn chế độ `recurrent` khi chạy pipeline.
1548 | 
1549 | Bài trước: [Triển khai mô hình học máy](../deployment/index.md)
1550 | 
1551 | Bài tiếp theo: [Kết nối các thành phần của hệ thống học máy](../pipeline/index.md)
1552 | 
1553 | [Về Trang chủ](../index.md)
1554 | 
1555 | 
1556 | <script src="https://utteranc.es/client.js"
1557 |         repo="thanhhau097/mlpipeline"
1558 |         issue-term="pathname"
1559 |         label="Comment"
1560 |         theme="github-light"
1561 |         crossorigin="anonymous"
1562 |         async>
1563 | </script>


--------------------------------------------------------------------------------