├── .gitignore
├── ModelConver
├── imgs
│ ├── mnn.jpg
│ ├── ncnn.jpeg
│ └── process.svg
├── readme.md
├── Pytorch->ONNX.md
├── ONNX->MNN.md
└── ONNX->NCNN.md
├── README.md
├── AMP
├── net.py
├── README.md
└── main.py
├── TensorRT
├── readme.md
├── main.py
├── trt_com.py
├── lenet.py
└── imgs
│ ├── build.svg
│ └── infer.svg
└── DDP
├── readme.md
└── ddp.py
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | *.code-workspace
3 | .DS_Store
4 |
--------------------------------------------------------------------------------
/ModelConver/imgs/mnn.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bobo0810/PytorchExample/HEAD/ModelConver/imgs/mnn.jpg
--------------------------------------------------------------------------------
/ModelConver/imgs/ncnn.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bobo0810/PytorchExample/HEAD/ModelConver/imgs/ncnn.jpeg
--------------------------------------------------------------------------------
/ModelConver/readme.md:
--------------------------------------------------------------------------------
1 | # 移动端部署
2 |
3 | 以[人脸检测库RetinaFace](https://github.com/biubug6/Face-Detector-1MB-with-landmark)为例,移动端推理框架NCNN、MNN较为常用。
4 |
5 | ### 示例
6 |
7 | - [Pytorch->ONNX](Pytorch->ONNX.md)
8 |
9 | - [ONNX->NCNN](ONNX->NCNN.md)
10 |
11 | - [ONNX->MNN](ONNX->MNN.md)
12 |
13 | ### 流程
14 |
15 | 
16 |
17 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Pytorch最小实践
2 |
3 | #### 收录到[PytorchNetHub](https://github.com/bobo0810/PytorchNetHub)
4 |
5 | ### [AMP](./AMP/README.md)
6 |
7 | - 自动混合精度训练
8 |
9 | ### [DDP](./DDP/readme.md)
10 |
11 | - 分布式数据并行(多机多卡)
12 |
13 |
14 | ### [MNN/NCNN部署](./ModelConver/readme.md)
15 |
16 | - Pytorch->ONNX-> NCNN / MNN
17 |
18 | ### [TensorRT部署](./TensorRT/readme.md)
19 |
20 | - TensorRT API
21 | - Pytorch->ONNX->TensorRT
22 |
23 |
--------------------------------------------------------------------------------
/AMP/net.py:
--------------------------------------------------------------------------------
1 | from torch.cuda.amp import autocast
2 | import torch.nn as nn
3 | class MyNet(nn.Module):
4 | '''
5 | 自定义网络
6 | '''
7 | def __init__(self, use_amp=False):
8 | '''
9 | :param use_amp: True开启混合精度训练
10 | '''
11 | super(MyNet, self).__init__()
12 | self.use_amp = use_amp
13 |
14 | def forward(self,input):
15 | if self.use_amp:
16 | # 开启自动混合精度
17 | with autocast():
18 | return self.forward_calculation(input)
19 | else:
20 | return self.forward_calculation(input)
21 |
22 | def forward_calculation(self, input):
23 | ...
24 | ...
25 | return feature
--------------------------------------------------------------------------------
/AMP/README.md:
--------------------------------------------------------------------------------
1 | # AMP: Automatic Mixed Precision
2 |
3 | ## 说明
4 | - 好处:多快好省, batch增大
5 | - 训练:DataParallel且梯度累加的代码
6 |
7 | ## 注意
8 | - AMP保存的模型仍为FP32
9 | - AMP下模型保存两份权重。
10 |
11 | FP16权重用于反向传播计算(加速训练),并更新参数在FP32权重上(主模型)
12 | - 若想推理加速,在精度接受范围内img\model手动half()为FP16,然后只能GPU推理
13 | - [预测问题](https://github.com/jefflomax/pytorch-fizzbuzz-amp/issues/1#issuecomment-719125063)
14 |
15 | ## 环境
16 |
17 | | python版本 | pytorch版本 | 系统 |
18 | |------------|-------------|--------|
19 | | 3.6 | >=1.6.0 | Ubuntu |
20 |
21 |
22 | ## 参考
23 | [Pytorch_docs](https://pytorch.org/docs/stable/notes/amp_examples.html)
24 |
25 | [AUTOMATIC MIXED PRECISION](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html#advanced-topics)
26 |
27 | [基于Apex的混合精度加速](https://zhuanlan.zhihu.com/p/79887894)
28 |
29 | [论文精读:Mixed Precision Training](https://zhuanlan.zhihu.com/p/163493798)
30 |
--------------------------------------------------------------------------------
/ModelConver/Pytorch->ONNX.md:
--------------------------------------------------------------------------------
1 | ## Pytorch->ONNX
2 |
3 | 示例库: [Face-Detector-1MB-with-landmark](https://github.com/biubug6/Face-Detector-1MB-with-landmark)
4 |
5 | 1. 验证输出
6 |
7 | convert_to_onnx.py
8 |
9 | ```python
10 | # RetinaFace网络输出三个参数:bbox、类别置信度、关键点
11 | output_names = ["output0"] 改为 output_names = ["bbox","prob","landmark"]
12 | ```
13 |
14 | 2. 转为ONNX
15 |
16 | 注意: opset_version=11 与ONNX瘦身共用将导致推理异常
17 |
18 | ```shell
19 | # 生成faceDetector.onnx
20 | python convert_to_onnx.py --trained_model ./weights/mobilenet0.25_Final.pth --network mobile0.25
21 | ```
22 |
23 | 3. ONNX瘦身
24 |
25 | ```shell
26 | #安装onnx-simplifier
27 | pip3 install -U pip && pip3 install onnx-simplifier
28 | # 生成faceDetector_sim.onnx
29 | python3 -m onnxsim faceDetector.onnx faceDetector_sim.onnx
30 | ```
31 |
32 | **参考**
33 |
34 | [onnx-simplifier](https://github.com/daquexian/onnx-simplifier)
35 |
36 |
--------------------------------------------------------------------------------
/TensorRT/readme.md:
--------------------------------------------------------------------------------
1 |
2 | # TensorRT最佳实践
3 |
4 |
5 | # 示例
6 | - TensorRT API
7 | - [最简示例](./lenet.py)
8 | - 访问[TensorRTx](https://github.com/wang-xinyu/tensorrtx)了解更多
9 |
10 | - 解析ONNX
11 | - [固定尺度](./main.py)
12 | - 动态维度: 待更新
13 |
14 |
15 | # 整体流程
16 | ## 1.构建引擎
17 | 
18 | ## 2.推理
19 | 
20 |
21 |
22 |
23 | ## 三方库
24 | - [torch2trt](https://github.com/NVIDIA-AI-IOT/torch2trt)
25 | - [TRTorch](https://github.com/NVIDIA/TRTorch)
26 | > Torch直接转为TRT,但支持算子少,不通用。
27 |
28 | ## 参考
29 |
30 | - [TensorRT部署](http://zengzeyu.com/2020/07/09/tensorrt_01_installation/)
31 | - [TensorRT部署常见错误](https://blog.csdn.net/QFJIZHI/article/details/107335865)
32 | - [TensorRT加速Pytorch](https://blog.csdn.net/leviopku/article/details/112963733)
33 | - [TensorRTx](https://github.com/wang-xinyu/tensorrtx)
34 | - [TensorRT:深度学习推理加速](https://www.nvidia.cn/content/dam/en-zz/zh_cn/assets/webinars/oct16/Gary_TensorRT_GTCChina2019.pdf)
--------------------------------------------------------------------------------
/DDP/readme.md:
--------------------------------------------------------------------------------
1 | # DistributedDataParallel
2 |
3 | ## 说明
4 | - 分布式数据并行DDP最小实现
5 | - 适用单机多卡、多机多卡训练
6 |
7 | ## 运行
8 |
9 | ### 示例
10 | ```PowerShell
11 | //只有所有节点执行Shell命令,才开始训练
12 | python ddp.py --nodes 节点数 --gpus 每个节点的GPU数量 --nr 当前节点序号 --ip 当前节点ip
13 | ```
14 |
15 | ### 单机多卡
16 | 节点ip=192.168.3.8
17 | ```PowerShell
18 | Shell: CUDA_VISIBLE_DEVICES=0,1 python ddp.py --nodes 1 --gpus 2 --nr 0 --ip 192.168.3.8
19 | ```
20 |
21 | ### 多机多卡
22 | 主节点ip=192.168.3.8
23 | ```PowerShell
24 | 主节点Shell: CUDA_VISIBLE_DEVICES=0,1 python ddp.py --nodes 2 --gpus 2 --nr 0 --ip 192.168.3.8
25 | 副节点Shell: CUDA_VISIBLE_DEVICES=0,1 python ddp.py --nodes 2 --gpus 2 --nr 1 --ip 192.168.3.8
26 | ```
27 |
28 |
29 | ## 总结问题
30 |
31 | 1. batch_size
32 |
33 | > 有效batch = 每个GPU的batch * 总GPUs
34 |
35 | 2. 验证、保存
36 |
37 | > 验证:确保不同进程保存的log名称不同,最后只可视化rank=0。
38 | > 保存:只保存rank=0的模型。
39 |
40 | 3. 数据读取
41 |
42 | - DataLoader采用Lmdb读取,若如下错误
43 |
44 | ```python
45 | TypeError: can't pickle Environment objects
46 | ```
47 |
48 | > 解决办法:DataLoader内num_workers=0
49 |
50 | - DataLoader采用其他方式读取,若如下错误
51 |
52 | ```python
53 | Attribute:Can’t pickle local object ‘DataLoader.__init__..’
54 | ```
55 |
56 | > 解决办法: lambda x: Image.fromarray(x) 改为 Image.fromarray
57 |
58 | 4. 同步BN
59 |
60 | ```python
61 | # 仅支持DDP
62 | model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
63 | ```
64 |
65 | # 参考
66 | [Distributed data parallel training in Pytorch](https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html) 推荐!
67 |
68 | [distributed_tutorial](https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-distributed.py)
69 |
70 | [PyTorch分布式训练简明教程](https://zhuanlan.zhihu.com/p/113694038)
71 |
72 | [Pytorch 分布式训练](https://zhuanlan.zhihu.com/p/76638962)
73 |
74 | [discuss.pytorch](https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857)
--------------------------------------------------------------------------------
/ModelConver/ONNX->MNN.md:
--------------------------------------------------------------------------------
1 | ## ONNX->MNN
2 |
3 | ### 编译MNN
4 |
5 | 1. 安装Homebrew
6 |
7 | ```shell
8 | # macOS10.15.7
9 | /bin/zsh -c "$(curl -fsSL https://gitee.com/cunkai/HomebrewCN/raw/master/Homebrew.sh)"
10 | ```
11 |
12 | 2. 编译MNN
13 |
14 | [官方编译](https://www.yuque.com/mnn/cn/demo_project)
15 |
16 | ```shell
17 | cd //进入mnn根路径
18 | # 生成 schema ,可选
19 | cd schema && ./generate.sh
20 |
21 | # 进行编译
22 | cd
23 | mkdir build && cd build
24 | # 打开 编译DEMO、编译模型转换器
25 | # https://www.yuque.com/mnn/cn/cmake_opts CMake参数前均加字母D
26 | cmake -DMNN_BUILD_DEMO=ON -DMNN_BUILD_CONVERTER=ON ..
27 | make -j8
28 | ```
29 |
30 | ### 转化
31 |
32 | 1. 转为mnn
33 |
34 | [官方转化](https://www.yuque.com/mnn/cn/model_convert)
35 |
36 | ```shell
37 | cd /build/
38 | # 生成retinaface.mnn
39 | ./MNNConvert -f ONNX --modelFile faceDetector_sim.onnx --MNNModel retinaface.mnn --bizCode biz
40 | ```
41 |
42 | ### C++推理
43 |
44 | 1. 验证输出
45 |
46 | (1)下载[RetinaFace_MNN](https://github.com/ItchyHiker/RetinaFace_MNN)推理项目
47 |
48 | (2)修改代码
49 |
50 | - `CMakeLists.txt`
51 |
52 | ```cmake
53 | # OpenCV路径:默认安装/usr/local/Cellar/opencv
54 | # MNN路径: 更改为本地mnn根路径
55 | set(OpenCV_DIR /usr/local/Cellar/opencv/4.5.2/lib)
56 | set(OpenCV_INCLUDE_DIRS /usr/local/Cellar/opencv/4.5.2/include/opencv4/)
57 | set(MNN_DIR /build/libMNN.dylib)
58 | set(MNN_INCLUDE_DIRS /include)
59 | ```
60 |
61 | - `main.cpp` 更改第12、13行 模型、测试图片路径
62 | - `retinaface.cpp`第26~29行key分别更改为“input0”、“prob”、"bbox"、"landmark",与转化前key值对应。
63 |
64 | (3)可选项
65 |
66 | - anchor比例:`retinaface.cpp`第128、130、132行
67 | - 图像尺寸:`main.cpp`第15行
68 |
69 | 2. 构建项目
70 |
71 | ```shell
72 | cd #进入推理项目根路径
73 | mkdir -p build
74 | cd build
75 | cmake .. #生成Makefile文件
76 | make -j4#根据Makefile文件进行编译
77 | # 生成可执行文件RetinaFace
78 | # 验证
79 | ./RetinaFace
80 | ```
81 |
82 | 
83 |
84 | **参考**
85 |
86 | [MNN](https://github.com/alibaba/MNN)
87 |
88 | [MNN文档](https://www.yuque.com/mnn/cn/cmake_opts)
89 |
90 |
--------------------------------------------------------------------------------
/AMP/main.py:
--------------------------------------------------------------------------------
1 | from torch.cuda.amp import autocast
2 | from torch.cuda.amp import GradScaler
3 | import torch
4 | from net import MyNet
5 |
6 | def start_train():
7 | '''
8 | 训练
9 | '''
10 | use_amp=True
11 | # 前向反传N次,再更新参数 目的:增大batch(理论batch= batch_size * N)
12 | iter_size=8
13 |
14 | myNet = MyNet(use_amp).to("cuda:0")
15 | myNet = torch.nn.DataParallel(myNet,device_ids=[0,1]) # 数据并行
16 | myNet.train()
17 | # 训练开始前初始化 梯度缩放器
18 | scaler = GradScaler() if use_amp else None
19 |
20 | # 加载预训练权重
21 | if resume_train:
22 | scaler.load_state_dict(checkpoint['scaler']) # amp自动混合精度用到
23 | optimizer.load_state_dict(checkpoint['optimizer'])
24 | myNet.load_state_dict(checkpoint["model"])
25 |
26 |
27 | for epoch in range(1,100):
28 | for batch_idx, (input, target) in enumerate(dataloader_train):
29 |
30 | # 数据 转到每个并行模型的主卡上
31 | input = input.to("cuda:0")
32 | target = target.to("cuda:0")
33 |
34 | # 自动混合精度训练
35 | if use_amp:
36 | # 自动广播 将支持半精度操作自动转为FP16
37 | with autocast():
38 | # 提取特征
39 | feature=myNet(input)
40 | losses = loss_function(target,feature)
41 | loss = losses / iter_size
42 | scaler.scale(loss).backward()
43 | else:
44 | feature = myNet(input, target)
45 | losses = loss_function(target, feature)
46 | loss = losses / iter_size
47 | loss.backward()
48 |
49 | # 梯度累积,再更新参数
50 | if (batch_idx + 1) % iter_size == 0:
51 | # 梯度更新
52 | if use_amp:
53 | scaler.step(optimizer)
54 | scaler.update()
55 | else:
56 | optimizer.step()
57 | # 梯度清零
58 | optimizer.zero_grad()
59 | # scaler 具有状态。恢复训练时需要加载
60 | state = {'net': myNet.state_dict(), 'optimizer': optimizer.state_dict(), 'scaler': scaler.state_dict()}
61 | torch.save(state, "filename.pth")
62 |
63 | def start_test():
64 | '''
65 | 测试
66 | '''
67 | # 初始化网络并加载预训练模型
68 | myNet = MyNet().to("cuda:0")
69 | myNet.eval()
70 | with torch.no_grad():
71 | input = input.to("cuda:0")
72 |
73 | # 若想推理加速,在精度接受范围内img\model手动half()为FP16,然后只能GPU推理
74 | # input=input.half()
75 | # myNet=myNet.half()
76 | feature = myNet(input)
77 |
78 |
79 |
80 |
81 |
82 |
--------------------------------------------------------------------------------
/TensorRT/main.py:
--------------------------------------------------------------------------------
1 | import onnx
2 | import pycuda.autoinit
3 | import pycuda.driver as cuda
4 | import tensorrt as trt
5 | import torch
6 | import time
7 | import torchvision
8 | import numpy as np
9 | import os
10 | current_path=os.path.abspath(os.path.dirname(__file__))
11 | from trt_com import Torch_to_ONNX,ONNX_to_TensorRT,Init_TensorRT,Do_Inference
12 |
13 |
14 |
15 | batch_size=3 # 固定尺度 eg:1、6、8...
16 | class ONNX_Config():
17 | '''
18 | ONNX参数
19 | '''
20 | input_size=[batch_size,3,224,224] # 输入尺寸
21 | device_id="cuda:0"
22 | onnx_path=current_path+"/model.onnx" # onnx模型的保存路径
23 |
24 | class TensorRT_Config():
25 | '''
26 | TensorRT参数
27 | '''
28 | output_size= [batch_size,1000] #输出尺寸 resnet18输出1000分类
29 | fp16_mode = True # 是否支持FP16 依赖硬件
30 | trt_path = current_path+"/model_fp16_{}.trt".format(fp16_mode) # TRT引擎的保存路径
31 |
32 | if __name__ == "__main__":
33 | # ============1.Pytorch->ONNX============
34 | onnx_cfg = ONNX_Config() #配置onnx转化参数
35 | device = torch.device(onnx_cfg.device_id)
36 | # 初始化Pytorch
37 | torch_net = torchvision.models.resnet18(pretrained=True).to(device)
38 | torch_net.eval()
39 | # 转为ONNX模型
40 | Torch_to_ONNX(torch_net,onnx_cfg.input_size,onnx_cfg.onnx_path,device)
41 |
42 |
43 | # ============2.ONNX->TensorRT============
44 | trt_cfg = TensorRT_Config() #配置tesnorrt转化参数
45 | ONNX_to_TensorRT(trt_cfg.fp16_mode,onnx_cfg.onnx_path,trt_cfg.trt_path)
46 |
47 |
48 | # ============3.Trt预测============
49 | img_np_nchw = np.ones(tuple(onnx_cfg.input_size),dtype=float).astype(np.float32) # 输入数据
50 |
51 | [context,inputs, outputs, bindings, stream] =Init_TensorRT(trt_cfg.trt_path) # 加载引擎
52 | inputs[0].host = img_np_nchw.reshape(-1) # 绑定输入数据 一维npy
53 | # inputs[1].host = ... 适用多个输入
54 |
55 | t0 = time.time()
56 | output=Do_Inference(context, bindings, inputs, outputs, stream) # list 若网络仅一个输出,则len=1
57 | t1 = time.time()
58 | output=output[0].reshape(*trt_cfg.output_size) # 一维npy 恢复为 指定输出尺寸
59 |
60 | # ============4.Torch预测============
61 | input = torch.from_numpy(img_np_nchw).to(device)
62 | t2 = time.time()
63 | output_torch = torch_net(input)
64 | t3 = time.time()
65 |
66 | # ============5.计算误差============
67 | mse = np.mean((output - output_torch.cpu().detach().numpy()) ** 2)
68 |
69 | print('MSE Error = {}'.format(mse))
70 | print("Inference time with the TensorRT engine: {}".format(t1 - t0))
71 | print("Inference time with the PyTorch model: {}".format(t3 - t2))
72 | print('All completed!')
73 |
--------------------------------------------------------------------------------
/ModelConver/ONNX->NCNN.md:
--------------------------------------------------------------------------------
1 | ## 1. ONNX->NCNN
2 |
3 | 示例库: [Face-Detector-1MB-with-landmark](https://github.com/biubug6/Face-Detector-1MB-with-landmark)
4 |
5 | ### 编译NCNN
6 |
7 | 1. 安装Homebrew
8 |
9 | ```shell
10 | # macOS10.15.7
11 | /bin/zsh -c "$(curl -fsSL https://gitee.com/cunkai/HomebrewCN/raw/master/Homebrew.sh)"
12 | ```
13 |
14 | 2. 安装第三方依赖
15 |
16 | ```shell
17 | brew install cmake
18 | brew install protobuf
19 | brew install opencv //自动安装很多依赖 默认安装路径/usr/local/Cellar/opencv
20 | ```
21 |
22 | 3. 编译NCNN
23 |
24 | ```shell
25 | cd //进入ncnn根路径
26 | mkdir -p build
27 | cd build
28 | cmake .. #生成Makefile文件
29 | make #根据Makefile文件进行编译
30 | make install #生成install文件夹
31 | ```
32 |
33 | ### 转化
34 |
35 | 1. 转为ncnn
36 |
37 | ```shell
38 | cd /build/tools/onnx
39 | ./onnx2ncnn faceDetector_sim.onnx face.param face.bin
40 | ```
41 |
42 | ### C++推理
43 |
44 | 1. 验证输出
45 |
46 | (1)将`/build/install`的文件替换到`Face-Detector-1MB-with-landmark/Face_Detector_ncnn/ncnn`目录下
47 |
48 | (2)将 face.param 和face.bin移动到`Face_Detector_ncnn/model`目录下
49 |
50 | (3)`Face_Detector_ncnn/FaceDetector.cpp` 第53、56、59行key分别更改为"bbox","prob","landmark",与转化前key值对应。
51 |
52 | (4)`Face_Detector_ncnn/main.cpp` 若使用retinaface模型,应将false->true。
53 |
54 | (5)可选项
55 |
56 | - anchor比例:`FaceDetector.cpp`第202行
57 | - 图像尺寸:`main.cpp`第27行
58 |
59 | 2. 构建项目
60 |
61 | 在`Face_Detector_ncnn/CMakeLists.txt`设置opencv路径
62 |
63 | ```shell
64 | set(OpenCV_DIR "/usr/local/Cellar/opencv/4.5.2/")
65 | ```
66 |
67 | 编辑将出现如下错误
68 |
69 | ```shell
70 | cmake ..
71 | make -j4 #开4个线程进行编译
72 |
73 | #将出现错误 include未找到opencv2
74 | fatal error: 'opencv2/opencv.hpp' file not found
75 | fatal error: 'opencv2/core/core.hpp' file not found
76 |
77 | # 原因
78 | # opencv2的include路径为/usr/local/Cellar/opencv/4.5.2/include/opencv4/
79 | ```
80 |
81 | 3. 解决
82 |
83 | ```cmake
84 | # CMakeLists文件修改两个路径
85 | 19行: ${OpenCV_DIR}/include = /usr/local/Cellar/opencv/4.5.2/include/opencv4/
86 | 22行: ${OpenCV_DIR}/lib = /usr/local/Cellar/opencv/4.5.2/lib
87 | ```
88 |
89 | ```shell
90 | #编译
91 | cmake ..
92 | make -j4
93 | # 生成可执行文件FaceDetector
94 | # 验证
95 | ./FaceDetector
96 | ```
97 |
98 | 
99 |
100 |
101 |
102 | ## 2. NCNN优化
103 |
104 | 作用:(1)优化模型,融合算子 (2)FP32->FP16
105 |
106 | ```shell
107 | cd /build/tools/
108 | # flag:0为FP32,1为FP16
109 | ./ncnnoptimize ncnn.param ncnn.bin new.param new.bin flag
110 | ```
111 |
112 |
113 |
114 | **参考**
115 |
116 | [macOS编译NCNN](https://www.bilibili.com/read/cv10224407/)
117 |
118 | [NCNN](https://github.com/Tencent/ncnn)
119 |
120 | [NCNN深度学习框架之Optimize优化器](https://www.cnblogs.com/wanggangtao/p/11313705.html)
121 |
122 |
--------------------------------------------------------------------------------
/ModelConver/imgs/process.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/DDP/ddp.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import torch.multiprocessing as mp
4 | import torchvision
5 | import torchvision.transforms as transforms
6 | import torch
7 | import torch.nn as nn
8 | import torch.distributed as dist
9 | class ConvNet(nn.Module):
10 | def __init__(self, num_classes=10):
11 | super(ConvNet, self).__init__()
12 | self.layer1 = nn.Sequential(
13 | nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
14 | nn.BatchNorm2d(16),
15 | nn.ReLU(),
16 | nn.MaxPool2d(kernel_size=2, stride=2))
17 | self.layer2 = nn.Sequential(
18 | nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
19 | nn.BatchNorm2d(32),
20 | nn.ReLU(),
21 | nn.MaxPool2d(kernel_size=2, stride=2))
22 | self.fc = nn.Linear(7*7*32, num_classes)
23 |
24 | def forward(self, x):
25 | out = self.layer1(x)
26 | out = self.layer2(out)
27 | out = out.reshape(out.size(0), -1)
28 | out = self.fc(out)
29 | return out
30 |
31 |
32 | def main():
33 | parser = argparse.ArgumentParser()
34 | parser.add_argument('--nodes', default=2, type=int) # 节点数量
35 | parser.add_argument('--gpus', default=2, type=int) # 每个节点的GPU数量
36 | parser.add_argument( '--nr', default=0, type=int) # 当前节点在所有节点的序号
37 | parser.add_argument('--batch', default=128, type=int) # 总batch(有效batch) 均分给全部GPU
38 | parser.add_argument('--ip',default=None,type=str) # 主节点ip
39 |
40 | args = parser.parse_args()
41 | args.world_size = args.gpus * args.nodes #总的world_size,即进程总数==总GPU数量(每个进程负责一个GPU)
42 | os.environ['MASTER_ADDR'] = args.ip # 主节点(主进程),用于所有进程同步梯度
43 | os.environ['MASTER_PORT'] = '8886' # 主进程用于通信的端口,可随意设置
44 |
45 | # 一个节点启动 该节点的所有进程,每个进程运行train(i,args) i从0到args.gpus-1
46 | # nprocs:作用于mp.spawn,标明启动的线程数
47 | # args:传递给train方法的参数
48 | mp.spawn(train, nprocs=args.gpus, args=(args,))
49 |
50 |
51 | def train(pid, args):
52 | '''
53 | 通过mp.spawn启动多进程,train接收参数为:节点内部的子进程号pid + 方法参数
54 | '''
55 | # 每个进程负责一个GPU,故 节点内部子进程号 = 节点内部GPU序号
56 | gpu=pid
57 |
58 | # 计算当前进程在所有进程中的全局排名,每个进程都需要知道进程总数和在进程中的顺序,以便使用哪块GPU
59 | # rank=0为主进程,用于保存模型和打印信息
60 | rank = args.nr * args.gpus + gpu
61 |
62 | # 初始化分布环境
63 | # env:环境变量初始化,需要在环境变量配置4个参数:MASTER_PORT,MASTER_ADDR,WORLD_SIZE,RANK
64 | dist.init_process_group(backend='nccl',
65 | init_method='env://',
66 | world_size=args.world_size,
67 | rank=rank)
68 |
69 | torch.manual_seed(0)
70 | model = ConvNet()
71 |
72 | # 加载权重
73 | if args.savepath:
74 | print('loading weights')
75 | pass
76 |
77 | # DDP分发之前,同步BN(将网络内部的BatchNorm层转换为SyncBatchNorm层)
78 | model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
79 |
80 | torch.cuda.set_device(gpu) # 当前节点负责的GPU
81 | model.cuda(gpu)
82 | batch_size = int(args.batch/args.world_size) # 总的有效batch_size= 均分每块GPU的batch * 总进程数(总GPUs)
83 |
84 | criterion = nn.CrossEntropyLoss().cuda(gpu)
85 | optimizer = torch.optim.SGD(model.parameters(), 1e-4)
86 |
87 | # GPU模型包装为 DDP模型
88 | model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
89 |
90 | # 加载数据
91 | train_dataset = torchvision.datasets.MNIST(root='./data',
92 | train=True,
93 | transform=transforms.ToTensor(),
94 | download=True)
95 | # 采样器:将数据集分为 world_size 块,不同块送到各进程中
96 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
97 | num_replicas=args.world_size,
98 | rank=rank)
99 | train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
100 | batch_size=batch_size,
101 | shuffle=False, # DDP下该参数无效,由train_sampler负责
102 | num_workers=0, # DDP下为0 否则读取出错
103 | pin_memory=True,
104 | sampler=train_sampler) # 采样器
105 |
106 | for epoch in range(10):
107 | # 每轮采样器打乱数据集,保证数据划分不同
108 | train_sampler.set_epoch(epoch)
109 |
110 | for i, (images, labels) in enumerate(train_loader):
111 | images = images.cuda(non_blocking=True)
112 | labels = labels.cuda(non_blocking=True)
113 |
114 | outputs = model(images)
115 | loss = criterion(outputs, labels)
116 |
117 |
118 | optimizer.zero_grad()
119 | loss.backward()
120 | optimizer.step()
121 | if (i + 1) % 100 == 0 and gpu == 0:
122 | print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, 10, i + 1, len(train_loader),
123 | loss.item()))
124 |
125 | # ===验证===
126 | # 确保每个进程log名称不同,最后可视化rank=0的log即可
127 | # acc=eval()
128 |
129 |
130 | # 仅主进程 保存模型
131 | if rank == 0:
132 | torch.save(model.state_dict(),'ddp.pth')
133 |
134 |
135 | if __name__ == '__main__':
136 | main()
--------------------------------------------------------------------------------
/TensorRT/trt_com.py:
--------------------------------------------------------------------------------
1 | import onnx
2 | import onnxruntime
3 | import pycuda.autoinit
4 | import pycuda.driver as cuda
5 | import tensorrt as trt
6 | import torch
7 | import time
8 | import torchvision
9 | import numpy as np
10 | import os
11 | import sys
12 | current_path=os.path.abspath(os.path.dirname(__file__))
13 | '''
14 | 封装通用代码
15 | '''
16 |
17 | def Init_TensorRT(trt_path):
18 | '''
19 | 初始化TensorRT引擎
20 | trt_path: trt文件
21 | '''
22 | # 加载cuda引擎
23 | engine = load_engine(trt_path)
24 | # 创建CudaEngine之后,需要将该引擎应用到不同的卡上配置执行环境
25 | context = engine.create_execution_context()
26 | inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings
27 | return [context,inputs, outputs, bindings, stream]
28 | def load_engine(trt_path):
29 | """
30 | 加载cuda引擎
31 | trt_path: TensorRT引擎文件
32 | """
33 | # 以trt的Logger为参数,使用builder创建计算图类型INetworkDefinition
34 | TRT_LOGGER = trt.Logger()
35 |
36 | # 如果已经存在序列化之后的引擎,则直接反序列化得到cudaEngine
37 | if os.path.exists(trt_path):
38 | print("Reading engine from file: {}".format(trt_path))
39 | with open(trt_path, 'rb') as f, \
40 | trt.Runtime(TRT_LOGGER) as runtime:
41 | return runtime.deserialize_cuda_engine(f.read()) # 反序列化
42 | else:
43 | print('No Found:'+trt_path)
44 | raise FileNotFoundError
45 |
46 |
47 | def allocate_buffers(engine):
48 | '''
49 | TRT分配缓存
50 | '''
51 | class HostDeviceMem(object):
52 | def __init__(self, host_mem, device_mem):
53 | """
54 | host_mem: cpu memory
55 | device_mem: gpu memory
56 | """
57 | self.host = host_mem # 主机数据
58 | self.device = device_mem # GPU数据
59 |
60 | def __str__(self):
61 | return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
62 |
63 | def __repr__(self):
64 | return self.__str__()
65 | inputs, outputs, bindings = [], [], []
66 | stream = cuda.Stream()
67 | for binding in engine:
68 | # print(binding) # 绑定的输入输出
69 | # print(engine.get_binding_shape(binding)) # get_binding_shape 是变量的大小
70 | size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
71 | # volume 计算可迭代变量的空间,指元素个数
72 | # size = trt.volume(engine.get_binding_shape(binding)) # 如果采用固定bs的onnx,则采用该句
73 | dtype = trt.nptype(engine.get_binding_dtype(binding))
74 | # get_binding_dtype 获得binding的数据类型
75 | # nptype等价于numpy中的dtype,即数据类型
76 | # allocate host and device buffers
77 | host_mem = cuda.pagelocked_empty(size, dtype) # 创建锁业内存
78 | device_mem = cuda.mem_alloc(host_mem.nbytes) # cuda分配空间
79 | # print(int(device_mem)) # binding在计算图中的缓冲地址
80 | bindings.append(int(device_mem))
81 | # append to the appropriate list
82 | if engine.binding_is_input(binding):
83 | inputs.append(HostDeviceMem(host_mem, device_mem)) # 绑定输入
84 | else:
85 | outputs.append(HostDeviceMem(host_mem, device_mem)) # 绑定输出
86 | return inputs, outputs, bindings, stream
87 |
88 |
89 | def Do_Inference(context, bindings, inputs, outputs, stream):
90 | '''
91 | 执行推理
92 | '''
93 | # htod:host to device 将数据由主机迁移到gpu device
94 | [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
95 |
96 | # Run inference.
97 | context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
98 | # dtoh:device to host
99 | [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
100 |
101 | # Synchronize the stream 同步流后才能得到预测结果
102 | stream.synchronize()
103 |
104 | # 返回预测结果 一维numpy
105 | return [out.host for out in outputs]
106 |
107 |
108 | def Torch_to_ONNX(net,input_size,onnx_path,device):
109 | '''
110 | torch->onnx(仅支持固定输入尺度)
111 | input_size: 输入尺度 [N,3,224,224]
112 | onnx_path: onnx权重文件的保存路径
113 | device: "cuda:0"
114 | '''
115 | net.to(device)
116 | net.eval()
117 | # 转为ONNX
118 | torch.onnx.export(net, # 待转换的网络模型和参数
119 | torch.randn(tuple(input_size), device=device), # 虚拟的输入,用于确定输入尺寸和推理计算图每个节点的尺寸
120 | onnx_path, # 输出文件路径
121 | verbose=False, # 是否以字符串的形式显示计算图
122 | input_names=["input"],
123 | output_names=["output"], # 输出节点的名称
124 | opset_version=13, # onnx支持算子的版本
125 | do_constant_folding=True, # 是否压缩常量
126 | )
127 |
128 |
129 | # 验证模型
130 | net = onnx.load(onnx_path) # 加载onnx 计算图
131 | onnx.checker.check_model(net) # 检查文件模型是否正确
132 | onnx.helper.printable_graph(net.graph) # 输出onnx的计算图
133 |
134 | # ONNX推理
135 | session = onnxruntime.InferenceSession(onnx_path) # 创建一个运行session,类似于tensorflow
136 | output = session.run(None, {"input": np.random.rand(input_size[0],input_size[1], input_size[2], input_size[3]).astype('float32')}) # 输入必须是numpy类型
137 |
138 | print('ONNX file in ' + onnx_path)
139 | print('============Pytorch->ONNX SUCCESS============')
140 |
141 |
142 | def ONNX_to_TensorRT(fp16_mode=False,onnx_path=None,trt_path=None,max_batch_size=1):
143 | """
144 | 生成cudaEngine,并保存引擎文件(仅支持固定输入尺度)
145 |
146 | max_batch_size: 默认为1,不支持动态batch
147 | fp16_mode: True则fp16预测
148 | onnx_path: 将加载的onnx权重路径
149 | trt_path: trt引擎文件保存路径
150 | """
151 | # 通过logger报告错误、警告、信息
152 | TRT_LOGGER = trt.Logger()
153 |
154 | explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
155 |
156 |
157 | with trt.Builder(TRT_LOGGER) as builder, \
158 | builder.create_network(explicit_batch) as network, \
159 | trt.OnnxParser(network, TRT_LOGGER) as parser:
160 | builder.max_workspace_size = 1 << 30 # 预先分配的工作空间大小,即ICudaEngine执行时GPU最大需要的空间
161 | builder.max_batch_size = max_batch_size # 执行时最大可以使用的batchsize
162 | builder.fp16_mode = fp16_mode
163 |
164 | # ########解析onnx文件,填充计算图#########
165 | if not os.path.exists(onnx_path):
166 | quit("ONNX file {} not found!".format(onnx_path))
167 | print('loading onnx file from path {} ...'.format(onnx_path))
168 | with open(onnx_path, 'rb') as model:
169 | print("Begining onnx file parsing")
170 | parser.parse(model.read()) # OnnxParser解析onnx文件,为network对象构建网络并填充权重
171 | print("Completed parsing of onnx file")
172 |
173 | ########builder基于计算图创建引擎#########
174 | print("Building an engine from file{}' this may take a while...".format(onnx_path))
175 | output_shape=network.get_layer(network.num_layers - 1).get_output(0).shape # 查看最后一层网络输出尺寸
176 | # network.mark_output(network.get_layer(network.num_layers -1).get_output(0)) #设置输出
177 | engine = builder.build_cuda_engine(network) # 构建引擎
178 | print("Completed creating Engine")
179 |
180 | # 保存engine供以后直接加载使用
181 | with open(trt_path, 'wb') as f:
182 | f.write(engine.serialize()) # 序列化
183 |
184 | print('TensorRT file in ' + trt_path)
185 | print('============ONNX->TensorRT SUCCESS============')
--------------------------------------------------------------------------------
/TensorRT/lenet.py:
--------------------------------------------------------------------------------
1 | '''
2 | https://github.com/wang-xinyu/tensorrtx lenet最简单示例
3 | '''
4 |
5 | import argparse
6 | import os
7 | import struct
8 | import sys
9 |
10 | import numpy as np
11 | import pycuda.autoinit
12 | import pycuda.driver as cuda
13 | import tensorrt as trt
14 |
15 | INPUT_H = 32 #输入尺寸
16 | INPUT_W = 32
17 | OUTPUT_SIZE = 10 #输出形状 10分类
18 | INPUT_BLOB_NAME = "data" # blob二进制对象 输入名称
19 | OUTPUT_BLOB_NAME = "prob" # 输出名称
20 |
21 | weight_path = "./lenet5.wts" # 二进制权重
22 | engine_path = "./lenet5.engine" #trt引擎的保存路径
23 |
24 | gLogger = trt.Logger(trt.Logger.INFO) # 通过logger报告错误、警告、信息(Builder/ICudaEngine/Runtime)
25 |
26 |
27 | def load_weights(file):
28 | '''加载二进制权重文件'''
29 | print(f"Loading weights: {file}")
30 |
31 | assert os.path.exists(file), 'Unable to load weight file.'
32 |
33 | weight_map = {}
34 | with open(file, "r") as f:
35 | lines = [line.strip() for line in f]
36 | count = int(lines[0])
37 | assert count == len(lines) - 1
38 | for i in range(1, count + 1): # 遍历每行内容
39 | splits = lines[i].split(" ")
40 | name = splits[0] # 第一个值为网络名称
41 | cur_count = int(splits[1]) # 第二个值为 该行参数数量
42 | assert cur_count + 2 == len(splits)
43 | values = [] #保存该行参数
44 | for j in range(2, len(splits)):
45 | # hex string to bytes to float
46 | values.append(struct.unpack(">f", bytes.fromhex(splits[j])))
47 | weight_map[name] = np.array(values, dtype=np.float32)
48 |
49 | return weight_map
50 |
51 |
52 | def createLenetEngine(maxBatchSize, builder, config, dt):
53 | '''
54 | 构建网络引擎
55 | dt: fp32 or fp16
56 | '''
57 |
58 |
59 | weight_map = load_weights(weight_path) # 加载二进制权重
60 | network = builder.create_network() # 创建网络对象
61 |
62 | data = network.add_input(INPUT_BLOB_NAME, dt, (1, INPUT_H, INPUT_W)) # 设置网络输入的名称和尺寸
63 | assert data
64 | # ============定义网络============
65 | # 定义卷积
66 | conv1 = network.add_convolution(input=data, # 输入tensor
67 | num_output_maps=6, # 输出通道
68 | kernel_shape=(5, 5), # 卷积核尺寸
69 | kernel=weight_map["conv1.weight"], # 赋值卷积核的权重[out_channels, in_channels, kernel_height, kernel_width]
70 | bias=weight_map["conv1.bias"]) # 赋值偏向权重[out_channels]
71 | assert conv1
72 | conv1.stride = (1, 1) # 设置卷积的步长
73 |
74 | # 定义激活函数
75 | relu1 = network.add_activation(conv1.get_output(0), # 前卷积层的输出
76 | type=trt.ActivationType.RELU)
77 | assert relu1
78 |
79 | # 定义池化
80 | pool1 = network.add_pooling(input=relu1.get_output(0),# 前激活层的输出
81 | window_size=trt.DimsHW(2, 2), # 池化窗口大小
82 | type=trt.PoolingType.AVERAGE) # 池化类型为平均池化
83 | assert pool1
84 | pool1.stride = (2, 2) # 池化步长
85 |
86 | conv2 = network.add_convolution(pool1.get_output(0), 16, trt.DimsHW(5, 5),
87 | weight_map["conv2.weight"],
88 | weight_map["conv2.bias"])
89 | assert conv2
90 | conv2.stride = (1, 1)
91 |
92 | relu2 = network.add_activation(conv2.get_output(0),
93 | type=trt.ActivationType.RELU)
94 | assert relu2
95 |
96 | pool2 = network.add_pooling(input=relu2.get_output(0),
97 | window_size=trt.DimsHW(2, 2),
98 | type=trt.PoolingType.AVERAGE)
99 | assert pool2
100 | pool2.stride = (2, 2)
101 |
102 | # 定义全连接层
103 | fc1 = network.add_fully_connected(input=pool2.get_output(0),
104 | num_outputs=120,
105 | kernel=weight_map['fc1.weight'],
106 | bias=weight_map['fc1.bias'])
107 | assert fc1
108 |
109 | relu3 = network.add_activation(fc1.get_output(0),
110 | type=trt.ActivationType.RELU)
111 | assert relu3
112 |
113 | fc2 = network.add_fully_connected(input=relu3.get_output(0),
114 | num_outputs=84,
115 | kernel=weight_map['fc2.weight'],
116 | bias=weight_map['fc2.bias'])
117 | assert fc2
118 |
119 | relu4 = network.add_activation(fc2.get_output(0),
120 | type=trt.ActivationType.RELU)
121 | assert relu4
122 |
123 | fc3 = network.add_fully_connected(input=relu4.get_output(0),
124 | num_outputs=OUTPUT_SIZE,
125 | kernel=weight_map['fc3.weight'],
126 | bias=weight_map['fc3.bias'])
127 | assert fc3
128 |
129 | prob = network.add_softmax(fc3.get_output(0)) #经过softmax
130 | assert prob
131 |
132 | prob.get_output(0).name = OUTPUT_BLOB_NAME # 网络输出 赋值名称,便于后续通过名称拿出预测结果
133 | network.mark_output(prob.get_output(0)) # 将该tensor 标记为输出
134 |
135 | # Build engine
136 | builder.max_batch_size = maxBatchSize
137 | # builder.max_workspace_size = 1 << 20
138 | config.max_workspace_size= 1 << 20
139 | engine = builder.build_engine(network, config)
140 |
141 | del network
142 | del weight_map
143 |
144 | return engine
145 |
146 |
147 | def APIToModel(maxBatchSize):
148 | '''将二进制权重转为trt引擎'''
149 | builder = trt.Builder(gLogger) # builder对象 用于推理
150 | config = builder.create_builder_config() # 为builder对象配置参数
151 | engine = createLenetEngine(maxBatchSize, builder, config, trt.float32)
152 | assert engine # 断言引擎不为空
153 |
154 | # 保存为trt引擎文件
155 | with open(engine_path, "wb") as f:
156 | f.write(engine.serialize())
157 |
158 | del engine
159 | del builder
160 |
161 |
162 | def doInference(context, host_in, host_out, batchSize):
163 | '''
164 | trt推理
165 |
166 | host_in 输入数据
167 | host_out 空npy,用于接收输出
168 | '''
169 | engine = context.engine
170 | assert engine.num_bindings == 2 # 绑定的tensor数量 输入1+输出1
171 |
172 | devide_in = cuda.mem_alloc(host_in.nbytes) # cuda分配输入内存,返回“设备分配对象“地址
173 | devide_out = cuda.mem_alloc(host_out.nbytes)
174 | bindings = [int(devide_in), int(devide_out)]
175 | stream = cuda.Stream() # 多个流 可以并行
176 |
177 | cuda.memcpy_htod_async(devide_in, host_in, stream) # 将主机内存的数据 复制到GPU上 htod即host_to_device
178 | context.execute_async(bindings=bindings, stream_handle=stream.handle) # 异步 GPU执行推理
179 | cuda.memcpy_dtoh_async(host_out, devide_out, stream) # 将GPU数据 复制到主机内存上 dtoh即device_to_host
180 | stream.synchronize() #流同步后 host_out接收预测结果
181 |
182 |
183 | if __name__ == '__main__':
184 | parser = argparse.ArgumentParser()
185 | parser.add_argument("-s",default=False, action='store_true')
186 | parser.add_argument("-d", default=True, action='store_true')
187 | args = parser.parse_args()
188 |
189 | if not (args.s ^ args.d):
190 | print("arguments not right!")
191 | print("python lenet.py -s # serialize model to plan file") # 将二进制权重转为trt引擎
192 | print("python lenet.py -d # deserialize plan file and run inference") # 加载trt引擎并推理
193 | sys.exit()
194 |
195 | if args.s:
196 | APIToModel(1)
197 | else:
198 | runtime = trt.Runtime(gLogger) # 创建trt运行时,以便加载trt引擎
199 | assert runtime
200 |
201 | with open(engine_path, "rb") as f:
202 | engine = runtime.deserialize_cuda_engine(f.read()) # 加载trt引擎
203 | assert engine
204 |
205 | context = engine.create_execution_context() # 创建执行内容对象
206 | assert context
207 |
208 | data = np.ones((INPUT_H * INPUT_W), dtype=np.float32) # TRT输入为一维[1024] 1024=1*32*32
209 | host_in = cuda.pagelocked_empty(INPUT_H * INPUT_W, dtype=np.float32) # 页面锁定分配输入 dtype为输入数据类型 初始化一个与输入数据尺寸相同的空npy
210 | np.copyto(host_in, data.ravel()) # ravel将原数据拉伸为一维,不产生副本 copyto返回数组的副本,赋值给host_in
211 | host_out = cuda.pagelocked_empty(OUTPUT_SIZE, dtype=np.float32) # 页面锁定分配输出 初始化一个与输出数据尺寸相同的空npy
212 | doInference(context, host_in, host_out, 1) # 推理完成后 host_out保存结果
213 |
214 | print(f'Output: {host_out}')
215 |
--------------------------------------------------------------------------------
/TensorRT/imgs/build.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
--------------------------------------------------------------------------------
/TensorRT/imgs/infer.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
--------------------------------------------------------------------------------