├── .gitignore ├── ModelConver ├── imgs │ ├── mnn.jpg │ ├── ncnn.jpeg │ └── process.svg ├── readme.md ├── Pytorch->ONNX.md ├── ONNX->MNN.md └── ONNX->NCNN.md ├── README.md ├── AMP ├── net.py ├── README.md └── main.py ├── TensorRT ├── readme.md ├── main.py ├── trt_com.py ├── lenet.py └── imgs │ ├── build.svg │ └── infer.svg └── DDP ├── readme.md └── ddp.py /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | *.code-workspace 3 | .DS_Store 4 | -------------------------------------------------------------------------------- /ModelConver/imgs/mnn.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bobo0810/PytorchExample/HEAD/ModelConver/imgs/mnn.jpg -------------------------------------------------------------------------------- /ModelConver/imgs/ncnn.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bobo0810/PytorchExample/HEAD/ModelConver/imgs/ncnn.jpeg -------------------------------------------------------------------------------- /ModelConver/readme.md: -------------------------------------------------------------------------------- 1 | # 移动端部署 2 | 3 | 以[人脸检测库RetinaFace](https://github.com/biubug6/Face-Detector-1MB-with-landmark)为例,移动端推理框架NCNN、MNN较为常用。 4 | 5 | ### 示例 6 | 7 | - [Pytorch->ONNX](Pytorch->ONNX.md) 8 | 9 | - [ONNX->NCNN](ONNX->NCNN.md) 10 | 11 | - [ONNX->MNN](ONNX->MNN.md) 12 | 13 | ### 流程 14 | 15 | ![avatar](./imgs/process.svg) 16 | 17 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Pytorch最小实践 2 | 3 | #### 收录到[PytorchNetHub](https://github.com/bobo0810/PytorchNetHub) 4 | 5 | ### [AMP](./AMP/README.md) 6 | 7 | - 自动混合精度训练 8 | 9 | ### [DDP](./DDP/readme.md) 10 | 11 | - 分布式数据并行(多机多卡) 12 | 13 | 14 | ### [MNN/NCNN部署](./ModelConver/readme.md) 15 | 16 | - Pytorch->ONNX-> NCNN / MNN 17 | 18 | ### [TensorRT部署](./TensorRT/readme.md) 19 | 20 | - TensorRT API 21 | - Pytorch->ONNX->TensorRT 22 | 23 | -------------------------------------------------------------------------------- /AMP/net.py: -------------------------------------------------------------------------------- 1 | from torch.cuda.amp import autocast 2 | import torch.nn as nn 3 | class MyNet(nn.Module): 4 | ''' 5 | 自定义网络 6 | ''' 7 | def __init__(self, use_amp=False): 8 | ''' 9 | :param use_amp: True开启混合精度训练 10 | ''' 11 | super(MyNet, self).__init__() 12 | self.use_amp = use_amp 13 | 14 | def forward(self,input): 15 | if self.use_amp: 16 | # 开启自动混合精度 17 | with autocast(): 18 | return self.forward_calculation(input) 19 | else: 20 | return self.forward_calculation(input) 21 | 22 | def forward_calculation(self, input): 23 | ... 24 | ... 25 | return feature -------------------------------------------------------------------------------- /AMP/README.md: -------------------------------------------------------------------------------- 1 | # AMP: Automatic Mixed Precision 2 | 3 | ## 说明 4 | - 好处:多快好省, batch增大 5 | - 训练:DataParallel且梯度累加的代码 6 | 7 | ## 注意 8 | - AMP保存的模型仍为FP32 9 | - AMP下模型保存两份权重。 10 | 11 | FP16权重用于反向传播计算(加速训练),并更新参数在FP32权重上(主模型) 12 | - 若想推理加速,在精度接受范围内img\model手动half()为FP16,然后只能GPU推理 13 | - [预测问题](https://github.com/jefflomax/pytorch-fizzbuzz-amp/issues/1#issuecomment-719125063) 14 | 15 | ## 环境 16 | 17 | | python版本 | pytorch版本 | 系统 | 18 | |------------|-------------|--------| 19 | | 3.6 | >=1.6.0 | Ubuntu | 20 | 21 | 22 | ## 参考 23 | [Pytorch_docs](https://pytorch.org/docs/stable/notes/amp_examples.html) 24 | 25 | [AUTOMATIC MIXED PRECISION](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html#advanced-topics) 26 | 27 | [基于Apex的混合精度加速](https://zhuanlan.zhihu.com/p/79887894) 28 | 29 | [论文精读:Mixed Precision Training](https://zhuanlan.zhihu.com/p/163493798) 30 | -------------------------------------------------------------------------------- /ModelConver/Pytorch->ONNX.md: -------------------------------------------------------------------------------- 1 | ## Pytorch->ONNX 2 | 3 | 示例库: [Face-Detector-1MB-with-landmark](https://github.com/biubug6/Face-Detector-1MB-with-landmark) 4 | 5 | 1. 验证输出 6 | 7 | convert_to_onnx.py 8 | 9 | ```python 10 | # RetinaFace网络输出三个参数:bbox、类别置信度、关键点 11 | output_names = ["output0"] 改为 output_names = ["bbox","prob","landmark"] 12 | ``` 13 | 14 | 2. 转为ONNX 15 | 16 | 注意: opset_version=11 与ONNX瘦身共用将导致推理异常 17 | 18 | ```shell 19 | # 生成faceDetector.onnx 20 | python convert_to_onnx.py --trained_model ./weights/mobilenet0.25_Final.pth --network mobile0.25 21 | ``` 22 | 23 | 3. ONNX瘦身 24 | 25 | ```shell 26 | #安装onnx-simplifier 27 | pip3 install -U pip && pip3 install onnx-simplifier 28 | # 生成faceDetector_sim.onnx 29 | python3 -m onnxsim faceDetector.onnx faceDetector_sim.onnx 30 | ``` 31 | 32 | **参考** 33 | 34 | [onnx-simplifier](https://github.com/daquexian/onnx-simplifier) 35 | 36 | -------------------------------------------------------------------------------- /TensorRT/readme.md: -------------------------------------------------------------------------------- 1 | 2 | # TensorRT最佳实践 3 | 4 | 5 | # 示例 6 | - TensorRT API 7 | - [最简示例](./lenet.py) 8 | - 访问[TensorRTx](https://github.com/wang-xinyu/tensorrtx)了解更多 9 | 10 | - 解析ONNX 11 | - [固定尺度](./main.py) 12 | - 动态维度: 待更新 13 | 14 | 15 | # 整体流程 16 | ## 1.构建引擎 17 | ![avatar](./imgs/build.svg) 18 | ## 2.推理 19 | ![avatar](./imgs/infer.svg) 20 | 21 | 22 | 23 | ## 三方库 24 | - [torch2trt](https://github.com/NVIDIA-AI-IOT/torch2trt) 25 | - [TRTorch](https://github.com/NVIDIA/TRTorch) 26 | > Torch直接转为TRT,但支持算子少,不通用。 27 | 28 | ## 参考 29 | 30 | - [TensorRT部署](http://zengzeyu.com/2020/07/09/tensorrt_01_installation/) 31 | - [TensorRT部署常见错误](https://blog.csdn.net/QFJIZHI/article/details/107335865) 32 | - [TensorRT加速Pytorch](https://blog.csdn.net/leviopku/article/details/112963733) 33 | - [TensorRTx](https://github.com/wang-xinyu/tensorrtx) 34 | - [TensorRT:深度学习推理加速](https://www.nvidia.cn/content/dam/en-zz/zh_cn/assets/webinars/oct16/Gary_TensorRT_GTCChina2019.pdf) -------------------------------------------------------------------------------- /DDP/readme.md: -------------------------------------------------------------------------------- 1 | # DistributedDataParallel 2 | 3 | ## 说明 4 | - 分布式数据并行DDP最小实现 5 | - 适用单机多卡、多机多卡训练 6 | 7 | ## 运行 8 | 9 | ### 示例 10 | ```PowerShell 11 | //只有所有节点执行Shell命令,才开始训练 12 | python ddp.py --nodes 节点数 --gpus 每个节点的GPU数量 --nr 当前节点序号 --ip 当前节点ip 13 | ``` 14 | 15 | ### 单机多卡 16 | 节点ip=192.168.3.8 17 | ```PowerShell 18 | Shell: CUDA_VISIBLE_DEVICES=0,1 python ddp.py --nodes 1 --gpus 2 --nr 0 --ip 192.168.3.8 19 | ``` 20 | 21 | ### 多机多卡 22 | 主节点ip=192.168.3.8 23 | ```PowerShell 24 | 主节点Shell: CUDA_VISIBLE_DEVICES=0,1 python ddp.py --nodes 2 --gpus 2 --nr 0 --ip 192.168.3.8 25 | 副节点Shell: CUDA_VISIBLE_DEVICES=0,1 python ddp.py --nodes 2 --gpus 2 --nr 1 --ip 192.168.3.8 26 | ``` 27 | 28 | 29 | ## 总结问题 30 | 31 | 1. batch_size 32 | 33 | > 有效batch = 每个GPU的batch * 总GPUs 34 | 35 | 2. 验证、保存 36 | 37 | > 验证:确保不同进程保存的log名称不同,最后只可视化rank=0。 38 | > 保存:只保存rank=0的模型。 39 | 40 | 3. 数据读取 41 | 42 | - DataLoader采用Lmdb读取,若如下错误 43 | 44 | ```python 45 | TypeError: can't pickle Environment objects 46 | ``` 47 | 48 | > 解决办法:DataLoader内num_workers=0 49 | 50 | - DataLoader采用其他方式读取,若如下错误 51 | 52 | ```python 53 | Attribute:Can’t pickle local object ‘DataLoader.__init__..’ 54 | ``` 55 | 56 | > 解决办法: lambda x: Image.fromarray(x) 改为 Image.fromarray 57 | 58 | 4. 同步BN 59 | 60 | ```python 61 | # 仅支持DDP 62 | model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) 63 | ``` 64 | 65 | # 参考 66 | [Distributed data parallel training in Pytorch](https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html) 推荐! 67 | 68 | [distributed_tutorial](https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-distributed.py) 69 | 70 | [PyTorch分布式训练简明教程](https://zhuanlan.zhihu.com/p/113694038) 71 | 72 | [Pytorch 分布式训练](https://zhuanlan.zhihu.com/p/76638962) 73 | 74 | [discuss.pytorch](https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857) -------------------------------------------------------------------------------- /ModelConver/ONNX->MNN.md: -------------------------------------------------------------------------------- 1 | ## ONNX->MNN 2 | 3 | ### 编译MNN 4 | 5 | 1. 安装Homebrew 6 | 7 | ```shell 8 | # macOS10.15.7 9 | /bin/zsh -c "$(curl -fsSL https://gitee.com/cunkai/HomebrewCN/raw/master/Homebrew.sh)" 10 | ``` 11 | 12 | 2. 编译MNN 13 | 14 | [官方编译](https://www.yuque.com/mnn/cn/demo_project) 15 | 16 | ```shell 17 | cd //进入mnn根路径 18 | # 生成 schema ,可选 19 | cd schema && ./generate.sh 20 | 21 | # 进行编译 22 | cd 23 | mkdir build && cd build 24 | # 打开 编译DEMO、编译模型转换器 25 | # https://www.yuque.com/mnn/cn/cmake_opts CMake参数前均加字母D 26 | cmake -DMNN_BUILD_DEMO=ON -DMNN_BUILD_CONVERTER=ON .. 27 | make -j8 28 | ``` 29 | 30 | ### 转化 31 | 32 | 1. 转为mnn 33 | 34 | [官方转化](https://www.yuque.com/mnn/cn/model_convert) 35 | 36 | ```shell 37 | cd /build/ 38 | # 生成retinaface.mnn 39 | ./MNNConvert -f ONNX --modelFile faceDetector_sim.onnx --MNNModel retinaface.mnn --bizCode biz 40 | ``` 41 | 42 | ### C++推理 43 | 44 | 1. 验证输出 45 | 46 | (1)下载[RetinaFace_MNN](https://github.com/ItchyHiker/RetinaFace_MNN)推理项目 47 | 48 | (2)修改代码 49 | 50 | - `CMakeLists.txt` 51 | 52 | ```cmake 53 | # OpenCV路径:默认安装/usr/local/Cellar/opencv 54 | # MNN路径: 更改为本地mnn根路径 55 | set(OpenCV_DIR /usr/local/Cellar/opencv/4.5.2/lib) 56 | set(OpenCV_INCLUDE_DIRS /usr/local/Cellar/opencv/4.5.2/include/opencv4/) 57 | set(MNN_DIR /build/libMNN.dylib) 58 | set(MNN_INCLUDE_DIRS /include) 59 | ``` 60 | 61 | - `main.cpp` 更改第12、13行 模型、测试图片路径 62 | - `retinaface.cpp`第26~29行key分别更改为“input0”、“prob”、"bbox"、"landmark",与转化前key值对应。 63 | 64 | (3)可选项 65 | 66 | - anchor比例:`retinaface.cpp`第128、130、132行 67 | - 图像尺寸:`main.cpp`第15行 68 | 69 | 2. 构建项目 70 | 71 | ```shell 72 | cd #进入推理项目根路径 73 | mkdir -p build 74 | cd build 75 | cmake .. #生成Makefile文件 76 | make -j4#根据Makefile文件进行编译 77 | # 生成可执行文件RetinaFace 78 | # 验证 79 | ./RetinaFace 80 | ``` 81 | 82 | ![avatar](./imgs/mnn.jpg) 83 | 84 | **参考** 85 | 86 | [MNN](https://github.com/alibaba/MNN) 87 | 88 | [MNN文档](https://www.yuque.com/mnn/cn/cmake_opts) 89 | 90 | -------------------------------------------------------------------------------- /AMP/main.py: -------------------------------------------------------------------------------- 1 | from torch.cuda.amp import autocast 2 | from torch.cuda.amp import GradScaler 3 | import torch 4 | from net import MyNet 5 | 6 | def start_train(): 7 | ''' 8 | 训练 9 | ''' 10 | use_amp=True 11 | # 前向反传N次,再更新参数 目的:增大batch(理论batch= batch_size * N) 12 | iter_size=8 13 | 14 | myNet = MyNet(use_amp).to("cuda:0") 15 | myNet = torch.nn.DataParallel(myNet,device_ids=[0,1]) # 数据并行 16 | myNet.train() 17 | # 训练开始前初始化 梯度缩放器 18 | scaler = GradScaler() if use_amp else None 19 | 20 | # 加载预训练权重 21 | if resume_train: 22 | scaler.load_state_dict(checkpoint['scaler']) # amp自动混合精度用到 23 | optimizer.load_state_dict(checkpoint['optimizer']) 24 | myNet.load_state_dict(checkpoint["model"]) 25 | 26 | 27 | for epoch in range(1,100): 28 | for batch_idx, (input, target) in enumerate(dataloader_train): 29 | 30 | # 数据 转到每个并行模型的主卡上 31 | input = input.to("cuda:0") 32 | target = target.to("cuda:0") 33 | 34 | # 自动混合精度训练 35 | if use_amp: 36 | # 自动广播 将支持半精度操作自动转为FP16 37 | with autocast(): 38 | # 提取特征 39 | feature=myNet(input) 40 | losses = loss_function(target,feature) 41 | loss = losses / iter_size 42 | scaler.scale(loss).backward() 43 | else: 44 | feature = myNet(input, target) 45 | losses = loss_function(target, feature) 46 | loss = losses / iter_size 47 | loss.backward() 48 | 49 | # 梯度累积,再更新参数 50 | if (batch_idx + 1) % iter_size == 0: 51 | # 梯度更新 52 | if use_amp: 53 | scaler.step(optimizer) 54 | scaler.update() 55 | else: 56 | optimizer.step() 57 | # 梯度清零 58 | optimizer.zero_grad() 59 | # scaler 具有状态。恢复训练时需要加载 60 | state = {'net': myNet.state_dict(), 'optimizer': optimizer.state_dict(), 'scaler': scaler.state_dict()} 61 | torch.save(state, "filename.pth") 62 | 63 | def start_test(): 64 | ''' 65 | 测试 66 | ''' 67 | # 初始化网络并加载预训练模型 68 | myNet = MyNet().to("cuda:0") 69 | myNet.eval() 70 | with torch.no_grad(): 71 | input = input.to("cuda:0") 72 | 73 | # 若想推理加速,在精度接受范围内img\model手动half()为FP16,然后只能GPU推理 74 | # input=input.half() 75 | # myNet=myNet.half() 76 | feature = myNet(input) 77 | 78 | 79 | 80 | 81 | 82 | -------------------------------------------------------------------------------- /TensorRT/main.py: -------------------------------------------------------------------------------- 1 | import onnx 2 | import pycuda.autoinit 3 | import pycuda.driver as cuda 4 | import tensorrt as trt 5 | import torch 6 | import time 7 | import torchvision 8 | import numpy as np 9 | import os 10 | current_path=os.path.abspath(os.path.dirname(__file__)) 11 | from trt_com import Torch_to_ONNX,ONNX_to_TensorRT,Init_TensorRT,Do_Inference 12 | 13 | 14 | 15 | batch_size=3 # 固定尺度 eg:1、6、8... 16 | class ONNX_Config(): 17 | ''' 18 | ONNX参数 19 | ''' 20 | input_size=[batch_size,3,224,224] # 输入尺寸 21 | device_id="cuda:0" 22 | onnx_path=current_path+"/model.onnx" # onnx模型的保存路径 23 | 24 | class TensorRT_Config(): 25 | ''' 26 | TensorRT参数 27 | ''' 28 | output_size= [batch_size,1000] #输出尺寸 resnet18输出1000分类 29 | fp16_mode = True # 是否支持FP16 依赖硬件 30 | trt_path = current_path+"/model_fp16_{}.trt".format(fp16_mode) # TRT引擎的保存路径 31 | 32 | if __name__ == "__main__": 33 | # ============1.Pytorch->ONNX============ 34 | onnx_cfg = ONNX_Config() #配置onnx转化参数 35 | device = torch.device(onnx_cfg.device_id) 36 | # 初始化Pytorch 37 | torch_net = torchvision.models.resnet18(pretrained=True).to(device) 38 | torch_net.eval() 39 | # 转为ONNX模型 40 | Torch_to_ONNX(torch_net,onnx_cfg.input_size,onnx_cfg.onnx_path,device) 41 | 42 | 43 | # ============2.ONNX->TensorRT============ 44 | trt_cfg = TensorRT_Config() #配置tesnorrt转化参数 45 | ONNX_to_TensorRT(trt_cfg.fp16_mode,onnx_cfg.onnx_path,trt_cfg.trt_path) 46 | 47 | 48 | # ============3.Trt预测============ 49 | img_np_nchw = np.ones(tuple(onnx_cfg.input_size),dtype=float).astype(np.float32) # 输入数据 50 | 51 | [context,inputs, outputs, bindings, stream] =Init_TensorRT(trt_cfg.trt_path) # 加载引擎 52 | inputs[0].host = img_np_nchw.reshape(-1) # 绑定输入数据 一维npy 53 | # inputs[1].host = ... 适用多个输入 54 | 55 | t0 = time.time() 56 | output=Do_Inference(context, bindings, inputs, outputs, stream) # list 若网络仅一个输出,则len=1 57 | t1 = time.time() 58 | output=output[0].reshape(*trt_cfg.output_size) # 一维npy 恢复为 指定输出尺寸 59 | 60 | # ============4.Torch预测============ 61 | input = torch.from_numpy(img_np_nchw).to(device) 62 | t2 = time.time() 63 | output_torch = torch_net(input) 64 | t3 = time.time() 65 | 66 | # ============5.计算误差============ 67 | mse = np.mean((output - output_torch.cpu().detach().numpy()) ** 2) 68 | 69 | print('MSE Error = {}'.format(mse)) 70 | print("Inference time with the TensorRT engine: {}".format(t1 - t0)) 71 | print("Inference time with the PyTorch model: {}".format(t3 - t2)) 72 | print('All completed!') 73 | -------------------------------------------------------------------------------- /ModelConver/ONNX->NCNN.md: -------------------------------------------------------------------------------- 1 | ## 1. ONNX->NCNN 2 | 3 | 示例库: [Face-Detector-1MB-with-landmark](https://github.com/biubug6/Face-Detector-1MB-with-landmark) 4 | 5 | ### 编译NCNN 6 | 7 | 1. 安装Homebrew 8 | 9 | ```shell 10 | # macOS10.15.7 11 | /bin/zsh -c "$(curl -fsSL https://gitee.com/cunkai/HomebrewCN/raw/master/Homebrew.sh)" 12 | ``` 13 | 14 | 2. 安装第三方依赖 15 | 16 | ```shell 17 | brew install cmake 18 | brew install protobuf 19 | brew install opencv //自动安装很多依赖 默认安装路径/usr/local/Cellar/opencv 20 | ``` 21 | 22 | 3. 编译NCNN 23 | 24 | ```shell 25 | cd //进入ncnn根路径 26 | mkdir -p build 27 | cd build 28 | cmake .. #生成Makefile文件 29 | make #根据Makefile文件进行编译 30 | make install #生成install文件夹 31 | ``` 32 | 33 | ### 转化 34 | 35 | 1. 转为ncnn 36 | 37 | ```shell 38 | cd /build/tools/onnx 39 | ./onnx2ncnn faceDetector_sim.onnx face.param face.bin 40 | ``` 41 | 42 | ### C++推理 43 | 44 | 1. 验证输出 45 | 46 | (1)将`/build/install`的文件替换到`Face-Detector-1MB-with-landmark/Face_Detector_ncnn/ncnn`目录下 47 | 48 | (2)将 face.param 和face.bin移动到`Face_Detector_ncnn/model`目录下 49 | 50 | (3)`Face_Detector_ncnn/FaceDetector.cpp` 第53、56、59行key分别更改为"bbox","prob","landmark",与转化前key值对应。 51 | 52 | (4)`Face_Detector_ncnn/main.cpp` 若使用retinaface模型,应将false->true。 53 | 54 | (5)可选项 55 | 56 | - anchor比例:`FaceDetector.cpp`第202行 57 | - 图像尺寸:`main.cpp`第27行 58 | 59 | 2. 构建项目 60 | 61 | 在`Face_Detector_ncnn/CMakeLists.txt`设置opencv路径 62 | 63 | ```shell 64 | set(OpenCV_DIR "/usr/local/Cellar/opencv/4.5.2/") 65 | ``` 66 | 67 | 编辑将出现如下错误 68 | 69 | ```shell 70 | cmake .. 71 | make -j4 #开4个线程进行编译 72 | 73 | #将出现错误 include未找到opencv2 74 | fatal error: 'opencv2/opencv.hpp' file not found 75 | fatal error: 'opencv2/core/core.hpp' file not found 76 | 77 | # 原因 78 | # opencv2的include路径为/usr/local/Cellar/opencv/4.5.2/include/opencv4/ 79 | ``` 80 | 81 | 3. 解决 82 | 83 | ```cmake 84 | # CMakeLists文件修改两个路径 85 | 19行: ${OpenCV_DIR}/include = /usr/local/Cellar/opencv/4.5.2/include/opencv4/ 86 | 22行: ${OpenCV_DIR}/lib = /usr/local/Cellar/opencv/4.5.2/lib 87 | ``` 88 | 89 | ```shell 90 | #编译 91 | cmake .. 92 | make -j4 93 | # 生成可执行文件FaceDetector 94 | # 验证 95 | ./FaceDetector 96 | ``` 97 | 98 | ![avatar](./imgs/ncnn.jpeg) 99 | 100 | 101 | 102 | ## 2. NCNN优化 103 | 104 | 作用:(1)优化模型,融合算子 (2)FP32->FP16 105 | 106 | ```shell 107 | cd /build/tools/ 108 | # flag:0为FP32,1为FP16 109 | ./ncnnoptimize ncnn.param ncnn.bin new.param new.bin flag 110 | ``` 111 | 112 | 113 | 114 | **参考** 115 | 116 | [macOS编译NCNN](https://www.bilibili.com/read/cv10224407/) 117 | 118 | [NCNN](https://github.com/Tencent/ncnn) 119 | 120 | [NCNN深度学习框架之Optimize优化器](https://www.cnblogs.com/wanggangtao/p/11313705.html) 121 | 122 | -------------------------------------------------------------------------------- /ModelConver/imgs/process.svg: -------------------------------------------------------------------------------- 1 | PytorchONNXMNNNCNN -------------------------------------------------------------------------------- /DDP/ddp.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import torch.multiprocessing as mp 4 | import torchvision 5 | import torchvision.transforms as transforms 6 | import torch 7 | import torch.nn as nn 8 | import torch.distributed as dist 9 | class ConvNet(nn.Module): 10 | def __init__(self, num_classes=10): 11 | super(ConvNet, self).__init__() 12 | self.layer1 = nn.Sequential( 13 | nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), 14 | nn.BatchNorm2d(16), 15 | nn.ReLU(), 16 | nn.MaxPool2d(kernel_size=2, stride=2)) 17 | self.layer2 = nn.Sequential( 18 | nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), 19 | nn.BatchNorm2d(32), 20 | nn.ReLU(), 21 | nn.MaxPool2d(kernel_size=2, stride=2)) 22 | self.fc = nn.Linear(7*7*32, num_classes) 23 | 24 | def forward(self, x): 25 | out = self.layer1(x) 26 | out = self.layer2(out) 27 | out = out.reshape(out.size(0), -1) 28 | out = self.fc(out) 29 | return out 30 | 31 | 32 | def main(): 33 | parser = argparse.ArgumentParser() 34 | parser.add_argument('--nodes', default=2, type=int) # 节点数量 35 | parser.add_argument('--gpus', default=2, type=int) # 每个节点的GPU数量 36 | parser.add_argument( '--nr', default=0, type=int) # 当前节点在所有节点的序号 37 | parser.add_argument('--batch', default=128, type=int) # 总batch(有效batch) 均分给全部GPU 38 | parser.add_argument('--ip',default=None,type=str) # 主节点ip 39 | 40 | args = parser.parse_args() 41 | args.world_size = args.gpus * args.nodes #总的world_size,即进程总数==总GPU数量(每个进程负责一个GPU) 42 | os.environ['MASTER_ADDR'] = args.ip # 主节点(主进程),用于所有进程同步梯度 43 | os.environ['MASTER_PORT'] = '8886' # 主进程用于通信的端口,可随意设置 44 | 45 | # 一个节点启动 该节点的所有进程,每个进程运行train(i,args) i从0到args.gpus-1 46 | # nprocs:作用于mp.spawn,标明启动的线程数 47 | # args:传递给train方法的参数 48 | mp.spawn(train, nprocs=args.gpus, args=(args,)) 49 | 50 | 51 | def train(pid, args): 52 | ''' 53 | 通过mp.spawn启动多进程,train接收参数为:节点内部的子进程号pid + 方法参数 54 | ''' 55 | # 每个进程负责一个GPU,故 节点内部子进程号 = 节点内部GPU序号 56 | gpu=pid 57 | 58 | # 计算当前进程在所有进程中的全局排名,每个进程都需要知道进程总数和在进程中的顺序,以便使用哪块GPU 59 | # rank=0为主进程,用于保存模型和打印信息 60 | rank = args.nr * args.gpus + gpu 61 | 62 | # 初始化分布环境 63 | # env:环境变量初始化,需要在环境变量配置4个参数:MASTER_PORT,MASTER_ADDR,WORLD_SIZE,RANK 64 | dist.init_process_group(backend='nccl', 65 | init_method='env://', 66 | world_size=args.world_size, 67 | rank=rank) 68 | 69 | torch.manual_seed(0) 70 | model = ConvNet() 71 | 72 | # 加载权重 73 | if args.savepath: 74 | print('loading weights') 75 | pass 76 | 77 | # DDP分发之前,同步BN(将网络内部的BatchNorm层转换为SyncBatchNorm层) 78 | model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) 79 | 80 | torch.cuda.set_device(gpu) # 当前节点负责的GPU 81 | model.cuda(gpu) 82 | batch_size = int(args.batch/args.world_size) # 总的有效batch_size= 均分每块GPU的batch * 总进程数(总GPUs) 83 | 84 | criterion = nn.CrossEntropyLoss().cuda(gpu) 85 | optimizer = torch.optim.SGD(model.parameters(), 1e-4) 86 | 87 | # GPU模型包装为 DDP模型 88 | model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) 89 | 90 | # 加载数据 91 | train_dataset = torchvision.datasets.MNIST(root='./data', 92 | train=True, 93 | transform=transforms.ToTensor(), 94 | download=True) 95 | # 采样器:将数据集分为 world_size 块,不同块送到各进程中 96 | train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, 97 | num_replicas=args.world_size, 98 | rank=rank) 99 | train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 100 | batch_size=batch_size, 101 | shuffle=False, # DDP下该参数无效,由train_sampler负责 102 | num_workers=0, # DDP下为0 否则读取出错 103 | pin_memory=True, 104 | sampler=train_sampler) # 采样器 105 | 106 | for epoch in range(10): 107 | # 每轮采样器打乱数据集,保证数据划分不同 108 | train_sampler.set_epoch(epoch) 109 | 110 | for i, (images, labels) in enumerate(train_loader): 111 | images = images.cuda(non_blocking=True) 112 | labels = labels.cuda(non_blocking=True) 113 | 114 | outputs = model(images) 115 | loss = criterion(outputs, labels) 116 | 117 | 118 | optimizer.zero_grad() 119 | loss.backward() 120 | optimizer.step() 121 | if (i + 1) % 100 == 0 and gpu == 0: 122 | print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, 10, i + 1, len(train_loader), 123 | loss.item())) 124 | 125 | # ===验证=== 126 | # 确保每个进程log名称不同,最后可视化rank=0的log即可 127 | # acc=eval() 128 | 129 | 130 | # 仅主进程 保存模型 131 | if rank == 0: 132 | torch.save(model.state_dict(),'ddp.pth') 133 | 134 | 135 | if __name__ == '__main__': 136 | main() -------------------------------------------------------------------------------- /TensorRT/trt_com.py: -------------------------------------------------------------------------------- 1 | import onnx 2 | import onnxruntime 3 | import pycuda.autoinit 4 | import pycuda.driver as cuda 5 | import tensorrt as trt 6 | import torch 7 | import time 8 | import torchvision 9 | import numpy as np 10 | import os 11 | import sys 12 | current_path=os.path.abspath(os.path.dirname(__file__)) 13 | ''' 14 | 封装通用代码 15 | ''' 16 | 17 | def Init_TensorRT(trt_path): 18 | ''' 19 | 初始化TensorRT引擎 20 | trt_path: trt文件 21 | ''' 22 | # 加载cuda引擎 23 | engine = load_engine(trt_path) 24 | # 创建CudaEngine之后,需要将该引擎应用到不同的卡上配置执行环境 25 | context = engine.create_execution_context() 26 | inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings 27 | return [context,inputs, outputs, bindings, stream] 28 | def load_engine(trt_path): 29 | """ 30 | 加载cuda引擎 31 | trt_path: TensorRT引擎文件 32 | """ 33 | # 以trt的Logger为参数,使用builder创建计算图类型INetworkDefinition 34 | TRT_LOGGER = trt.Logger() 35 | 36 | # 如果已经存在序列化之后的引擎,则直接反序列化得到cudaEngine 37 | if os.path.exists(trt_path): 38 | print("Reading engine from file: {}".format(trt_path)) 39 | with open(trt_path, 'rb') as f, \ 40 | trt.Runtime(TRT_LOGGER) as runtime: 41 | return runtime.deserialize_cuda_engine(f.read()) # 反序列化 42 | else: 43 | print('No Found:'+trt_path) 44 | raise FileNotFoundError 45 | 46 | 47 | def allocate_buffers(engine): 48 | ''' 49 | TRT分配缓存 50 | ''' 51 | class HostDeviceMem(object): 52 | def __init__(self, host_mem, device_mem): 53 | """ 54 | host_mem: cpu memory 55 | device_mem: gpu memory 56 | """ 57 | self.host = host_mem # 主机数据 58 | self.device = device_mem # GPU数据 59 | 60 | def __str__(self): 61 | return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device) 62 | 63 | def __repr__(self): 64 | return self.__str__() 65 | inputs, outputs, bindings = [], [], [] 66 | stream = cuda.Stream() 67 | for binding in engine: 68 | # print(binding) # 绑定的输入输出 69 | # print(engine.get_binding_shape(binding)) # get_binding_shape 是变量的大小 70 | size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size 71 | # volume 计算可迭代变量的空间,指元素个数 72 | # size = trt.volume(engine.get_binding_shape(binding)) # 如果采用固定bs的onnx,则采用该句 73 | dtype = trt.nptype(engine.get_binding_dtype(binding)) 74 | # get_binding_dtype 获得binding的数据类型 75 | # nptype等价于numpy中的dtype,即数据类型 76 | # allocate host and device buffers 77 | host_mem = cuda.pagelocked_empty(size, dtype) # 创建锁业内存 78 | device_mem = cuda.mem_alloc(host_mem.nbytes) # cuda分配空间 79 | # print(int(device_mem)) # binding在计算图中的缓冲地址 80 | bindings.append(int(device_mem)) 81 | # append to the appropriate list 82 | if engine.binding_is_input(binding): 83 | inputs.append(HostDeviceMem(host_mem, device_mem)) # 绑定输入 84 | else: 85 | outputs.append(HostDeviceMem(host_mem, device_mem)) # 绑定输出 86 | return inputs, outputs, bindings, stream 87 | 88 | 89 | def Do_Inference(context, bindings, inputs, outputs, stream): 90 | ''' 91 | 执行推理 92 | ''' 93 | # htod:host to device 将数据由主机迁移到gpu device 94 | [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs] 95 | 96 | # Run inference. 97 | context.execute_async_v2(bindings=bindings, stream_handle=stream.handle) 98 | # dtoh:device to host 99 | [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs] 100 | 101 | # Synchronize the stream 同步流后才能得到预测结果 102 | stream.synchronize() 103 | 104 | # 返回预测结果 一维numpy 105 | return [out.host for out in outputs] 106 | 107 | 108 | def Torch_to_ONNX(net,input_size,onnx_path,device): 109 | ''' 110 | torch->onnx(仅支持固定输入尺度) 111 | input_size: 输入尺度 [N,3,224,224] 112 | onnx_path: onnx权重文件的保存路径 113 | device: "cuda:0" 114 | ''' 115 | net.to(device) 116 | net.eval() 117 | # 转为ONNX 118 | torch.onnx.export(net, # 待转换的网络模型和参数 119 | torch.randn(tuple(input_size), device=device), # 虚拟的输入,用于确定输入尺寸和推理计算图每个节点的尺寸 120 | onnx_path, # 输出文件路径 121 | verbose=False, # 是否以字符串的形式显示计算图 122 | input_names=["input"], 123 | output_names=["output"], # 输出节点的名称 124 | opset_version=13, # onnx支持算子的版本 125 | do_constant_folding=True, # 是否压缩常量 126 | ) 127 | 128 | 129 | # 验证模型 130 | net = onnx.load(onnx_path) # 加载onnx 计算图 131 | onnx.checker.check_model(net) # 检查文件模型是否正确 132 | onnx.helper.printable_graph(net.graph) # 输出onnx的计算图 133 | 134 | # ONNX推理 135 | session = onnxruntime.InferenceSession(onnx_path) # 创建一个运行session,类似于tensorflow 136 | output = session.run(None, {"input": np.random.rand(input_size[0],input_size[1], input_size[2], input_size[3]).astype('float32')}) # 输入必须是numpy类型 137 | 138 | print('ONNX file in ' + onnx_path) 139 | print('============Pytorch->ONNX SUCCESS============') 140 | 141 | 142 | def ONNX_to_TensorRT(fp16_mode=False,onnx_path=None,trt_path=None,max_batch_size=1): 143 | """ 144 | 生成cudaEngine,并保存引擎文件(仅支持固定输入尺度) 145 | 146 | max_batch_size: 默认为1,不支持动态batch 147 | fp16_mode: True则fp16预测 148 | onnx_path: 将加载的onnx权重路径 149 | trt_path: trt引擎文件保存路径 150 | """ 151 | # 通过logger报告错误、警告、信息 152 | TRT_LOGGER = trt.Logger() 153 | 154 | explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) 155 | 156 | 157 | with trt.Builder(TRT_LOGGER) as builder, \ 158 | builder.create_network(explicit_batch) as network, \ 159 | trt.OnnxParser(network, TRT_LOGGER) as parser: 160 | builder.max_workspace_size = 1 << 30 # 预先分配的工作空间大小,即ICudaEngine执行时GPU最大需要的空间 161 | builder.max_batch_size = max_batch_size # 执行时最大可以使用的batchsize 162 | builder.fp16_mode = fp16_mode 163 | 164 | # ########解析onnx文件,填充计算图######### 165 | if not os.path.exists(onnx_path): 166 | quit("ONNX file {} not found!".format(onnx_path)) 167 | print('loading onnx file from path {} ...'.format(onnx_path)) 168 | with open(onnx_path, 'rb') as model: 169 | print("Begining onnx file parsing") 170 | parser.parse(model.read()) # OnnxParser解析onnx文件,为network对象构建网络并填充权重 171 | print("Completed parsing of onnx file") 172 | 173 | ########builder基于计算图创建引擎######### 174 | print("Building an engine from file{}' this may take a while...".format(onnx_path)) 175 | output_shape=network.get_layer(network.num_layers - 1).get_output(0).shape # 查看最后一层网络输出尺寸 176 | # network.mark_output(network.get_layer(network.num_layers -1).get_output(0)) #设置输出 177 | engine = builder.build_cuda_engine(network) # 构建引擎 178 | print("Completed creating Engine") 179 | 180 | # 保存engine供以后直接加载使用 181 | with open(trt_path, 'wb') as f: 182 | f.write(engine.serialize()) # 序列化 183 | 184 | print('TensorRT file in ' + trt_path) 185 | print('============ONNX->TensorRT SUCCESS============') -------------------------------------------------------------------------------- /TensorRT/lenet.py: -------------------------------------------------------------------------------- 1 | ''' 2 | https://github.com/wang-xinyu/tensorrtx lenet最简单示例 3 | ''' 4 | 5 | import argparse 6 | import os 7 | import struct 8 | import sys 9 | 10 | import numpy as np 11 | import pycuda.autoinit 12 | import pycuda.driver as cuda 13 | import tensorrt as trt 14 | 15 | INPUT_H = 32 #输入尺寸 16 | INPUT_W = 32 17 | OUTPUT_SIZE = 10 #输出形状 10分类 18 | INPUT_BLOB_NAME = "data" # blob二进制对象 输入名称 19 | OUTPUT_BLOB_NAME = "prob" # 输出名称 20 | 21 | weight_path = "./lenet5.wts" # 二进制权重 22 | engine_path = "./lenet5.engine" #trt引擎的保存路径 23 | 24 | gLogger = trt.Logger(trt.Logger.INFO) # 通过logger报告错误、警告、信息(Builder/ICudaEngine/Runtime) 25 | 26 | 27 | def load_weights(file): 28 | '''加载二进制权重文件''' 29 | print(f"Loading weights: {file}") 30 | 31 | assert os.path.exists(file), 'Unable to load weight file.' 32 | 33 | weight_map = {} 34 | with open(file, "r") as f: 35 | lines = [line.strip() for line in f] 36 | count = int(lines[0]) 37 | assert count == len(lines) - 1 38 | for i in range(1, count + 1): # 遍历每行内容 39 | splits = lines[i].split(" ") 40 | name = splits[0] # 第一个值为网络名称 41 | cur_count = int(splits[1]) # 第二个值为 该行参数数量 42 | assert cur_count + 2 == len(splits) 43 | values = [] #保存该行参数 44 | for j in range(2, len(splits)): 45 | # hex string to bytes to float 46 | values.append(struct.unpack(">f", bytes.fromhex(splits[j]))) 47 | weight_map[name] = np.array(values, dtype=np.float32) 48 | 49 | return weight_map 50 | 51 | 52 | def createLenetEngine(maxBatchSize, builder, config, dt): 53 | ''' 54 | 构建网络引擎 55 | dt: fp32 or fp16 56 | ''' 57 | 58 | 59 | weight_map = load_weights(weight_path) # 加载二进制权重 60 | network = builder.create_network() # 创建网络对象 61 | 62 | data = network.add_input(INPUT_BLOB_NAME, dt, (1, INPUT_H, INPUT_W)) # 设置网络输入的名称和尺寸 63 | assert data 64 | # ============定义网络============ 65 | # 定义卷积 66 | conv1 = network.add_convolution(input=data, # 输入tensor 67 | num_output_maps=6, # 输出通道 68 | kernel_shape=(5, 5), # 卷积核尺寸 69 | kernel=weight_map["conv1.weight"], # 赋值卷积核的权重[out_channels, in_channels, kernel_height, kernel_width] 70 | bias=weight_map["conv1.bias"]) # 赋值偏向权重[out_channels] 71 | assert conv1 72 | conv1.stride = (1, 1) # 设置卷积的步长 73 | 74 | # 定义激活函数 75 | relu1 = network.add_activation(conv1.get_output(0), # 前卷积层的输出 76 | type=trt.ActivationType.RELU) 77 | assert relu1 78 | 79 | # 定义池化 80 | pool1 = network.add_pooling(input=relu1.get_output(0),# 前激活层的输出 81 | window_size=trt.DimsHW(2, 2), # 池化窗口大小 82 | type=trt.PoolingType.AVERAGE) # 池化类型为平均池化 83 | assert pool1 84 | pool1.stride = (2, 2) # 池化步长 85 | 86 | conv2 = network.add_convolution(pool1.get_output(0), 16, trt.DimsHW(5, 5), 87 | weight_map["conv2.weight"], 88 | weight_map["conv2.bias"]) 89 | assert conv2 90 | conv2.stride = (1, 1) 91 | 92 | relu2 = network.add_activation(conv2.get_output(0), 93 | type=trt.ActivationType.RELU) 94 | assert relu2 95 | 96 | pool2 = network.add_pooling(input=relu2.get_output(0), 97 | window_size=trt.DimsHW(2, 2), 98 | type=trt.PoolingType.AVERAGE) 99 | assert pool2 100 | pool2.stride = (2, 2) 101 | 102 | # 定义全连接层 103 | fc1 = network.add_fully_connected(input=pool2.get_output(0), 104 | num_outputs=120, 105 | kernel=weight_map['fc1.weight'], 106 | bias=weight_map['fc1.bias']) 107 | assert fc1 108 | 109 | relu3 = network.add_activation(fc1.get_output(0), 110 | type=trt.ActivationType.RELU) 111 | assert relu3 112 | 113 | fc2 = network.add_fully_connected(input=relu3.get_output(0), 114 | num_outputs=84, 115 | kernel=weight_map['fc2.weight'], 116 | bias=weight_map['fc2.bias']) 117 | assert fc2 118 | 119 | relu4 = network.add_activation(fc2.get_output(0), 120 | type=trt.ActivationType.RELU) 121 | assert relu4 122 | 123 | fc3 = network.add_fully_connected(input=relu4.get_output(0), 124 | num_outputs=OUTPUT_SIZE, 125 | kernel=weight_map['fc3.weight'], 126 | bias=weight_map['fc3.bias']) 127 | assert fc3 128 | 129 | prob = network.add_softmax(fc3.get_output(0)) #经过softmax 130 | assert prob 131 | 132 | prob.get_output(0).name = OUTPUT_BLOB_NAME # 网络输出 赋值名称,便于后续通过名称拿出预测结果 133 | network.mark_output(prob.get_output(0)) # 将该tensor 标记为输出 134 | 135 | # Build engine 136 | builder.max_batch_size = maxBatchSize 137 | # builder.max_workspace_size = 1 << 20 138 | config.max_workspace_size= 1 << 20 139 | engine = builder.build_engine(network, config) 140 | 141 | del network 142 | del weight_map 143 | 144 | return engine 145 | 146 | 147 | def APIToModel(maxBatchSize): 148 | '''将二进制权重转为trt引擎''' 149 | builder = trt.Builder(gLogger) # builder对象 用于推理 150 | config = builder.create_builder_config() # 为builder对象配置参数 151 | engine = createLenetEngine(maxBatchSize, builder, config, trt.float32) 152 | assert engine # 断言引擎不为空 153 | 154 | # 保存为trt引擎文件 155 | with open(engine_path, "wb") as f: 156 | f.write(engine.serialize()) 157 | 158 | del engine 159 | del builder 160 | 161 | 162 | def doInference(context, host_in, host_out, batchSize): 163 | ''' 164 | trt推理 165 | 166 | host_in 输入数据 167 | host_out 空npy,用于接收输出 168 | ''' 169 | engine = context.engine 170 | assert engine.num_bindings == 2 # 绑定的tensor数量 输入1+输出1 171 | 172 | devide_in = cuda.mem_alloc(host_in.nbytes) # cuda分配输入内存,返回“设备分配对象“地址 173 | devide_out = cuda.mem_alloc(host_out.nbytes) 174 | bindings = [int(devide_in), int(devide_out)] 175 | stream = cuda.Stream() # 多个流 可以并行 176 | 177 | cuda.memcpy_htod_async(devide_in, host_in, stream) # 将主机内存的数据 复制到GPU上 htod即host_to_device 178 | context.execute_async(bindings=bindings, stream_handle=stream.handle) # 异步 GPU执行推理 179 | cuda.memcpy_dtoh_async(host_out, devide_out, stream) # 将GPU数据 复制到主机内存上 dtoh即device_to_host 180 | stream.synchronize() #流同步后 host_out接收预测结果 181 | 182 | 183 | if __name__ == '__main__': 184 | parser = argparse.ArgumentParser() 185 | parser.add_argument("-s",default=False, action='store_true') 186 | parser.add_argument("-d", default=True, action='store_true') 187 | args = parser.parse_args() 188 | 189 | if not (args.s ^ args.d): 190 | print("arguments not right!") 191 | print("python lenet.py -s # serialize model to plan file") # 将二进制权重转为trt引擎 192 | print("python lenet.py -d # deserialize plan file and run inference") # 加载trt引擎并推理 193 | sys.exit() 194 | 195 | if args.s: 196 | APIToModel(1) 197 | else: 198 | runtime = trt.Runtime(gLogger) # 创建trt运行时,以便加载trt引擎 199 | assert runtime 200 | 201 | with open(engine_path, "rb") as f: 202 | engine = runtime.deserialize_cuda_engine(f.read()) # 加载trt引擎 203 | assert engine 204 | 205 | context = engine.create_execution_context() # 创建执行内容对象 206 | assert context 207 | 208 | data = np.ones((INPUT_H * INPUT_W), dtype=np.float32) # TRT输入为一维[1024] 1024=1*32*32 209 | host_in = cuda.pagelocked_empty(INPUT_H * INPUT_W, dtype=np.float32) # 页面锁定分配输入 dtype为输入数据类型 初始化一个与输入数据尺寸相同的空npy 210 | np.copyto(host_in, data.ravel()) # ravel将原数据拉伸为一维,不产生副本 copyto返回数组的副本,赋值给host_in 211 | host_out = cuda.pagelocked_empty(OUTPUT_SIZE, dtype=np.float32) # 页面锁定分配输出 初始化一个与输出数据尺寸相同的空npy 212 | doInference(context, host_in, host_out, 1) # 推理完成后 host_out保存结果 213 | 214 | print(f'Output: {host_out}') 215 | -------------------------------------------------------------------------------- /TensorRT/imgs/build.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
create_network
create_network
build_engine
build_engine
builder
builder
network
network
config:配置builder对象参数
config:配置builder对象参数
定义网络结构,并赋值权重
定义网络结构,并赋值权重
engine
engine
保存引擎文件
保存引擎文件
原生API
原生API
解析onnx权重
解析onnx权重
onnx_parser
onnx_parser
network
network
build_engine
build_engine
builder
builder
engine
engine
1.构建计算图
2.填充权重
1.构建计算图...
解析onnx
解析onnx
Viewer does not support full SVG 1.1
-------------------------------------------------------------------------------- /TensorRT/imgs/infer.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
deserialize_cuda_engine
deserialize_cuda_engine
runtime
runtime
create_execution_context
create_execution_context
engine
engine
执行内容对象
执行内容对象
传输
传输
host_in
主机数据
host_in...

devide_in
GPU数据
devide_in...
devide_out
GPU数据
devide_out...
接收输入数据
接收输入数据
context
context
bindings
bindings
接收预测结果
接收预测结果
Stream
Stream
多个流支持并行
多个流支持并行
传输
传输
devide_out
GPU数据
devide_out...
host_out
主机数据
host_out...
Viewer does not support full SVG 1.1
--------------------------------------------------------------------------------