├── LICENSE
├── README.md
├── offline
    ├── dpcrn.py
    ├── dpcrn_onnx.py
    ├── dpcrn_trt.py
    └── models
    │   ├── dpcrn.engine
    │   ├── dpcrn.onnx
    │   ├── dpcrn_simple.onnx
    │   └── trtexec.exe
├── online
    ├── dpcrn.py
    ├── dpcrn_stream.py
    ├── dpcrn_stream_onnx.py
    ├── dpcrn_stream_trt.py
    ├── models
    │   ├── dpcrn.engine
    │   ├── dpcrn.onnx
    │   ├── dpcrn_simple.onnx
    │   └── trtexec.exe
    └── modules
    │   ├── __init__.py
    │   ├── convert.py
    │   └── convolution.py
├── readme
    ├── README_zh.md
    ├── TRTSETUP.md
    └── TRTSETUP_zh.md
└── requirements.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Rong Xiaobin
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Deploying Deep Learning Speech Enhancement Models with TensorRT
 2 | The Chinese version of the document can be found in: [使用 TensorRT 部署深度学习语音增强模型](./readme/README_zh.md)
 3 | 
 4 | To install TensorRT, refer to: [TensorRT Installation Tutorial](./readme/TRTSETUP.md)
 5 | 
 6 | The deployment of speech enhancement models can be categorized into two types: **offline inference** and **online inference**. Offline inference involves performing model inference on pre-prepared data, usually in the form of a batch of samples or longer audio signals. Offline inference does not have real-time requirements and can utilize efficient inference methods and resource allocation strategies.
 7 | 
 8 | On the other hand, online inference involves performing model inference on real-time generated speech data, such as continuously captured audio signals from a microphone. Online inference requires low latency and high throughput to meet real-time requirements.
 9 | 
10 | ## Deployment of Offline Models
11 | ### 1. Conversion to ONNX Model
12 | For offline models, exporting to ONNX is straightforward. The only thing to consider is setting the time dimension of the input shape. Although `torch.onnx.export` supports dynamic dimensions, considering the limited need in practical applications, we choose to fix the time dimension to 563, corresponding to 9 seconds of audio data. During offline processing, if the audio is less than 9 seconds, it is padded with zeros. If the audio is longer than 9 seconds, it is processed in batches of 9 seconds each.
13 | 
14 | `offline\dpcrn_onnx.py` provides the export and inference of the ONNX model, and evaluates the inference speed on ONNXRuntime.
15 | 
16 | ### 2. Conversion to Engine Model
17 | We use the conversion tool `trtexec.exe` provided by TensorRT to convert the model from ONNX to the Engine format supported by TensorRT. This tool is located in the `bin` directory of the TensorRT installation package, and the usage is as follows:
18 | ```
19 | trtexec.exe --onnx=[onnx_path] --saveEngine=[save_path]
20 | ```
21 | 
22 | `offline\dpcrn_trt.py` provides the export and inference of the Engine model, and evaluates the inference speed. The results are shown in the following table.
23 | 
24 | | **Model Format** | **Inference Framework** | **Inference Platform** | **Average Inference Speed (ms)** | **Maximum Inference Speed (ms)** | **Minimum Inference Speed (ms)** |
25 | |:----------------:|:----------------------:|:---------------------:|:-------------------------------:|:-------------------------------:|:-------------------------------:|
26 | | ONNX | ONNXRuntime | CPU | 8.6 | 21.0 | 7.6|
27 | | Engine |TensorRT| CUDA | 2.2 | 5.1 | 1.9 |
28 | 
29 | The CPU used for inference is the 13th Gen Intel(R) Core(TM) i9-13900HX @ 2.20 GHz, and the CUDA used is NVIDIA GeForce RTX 4080 Laptop GPU. The inference process was repeated 1000 times to obtain the average, maximum, and minimum inference speeds. Here, the inference speed is defined as the processing time per audio duration. It can be observed that the inference speed of the TensorRT framework is nearly 4 times faster than ONNXRuntime.
30 | 
31 | ## Deployment of Online Models
32 | In speech enhancement, online inference has broader application scenarios and higher real-time requirements for the model. Consequently, the deployment of online inference is more complex. Here, we adopt the method of **streaming inference** to perform frame-by-frame inference on real-time data streams. When implementing streaming inference, appropriate data buffering mechanisms, data stream management, and pipeline design for model inference are required to ensure data continuity and stable inference.
33 | 
34 | ### 1. Conversion to Streaming Model
35 | RNN naturally adapts to streaming inference without additional conversion. In contrast, the convolutional layers are the main part of the neural network that requires conversion for streaming. In the `online\modules` directory, we define two types of operators for streaming convolution and streaming transposed convolution in `convolution.py`. We also provide a method in `convert.py` to copy the original model parameter dictionary for the conversion of streaming models.
36 | 
37 | `online\dpcrn_stream.py` provides the conversion and inference process for streaming models. Note that for streaming models, the time dimension of the input tensor is always set to 1.
38 | 
39 | ### 2. Conversion to ONNX Model
40 | For streaming models, there is no need to consider the time dimension when converting to ONNX. However, it is recommended to specify all input tensors in the `forward` function instead of using a list as done in `online\dpcrn_stream.py`.
41 | 
42 | `online\dpcrn_stream_onnx.py` provides the conversion and inference process for streaming ONNX models and evaluates the inference speed on ONNXRuntime.
43 | 
44 | ### 3. Conversion to Engine Model
45 | Similarly, we use the conversion tool `trtexec.exe` provided by TensorRT for model conversion.
46 | 
47 | `online\dpcrn_stream_trt.py` provides the export and inference of the streaming Engine model, and evaluates the inference speed. The results are shown in the following table.
48 | 
49 | | **Model Format** | **Inference Framework** | **Inference Platform** | **Average Inference Speed (ms)** | **Maximum Inference Speed (ms)** | **Minimum Inference Speed (ms)** |
50 | |:----------------:|:----------------------:|:---------------------:|:-------------------------------:|:-------------------------------:|:-------------------------------:|
51 | | ONNX | ONNXRuntime | CPU |1.0 | 3.1 | 0.9 |
52 | | Engine |TensorRT| CUDA |2.2 | 4.7 | 1.8 |
53 | 
54 | The evaluation of inference speed is repeated 1000 times. Here, the inference speed is defined as the processing time per frame. As we can see, using TensorRT for inference is slower than using ONNXRuntime. This is because, for high-throughput streaming models, the data transfer between CUDA and CPU during TensorRT inference takes a certain amount of time. Only when the model's inference speed on the CPU becomes the bottleneck, using TensorRT for CUDA inference will have a positive effect.
55 | 
56 | ## Acknowledgement
57 | The SE model used in this repository is [DPCRN](https://arxiv.org/abs/2107.05429), an excellent SE model with high performance and low latency, which ranked 2nd in DNS3 Challenge. The author is [Xiaohuai Le](https://github.com/Le-Xiaohuai-speech), my senior, who has taught me a lot, including the streaming conversion method employed in this repository.
58 | 
59 | 


--------------------------------------------------------------------------------
/offline/dpcrn.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A more elegant implementation of DPCRN.
  3 | 1.74 GMac, 787.15 k
  4 | """
  5 | import torch
  6 | import torch.nn as nn
  7 | 
  8 | 
  9 | class DPRNN(nn.Module):
 10 |     def __init__(self, numUnits, width, channel, **kwargs):
 11 |         super(DPRNN, self).__init__(**kwargs)
 12 |         self.numUnits = numUnits
 13 |         self.width = width
 14 |         self.channel = channel
 15 | 
 16 |         self.intra_rnn = nn.LSTM(input_size = self.numUnits, hidden_size = self.numUnits//2, batch_first = True, bidirectional = True)
 17 |         self.intra_fc = nn.Linear(self.numUnits, self.numUnits)
 18 |         self.intra_ln = nn.LayerNorm((width, numUnits), eps=1e-8)
 19 | 
 20 |         self.inter_rnn = nn.LSTM(input_size = self.numUnits, hidden_size = self.numUnits, batch_first = True, bidirectional = False)
 21 |         self.inter_fc = nn.Linear(self.numUnits, self.numUnits)
 22 |         self.inter_ln = nn.LayerNorm((width, numUnits), eps=1e-8)
 23 |     
 24 |     def forward(self,x):
 25 |         # x: (B, C, T, F)
 26 |         ## Intra RNN
 27 |         x = x.permute(0, 2, 3, 1)  # (B,T,F,C)
 28 |         intra_x = x.reshape(x.shape[0] * x.shape[1], x.shape[2], x.shape[3])  # (B*T,F,C)
 29 |         intra_x = self.intra_rnn(intra_x)[0]  # (B*T,F,C)
 30 |         intra_x = self.intra_fc(intra_x)      # (B*T,F,C)
 31 |         intra_x = intra_x.reshape(x.shape[0], -1, self.width, self.channel) # (B,T,F,C)
 32 |         intra_x = self.intra_ln(intra_x)
 33 |         intra_out = torch.add(x, intra_x)
 34 | 
 35 |         ## Inter RNN
 36 |         x = intra_out.permute(0,2,1,3)  # (B,F,T,C)
 37 |         inter_x = x.reshape(x.shape[0] * x.shape[1], x.shape[2], x.shape[3]) 
 38 |         inter_x = self.inter_rnn(inter_x)[0]  # (B*F,T,C)
 39 |         inter_x = self.inter_fc(inter_x)      # (B*F,T,C)
 40 |         inter_x = inter_x.reshape(x.shape[0], self.width, -1, self.channel) # (B,F,T,C)
 41 |         inter_x = inter_x.permute(0,2,1,3)   # (B,T,F,C)
 42 |         inter_x = self.inter_ln(inter_x) 
 43 |         inter_out = torch.add(intra_out, inter_x)
 44 |         
 45 |         dual_out = inter_out.permute(0,3,1,2)  # (B,C,T,F)
 46 |         
 47 |         return dual_out
 48 | 
 49 | 
 50 | class Encoder(nn.Module):
 51 |     def __init__(self,
 52 |                  in_channels,
 53 |                  filter_size,
 54 |                  kernel_size,
 55 |                  stride_size):
 56 |         super().__init__()
 57 |         self.N_layers = len(filter_size)
 58 | 
 59 |         self.pad_list = nn.ModuleList([])
 60 |         self.conv_list = nn.ModuleList([])
 61 |         self.bn_list = nn.ModuleList([])
 62 |         self.act_list = nn.ModuleList([])
 63 | 
 64 |         for i in range(self.N_layers):
 65 |             Cin = in_channels if i==0 else filter_size[i-1]
 66 |             self.pad_list.append(nn.ZeroPad2d([(kernel_size[i][1]-1)//2, (kernel_size[i][1]-1)//2, kernel_size[i][0]-1, 0]))
 67 |             self.conv_list.append(nn.Conv2d(Cin, filter_size[i], kernel_size[i], stride_size[i]))
 68 |             self.bn_list.append(nn.BatchNorm2d(filter_size[i]))
 69 |             self.act_list.append(nn.PReLU())
 70 | 
 71 |     def forward(self, x):
 72 |         """
 73 |         x: (B,C,T,F)
 74 |         """
 75 |         en_outs = []
 76 |         for i in range(self.N_layers):
 77 |             x = self.pad_list[i](x)
 78 |             x = self.conv_list[i](x)
 79 |             x = self.bn_list[i](x)
 80 |             x = self.act_list[i](x)
 81 |             en_outs.append(x)
 82 |             # print(f'en_{i}:', x.shape)
 83 |         return x, en_outs
 84 | 
 85 | 
 86 | class Decoder(nn.Module):
 87 |     def __init__(self,
 88 |                  out_channels,
 89 |                  filter_size,
 90 |                  kernel_size,
 91 |                  stride_size):
 92 |         super().__init__()
 93 |         self.N_layers = len(filter_size)
 94 | 
 95 |         self.pad_list = nn.ModuleList([])
 96 |         self.conv_list = nn.ModuleList([])
 97 |         self.bn_list = nn.ModuleList([])
 98 |         self.act_list = nn.ModuleList([])
 99 | 
100 |         for i in range(self.N_layers-1, -1, -1):
101 |             Cout = out_channels if i==0 else filter_size[i-1]
102 |             act = nn.Identity() if i== 0 else nn.PReLU()
103 |             self.pad_list.append(nn.ZeroPad2d([0, 0, kernel_size[i][0]-1, 0]))
104 |             self.conv_list.append(nn.ConvTranspose2d(filter_size[i]*2, Cout, kernel_size[i], stride_size[i], (kernel_size[i][0]-1, (kernel_size[i][1]-1)//2)))
105 |             self.bn_list.append(nn.BatchNorm2d(Cout))
106 |             self.act_list.append(act)
107 | 
108 |     def forward(self, x, en_outs):
109 |         """
110 |         x: (B,C,T,F)
111 |         """
112 |         for i in range(self.N_layers):
113 |             x = torch.cat([x, en_outs[self.N_layers-1-i]], dim=1)
114 |             x = self.pad_list[i](x)
115 |             x = self.conv_list[i](x)
116 |             x = self.bn_list[i](x)
117 |             x = self.act_list[i](x)
118 |             # print(f'de_{i}:', x.shape)
119 |         return x
120 | 
121 | 
122 | class DPCRN(nn.Module):
123 |     def __init__(self,
124 |                  in_channels=2,
125 |                  out_channels=2,
126 |                  filter_size=[32,32,32,64,128],
127 |                  kernel_size=[(2,5), (2,3), (2,3), (2,3), (2,3)],
128 |                  stride_size=[(1,2), (1,2), (1,2), (1,1), (1,1)],
129 |                  N_dprnn=2,
130 |                  num_units=128,
131 |                  width=33,
132 |                  **kwargs):
133 |         super().__init__()
134 |         self.N_dprnn = N_dprnn
135 |         self.encoder = Encoder(in_channels, filter_size, kernel_size, stride_size)
136 |         
137 |         self.dprnns = nn.ModuleList([])
138 |         for i in range(N_dprnn):
139 |             self.dprnns.append(DPRNN(num_units, width, filter_size[-1]))
140 |             
141 |         self.decoder = Decoder(out_channels, filter_size, kernel_size, stride_size)
142 | 
143 |     def forward(self, x):
144 |         """
145 |         x: (B,F,T,2), noisy spectrogram, where B is batch size, F is frequency bins, T is time frames, and 2 is R/I components.
146 |         """
147 |         x_ref = x
148 |         x = x.permute(0, 3, 2, 1)     # (B,C,T,F)
149 | 
150 |         x, en_outs = self.encoder(x)
151 | 
152 |         for i in range(self.N_dprnn):
153 |             x = self.dprnns[i](x)
154 | 
155 |         x = self.decoder(x, en_outs)
156 | 
157 |         m = x.permute(0,3,2,1)
158 | 
159 |         s_real = x_ref[...,0] * m[...,0] - x_ref[...,1] * m[...,1]
160 |         s_imag = x_ref[...,1] * m[...,0] + x_ref[...,0] * m[...,1]
161 |         s = torch.stack([s_real, s_imag], dim=-1)  # (B,F,T,2)
162 |         
163 |         return s
164 | 
165 | 
166 | 
167 | if __name__ == "__main__":
168 |     model = DPCRN().cuda()
169 | 
170 |     from ptflops import get_model_complexity_info
171 |     flops, params = get_model_complexity_info(model, (257, 63, 2), as_strings=True,
172 |                                            print_per_layer_stat=False, verbose=True)
173 |     print(flops, params)
174 | 
175 |     model = model.cpu().eval()
176 |     x = torch.randn(1, 257, 63, 2)
177 |     y = model(x)
178 |     print(y.shape)


--------------------------------------------------------------------------------
/offline/dpcrn_onnx.py:
--------------------------------------------------------------------------------
 1 | """
 2 | A more elegant implementation of DPCRN.
 3 | 1.74 GMac, 787.15 k
 4 | """
 5 | import torch
 6 | import time
 7 | import onnx
 8 | import onnxruntime
 9 | import numpy as np
10 | from onnxsim import simplify
11 | from dpcrn import DPCRN
12 | 
13 | 
14 | ## load model
15 | model = DPCRN().eval()  # remember to set `eval` mode!
16 | 
17 | ## convert to onnx
18 | file = 'models/dpcrn.onnx'
19 | device = torch.device('cpu')
20 | 
21 | time_len = 9  # set the length to 9 s
22 | frame_num = time_len * 16000 // 256 + 1  # compute frame numbers, fs=16000, hop_size=256
23 | x = torch.randn(1, 257, frame_num, 2, device=device)  
24 | 
25 | torch.onnx.export(model,
26 |                 (x,),
27 |                 file,
28 |                 input_names = ['mix'],
29 |                 output_names = ['enh'],
30 |                 opset_version=11,
31 |                 verbose = False)
32 | 
33 | onnx_model = onnx.load(file)
34 | onnx.checker.check_model(onnx_model)
35 | 
36 | model_simp, check = simplify(onnx_model)
37 | assert check, "Simplified ONNX model could not be validated"
38 | onnx.save(model_simp, file.split('.onnx')[0] + '_simple.onnx')
39 | 
40 | ## run onnx model
41 | # session = onnxruntime.InferenceSession(file, None, providers=['CUDAExecutionProvider'])
42 | session = onnxruntime.InferenceSession(file.split('.onnx')[0]+'_simple.onnx', None, providers=['CPUExecutionProvider'])
43 | inputs = x.cpu().detach().numpy()
44 | 
45 | ## execute inference
46 | outputs = session.run([], {'mix': inputs})
47 | 
48 | ## check error
49 | y = model(x)
50 | diff = outputs - y.detach().numpy()
51 | print(">>> The maximum numerical error:", np.abs(diff).max())
52 | 
53 | ## test inference speed
54 | T_list = []
55 | for i in range(1000):  
56 |     tic = time.perf_counter()
57 |     outputs = session.run([], {'mix': inputs})
58 |     toc = time.perf_counter()
59 |     T_list.append((toc-tic) / time_len)
60 | print(">>> inference time: mean: {:.1f}ms, max: {:.1f}ms, min: {:.1f}ms".format(1e3*np.mean(T_list), 1e3*np.max(T_list), 1e3*np.min(T_list)))
61 | 


--------------------------------------------------------------------------------
/offline/dpcrn_trt.py:
--------------------------------------------------------------------------------
  1 | """ 
  2 | 1. onnx to engine: use `trtexec.exe`.
  3 | 2. tensorrt and torch cannot be imported simultaneously!
  4 | """
  5 | import os
  6 | import numpy as np
  7 | import tensorrt as trt
  8 | import pycuda.driver as cuda
  9 | import pycuda.autoinit  # must import
 10 | from collections import namedtuple
 11 | 
 12 | def onnx2engine(trtexec_path, onnx_path, save_path):
 13 |     os.system(f"{trtexec_path} --onnx={onnx_path} --saveEngine={save_path}")
 14 | 
 15 | 
 16 | Bindings = namedtuple("Bindings", ("name", "shape", "host", "device", "size"))
 17 | 
 18 | class TRTModel:
 19 |     """
 20 |     Implements inference for the EfficientNet TensorRT engine.
 21 |     """
 22 | 
 23 |     def __init__(self, engine_path, dtype=np.float32):
 24 |         """
 25 |         Args:
 26 |             engine_path: The path to the serialized engine to load from disk.
 27 |             dtype: The datatype used in inference.
 28 |         """
 29 |         # init arguments
 30 |         self.engine_path = engine_path
 31 |         self.dtype = dtype
 32 | 
 33 |         # Load TRT engine
 34 |         self.logger = trt.Logger(trt.Logger.ERROR)
 35 |         self.runtime = trt.Runtime(self.logger)
 36 |         self.engine = self.load_engine(self.runtime, self.engine_path)
 37 |         self.context = self.engine.create_execution_context()
 38 |         self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers()
 39 | 
 40 |     @staticmethod
 41 |     def load_engine(trt_runtime, engine_path):
 42 |         with open(engine_path, 'rb') as f:
 43 |             engine_data = f.read()
 44 |         engine = trt_runtime.deserialize_cuda_engine(engine_data)
 45 |         return engine
 46 | 
 47 | 
 48 |     def allocate_buffers(self):
 49 |         inputs = []
 50 |         outputs = []
 51 |         bindings = []
 52 |         stream = cuda.Stream()
 53 | 
 54 |         for i in range(self.engine.num_io_tensors):
 55 |             name = self.engine.get_tensor_name(i)
 56 |             shape = self.engine.get_tensor_shape(name)
 57 |             size = trt.volume(shape)
 58 |             # print(i, name, ':', shape, size)
 59 | 
 60 |             host_mem = cuda.pagelocked_empty(size, self.dtype)
 61 |             device_mem = cuda.mem_alloc(host_mem.nbytes)
 62 | 
 63 |             bindings.append(int(device_mem))
 64 | 
 65 |             if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
 66 |                 inputs.append(Bindings(name, shape, host_mem, device_mem, host_mem.nbytes))
 67 |             else:
 68 |                 outputs.append(Bindings(name, shape, host_mem, device_mem, host_mem.nbytes))
 69 | 
 70 |         return inputs, outputs, bindings, stream
 71 | 
 72 | 
 73 |     def __call__(self, x: np.ndarray):
 74 |         x = x.astype(self.dtype)
 75 |         np.copyto(self.inputs[0].host, x.ravel())
 76 | 
 77 |         # Transfer the noisy data from CPU to CUDA.
 78 |         cuda.memcpy_htod_async(self.inputs[0].device, self.inputs[0].host, self.stream)
 79 |         # Execute inference.
 80 |         self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle)
 81 |         # Transfer the enhanced data from CPU to CUDA.
 82 |         cuda.memcpy_dtoh_async(self.outputs[-1].host, self.outputs[-1].device, self.stream)
 83 | 
 84 |         self.stream.synchronize()
 85 | 
 86 |         return self.outputs[-1].host.reshape(self.outputs[-1].shape)
 87 |     
 88 | 
 89 | if __name__ == "__main__":
 90 |     import time
 91 | 
 92 |     trtexec_path = r'.\models\trtexec.exe'
 93 |     onnx_path = r'.\models\dpcrn.onnx'
 94 |     save_path = r'.\models\dpcrn.engine'
 95 | 
 96 |     ## Convert to engine
 97 |     onnx2engine(trtexec_path, onnx_path, save_path)
 98 | 
 99 |     ## Load engine model
100 |     model = TRTModel(save_path)
101 | 
102 |     ## Execute inference
103 |     time_len = 9  # set the length to 9 s
104 |     frame_num = time_len * 16000 // 256 + 1  # compute frame numbers, fs=16000, hop_size=256
105 |     x = np.random.randn(1, 257, frame_num, 2)
106 |     y = model(x)
107 | 
108 |     ## Test inference speed
109 |     times = np.zeros([1000])
110 |     for i in range(len(times)):
111 |         tic = time.perf_counter()
112 |         outputs = model(x)
113 |         toc = time.perf_counter()
114 |         times[i] = 1000*((toc-tic) / time_len)
115 | 
116 |     print(">>> Average Inference Time (ms): ", times.mean())
117 |     print(">>> Maximum Inference Time (ms): ", times.max())
118 |     print(">>> Minimum Inference Time (ms): ", times.min())
119 | 
120 | 
121 | 
122 | 
123 |     
124 | 
125 | 


--------------------------------------------------------------------------------
/offline/models/dpcrn.engine:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/offline/models/dpcrn.engine


--------------------------------------------------------------------------------
/offline/models/dpcrn.onnx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/offline/models/dpcrn.onnx


--------------------------------------------------------------------------------
/offline/models/dpcrn_simple.onnx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/offline/models/dpcrn_simple.onnx


--------------------------------------------------------------------------------
/offline/models/trtexec.exe:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/offline/models/trtexec.exe


--------------------------------------------------------------------------------
/online/dpcrn.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A more elegant implementation of DPCRN.
  3 | 1.74 GMac, 787.15 k
  4 | """
  5 | import torch
  6 | import torch.nn as nn
  7 | 
  8 | 
  9 | class DPRNN(nn.Module):
 10 |     def __init__(self, numUnits, width, channel, **kwargs):
 11 |         super(DPRNN, self).__init__(**kwargs)
 12 |         self.numUnits = numUnits
 13 |         self.width = width
 14 |         self.channel = channel
 15 | 
 16 |         self.intra_rnn = nn.LSTM(input_size = self.numUnits, hidden_size = self.numUnits//2, batch_first = True, bidirectional = True)
 17 |         self.intra_fc = nn.Linear(self.numUnits, self.numUnits)
 18 |         self.intra_ln = nn.LayerNorm((width, numUnits), eps=1e-8)
 19 | 
 20 |         self.inter_rnn = nn.LSTM(input_size = self.numUnits, hidden_size = self.numUnits, batch_first = True, bidirectional = False)
 21 |         self.inter_fc = nn.Linear(self.numUnits, self.numUnits)
 22 |         self.inter_ln = nn.LayerNorm((width, numUnits), eps=1e-8)
 23 |     
 24 |     def forward(self,x):
 25 |         # x: (B, C, T, F)
 26 |         ## Intra RNN
 27 |         x = x.permute(0, 2, 3, 1)  # (B,T,F,C)
 28 |         intra_x = x.reshape(x.shape[0] * x.shape[1], x.shape[2], x.shape[3])  # (B*T,F,C)
 29 |         intra_x = self.intra_rnn(intra_x)[0]  # (B*T,F,C)
 30 |         intra_x = self.intra_fc(intra_x)      # (B*T,F,C)
 31 |         intra_x = intra_x.reshape(x.shape[0], -1, self.width, self.channel) # (B,T,F,C)
 32 |         intra_x = self.intra_ln(intra_x)
 33 |         intra_out = torch.add(x, intra_x)
 34 | 
 35 |         ## Inter RNN
 36 |         x = intra_out.permute(0,2,1,3)  # (B,F,T,C)
 37 |         inter_x = x.reshape(x.shape[0] * x.shape[1], x.shape[2], x.shape[3]) 
 38 |         inter_x = self.inter_rnn(inter_x)[0]  # (B*F,T,C)
 39 |         inter_x = self.inter_fc(inter_x)      # (B*F,T,C)
 40 |         inter_x = inter_x.reshape(x.shape[0], self.width, -1, self.channel) # (B,F,T,C)
 41 |         inter_x = inter_x.permute(0,2,1,3)   # (B,T,F,C)
 42 |         inter_x = self.inter_ln(inter_x) 
 43 |         inter_out = torch.add(intra_out, inter_x)
 44 |         
 45 |         dual_out = inter_out.permute(0,3,1,2)  # (B,C,T,F)
 46 |         
 47 |         return dual_out
 48 | 
 49 | 
 50 | class Encoder(nn.Module):
 51 |     def __init__(self,
 52 |                  in_channels,
 53 |                  filter_size,
 54 |                  kernel_size,
 55 |                  stride_size):
 56 |         super().__init__()
 57 |         self.N_layers = len(filter_size)
 58 | 
 59 |         self.pad_list = nn.ModuleList([])
 60 |         self.conv_list = nn.ModuleList([])
 61 |         self.bn_list = nn.ModuleList([])
 62 |         self.act_list = nn.ModuleList([])
 63 | 
 64 |         for i in range(self.N_layers):
 65 |             Cin = in_channels if i==0 else filter_size[i-1]
 66 |             self.pad_list.append(nn.ZeroPad2d([(kernel_size[i][1]-1)//2, (kernel_size[i][1]-1)//2, kernel_size[i][0]-1, 0]))
 67 |             self.conv_list.append(nn.Conv2d(Cin, filter_size[i], kernel_size[i], stride_size[i]))
 68 |             self.bn_list.append(nn.BatchNorm2d(filter_size[i]))
 69 |             self.act_list.append(nn.PReLU())
 70 | 
 71 |     def forward(self, x):
 72 |         """
 73 |         x: (B,C,T,F)
 74 |         """
 75 |         en_outs = []
 76 |         for i in range(self.N_layers):
 77 |             x = self.pad_list[i](x)
 78 |             x = self.conv_list[i](x)
 79 |             x = self.bn_list[i](x)
 80 |             x = self.act_list[i](x)
 81 |             en_outs.append(x)
 82 |             # print(f'en_{i}:', x.shape)
 83 |         return x, en_outs
 84 | 
 85 | 
 86 | class Decoder(nn.Module):
 87 |     def __init__(self,
 88 |                  out_channels,
 89 |                  filter_size,
 90 |                  kernel_size,
 91 |                  stride_size):
 92 |         super().__init__()
 93 |         self.N_layers = len(filter_size)
 94 | 
 95 |         self.pad_list = nn.ModuleList([])
 96 |         self.conv_list = nn.ModuleList([])
 97 |         self.bn_list = nn.ModuleList([])
 98 |         self.act_list = nn.ModuleList([])
 99 | 
100 |         for i in range(self.N_layers-1, -1, -1):
101 |             Cout = out_channels if i==0 else filter_size[i-1]
102 |             act = nn.Identity() if i== 0 else nn.PReLU()
103 |             self.pad_list.append(nn.ZeroPad2d([0, 0, kernel_size[i][0]-1, 0]))
104 |             self.conv_list.append(nn.ConvTranspose2d(filter_size[i]*2, Cout, kernel_size[i], stride_size[i], (kernel_size[i][0]-1, (kernel_size[i][1]-1)//2)))
105 |             self.bn_list.append(nn.BatchNorm2d(Cout))
106 |             self.act_list.append(act)
107 | 
108 |     def forward(self, x, en_outs):
109 |         """
110 |         x: (B,C,T,F)
111 |         """
112 |         for i in range(self.N_layers):
113 |             x = torch.cat([x, en_outs[self.N_layers-1-i]], dim=1)
114 |             x = self.pad_list[i](x)
115 |             x = self.conv_list[i](x)
116 |             x = self.bn_list[i](x)
117 |             x = self.act_list[i](x)
118 |             # print(f'de_{i}:', x.shape)
119 |         return x
120 | 
121 | 
122 | class DPCRN(nn.Module):
123 |     def __init__(self,
124 |                  in_channels=2,
125 |                  out_channels=2,
126 |                  filter_size=[32,32,32,64,128],
127 |                  kernel_size=[(2,5), (2,3), (2,3), (2,3), (2,3)],
128 |                  stride_size=[(1,2), (1,2), (1,2), (1,1), (1,1)],
129 |                  N_dprnn=2,
130 |                  num_units=128,
131 |                  width=33,
132 |                  **kwargs):
133 |         super().__init__()
134 |         self.N_dprnn = N_dprnn
135 |         self.encoder = Encoder(in_channels, filter_size, kernel_size, stride_size)
136 |         
137 |         self.dprnns = nn.ModuleList([])
138 |         for i in range(N_dprnn):
139 |             self.dprnns.append(DPRNN(num_units, width, filter_size[-1]))
140 |             
141 |         self.decoder = Decoder(out_channels, filter_size, kernel_size, stride_size)
142 | 
143 |     def forward(self, x):
144 |         """
145 |         x: (B,F,T,2), noisy spectrogram, where B is batch size, F is frequency bins, T is time frames, and 2 is R/I components.
146 |         """
147 |         x_ref = x
148 |         x = x.permute(0, 3, 2, 1)     # (B,C,T,F)
149 | 
150 |         x, en_outs = self.encoder(x)
151 | 
152 |         for i in range(self.N_dprnn):
153 |             x = self.dprnns[i](x)
154 | 
155 |         x = self.decoder(x, en_outs)
156 | 
157 |         m = x.permute(0,3,2,1)
158 | 
159 |         s_real = x_ref[...,0] * m[...,0] - x_ref[...,1] * m[...,1]
160 |         s_imag = x_ref[...,1] * m[...,0] + x_ref[...,0] * m[...,1]
161 |         s = torch.stack([s_real, s_imag], dim=-1)  # (B,F,T,2)
162 |         
163 |         return s
164 | 
165 | 
166 | 
167 | if __name__ == "__main__":
168 |     model = DPCRN().cuda()
169 | 
170 |     from ptflops import get_model_complexity_info
171 |     flops, params = get_model_complexity_info(model, (257, 63, 2), as_strings=True,
172 |                                            print_per_layer_stat=False, verbose=True)
173 |     print(flops, params)
174 | 
175 |     model = model.cpu().eval()
176 |     x = torch.randn(1, 257, 63, 2)
177 |     y = model(x)
178 |     print(y.shape)


--------------------------------------------------------------------------------
/online/dpcrn_stream.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A more elegant implementation of streaming DPCRN.
  3 | 1.74 GMac, 787.15 k
  4 | """
  5 | import torch
  6 | import torch.nn as nn
  7 | from modules.convolution import StreamConv2d, StreamConvTranspose2d
  8 | from modules.convert import convert_to_stream
  9 | 
 10 | 
 11 | class StreamDPRNN(nn.Module):
 12 |     def __init__(self, numUnits, width, channel, **kwargs):
 13 |         super().__init__(**kwargs)
 14 |         self.numUnits = numUnits
 15 |         self.width = width
 16 |         self.channel = channel
 17 | 
 18 |         self.intra_rnn = nn.LSTM(input_size = self.numUnits, hidden_size = self.numUnits//2, batch_first=True, bidirectional=True)
 19 |         self.intra_fc = nn.Linear(self.numUnits, self.numUnits)
 20 |         self.intra_ln = nn.LayerNorm((width, numUnits), eps=1e-8)
 21 | 
 22 |         self.inter_rnn = nn.LSTM(input_size = self.numUnits, hidden_size = self.numUnits, batch_first=True, bidirectional=False)
 23 |         self.inter_fc = nn.Linear(self.numUnits, self.numUnits)
 24 |         self.inter_ln = nn.LayerNorm((width, numUnits), eps=1e-8)
 25 |     
 26 |     def forward(self, x, h_cache, c_cache):
 27 |         """
 28 |         x:       (B, C, T=1, F)
 29 |         h_cache: (1, F, C), hidden cache for inter RNN.
 30 |         c_cache: (1, F, C), cell cache for inter RNN.
 31 |         """
 32 |         ## Intra RNN
 33 |         x = x.permute(0, 2, 3, 1)  # (B,T,F,C)
 34 |         intra_x = x.reshape(x.shape[0] * x.shape[1], x.shape[2], x.shape[3])  # (B*T,F,C)
 35 |         intra_x = self.intra_rnn(intra_x)[0]  # (B*T,F,C)
 36 |         intra_x = self.intra_fc(intra_x)      # (B*T,F,C)
 37 |         intra_x = intra_x.reshape(x.shape[0], -1, self.width, self.channel) # (B,T,F,C)
 38 |         intra_x = self.intra_ln(intra_x)
 39 |         intra_out = torch.add(x, intra_x)
 40 | 
 41 |         ## Inter RNN
 42 |         x = intra_out.permute(0,2,1,3)  # (B,F,T,C)
 43 |         inter_x = x.reshape(x.shape[0] * x.shape[1], x.shape[2], x.shape[3]) 
 44 |         inter_x, (h_cache, c_cache) = self.inter_rnn(inter_x, (h_cache, c_cache))  # (B*F,T,C)
 45 |         inter_x = self.inter_fc(inter_x)      # (B*F,T,C)
 46 |         inter_x = inter_x.reshape(x.shape[0], self.width, -1, self.channel) # (B,F,T,C)
 47 |         inter_x = inter_x.permute(0,2,1,3)   # (B,T,F,C)
 48 |         inter_x = self.inter_ln(inter_x) 
 49 |         inter_out = torch.add(intra_out, inter_x)
 50 |         
 51 |         dual_out = inter_out.permute(0,3,1,2)  # (B,C,T,F)
 52 |         
 53 |         return dual_out, h_cache, c_cache
 54 | 
 55 | 
 56 | class StreamEncoder(nn.Module):
 57 |     def __init__(self,
 58 |                  in_channels,
 59 |                  filter_size,
 60 |                  kernel_size,
 61 |                  stride_size):
 62 |         super().__init__()
 63 |         self.N_layers = len(filter_size)
 64 | 
 65 |         self.conv_list = nn.ModuleList([])
 66 |         self.bn_list = nn.ModuleList([])
 67 |         self.act_list = nn.ModuleList([])
 68 | 
 69 |         for i in range(self.N_layers):
 70 |             Cin = in_channels if i==0 else filter_size[i-1]
 71 |             self.conv_list.append(StreamConv2d(Cin, filter_size[i], kernel_size[i], stride_size[i], (0, (kernel_size[i][1]-1)//2)))
 72 |             self.bn_list.append(nn.BatchNorm2d(filter_size[i]))
 73 |             self.act_list.append(nn.PReLU())
 74 | 
 75 |     def forward(self, x, en_caches):
 76 |         """
 77 |         x: (B,C,1,F).
 78 |         en_caches:  A list of cache (B,C,1,F) for each conv layer in encoder. 
 79 |         """
 80 |         en_outs = []
 81 |         for i in range(self.N_layers):
 82 |             x, en_caches[i] = self.conv_list[i](x, en_caches[i])
 83 |             x = self.bn_list[i](x)
 84 |             x = self.act_list[i](x)
 85 |             en_outs.append(x)
 86 |             # print(f'en_{i}:', x.shape)
 87 |         return x, en_outs, en_caches
 88 | 
 89 | 
 90 | class StreamDecoder(nn.Module):
 91 |     def __init__(self,
 92 |                  out_channels,
 93 |                  filter_size,
 94 |                  kernel_size,
 95 |                  stride_size):
 96 |         super().__init__()
 97 |         self.N_layers = len(filter_size)
 98 | 
 99 |         self.conv_list = nn.ModuleList([])
100 |         self.bn_list = nn.ModuleList([])
101 |         self.act_list = nn.ModuleList([])
102 | 
103 |         for i in range(self.N_layers-1, -1, -1):
104 |             Cout = out_channels if i==0 else filter_size[i-1]
105 |             act = nn.Identity() if i== 0 else nn.PReLU()
106 |             self.conv_list.append(StreamConvTranspose2d(filter_size[i]*2, Cout, kernel_size[i], stride_size[i], (0, (kernel_size[i][1]-1)//2)))
107 |             self.bn_list.append(nn.BatchNorm2d(Cout))
108 |             self.act_list.append(act)
109 | 
110 |     def forward(self, x, en_outs, de_caches):
111 |         """
112 |         x: (B,C,T,F)
113 |         de_caches: A list of cache (B,C,T,F) for each conv layer in decoder. 
114 |         """
115 |         for i in range(self.N_layers):
116 |             x = torch.cat([x, en_outs[self.N_layers-1-i]], dim=1)
117 |             x, de_caches[i] = self.conv_list[i](x, de_caches[i])
118 |             x = self.bn_list[i](x)
119 |             x = self.act_list[i](x)
120 |             # print(f'de_{i}:', x.shape)
121 |         return x, de_caches
122 | 
123 | 
124 | class StreamDPCRN(nn.Module):
125 |     def __init__(self,
126 |                  in_channels=2,
127 |                  out_channels=2,
128 |                  filter_size=[32,32,32,64,128],
129 |                  kernel_size=[(2,5), (2,3), (2,3), (2,3), (2,3)],
130 |                  stride_size=[(1,2), (1,2), (1,2), (1,1), (1,1)],
131 |                  N_dprnn=2,
132 |                  num_units=128,
133 |                  width=33,
134 |                  **kwargs):
135 |         super().__init__()
136 |         self.N_dprnn = N_dprnn
137 |         self.encoder = StreamEncoder(in_channels, filter_size, kernel_size, stride_size)
138 |         
139 |         self.dprnns = nn.ModuleList([])
140 |         for i in range(N_dprnn):
141 |             self.dprnns.append(StreamDPRNN(num_units, width, filter_size[-1]))
142 |             
143 |         self.decoder = StreamDecoder(out_channels, filter_size, kernel_size, stride_size)
144 | 
145 |     def forward(self, x, en_caches, rnn_caches, de_caches):
146 |         """
147 |         x: (B,F,T,2)
148 |         en_caces: A list of cache (B,C,T,F) for each conv layer in encoder. 
149 |         rnn_caches: A list of cache (1,F,C) for each rnn in dprnns.
150 |         en_caces: A list of cache (B,C,T,F) of each conv layer in decoder. 
151 |         """
152 |         x_ref = x
153 |         x = x.permute(0, 3, 2, 1)     # (B,C,T,F)
154 | 
155 |         x, en_outs, en_caches = self.encoder(x, en_caches)
156 | 
157 |         for i in range(self.N_dprnn):
158 |             x, rnn_caches[2*i], rnn_caches[2*i+1] = self.dprnns[i](x, rnn_caches[2*i], rnn_caches[2*i+1])
159 | 
160 |         x, de_caches = self.decoder(x, en_outs, de_caches)
161 | 
162 |         m = x.permute(0,3,2,1)
163 | 
164 |         s_real = x_ref[...,0] * m[...,0] - x_ref[...,1] * m[...,1]
165 |         s_imag = x_ref[...,1] * m[...,0] + x_ref[...,0] * m[...,1]
166 |         s = torch.stack([s_real, s_imag], dim=-1)  # (B,F,T,2)
167 |         
168 |         return s, en_caches, rnn_caches, de_caches
169 | 
170 | 
171 | 
172 | if __name__ == "__main__":
173 |     from dpcrn import DPCRN
174 |     model = DPCRN().eval()
175 |     model_stream = StreamDPCRN().eval()
176 |     convert_to_stream(model_stream, model)
177 | 
178 |     x = torch.randn(1, 257, 63, 2)
179 |     en_caches = [torch.zeros(1, 2,  1, 257),
180 |                 torch.zeros(1, 32, 1, 129),
181 |                 torch.zeros(1, 32, 1, 65),
182 |                 torch.zeros(1, 32, 1, 33),
183 |                 torch.zeros(1, 64,1, 33)]
184 |                                                 
185 |     rnn_caches = [torch.zeros(1,33,128),
186 |                   torch.zeros(1,33,128),
187 |                   torch.zeros(1,33,128),
188 |                   torch.zeros(1,33,128)]
189 |     
190 |     de_caches = [torch.zeros(1, 256,1, 33),
191 |                 torch.zeros(1, 128,1, 33),
192 |                 torch.zeros(1, 64, 1, 33),
193 |                 torch.zeros(1, 64, 1, 65),
194 |                 torch.zeros(1, 64, 1, 129)]
195 |     
196 |     y1 = []
197 |     for i in range(x.shape[-2]):
198 |         yi, en_caches, rnn_caches, de_caches = model_stream(x[:,:,i:i+1,:], en_caches, rnn_caches, de_caches)
199 |         y1.append(yi)
200 |     y1 = torch.cat(y1, dim=2)
201 |     
202 |     ## check errors
203 |     y = model(x)
204 |     print((y-y1).abs().max())
205 | 


--------------------------------------------------------------------------------
/online/dpcrn_stream_onnx.py:
--------------------------------------------------------------------------------
  1 | """
  2 | A more elegant implementation of DPCRN.
  3 | 1.74 GMac, 787.15 k
  4 | """
  5 | import torch
  6 | import torch.nn as nn
  7 | from modules.convolution import StreamConv2d, StreamConvTranspose2d
  8 | from modules.convert import convert_to_stream
  9 | 
 10 | 
 11 | class StreamDPRNN(nn.Module):
 12 |     def __init__(self, numUnits, width, channel, **kwargs):
 13 |         super().__init__(**kwargs)
 14 |         self.numUnits = numUnits
 15 |         self.width = width
 16 |         self.channel = channel
 17 | 
 18 |         self.intra_rnn = nn.LSTM(input_size = self.numUnits, hidden_size = self.numUnits//2, batch_first=True, bidirectional=True)
 19 |         self.intra_fc = nn.Linear(self.numUnits, self.numUnits)
 20 |         self.intra_ln = nn.LayerNorm((width, numUnits), eps=1e-8)
 21 | 
 22 |         self.inter_rnn = nn.LSTM(input_size = self.numUnits, hidden_size = self.numUnits, batch_first=True, bidirectional=False)
 23 |         self.inter_fc = nn.Linear(self.numUnits, self.numUnits)
 24 |         self.inter_ln = nn.LayerNorm((width, numUnits), eps=1e-8)
 25 |     
 26 |     def forward(self, x, h_cache, c_cache):
 27 |         """
 28 |         x:       (B, C, T=1, F)
 29 |         h_cache: (1, F, C), hidden cache for inter RNN.
 30 |         c_cache: (1, F, C), cell cache for inter RNN.
 31 |         """
 32 |         ## Intra RNN
 33 |         x = x.permute(0, 2, 3, 1)  # (B,T,F,C)
 34 |         intra_x = x.reshape(x.shape[0] * x.shape[1], x.shape[2], x.shape[3])  # (B*T,F,C)
 35 |         intra_x = self.intra_rnn(intra_x)[0]  # (B*T,F,C)
 36 |         intra_x = self.intra_fc(intra_x)      # (B*T,F,C)
 37 |         intra_x = intra_x.reshape(x.shape[0], -1, self.width, self.channel) # (B,T,F,C)
 38 |         intra_x = self.intra_ln(intra_x)
 39 |         intra_out = torch.add(x, intra_x)
 40 | 
 41 |         ## Inter RNN
 42 |         x = intra_out.permute(0,2,1,3)  # (B,F,T,C)
 43 |         inter_x = x.reshape(x.shape[0] * x.shape[1], x.shape[2], x.shape[3]) 
 44 |         inter_x, (h_cache, c_cache) = self.inter_rnn(inter_x, (h_cache, c_cache))  # (B*F,T,C)
 45 |         inter_x = self.inter_fc(inter_x)      # (B*F,T,C)
 46 |         inter_x = inter_x.reshape(x.shape[0], self.width, -1, self.channel) # (B,F,T,C)
 47 |         inter_x = inter_x.permute(0,2,1,3)   # (B,T,F,C)
 48 |         inter_x = self.inter_ln(inter_x) 
 49 |         inter_out = torch.add(intra_out, inter_x)
 50 |         
 51 |         dual_out = inter_out.permute(0,3,1,2)  # (B,C,T,F)
 52 |         
 53 |         return dual_out, h_cache, c_cache
 54 | 
 55 | 
 56 | class StreamEncoder(nn.Module):
 57 |     def __init__(self,
 58 |                  in_channels,
 59 |                  filter_size,
 60 |                  kernel_size,
 61 |                  stride_size):
 62 |         super().__init__()
 63 |         self.N_layers = len(filter_size)
 64 | 
 65 |         self.conv_list = nn.ModuleList([])
 66 |         self.bn_list = nn.ModuleList([])
 67 |         self.act_list = nn.ModuleList([])
 68 | 
 69 |         for i in range(self.N_layers):
 70 |             Cin = in_channels if i==0 else filter_size[i-1]
 71 |             self.conv_list.append(StreamConv2d(Cin, filter_size[i], kernel_size[i], stride_size[i], (0, (kernel_size[i][1]-1)//2)))
 72 |             self.bn_list.append(nn.BatchNorm2d(filter_size[i]))
 73 |             self.act_list.append(nn.PReLU())
 74 | 
 75 |     def forward(self, x, en_caches):
 76 |         """
 77 |         x: (B,C,1,F).
 78 |         en_caches: A list of cache (B,C,1,F) for each conv layer in encoder. 
 79 |         """
 80 |         en_outs = []
 81 |         for i in range(self.N_layers):
 82 |             x, en_caches[i] = self.conv_list[i](x, en_caches[i])
 83 |             x = self.bn_list[i](x)
 84 |             x = self.act_list[i](x)
 85 |             en_outs.append(x)
 86 |             # print(f'en_{i}:', x.shape)
 87 |         return x, en_outs, en_caches
 88 | 
 89 | 
 90 | class StreamDecoder(nn.Module):
 91 |     def __init__(self,
 92 |                  out_channels,
 93 |                  filter_size,
 94 |                  kernel_size,
 95 |                  stride_size):
 96 |         super().__init__()
 97 |         self.N_layers = len(filter_size)
 98 | 
 99 |         self.conv_list = nn.ModuleList([])
100 |         self.bn_list = nn.ModuleList([])
101 |         self.act_list = nn.ModuleList([])
102 | 
103 |         for i in range(self.N_layers-1, -1, -1):
104 |             Cout = out_channels if i==0 else filter_size[i-1]
105 |             act = nn.Identity() if i== 0 else nn.PReLU()
106 |             self.conv_list.append(StreamConvTranspose2d(filter_size[i]*2, Cout, kernel_size[i], stride_size[i], (0, (kernel_size[i][1]-1)//2)))
107 |             self.bn_list.append(nn.BatchNorm2d(Cout))
108 |             self.act_list.append(act)
109 | 
110 |     def forward(self, x, en_outs, de_caches):
111 |         """
112 |         x: (B,C,T,F)
113 |         de_caches: A list of cache (B,C,T,F) for each conv layer in decoder. 
114 |         """
115 |         for i in range(self.N_layers):
116 |             x = torch.cat([x, en_outs[self.N_layers-1-i]], dim=1)
117 |             x, de_caches[i] = self.conv_list[i](x, de_caches[i])
118 |             x = self.bn_list[i](x)
119 |             x = self.act_list[i](x)
120 |             # print(f'de_{i}:', x.shape)
121 |         return x, de_caches
122 | 
123 | 
124 | class StreamDPCRN(nn.Module):
125 |     def __init__(self,
126 |                  in_channels=2,
127 |                  out_channels=2,
128 |                  filter_size=[32,32,32,64,128],
129 |                  kernel_size=[(2,5), (2,3), (2,3), (2,3), (2,3)],
130 |                  stride_size=[(1,2), (1,2), (1,2), (1,1), (1,1)],
131 |                  N_dprnn=2,
132 |                  num_units=128,
133 |                  width=33,
134 |                  **kwargs):
135 |         super().__init__()
136 |         self.encoder = StreamEncoder(in_channels, filter_size, kernel_size, stride_size)
137 |         
138 |         self.dprnns = nn.ModuleList([])
139 |         for i in range(N_dprnn):
140 |             self.dprnns.append(StreamDPRNN(num_units, width, filter_size[-1]))
141 |             
142 |         self.decoder = StreamDecoder(out_channels, filter_size, kernel_size, stride_size)
143 | 
144 |     def forward(self, x,
145 |                 en_cache1, en_cache2, en_cache3, en_cache4, en_cache5,
146 |                 rnn_cache1, rnn_cache2, rnn_cache3, rnn_cache4,
147 |                 de_cache1, de_cache2, de_cache3, de_cache4, de_cache5):
148 |         """
149 |         When export to onnx, the inputs should be specific tensors.
150 |         x: (B,F,T,2)
151 |         cachees: ...
152 |         """
153 |         x_ref = x
154 |         x = x.permute(0, 3, 2, 1)     # (B,C,T,F)
155 | 
156 |         x, en_outs, [en_cache1, en_cache2, en_cache3, en_cache4, en_cache5] = self.encoder(x, [en_cache1, en_cache2, en_cache3, en_cache4, en_cache5])
157 | 
158 |         x, rnn_cache1, rnn_cache2 = self.dprnns[0](x, rnn_cache1, rnn_cache2)
159 |         x, rnn_cache3, rnn_cache4 = self.dprnns[1](x, rnn_cache3, rnn_cache4)
160 | 
161 |         x, [de_cache1, de_cache2, de_cache3, de_cache4, de_cache5] = self.decoder(x, en_outs, [de_cache1, de_cache2, de_cache3, de_cache4, de_cache5])
162 | 
163 |         m = x.permute(0,3,2,1)
164 | 
165 |         s_real = x_ref[...,0] * m[...,0] - x_ref[...,1] * m[...,1]
166 |         s_imag = x_ref[...,1] * m[...,0] + x_ref[...,0] * m[...,1]
167 |         s = torch.stack([s_real, s_imag], dim=-1)  # (B,F,T,2)
168 |         
169 |         return s, en_cache1, en_cache2, en_cache3, en_cache4, en_cache5,\
170 |                 rnn_cache1, rnn_cache2, rnn_cache3, rnn_cache4,\
171 |                 de_cache1, de_cache2, de_cache3, de_cache4, de_cache5
172 | 
173 | 
174 | 
175 | if __name__ == "__main__":
176 |     from dpcrn import DPCRN
177 |     model = DPCRN().eval()
178 |     model_stream = StreamDPCRN().eval()
179 |     convert_to_stream(model_stream, model)
180 | 
181 |     x = torch.randn(1, 257, 1000, 2)
182 |     en_cache1 = torch.zeros(1, 2,  1, 257)
183 |     en_cache2 = torch.zeros(1, 32, 1, 129)
184 |     en_cache3 = torch.zeros(1, 32, 1, 65)
185 |     en_cache4 = torch.zeros(1, 32, 1, 33)
186 |     en_cache5 = torch.zeros(1, 64,1, 33)
187 | 
188 |     rnn_cache1 = torch.zeros(1, 33, 128)
189 |     rnn_cache2 = torch.zeros(1, 33, 128)
190 |     rnn_cache3 = torch.zeros(1, 33, 128)
191 |     rnn_cache4 = torch.zeros(1, 33, 128)
192 | 
193 |     de_cache1 = torch.zeros(1, 256,1, 33)
194 |     de_cache2 = torch.zeros(1, 128,1, 33)
195 |     de_cache3 = torch.zeros(1, 64, 1, 33)
196 |     de_cache4 = torch.zeros(1, 64, 1, 65)
197 |     de_cache5 = torch.zeros(1, 64, 1, 129)
198 |     
199 |     y1 = []
200 |     for i in range(x.shape[-2]):
201 |         yi, en_cache1, en_cache2, en_cache3, en_cache4, en_cache5,\
202 |             rnn_cache1, rnn_cache2, rnn_cache3, rnn_cache4,\
203 |             de_cache1, de_cache2, de_cache3, de_cache4, de_cache5 = \
204 |                     model_stream(x[:,:,i:i+1,:], en_cache1, en_cache2, en_cache3, en_cache4, en_cache5,
205 |                                             rnn_cache1, rnn_cache2, rnn_cache3, rnn_cache4,
206 |                                             de_cache1, de_cache2, de_cache3, de_cache4, de_cache5)
207 |         y1.append(yi)
208 |     y1 = torch.cat(y1, dim=2)
209 |     
210 |     ## check streaming errors
211 |     y = model(x)
212 |     print((y-y1).abs().max())
213 | 
214 | 
215 |     import time
216 |     import onnx
217 |     import onnxruntime
218 |     import numpy as np
219 |     from onnxsim import simplify
220 |     ## convert to onnx
221 |     file = 'models/dpcrn.onnx'
222 |     device = torch.device('cpu')
223 |     input = torch.randn(1, 257, 1, 2, device=device)
224 |     torch.onnx.export(model_stream,
225 |                     (input, en_cache1, en_cache2, en_cache3, en_cache4, en_cache5,
226 |                         rnn_cache1, rnn_cache2, rnn_cache3, rnn_cache4,
227 |                         de_cache1, de_cache2, de_cache3, de_cache4, de_cache5),
228 |                     file,
229 |                     input_names = ['mix', 'en_cache1', 'en_cache2', 'en_cache3', 'en_cache4', 'en_cache5',
230 |                                    'rnn_cache1', 'rnn_cache2', 'rnn_cache3', 'rnn_cache4',
231 |                                     'de_cache1', 'de_cache2', 'de_cache3', 'de_cache4', 'de_cache5'],
232 |                     output_names = ['enh', 'en_cache1_out', 'en_cache2_out', 'en_cache3_out', 'en_cache4_out', 'en_cache5_out',
233 |                                    'rnn_cache1_out', 'rnn_cache2_out', 'rnn_cache3_out', 'rnn_cache4_out',
234 |                                     'de_cache1_out', 'de_cache2_out', 'de_cache3_out', 'de_cache4_out', 'de_cache5_out'],
235 |                     opset_version=11,
236 |                     verbose = False)
237 | 
238 |     onnx_model = onnx.load(file)
239 |     onnx.checker.check_model(onnx_model)
240 | 
241 |     model_simp, check = simplify(onnx_model)
242 |     assert check, "Simplified ONNX model could not be validated"
243 |     onnx.save(model_simp, file.split('.onnx')[0] + '_simple.onnx')
244 | 
245 |     ## run onnx model
246 |     # session = onnxruntime.InferenceSession(file, None, providers=['CUDAExecutionProvider'])
247 |     session = onnxruntime.InferenceSession(file.split('.onnx')[0]+'_simple.onnx', None, providers=['CPUExecutionProvider'])
248 |     input = x.cpu().detach().numpy()
249 |     en_cache1 = torch.zeros(1, 2,  1, 257).numpy()
250 |     en_cache2 = torch.zeros(1, 32, 1, 129).numpy()
251 |     en_cache3 = torch.zeros(1, 32, 1, 65).numpy()
252 |     en_cache4 = torch.zeros(1, 32, 1, 33).numpy()
253 |     en_cache5 = torch.zeros(1, 64,1, 33).numpy()
254 | 
255 |     rnn_cache1 = torch.zeros(1, 33, 128).numpy()
256 |     rnn_cache2 = torch.zeros(1, 33, 128).numpy()
257 |     rnn_cache3 = torch.zeros(1, 33, 128).numpy()
258 |     rnn_cache4 = torch.zeros(1, 33, 128).numpy()
259 | 
260 |     de_cache1 = torch.zeros(1, 256,1, 33).numpy()
261 |     de_cache2 = torch.zeros(1, 128,1, 33).numpy()
262 |     de_cache3 = torch.zeros(1, 64, 1, 33).numpy()
263 |     de_cache4 = torch.zeros(1, 64, 1, 65).numpy()
264 |     de_cache5 = torch.zeros(1, 64, 1, 129).numpy()
265 | 
266 |     T_list = []
267 |     outputs = []
268 | 
269 |     for i in range(input.shape[-2]):
270 |         tic = time.perf_counter()
271 |         
272 |         out_i, en_cache1, en_cache2, en_cache3, en_cache4, en_cache5,\
273 |             rnn_cache1, rnn_cache2, rnn_cache3, rnn_cache4,\
274 |             de_cache1, de_cache2, de_cache3, de_cache4, de_cache5 \
275 |                 = session.run([], {'mix': input[..., i:i+1, :],
276 |                                     'en_cache1': en_cache1, 'en_cache2':en_cache2, 'en_cache3':en_cache3, 'en_cache4':en_cache4, 'en_cache5':en_cache5,
277 |                                    'rnn_cache1':rnn_cache1, 'rnn_cache2':rnn_cache2, 'rnn_cache3':rnn_cache3, 'rnn_cache4':rnn_cache4,
278 |                                     'de_cache1':de_cache1, 'de_cache2':de_cache2, 'de_cache3':de_cache3, 'de_cache4':de_cache4, 'de_cache5':de_cache5})
279 | 
280 |         toc = time.perf_counter()
281 |         T_list.append(toc-tic)
282 |         outputs.append(out_i)
283 |     outputs = np.concatenate(outputs, axis=2)
284 |     ## check onnx errors
285 |     print(np.abs(outputs - y.detach().numpy()).max())
286 |     
287 |     ## evaluate inference speed
288 |     print(">>> inference time: mean: {:.1f}ms, max: {:.1f}ms, min: {:.1f}ms".format(1e3*np.mean(T_list), 1e3*np.max(T_list), 1e3*np.min(T_list)))


--------------------------------------------------------------------------------
/online/dpcrn_stream_trt.py:
--------------------------------------------------------------------------------
  1 | """ 
  2 | 1. onnx to engine: use `trtexec.exe`.
  3 | 2. tensorrt and torch cannot be imported simultaneously!
  4 | 3. During streaming inference of the engine model, if we set outputs=[] and then append frame by frame,
  5 | it will result in each element in outputs being the result of the last frame (reason unknown).
  6 | The solution is to set outputs=np.zeros((1, 2, T, F)) and then assign values frame by frame.
  7 | """
  8 | import os
  9 | import numpy as np
 10 | import tensorrt as trt
 11 | import pycuda.driver as cuda
 12 | import pycuda.autoinit  # must import
 13 | from collections import namedtuple
 14 | 
 15 | 
 16 | def onnx2engine(trtexec_path, onnx_path, save_path):
 17 |     os.system(f"{trtexec_path} --onnx={onnx_path} --saveEngine={save_path}")
 18 | 
 19 | 
 20 | Bindings = namedtuple("Bindings", ("name", "shape", "host", "device", "size"))
 21 | 
 22 | class TRTModel:
 23 |     """
 24 |     Implements inference for the EfficientNet TensorRT engine.
 25 |     """
 26 | 
 27 |     def __init__(self, engine_path, dtype=np.float32):
 28 |         """
 29 |         Args:
 30 |             engine_path: The path to the serialized engine to load from disk.
 31 |             dtype: The datatype used in inference.
 32 |         """
 33 |         # init arguments
 34 |         self.engine_path = engine_path
 35 |         self.dtype = dtype
 36 | 
 37 |         # Load TRT engine
 38 |         self.logger = trt.Logger(trt.Logger.ERROR)
 39 |         self.runtime = trt.Runtime(self.logger)
 40 |         self.engine = self.load_engine(self.runtime, self.engine_path)
 41 |         self.context = self.engine.create_execution_context()
 42 |         self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers()
 43 | 
 44 |     @staticmethod
 45 |     def load_engine(trt_runtime, engine_path):
 46 |         with open(engine_path, 'rb') as f:
 47 |             engine_data = f.read()
 48 |         engine = trt_runtime.deserialize_cuda_engine(engine_data)
 49 |         return engine
 50 | 
 51 | 
 52 |     def allocate_buffers(self):
 53 |         inputs = []
 54 |         outputs = []
 55 |         bindings = []
 56 |         stream = cuda.Stream()
 57 | 
 58 |         for i in range(self.engine.num_io_tensors):
 59 |             name = self.engine.get_tensor_name(i)
 60 |             shape = self.engine.get_tensor_shape(name)
 61 |             size = trt.volume(shape)
 62 |             # print(i, name, ':', shape, size)
 63 |             ## 0 mix : (1, 257, 1, 2) 514
 64 |             ## 1 en_cache1 : (1, 2, 1, 257) 514
 65 |             ## 2 en_cache2 : (1, 32, 1, 129) 4128
 66 |             ## 3 en_cache3 : (1, 32, 1, 65) 2080
 67 |             ## 4 en_cache4 : (1, 32, 1, 33) 1056
 68 |             ## 5 en_cache5 : (1, 64, 1, 33) 2112
 69 |             ## 6 rnn_cache1 : (1, 33, 128) 4224
 70 |             ## 7 rnn_cache2 : (1, 33, 128) 4224
 71 |             ## 8 rnn_cache3 : (1, 33, 128) 4224
 72 |             ## 9 rnn_cache4 : (1, 33, 128) 4224
 73 |             ## 10 de_cache1 : (1, 256, 1, 33) 8448
 74 |             ## 11 de_cache2 : (1, 128, 1, 33) 4224
 75 |             ## 12 de_cache3 : (1, 64, 1, 33) 2112
 76 |             ## 13 de_cache4 : (1, 64, 1, 65) 4160
 77 |             ## 14 de_cache5 : (1, 64, 1, 129) 8256
 78 |             ## 15 en_cache1_out : (1, 2, 1, 257) 514
 79 |             ## 16 en_cache2_out : (1, 32, 1, 129) 4128
 80 |             ## 17 en_cache3_out : (1, 32, 1, 65) 2080
 81 |             ## 18 en_cache4_out : (1, 32, 1, 33) 1056
 82 |             ## 19 en_cache5_out : (1, 64, 1, 33) 2112
 83 |             ## 20 rnn_cache1_out : (1, 33, 128) 4224
 84 |             ## 21 rnn_cache2_out : (1, 33, 128) 4224
 85 |             ## 22 rnn_cache3_out : (1, 33, 128) 4224
 86 |             ## 23 rnn_cache4_out : (1, 33, 128) 4224
 87 |             ## 24 de_cache1_out : (1, 256, 1, 33) 8448
 88 |             ## 25 de_cache2_out : (1, 128, 1, 33) 4224
 89 |             ## 10 de_cache1 : (1, 256, 1, 33) 8448
 90 |             ## 11 de_cache2 : (1, 128, 1, 33) 4224
 91 |             ## 12 de_cache3 : (1, 64, 1, 33) 2112
 92 |             ## 13 de_cache4 : (1, 64, 1, 65) 4160
 93 |             ## 14 de_cache5 : (1, 64, 1, 129) 8256
 94 |             ## 15 en_cache1_out : (1, 2, 1, 257) 514
 95 |             ## 16 en_cache2_out : (1, 32, 1, 129) 4128
 96 |             ## 17 en_cache3_out : (1, 32, 1, 65) 2080
 97 |             ## 18 en_cache4_out : (1, 32, 1, 33) 1056
 98 |             ## 19 en_cache5_out : (1, 64, 1, 33) 2112
 99 |             ## 20 rnn_cache1_out : (1, 33, 128) 4224
100 |             ## 21 rnn_cache2_out : (1, 33, 128) 4224
101 |             ## 22 rnn_cache3_out : (1, 33, 128) 4224
102 |             ## 23 rnn_cache4_out : (1, 33, 128) 4224
103 |             ## 24 de_cache1_out : (1, 256, 1, 33) 8448
104 |             ## 25 de_cache2_out : (1, 128, 1, 33) 4224
105 |             ## 26 de_cache3_out : (1, 64, 1, 33) 2112
106 |             ## 27 de_cache4_out : (1, 64, 1, 65) 4160
107 |             ## 28 de_cache5_out : (1, 64, 1, 129) 8256
108 |             ## 29 enh : (1, 257, 1, 2) 514
109 |             host_mem = cuda.pagelocked_empty(size, self.dtype)
110 |             device_mem = cuda.mem_alloc(host_mem.nbytes)
111 | 
112 |             bindings.append(int(device_mem))
113 | 
114 |             if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
115 |                 inputs.append(Bindings(name, shape, host_mem, device_mem, host_mem.nbytes))
116 |             else:
117 |                 outputs.append(Bindings(name, shape, host_mem, device_mem, host_mem.nbytes))
118 | 
119 |         return inputs, outputs, bindings, stream
120 | 
121 | 
122 |     def __call__(self, x: np.ndarray):
123 |         x = x.astype(self.dtype)
124 |         np.copyto(self.inputs[0].host, x.ravel())
125 | 
126 |         # Transfer the current frame of noisy data from CPU to CUDA.
127 |         cuda.memcpy_htod_async(self.inputs[0].device, self.inputs[0].host, self.stream)
128 |         # Execute inference.
129 |         self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle)
130 | 
131 |         # Copy cache_out to cache_in within CUDA.
132 |         for i in range(1, 15):
133 |             # print(self.inputs[i].name, self.outputs[i-1].name)
134 |             assert(self.outputs[i-1].size == self.inputs[i].size)
135 |             cuda.memcpy_dtod_async(self.inputs[i].device, self.outputs[i-1].device, self.outputs[i-1].size, self.stream)
136 | 
137 |         # Transfer the current frame of enhanced data from CUDA to CPU.
138 |         cuda.memcpy_dtoh_async(self.outputs[-1].host, self.outputs[-1].device, self.stream)
139 | 
140 |         self.stream.synchronize()
141 | 
142 |         return self.outputs[-1].host.reshape(self.outputs[-1].shape)
143 |     
144 | 
145 | if __name__ == "__main__":
146 |     import time
147 | 
148 |     trtexec_path = r'.\models\trtexec.exe'
149 |     onnx_path = r'.\models\dpcrn.onnx'
150 |     save_path = r'.\models\dpcrn.engine'
151 | 
152 |     ## Convert to engine
153 |     onnx2engine(trtexec_path, onnx_path, save_path)
154 | 
155 |     ## Load engine model
156 |     model = TRTModel(save_path)
157 | 
158 |     x = np.random.randn(1, 257, 1000, 2)
159 |     
160 |     times = np.zeros([x.shape[-2]])
161 |     outputs = np.zeros([1, 257, x.shape[-2], 2])    # [1, F, T, 2]
162 |     for i in range(x.shape[-2]):
163 |         tic = time.perf_counter()
164 |         out_i = model(x[:,:, i:i+1,:])
165 |         toc = time.perf_counter()
166 | 
167 |         outputs[:,:,i:i+1,:] = out_i
168 |         times[i] = 1000*(toc-tic)
169 | 
170 |     print("Average Inference Time (ms): ", times.mean())
171 |     print("Maximum Inference Time (ms): ", times.max())
172 |     print("Minimum Inference Time (ms): ", times.min())
173 | 
174 | 
175 | 
176 | 
177 |     
178 | 
179 | 


--------------------------------------------------------------------------------
/online/models/dpcrn.engine:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/online/models/dpcrn.engine


--------------------------------------------------------------------------------
/online/models/dpcrn.onnx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/online/models/dpcrn.onnx


--------------------------------------------------------------------------------
/online/models/dpcrn_simple.onnx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/online/models/dpcrn_simple.onnx


--------------------------------------------------------------------------------
/online/models/trtexec.exe:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/online/models/trtexec.exe


--------------------------------------------------------------------------------
/online/modules/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Xiaobin-Rong/TRT-SE/e35783a695669fd1346ce566dfe73abc7250ee2f/online/modules/__init__.py


--------------------------------------------------------------------------------
/online/modules/convert.py:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | def convert_to_stream(stream_model, model):
 4 |     state_dict = model.state_dict()
 5 |     new_state_dict = stream_model.state_dict()
 6 | 
 7 |     for key in stream_model.state_dict().keys():
 8 |         if key in state_dict.keys():
 9 |             new_state_dict[key] = state_dict[key]
10 | 
11 |         elif key.replace('Conv1d.', '') in state_dict.keys():
12 |             new_state_dict[key] = state_dict[key.replace('Conv1d.', '')]
13 |     
14 |         elif key.replace('Conv2d.', '') in state_dict.keys():
15 |             new_state_dict[key] = state_dict[key.replace('Conv2d.', '')]
16 |             
17 |         elif key.replace('ConvTranspose2d.', '') in state_dict.keys():
18 |             new_state_dict[key] = state_dict[key.replace('ConvTranspose2d.', '')]
19 | 
20 |         else:
21 |             raise(ValueError('key error!'))
22 |         
23 |     stream_model.load_state_dict(new_state_dict)
24 | 


--------------------------------------------------------------------------------
/online/modules/convolution.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Sat Dec  3 17:32:08 2022
  4 | Modified on Tue Jan  9 17:47:18 2024
  5 | 
  6 | @author: Xiaohuai Le, Xiaobin Rong
  7 | """
  8 | import torch
  9 | import torch.nn as nn
 10 | from typing import List, Tuple, Union
 11 | 
 12 | """
 13 | When export to ONNX format, ensure that the cache is saved as a tensor, not a list.
 14 | """
 15 | 
 16 | class StreamConv1d(nn.Module):
 17 |     def __init__(self,
 18 |                 in_channels: int,
 19 |                 out_channels: int,
 20 |                 kernel_size: int,
 21 |                 stride: int=1,
 22 |                 padding: int=0,
 23 |                 dilation: int=1,
 24 |                 groups: int=1,
 25 |                 bias: bool=True,
 26 |                 *args, **kargs):
 27 |         super(StreamConv1d, self).__init__(*args, *kargs)
 28 |         
 29 |         assert padding == 0, "To meet the demands of causal streaming requirements"
 30 |         
 31 |         self.Conv1d = nn.Conv1d(in_channels = in_channels,
 32 |                                 out_channels = out_channels,
 33 |                                 kernel_size = kernel_size,
 34 |                                 stride = stride,
 35 |                                 padding = padding,
 36 |                                 dilation = dilation,
 37 |                                 groups = groups,
 38 |                                 bias = bias)
 39 |     
 40 |     def forward(self, x, cache):
 41 |         """
 42 |         x:     [bs, C, T_size]
 43 |         cache: [bs, C, T_size-1]
 44 |         """
 45 |         inp = torch.cat([cache, x], dim=-1)
 46 |         oup = self.Conv1d(inp)
 47 |         out_cache = inp[..., 1:]
 48 |         return oup, out_cache
 49 | 
 50 | 
 51 | class StreamConv2d(nn.Module):
 52 |     def __init__(self, 
 53 |                  in_channels: int,
 54 |                  out_channels: int,
 55 |                  kernel_size: Union[int, Tuple[int, int]],
 56 |                  stride: Union[int, Tuple[int, int]] = 1,
 57 |                  padding: Union[str, int, Tuple[int, int]] = 0,
 58 |                  dilation: Union[int, Tuple[int, int]] = 1,
 59 |                  groups: int = 1,
 60 |                  bias: bool = True,
 61 |                  *args, **kargs):
 62 |         super().__init__(*args, **kargs)
 63 |         """
 64 |         kernel_size = [T_size, F_size] by defalut
 65 |         """
 66 |         if type(padding) is int:
 67 |             self.T_pad = padding
 68 |             self.F_pad = padding
 69 |         elif type(padding) in [list, tuple]:
 70 |             self.T_pad, self.F_pad = padding
 71 |         else:
 72 |             raise ValueError('Invalid padding size.')
 73 |         
 74 |         assert self.T_pad == 0, "To meet the demands of causal streaming requirements"
 75 |         
 76 |         self.Conv2d = nn.Conv2d(in_channels = in_channels, 
 77 |                                 out_channels = out_channels,
 78 |                                 kernel_size = kernel_size,
 79 |                                 stride = stride,
 80 |                                 padding = padding,
 81 |                                 dilation = dilation,
 82 |                                 groups = groups,
 83 |                                 bias = bias)
 84 |             
 85 |     def forward(self, x, cache):
 86 |         """
 87 |         x: [bs, C, 1, F]
 88 |         cache: [bs, C, T_size-1, F]
 89 |         """
 90 |         inp = torch.cat([cache, x], dim=2)
 91 |         outp = self.Conv2d(inp)
 92 |         out_cache = inp[:,:, 1:]
 93 |         return outp, out_cache
 94 | 
 95 | 
 96 | class StreamConvTranspose2d(nn.Module):
 97 |     def __init__(self, 
 98 |                  in_channels: int,
 99 |                  out_channels: int,
100 |                  kernel_size: Union[int, Tuple[int, int]],
101 |                  stride: Union[int, Tuple[int, int]] = 1,
102 |                  padding: Union[str, int, Tuple[int, int]] = 0,
103 |                  dilation: Union[int, Tuple[int, int]] = 1,
104 |                  groups: int = 1,
105 |                  bias: bool = True,
106 |                  *args, **kargs):
107 |         super().__init__(*args, **kargs)
108 |         """
109 |         kernel_size = [T_size, F_size] by default
110 |         stride = [T_stride, F_stride] and assert T_stride == 1
111 |         """
112 |         if type(kernel_size) is int:
113 |             self.T_size = kernel_size
114 |             self.F_size = kernel_size
115 |         elif type(kernel_size) in [list, tuple]:
116 |             self.T_size, self.F_size = kernel_size
117 |         else:
118 |             raise ValueError('Invalid kernel size.')
119 |             
120 |         if type(stride) is int:
121 |             self.T_stride = stride
122 |             self.F_stride = stride
123 |         elif type(stride) in [list, tuple]:
124 |             self.T_stride, self.F_stride = stride
125 |         else:
126 |             raise ValueError('Invalid stride size.')
127 |         
128 |         assert self.T_stride == 1
129 | 
130 |         if type(padding) is int:
131 |             self.T_pad = padding
132 |             self.F_pad = padding
133 |         elif type(padding) in [list, tuple]:
134 |             self.T_pad, self.F_pad = padding
135 |         else:
136 |             raise ValueError('Invalid padding size.')
137 |         
138 |         if type(dilation) is int:
139 |             self.T_dilation = dilation
140 |             self.F_dilation = dilation
141 |         elif type(dilation) in [list, tuple]:
142 |             self.T_dilation, self.F_dilation = dilation
143 |         else:
144 |             raise ValueError('Invalid dilation size.')
145 |         
146 |         assert self.T_pad == (self.T_size-1) * self.T_dilation, "To meet the demands of causal streaming requirements"
147 | 
148 |         self.ConvTranspose2d = nn.ConvTranspose2d(in_channels = in_channels, 
149 |                                                 out_channels = out_channels,
150 |                                                 kernel_size = kernel_size,
151 |                                                 stride = stride, 
152 |                                                 padding = padding,
153 |                                                 dilation = dilation,
154 |                                                 groups = groups,
155 |                                                 bias = bias)
156 |         
157 |     def forward(self, x, cache):
158 |         """
159 |         x: [bs, C, 1, F]
160 |         cache: [bs, C, T_size-1, F]
161 |         """
162 |         inp = torch.cat([cache, x], dim=2)
163 |         outp = self.ConvTranspose2d(inp)
164 |         out_cache = inp[:,:, 1:]
165 |         return outp, out_cache
166 | 
167 | 
168 | if __name__ == '__main__':
169 |     from convert import convert_to_stream
170 | 
171 |     ### test Conv1d Stream
172 |     Sconv = StreamConv1d(1, 1, 3)
173 |     Conv = nn.Conv1d(1, 1, 3)
174 |     convert_to_stream(Sconv, Conv)
175 | 
176 |     test_input = torch.randn([1, 1, 10])
177 |     with torch.no_grad():
178 |         ## Non-Streaming
179 |         test_out1 = Conv(torch.nn.functional.pad(test_input, [2,0]))
180 |         
181 |         ## Streaming
182 |         cache = torch.zeros([1, 1, 2])
183 |         test_out2 = []
184 |         for i in range(10):
185 |             out, cache = Sconv(test_input[..., i:i+1], cache)
186 |             test_out2.append(out)
187 |         test_out2 = torch.cat(test_out2, dim=-1)
188 |         print(">>> Streaming Conv1d error:", (test_out1 - test_out2).abs().max())
189 | 
190 |     ### test Conv2d Stream
191 |     Sconv = StreamConv2d(1, 1, [3,3])
192 |     Conv = nn.Conv2d(1, 1, (3,3))
193 |     convert_to_stream(Sconv, Conv)
194 | 
195 |     test_input = torch.randn([1,1,10,6])
196 | 
197 |     with torch.no_grad():
198 |         ## Non-Streaming
199 |         test_out1 = Conv(torch.nn.functional.pad(test_input,[0,0,2,0]))
200 |         
201 |         ## Streaming
202 |         cache = torch.zeros([1,1,2,6])
203 |         test_out2 = []
204 |         for i in range(10):
205 |             out, cache = Sconv(test_input[:,:, i:i+1], cache)
206 |             test_out2.append(out)
207 |         test_out2 = torch.cat(test_out2, dim=2)
208 |         print(">>> Streaming Conv2d error:", (test_out1 - test_out2).abs().max())
209 | 
210 |         
211 |     ### test ConvTranspose2d Stream
212 |     kt = 3  # kernel size along T axis
213 |     dt = 2  # dilation along T axis
214 |     pt = (kt-1) * dt # padding along T axis
215 |     DeConv = torch.nn.ConvTranspose2d(4, 8, (kt,3), stride=(1,2), padding=(pt,1), dilation=(dt,2), groups=2)
216 |     SDeconv = StreamConvTranspose2d(4, 8, (kt,3), stride=(1,2), padding=(2*2,1), dilation=(dt,2), groups=2)
217 |     convert_to_stream(SDeconv, DeConv)
218 | 
219 |     test_input = torch.randn([1, 4, 100, 6])
220 |     with torch.no_grad():
221 |         ## Non-Streaming
222 |         test_out1 = DeConv(nn.functional.pad(test_input, [0,0,pt,0]))  # causal padding!
223 |         test_out1 = test_out1
224 |         ## Streaming
225 |         test_out2 = []
226 |         cache = torch.zeros([1, 4, pt, 6])
227 |         for i in range(100):
228 |             out, cache = SDeconv(test_input[:,:, i:i+1], cache)
229 |             test_out2.append(out)
230 |         test_out2 = torch.cat(test_out2, dim=2)
231 | 
232 |         print(">>> Streaming ConvTranspose2d error:", (test_out1 - test_out2).abs().max())
233 |     
234 | 
235 | 


--------------------------------------------------------------------------------
/readme/README_zh.md:
--------------------------------------------------------------------------------
 1 | # 使用 TensorRT 部署深度学习语音增强模型
 2 | 
 3 | 安装 TensorRT 见：[TensorRT 安装教程](./TRTSETUP_zh.md)
 4 | 
 5 | 语音增强模型的部署可以分为 **离线推理** 和 **在线推理** 两类。离线推理是在预先准备的数据上进行模型推理，通常是一批样本或较长的语音信号。离线推理无实时要求，可使用高效的推理方法和资源分配策略。
 6 | 
 7 | 在线推理是在实时场景中对实时生成的语音数据进行模型推理，如通过麦克风捕捉的连续语音信号。在线推理要求低延迟、高吞吐量以满足实时性的要求。
 8 | 
 9 | ## 离线模型的部署
10 | ### 1. 转换为 ONNX 模型
11 | 对于离线模型，ONNX 的导出非常简单。唯一需要注意的是输入形状的时间维度的设置。尽管 `torch.onnx.export` 支持动态维度，但考虑到实际应用场景中该需求不大，我们选择固定时间维度为 563，对应 9 s 长度的音频数据。当离线处理时，若音频不足 9 s，对其进行补 0 处理；若音频大于 9 s，对其以 9 s 分段按批次处理。
12 | 
13 | `offline\dpcrn_onnx.py` 提供了 ONNX 模型的导出及推理，并评估了 ONNX 模型在 ONNXRuntime 上的推理速度。
14 | 
15 | ### 2. 转换为 Engine 模型
16 | 我们使用 TensorRT 官方提供的转换工具 `trtexec.exe` 来进行模型从 ONNX 到 TensorRT 支持的 Engine 格式的转换，该工具位于 TensorRT 安装包内的 `bin` 目录下，使用方法为：
17 | ```
18 | trtexec.exe --onnx=[onnx_path] ---saveEngine=[save_path]
19 | ```
20 | 
21 | `offline\dpcrn_trt.py` 提供了 Engine 模型的导出及推理，并评估了 Engine 模型的推理速度，结果如下表。
22 | 
23 | | **模型格式** | **推理框架** | **推理平台** | **平均推理速度 （ms）** | **最大推理速度 （ms）** | **最小推理速度 （ms）** |
24 | |:-----------:|:-----------:|:-----------:|:----------------------:|:----------------------:|:----------------------:|
25 | | ONNX | ONNXRuntime | CPU | 8.6 | 21.0 | 7.6|
26 | | Engine |TensorRT| CUDA | 2.2 | 5.1 | 1.9 |
27 | 
28 | 其中推理使用的 CPU 为 13th Gen Intel(R) Core(TM) i9-13900HX @ 2.20 GHz，使用的 CUDA 为 NVIDIA GeForce RTX 4080 Laptop GPU。推理过程重复进行了 1000 次，据此得到平均和最大最小推理速度。这里推理速度的定义为：音频处理时长 / 音频总时长。可以看到 TensorRT 框架的推理速度比 ONNXRuntime 提升了将近 4 倍。
29 | 
30 | ## 在线模型的部署
31 | 在语音增强中，在线推理的应用场景更广泛且对模型的实时性要求更高，相应地，在线推理的部署也更复杂。在这里我们采用 **流式推理** 的方法，对实时数据流进行逐帧推理。在实现流式推理时，需要适当的数据缓冲机制、数据流管理和模型推理的流水线设计，以确保数据的连续性和推理的稳定性。
32 | 
33 | ### 1. 转换为流式模型
34 | RNN 天然适应流式推理，无需额外转换，相比之下，卷积层是神经网络中需要进行流式转换的主要部分。在 `online\modules` 中，我们在 `convolution.py` 中定义了流式卷积和流式转职卷积两种算子的实现，并且在 `convert.py` 中提供了复制原模型参数字典的方法，用于流式模型的转换。
35 | 
36 | `online\dpcrn_stream.py` 提供了流式模型的转换及推理过程，注意对流式模型，输入张量的时间维度始终为 1.
37 | 
38 | ### 2. 转换为 ONNX 模型
39 | 对于流式模型，转换为 ONNX 时无需考虑时间维度的问题，但最好在 `forward` 函数中指定所有的输入张量，而不是像 `online\dpcrn_stream.py` 中那样用列表来代替。
40 | 
41 | `online\dpcrn_stream_onnx.py` 提供了流式 ONNX 模型的转换及推理过程，并评估了其在 ONNXRuntime 的推理速度。
42 | 
43 | ### 3. 转换为 Engine 模型
44 | 我们同样使用 TensorRT 官方提供的转换工具 `trtexec.exe` 来进行模型转换。
45 | 
46 | `online\dpcrn_stream_trt.py` 提供了流式 Engine 模型的导出及推理，并评估了 Engine 模型的推理速度，结果如下表。
47 | 
48 | | **模型格式** | **推理框架** | **推理平台** | **平均推理速度 （ms）** | **最大推理速度 （ms）** | **最小推理速度 （ms）** |
49 | |:-----------:|:-----------:|:-----------:|:----------------------:|:----------------------:|:----------------------:|
50 | | ONNX | ONNXRuntime | CPU |1.0 | 3.1 | 0.9 |
51 | | Engine |TensorRT| CUDA |2.2 | 4.7 | 1.8 |
52 | 
53 | 其中推理速度的评估重复进行了 1000 次，这里推理速度的定义为：音频处理时长 / 音频总帧数。可以看到，使用 TensorRT 推理反而比使用 ONNXRuntime 更慢，这是因为对于高吞吐的流式模型，使用 TensorRT 推理时数据从 CUDA 向 CPU 迁移将会占用一定的时间。只有当模型在 CPU 上推理速度称为瓶颈时，使用 TensorRT 在 CUDA 上推理才有正面效果。
54 | 
55 | ## 致谢
56 | 这个仓库使用的语音增强模型是 [DPCRN](https://arxiv.org/abs/2107.05429)，是一个高性能低延时的优秀 SE 模型，在 DNS3 挑战赛中取得了排名第二的成绩。作者是我的师兄 [乐笑怀](https://github.com/Le-Xiaohuai-speech)，他教给了我很多东西，包括这个仓库中使用的流式转换方法。
57 | 


--------------------------------------------------------------------------------
/readme/TRTSETUP.md:
--------------------------------------------------------------------------------
 1 | # TensorRT Installation Tutorial
 2 | The Chinese version of the document can be found in: [TensorRT 安装教程](./TRTSETUP_zh.md)
 3 | 
 4 | TensorRT is a deep learning inference engine developed by NVIDIA, designed to optimize and accelerate the inference process of deep learning models. It is specifically designed for deploying and efficiently executing deep learning models on NVIDIA GPUs. Models trained using frameworks such as PyTorch and TensorFlow can be converted to the TensorRT format and then utilized with the TensorRT inference engine to improve the model's runtime speed on GPUs. TensorRT is an ideal framework for deploying deep learning models on GPUs.
 5 | 
 6 | This tutorial provides the installation and configuration methods for TensorRT on the Windows operating system. I use `NVIDIA GeForce RTX 4080 Laptop GPU`, and the corresponding software environment consists of the following components:
 7 | * CUDA 11.7
 8 | * cuDNN 8.8
 9 | * TensorRT 8.5
10 | * Python 3.8.13
11 | 
12 | ## 1. Environment Setup
13 | TensorRT is version-specific to CUDA and cuDNN, so it is necessary to ensure that CUDA and cuDNN are already installed and their versions are compatible. <br>
14 | CUDA download link: https://developer.nvidia.com/cuda-downloads <br>
15 | cuDNN download link: https://developer.nvidia.com/rdp/cudnn-download <br>
16 | 
17 | ### 1.1 Check CUDA Installation
18 | Open the command prompt and enter `nvcc -V`. If you see the following information, it means CUDA is installed:
19 | ```
20 | Cuda compilation tools, release 11.7, V11.7.64
21 | Build cuda_11.7.r11.7/compiler.31294372_0
22 | ```
23 | Furthermore, navigate to the installation path of CUDA, which is by default `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA`. Within this path, go to `v11.7\extras\demo_suite`, where `v11.7` represents the version of CUDA downloaded. In this directory, open the command prompt and run `.\deviceQuery.exe` and `.\bandwidthTest.exe`. If both tests pass, it indicates that CUDA is functioning correctly.
24 | 
25 | Note: If CUDA is not installed, download the appropriate version for your GPU model from the official NVIDIA website.
26 | 
27 | ### 1.2 Check cuDNN Installation
28 | Installing cuDNN involves copying a series of files to the CUDA installation directory:
29 | * Download the cuDNN installation package corresponding to your CUDA version from the official website. Registration and login are required on the website to access the downloads. Extract the files locally.
30 | * Move the files in the `bin` directory of cuDNN to the `bin` directory of CUDA (`C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin`).
31 | * Move the files in the `include` directory of cuDNN to the `include` directory of CUDA.
32 | * Move the files in the `lib\x64` directory of cuDNN to the `lib\x64` directory of CUDA.
33 | 
34 | ## 2. Install TensorRT
35 | ### 2.1 Download
36 | TensorRT can be downloaded from the official website https://developer.nvidia.com/tensorrt. Registration and login are required on the website to access the downloads. For Windows systems, simply download the compressed package and extract it to the local machine, making sure to select the TensorRT version that corresponds to your CUDA version.
37 | 
38 | ### 2.2 Configuration
39 | #### 2.2.1 File Configuration
40 | The configuration process for TensorRT is similar to that of cuDNN:
41 | * Copy the files in the `include` directory of the TensorRT installation package to the `include` directory of CUDA.
42 | * Copy all the `lib` files in the `lib` directory of the TensorRT installation package to the `lib\x64` directory of CUDA.
43 | * Copy all the `dll` files in the `lib` directory of the TensorRT installation package to the `bin` directory of CUDA.
44 | * Add the path to the `bin` directory of TensorRT to the environment variables.
45 | * Add the paths to the `include`, `lib`, and `bin` directories of CUDA to the environment variables.
46 | 
47 | #### 2.2.2 Install tensorrt
48 | * Navigate to the `TensorRT-8.5.1.7\python` directory, which contains the `whl` files for different versions of TensorRT compatible with different Python versions. Since the Python version in the virtual environment is 3.8, install the `whl` file corresponding to `cp38`. In the terminal command line at this path, within the virtual environment, run:
49 |   ```
50 |   pip install tensorrt-8.5.1.7-cp38-none-win_amd64.whl
51 |   ```
52 | * Navigate to the `TensorRT-8.5.1.7\graphsurgeon` directory. In the terminal command line at this path, within the virtual environment, run:
53 |   ```
54 |   pip install graphsurgeon-0.4.6-py2.py3-none-any.whl
55 |    ```
56 | * Navigate to the `TensorRT-8.5.1.7\onnx_graphsurgeon` directory，In the terminal command line at this path, within the virtual environment, run:
57 |   ```
58 |   pip install onnx_ graphsurgeon -0.3.12 - py2.py3-none -any.whl
59 |   ```
60 | After the installation is complete, enter the Python environment and print the version information. If no errors are reported, it indicates a successful installation.
61 | ```
62 | import tensorrt as trt
63 | print(trt .__ version __)
64 | assert trt. Builder (trt. Logger ())
65 | ```
66 | 
67 | #### 2.2.3 Install PyCUDA
68 | * Go to the PyCUDA download URL: https://www.lfd.uci.edu/~gohlke/pythonlibs/#pycuda and download the appropriate PyCUDA `.whl` file based on your CUDA and Python versions.
69 | * Navigate to the directory where the `.whl` file is located. Open a terminal or command prompt at that path and activate your virtual environment.
70 | * Run the following command:
71 | ```
72 | pip install pycuda‑2022.1+cuda116‑cp38‑cp38‑win_amd64.whl
73 | ```
74 | 
75 | After completing the above configurations, you may still encounter an error when using the TensorRT compilation tool to convert ONNX models: `Could not locate zlibwapi.dll. Please make sure it is in your library path!` 
76 | 
77 | To resolve this issue:
78 | * First, download the `zlib` file and extract its contents.
79 | * Navigate to the `dll_x64` folder.
80 | * Copy the `zlibwapi.lib` file and paste it into `C:\Program Files\NVIDIA GPU ComputingToolkit\CUDA\v11.7\lib\x64`.
81 | * Copy the `zlibwapi.dll` file and paste it into `C:\Program Files\NVIDIA GPU ComputingToolkit\CUDA\v11.7\bin`.
82 | 
83 | 
84 | ## Reference
85 | [1] [Windows 安装 CUDA / cuDNN](https://zhuanlan.zhihu.com/p/99880204?from_voters_page=true) <br>
86 | [2] [Windows 系统下如何确认 CUDA 和 cuDNN 都安装成功了](https://blog.csdn.net/qq_35768355/article/details/132985948) <br>
87 | [3] [TensorRT 安装](https://blog.csdn.net/weixin_51691064/article/details/130403978) <br>
88 | [4] [TensorRT 安装 zlibwapi.dll](https://blog.csdn.net/weixin_42166222/article/details/130625663) <br>
89 | [5] [TensorRT 安装记录](https://blog.csdn.net/qq_37541097/article/details/114847600) <br>
90 | [6] [PyCUDA 安装与使用](https://blog.csdn.net/qq_41910905/article/details/109650182)
91 | 


--------------------------------------------------------------------------------
/readme/TRTSETUP_zh.md:
--------------------------------------------------------------------------------
 1 | # TensorRT 安装教程
 2 | 
 3 | TensorRT 是由 NVIDIA 开发的深度学习推理引擎，旨在优化和加速深度学习模型的推理过程。它是为了在 NVIDIA GPU 上部署和高效执行深度学习模型而设计的。利用 Pytorch、TensorFlow 等框架训练好的模型，可以转化为 TensorRT 的格式，然后利用 TensorRT 推理引擎进行推理，从而提升该模型在 GPU 上的运行速度。对于部署于 GPU 的深度学习模型，TensorRT 是个非常理想的推理框架。
 4 | 
 5 | 本教程提供了在 Windows 系统中 TensorRT 的安装及配置方法。本机的显卡型号为 `NVIDIA GeForce RTX 4080 Laptop GPU`，使用的环境为：
 6 | * CUDA 11.7
 7 | * cuDNN 8.8
 8 | * TensorRT 8.5
 9 | * Python 3.8.13
10 | 
11 | ## 1 环境准备
12 | TensorRT 与 CUDA 及 cuDNN 的版本是对应的，因此需要先确定已安装好 CUDA 与 cuDNN，并确认其版本。 <br>
13 | CUDA 下载地址：https://developer.nvidia.com/cuda-downloads <br>
14 | cuDNN 下载地址：https://developer.nvidia.com/rdp/cudnn-download <br>
15 | 
16 | ### 1.1 检查 CUDA
17 | 进入命令行，输入 `nvcc -V`，若看到如下信息，则说明 CUDA 已安装。
18 | ```
19 | Cuda compilation tools, release 11.7, V11.7.64
20 | Build cuda_11.7.r11.7/compiler.31294372_0
21 | ```
22 | 进一步，进入 CUDA 的安装路径，默认情况下是 `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA`，进入其下的 `v11.7\extras\demo_suite`，其中 v11.7 为下载的 CUDA 版本号。在该路径下打开命令行，运行 `.\deviceQuery.exe` 及 `.\bandwidthTest.exe`，若两项测试均通过，说明 CUDA 能正常工作。
23 | 
24 | P.S. 若 CUDA 未安装好，根据自己显卡的型号到官网下载合适版本的 CUDA 即可。
25 | 
26 | ### 1.2 检查 cuDNN
27 | cuDNN 的安装是将一系列文件导入到 CUDA 的安装目录下：
28 | * 从官网下载好与 CUDA 对应版本的 cuDNN 安装包，官网要求注册会员并登录后方可下载。解压至本地；
29 | * 将 cuDNN 中 `bin` 目录下的文件移动到 CUDA 的 `bin` 目录（`C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin`）；
30 | * 将 cuDNN 中 `include` 目录下的文件移动到 CUDA 的 `include` 目录；
31 | * 将 cuDNN 中 `lib\x64` 目录下的文件移动到 CUDA 的 `lib\x64` 目录。
32 | 
33 | ## 2 安装 TensorRT
34 | ### 2.1 下载
35 | TensorRT 可以通过官网 https://developer.nvidia.com/tensorrt 下载，官网要求注册会员并登录后方可下载。Windows 系统只需下载好压缩包并解压到本地即可，注意选择与 CUDA 版本对应的 TensorRT 版本。
36 | 
37 | ### 2.2 配置
38 | #### 2.2.1. 文件配置
39 | TensorRT 的配置与 cuDNN 的配置类似：
40 | * 将 TensorRT 安装包中 `include` 目录下的文件复制到 CUDA 的 `include` 目录下；
41 | * 将 TensorRT 安装包中 `lib` 目录下的所有 `lib` 文件复制到 CUDA 的 `lib\x64` 目录下；
42 | * 将 TensorRT 安装包中 `lib` 目录下的所有 `dll` 文件复制到 CUDA 的 `bin` 目录下；
43 | * 将 TensorRT 的 `bin` 路径添加到环境变量；
44 | * 将 CUDA 的 `include`、`lib` 和 `bin` 路径添加到环境变量。
45 | 
46 | #### 2.2.2 安装 tensorrt
47 | * 进入 `TensorRT-8.5.1.7\python` 目录，该目录下有 tensorrt 的针对不同 python 版本的 `whl` 文件。我们虚拟环境中的 python 版本为 3.8，应该安装 `cp38` 对应的 `whl` 文件。
48 |   在该路径下打开终端命令行，进入虚拟环境后，运行：
49 |   ```
50 |   pip install tensorrt -8.5.1.7 - cp38-none -win_amd 64. whl
51 |   ```
52 | * 进入 `TensorRT-8.5.1.7\graphsurgeon` 目录，在该路径下打开终端命令行，进入虚拟环境后，运行：
53 |   ```
54 |   pip install graphsurgeon -0.4.6 - py2.py3-none -any.whl
55 |   ```
56 | * 进入 `TensorRT-8.5.1.7\onnx_graphsurgeon` 目录，在该路径下打开终端命令行，进入虚拟环境后，运行：
57 |   ```
58 |   pip install onnx_ graphsurgeon -0.3.12 - py2.py3-none -any.whl
59 |   ```
60 | 安装完成后，进入 python 环境，打印版本号等信息，若不报错则说明安装成功。
61 | ```
62 | import tensorrt as trt
63 | print(trt .__ version __)
64 | assert trt. Builder (trt. Logger ())
65 | ```
66 | 
67 | #### 2.2.3 安装 pycuda
68 | * 在 PyCUDA 下载网址 https://www.lfd.uci.edu/~gohlke/pythonlibs/#pycuda 上根据 CUDA 和 python 版本下载好合适的 PyCUDA `whl` 文件。
69 | * 进入 `whl` 文件所在目录，在该路径下打开终端命令行，进入虚拟环境后，运行：
70 | ```
71 | pip install pycuda ‑2022.1+ cuda 116‑ cp 38‑ cp 38‑ win_amd 64. whl
72 | ```
73 | 
74 | 以上一系列的配置完成后，在使用 TensorRT 的编译工具对 ONNX 模型进行转换时，仍可能会报错：`Could not locate zlibwapi.dll. Please make sure it is in your library path!` <br>
75 | 解决方案是：
76 | * 首先下载 `zlib` 文件，解压后进入 `dll_x64` 文件夹；
77 | * 将 `zlibwapi.lib` 文件放到 `C:\Program Files\NVIDIA GPU ComputingToolkit\CUDA\v11.7\lib\x64` 下；
78 | * 将 `zlibwapi.dll` 文件放到 `C:\Program Files\NVIDIA GPU ComputingToolkit\CUDA\v11.7\bin` 下。
79 | 
80 | 
81 | ## 参考文章
82 | [1] [Windows 安装 CUDA / cuDNN](https://zhuanlan.zhihu.com/p/99880204?from_voters_page=true) <br>
83 | [2] [Windows 系统下如何确认 CUDA 和 cuDNN 都安装成功了](https://blog.csdn.net/qq_35768355/article/details/132985948) <br>
84 | [3] [TensorRT 安装](https://blog.csdn.net/weixin_51691064/article/details/130403978) <br>
85 | [4] [TensorRT 安装 zlibwapi.dll](https://blog.csdn.net/weixin_42166222/article/details/130625663) <br>
86 | [5] [TensorRT 安装记录](https://blog.csdn.net/qq_37541097/article/details/114847600) <br>
87 | [6] [PyCUDA 安装与使用](https://blog.csdn.net/qq_41910905/article/details/109650182)
88 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy==1.22.4
 2 | onnx==1.14.1
 3 | onnxruntime==1.15.1
 4 | onnxruntime_gpu==1.16.0
 5 | onnxsim==0.4.33
 6 | ptflops==0.7
 7 | pycuda==2022.2.2
 8 | tensorrt==8.5.1.7
 9 | torch==1.9.0+cu111
10 | 


--------------------------------------------------------------------------------