├── .gitignore ├── LICENSE ├── README.md ├── cfg ├── yolov4-crowdhuman-416x416.cfg ├── yolov4-crowdhuman-480x480.cfg ├── yolov4-crowdhuman-608x608.cfg ├── yolov4-tiny-3l-crowdhuman-416x416.cfg ├── yolov4-tiny-3l-crowdhuman-608x608.cfg ├── yolov4-tiny-crowdhuman-416x416.cfg └── yolov4-tiny-crowdhuman-608x608.cfg ├── data ├── README.md ├── crowdhuman-template.data ├── crowdhuman.names ├── gen_txts.py ├── image_histogram.ipynb ├── prepare_data.sh └── verify_txts.py ├── doc ├── cant_connect_gpu.jpg ├── chart_yolov4-crowdhuman-608x608.png ├── chart_yolov4-tiny-3l-crowdhuman-416x416.png ├── chart_yolov4-tiny-crowdhuman-608x608.png ├── crowdhuman_sample.jpg ├── drive_on_colab.jpg ├── infinity_war.jpg ├── predictions_sample.jpg └── save_a_copy.jpg ├── prepare_training.sh └── yolov4_crowdhuman.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | *.pyc 3 | 4 | data/raw/ 5 | data/crowdhuman*/ 6 | data/crowdhuman-*.data 7 | data/.ipynb_checkpoints/ 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 JK Jung 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | YOLOv4 CrowdHuman Tutorial 2 | ========================== 3 | 4 | This is a tutorial demonstrating how to train a YOLOv4 people detector using [Darknet](https://github.com/AlexeyAB/darknet) and the [CrowdHuman dataset](https://www.crowdhuman.org/). 5 | 6 | Table of contents 7 | ----------------- 8 | 9 | * [Setup](#setup) 10 | * [Preparing training data](#preparing) 11 | * [Training on a local PC](#training-locally) 12 | * [Testing the custom-trained yolov4 model](#testing) 13 | * [Training on Google Colab](#training-colab) 14 | * [Deploying onto Jetson Nano](#deploying) 15 | 16 | 17 | Setup 18 | ----- 19 | 20 | If you are going to train the model on [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb), you could skip this section and jump straight to [Training on Google Colab](#training-colab). 21 | 22 | Otherwise, to run training locally, you need to have a x86_64 PC with a decent GPU. For example, I mainly test the code in this repository using a desktop PC with: 23 | 24 | * NVIDIA GeForce RTX 2080 Ti 25 | * Ubuntu 18.04.5 LTS (x86_64) 26 | - CUDA 10.2 27 | - cuDNN 8.0.1 28 | 29 | In addition, you should have OpenCV (including python3 "cv2" module) installed properly on the local PC since both the data preparation code and "darknet" would require it. 30 | 31 | 32 | Preparing training data 33 | ----------------------- 34 | 35 | For training on the local PC, I use a "608x608" yolov4 model as example. Note that I use python3 exclusively in this tutorial (python2 might not work). Follow these steps to prepare the "CrowdHuman" dataset for training the yolov4 model. 36 | 37 | 1. Clone this repository. 38 | 39 | ```shell 40 | $ cd ${HOME}/project 41 | $ git clone https://github.com/jkjung-avt/yolov4_crowdhuman 42 | ``` 43 | 44 | 2. Run the "prepare_data.sh" script in the "data/" subdirectory. It would download the "CrowdHuman" dataset, unzip train/val image files, and generate YOLO txt files necessary for the training. You could refer to [data/README.md](data/README.md) for more information about the dataset. You could further refer to [How to train (to detect your custom objects)](https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects) for an explanation of YOLO txt files. 45 | 46 | ```shell 47 | $ cd ${HOME}/project/yolov4_crowdhuman/data 48 | $ ./prepare_data.sh 608x608 49 | ``` 50 | 51 | This step could take quite a while, depending on your internet speed. When it is done, all image files and ".txt" files for training would be in the "data/crowdhuman-608x608/" subdirectory. (If interested, you could do `python3 verify_txts.py 608x608` to verify the generated txt files.) 52 | 53 | This tutorial is for training the yolov4 model to detect 2 classes of object: "head" (0) and "person" (1), where the "person" class corresponds to "full body" (including occluded body portions) in the original "CrowdHuman" annotations. Take a look at "data/crowdhuman-608x608.data", "data/crowdhuman.names", and "data/crowdhuman-608x608/" to gain a better understanding of the data files that have been generated/prepared for the training. 54 | 55 | ![A sample jpg from the CrowdHuman dataset](doc/crowdhuman_sample.jpg) 56 | 57 | 58 | Training on a local PC 59 | ---------------------- 60 | 61 | Continuing from steps in the previous section, you'd be using the "darknet" framework to train the yolov4 model. 62 | 63 | 1. Download and build "darknet" code. (NOTE to myself: Consider making "darknet" as a submodule and automate the build process?) 64 | 65 | ```shell 66 | $ cd ${HOME}/project/yolov4_crowdhuman 67 | $ git clone https://github.com/AlexeyAB/darknet.git 68 | $ cd darknet 69 | $ vim Makefile # edit Makefile with your preferred editor (might not be vim) 70 | ``` 71 | 72 | Modify the first few lines of the "Makefile" as follows. Please refer to [How to compile on Linux (using make)](https://github.com/AlexeyAB/darknet#how-to-compile-on-linux-using-make) for more information about these settings. Note that, in the example below, CUDA compute "75" is for RTX 2080 Ti and "61" is for GTX 1080. You might need to modify those based on the kind of GPU you are using. 73 | 74 | ``` 75 | GPU=1 76 | CUDNN=1 77 | CUDNN_HALF=1 78 | OPENCV=1 79 | AVX=1 80 | OPENMP=1 81 | LIBSO=1 82 | ZED_CAMERA=0 83 | ZED_CAMERA_v2_8=0 84 | 85 | ...... 86 | 87 | USE_CPP=0 88 | DEBUG=0 89 | 90 | ARCH= -gencode arch=compute_61,code=[sm_61,compute_61] \ 91 | -gencode arch=compute_75,code=[sm_75,compute_75] 92 | 93 | ...... 94 | ``` 95 | 96 | Then do a `make` to build "darknet". 97 | 98 | ```shell 99 | $ make 100 | ``` 101 | 102 | When it is done, you could (optionally) test the "darknet" executable as follows. 103 | 104 | ```shell 105 | ### download pre-trained yolov4 coco weights and test with the dog image 106 | $ wget https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights \ 107 | -q --show-progress --no-clobber 108 | $ ./darknet detector test cfg/coco.data cfg/yolov4-416.cfg yolov4.weights \ 109 | data/dog.jpg 110 | ``` 111 | 112 | 2. Then copy over all files needed for training and download the pre-trained weights ("yolov4.conv.137"). 113 | 114 | ```shell 115 | $ cd ${HOME}/project/yolov4_crowdhuman 116 | $ ./prepare_training.sh 608x608 117 | ``` 118 | 119 | 3. Train the "yolov4-crowdhuman-608x608" model. Please refer to [How to train with multi-GPU](https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu) for how to fine-tune your training process. For example, you could specify `-gpus 0,1,2,3` in order to use multiple GPUs to speed up training. 120 | 121 | ```shell 122 | $ cd ${HOME}/project/yolov4_crowdhuman/darknet 123 | $ ./darknet detector train data/crowdhuman-608x608.data \ 124 | cfg/yolov4-crowdhuman-608x608.cfg \ 125 | yolov4.conv.137 -map -gpus 0 126 | ``` 127 | 128 | When the model is being trained, you could monitor its progress on the loss/mAP chart (since the `-map` option is used). Alternatively, if you are training on a remote PC via ssh, add the `-dont_show -mjpeg_port 8090` option so that you could monitor the loss/mAP chart on a web browser (http://{IP address}:8090/). 129 | 130 | As a reference, training this "yolov4-crowdhuman-608x608" model with my RTX 2080 Ti GPU takes 17~18 hours. 131 | 132 | ![My sample loss/mAP chart of the "yolov4-crowdhuman-608x608" model](doc/chart_yolov4-crowdhuman-608x608.png) 133 | 134 | Another example for the training of "yolov4-tiny-crowdhuman-608x608" model on RTX 2080 Ti GPU (< 3 hours). 135 | 136 | ![My sample loss/mAP chart of the "yolov4-tiny-crowdhuman-608x608" model](doc/chart_yolov4-tiny-crowdhuman-608x608.png) 137 | 138 | And another one for the training of "yolov4-tiny-3l-crowdhuman-416x416" model on RTX 2080 Ti GPU (< 2 hours). 139 | 140 | ![My sample loss/mAP chart of the "yolov4-tiny-3l-crowdhuman-416x416" model](doc/chart_yolov4-tiny-3l-crowdhuman-416x416.png) 141 | 142 | 143 | Testing the custom-trained yolov4 model 144 | --------------------------------------- 145 | 146 | After you have trained the "yolov4-crowdhuman-608x608" model locally, you could test the "best" custom-trained model like this. 147 | 148 | ```shell 149 | $ cd ${HOME}/project/yolov4_crowdhuman/darknet 150 | $ ./darknet detector test data/crowdhuman-608x608.data \ 151 | cfg/yolov4-crowdhuman-608x608.cfg \ 152 | backup/yolov4-crowdhuman-608x608_best.weights \ 153 | data/crowdhuman-608x608/273275,4e9d1000623d182f.jpg \ 154 | -gpus 0 155 | ``` 156 | 157 | ![A sample prediction using the trained "yolov4-crowdhuman-608x608" model](doc/predictions_sample.jpg) 158 | 159 | In addition, you could verify mAP of the "best" model like this. 160 | 161 | ``` 162 | $ ./darknet detector map data/crowdhuman-608x608.data \ 163 | cfg/yolov4-crowdhuman-608x608.cfg \ 164 | backup/yolov4-crowdhuman-608x608_best.weights \ 165 | -gpus 0 166 | ``` 167 | 168 | For example, I got mAP@0.50 = 0.814523 when I tested my own custom-trained "yolov4-crowdhuman-608x608" model. 169 | 170 | ``` 171 | detections_count = 614280, unique_truth_count = 183365 172 | class_id = 0, name = head, ap = 82.60% (TP = 65119, FP = 14590) 173 | class_id = 1, name = person, ap = 80.30% (TP = 72055, FP = 11766) 174 | 175 | for conf_thresh = 0.25, precision = 0.84, recall = 0.75, F1-score = 0.79 176 | for conf_thresh = 0.25, TP = 137174, FP = 26356, FN = 46191, average IoU = 66.92 % 177 | 178 | IoU threshold = 50 %, used Area-Under-Curve for each unique Recall 179 | mean average precision (mAP@0.50) = 0.814523, or 81.45 % 180 | ``` 181 | 182 | 183 | Training on Google Colab 184 | ------------------------ 185 | 186 | For doing training on Google Colab, I use a "416x416" yolov4 model as example. I have put all data processing and training commands into an IPython Notebook. So training the "yolov4-crowdhuman-416x416" model on Google Colab is just as simple as: (1) opening the Notebook on Google Colab, (2) mount your Google Drive, (3) run all cells in the Notebook. 187 | 188 | A few words of caution before you begin running the Notebook on Google Colab: 189 | 190 | * Google Colab's GPU runtime is *free of charge*, but it is **not unlimited nor guaranteed**. Even though the Google Colab [FAQ](https://research.google.com/colaboratory/faq.html#resource-limits) states that *"virtual machines have maximum lifetimes that can be as much as 12 hours"*, I often saw my Colab GPU sessions getting disconnected after 7~8 hours of non-interactive use. 191 | 192 | * If you connect to GPU instances on Google Colab repeatedly and frequently, you could be **temporarily locked out** (not able to connect to GPU instances for a couple of days). So I'd suggest you to connect to a GPU runtime sparingly and only when needed, and to manually terminate the GPU sessions as soon as you no longer need them. 193 | 194 | * It is strongly advised that you read and mind Google Colab's [Resource Limits](https://research.google.com/colaboratory/faq.html#resource-limits). 195 | 196 | Due to the 7~8 hour limit of GPU runtime mentioned above, you won't be able to train a large yolov4 model in a single session. That's the reason why I chose "416x416" model for this part of the tutorial. Here are the steps: 197 | 198 | 1. Open [yolov4_crowdhuman.ipynb](https://colab.research.google.com/drive/1eoa2_v6wVlcJiDBh3Tb_umhm7a09lpIE?usp=sharing). This IPython Notebook is on my personal Google Drive. You could review it, but you could not modify it. 199 | 200 | 2. Make a copy of "yolov4_crowdhuman.ipynb" on your own Google Drive, by clicking "Files -> Save a copy in Drive" on the menu. You should use your own saved copy of the Notebook for the rest of the steps. 201 | 202 | ![Saving a copy of yolov4_crowdhuman.ipynb](./doc/save_a_copy.jpg) 203 | 204 | 3. Follow the instructions in the Notebook to train the "yolov4-crowdhuman-416x416" model, i.e. 205 | 206 | - make sure the IPython Notebook has successfully connected to a GPU runtime, 207 | - mount your Google Drive (for saving training log and weights), 208 | - run all cells ("Runtime -> Run all" or "Runtime -> Restart and run all"). 209 | 210 | You should have a good chance of finishing training the "yolov4-crowdhuman-416x416" model before the Colab session gets automatically disconnected (expired). 211 | 212 | Instead of opening the Colab Notebook on my Google Drive, you could also go to [your own Colab account](https://colab.research.google.com/notebooks/intro.ipynb) and use "File -> Upload notebook" to upload [yolov4_crowdhuman.ipynb](yolov4_crowdhuman.ipynb) directly. 213 | 214 | Refer to my [Custom YOLOv4 Model on Google Colab](https://jkjung-avt.github.io/colab-yolov4/) post for additional information about running the IPython Notebook. 215 | 216 | 217 | Deploying onto Jetson Nano 218 | -------------------------- 219 | 220 | To deploy the trained "yolov4-crowdhuman-416x416" model onto Jsetson Nano, I'd use my [jkjung-avt/tensorrt_demos](https://github.com/jkjung-avt/tensorrt_demos) code to build/deploy it as a TensorRT engine. Here are the detailed steps: 221 | 222 | 1. On the Jetson Nano, check out my [jkjung-avt/tensorrt_demos](https://github.com/jkjung-avt/tensorrt_demos) code and make sure you are able to run the standard "yolov4-416" TensorRT engine without problem. Please refer to [Demo #5: YOLOv4](https://github.com/jkjung-avt/tensorrt_demos#yolov4) for details. 223 | 224 | ```shell 225 | $ cd ${HOME}/project 226 | $ git clone https://github.com/jkjung-avt/tensorrt_demos.git 227 | ### Detailed steps omitted: install pycuda, download yolov4-416 model, yolo_to_onnx, onnx_to_tensorrt 228 | ### ...... 229 | $ cd ${HOME}/project/tensorrt_demos 230 | $ python3 trt_yolo.py --image ${HOME}/Pictures/dog.jpg -m yolov4-416 231 | ``` 232 | 233 | 2. Download the "yolov4-crowdhuman-416x416" model. More specifically, get "yolov4-crowdhuman-416x416.cfg" from this repository and download "yolov4-crowdhuman-416x416_best.weights" file from your Google Drive. Rename the .weights file so that it matches the .cfg file. 234 | 235 | ```shell 236 | $ cd ${HOME}/project/tensorrt_demos/yolo 237 | $ wget https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/master/cfg/yolov4-crowdhuman-416x416.cfg 238 | $ cp ${HOME}/Downloads/yolov4-crowdhuman-416x416_best.weights yolov4-crowdhuman-416x416.weights 239 | ``` 240 | 241 | Then build the TensorRT (FP16) engine. Note the "-c 2" in the command-line option is for specifying that the model is for detecting 2 classes of objects. 242 | 243 | ```shell 244 | $ python3 yolo_to_onnx.py -c 2 -m yolov4-crowdhuman-416x416 245 | $ python3 onnx_to_tensorrt.py -c 2 -m yolov4-crowdhuman-416x416 246 | ``` 247 | 248 | 3. Test the TensorRT engine. For example, I tested it with the "Avengers: Infinity War" movie trailer. (You should download and test with your own images or videos.) 249 | 250 | ```shell 251 | $ cd ${HOME}/project/tensorrt_demos 252 | $ python3 trt_yolo.py --video ${HOME}/Videos/Infinity_War.mp4 \ 253 | -c 2 -m yolov4-crowdhuman-416x416 254 | ``` 255 | 256 | (Click on the image below to see the whole video clip...) 257 | 258 | [![Testing with the Avengers: Infinity War trailer](https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/master/doc/infinity_war.jpg)](https://youtu.be/7Qr_Fq18FgM) 259 | 260 | 261 | Contributions 262 | -------------------------- 263 | 264 | [@philipp-schmidt](https://github.com/philipp-schmidt): yolov4-tiny models and training charts -------------------------------------------------------------------------------- /cfg/yolov4-crowdhuman-416x416.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Training 3 | batch=24 4 | subdivisions=8 5 | # Testing 6 | #batch=1 7 | #subdivisions=1 8 | width=416 9 | height=416 10 | channels=3 11 | momentum=0.949 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.001 19 | burn_in=1000 20 | max_batches=4000 21 | policy=steps 22 | steps=3000,3600 23 | scales=.1,.1 24 | 25 | max_chart_loss=40.0 26 | 27 | #cutmix=1 28 | mosaic=1 29 | 30 | #:104x104 54:52x52 85:26x26 104:13x13 for 416 31 | 32 | [convolutional] 33 | batch_normalize=1 34 | filters=32 35 | size=3 36 | stride=1 37 | pad=1 38 | activation=mish 39 | 40 | # Downsample 41 | 42 | [convolutional] 43 | batch_normalize=1 44 | filters=64 45 | size=3 46 | stride=2 47 | pad=1 48 | activation=mish 49 | 50 | [convolutional] 51 | batch_normalize=1 52 | filters=64 53 | size=1 54 | stride=1 55 | pad=1 56 | activation=mish 57 | 58 | [route] 59 | layers = -2 60 | 61 | [convolutional] 62 | batch_normalize=1 63 | filters=64 64 | size=1 65 | stride=1 66 | pad=1 67 | activation=mish 68 | 69 | [convolutional] 70 | batch_normalize=1 71 | filters=32 72 | size=1 73 | stride=1 74 | pad=1 75 | activation=mish 76 | 77 | [convolutional] 78 | batch_normalize=1 79 | filters=64 80 | size=3 81 | stride=1 82 | pad=1 83 | activation=mish 84 | 85 | [shortcut] 86 | from=-3 87 | activation=linear 88 | 89 | [convolutional] 90 | batch_normalize=1 91 | filters=64 92 | size=1 93 | stride=1 94 | pad=1 95 | activation=mish 96 | 97 | [route] 98 | layers = -1,-7 99 | 100 | [convolutional] 101 | batch_normalize=1 102 | filters=64 103 | size=1 104 | stride=1 105 | pad=1 106 | activation=mish 107 | 108 | # Downsample 109 | 110 | [convolutional] 111 | batch_normalize=1 112 | filters=128 113 | size=3 114 | stride=2 115 | pad=1 116 | activation=mish 117 | 118 | [convolutional] 119 | batch_normalize=1 120 | filters=64 121 | size=1 122 | stride=1 123 | pad=1 124 | activation=mish 125 | 126 | [route] 127 | layers = -2 128 | 129 | [convolutional] 130 | batch_normalize=1 131 | filters=64 132 | size=1 133 | stride=1 134 | pad=1 135 | activation=mish 136 | 137 | [convolutional] 138 | batch_normalize=1 139 | filters=64 140 | size=1 141 | stride=1 142 | pad=1 143 | activation=mish 144 | 145 | [convolutional] 146 | batch_normalize=1 147 | filters=64 148 | size=3 149 | stride=1 150 | pad=1 151 | activation=mish 152 | 153 | [shortcut] 154 | from=-3 155 | activation=linear 156 | 157 | [convolutional] 158 | batch_normalize=1 159 | filters=64 160 | size=1 161 | stride=1 162 | pad=1 163 | activation=mish 164 | 165 | [convolutional] 166 | batch_normalize=1 167 | filters=64 168 | size=3 169 | stride=1 170 | pad=1 171 | activation=mish 172 | 173 | [shortcut] 174 | from=-3 175 | activation=linear 176 | 177 | [convolutional] 178 | batch_normalize=1 179 | filters=64 180 | size=1 181 | stride=1 182 | pad=1 183 | activation=mish 184 | 185 | [route] 186 | layers = -1,-10 187 | 188 | [convolutional] 189 | batch_normalize=1 190 | filters=128 191 | size=1 192 | stride=1 193 | pad=1 194 | activation=mish 195 | 196 | # Downsample 197 | 198 | [convolutional] 199 | batch_normalize=1 200 | filters=256 201 | size=3 202 | stride=2 203 | pad=1 204 | activation=mish 205 | 206 | [convolutional] 207 | batch_normalize=1 208 | filters=128 209 | size=1 210 | stride=1 211 | pad=1 212 | activation=mish 213 | 214 | [route] 215 | layers = -2 216 | 217 | [convolutional] 218 | batch_normalize=1 219 | filters=128 220 | size=1 221 | stride=1 222 | pad=1 223 | activation=mish 224 | 225 | [convolutional] 226 | batch_normalize=1 227 | filters=128 228 | size=1 229 | stride=1 230 | pad=1 231 | activation=mish 232 | 233 | [convolutional] 234 | batch_normalize=1 235 | filters=128 236 | size=3 237 | stride=1 238 | pad=1 239 | activation=mish 240 | 241 | [shortcut] 242 | from=-3 243 | activation=linear 244 | 245 | [convolutional] 246 | batch_normalize=1 247 | filters=128 248 | size=1 249 | stride=1 250 | pad=1 251 | activation=mish 252 | 253 | [convolutional] 254 | batch_normalize=1 255 | filters=128 256 | size=3 257 | stride=1 258 | pad=1 259 | activation=mish 260 | 261 | [shortcut] 262 | from=-3 263 | activation=linear 264 | 265 | [convolutional] 266 | batch_normalize=1 267 | filters=128 268 | size=1 269 | stride=1 270 | pad=1 271 | activation=mish 272 | 273 | [convolutional] 274 | batch_normalize=1 275 | filters=128 276 | size=3 277 | stride=1 278 | pad=1 279 | activation=mish 280 | 281 | [shortcut] 282 | from=-3 283 | activation=linear 284 | 285 | [convolutional] 286 | batch_normalize=1 287 | filters=128 288 | size=1 289 | stride=1 290 | pad=1 291 | activation=mish 292 | 293 | [convolutional] 294 | batch_normalize=1 295 | filters=128 296 | size=3 297 | stride=1 298 | pad=1 299 | activation=mish 300 | 301 | [shortcut] 302 | from=-3 303 | activation=linear 304 | 305 | 306 | [convolutional] 307 | batch_normalize=1 308 | filters=128 309 | size=1 310 | stride=1 311 | pad=1 312 | activation=mish 313 | 314 | [convolutional] 315 | batch_normalize=1 316 | filters=128 317 | size=3 318 | stride=1 319 | pad=1 320 | activation=mish 321 | 322 | [shortcut] 323 | from=-3 324 | activation=linear 325 | 326 | [convolutional] 327 | batch_normalize=1 328 | filters=128 329 | size=1 330 | stride=1 331 | pad=1 332 | activation=mish 333 | 334 | [convolutional] 335 | batch_normalize=1 336 | filters=128 337 | size=3 338 | stride=1 339 | pad=1 340 | activation=mish 341 | 342 | [shortcut] 343 | from=-3 344 | activation=linear 345 | 346 | [convolutional] 347 | batch_normalize=1 348 | filters=128 349 | size=1 350 | stride=1 351 | pad=1 352 | activation=mish 353 | 354 | [convolutional] 355 | batch_normalize=1 356 | filters=128 357 | size=3 358 | stride=1 359 | pad=1 360 | activation=mish 361 | 362 | [shortcut] 363 | from=-3 364 | activation=linear 365 | 366 | [convolutional] 367 | batch_normalize=1 368 | filters=128 369 | size=1 370 | stride=1 371 | pad=1 372 | activation=mish 373 | 374 | [convolutional] 375 | batch_normalize=1 376 | filters=128 377 | size=3 378 | stride=1 379 | pad=1 380 | activation=mish 381 | 382 | [shortcut] 383 | from=-3 384 | activation=linear 385 | 386 | [convolutional] 387 | batch_normalize=1 388 | filters=128 389 | size=1 390 | stride=1 391 | pad=1 392 | activation=mish 393 | 394 | [route] 395 | layers = -1,-28 396 | 397 | [convolutional] 398 | batch_normalize=1 399 | filters=256 400 | size=1 401 | stride=1 402 | pad=1 403 | activation=mish 404 | 405 | # Downsample 406 | 407 | [convolutional] 408 | batch_normalize=1 409 | filters=512 410 | size=3 411 | stride=2 412 | pad=1 413 | activation=mish 414 | 415 | [convolutional] 416 | batch_normalize=1 417 | filters=256 418 | size=1 419 | stride=1 420 | pad=1 421 | activation=mish 422 | 423 | [route] 424 | layers = -2 425 | 426 | [convolutional] 427 | batch_normalize=1 428 | filters=256 429 | size=1 430 | stride=1 431 | pad=1 432 | activation=mish 433 | 434 | [convolutional] 435 | batch_normalize=1 436 | filters=256 437 | size=1 438 | stride=1 439 | pad=1 440 | activation=mish 441 | 442 | [convolutional] 443 | batch_normalize=1 444 | filters=256 445 | size=3 446 | stride=1 447 | pad=1 448 | activation=mish 449 | 450 | [shortcut] 451 | from=-3 452 | activation=linear 453 | 454 | 455 | [convolutional] 456 | batch_normalize=1 457 | filters=256 458 | size=1 459 | stride=1 460 | pad=1 461 | activation=mish 462 | 463 | [convolutional] 464 | batch_normalize=1 465 | filters=256 466 | size=3 467 | stride=1 468 | pad=1 469 | activation=mish 470 | 471 | [shortcut] 472 | from=-3 473 | activation=linear 474 | 475 | 476 | [convolutional] 477 | batch_normalize=1 478 | filters=256 479 | size=1 480 | stride=1 481 | pad=1 482 | activation=mish 483 | 484 | [convolutional] 485 | batch_normalize=1 486 | filters=256 487 | size=3 488 | stride=1 489 | pad=1 490 | activation=mish 491 | 492 | [shortcut] 493 | from=-3 494 | activation=linear 495 | 496 | 497 | [convolutional] 498 | batch_normalize=1 499 | filters=256 500 | size=1 501 | stride=1 502 | pad=1 503 | activation=mish 504 | 505 | [convolutional] 506 | batch_normalize=1 507 | filters=256 508 | size=3 509 | stride=1 510 | pad=1 511 | activation=mish 512 | 513 | [shortcut] 514 | from=-3 515 | activation=linear 516 | 517 | 518 | [convolutional] 519 | batch_normalize=1 520 | filters=256 521 | size=1 522 | stride=1 523 | pad=1 524 | activation=mish 525 | 526 | [convolutional] 527 | batch_normalize=1 528 | filters=256 529 | size=3 530 | stride=1 531 | pad=1 532 | activation=mish 533 | 534 | [shortcut] 535 | from=-3 536 | activation=linear 537 | 538 | 539 | [convolutional] 540 | batch_normalize=1 541 | filters=256 542 | size=1 543 | stride=1 544 | pad=1 545 | activation=mish 546 | 547 | [convolutional] 548 | batch_normalize=1 549 | filters=256 550 | size=3 551 | stride=1 552 | pad=1 553 | activation=mish 554 | 555 | [shortcut] 556 | from=-3 557 | activation=linear 558 | 559 | 560 | [convolutional] 561 | batch_normalize=1 562 | filters=256 563 | size=1 564 | stride=1 565 | pad=1 566 | activation=mish 567 | 568 | [convolutional] 569 | batch_normalize=1 570 | filters=256 571 | size=3 572 | stride=1 573 | pad=1 574 | activation=mish 575 | 576 | [shortcut] 577 | from=-3 578 | activation=linear 579 | 580 | [convolutional] 581 | batch_normalize=1 582 | filters=256 583 | size=1 584 | stride=1 585 | pad=1 586 | activation=mish 587 | 588 | [convolutional] 589 | batch_normalize=1 590 | filters=256 591 | size=3 592 | stride=1 593 | pad=1 594 | activation=mish 595 | 596 | [shortcut] 597 | from=-3 598 | activation=linear 599 | 600 | [convolutional] 601 | batch_normalize=1 602 | filters=256 603 | size=1 604 | stride=1 605 | pad=1 606 | activation=mish 607 | 608 | [route] 609 | layers = -1,-28 610 | 611 | [convolutional] 612 | batch_normalize=1 613 | filters=512 614 | size=1 615 | stride=1 616 | pad=1 617 | activation=mish 618 | 619 | # Downsample 620 | 621 | [convolutional] 622 | batch_normalize=1 623 | filters=1024 624 | size=3 625 | stride=2 626 | pad=1 627 | activation=mish 628 | 629 | [convolutional] 630 | batch_normalize=1 631 | filters=512 632 | size=1 633 | stride=1 634 | pad=1 635 | activation=mish 636 | 637 | [route] 638 | layers = -2 639 | 640 | [convolutional] 641 | batch_normalize=1 642 | filters=512 643 | size=1 644 | stride=1 645 | pad=1 646 | activation=mish 647 | 648 | [convolutional] 649 | batch_normalize=1 650 | filters=512 651 | size=1 652 | stride=1 653 | pad=1 654 | activation=mish 655 | 656 | [convolutional] 657 | batch_normalize=1 658 | filters=512 659 | size=3 660 | stride=1 661 | pad=1 662 | activation=mish 663 | 664 | [shortcut] 665 | from=-3 666 | activation=linear 667 | 668 | [convolutional] 669 | batch_normalize=1 670 | filters=512 671 | size=1 672 | stride=1 673 | pad=1 674 | activation=mish 675 | 676 | [convolutional] 677 | batch_normalize=1 678 | filters=512 679 | size=3 680 | stride=1 681 | pad=1 682 | activation=mish 683 | 684 | [shortcut] 685 | from=-3 686 | activation=linear 687 | 688 | [convolutional] 689 | batch_normalize=1 690 | filters=512 691 | size=1 692 | stride=1 693 | pad=1 694 | activation=mish 695 | 696 | [convolutional] 697 | batch_normalize=1 698 | filters=512 699 | size=3 700 | stride=1 701 | pad=1 702 | activation=mish 703 | 704 | [shortcut] 705 | from=-3 706 | activation=linear 707 | 708 | [convolutional] 709 | batch_normalize=1 710 | filters=512 711 | size=1 712 | stride=1 713 | pad=1 714 | activation=mish 715 | 716 | [convolutional] 717 | batch_normalize=1 718 | filters=512 719 | size=3 720 | stride=1 721 | pad=1 722 | activation=mish 723 | 724 | [shortcut] 725 | from=-3 726 | activation=linear 727 | 728 | [convolutional] 729 | batch_normalize=1 730 | filters=512 731 | size=1 732 | stride=1 733 | pad=1 734 | activation=mish 735 | 736 | [route] 737 | layers = -1,-16 738 | 739 | [convolutional] 740 | batch_normalize=1 741 | filters=1024 742 | size=1 743 | stride=1 744 | pad=1 745 | activation=mish 746 | stopbackward=800 747 | 748 | ########################## 749 | 750 | [convolutional] 751 | batch_normalize=1 752 | filters=512 753 | size=1 754 | stride=1 755 | pad=1 756 | activation=leaky 757 | 758 | [convolutional] 759 | batch_normalize=1 760 | size=3 761 | stride=1 762 | pad=1 763 | filters=1024 764 | activation=leaky 765 | 766 | [convolutional] 767 | batch_normalize=1 768 | filters=512 769 | size=1 770 | stride=1 771 | pad=1 772 | activation=leaky 773 | 774 | ### SPP ### 775 | [maxpool] 776 | stride=1 777 | size=5 778 | 779 | [route] 780 | layers=-2 781 | 782 | [maxpool] 783 | stride=1 784 | size=9 785 | 786 | [route] 787 | layers=-4 788 | 789 | [maxpool] 790 | stride=1 791 | size=13 792 | 793 | [route] 794 | layers=-1,-3,-5,-6 795 | ### End SPP ### 796 | 797 | [convolutional] 798 | batch_normalize=1 799 | filters=512 800 | size=1 801 | stride=1 802 | pad=1 803 | activation=leaky 804 | 805 | [convolutional] 806 | batch_normalize=1 807 | size=3 808 | stride=1 809 | pad=1 810 | filters=1024 811 | activation=leaky 812 | 813 | [convolutional] 814 | batch_normalize=1 815 | filters=512 816 | size=1 817 | stride=1 818 | pad=1 819 | activation=leaky 820 | 821 | [convolutional] 822 | batch_normalize=1 823 | filters=256 824 | size=1 825 | stride=1 826 | pad=1 827 | activation=leaky 828 | 829 | [upsample] 830 | stride=2 831 | 832 | [route] 833 | layers = 85 834 | 835 | [convolutional] 836 | batch_normalize=1 837 | filters=256 838 | size=1 839 | stride=1 840 | pad=1 841 | activation=leaky 842 | 843 | [route] 844 | layers = -1, -3 845 | 846 | [convolutional] 847 | batch_normalize=1 848 | filters=256 849 | size=1 850 | stride=1 851 | pad=1 852 | activation=leaky 853 | 854 | [convolutional] 855 | batch_normalize=1 856 | size=3 857 | stride=1 858 | pad=1 859 | filters=512 860 | activation=leaky 861 | 862 | [convolutional] 863 | batch_normalize=1 864 | filters=256 865 | size=1 866 | stride=1 867 | pad=1 868 | activation=leaky 869 | 870 | [convolutional] 871 | batch_normalize=1 872 | size=3 873 | stride=1 874 | pad=1 875 | filters=512 876 | activation=leaky 877 | 878 | [convolutional] 879 | batch_normalize=1 880 | filters=256 881 | size=1 882 | stride=1 883 | pad=1 884 | activation=leaky 885 | 886 | [convolutional] 887 | batch_normalize=1 888 | filters=128 889 | size=1 890 | stride=1 891 | pad=1 892 | activation=leaky 893 | 894 | [upsample] 895 | stride=2 896 | 897 | [route] 898 | layers = 54 899 | 900 | [convolutional] 901 | batch_normalize=1 902 | filters=128 903 | size=1 904 | stride=1 905 | pad=1 906 | activation=leaky 907 | 908 | [route] 909 | layers = -1, -3 910 | 911 | [convolutional] 912 | batch_normalize=1 913 | filters=128 914 | size=1 915 | stride=1 916 | pad=1 917 | activation=leaky 918 | 919 | [convolutional] 920 | batch_normalize=1 921 | size=3 922 | stride=1 923 | pad=1 924 | filters=256 925 | activation=leaky 926 | 927 | [convolutional] 928 | batch_normalize=1 929 | filters=128 930 | size=1 931 | stride=1 932 | pad=1 933 | activation=leaky 934 | 935 | [convolutional] 936 | batch_normalize=1 937 | size=3 938 | stride=1 939 | pad=1 940 | filters=256 941 | activation=leaky 942 | 943 | [convolutional] 944 | batch_normalize=1 945 | filters=128 946 | size=1 947 | stride=1 948 | pad=1 949 | activation=leaky 950 | 951 | ########################## 952 | 953 | [convolutional] 954 | batch_normalize=1 955 | size=3 956 | stride=1 957 | pad=1 958 | filters=256 959 | activation=leaky 960 | 961 | [convolutional] 962 | size=1 963 | stride=1 964 | pad=1 965 | filters=21 966 | activation=linear 967 | 968 | 969 | [yolo] 970 | mask = 0,1,2 971 | anchors = 9,18, 17,45, 26,84, 38,132, 48,188, 62,256, 87,338, 133,215, 192,355 972 | classes=2 973 | num=9 974 | jitter=.3 975 | ignore_thresh = .7 976 | truth_thresh = 1 977 | random=1 978 | scale_x_y = 1.2 979 | iou_thresh=0.213 980 | cls_normalizer=1.0 981 | iou_normalizer=0.07 982 | iou_loss=ciou 983 | nms_kind=greedynms 984 | beta_nms=0.6 985 | max_delta=5 986 | 987 | 988 | [route] 989 | layers = -4 990 | 991 | [convolutional] 992 | batch_normalize=1 993 | size=3 994 | stride=2 995 | pad=1 996 | filters=256 997 | activation=leaky 998 | 999 | [route] 1000 | layers = -1, -16 1001 | 1002 | [convolutional] 1003 | batch_normalize=1 1004 | filters=256 1005 | size=1 1006 | stride=1 1007 | pad=1 1008 | activation=leaky 1009 | 1010 | [convolutional] 1011 | batch_normalize=1 1012 | size=3 1013 | stride=1 1014 | pad=1 1015 | filters=512 1016 | activation=leaky 1017 | 1018 | [convolutional] 1019 | batch_normalize=1 1020 | filters=256 1021 | size=1 1022 | stride=1 1023 | pad=1 1024 | activation=leaky 1025 | 1026 | [convolutional] 1027 | batch_normalize=1 1028 | size=3 1029 | stride=1 1030 | pad=1 1031 | filters=512 1032 | activation=leaky 1033 | 1034 | [convolutional] 1035 | batch_normalize=1 1036 | filters=256 1037 | size=1 1038 | stride=1 1039 | pad=1 1040 | activation=leaky 1041 | 1042 | [convolutional] 1043 | batch_normalize=1 1044 | size=3 1045 | stride=1 1046 | pad=1 1047 | filters=512 1048 | activation=leaky 1049 | 1050 | [convolutional] 1051 | size=1 1052 | stride=1 1053 | pad=1 1054 | filters=21 1055 | activation=linear 1056 | 1057 | 1058 | [yolo] 1059 | mask = 3,4,5 1060 | anchors = 9,18, 17,45, 26,84, 38,132, 48,188, 62,256, 87,338, 133,215, 192,355 1061 | classes=2 1062 | num=9 1063 | jitter=.3 1064 | ignore_thresh = .7 1065 | truth_thresh = 1 1066 | random=1 1067 | scale_x_y = 1.1 1068 | iou_thresh=0.213 1069 | cls_normalizer=1.0 1070 | iou_normalizer=0.07 1071 | iou_loss=ciou 1072 | nms_kind=greedynms 1073 | beta_nms=0.6 1074 | max_delta=5 1075 | 1076 | 1077 | [route] 1078 | layers = -4 1079 | 1080 | [convolutional] 1081 | batch_normalize=1 1082 | size=3 1083 | stride=2 1084 | pad=1 1085 | filters=512 1086 | activation=leaky 1087 | 1088 | [route] 1089 | layers = -1, -37 1090 | 1091 | [convolutional] 1092 | batch_normalize=1 1093 | filters=512 1094 | size=1 1095 | stride=1 1096 | pad=1 1097 | activation=leaky 1098 | 1099 | [convolutional] 1100 | batch_normalize=1 1101 | size=3 1102 | stride=1 1103 | pad=1 1104 | filters=1024 1105 | activation=leaky 1106 | 1107 | [convolutional] 1108 | batch_normalize=1 1109 | filters=512 1110 | size=1 1111 | stride=1 1112 | pad=1 1113 | activation=leaky 1114 | 1115 | [convolutional] 1116 | batch_normalize=1 1117 | size=3 1118 | stride=1 1119 | pad=1 1120 | filters=1024 1121 | activation=leaky 1122 | 1123 | [convolutional] 1124 | batch_normalize=1 1125 | filters=512 1126 | size=1 1127 | stride=1 1128 | pad=1 1129 | activation=leaky 1130 | 1131 | [convolutional] 1132 | batch_normalize=1 1133 | size=3 1134 | stride=1 1135 | pad=1 1136 | filters=1024 1137 | activation=leaky 1138 | 1139 | [convolutional] 1140 | size=1 1141 | stride=1 1142 | pad=1 1143 | filters=21 1144 | activation=linear 1145 | 1146 | 1147 | [yolo] 1148 | mask = 6,7,8 1149 | anchors = 9,18, 17,45, 26,84, 38,132, 48,188, 62,256, 87,338, 133,215, 192,355 1150 | classes=2 1151 | num=9 1152 | jitter=.3 1153 | ignore_thresh = .7 1154 | truth_thresh = 1 1155 | random=1 1156 | scale_x_y = 1.05 1157 | iou_thresh=0.213 1158 | cls_normalizer=1.0 1159 | iou_normalizer=0.07 1160 | iou_loss=ciou 1161 | nms_kind=greedynms 1162 | beta_nms=0.6 1163 | max_delta=5 1164 | 1165 | -------------------------------------------------------------------------------- /cfg/yolov4-crowdhuman-480x480.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Training 3 | batch=64 4 | subdivisions=32 5 | # Testing 6 | #batch=1 7 | #subdivisions=1 8 | width=480 9 | height=480 10 | channels=3 11 | momentum=0.949 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.0013 19 | burn_in=1000 20 | max_batches=6000 21 | policy=steps 22 | steps=4800,5400 23 | scales=.1,.1 24 | 25 | max_chart_loss=40.0 26 | 27 | #cutmix=1 28 | mosaic=1 29 | 30 | #:104x104 54:52x52 85:26x26 104:13x13 for 416 31 | 32 | [convolutional] 33 | batch_normalize=1 34 | filters=32 35 | size=3 36 | stride=1 37 | pad=1 38 | activation=mish 39 | 40 | # Downsample 41 | 42 | [convolutional] 43 | batch_normalize=1 44 | filters=64 45 | size=3 46 | stride=2 47 | pad=1 48 | activation=mish 49 | 50 | [convolutional] 51 | batch_normalize=1 52 | filters=64 53 | size=1 54 | stride=1 55 | pad=1 56 | activation=mish 57 | 58 | [route] 59 | layers = -2 60 | 61 | [convolutional] 62 | batch_normalize=1 63 | filters=64 64 | size=1 65 | stride=1 66 | pad=1 67 | activation=mish 68 | 69 | [convolutional] 70 | batch_normalize=1 71 | filters=32 72 | size=1 73 | stride=1 74 | pad=1 75 | activation=mish 76 | 77 | [convolutional] 78 | batch_normalize=1 79 | filters=64 80 | size=3 81 | stride=1 82 | pad=1 83 | activation=mish 84 | 85 | [shortcut] 86 | from=-3 87 | activation=linear 88 | 89 | [convolutional] 90 | batch_normalize=1 91 | filters=64 92 | size=1 93 | stride=1 94 | pad=1 95 | activation=mish 96 | 97 | [route] 98 | layers = -1,-7 99 | 100 | [convolutional] 101 | batch_normalize=1 102 | filters=64 103 | size=1 104 | stride=1 105 | pad=1 106 | activation=mish 107 | 108 | # Downsample 109 | 110 | [convolutional] 111 | batch_normalize=1 112 | filters=128 113 | size=3 114 | stride=2 115 | pad=1 116 | activation=mish 117 | 118 | [convolutional] 119 | batch_normalize=1 120 | filters=64 121 | size=1 122 | stride=1 123 | pad=1 124 | activation=mish 125 | 126 | [route] 127 | layers = -2 128 | 129 | [convolutional] 130 | batch_normalize=1 131 | filters=64 132 | size=1 133 | stride=1 134 | pad=1 135 | activation=mish 136 | 137 | [convolutional] 138 | batch_normalize=1 139 | filters=64 140 | size=1 141 | stride=1 142 | pad=1 143 | activation=mish 144 | 145 | [convolutional] 146 | batch_normalize=1 147 | filters=64 148 | size=3 149 | stride=1 150 | pad=1 151 | activation=mish 152 | 153 | [shortcut] 154 | from=-3 155 | activation=linear 156 | 157 | [convolutional] 158 | batch_normalize=1 159 | filters=64 160 | size=1 161 | stride=1 162 | pad=1 163 | activation=mish 164 | 165 | [convolutional] 166 | batch_normalize=1 167 | filters=64 168 | size=3 169 | stride=1 170 | pad=1 171 | activation=mish 172 | 173 | [shortcut] 174 | from=-3 175 | activation=linear 176 | 177 | [convolutional] 178 | batch_normalize=1 179 | filters=64 180 | size=1 181 | stride=1 182 | pad=1 183 | activation=mish 184 | 185 | [route] 186 | layers = -1,-10 187 | 188 | [convolutional] 189 | batch_normalize=1 190 | filters=128 191 | size=1 192 | stride=1 193 | pad=1 194 | activation=mish 195 | 196 | # Downsample 197 | 198 | [convolutional] 199 | batch_normalize=1 200 | filters=256 201 | size=3 202 | stride=2 203 | pad=1 204 | activation=mish 205 | 206 | [convolutional] 207 | batch_normalize=1 208 | filters=128 209 | size=1 210 | stride=1 211 | pad=1 212 | activation=mish 213 | 214 | [route] 215 | layers = -2 216 | 217 | [convolutional] 218 | batch_normalize=1 219 | filters=128 220 | size=1 221 | stride=1 222 | pad=1 223 | activation=mish 224 | 225 | [convolutional] 226 | batch_normalize=1 227 | filters=128 228 | size=1 229 | stride=1 230 | pad=1 231 | activation=mish 232 | 233 | [convolutional] 234 | batch_normalize=1 235 | filters=128 236 | size=3 237 | stride=1 238 | pad=1 239 | activation=mish 240 | 241 | [shortcut] 242 | from=-3 243 | activation=linear 244 | 245 | [convolutional] 246 | batch_normalize=1 247 | filters=128 248 | size=1 249 | stride=1 250 | pad=1 251 | activation=mish 252 | 253 | [convolutional] 254 | batch_normalize=1 255 | filters=128 256 | size=3 257 | stride=1 258 | pad=1 259 | activation=mish 260 | 261 | [shortcut] 262 | from=-3 263 | activation=linear 264 | 265 | [convolutional] 266 | batch_normalize=1 267 | filters=128 268 | size=1 269 | stride=1 270 | pad=1 271 | activation=mish 272 | 273 | [convolutional] 274 | batch_normalize=1 275 | filters=128 276 | size=3 277 | stride=1 278 | pad=1 279 | activation=mish 280 | 281 | [shortcut] 282 | from=-3 283 | activation=linear 284 | 285 | [convolutional] 286 | batch_normalize=1 287 | filters=128 288 | size=1 289 | stride=1 290 | pad=1 291 | activation=mish 292 | 293 | [convolutional] 294 | batch_normalize=1 295 | filters=128 296 | size=3 297 | stride=1 298 | pad=1 299 | activation=mish 300 | 301 | [shortcut] 302 | from=-3 303 | activation=linear 304 | 305 | 306 | [convolutional] 307 | batch_normalize=1 308 | filters=128 309 | size=1 310 | stride=1 311 | pad=1 312 | activation=mish 313 | 314 | [convolutional] 315 | batch_normalize=1 316 | filters=128 317 | size=3 318 | stride=1 319 | pad=1 320 | activation=mish 321 | 322 | [shortcut] 323 | from=-3 324 | activation=linear 325 | 326 | [convolutional] 327 | batch_normalize=1 328 | filters=128 329 | size=1 330 | stride=1 331 | pad=1 332 | activation=mish 333 | 334 | [convolutional] 335 | batch_normalize=1 336 | filters=128 337 | size=3 338 | stride=1 339 | pad=1 340 | activation=mish 341 | 342 | [shortcut] 343 | from=-3 344 | activation=linear 345 | 346 | [convolutional] 347 | batch_normalize=1 348 | filters=128 349 | size=1 350 | stride=1 351 | pad=1 352 | activation=mish 353 | 354 | [convolutional] 355 | batch_normalize=1 356 | filters=128 357 | size=3 358 | stride=1 359 | pad=1 360 | activation=mish 361 | 362 | [shortcut] 363 | from=-3 364 | activation=linear 365 | 366 | [convolutional] 367 | batch_normalize=1 368 | filters=128 369 | size=1 370 | stride=1 371 | pad=1 372 | activation=mish 373 | 374 | [convolutional] 375 | batch_normalize=1 376 | filters=128 377 | size=3 378 | stride=1 379 | pad=1 380 | activation=mish 381 | 382 | [shortcut] 383 | from=-3 384 | activation=linear 385 | 386 | [convolutional] 387 | batch_normalize=1 388 | filters=128 389 | size=1 390 | stride=1 391 | pad=1 392 | activation=mish 393 | 394 | [route] 395 | layers = -1,-28 396 | 397 | [convolutional] 398 | batch_normalize=1 399 | filters=256 400 | size=1 401 | stride=1 402 | pad=1 403 | activation=mish 404 | 405 | # Downsample 406 | 407 | [convolutional] 408 | batch_normalize=1 409 | filters=512 410 | size=3 411 | stride=2 412 | pad=1 413 | activation=mish 414 | 415 | [convolutional] 416 | batch_normalize=1 417 | filters=256 418 | size=1 419 | stride=1 420 | pad=1 421 | activation=mish 422 | 423 | [route] 424 | layers = -2 425 | 426 | [convolutional] 427 | batch_normalize=1 428 | filters=256 429 | size=1 430 | stride=1 431 | pad=1 432 | activation=mish 433 | 434 | [convolutional] 435 | batch_normalize=1 436 | filters=256 437 | size=1 438 | stride=1 439 | pad=1 440 | activation=mish 441 | 442 | [convolutional] 443 | batch_normalize=1 444 | filters=256 445 | size=3 446 | stride=1 447 | pad=1 448 | activation=mish 449 | 450 | [shortcut] 451 | from=-3 452 | activation=linear 453 | 454 | 455 | [convolutional] 456 | batch_normalize=1 457 | filters=256 458 | size=1 459 | stride=1 460 | pad=1 461 | activation=mish 462 | 463 | [convolutional] 464 | batch_normalize=1 465 | filters=256 466 | size=3 467 | stride=1 468 | pad=1 469 | activation=mish 470 | 471 | [shortcut] 472 | from=-3 473 | activation=linear 474 | 475 | 476 | [convolutional] 477 | batch_normalize=1 478 | filters=256 479 | size=1 480 | stride=1 481 | pad=1 482 | activation=mish 483 | 484 | [convolutional] 485 | batch_normalize=1 486 | filters=256 487 | size=3 488 | stride=1 489 | pad=1 490 | activation=mish 491 | 492 | [shortcut] 493 | from=-3 494 | activation=linear 495 | 496 | 497 | [convolutional] 498 | batch_normalize=1 499 | filters=256 500 | size=1 501 | stride=1 502 | pad=1 503 | activation=mish 504 | 505 | [convolutional] 506 | batch_normalize=1 507 | filters=256 508 | size=3 509 | stride=1 510 | pad=1 511 | activation=mish 512 | 513 | [shortcut] 514 | from=-3 515 | activation=linear 516 | 517 | 518 | [convolutional] 519 | batch_normalize=1 520 | filters=256 521 | size=1 522 | stride=1 523 | pad=1 524 | activation=mish 525 | 526 | [convolutional] 527 | batch_normalize=1 528 | filters=256 529 | size=3 530 | stride=1 531 | pad=1 532 | activation=mish 533 | 534 | [shortcut] 535 | from=-3 536 | activation=linear 537 | 538 | 539 | [convolutional] 540 | batch_normalize=1 541 | filters=256 542 | size=1 543 | stride=1 544 | pad=1 545 | activation=mish 546 | 547 | [convolutional] 548 | batch_normalize=1 549 | filters=256 550 | size=3 551 | stride=1 552 | pad=1 553 | activation=mish 554 | 555 | [shortcut] 556 | from=-3 557 | activation=linear 558 | 559 | 560 | [convolutional] 561 | batch_normalize=1 562 | filters=256 563 | size=1 564 | stride=1 565 | pad=1 566 | activation=mish 567 | 568 | [convolutional] 569 | batch_normalize=1 570 | filters=256 571 | size=3 572 | stride=1 573 | pad=1 574 | activation=mish 575 | 576 | [shortcut] 577 | from=-3 578 | activation=linear 579 | 580 | [convolutional] 581 | batch_normalize=1 582 | filters=256 583 | size=1 584 | stride=1 585 | pad=1 586 | activation=mish 587 | 588 | [convolutional] 589 | batch_normalize=1 590 | filters=256 591 | size=3 592 | stride=1 593 | pad=1 594 | activation=mish 595 | 596 | [shortcut] 597 | from=-3 598 | activation=linear 599 | 600 | [convolutional] 601 | batch_normalize=1 602 | filters=256 603 | size=1 604 | stride=1 605 | pad=1 606 | activation=mish 607 | 608 | [route] 609 | layers = -1,-28 610 | 611 | [convolutional] 612 | batch_normalize=1 613 | filters=512 614 | size=1 615 | stride=1 616 | pad=1 617 | activation=mish 618 | 619 | # Downsample 620 | 621 | [convolutional] 622 | batch_normalize=1 623 | filters=1024 624 | size=3 625 | stride=2 626 | pad=1 627 | activation=mish 628 | 629 | [convolutional] 630 | batch_normalize=1 631 | filters=512 632 | size=1 633 | stride=1 634 | pad=1 635 | activation=mish 636 | 637 | [route] 638 | layers = -2 639 | 640 | [convolutional] 641 | batch_normalize=1 642 | filters=512 643 | size=1 644 | stride=1 645 | pad=1 646 | activation=mish 647 | 648 | [convolutional] 649 | batch_normalize=1 650 | filters=512 651 | size=1 652 | stride=1 653 | pad=1 654 | activation=mish 655 | 656 | [convolutional] 657 | batch_normalize=1 658 | filters=512 659 | size=3 660 | stride=1 661 | pad=1 662 | activation=mish 663 | 664 | [shortcut] 665 | from=-3 666 | activation=linear 667 | 668 | [convolutional] 669 | batch_normalize=1 670 | filters=512 671 | size=1 672 | stride=1 673 | pad=1 674 | activation=mish 675 | 676 | [convolutional] 677 | batch_normalize=1 678 | filters=512 679 | size=3 680 | stride=1 681 | pad=1 682 | activation=mish 683 | 684 | [shortcut] 685 | from=-3 686 | activation=linear 687 | 688 | [convolutional] 689 | batch_normalize=1 690 | filters=512 691 | size=1 692 | stride=1 693 | pad=1 694 | activation=mish 695 | 696 | [convolutional] 697 | batch_normalize=1 698 | filters=512 699 | size=3 700 | stride=1 701 | pad=1 702 | activation=mish 703 | 704 | [shortcut] 705 | from=-3 706 | activation=linear 707 | 708 | [convolutional] 709 | batch_normalize=1 710 | filters=512 711 | size=1 712 | stride=1 713 | pad=1 714 | activation=mish 715 | 716 | [convolutional] 717 | batch_normalize=1 718 | filters=512 719 | size=3 720 | stride=1 721 | pad=1 722 | activation=mish 723 | 724 | [shortcut] 725 | from=-3 726 | activation=linear 727 | 728 | [convolutional] 729 | batch_normalize=1 730 | filters=512 731 | size=1 732 | stride=1 733 | pad=1 734 | activation=mish 735 | 736 | [route] 737 | layers = -1,-16 738 | 739 | [convolutional] 740 | batch_normalize=1 741 | filters=1024 742 | size=1 743 | stride=1 744 | pad=1 745 | activation=mish 746 | stopbackward=800 747 | 748 | ########################## 749 | 750 | [convolutional] 751 | batch_normalize=1 752 | filters=512 753 | size=1 754 | stride=1 755 | pad=1 756 | activation=leaky 757 | 758 | [convolutional] 759 | batch_normalize=1 760 | size=3 761 | stride=1 762 | pad=1 763 | filters=1024 764 | activation=leaky 765 | 766 | [convolutional] 767 | batch_normalize=1 768 | filters=512 769 | size=1 770 | stride=1 771 | pad=1 772 | activation=leaky 773 | 774 | ### SPP ### 775 | [maxpool] 776 | stride=1 777 | size=5 778 | 779 | [route] 780 | layers=-2 781 | 782 | [maxpool] 783 | stride=1 784 | size=9 785 | 786 | [route] 787 | layers=-4 788 | 789 | [maxpool] 790 | stride=1 791 | size=13 792 | 793 | [route] 794 | layers=-1,-3,-5,-6 795 | ### End SPP ### 796 | 797 | [convolutional] 798 | batch_normalize=1 799 | filters=512 800 | size=1 801 | stride=1 802 | pad=1 803 | activation=leaky 804 | 805 | [convolutional] 806 | batch_normalize=1 807 | size=3 808 | stride=1 809 | pad=1 810 | filters=1024 811 | activation=leaky 812 | 813 | [convolutional] 814 | batch_normalize=1 815 | filters=512 816 | size=1 817 | stride=1 818 | pad=1 819 | activation=leaky 820 | 821 | [convolutional] 822 | batch_normalize=1 823 | filters=256 824 | size=1 825 | stride=1 826 | pad=1 827 | activation=leaky 828 | 829 | [upsample] 830 | stride=2 831 | 832 | [route] 833 | layers = 85 834 | 835 | [convolutional] 836 | batch_normalize=1 837 | filters=256 838 | size=1 839 | stride=1 840 | pad=1 841 | activation=leaky 842 | 843 | [route] 844 | layers = -1, -3 845 | 846 | [convolutional] 847 | batch_normalize=1 848 | filters=256 849 | size=1 850 | stride=1 851 | pad=1 852 | activation=leaky 853 | 854 | [convolutional] 855 | batch_normalize=1 856 | size=3 857 | stride=1 858 | pad=1 859 | filters=512 860 | activation=leaky 861 | 862 | [convolutional] 863 | batch_normalize=1 864 | filters=256 865 | size=1 866 | stride=1 867 | pad=1 868 | activation=leaky 869 | 870 | [convolutional] 871 | batch_normalize=1 872 | size=3 873 | stride=1 874 | pad=1 875 | filters=512 876 | activation=leaky 877 | 878 | [convolutional] 879 | batch_normalize=1 880 | filters=256 881 | size=1 882 | stride=1 883 | pad=1 884 | activation=leaky 885 | 886 | [convolutional] 887 | batch_normalize=1 888 | filters=128 889 | size=1 890 | stride=1 891 | pad=1 892 | activation=leaky 893 | 894 | [upsample] 895 | stride=2 896 | 897 | [route] 898 | layers = 54 899 | 900 | [convolutional] 901 | batch_normalize=1 902 | filters=128 903 | size=1 904 | stride=1 905 | pad=1 906 | activation=leaky 907 | 908 | [route] 909 | layers = -1, -3 910 | 911 | [convolutional] 912 | batch_normalize=1 913 | filters=128 914 | size=1 915 | stride=1 916 | pad=1 917 | activation=leaky 918 | 919 | [convolutional] 920 | batch_normalize=1 921 | size=3 922 | stride=1 923 | pad=1 924 | filters=256 925 | activation=leaky 926 | 927 | [convolutional] 928 | batch_normalize=1 929 | filters=128 930 | size=1 931 | stride=1 932 | pad=1 933 | activation=leaky 934 | 935 | [convolutional] 936 | batch_normalize=1 937 | size=3 938 | stride=1 939 | pad=1 940 | filters=256 941 | activation=leaky 942 | 943 | [convolutional] 944 | batch_normalize=1 945 | filters=128 946 | size=1 947 | stride=1 948 | pad=1 949 | activation=leaky 950 | 951 | ########################## 952 | 953 | [convolutional] 954 | batch_normalize=1 955 | size=3 956 | stride=1 957 | pad=1 958 | filters=256 959 | activation=leaky 960 | 961 | [convolutional] 962 | size=1 963 | stride=1 964 | pad=1 965 | filters=21 966 | activation=linear 967 | 968 | 969 | [yolo] 970 | mask = 0,1,2 971 | anchors = 10,19, 19,50, 30,93, 43,148, 55,213, 71,292, 99,388, 154,249, 219,411 972 | classes=2 973 | num=9 974 | jitter=.3 975 | ignore_thresh = .7 976 | truth_thresh = 1 977 | random=1 978 | scale_x_y = 1.2 979 | iou_thresh=0.213 980 | cls_normalizer=1.0 981 | iou_normalizer=0.07 982 | iou_loss=ciou 983 | nms_kind=greedynms 984 | beta_nms=0.6 985 | max_delta=5 986 | 987 | 988 | [route] 989 | layers = -4 990 | 991 | [convolutional] 992 | batch_normalize=1 993 | size=3 994 | stride=2 995 | pad=1 996 | filters=256 997 | activation=leaky 998 | 999 | [route] 1000 | layers = -1, -16 1001 | 1002 | [convolutional] 1003 | batch_normalize=1 1004 | filters=256 1005 | size=1 1006 | stride=1 1007 | pad=1 1008 | activation=leaky 1009 | 1010 | [convolutional] 1011 | batch_normalize=1 1012 | size=3 1013 | stride=1 1014 | pad=1 1015 | filters=512 1016 | activation=leaky 1017 | 1018 | [convolutional] 1019 | batch_normalize=1 1020 | filters=256 1021 | size=1 1022 | stride=1 1023 | pad=1 1024 | activation=leaky 1025 | 1026 | [convolutional] 1027 | batch_normalize=1 1028 | size=3 1029 | stride=1 1030 | pad=1 1031 | filters=512 1032 | activation=leaky 1033 | 1034 | [convolutional] 1035 | batch_normalize=1 1036 | filters=256 1037 | size=1 1038 | stride=1 1039 | pad=1 1040 | activation=leaky 1041 | 1042 | [convolutional] 1043 | batch_normalize=1 1044 | size=3 1045 | stride=1 1046 | pad=1 1047 | filters=512 1048 | activation=leaky 1049 | 1050 | [convolutional] 1051 | size=1 1052 | stride=1 1053 | pad=1 1054 | filters=21 1055 | activation=linear 1056 | 1057 | 1058 | [yolo] 1059 | mask = 3,4,5 1060 | anchors = 10,19, 19,50, 30,93, 43,148, 55,213, 71,292, 99,388, 154,249, 219,411 1061 | classes=2 1062 | num=9 1063 | jitter=.3 1064 | ignore_thresh = .7 1065 | truth_thresh = 1 1066 | random=1 1067 | scale_x_y = 1.1 1068 | iou_thresh=0.213 1069 | cls_normalizer=1.0 1070 | iou_normalizer=0.07 1071 | iou_loss=ciou 1072 | nms_kind=greedynms 1073 | beta_nms=0.6 1074 | max_delta=5 1075 | 1076 | 1077 | [route] 1078 | layers = -4 1079 | 1080 | [convolutional] 1081 | batch_normalize=1 1082 | size=3 1083 | stride=2 1084 | pad=1 1085 | filters=512 1086 | activation=leaky 1087 | 1088 | [route] 1089 | layers = -1, -37 1090 | 1091 | [convolutional] 1092 | batch_normalize=1 1093 | filters=512 1094 | size=1 1095 | stride=1 1096 | pad=1 1097 | activation=leaky 1098 | 1099 | [convolutional] 1100 | batch_normalize=1 1101 | size=3 1102 | stride=1 1103 | pad=1 1104 | filters=1024 1105 | activation=leaky 1106 | 1107 | [convolutional] 1108 | batch_normalize=1 1109 | filters=512 1110 | size=1 1111 | stride=1 1112 | pad=1 1113 | activation=leaky 1114 | 1115 | [convolutional] 1116 | batch_normalize=1 1117 | size=3 1118 | stride=1 1119 | pad=1 1120 | filters=1024 1121 | activation=leaky 1122 | 1123 | [convolutional] 1124 | batch_normalize=1 1125 | filters=512 1126 | size=1 1127 | stride=1 1128 | pad=1 1129 | activation=leaky 1130 | 1131 | [convolutional] 1132 | batch_normalize=1 1133 | size=3 1134 | stride=1 1135 | pad=1 1136 | filters=1024 1137 | activation=leaky 1138 | 1139 | [convolutional] 1140 | size=1 1141 | stride=1 1142 | pad=1 1143 | filters=21 1144 | activation=linear 1145 | 1146 | 1147 | [yolo] 1148 | mask = 6,7,8 1149 | anchors = 10,19, 19,50, 30,93, 43,148, 55,213, 71,292, 99,388, 154,249, 219,411 1150 | classes=2 1151 | num=9 1152 | jitter=.3 1153 | ignore_thresh = .7 1154 | truth_thresh = 1 1155 | random=1 1156 | scale_x_y = 1.05 1157 | iou_thresh=0.213 1158 | cls_normalizer=1.0 1159 | iou_normalizer=0.07 1160 | iou_loss=ciou 1161 | nms_kind=greedynms 1162 | beta_nms=0.6 1163 | max_delta=5 1164 | 1165 | -------------------------------------------------------------------------------- /cfg/yolov4-crowdhuman-608x608.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Training 3 | batch=64 4 | subdivisions=64 5 | # Testing 6 | #batch=1 7 | #subdivisions=1 8 | width=608 9 | height=608 10 | channels=3 11 | momentum=0.949 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.0013 19 | burn_in=1000 20 | max_batches=6000 21 | policy=steps 22 | steps=4800,5400 23 | scales=.1,.1 24 | 25 | max_chart_loss=40.0 26 | 27 | #cutmix=1 28 | mosaic=1 29 | 30 | #:104x104 54:52x52 85:26x26 104:13x13 for 416 31 | 32 | [convolutional] 33 | batch_normalize=1 34 | filters=32 35 | size=3 36 | stride=1 37 | pad=1 38 | activation=mish 39 | 40 | # Downsample 41 | 42 | [convolutional] 43 | batch_normalize=1 44 | filters=64 45 | size=3 46 | stride=2 47 | pad=1 48 | activation=mish 49 | 50 | [convolutional] 51 | batch_normalize=1 52 | filters=64 53 | size=1 54 | stride=1 55 | pad=1 56 | activation=mish 57 | 58 | [route] 59 | layers = -2 60 | 61 | [convolutional] 62 | batch_normalize=1 63 | filters=64 64 | size=1 65 | stride=1 66 | pad=1 67 | activation=mish 68 | 69 | [convolutional] 70 | batch_normalize=1 71 | filters=32 72 | size=1 73 | stride=1 74 | pad=1 75 | activation=mish 76 | 77 | [convolutional] 78 | batch_normalize=1 79 | filters=64 80 | size=3 81 | stride=1 82 | pad=1 83 | activation=mish 84 | 85 | [shortcut] 86 | from=-3 87 | activation=linear 88 | 89 | [convolutional] 90 | batch_normalize=1 91 | filters=64 92 | size=1 93 | stride=1 94 | pad=1 95 | activation=mish 96 | 97 | [route] 98 | layers = -1,-7 99 | 100 | [convolutional] 101 | batch_normalize=1 102 | filters=64 103 | size=1 104 | stride=1 105 | pad=1 106 | activation=mish 107 | 108 | # Downsample 109 | 110 | [convolutional] 111 | batch_normalize=1 112 | filters=128 113 | size=3 114 | stride=2 115 | pad=1 116 | activation=mish 117 | 118 | [convolutional] 119 | batch_normalize=1 120 | filters=64 121 | size=1 122 | stride=1 123 | pad=1 124 | activation=mish 125 | 126 | [route] 127 | layers = -2 128 | 129 | [convolutional] 130 | batch_normalize=1 131 | filters=64 132 | size=1 133 | stride=1 134 | pad=1 135 | activation=mish 136 | 137 | [convolutional] 138 | batch_normalize=1 139 | filters=64 140 | size=1 141 | stride=1 142 | pad=1 143 | activation=mish 144 | 145 | [convolutional] 146 | batch_normalize=1 147 | filters=64 148 | size=3 149 | stride=1 150 | pad=1 151 | activation=mish 152 | 153 | [shortcut] 154 | from=-3 155 | activation=linear 156 | 157 | [convolutional] 158 | batch_normalize=1 159 | filters=64 160 | size=1 161 | stride=1 162 | pad=1 163 | activation=mish 164 | 165 | [convolutional] 166 | batch_normalize=1 167 | filters=64 168 | size=3 169 | stride=1 170 | pad=1 171 | activation=mish 172 | 173 | [shortcut] 174 | from=-3 175 | activation=linear 176 | 177 | [convolutional] 178 | batch_normalize=1 179 | filters=64 180 | size=1 181 | stride=1 182 | pad=1 183 | activation=mish 184 | 185 | [route] 186 | layers = -1,-10 187 | 188 | [convolutional] 189 | batch_normalize=1 190 | filters=128 191 | size=1 192 | stride=1 193 | pad=1 194 | activation=mish 195 | 196 | # Downsample 197 | 198 | [convolutional] 199 | batch_normalize=1 200 | filters=256 201 | size=3 202 | stride=2 203 | pad=1 204 | activation=mish 205 | 206 | [convolutional] 207 | batch_normalize=1 208 | filters=128 209 | size=1 210 | stride=1 211 | pad=1 212 | activation=mish 213 | 214 | [route] 215 | layers = -2 216 | 217 | [convolutional] 218 | batch_normalize=1 219 | filters=128 220 | size=1 221 | stride=1 222 | pad=1 223 | activation=mish 224 | 225 | [convolutional] 226 | batch_normalize=1 227 | filters=128 228 | size=1 229 | stride=1 230 | pad=1 231 | activation=mish 232 | 233 | [convolutional] 234 | batch_normalize=1 235 | filters=128 236 | size=3 237 | stride=1 238 | pad=1 239 | activation=mish 240 | 241 | [shortcut] 242 | from=-3 243 | activation=linear 244 | 245 | [convolutional] 246 | batch_normalize=1 247 | filters=128 248 | size=1 249 | stride=1 250 | pad=1 251 | activation=mish 252 | 253 | [convolutional] 254 | batch_normalize=1 255 | filters=128 256 | size=3 257 | stride=1 258 | pad=1 259 | activation=mish 260 | 261 | [shortcut] 262 | from=-3 263 | activation=linear 264 | 265 | [convolutional] 266 | batch_normalize=1 267 | filters=128 268 | size=1 269 | stride=1 270 | pad=1 271 | activation=mish 272 | 273 | [convolutional] 274 | batch_normalize=1 275 | filters=128 276 | size=3 277 | stride=1 278 | pad=1 279 | activation=mish 280 | 281 | [shortcut] 282 | from=-3 283 | activation=linear 284 | 285 | [convolutional] 286 | batch_normalize=1 287 | filters=128 288 | size=1 289 | stride=1 290 | pad=1 291 | activation=mish 292 | 293 | [convolutional] 294 | batch_normalize=1 295 | filters=128 296 | size=3 297 | stride=1 298 | pad=1 299 | activation=mish 300 | 301 | [shortcut] 302 | from=-3 303 | activation=linear 304 | 305 | 306 | [convolutional] 307 | batch_normalize=1 308 | filters=128 309 | size=1 310 | stride=1 311 | pad=1 312 | activation=mish 313 | 314 | [convolutional] 315 | batch_normalize=1 316 | filters=128 317 | size=3 318 | stride=1 319 | pad=1 320 | activation=mish 321 | 322 | [shortcut] 323 | from=-3 324 | activation=linear 325 | 326 | [convolutional] 327 | batch_normalize=1 328 | filters=128 329 | size=1 330 | stride=1 331 | pad=1 332 | activation=mish 333 | 334 | [convolutional] 335 | batch_normalize=1 336 | filters=128 337 | size=3 338 | stride=1 339 | pad=1 340 | activation=mish 341 | 342 | [shortcut] 343 | from=-3 344 | activation=linear 345 | 346 | [convolutional] 347 | batch_normalize=1 348 | filters=128 349 | size=1 350 | stride=1 351 | pad=1 352 | activation=mish 353 | 354 | [convolutional] 355 | batch_normalize=1 356 | filters=128 357 | size=3 358 | stride=1 359 | pad=1 360 | activation=mish 361 | 362 | [shortcut] 363 | from=-3 364 | activation=linear 365 | 366 | [convolutional] 367 | batch_normalize=1 368 | filters=128 369 | size=1 370 | stride=1 371 | pad=1 372 | activation=mish 373 | 374 | [convolutional] 375 | batch_normalize=1 376 | filters=128 377 | size=3 378 | stride=1 379 | pad=1 380 | activation=mish 381 | 382 | [shortcut] 383 | from=-3 384 | activation=linear 385 | 386 | [convolutional] 387 | batch_normalize=1 388 | filters=128 389 | size=1 390 | stride=1 391 | pad=1 392 | activation=mish 393 | 394 | [route] 395 | layers = -1,-28 396 | 397 | [convolutional] 398 | batch_normalize=1 399 | filters=256 400 | size=1 401 | stride=1 402 | pad=1 403 | activation=mish 404 | 405 | # Downsample 406 | 407 | [convolutional] 408 | batch_normalize=1 409 | filters=512 410 | size=3 411 | stride=2 412 | pad=1 413 | activation=mish 414 | 415 | [convolutional] 416 | batch_normalize=1 417 | filters=256 418 | size=1 419 | stride=1 420 | pad=1 421 | activation=mish 422 | 423 | [route] 424 | layers = -2 425 | 426 | [convolutional] 427 | batch_normalize=1 428 | filters=256 429 | size=1 430 | stride=1 431 | pad=1 432 | activation=mish 433 | 434 | [convolutional] 435 | batch_normalize=1 436 | filters=256 437 | size=1 438 | stride=1 439 | pad=1 440 | activation=mish 441 | 442 | [convolutional] 443 | batch_normalize=1 444 | filters=256 445 | size=3 446 | stride=1 447 | pad=1 448 | activation=mish 449 | 450 | [shortcut] 451 | from=-3 452 | activation=linear 453 | 454 | 455 | [convolutional] 456 | batch_normalize=1 457 | filters=256 458 | size=1 459 | stride=1 460 | pad=1 461 | activation=mish 462 | 463 | [convolutional] 464 | batch_normalize=1 465 | filters=256 466 | size=3 467 | stride=1 468 | pad=1 469 | activation=mish 470 | 471 | [shortcut] 472 | from=-3 473 | activation=linear 474 | 475 | 476 | [convolutional] 477 | batch_normalize=1 478 | filters=256 479 | size=1 480 | stride=1 481 | pad=1 482 | activation=mish 483 | 484 | [convolutional] 485 | batch_normalize=1 486 | filters=256 487 | size=3 488 | stride=1 489 | pad=1 490 | activation=mish 491 | 492 | [shortcut] 493 | from=-3 494 | activation=linear 495 | 496 | 497 | [convolutional] 498 | batch_normalize=1 499 | filters=256 500 | size=1 501 | stride=1 502 | pad=1 503 | activation=mish 504 | 505 | [convolutional] 506 | batch_normalize=1 507 | filters=256 508 | size=3 509 | stride=1 510 | pad=1 511 | activation=mish 512 | 513 | [shortcut] 514 | from=-3 515 | activation=linear 516 | 517 | 518 | [convolutional] 519 | batch_normalize=1 520 | filters=256 521 | size=1 522 | stride=1 523 | pad=1 524 | activation=mish 525 | 526 | [convolutional] 527 | batch_normalize=1 528 | filters=256 529 | size=3 530 | stride=1 531 | pad=1 532 | activation=mish 533 | 534 | [shortcut] 535 | from=-3 536 | activation=linear 537 | 538 | 539 | [convolutional] 540 | batch_normalize=1 541 | filters=256 542 | size=1 543 | stride=1 544 | pad=1 545 | activation=mish 546 | 547 | [convolutional] 548 | batch_normalize=1 549 | filters=256 550 | size=3 551 | stride=1 552 | pad=1 553 | activation=mish 554 | 555 | [shortcut] 556 | from=-3 557 | activation=linear 558 | 559 | 560 | [convolutional] 561 | batch_normalize=1 562 | filters=256 563 | size=1 564 | stride=1 565 | pad=1 566 | activation=mish 567 | 568 | [convolutional] 569 | batch_normalize=1 570 | filters=256 571 | size=3 572 | stride=1 573 | pad=1 574 | activation=mish 575 | 576 | [shortcut] 577 | from=-3 578 | activation=linear 579 | 580 | [convolutional] 581 | batch_normalize=1 582 | filters=256 583 | size=1 584 | stride=1 585 | pad=1 586 | activation=mish 587 | 588 | [convolutional] 589 | batch_normalize=1 590 | filters=256 591 | size=3 592 | stride=1 593 | pad=1 594 | activation=mish 595 | 596 | [shortcut] 597 | from=-3 598 | activation=linear 599 | 600 | [convolutional] 601 | batch_normalize=1 602 | filters=256 603 | size=1 604 | stride=1 605 | pad=1 606 | activation=mish 607 | 608 | [route] 609 | layers = -1,-28 610 | 611 | [convolutional] 612 | batch_normalize=1 613 | filters=512 614 | size=1 615 | stride=1 616 | pad=1 617 | activation=mish 618 | 619 | # Downsample 620 | 621 | [convolutional] 622 | batch_normalize=1 623 | filters=1024 624 | size=3 625 | stride=2 626 | pad=1 627 | activation=mish 628 | 629 | [convolutional] 630 | batch_normalize=1 631 | filters=512 632 | size=1 633 | stride=1 634 | pad=1 635 | activation=mish 636 | 637 | [route] 638 | layers = -2 639 | 640 | [convolutional] 641 | batch_normalize=1 642 | filters=512 643 | size=1 644 | stride=1 645 | pad=1 646 | activation=mish 647 | 648 | [convolutional] 649 | batch_normalize=1 650 | filters=512 651 | size=1 652 | stride=1 653 | pad=1 654 | activation=mish 655 | 656 | [convolutional] 657 | batch_normalize=1 658 | filters=512 659 | size=3 660 | stride=1 661 | pad=1 662 | activation=mish 663 | 664 | [shortcut] 665 | from=-3 666 | activation=linear 667 | 668 | [convolutional] 669 | batch_normalize=1 670 | filters=512 671 | size=1 672 | stride=1 673 | pad=1 674 | activation=mish 675 | 676 | [convolutional] 677 | batch_normalize=1 678 | filters=512 679 | size=3 680 | stride=1 681 | pad=1 682 | activation=mish 683 | 684 | [shortcut] 685 | from=-3 686 | activation=linear 687 | 688 | [convolutional] 689 | batch_normalize=1 690 | filters=512 691 | size=1 692 | stride=1 693 | pad=1 694 | activation=mish 695 | 696 | [convolutional] 697 | batch_normalize=1 698 | filters=512 699 | size=3 700 | stride=1 701 | pad=1 702 | activation=mish 703 | 704 | [shortcut] 705 | from=-3 706 | activation=linear 707 | 708 | [convolutional] 709 | batch_normalize=1 710 | filters=512 711 | size=1 712 | stride=1 713 | pad=1 714 | activation=mish 715 | 716 | [convolutional] 717 | batch_normalize=1 718 | filters=512 719 | size=3 720 | stride=1 721 | pad=1 722 | activation=mish 723 | 724 | [shortcut] 725 | from=-3 726 | activation=linear 727 | 728 | [convolutional] 729 | batch_normalize=1 730 | filters=512 731 | size=1 732 | stride=1 733 | pad=1 734 | activation=mish 735 | 736 | [route] 737 | layers = -1,-16 738 | 739 | [convolutional] 740 | batch_normalize=1 741 | filters=1024 742 | size=1 743 | stride=1 744 | pad=1 745 | activation=mish 746 | stopbackward=800 747 | 748 | ########################## 749 | 750 | [convolutional] 751 | batch_normalize=1 752 | filters=512 753 | size=1 754 | stride=1 755 | pad=1 756 | activation=leaky 757 | 758 | [convolutional] 759 | batch_normalize=1 760 | size=3 761 | stride=1 762 | pad=1 763 | filters=1024 764 | activation=leaky 765 | 766 | [convolutional] 767 | batch_normalize=1 768 | filters=512 769 | size=1 770 | stride=1 771 | pad=1 772 | activation=leaky 773 | 774 | ### SPP ### 775 | [maxpool] 776 | stride=1 777 | size=5 778 | 779 | [route] 780 | layers=-2 781 | 782 | [maxpool] 783 | stride=1 784 | size=9 785 | 786 | [route] 787 | layers=-4 788 | 789 | [maxpool] 790 | stride=1 791 | size=13 792 | 793 | [route] 794 | layers=-1,-3,-5,-6 795 | ### End SPP ### 796 | 797 | [convolutional] 798 | batch_normalize=1 799 | filters=512 800 | size=1 801 | stride=1 802 | pad=1 803 | activation=leaky 804 | 805 | [convolutional] 806 | batch_normalize=1 807 | size=3 808 | stride=1 809 | pad=1 810 | filters=1024 811 | activation=leaky 812 | 813 | [convolutional] 814 | batch_normalize=1 815 | filters=512 816 | size=1 817 | stride=1 818 | pad=1 819 | activation=leaky 820 | 821 | [convolutional] 822 | batch_normalize=1 823 | filters=256 824 | size=1 825 | stride=1 826 | pad=1 827 | activation=leaky 828 | 829 | [upsample] 830 | stride=2 831 | 832 | [route] 833 | layers = 85 834 | 835 | [convolutional] 836 | batch_normalize=1 837 | filters=256 838 | size=1 839 | stride=1 840 | pad=1 841 | activation=leaky 842 | 843 | [route] 844 | layers = -1, -3 845 | 846 | [convolutional] 847 | batch_normalize=1 848 | filters=256 849 | size=1 850 | stride=1 851 | pad=1 852 | activation=leaky 853 | 854 | [convolutional] 855 | batch_normalize=1 856 | size=3 857 | stride=1 858 | pad=1 859 | filters=512 860 | activation=leaky 861 | 862 | [convolutional] 863 | batch_normalize=1 864 | filters=256 865 | size=1 866 | stride=1 867 | pad=1 868 | activation=leaky 869 | 870 | [convolutional] 871 | batch_normalize=1 872 | size=3 873 | stride=1 874 | pad=1 875 | filters=512 876 | activation=leaky 877 | 878 | [convolutional] 879 | batch_normalize=1 880 | filters=256 881 | size=1 882 | stride=1 883 | pad=1 884 | activation=leaky 885 | 886 | [convolutional] 887 | batch_normalize=1 888 | filters=128 889 | size=1 890 | stride=1 891 | pad=1 892 | activation=leaky 893 | 894 | [upsample] 895 | stride=2 896 | 897 | [route] 898 | layers = 54 899 | 900 | [convolutional] 901 | batch_normalize=1 902 | filters=128 903 | size=1 904 | stride=1 905 | pad=1 906 | activation=leaky 907 | 908 | [route] 909 | layers = -1, -3 910 | 911 | [convolutional] 912 | batch_normalize=1 913 | filters=128 914 | size=1 915 | stride=1 916 | pad=1 917 | activation=leaky 918 | 919 | [convolutional] 920 | batch_normalize=1 921 | size=3 922 | stride=1 923 | pad=1 924 | filters=256 925 | activation=leaky 926 | 927 | [convolutional] 928 | batch_normalize=1 929 | filters=128 930 | size=1 931 | stride=1 932 | pad=1 933 | activation=leaky 934 | 935 | [convolutional] 936 | batch_normalize=1 937 | size=3 938 | stride=1 939 | pad=1 940 | filters=256 941 | activation=leaky 942 | 943 | [convolutional] 944 | batch_normalize=1 945 | filters=128 946 | size=1 947 | stride=1 948 | pad=1 949 | activation=leaky 950 | 951 | ########################## 952 | 953 | [convolutional] 954 | batch_normalize=1 955 | size=3 956 | stride=1 957 | pad=1 958 | filters=256 959 | activation=leaky 960 | 961 | [convolutional] 962 | size=1 963 | stride=1 964 | pad=1 965 | filters=21 966 | activation=linear 967 | 968 | 969 | [yolo] 970 | mask = 0,1,2 971 | anchors = 11,22, 24,60, 37,116, 54,186, 69,268, 89,369, 126,491, 194,314, 278,520 972 | classes=2 973 | num=9 974 | jitter=.3 975 | ignore_thresh = .7 976 | truth_thresh = 1 977 | random=1 978 | scale_x_y = 1.2 979 | iou_thresh=0.213 980 | cls_normalizer=1.0 981 | iou_normalizer=0.07 982 | iou_loss=ciou 983 | nms_kind=greedynms 984 | beta_nms=0.6 985 | max_delta=5 986 | 987 | 988 | [route] 989 | layers = -4 990 | 991 | [convolutional] 992 | batch_normalize=1 993 | size=3 994 | stride=2 995 | pad=1 996 | filters=256 997 | activation=leaky 998 | 999 | [route] 1000 | layers = -1, -16 1001 | 1002 | [convolutional] 1003 | batch_normalize=1 1004 | filters=256 1005 | size=1 1006 | stride=1 1007 | pad=1 1008 | activation=leaky 1009 | 1010 | [convolutional] 1011 | batch_normalize=1 1012 | size=3 1013 | stride=1 1014 | pad=1 1015 | filters=512 1016 | activation=leaky 1017 | 1018 | [convolutional] 1019 | batch_normalize=1 1020 | filters=256 1021 | size=1 1022 | stride=1 1023 | pad=1 1024 | activation=leaky 1025 | 1026 | [convolutional] 1027 | batch_normalize=1 1028 | size=3 1029 | stride=1 1030 | pad=1 1031 | filters=512 1032 | activation=leaky 1033 | 1034 | [convolutional] 1035 | batch_normalize=1 1036 | filters=256 1037 | size=1 1038 | stride=1 1039 | pad=1 1040 | activation=leaky 1041 | 1042 | [convolutional] 1043 | batch_normalize=1 1044 | size=3 1045 | stride=1 1046 | pad=1 1047 | filters=512 1048 | activation=leaky 1049 | 1050 | [convolutional] 1051 | size=1 1052 | stride=1 1053 | pad=1 1054 | filters=21 1055 | activation=linear 1056 | 1057 | 1058 | [yolo] 1059 | mask = 3,4,5 1060 | anchors = 11,22, 24,60, 37,116, 54,186, 69,268, 89,369, 126,491, 194,314, 278,520 1061 | classes=2 1062 | num=9 1063 | jitter=.3 1064 | ignore_thresh = .7 1065 | truth_thresh = 1 1066 | random=1 1067 | scale_x_y = 1.1 1068 | iou_thresh=0.213 1069 | cls_normalizer=1.0 1070 | iou_normalizer=0.07 1071 | iou_loss=ciou 1072 | nms_kind=greedynms 1073 | beta_nms=0.6 1074 | max_delta=5 1075 | 1076 | 1077 | [route] 1078 | layers = -4 1079 | 1080 | [convolutional] 1081 | batch_normalize=1 1082 | size=3 1083 | stride=2 1084 | pad=1 1085 | filters=512 1086 | activation=leaky 1087 | 1088 | [route] 1089 | layers = -1, -37 1090 | 1091 | [convolutional] 1092 | batch_normalize=1 1093 | filters=512 1094 | size=1 1095 | stride=1 1096 | pad=1 1097 | activation=leaky 1098 | 1099 | [convolutional] 1100 | batch_normalize=1 1101 | size=3 1102 | stride=1 1103 | pad=1 1104 | filters=1024 1105 | activation=leaky 1106 | 1107 | [convolutional] 1108 | batch_normalize=1 1109 | filters=512 1110 | size=1 1111 | stride=1 1112 | pad=1 1113 | activation=leaky 1114 | 1115 | [convolutional] 1116 | batch_normalize=1 1117 | size=3 1118 | stride=1 1119 | pad=1 1120 | filters=1024 1121 | activation=leaky 1122 | 1123 | [convolutional] 1124 | batch_normalize=1 1125 | filters=512 1126 | size=1 1127 | stride=1 1128 | pad=1 1129 | activation=leaky 1130 | 1131 | [convolutional] 1132 | batch_normalize=1 1133 | size=3 1134 | stride=1 1135 | pad=1 1136 | filters=1024 1137 | activation=leaky 1138 | 1139 | [convolutional] 1140 | size=1 1141 | stride=1 1142 | pad=1 1143 | filters=21 1144 | activation=linear 1145 | 1146 | 1147 | [yolo] 1148 | mask = 6,7,8 1149 | anchors = 11,22, 24,60, 37,116, 54,186, 69,268, 89,369, 126,491, 194,314, 278,520 1150 | classes=2 1151 | num=9 1152 | jitter=.3 1153 | ignore_thresh = .7 1154 | truth_thresh = 1 1155 | random=1 1156 | scale_x_y = 1.05 1157 | iou_thresh=0.213 1158 | cls_normalizer=1.0 1159 | iou_normalizer=0.07 1160 | iou_loss=ciou 1161 | nms_kind=greedynms 1162 | beta_nms=0.6 1163 | max_delta=5 1164 | 1165 | -------------------------------------------------------------------------------- /cfg/yolov4-tiny-3l-crowdhuman-416x416.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Testing 3 | #batch=1 4 | #subdivisions=1 5 | # Training 6 | batch=64 7 | subdivisions=1 8 | width=416 9 | height=416 10 | channels=3 11 | momentum=0.9 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.00261 19 | burn_in=1000 20 | max_batches = 6000 21 | policy=steps 22 | steps=4800,5400 23 | scales=.1,.1 24 | 25 | [convolutional] 26 | batch_normalize=1 27 | filters=32 28 | size=3 29 | stride=2 30 | pad=1 31 | activation=leaky 32 | 33 | [convolutional] 34 | batch_normalize=1 35 | filters=64 36 | size=3 37 | stride=2 38 | pad=1 39 | activation=leaky 40 | 41 | [convolutional] 42 | batch_normalize=1 43 | filters=64 44 | size=3 45 | stride=1 46 | pad=1 47 | activation=leaky 48 | 49 | [route] 50 | layers=-1 51 | groups=2 52 | group_id=1 53 | 54 | [convolutional] 55 | batch_normalize=1 56 | filters=32 57 | size=3 58 | stride=1 59 | pad=1 60 | activation=leaky 61 | 62 | [convolutional] 63 | batch_normalize=1 64 | filters=32 65 | size=3 66 | stride=1 67 | pad=1 68 | activation=leaky 69 | 70 | [route] 71 | layers = -1,-2 72 | 73 | [convolutional] 74 | batch_normalize=1 75 | filters=64 76 | size=1 77 | stride=1 78 | pad=1 79 | activation=leaky 80 | 81 | [route] 82 | layers = -6,-1 83 | 84 | [maxpool] 85 | size=2 86 | stride=2 87 | 88 | [convolutional] 89 | batch_normalize=1 90 | filters=128 91 | size=3 92 | stride=1 93 | pad=1 94 | activation=leaky 95 | 96 | [route] 97 | layers=-1 98 | groups=2 99 | group_id=1 100 | 101 | [convolutional] 102 | batch_normalize=1 103 | filters=64 104 | size=3 105 | stride=1 106 | pad=1 107 | activation=leaky 108 | 109 | [convolutional] 110 | batch_normalize=1 111 | filters=64 112 | size=3 113 | stride=1 114 | pad=1 115 | activation=leaky 116 | 117 | [route] 118 | layers = -1,-2 119 | 120 | [convolutional] 121 | batch_normalize=1 122 | filters=128 123 | size=1 124 | stride=1 125 | pad=1 126 | activation=leaky 127 | 128 | [route] 129 | layers = -6,-1 130 | 131 | [maxpool] 132 | size=2 133 | stride=2 134 | 135 | [convolutional] 136 | batch_normalize=1 137 | filters=256 138 | size=3 139 | stride=1 140 | pad=1 141 | activation=leaky 142 | 143 | [route] 144 | layers=-1 145 | groups=2 146 | group_id=1 147 | 148 | [convolutional] 149 | batch_normalize=1 150 | filters=128 151 | size=3 152 | stride=1 153 | pad=1 154 | activation=leaky 155 | 156 | [convolutional] 157 | batch_normalize=1 158 | filters=128 159 | size=3 160 | stride=1 161 | pad=1 162 | activation=leaky 163 | 164 | [route] 165 | layers = -1,-2 166 | 167 | [convolutional] 168 | batch_normalize=1 169 | filters=256 170 | size=1 171 | stride=1 172 | pad=1 173 | activation=leaky 174 | 175 | [route] 176 | layers = -6,-1 177 | 178 | [maxpool] 179 | size=2 180 | stride=2 181 | 182 | [convolutional] 183 | batch_normalize=1 184 | filters=512 185 | size=3 186 | stride=1 187 | pad=1 188 | activation=leaky 189 | 190 | ################################## 191 | 192 | [convolutional] 193 | batch_normalize=1 194 | filters=256 195 | size=1 196 | stride=1 197 | pad=1 198 | activation=leaky 199 | 200 | [convolutional] 201 | batch_normalize=1 202 | filters=512 203 | size=3 204 | stride=1 205 | pad=1 206 | activation=leaky 207 | 208 | [convolutional] 209 | size=1 210 | stride=1 211 | pad=1 212 | filters=21 213 | activation=linear 214 | 215 | 216 | 217 | [yolo] 218 | mask = 6,7,8 219 | anchors = 8, 15, 13, 34, 18, 75, 28, 49, 30,123, 58,106, 46,203, 80,265, 155,317 220 | classes=2 221 | num=9 222 | jitter=.3 223 | scale_x_y = 1.05 224 | cls_normalizer=1.0 225 | iou_normalizer=0.07 226 | iou_loss=ciou 227 | ignore_thresh = .7 228 | truth_thresh = 1 229 | random=0 230 | resize=1.5 231 | nms_kind=greedynms 232 | beta_nms=0.6 233 | 234 | [route] 235 | layers = -4 236 | 237 | [convolutional] 238 | batch_normalize=1 239 | filters=128 240 | size=1 241 | stride=1 242 | pad=1 243 | activation=leaky 244 | 245 | [upsample] 246 | stride=2 247 | 248 | [route] 249 | layers = -1, 23 250 | 251 | [convolutional] 252 | batch_normalize=1 253 | filters=256 254 | size=3 255 | stride=1 256 | pad=1 257 | activation=leaky 258 | 259 | [convolutional] 260 | size=1 261 | stride=1 262 | pad=1 263 | filters=21 264 | activation=linear 265 | 266 | [yolo] 267 | mask = 3,4,5 268 | anchors = 8, 15, 13, 34, 18, 75, 28, 49, 30,123, 58,106, 46,203, 80,265, 155,317 269 | classes=2 270 | num=9 271 | jitter=.3 272 | scale_x_y = 1.05 273 | cls_normalizer=1.0 274 | iou_normalizer=0.07 275 | iou_loss=ciou 276 | ignore_thresh = .7 277 | truth_thresh = 1 278 | random=0 279 | resize=1.5 280 | nms_kind=greedynms 281 | beta_nms=0.6 282 | 283 | 284 | [route] 285 | layers = -3 286 | 287 | [convolutional] 288 | batch_normalize=1 289 | filters=64 290 | size=1 291 | stride=1 292 | pad=1 293 | activation=leaky 294 | 295 | [upsample] 296 | stride=2 297 | 298 | [route] 299 | layers = -1, 15 300 | 301 | [convolutional] 302 | batch_normalize=1 303 | filters=128 304 | size=3 305 | stride=1 306 | pad=1 307 | activation=leaky 308 | 309 | [convolutional] 310 | size=1 311 | stride=1 312 | pad=1 313 | filters=21 314 | activation=linear 315 | 316 | [yolo] 317 | mask = 0,1,2 318 | anchors = 8, 15, 13, 34, 18, 75, 28, 49, 30,123, 58,106, 46,203, 80,265, 155,317 319 | classes=2 320 | num=9 321 | jitter=.3 322 | scale_x_y = 1.05 323 | cls_normalizer=1.0 324 | iou_normalizer=0.07 325 | iou_loss=ciou 326 | ignore_thresh = .7 327 | truth_thresh = 1 328 | random=0 329 | resize=1.5 330 | nms_kind=greedynms 331 | beta_nms=0.6 332 | 333 | -------------------------------------------------------------------------------- /cfg/yolov4-tiny-3l-crowdhuman-608x608.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Testing 3 | #batch=1 4 | #subdivisions=1 5 | # Training 6 | batch=32 7 | subdivisions=1 8 | width=608 9 | height=608 10 | channels=3 11 | momentum=0.9 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.00261 19 | burn_in=1000 20 | max_batches = 6000 21 | policy=steps 22 | steps=4800,5400 23 | scales=.1,.1 24 | 25 | [convolutional] 26 | batch_normalize=1 27 | filters=32 28 | size=3 29 | stride=2 30 | pad=1 31 | activation=leaky 32 | 33 | [convolutional] 34 | batch_normalize=1 35 | filters=64 36 | size=3 37 | stride=2 38 | pad=1 39 | activation=leaky 40 | 41 | [convolutional] 42 | batch_normalize=1 43 | filters=64 44 | size=3 45 | stride=1 46 | pad=1 47 | activation=leaky 48 | 49 | [route] 50 | layers=-1 51 | groups=2 52 | group_id=1 53 | 54 | [convolutional] 55 | batch_normalize=1 56 | filters=32 57 | size=3 58 | stride=1 59 | pad=1 60 | activation=leaky 61 | 62 | [convolutional] 63 | batch_normalize=1 64 | filters=32 65 | size=3 66 | stride=1 67 | pad=1 68 | activation=leaky 69 | 70 | [route] 71 | layers = -1,-2 72 | 73 | [convolutional] 74 | batch_normalize=1 75 | filters=64 76 | size=1 77 | stride=1 78 | pad=1 79 | activation=leaky 80 | 81 | [route] 82 | layers = -6,-1 83 | 84 | [maxpool] 85 | size=2 86 | stride=2 87 | 88 | [convolutional] 89 | batch_normalize=1 90 | filters=128 91 | size=3 92 | stride=1 93 | pad=1 94 | activation=leaky 95 | 96 | [route] 97 | layers=-1 98 | groups=2 99 | group_id=1 100 | 101 | [convolutional] 102 | batch_normalize=1 103 | filters=64 104 | size=3 105 | stride=1 106 | pad=1 107 | activation=leaky 108 | 109 | [convolutional] 110 | batch_normalize=1 111 | filters=64 112 | size=3 113 | stride=1 114 | pad=1 115 | activation=leaky 116 | 117 | [route] 118 | layers = -1,-2 119 | 120 | [convolutional] 121 | batch_normalize=1 122 | filters=128 123 | size=1 124 | stride=1 125 | pad=1 126 | activation=leaky 127 | 128 | [route] 129 | layers = -6,-1 130 | 131 | [maxpool] 132 | size=2 133 | stride=2 134 | 135 | [convolutional] 136 | batch_normalize=1 137 | filters=256 138 | size=3 139 | stride=1 140 | pad=1 141 | activation=leaky 142 | 143 | [route] 144 | layers=-1 145 | groups=2 146 | group_id=1 147 | 148 | [convolutional] 149 | batch_normalize=1 150 | filters=128 151 | size=3 152 | stride=1 153 | pad=1 154 | activation=leaky 155 | 156 | [convolutional] 157 | batch_normalize=1 158 | filters=128 159 | size=3 160 | stride=1 161 | pad=1 162 | activation=leaky 163 | 164 | [route] 165 | layers = -1,-2 166 | 167 | [convolutional] 168 | batch_normalize=1 169 | filters=256 170 | size=1 171 | stride=1 172 | pad=1 173 | activation=leaky 174 | 175 | [route] 176 | layers = -6,-1 177 | 178 | [maxpool] 179 | size=2 180 | stride=2 181 | 182 | [convolutional] 183 | batch_normalize=1 184 | filters=512 185 | size=3 186 | stride=1 187 | pad=1 188 | activation=leaky 189 | 190 | ################################## 191 | 192 | [convolutional] 193 | batch_normalize=1 194 | filters=256 195 | size=1 196 | stride=1 197 | pad=1 198 | activation=leaky 199 | 200 | [convolutional] 201 | batch_normalize=1 202 | filters=512 203 | size=3 204 | stride=1 205 | pad=1 206 | activation=leaky 207 | 208 | [convolutional] 209 | size=1 210 | stride=1 211 | pad=1 212 | filters=21 213 | activation=linear 214 | 215 | 216 | 217 | [yolo] 218 | mask = 6,7,8 219 | anchors = 8, 15, 13, 34, 18, 75, 28, 49, 30,123, 58,106, 46,203, 80,265, 155,317 220 | classes=2 221 | num=9 222 | jitter=.3 223 | scale_x_y = 1.05 224 | cls_normalizer=1.0 225 | iou_normalizer=0.07 226 | iou_loss=ciou 227 | ignore_thresh = .7 228 | truth_thresh = 1 229 | random=0 230 | resize=1.5 231 | nms_kind=greedynms 232 | beta_nms=0.6 233 | 234 | [route] 235 | layers = -4 236 | 237 | [convolutional] 238 | batch_normalize=1 239 | filters=128 240 | size=1 241 | stride=1 242 | pad=1 243 | activation=leaky 244 | 245 | [upsample] 246 | stride=2 247 | 248 | [route] 249 | layers = -1, 23 250 | 251 | [convolutional] 252 | batch_normalize=1 253 | filters=256 254 | size=3 255 | stride=1 256 | pad=1 257 | activation=leaky 258 | 259 | [convolutional] 260 | size=1 261 | stride=1 262 | pad=1 263 | filters=21 264 | activation=linear 265 | 266 | [yolo] 267 | mask = 3,4,5 268 | anchors = 8, 15, 13, 34, 18, 75, 28, 49, 30,123, 58,106, 46,203, 80,265, 155,317 269 | classes=2 270 | num=9 271 | jitter=.3 272 | scale_x_y = 1.05 273 | cls_normalizer=1.0 274 | iou_normalizer=0.07 275 | iou_loss=ciou 276 | ignore_thresh = .7 277 | truth_thresh = 1 278 | random=0 279 | resize=1.5 280 | nms_kind=greedynms 281 | beta_nms=0.6 282 | 283 | 284 | [route] 285 | layers = -3 286 | 287 | [convolutional] 288 | batch_normalize=1 289 | filters=64 290 | size=1 291 | stride=1 292 | pad=1 293 | activation=leaky 294 | 295 | [upsample] 296 | stride=2 297 | 298 | [route] 299 | layers = -1, 15 300 | 301 | [convolutional] 302 | batch_normalize=1 303 | filters=128 304 | size=3 305 | stride=1 306 | pad=1 307 | activation=leaky 308 | 309 | [convolutional] 310 | size=1 311 | stride=1 312 | pad=1 313 | filters=21 314 | activation=linear 315 | 316 | [yolo] 317 | mask = 0,1,2 318 | anchors = 8, 15, 13, 34, 18, 75, 28, 49, 30,123, 58,106, 46,203, 80,265, 155,317 319 | classes=2 320 | num=9 321 | jitter=.3 322 | scale_x_y = 1.05 323 | cls_normalizer=1.0 324 | iou_normalizer=0.07 325 | iou_loss=ciou 326 | ignore_thresh = .7 327 | truth_thresh = 1 328 | random=0 329 | resize=1.5 330 | nms_kind=greedynms 331 | beta_nms=0.6 332 | 333 | -------------------------------------------------------------------------------- /cfg/yolov4-tiny-crowdhuman-416x416.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Testing 3 | #batch=1 4 | #subdivisions=1 5 | # Training 6 | batch=64 7 | subdivisions=1 8 | width=416 9 | height=416 10 | channels=3 11 | momentum=0.9 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.00261 19 | burn_in=1000 20 | 21 | max_batches = 6000 22 | policy=steps 23 | steps=4800,5400 24 | scales=.1,.1 25 | 26 | 27 | #weights_reject_freq=1001 28 | #ema_alpha=0.9998 29 | #equidistant_point=1000 30 | #num_sigmas_reject_badlabels=3 31 | #badlabels_rejection_percentage=0.2 32 | 33 | 34 | [convolutional] 35 | batch_normalize=1 36 | filters=32 37 | size=3 38 | stride=2 39 | pad=1 40 | activation=leaky 41 | 42 | [convolutional] 43 | batch_normalize=1 44 | filters=64 45 | size=3 46 | stride=2 47 | pad=1 48 | activation=leaky 49 | 50 | [convolutional] 51 | batch_normalize=1 52 | filters=64 53 | size=3 54 | stride=1 55 | pad=1 56 | activation=leaky 57 | 58 | [route] 59 | layers=-1 60 | groups=2 61 | group_id=1 62 | 63 | [convolutional] 64 | batch_normalize=1 65 | filters=32 66 | size=3 67 | stride=1 68 | pad=1 69 | activation=leaky 70 | 71 | [convolutional] 72 | batch_normalize=1 73 | filters=32 74 | size=3 75 | stride=1 76 | pad=1 77 | activation=leaky 78 | 79 | [route] 80 | layers = -1,-2 81 | 82 | [convolutional] 83 | batch_normalize=1 84 | filters=64 85 | size=1 86 | stride=1 87 | pad=1 88 | activation=leaky 89 | 90 | [route] 91 | layers = -6,-1 92 | 93 | [maxpool] 94 | size=2 95 | stride=2 96 | 97 | [convolutional] 98 | batch_normalize=1 99 | filters=128 100 | size=3 101 | stride=1 102 | pad=1 103 | activation=leaky 104 | 105 | [route] 106 | layers=-1 107 | groups=2 108 | group_id=1 109 | 110 | [convolutional] 111 | batch_normalize=1 112 | filters=64 113 | size=3 114 | stride=1 115 | pad=1 116 | activation=leaky 117 | 118 | [convolutional] 119 | batch_normalize=1 120 | filters=64 121 | size=3 122 | stride=1 123 | pad=1 124 | activation=leaky 125 | 126 | [route] 127 | layers = -1,-2 128 | 129 | [convolutional] 130 | batch_normalize=1 131 | filters=128 132 | size=1 133 | stride=1 134 | pad=1 135 | activation=leaky 136 | 137 | [route] 138 | layers = -6,-1 139 | 140 | [maxpool] 141 | size=2 142 | stride=2 143 | 144 | [convolutional] 145 | batch_normalize=1 146 | filters=256 147 | size=3 148 | stride=1 149 | pad=1 150 | activation=leaky 151 | 152 | [route] 153 | layers=-1 154 | groups=2 155 | group_id=1 156 | 157 | [convolutional] 158 | batch_normalize=1 159 | filters=128 160 | size=3 161 | stride=1 162 | pad=1 163 | activation=leaky 164 | 165 | [convolutional] 166 | batch_normalize=1 167 | filters=128 168 | size=3 169 | stride=1 170 | pad=1 171 | activation=leaky 172 | 173 | [route] 174 | layers = -1,-2 175 | 176 | [convolutional] 177 | batch_normalize=1 178 | filters=256 179 | size=1 180 | stride=1 181 | pad=1 182 | activation=leaky 183 | 184 | [route] 185 | layers = -6,-1 186 | 187 | [maxpool] 188 | size=2 189 | stride=2 190 | 191 | [convolutional] 192 | batch_normalize=1 193 | filters=512 194 | size=3 195 | stride=1 196 | pad=1 197 | activation=leaky 198 | 199 | ################################## 200 | 201 | [convolutional] 202 | batch_normalize=1 203 | filters=256 204 | size=1 205 | stride=1 206 | pad=1 207 | activation=leaky 208 | 209 | [convolutional] 210 | batch_normalize=1 211 | filters=512 212 | size=3 213 | stride=1 214 | pad=1 215 | activation=leaky 216 | 217 | [convolutional] 218 | size=1 219 | stride=1 220 | pad=1 221 | filters=21 222 | activation=linear 223 | 224 | 225 | 226 | [yolo] 227 | mask = 3,4,5 228 | anchors = 7, 13, 14, 36, 26, 80, 41,156, 69,241, 140,311 229 | classes=2 230 | num=6 231 | jitter=.3 232 | scale_x_y = 1.05 233 | cls_normalizer=1.0 234 | iou_normalizer=0.07 235 | iou_loss=ciou 236 | ignore_thresh = .7 237 | truth_thresh = 1 238 | random=0 239 | resize=1.5 240 | nms_kind=greedynms 241 | beta_nms=0.6 242 | #new_coords=1 243 | #scale_x_y = 2.0 244 | 245 | [route] 246 | layers = -4 247 | 248 | [convolutional] 249 | batch_normalize=1 250 | filters=128 251 | size=1 252 | stride=1 253 | pad=1 254 | activation=leaky 255 | 256 | [upsample] 257 | stride=2 258 | 259 | [route] 260 | layers = -1, 23 261 | 262 | [convolutional] 263 | batch_normalize=1 264 | filters=256 265 | size=3 266 | stride=1 267 | pad=1 268 | activation=leaky 269 | 270 | [convolutional] 271 | size=1 272 | stride=1 273 | pad=1 274 | filters=21 275 | activation=linear 276 | 277 | [yolo] 278 | mask = 1,2,3 279 | anchors = 7, 13, 14, 36, 26, 80, 41,156, 69,241, 140,311 280 | classes=2 281 | num=6 282 | jitter=.3 283 | scale_x_y = 1.05 284 | cls_normalizer=1.0 285 | iou_normalizer=0.07 286 | iou_loss=ciou 287 | ignore_thresh = .7 288 | truth_thresh = 1 289 | random=0 290 | resize=1.5 291 | nms_kind=greedynms 292 | beta_nms=0.6 293 | #new_coords=1 294 | #scale_x_y = 2.0 295 | -------------------------------------------------------------------------------- /cfg/yolov4-tiny-crowdhuman-608x608.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Testing 3 | #batch=1 4 | #subdivisions=1 5 | # Training 6 | batch=32 7 | subdivisions=1 8 | width=608 9 | height=608 10 | channels=3 11 | momentum=0.9 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.00261 19 | burn_in=1000 20 | 21 | max_batches = 6000 22 | policy=steps 23 | steps=4800,5400 24 | scales=.1,.1 25 | 26 | 27 | #weights_reject_freq=1001 28 | #ema_alpha=0.9998 29 | #equidistant_point=1000 30 | #num_sigmas_reject_badlabels=3 31 | #badlabels_rejection_percentage=0.2 32 | 33 | 34 | [convolutional] 35 | batch_normalize=1 36 | filters=32 37 | size=3 38 | stride=2 39 | pad=1 40 | activation=leaky 41 | 42 | [convolutional] 43 | batch_normalize=1 44 | filters=64 45 | size=3 46 | stride=2 47 | pad=1 48 | activation=leaky 49 | 50 | [convolutional] 51 | batch_normalize=1 52 | filters=64 53 | size=3 54 | stride=1 55 | pad=1 56 | activation=leaky 57 | 58 | [route] 59 | layers=-1 60 | groups=2 61 | group_id=1 62 | 63 | [convolutional] 64 | batch_normalize=1 65 | filters=32 66 | size=3 67 | stride=1 68 | pad=1 69 | activation=leaky 70 | 71 | [convolutional] 72 | batch_normalize=1 73 | filters=32 74 | size=3 75 | stride=1 76 | pad=1 77 | activation=leaky 78 | 79 | [route] 80 | layers = -1,-2 81 | 82 | [convolutional] 83 | batch_normalize=1 84 | filters=64 85 | size=1 86 | stride=1 87 | pad=1 88 | activation=leaky 89 | 90 | [route] 91 | layers = -6,-1 92 | 93 | [maxpool] 94 | size=2 95 | stride=2 96 | 97 | [convolutional] 98 | batch_normalize=1 99 | filters=128 100 | size=3 101 | stride=1 102 | pad=1 103 | activation=leaky 104 | 105 | [route] 106 | layers=-1 107 | groups=2 108 | group_id=1 109 | 110 | [convolutional] 111 | batch_normalize=1 112 | filters=64 113 | size=3 114 | stride=1 115 | pad=1 116 | activation=leaky 117 | 118 | [convolutional] 119 | batch_normalize=1 120 | filters=64 121 | size=3 122 | stride=1 123 | pad=1 124 | activation=leaky 125 | 126 | [route] 127 | layers = -1,-2 128 | 129 | [convolutional] 130 | batch_normalize=1 131 | filters=128 132 | size=1 133 | stride=1 134 | pad=1 135 | activation=leaky 136 | 137 | [route] 138 | layers = -6,-1 139 | 140 | [maxpool] 141 | size=2 142 | stride=2 143 | 144 | [convolutional] 145 | batch_normalize=1 146 | filters=256 147 | size=3 148 | stride=1 149 | pad=1 150 | activation=leaky 151 | 152 | [route] 153 | layers=-1 154 | groups=2 155 | group_id=1 156 | 157 | [convolutional] 158 | batch_normalize=1 159 | filters=128 160 | size=3 161 | stride=1 162 | pad=1 163 | activation=leaky 164 | 165 | [convolutional] 166 | batch_normalize=1 167 | filters=128 168 | size=3 169 | stride=1 170 | pad=1 171 | activation=leaky 172 | 173 | [route] 174 | layers = -1,-2 175 | 176 | [convolutional] 177 | batch_normalize=1 178 | filters=256 179 | size=1 180 | stride=1 181 | pad=1 182 | activation=leaky 183 | 184 | [route] 185 | layers = -6,-1 186 | 187 | [maxpool] 188 | size=2 189 | stride=2 190 | 191 | [convolutional] 192 | batch_normalize=1 193 | filters=512 194 | size=3 195 | stride=1 196 | pad=1 197 | activation=leaky 198 | 199 | ################################## 200 | 201 | [convolutional] 202 | batch_normalize=1 203 | filters=256 204 | size=1 205 | stride=1 206 | pad=1 207 | activation=leaky 208 | 209 | [convolutional] 210 | batch_normalize=1 211 | filters=512 212 | size=3 213 | stride=1 214 | pad=1 215 | activation=leaky 216 | 217 | [convolutional] 218 | size=1 219 | stride=1 220 | pad=1 221 | filters=21 222 | activation=linear 223 | 224 | 225 | 226 | [yolo] 227 | mask = 3,4,5 228 | anchors = 7, 13, 14, 36, 26, 80, 41,156, 69,241, 140,311 229 | classes=2 230 | num=6 231 | jitter=.3 232 | scale_x_y = 1.05 233 | cls_normalizer=1.0 234 | iou_normalizer=0.07 235 | iou_loss=ciou 236 | ignore_thresh = .7 237 | truth_thresh = 1 238 | random=0 239 | resize=1.5 240 | nms_kind=greedynms 241 | beta_nms=0.6 242 | #new_coords=1 243 | #scale_x_y = 2.0 244 | 245 | [route] 246 | layers = -4 247 | 248 | [convolutional] 249 | batch_normalize=1 250 | filters=128 251 | size=1 252 | stride=1 253 | pad=1 254 | activation=leaky 255 | 256 | [upsample] 257 | stride=2 258 | 259 | [route] 260 | layers = -1, 23 261 | 262 | [convolutional] 263 | batch_normalize=1 264 | filters=256 265 | size=3 266 | stride=1 267 | pad=1 268 | activation=leaky 269 | 270 | [convolutional] 271 | size=1 272 | stride=1 273 | pad=1 274 | filters=21 275 | activation=linear 276 | 277 | [yolo] 278 | mask = 1,2,3 279 | anchors = 7, 13, 14, 36, 26, 80, 41,156, 69,241, 140,311 280 | classes=2 281 | num=6 282 | jitter=.3 283 | scale_x_y = 1.05 284 | cls_normalizer=1.0 285 | iou_normalizer=0.07 286 | iou_loss=ciou 287 | ignore_thresh = .7 288 | truth_thresh = 1 289 | random=0 290 | resize=1.5 291 | nms_kind=greedynms 292 | beta_nms=0.6 293 | #new_coords=1 294 | #scale_x_y = 2.0 295 | -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | # CrowdHuman Dataset by MEGVII 2 | 3 | * Official web site: [https://www.crowdhuman.org/](https://www.crowdhuman.org/) 4 | 5 | * Reference: 6 | - [CrowdHuman: A Benchmark for Detecting Human in a Crowd](https://arxiv.org/abs/1805.00123) 7 | - [CrowdHuman Dataset 介紹](https://chtseng.wordpress.com/2019/12/13/crowdhuman-dataset-%E4%BB%8B%E7%B4%B9/) 8 | 9 | * When converting CrowdHuman annotations to YOLO txt files, 10 | - I discard all "mask" objects. The "mask" objects in the CrowdHuman dataset are not real humans. They are usually reflections of humans, or pictures of humans in billboards or advertisement posters. 11 | - I use "hbox" (head) and "fbox" (full body) annotations of all "person" objects. Note that the "fbox" annotation might include body parts which are "ocluded" in the scene. 12 | - In the final YOLO txt files, there are 2 classes of objects. Class 0 is "head", and class 1 "person". 13 | -------------------------------------------------------------------------------- /data/crowdhuman-template.data: -------------------------------------------------------------------------------- 1 | classes = 2 2 | train = data/crowdhuman-{width}x{height}/train.txt 3 | valid = data/crowdhuman-{width}x{height}/test.txt 4 | names = data/crowdhuman.names 5 | backup = backup/ 6 | -------------------------------------------------------------------------------- /data/crowdhuman.names: -------------------------------------------------------------------------------- 1 | head 2 | person 3 | -------------------------------------------------------------------------------- /data/gen_txts.py: -------------------------------------------------------------------------------- 1 | """gen_txts.py 2 | 3 | To generate YOLO txt files from the original CrowdHuman annotations. 4 | Please also refer to README.md in this directory. 5 | 6 | Inputs: 7 | * raw/annotation_train.odgt 8 | * raw/annotation_val.odgt 9 | * crowdhuman-{width}x{height}/[IDs].jpg 10 | 11 | Outputs: 12 | * crowdhuman-{width}x{height}train.txt 13 | * crowdhuman-{width}x{height}/test.txt 14 | * crowdhuman-{width}x{height}/[IDs].txt (one annotation for each image in the training or test set) 15 | """ 16 | 17 | 18 | import json 19 | from pathlib import Path 20 | from argparse import ArgumentParser 21 | 22 | import numpy as np 23 | import cv2 24 | 25 | 26 | # input image width/height of the yolov4 model, set by command-line argument 27 | INPUT_WIDTH = 0 28 | INPUT_HEIGHT = 0 29 | 30 | # Minimum width/height of objects for detection (don't learn from 31 | # objects smaller than these 32 | MIN_W = 5 33 | MIN_H = 5 34 | 35 | # Do K-Means clustering in order to determine "anchor" sizes 36 | DO_KMEANS = True 37 | KMEANS_CLUSTERS = 9 38 | BBOX_WHS = [] # keep track of bbox width/height with respect to 608x608 39 | 40 | 41 | def image_shape(ID, image_dir): 42 | assert image_dir is not None 43 | jpg_path = image_dir / ('%s.jpg' % ID) 44 | img = cv2.imread(jpg_path.as_posix()) 45 | return img.shape 46 | 47 | 48 | def txt_line(cls, bbox, img_w, img_h): 49 | """Generate 1 line in the txt file.""" 50 | assert INPUT_WIDTH > 0 and INPUT_HEIGHT > 0 51 | x, y, w, h = bbox 52 | x = max(int(x), 0) 53 | y = max(int(y), 0) 54 | w = min(int(w), img_w - x) 55 | h = min(int(h), img_h - y) 56 | w_rescaled = float(w) * INPUT_WIDTH / img_w 57 | h_rescaled = float(h) * INPUT_HEIGHT / img_h 58 | if w_rescaled < MIN_W or h_rescaled < MIN_H: 59 | return '' 60 | else: 61 | if DO_KMEANS: 62 | global BBOX_WHS 63 | BBOX_WHS.append((w_rescaled, h_rescaled)) 64 | cx = (x + w / 2.) / img_w 65 | cy = (y + h / 2.) / img_h 66 | nw = float(w) / img_w 67 | nh = float(h) / img_h 68 | return '%d %.6f %.6f %.6f %.6f\n' % (cls, cx, cy, nw, nh) 69 | 70 | 71 | def process(set_='test', annotation_filename='raw/annotation_val.odgt', 72 | output_dir=None): 73 | """Process either 'train' or 'test' set.""" 74 | assert output_dir is not None 75 | output_dir.mkdir(exist_ok=True) 76 | jpgs = [] 77 | with open(annotation_filename, 'r') as fanno: 78 | for raw_anno in fanno.readlines(): 79 | anno = json.loads(raw_anno) 80 | ID = anno['ID'] # e.g. '273271,c9db000d5146c15' 81 | print('Processing ID: %s' % ID) 82 | img_h, img_w, img_c = image_shape(ID, output_dir) 83 | assert img_c == 3 # should be a BGR image 84 | txt_path = output_dir / ('%s.txt' % ID) 85 | # write a txt for each image 86 | with open(txt_path.as_posix(), 'w') as ftxt: 87 | for obj in anno['gtboxes']: 88 | if obj['tag'] == 'mask': 89 | continue # ignore non-human 90 | assert obj['tag'] == 'person' 91 | if 'hbox' in obj.keys(): # head 92 | line = txt_line(0, obj['hbox'], img_w, img_h) 93 | if line: 94 | ftxt.write(line) 95 | if 'fbox' in obj.keys(): # full body 96 | line = txt_line(1, obj['fbox'], img_w, img_h) 97 | if line: 98 | ftxt.write(line) 99 | jpgs.append('data/%s/%s.jpg' % (output_dir, ID)) 100 | # write the 'data/crowdhuman/train.txt' or 'data/crowdhuman/test.txt' 101 | set_path = output_dir / ('%s.txt' % set_) 102 | with open(set_path.as_posix(), 'w') as fset: 103 | for jpg in jpgs: 104 | fset.write('%s\n' % jpg) 105 | 106 | 107 | def rm_txts(output_dir): 108 | """Remove txt files in output_dir.""" 109 | for txt in output_dir.glob('*.txt'): 110 | if txt.is_file(): 111 | txt.unlink() 112 | 113 | 114 | def main(): 115 | global INPUT_WIDTH, INPUT_HEIGHT 116 | 117 | parser = ArgumentParser() 118 | parser.add_argument('dim', help='input width and height, e.g. 608x608') 119 | args = parser.parse_args() 120 | 121 | dim_split = args.dim.split('x') 122 | if len(dim_split) != 2: 123 | raise SystemExit('ERROR: bad spec of input dim (%s)' % args.dim) 124 | INPUT_WIDTH, INPUT_HEIGHT = int(dim_split[0]), int(dim_split[1]) 125 | if INPUT_WIDTH % 32 != 0 or INPUT_HEIGHT % 32 != 0: 126 | raise SystemExit('ERROR: bad spec of input dim (%s)' % args.dim) 127 | 128 | output_dir = Path('crowdhuman-%s' % args.dim) 129 | if not output_dir.is_dir(): 130 | raise SystemExit('ERROR: %s does not exist.' % output_dir.as_posix()) 131 | 132 | rm_txts(output_dir) 133 | process('test', 'raw/annotation_val.odgt', output_dir) 134 | process('train', 'raw/annotation_train.odgt', output_dir) 135 | 136 | with open('crowdhuman-%s.data' % args.dim, 'w') as f: 137 | f.write("""classes = 2 138 | train = data/crowdhuman-%s/train.txt 139 | valid = data/crowdhuman-%s/test.txt 140 | names = data/crowdhuman.names 141 | backup = backup/\n""" % (args.dim, args.dim)) 142 | 143 | if DO_KMEANS: 144 | try: 145 | from sklearn.cluster import KMeans 146 | except ModuleNotFoundError: 147 | print('WARNING: no sklearn, skipping anchor clustering...') 148 | else: 149 | X = np.array(BBOX_WHS) 150 | kmeans = KMeans(n_clusters=KMEANS_CLUSTERS, random_state=0).fit(X) 151 | centers = kmeans.cluster_centers_ 152 | centers = centers[centers[:, 0].argsort()] # sort by bbox w 153 | print('\n** for yolov4-%dx%d, ' % (INPUT_WIDTH, INPUT_HEIGHT), end='') 154 | print('resized bbox width/height clusters are: ', end='') 155 | print(' '.join(['(%.2f, %.2f)' % (c[0], c[1]) for c in centers])) 156 | print('\nanchors = ', end='') 157 | print(', '.join(['%d,%d' % (int(c[0]), int(c[1])) for c in centers])) 158 | 159 | 160 | if __name__ == '__main__': 161 | main() 162 | -------------------------------------------------------------------------------- /data/image_histogram.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from pathlib import Path\n", 10 | "\n", 11 | "import numpy as np\n", 12 | "import cv2\n", 13 | "from matplotlib import pyplot as plt\n", 14 | "\n", 15 | "%matplotlib inline" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "name": "stdout", 25 | "output_type": "stream", 26 | "text": [ 27 | "Processing images 0/19370\n", 28 | "Processing images 1000/19370\n", 29 | "Processing images 2000/19370\n", 30 | "Processing images 3000/19370\n", 31 | "Processing images 4000/19370\n", 32 | "Processing images 5000/19370\n", 33 | "Processing images 6000/19370\n", 34 | "Processing images 7000/19370\n", 35 | "Processing images 8000/19370\n", 36 | "Processing images 9000/19370\n", 37 | "Processing images 10000/19370\n", 38 | "Processing images 11000/19370\n", 39 | "Processing images 12000/19370\n", 40 | "Processing images 13000/19370\n", 41 | "Processing images 14000/19370\n", 42 | "Processing images 15000/19370\n", 43 | "Processing images 16000/19370\n", 44 | "Processing images 17000/19370\n", 45 | "Processing images 18000/19370\n", 46 | "Processing images 19000/19370\n", 47 | "Processing images 19369/19370\n" 48 | ] 49 | } 50 | ], 51 | "source": [ 52 | "jpg_paths = list(Path('raw/Images').rglob('*.jpg'))\n", 53 | "img_widths, img_heights = [], []\n", 54 | "for i, jpg_path in enumerate(jpg_paths):\n", 55 | " if i % 1000 == 0 or i == len(jpg_paths) - 1:\n", 56 | " print('Processing images %d/%d' % (i, len(jpg_paths)))\n", 57 | " img = cv2.imread(jpg_path.as_posix())\n", 58 | " assert img is not None\n", 59 | " img_h, img_w, img_c = img.shape\n", 60 | " assert img_c == 3\n", 61 | " img_widths.append(img_w)\n", 62 | " img_heights.append(img_h)\n", 63 | "\n", 64 | "img_widths = np.array(img_widths)\n", 65 | "img_heights = np.array(img_heights)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "name": "stdout", 75 | "output_type": "stream", 76 | "text": [ 77 | "image with min width ( 400): 273271,270ce0009b574b4f.jpg\n", 78 | "image with max width (10800): 283081,16bc500013036fc4.jpg\n", 79 | "image with min height ( 300): 273278,75ad1000895e69f8.jpg\n", 80 | "image with max height ( 7200): 283081,16bc500013036fc4.jpg\n", 81 | "min aspect ratio (0.36900): 283554,158f20008da98dbc.jpg\n", 82 | "max aspect ratio (5.60440): 273278,11efb10008ff5dbd4.jpg\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "img_aspects = img_widths / img_heights\n", 88 | "\n", 89 | "idx = img_widths.argmin()\n", 90 | "print('image with min width (%5d): %s' % (img_widths[idx], jpg_paths[idx].name))\n", 91 | "idx = img_widths.argmax()\n", 92 | "print('image with max width (%5d): %s' % (img_widths[idx], jpg_paths[idx].name))\n", 93 | "idx = img_heights.argmin()\n", 94 | "print('image with min height (%5d): %s' % (img_heights[idx], jpg_paths[idx].name))\n", 95 | "idx = img_heights.argmax()\n", 96 | "print('image with max height (%5d): %s' % (img_heights[idx], jpg_paths[idx].name))\n", 97 | "idx = img_aspects.argmin()\n", 98 | "print('min aspect ratio (%7.5f): %s' % (img_aspects[idx], jpg_paths[idx].name))\n", 99 | "idx = img_aspects.argmax()\n", 100 | "print('max aspect ratio (%7.5f): %s' % (img_aspects[idx], jpg_paths[idx].name))" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 4, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "data": { 110 | "image/png": "\n", 111 | "text/plain": [ 112 | "
" 113 | ] 114 | }, 115 | "metadata": { 116 | "needs_background": "light" 117 | }, 118 | "output_type": "display_data" 119 | } 120 | ], 121 | "source": [ 122 | "plt.figure(figsize=(12, 6))\n", 123 | "plt.subplot(1, 2, 1)\n", 124 | "plt.hist(img_widths)\n", 125 | "plt.title('image widths')\n", 126 | "plt.subplot(1, 2, 2)\n", 127 | "plt.hist(img_heights)\n", 128 | "plt.title('image heights')\n", 129 | "plt.tight_layout()\n", 130 | "plt.show()" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 5, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "data": { 140 | "image/png": "\n", 141 | "text/plain": [ 142 | "
" 143 | ] 144 | }, 145 | "metadata": { 146 | "needs_background": "light" 147 | }, 148 | "output_type": "display_data" 149 | } 150 | ], 151 | "source": [ 152 | "plt.figure(figsize=(10, 10))\n", 153 | "\n", 154 | "plt.subplot(1, 1, 1)\n", 155 | "plt.scatter(img_widths, img_heights)\n", 156 | "plt.title('image widths/heights')\n", 157 | "plt.xlabel('width')\n", 158 | "plt.ylabel('height')\n", 159 | "\n", 160 | "plt.show()" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [] 169 | } 170 | ], 171 | "metadata": { 172 | "kernelspec": { 173 | "display_name": "Python 3", 174 | "language": "python", 175 | "name": "python3" 176 | }, 177 | "language_info": { 178 | "codemirror_mode": { 179 | "name": "ipython", 180 | "version": 3 181 | }, 182 | "file_extension": ".py", 183 | "mimetype": "text/x-python", 184 | "name": "python", 185 | "nbconvert_exporter": "python", 186 | "pygments_lexer": "ipython3", 187 | "version": "3.6.9" 188 | } 189 | }, 190 | "nbformat": 4, 191 | "nbformat_minor": 4 192 | } 193 | -------------------------------------------------------------------------------- /data/prepare_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | # check argument 6 | if [[ -z $1 || ! $1 =~ [[:digit:]]x[[:digit:]] ]]; then 7 | echo "ERROR: This script requires 1 argument, \"input dimension\" of the YOLO model." 8 | echo "The input dimension should be {width}x{height} such as 608x608 or 416x256.". 9 | exit 1 10 | fi 11 | 12 | if which python3 > /dev/null; then 13 | PYTHON=python3 14 | else 15 | PYTHON=python 16 | fi 17 | 18 | echo "** Install requirements" 19 | # "gdown" is for downloading files from GoogleDrive 20 | pip3 install --user gdown > /dev/null 21 | 22 | # make sure to download dataset files to "yolov4_crowdhuman/data/raw/" 23 | mkdir -p $(dirname $0)/raw 24 | pushd $(dirname $0)/raw > /dev/null 25 | 26 | get_file() 27 | { 28 | # do download only if the file does not exist 29 | if [[ -f $2 ]]; then 30 | echo Skipping $2 31 | else 32 | echo Downloading $2... 33 | python3 -m gdown.cli $1 34 | fi 35 | } 36 | 37 | echo "** Download dataset files" 38 | get_file https://drive.google.com/uc?id=134QOvaatwKdy0iIeNqA_p-xkAhkV4F8Y CrowdHuman_train01.zip 39 | get_file https://drive.google.com/uc?id=17evzPh7gc1JBNvnW1ENXLy5Kr4Q_Nnla CrowdHuman_train02.zip 40 | get_file https://drive.google.com/uc?id=1tdp0UCgxrqy1B6p8LkR-Iy0aIJ8l4fJW CrowdHuman_train03.zip 41 | get_file https://drive.google.com/uc?id=18jFI789CoHTppQ7vmRSFEdnGaSQZ4YzO CrowdHuman_val.zip 42 | # test data is not needed... 43 | # get_file https://drive.google.com/uc?id=1tQG3E_RrRI4wIGskorLTmDiWHH2okVvk CrowdHuman_test.zip 44 | get_file https://drive.google.com/u/0/uc?id=1UUTea5mYqvlUObsC1Z8CFldHJAtLtMX3 annotation_train.odgt 45 | get_file https://drive.google.com/u/0/uc?id=10WIRwu8ju8GRLuCkZ_vT6hnNxs5ptwoL annotation_val.odgt 46 | 47 | # unzip image files (ignore CrowdHuman_test.zip for now) 48 | echo "** Unzip dataset files" 49 | for f in CrowdHuman_train01.zip CrowdHuman_train02.zip CrowdHuman_train03.zip CrowdHuman_val.zip ; do 50 | unzip -n ${f} 51 | done 52 | 53 | echo "** Create the crowdhuman-$1/ subdirectory" 54 | rm -rf ../crowdhuman-$1/ 55 | mkdir ../crowdhuman-$1/ 56 | ln Images/*.jpg ../crowdhuman-$1/ 57 | 58 | # the crowdhuman/ subdirectory now contains all train/val jpg images 59 | 60 | echo "** Generate yolo txt files" 61 | cd .. 62 | ${PYTHON} gen_txts.py $1 63 | 64 | popd > /dev/null 65 | 66 | echo "** Done." 67 | -------------------------------------------------------------------------------- /data/verify_txts.py: -------------------------------------------------------------------------------- 1 | """verify_txts.py 2 | 3 | For verifying correctness of the generated YOLO txt annotations. 4 | """ 5 | 6 | 7 | import random 8 | from pathlib import Path 9 | from argparse import ArgumentParser 10 | 11 | import cv2 12 | 13 | 14 | WINDOW_NAME = "verify_txts" 15 | 16 | parser = ArgumentParser() 17 | parser.add_argument('dim', help='input width and height, e.g. 608x608') 18 | args = parser.parse_args() 19 | 20 | if random.random() < 0.5: 21 | print('Verifying test.txt') 22 | jpgs_path = Path('crowdhuman-%s/test.txt' % args.dim) 23 | else: 24 | print('Verifying train.txt') 25 | jpgs_path = Path('crowdhuman-%s/train.txt' % args.dim) 26 | 27 | with open(jpgs_path.as_posix(), 'r') as f: 28 | jpg_names = [l.strip()[5:] for l in f.readlines()] 29 | 30 | random.shuffle(jpg_names) 31 | for jpg_name in jpg_names: 32 | img = cv2.imread(jpg_name) 33 | img_h, img_w, _ = img.shape 34 | txt_name = jpg_name.replace('.jpg', '.txt') 35 | with open(txt_name, 'r') as f: 36 | obj_lines = [l.strip() for l in f.readlines()] 37 | for obj_line in obj_lines: 38 | cls, cx, cy, nw, nh = [float(item) for item in obj_line.split(' ')] 39 | color = (0, 0, 255) if cls == 0.0 else (0, 255, 0) 40 | x_min = int((cx - (nw / 2.0)) * img_w) 41 | y_min = int((cy - (nh / 2.0)) * img_h) 42 | x_max = int((cx + (nw / 2.0)) * img_w) 43 | y_max = int((cy + (nh / 2.0)) * img_h) 44 | cv2.rectangle(img, (x_min, y_min), (x_max, y_max), color, 2) 45 | cv2.imshow(WINDOW_NAME, img) 46 | if cv2.waitKey(0) == 27: 47 | break 48 | 49 | cv2.destroyAllWindows() 50 | -------------------------------------------------------------------------------- /doc/cant_connect_gpu.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/cant_connect_gpu.jpg -------------------------------------------------------------------------------- /doc/chart_yolov4-crowdhuman-608x608.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/chart_yolov4-crowdhuman-608x608.png -------------------------------------------------------------------------------- /doc/chart_yolov4-tiny-3l-crowdhuman-416x416.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/chart_yolov4-tiny-3l-crowdhuman-416x416.png -------------------------------------------------------------------------------- /doc/chart_yolov4-tiny-crowdhuman-608x608.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/chart_yolov4-tiny-crowdhuman-608x608.png -------------------------------------------------------------------------------- /doc/crowdhuman_sample.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/crowdhuman_sample.jpg -------------------------------------------------------------------------------- /doc/drive_on_colab.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/drive_on_colab.jpg -------------------------------------------------------------------------------- /doc/infinity_war.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/infinity_war.jpg -------------------------------------------------------------------------------- /doc/predictions_sample.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/predictions_sample.jpg -------------------------------------------------------------------------------- /doc/save_a_copy.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jkjung-avt/yolov4_crowdhuman/374b0e839e062d2039259ca18f3490f39cd122a8/doc/save_a_copy.jpg -------------------------------------------------------------------------------- /prepare_training.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | # check argument 6 | if [[ -z $1 || ! $1 =~ [[:digit:]]x[[:digit:]] ]]; then 7 | echo "ERROR: This script requires 1 argument, \"input dimension\" of the YOLO model." 8 | echo "The input dimension should be {width}x{height} such as 608x608 or 416x256.". 9 | exit 1 10 | fi 11 | 12 | CROWDHUMAN=crowdhuman-$1 13 | 14 | if [[ ! -f data/${CROWDHUMAN}/train.txt || ! -f data/${CROWDHUMAN}/test.txt ]]; then 15 | echo "ERROR: missing txt file in data/${CROWDHUMAN}/" 16 | exit 1 17 | fi 18 | 19 | echo "** Install requirements" 20 | # "gdown" is for downloading files from GoogleDrive 21 | pip3 install --user gdown > /dev/null 22 | 23 | echo "** Copy files for training" 24 | ln -sf $(readlink -f data/${CROWDHUMAN}) darknet/data/ 25 | cp data/${CROWDHUMAN}.data darknet/data/ 26 | cp data/crowdhuman.names darknet/data/ 27 | cp cfg/*.cfg darknet/cfg/ 28 | 29 | if [[ ! -f darknet/yolov4.conv.137 ]]; then 30 | pushd darknet > /dev/null 31 | echo "** Download pre-trained yolov4 weights" 32 | python3 -m gdown.cli https://drive.google.com/uc?id=1JKF-bdIklxOOVy-2Cr5qdvjgGpmGfcbp 33 | popd > /dev/null 34 | fi 35 | 36 | echo "** Done." 37 | --------------------------------------------------------------------------------