├── README.md ├── Racecar.py ├── coco.names ├── demo └── output_1.gif ├── images ├── CV Proj - Block Diag.jpg ├── IMG_20200429_091806.jpg └── IMG_20200429_092927.jpg └── yolov3.cfg /README.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | This project focuses on building the perception stage of a small autonomous racecar system running on `NVIDIA Jetson TX2`. `ZED Stereo Camera` is used for visual input. The algorithm is developed in `Python 2` using `Robot Operating System (ROS)`. It uses `YOLO v3` to extract the detected objects in the racecar's environment and their estimated distances are obtained from the stereo camera parameters. This information is then fed to the master controller of the ROS node, which can then be used to plan and control the autonomous racecar. This algorithm can run in real time with GPU. 3 | 4 | # Motivation 5 | During Spring 2020, I (a graduate student in Computer Science) and my teammates (graduate students from Mechanical and Electrical Engineering) worked on a large-scale project in the domain of autonomous vehicles. We had done tons of literature surveys for that project. That is when we came across [F1Tenth](https://f1tenth.org/), an international community in autonomous systems, and got inspired to build a small autonomous racecar system. I took up the computer vision challenges (perception) of the project while my teammates focused on planning and control modules of the racecar. 6 | Soon after we started, the pandemic broke. Coronavirus had entered the US, and we were all quarantined at home. We were lucky to have multiple NVIDIA Jetson TX2s in our lab. So I got one home along with a ZED stereo camera and got committed to building a perception module and make the racecar run in my living room, haha! 7 | 8 | # Abstract 9 | The first step to build an autonomous racecar is to help it perceive its environment, allowing it to figure out its next steps to obtain complete autonomy. With Computer Vision techniques in the perception stage, we enable the small autonomous racecar to understand objects around it while it is on the run. 10 | This project's main objective is to design an algorithm to process the raw data from the sensors (in this case, a stereo camera) into meaningful observations and provide the information about the surroundings to the master controller. With the help of YOLOv3 to perform object detection and calculating the distance of these objects using ROS on the ZED Stereo Camera running over NVIDIA Jetson TX2, the algorithm will help the racecar communicate this information to its controller, which can further determine the path the racecar should take. 11 | Moreover, the racecar's reaction time should be high, and therefore the challenge is to design the algorithm to help the racecar make decisions very quickly. This project only implements the Computer Vision part (Perception stage) of the racecar. 12 | 13 | ![](https://github.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/blob/main/images/CV%20Proj%20-%20Block%20Diag.jpg) 14 | 15 | # Requirements 16 | ## Dependencies 17 | ``` 18 | python 2.x 19 | Darknet 20 | rospy 21 | roslib 22 | open-cv 23 | cv_bridge 24 | numpy 25 | math 26 | 27 | ``` 28 | Install Darknet [here](https://pjreddie.com/darknet/install/). 29 | 30 | ## Hardware Requirements 31 | ``` 32 | NVIDIA Jetson TX2 33 | ZED Stereo Camera 34 | ``` 35 | ![](https://github.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/blob/main/images/IMG_20200429_091806.jpg) 36 | ![](https://github.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/blob/main/images/IMG_20200429_092927.jpg) 37 | 38 | ## Other Requirements 39 | This code can be run over a live feed, but that is part of the whole pipeline (the complete pipeline includes all three, perception, planning, and control modules). I have shared the python file in this repository that uses bag files to retrieve the frames from a pre-recorded video. The example bag files can be downloaded from this [link](https://iu.box.com/s/p5ambtjg02qxxj2q0e0kcuj7fcf67h7o). 40 | 41 | YOLO weights (yolov3.weights) can be downloaded from the [official YOLO website](https://pjreddie.com/darknet/yolo/). 42 | 43 | # Run this command for output 44 | ``` 45 | python Racecar.py 46 | ``` 47 | 48 | # Output 49 | The code writes processed frames into file `output.avi`. 50 | 51 | As seen in the output, the label of the objects and their distance from the stereo camera are mentioned on the top left of the bounding boxes. 52 | 53 | ![output](https://github.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/blob/main/demo/output_1.gif) 54 | -------------------------------------------------------------------------------- /Racecar.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import roslib 3 | import sys 4 | import rospy 5 | import cv2 6 | from std_msgs.msg import String 7 | from sensor_msgs.msg import Image 8 | from cv_bridge import CvBridge, CvBridgeError 9 | import numpy as np 10 | import math 11 | import time 12 | import rosbag 13 | 14 | 15 | ############################ Extracting data from bag files ##################################### 16 | 17 | bagfile = sys.argv[1] 18 | left_images = [] # list to store the image numpy arrays extracted from bag file 19 | depth_arr = [] # list to store the depth numpy arrays extracted from bag file 20 | 21 | # extracting data from 'image_raw_color' topic 22 | bag = rosbag.Bag(bagfile) 23 | for topic, msg, t in bag.read_messages(topics="/zed/zed_node/left_raw/image_raw_color"): 24 | bridge = CvBridge() 25 | image = bridge.imgmsg_to_cv2(msg, desired_encoding='passthrough') 26 | left_images.append(image) 27 | bag.close() 28 | 29 | # extracting data from 'depth_registered' topic 30 | bag = rosbag.Bag(bagfile) 31 | for topic, msg, t in bag.read_messages(topics="/zed/zed_node/depth/depth_registered"): 32 | bridge = CvBridge() 33 | depth_image = bridge.imgmsg_to_cv2(msg, "32FC1") 34 | depth = np.array(depth_image, dtype=np.float32) 35 | depth_arr.append(depth) 36 | bag.close() 37 | 38 | 39 | ########################## Obect Detection and Depth Calculation ############################## 40 | 41 | def get_depth(x, y, w, h, n): # function to return distance of objects detected 42 | u = x+w//2 # center of image in u direction 43 | v = y+h//2 # center ofimage in v direction 44 | dist = -1 45 | depth_array = depth_arr[n] 46 | 47 | # to take care of boundary conditions for the bounding boxes 48 | left = u-15 49 | if (left<0): left=0 50 | right = u+15 51 | if (right>=1280): right=1280 52 | 53 | up = v-15 54 | if (up<0): up = 0 55 | down = v+15 56 | if (down>720): down = 720 57 | 58 | # extracting the distance value around the center pixel of the bounding box 59 | for row in range(up,down): 60 | for col in range(left, right): 61 | # verifying if the value is not 'nan' 62 | if (not(math.isnan(float(depth_array[row][col])))): 63 | dist = depth_array[row, col] 64 | return dist # returning when index with distance value is found 65 | return dist # returning -1 as distance for this particular object 66 | 67 | 68 | conf_thres = 0.5 #confidence threshold 69 | nms_thres = 0.4 #Non-Maximum Suppression Threshold 70 | 71 | labelsPath = 'coco.names' # labels of the COCO dataset 72 | labels = open(labelsPath).read().strip().split('\n') 73 | 74 | # picking random values for the bounding boxes of different class of objects 75 | colors = np.random.randint(0, 255, size=(len(labels), 3), dtype="uint8") 76 | 77 | # loading the confi and weights file 78 | model_config = 'yolov3.cfg' 79 | model_weights = 'yolov3.weights' 80 | # to read network model stored in Darknet model files 81 | net = cv2.dnn.readNetFromDarknet(model_config, model_weights) 82 | 83 | layerName = net.getLayerNames() 84 | layerName = [layerName[i[0] - 1] for i in net.getUnconnectedOutLayers()] 85 | 86 | # opening the writer to write the output into an avi file 87 | count = 0 88 | writer = None 89 | (H, W) = (None, None) 90 | #start = time.time() 91 | 92 | # traversing through the images captured in sequence 93 | for li in range(0,len(left_images)-1): 94 | img = left_images[li] 95 | # extrating BGR file from a BGRA image 96 | b, g, r, a = cv2.split(img) 97 | image = np.stack((b, g, r), -1) 98 | image = cv2.resize(image, (img.shape[1], img.shape[0])) 99 | #print(image.shape) 100 | 101 | if W is None or H is None: (H, W) = image.shape[:2] 102 | 103 | # creating 4-dimensional blob from series of images 104 | blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416), swapRB = True, crop = False) 105 | net.setInput(blob) 106 | # running forward pass to compute output of layer 107 | layersOutputs = net.forward(layerName) 108 | 109 | boxes = [] 110 | confidences = [] 111 | classIDs = [] 112 | 113 | for output in layersOutputs: 114 | for detection in output: 115 | scores = detection[5:] 116 | classID = np.argmax(scores) 117 | confidence = scores[classID] 118 | # picking the detectiong with confidence greater than threshold value 119 | if confidence > conf_thres: 120 | box = detection[0:4] * np.array([W, H, W, H]) 121 | (centerX, centerY, width, height) = box.astype('int') 122 | x = int(centerX - (width/2)) 123 | y = int(centerY - (height/2)) 124 | 125 | # storing object's coordinates, confidence and classID 126 | boxes.append([x, y, int(width), int(height)]) 127 | confidences.append(float(confidence)) 128 | classIDs.append(classID) 129 | 130 | #Non Maxima Suppression 131 | detectionNMS = cv2.dnn.NMSBoxes(boxes, confidences, conf_thres, nms_thres) 132 | 133 | if(len(detectionNMS) > 0): 134 | for i in detectionNMS.flatten(): 135 | (x, y) = (boxes[i][0], boxes[i][1]) 136 | (w, h) = (boxes[i][2], boxes[i][3]) 137 | # extracting depth of every object in the frame 138 | d = get_depth(x, y, w, h, li) 139 | 140 | color = [int(c) for c in colors[classIDs[i]]] 141 | cv2.rectangle(image, (x, y), (x + w, y + h), color, 2) 142 | if (d == -1): # if distance=-1, it means that no value for depth was found 143 | text = '{}: ~~~'.format(labels[classIDs[i]][:-1]) 144 | else: 145 | text = '{}: {:.3f}m'.format(labels[classIDs[i]][:-1], d) 146 | # writing object class and its corresponding distance on the image 147 | cv2.putText(image, text, (x, y+5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2) 148 | 149 | if writer is None: 150 | fourcc = cv2.VideoWriter_fourcc(*"MJPG") 151 | writer = cv2.VideoWriter("output.avi", fourcc, 10, (W,H), True) 152 | writer.write(image) 153 | print "writing frame", count 154 | count=count+1 155 | # file finished writing 156 | writer.release() 157 | 158 | #end = time.time() 159 | #print "time taken:", end-start 160 | -------------------------------------------------------------------------------- /coco.names: -------------------------------------------------------------------------------- 1 | person 2 | bicycle 3 | car 4 | motorbike 5 | aeroplane 6 | bus 7 | train 8 | truck 9 | boat 10 | traffic light 11 | fire hydrant 12 | stop sign 13 | parking meter 14 | bench 15 | bird 16 | cat 17 | dog 18 | horse 19 | sheep 20 | cow 21 | elephant 22 | bear 23 | zebra 24 | giraffe 25 | backpack 26 | umbrella 27 | handbag 28 | tie 29 | suitcase 30 | frisbee 31 | skis 32 | snowboard 33 | sports ball 34 | kite 35 | baseball bat 36 | baseball glove 37 | skateboard 38 | surfboard 39 | tennis racket 40 | bottle 41 | wine glass 42 | cup 43 | fork 44 | knife 45 | spoon 46 | bowl 47 | banana 48 | apple 49 | sandwich 50 | orange 51 | broccoli 52 | carrot 53 | hot dog 54 | pizza 55 | donut 56 | cake 57 | chair 58 | sofa 59 | pottedplant 60 | bed 61 | diningtable 62 | toilet 63 | tvmonitor 64 | laptop 65 | mouse 66 | remote 67 | keyboard 68 | cell phone 69 | microwave 70 | oven 71 | toaster 72 | sink 73 | refrigerator 74 | book 75 | clock 76 | vase 77 | scissors 78 | teddy bear 79 | hair drier 80 | toothbrush -------------------------------------------------------------------------------- /demo/output_1.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/21409046fa0496e177aa92f0f5ae8eb6ea7fb021/demo/output_1.gif -------------------------------------------------------------------------------- /images/CV Proj - Block Diag.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/21409046fa0496e177aa92f0f5ae8eb6ea7fb021/images/CV Proj - Block Diag.jpg -------------------------------------------------------------------------------- /images/IMG_20200429_091806.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/21409046fa0496e177aa92f0f5ae8eb6ea7fb021/images/IMG_20200429_091806.jpg -------------------------------------------------------------------------------- /images/IMG_20200429_092927.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/21409046fa0496e177aa92f0f5ae8eb6ea7fb021/images/IMG_20200429_092927.jpg -------------------------------------------------------------------------------- /yolov3.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Testing 3 | # batch=1 4 | # subdivisions=1 5 | # Training 6 | batch=64 7 | subdivisions=16 8 | width=608 9 | height=608 10 | channels=3 11 | momentum=0.9 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.001 19 | burn_in=1000 20 | max_batches = 500200 21 | policy=steps 22 | steps=400000,450000 23 | scales=.1,.1 24 | 25 | [convolutional] 26 | batch_normalize=1 27 | filters=32 28 | size=3 29 | stride=1 30 | pad=1 31 | activation=leaky 32 | 33 | # Downsample 34 | 35 | [convolutional] 36 | batch_normalize=1 37 | filters=64 38 | size=3 39 | stride=2 40 | pad=1 41 | activation=leaky 42 | 43 | [convolutional] 44 | batch_normalize=1 45 | filters=32 46 | size=1 47 | stride=1 48 | pad=1 49 | activation=leaky 50 | 51 | [convolutional] 52 | batch_normalize=1 53 | filters=64 54 | size=3 55 | stride=1 56 | pad=1 57 | activation=leaky 58 | 59 | [shortcut] 60 | from=-3 61 | activation=linear 62 | 63 | # Downsample 64 | 65 | [convolutional] 66 | batch_normalize=1 67 | filters=128 68 | size=3 69 | stride=2 70 | pad=1 71 | activation=leaky 72 | 73 | [convolutional] 74 | batch_normalize=1 75 | filters=64 76 | size=1 77 | stride=1 78 | pad=1 79 | activation=leaky 80 | 81 | [convolutional] 82 | batch_normalize=1 83 | filters=128 84 | size=3 85 | stride=1 86 | pad=1 87 | activation=leaky 88 | 89 | [shortcut] 90 | from=-3 91 | activation=linear 92 | 93 | [convolutional] 94 | batch_normalize=1 95 | filters=64 96 | size=1 97 | stride=1 98 | pad=1 99 | activation=leaky 100 | 101 | [convolutional] 102 | batch_normalize=1 103 | filters=128 104 | size=3 105 | stride=1 106 | pad=1 107 | activation=leaky 108 | 109 | [shortcut] 110 | from=-3 111 | activation=linear 112 | 113 | # Downsample 114 | 115 | [convolutional] 116 | batch_normalize=1 117 | filters=256 118 | size=3 119 | stride=2 120 | pad=1 121 | activation=leaky 122 | 123 | [convolutional] 124 | batch_normalize=1 125 | filters=128 126 | size=1 127 | stride=1 128 | pad=1 129 | activation=leaky 130 | 131 | [convolutional] 132 | batch_normalize=1 133 | filters=256 134 | size=3 135 | stride=1 136 | pad=1 137 | activation=leaky 138 | 139 | [shortcut] 140 | from=-3 141 | activation=linear 142 | 143 | [convolutional] 144 | batch_normalize=1 145 | filters=128 146 | size=1 147 | stride=1 148 | pad=1 149 | activation=leaky 150 | 151 | [convolutional] 152 | batch_normalize=1 153 | filters=256 154 | size=3 155 | stride=1 156 | pad=1 157 | activation=leaky 158 | 159 | [shortcut] 160 | from=-3 161 | activation=linear 162 | 163 | [convolutional] 164 | batch_normalize=1 165 | filters=128 166 | size=1 167 | stride=1 168 | pad=1 169 | activation=leaky 170 | 171 | [convolutional] 172 | batch_normalize=1 173 | filters=256 174 | size=3 175 | stride=1 176 | pad=1 177 | activation=leaky 178 | 179 | [shortcut] 180 | from=-3 181 | activation=linear 182 | 183 | [convolutional] 184 | batch_normalize=1 185 | filters=128 186 | size=1 187 | stride=1 188 | pad=1 189 | activation=leaky 190 | 191 | [convolutional] 192 | batch_normalize=1 193 | filters=256 194 | size=3 195 | stride=1 196 | pad=1 197 | activation=leaky 198 | 199 | [shortcut] 200 | from=-3 201 | activation=linear 202 | 203 | 204 | [convolutional] 205 | batch_normalize=1 206 | filters=128 207 | size=1 208 | stride=1 209 | pad=1 210 | activation=leaky 211 | 212 | [convolutional] 213 | batch_normalize=1 214 | filters=256 215 | size=3 216 | stride=1 217 | pad=1 218 | activation=leaky 219 | 220 | [shortcut] 221 | from=-3 222 | activation=linear 223 | 224 | [convolutional] 225 | batch_normalize=1 226 | filters=128 227 | size=1 228 | stride=1 229 | pad=1 230 | activation=leaky 231 | 232 | [convolutional] 233 | batch_normalize=1 234 | filters=256 235 | size=3 236 | stride=1 237 | pad=1 238 | activation=leaky 239 | 240 | [shortcut] 241 | from=-3 242 | activation=linear 243 | 244 | [convolutional] 245 | batch_normalize=1 246 | filters=128 247 | size=1 248 | stride=1 249 | pad=1 250 | activation=leaky 251 | 252 | [convolutional] 253 | batch_normalize=1 254 | filters=256 255 | size=3 256 | stride=1 257 | pad=1 258 | activation=leaky 259 | 260 | [shortcut] 261 | from=-3 262 | activation=linear 263 | 264 | [convolutional] 265 | batch_normalize=1 266 | filters=128 267 | size=1 268 | stride=1 269 | pad=1 270 | activation=leaky 271 | 272 | [convolutional] 273 | batch_normalize=1 274 | filters=256 275 | size=3 276 | stride=1 277 | pad=1 278 | activation=leaky 279 | 280 | [shortcut] 281 | from=-3 282 | activation=linear 283 | 284 | # Downsample 285 | 286 | [convolutional] 287 | batch_normalize=1 288 | filters=512 289 | size=3 290 | stride=2 291 | pad=1 292 | activation=leaky 293 | 294 | [convolutional] 295 | batch_normalize=1 296 | filters=256 297 | size=1 298 | stride=1 299 | pad=1 300 | activation=leaky 301 | 302 | [convolutional] 303 | batch_normalize=1 304 | filters=512 305 | size=3 306 | stride=1 307 | pad=1 308 | activation=leaky 309 | 310 | [shortcut] 311 | from=-3 312 | activation=linear 313 | 314 | 315 | [convolutional] 316 | batch_normalize=1 317 | filters=256 318 | size=1 319 | stride=1 320 | pad=1 321 | activation=leaky 322 | 323 | [convolutional] 324 | batch_normalize=1 325 | filters=512 326 | size=3 327 | stride=1 328 | pad=1 329 | activation=leaky 330 | 331 | [shortcut] 332 | from=-3 333 | activation=linear 334 | 335 | 336 | [convolutional] 337 | batch_normalize=1 338 | filters=256 339 | size=1 340 | stride=1 341 | pad=1 342 | activation=leaky 343 | 344 | [convolutional] 345 | batch_normalize=1 346 | filters=512 347 | size=3 348 | stride=1 349 | pad=1 350 | activation=leaky 351 | 352 | [shortcut] 353 | from=-3 354 | activation=linear 355 | 356 | 357 | [convolutional] 358 | batch_normalize=1 359 | filters=256 360 | size=1 361 | stride=1 362 | pad=1 363 | activation=leaky 364 | 365 | [convolutional] 366 | batch_normalize=1 367 | filters=512 368 | size=3 369 | stride=1 370 | pad=1 371 | activation=leaky 372 | 373 | [shortcut] 374 | from=-3 375 | activation=linear 376 | 377 | [convolutional] 378 | batch_normalize=1 379 | filters=256 380 | size=1 381 | stride=1 382 | pad=1 383 | activation=leaky 384 | 385 | [convolutional] 386 | batch_normalize=1 387 | filters=512 388 | size=3 389 | stride=1 390 | pad=1 391 | activation=leaky 392 | 393 | [shortcut] 394 | from=-3 395 | activation=linear 396 | 397 | 398 | [convolutional] 399 | batch_normalize=1 400 | filters=256 401 | size=1 402 | stride=1 403 | pad=1 404 | activation=leaky 405 | 406 | [convolutional] 407 | batch_normalize=1 408 | filters=512 409 | size=3 410 | stride=1 411 | pad=1 412 | activation=leaky 413 | 414 | [shortcut] 415 | from=-3 416 | activation=linear 417 | 418 | 419 | [convolutional] 420 | batch_normalize=1 421 | filters=256 422 | size=1 423 | stride=1 424 | pad=1 425 | activation=leaky 426 | 427 | [convolutional] 428 | batch_normalize=1 429 | filters=512 430 | size=3 431 | stride=1 432 | pad=1 433 | activation=leaky 434 | 435 | [shortcut] 436 | from=-3 437 | activation=linear 438 | 439 | [convolutional] 440 | batch_normalize=1 441 | filters=256 442 | size=1 443 | stride=1 444 | pad=1 445 | activation=leaky 446 | 447 | [convolutional] 448 | batch_normalize=1 449 | filters=512 450 | size=3 451 | stride=1 452 | pad=1 453 | activation=leaky 454 | 455 | [shortcut] 456 | from=-3 457 | activation=linear 458 | 459 | # Downsample 460 | 461 | [convolutional] 462 | batch_normalize=1 463 | filters=1024 464 | size=3 465 | stride=2 466 | pad=1 467 | activation=leaky 468 | 469 | [convolutional] 470 | batch_normalize=1 471 | filters=512 472 | size=1 473 | stride=1 474 | pad=1 475 | activation=leaky 476 | 477 | [convolutional] 478 | batch_normalize=1 479 | filters=1024 480 | size=3 481 | stride=1 482 | pad=1 483 | activation=leaky 484 | 485 | [shortcut] 486 | from=-3 487 | activation=linear 488 | 489 | [convolutional] 490 | batch_normalize=1 491 | filters=512 492 | size=1 493 | stride=1 494 | pad=1 495 | activation=leaky 496 | 497 | [convolutional] 498 | batch_normalize=1 499 | filters=1024 500 | size=3 501 | stride=1 502 | pad=1 503 | activation=leaky 504 | 505 | [shortcut] 506 | from=-3 507 | activation=linear 508 | 509 | [convolutional] 510 | batch_normalize=1 511 | filters=512 512 | size=1 513 | stride=1 514 | pad=1 515 | activation=leaky 516 | 517 | [convolutional] 518 | batch_normalize=1 519 | filters=1024 520 | size=3 521 | stride=1 522 | pad=1 523 | activation=leaky 524 | 525 | [shortcut] 526 | from=-3 527 | activation=linear 528 | 529 | [convolutional] 530 | batch_normalize=1 531 | filters=512 532 | size=1 533 | stride=1 534 | pad=1 535 | activation=leaky 536 | 537 | [convolutional] 538 | batch_normalize=1 539 | filters=1024 540 | size=3 541 | stride=1 542 | pad=1 543 | activation=leaky 544 | 545 | [shortcut] 546 | from=-3 547 | activation=linear 548 | 549 | ###################### 550 | 551 | [convolutional] 552 | batch_normalize=1 553 | filters=512 554 | size=1 555 | stride=1 556 | pad=1 557 | activation=leaky 558 | 559 | [convolutional] 560 | batch_normalize=1 561 | size=3 562 | stride=1 563 | pad=1 564 | filters=1024 565 | activation=leaky 566 | 567 | [convolutional] 568 | batch_normalize=1 569 | filters=512 570 | size=1 571 | stride=1 572 | pad=1 573 | activation=leaky 574 | 575 | [convolutional] 576 | batch_normalize=1 577 | size=3 578 | stride=1 579 | pad=1 580 | filters=1024 581 | activation=leaky 582 | 583 | [convolutional] 584 | batch_normalize=1 585 | filters=512 586 | size=1 587 | stride=1 588 | pad=1 589 | activation=leaky 590 | 591 | [convolutional] 592 | batch_normalize=1 593 | size=3 594 | stride=1 595 | pad=1 596 | filters=1024 597 | activation=leaky 598 | 599 | [convolutional] 600 | size=1 601 | stride=1 602 | pad=1 603 | filters=255 604 | activation=linear 605 | 606 | 607 | [yolo] 608 | mask = 6,7,8 609 | anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 610 | classes=80 611 | num=9 612 | jitter=.3 613 | ignore_thresh = .7 614 | truth_thresh = 1 615 | random=1 616 | 617 | 618 | [route] 619 | layers = -4 620 | 621 | [convolutional] 622 | batch_normalize=1 623 | filters=256 624 | size=1 625 | stride=1 626 | pad=1 627 | activation=leaky 628 | 629 | [upsample] 630 | stride=2 631 | 632 | [route] 633 | layers = -1, 61 634 | 635 | 636 | 637 | [convolutional] 638 | batch_normalize=1 639 | filters=256 640 | size=1 641 | stride=1 642 | pad=1 643 | activation=leaky 644 | 645 | [convolutional] 646 | batch_normalize=1 647 | size=3 648 | stride=1 649 | pad=1 650 | filters=512 651 | activation=leaky 652 | 653 | [convolutional] 654 | batch_normalize=1 655 | filters=256 656 | size=1 657 | stride=1 658 | pad=1 659 | activation=leaky 660 | 661 | [convolutional] 662 | batch_normalize=1 663 | size=3 664 | stride=1 665 | pad=1 666 | filters=512 667 | activation=leaky 668 | 669 | [convolutional] 670 | batch_normalize=1 671 | filters=256 672 | size=1 673 | stride=1 674 | pad=1 675 | activation=leaky 676 | 677 | [convolutional] 678 | batch_normalize=1 679 | size=3 680 | stride=1 681 | pad=1 682 | filters=512 683 | activation=leaky 684 | 685 | [convolutional] 686 | size=1 687 | stride=1 688 | pad=1 689 | filters=255 690 | activation=linear 691 | 692 | 693 | [yolo] 694 | mask = 3,4,5 695 | anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 696 | classes=80 697 | num=9 698 | jitter=.3 699 | ignore_thresh = .7 700 | truth_thresh = 1 701 | random=1 702 | 703 | 704 | 705 | [route] 706 | layers = -4 707 | 708 | [convolutional] 709 | batch_normalize=1 710 | filters=128 711 | size=1 712 | stride=1 713 | pad=1 714 | activation=leaky 715 | 716 | [upsample] 717 | stride=2 718 | 719 | [route] 720 | layers = -1, 36 721 | 722 | 723 | 724 | [convolutional] 725 | batch_normalize=1 726 | filters=128 727 | size=1 728 | stride=1 729 | pad=1 730 | activation=leaky 731 | 732 | [convolutional] 733 | batch_normalize=1 734 | size=3 735 | stride=1 736 | pad=1 737 | filters=256 738 | activation=leaky 739 | 740 | [convolutional] 741 | batch_normalize=1 742 | filters=128 743 | size=1 744 | stride=1 745 | pad=1 746 | activation=leaky 747 | 748 | [convolutional] 749 | batch_normalize=1 750 | size=3 751 | stride=1 752 | pad=1 753 | filters=256 754 | activation=leaky 755 | 756 | [convolutional] 757 | batch_normalize=1 758 | filters=128 759 | size=1 760 | stride=1 761 | pad=1 762 | activation=leaky 763 | 764 | [convolutional] 765 | batch_normalize=1 766 | size=3 767 | stride=1 768 | pad=1 769 | filters=256 770 | activation=leaky 771 | 772 | [convolutional] 773 | size=1 774 | stride=1 775 | pad=1 776 | filters=255 777 | activation=linear 778 | 779 | 780 | [yolo] 781 | mask = 0,1,2 782 | anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 783 | classes=80 784 | num=9 785 | jitter=.3 786 | ignore_thresh = .7 787 | truth_thresh = 1 788 | random=1 789 | 790 | --------------------------------------------------------------------------------