├── README.md
├── Racecar.py
├── coco.names
├── demo
    └── output_1.gif
├── images
    ├── CV Proj - Block Diag.jpg
    ├── IMG_20200429_091806.jpg
    └── IMG_20200429_092927.jpg
└── yolov3.cfg


/README.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | This project focuses on building the perception stage of a small autonomous racecar system running on `NVIDIA Jetson TX2`. `ZED Stereo Camera` is used for visual input. The algorithm is developed in `Python 2` using `Robot Operating System (ROS)`. It uses `YOLO v3` to extract the detected objects in the racecar's environment and their estimated distances are obtained from the stereo camera parameters. This information is then fed to the master controller of the ROS node, which can then be used to plan and control the autonomous racecar. This algorithm can run in real time with GPU.
 3 | 
 4 | # Motivation
 5 | During Spring 2020, I (a graduate student in Computer Science) and my teammates (graduate students from Mechanical and Electrical Engineering) worked on a large-scale project in the domain of autonomous vehicles. We had done tons of literature surveys for that project. That is when we came across [F1Tenth](https://f1tenth.org/), an international community in autonomous systems, and got inspired to build a small autonomous racecar system. I took up the computer vision challenges (perception) of the project while my teammates focused on planning and control modules of the racecar.
 6 | Soon after we started, the pandemic broke. Coronavirus had entered the US, and we were all quarantined at home. We were lucky to have multiple NVIDIA Jetson TX2s in our lab. So I got one home along with a ZED stereo camera and got committed to building a perception module and make the racecar run in my living room, haha!
 7 | 
 8 | # Abstract
 9 | The first step to build an autonomous racecar is to help it perceive its environment, allowing it to figure out its next steps to obtain complete autonomy. With Computer Vision techniques in the perception stage, we enable the small autonomous racecar to understand objects around it while it is on the run.  
10 | This project's main objective is to design an algorithm to process the raw data from the sensors (in this case, a stereo camera) into meaningful observations and provide the information about the surroundings to the master controller. With the help of YOLOv3 to perform object detection and calculating the distance of these objects using ROS on the ZED Stereo Camera running over NVIDIA Jetson TX2, the algorithm will help the racecar communicate this information to its controller, which can further determine the path the racecar should take. 
11 | Moreover, the racecar's reaction time should be high, and therefore the challenge is to design the algorithm to help the racecar make decisions very quickly. This project only implements the Computer Vision part (Perception stage) of the racecar.
12 | 
13 | ![](https://github.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/blob/main/images/CV%20Proj%20-%20Block%20Diag.jpg)
14 | 
15 | # Requirements
16 | ## Dependencies
17 | ```
18 | python 2.x
19 | Darknet
20 | rospy
21 | roslib
22 | open-cv
23 | cv_bridge
24 | numpy
25 | math
26 | 
27 | ```
28 | Install Darknet [here](https://pjreddie.com/darknet/install/).
29 | 
30 | ## Hardware Requirements
31 | ```
32 | NVIDIA Jetson TX2
33 | ZED Stereo Camera
34 | ```
35 | ![](https://github.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/blob/main/images/IMG_20200429_091806.jpg)
36 | ![](https://github.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/blob/main/images/IMG_20200429_092927.jpg)
37 | 
38 | ## Other Requirements
39 | This code can be run over a live feed, but that is part of the whole pipeline (the complete pipeline includes all three, perception, planning, and control modules). I have shared the python file in this repository that uses bag files to retrieve the frames from a pre-recorded video. The example bag files can be downloaded from this [link](https://iu.box.com/s/p5ambtjg02qxxj2q0e0kcuj7fcf67h7o).
40 | 
41 | YOLO weights (yolov3.weights) can be downloaded from the [official YOLO website](https://pjreddie.com/darknet/yolo/).
42 | 
43 | # Run this command for output
44 | ```
45 | python Racecar.py <path-of-the-bagfile>
46 | ```
47 | 
48 | # Output 
49 | The code writes processed frames into file `output.avi`.
50 | 
51 | As seen in the output, the <ins>label</ins> of the objects and their <ins>distance</ins> from the stereo camera are mentioned on the top left of the bounding boxes.
52 | 
53 | ![output](https://github.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/blob/main/demo/output_1.gif)
54 | 


--------------------------------------------------------------------------------
/Racecar.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | import roslib
  3 | import sys
  4 | import rospy
  5 | import cv2
  6 | from std_msgs.msg import String
  7 | from sensor_msgs.msg import Image
  8 | from cv_bridge import CvBridge, CvBridgeError
  9 | import numpy as np
 10 | import math
 11 | import time
 12 | import rosbag
 13 | 
 14 | 
 15 | ############################ Extracting data from bag files #####################################
 16 | 
 17 | bagfile = sys.argv[1]
 18 | left_images = [] # list to store the image numpy arrays extracted from bag file
 19 | depth_arr = []  # list to store the depth numpy arrays extracted from bag file
 20 | 
 21 | # extracting data from 'image_raw_color' topic
 22 | bag = rosbag.Bag(bagfile)
 23 | for topic, msg, t in bag.read_messages(topics="/zed/zed_node/left_raw/image_raw_color"):
 24 | 	bridge = CvBridge()
 25 | 	image = bridge.imgmsg_to_cv2(msg, desired_encoding='passthrough')
 26 | 	left_images.append(image)
 27 | bag.close()
 28 | 
 29 | # extracting data from 'depth_registered' topic
 30 | bag = rosbag.Bag(bagfile)
 31 | for topic, msg, t in bag.read_messages(topics="/zed/zed_node/depth/depth_registered"):
 32 | 	bridge = CvBridge()
 33 | 	depth_image = bridge.imgmsg_to_cv2(msg, "32FC1")
 34 | 	depth = np.array(depth_image, dtype=np.float32)
 35 | 	depth_arr.append(depth)
 36 | bag.close()
 37 | 
 38 | 
 39 | ########################## Obect Detection and Depth Calculation ##############################
 40 | 
 41 | def get_depth(x, y, w, h, n): # function to return distance of objects detected
 42 | 	u = x+w//2  # center of image in u direction
 43 | 	v = y+h//2  # center ofimage in v direction
 44 | 	dist = -1
 45 | 	depth_array = depth_arr[n]
 46 | 	
 47 | 	# to take care of boundary conditions for the bounding boxes
 48 | 	left = u-15
 49 | 	if (left<0): left=0
 50 | 	right = u+15
 51 | 	if (right>=1280): right=1280
 52 | 	
 53 | 	up = v-15
 54 | 	if (up<0): up = 0
 55 | 	down = v+15
 56 | 	if (down>720): down = 720
 57 | 	
 58 | 	# extracting the distance value around the center pixel of the bounding box
 59 | 	for row in range(up,down):		
 60 | 		for col in range(left, right):
 61 | 			# verifying if the value is not 'nan'
 62 | 			if (not(math.isnan(float(depth_array[row][col])))):
 63 | 				dist = depth_array[row, col]
 64 | 				return dist  # returning when index with distance value is found
 65 | 	return dist  # returning -1 as distance for this particular object
 66 | 	
 67 | 	
 68 | conf_thres = 0.5  #confidence threshold
 69 | nms_thres = 0.4  #Non-Maximum Suppression Threshold
 70 | 
 71 | labelsPath = 'coco.names'  # labels of the COCO dataset
 72 | labels = open(labelsPath).read().strip().split('\n')
 73 | 
 74 | # picking random values for the bounding boxes of different class of objects
 75 | colors = np.random.randint(0, 255, size=(len(labels), 3), dtype="uint8")
 76 | 
 77 | # loading the confi and weights file
 78 | model_config = 'yolov3.cfg'
 79 | model_weights = 'yolov3.weights'
 80 | # to read network model stored in Darknet model files
 81 | net = cv2.dnn.readNetFromDarknet(model_config, model_weights)
 82 | 
 83 | layerName = net.getLayerNames()
 84 | layerName = [layerName[i[0] - 1] for i in net.getUnconnectedOutLayers()]
 85 | 
 86 | # opening the writer to write the output into an avi file
 87 | count = 0
 88 | writer = None
 89 | (H, W) = (None, None)
 90 | #start = time.time()
 91 | 
 92 | # traversing through the images captured in sequence
 93 | for li in range(0,len(left_images)-1):
 94 | 	img = left_images[li]
 95 | 	# extrating BGR file from a BGRA image
 96 | 	b, g, r, a = cv2.split(img)
 97 | 	image = np.stack((b, g, r), -1)
 98 | 	image = cv2.resize(image, (img.shape[1], img.shape[0]))
 99 | 	#print(image.shape)
100 | 
101 | 	if W is None or  H is None: (H, W) = image.shape[:2]
102 | 	
103 | 	# creating 4-dimensional blob from series of images
104 | 	blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416), swapRB = True, crop = False)
105 | 	net.setInput(blob)
106 | 	# running forward pass to compute output of layer
107 | 	layersOutputs = net.forward(layerName)
108 | 
109 | 	boxes = []
110 | 	confidences = []
111 | 	classIDs = []
112 | 
113 | 	for output in layersOutputs:
114 | 		for detection in output:
115 | 			scores = detection[5:]
116 | 			classID = np.argmax(scores)
117 | 			confidence = scores[classID]
118 | 			# picking the detectiong with confidence greater than threshold value
119 | 			if confidence > conf_thres:
120 | 				box = detection[0:4] * np.array([W, H, W, H])
121 | 				(centerX, centerY,  width, height) = box.astype('int')
122 | 				x = int(centerX - (width/2))
123 | 				y = int(centerY - (height/2))
124 | 			
125 | 				# storing object's coordinates, confidence and classID
126 | 				boxes.append([x, y, int(width), int(height)])
127 | 				confidences.append(float(confidence))
128 | 				classIDs.append(classID)
129 | 
130 | 	#Non Maxima Suppression
131 | 	detectionNMS = cv2.dnn.NMSBoxes(boxes, confidences, conf_thres, nms_thres)
132 | 
133 | 	if(len(detectionNMS) > 0):
134 | 		for i in detectionNMS.flatten():
135 | 			(x, y) = (boxes[i][0], boxes[i][1])
136 | 			(w, h) = (boxes[i][2], boxes[i][3])
137 | 			# extracting depth of every object in the frame
138 | 			d = get_depth(x, y, w, h, li)
139 | 
140 | 			color = [int(c) for c in colors[classIDs[i]]]
141 | 			cv2.rectangle(image, (x, y), (x + w, y + h), color, 2)
142 | 			if (d == -1):  # if distance=-1, it means  that no value for depth was found
143 | 				text = '{}: ~~~'.format(labels[classIDs[i]][:-1])
144 | 			else:
145 | 				text = '{}: {:.3f}m'.format(labels[classIDs[i]][:-1], d)
146 | 			# writing object class and its corresponding distance on the image
147 | 			cv2.putText(image, text, (x, y+5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
148 | 	
149 | 	if writer is None:
150 | 		fourcc = cv2.VideoWriter_fourcc(*"MJPG")
151 | 		writer = cv2.VideoWriter("output.avi", fourcc, 10, (W,H), True)
152 | 	writer.write(image)
153 | 	print "writing frame", count
154 | 	count=count+1
155 | # file finished writing
156 | writer.release()
157 | 
158 | #end = time.time()
159 | #print "time taken:", end-start
160 | 


--------------------------------------------------------------------------------
/coco.names:
--------------------------------------------------------------------------------
 1 | person
 2 | bicycle
 3 | car
 4 | motorbike
 5 | aeroplane
 6 | bus
 7 | train
 8 | truck
 9 | boat
10 | traffic light
11 | fire hydrant
12 | stop sign
13 | parking meter
14 | bench
15 | bird
16 | cat
17 | dog
18 | horse
19 | sheep
20 | cow
21 | elephant
22 | bear
23 | zebra
24 | giraffe
25 | backpack
26 | umbrella
27 | handbag
28 | tie
29 | suitcase
30 | frisbee
31 | skis
32 | snowboard
33 | sports ball
34 | kite
35 | baseball bat
36 | baseball glove
37 | skateboard
38 | surfboard
39 | tennis racket
40 | bottle
41 | wine glass
42 | cup
43 | fork
44 | knife
45 | spoon
46 | bowl
47 | banana
48 | apple
49 | sandwich
50 | orange
51 | broccoli
52 | carrot
53 | hot dog
54 | pizza
55 | donut
56 | cake
57 | chair
58 | sofa
59 | pottedplant
60 | bed
61 | diningtable
62 | toilet
63 | tvmonitor
64 | laptop
65 | mouse
66 | remote
67 | keyboard
68 | cell phone
69 | microwave
70 | oven
71 | toaster
72 | sink
73 | refrigerator
74 | book
75 | clock
76 | vase
77 | scissors
78 | teddy bear
79 | hair drier
80 | toothbrush


--------------------------------------------------------------------------------
/demo/output_1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/21409046fa0496e177aa92f0f5ae8eb6ea7fb021/demo/output_1.gif


--------------------------------------------------------------------------------
/images/CV Proj - Block Diag.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/21409046fa0496e177aa92f0f5ae8eb6ea7fb021/images/CV Proj - Block Diag.jpg


--------------------------------------------------------------------------------
/images/IMG_20200429_091806.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/21409046fa0496e177aa92f0f5ae8eb6ea7fb021/images/IMG_20200429_091806.jpg


--------------------------------------------------------------------------------
/images/IMG_20200429_092927.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kumarapurv/Object-Detection-and-Depth-Sensing-for-a-Small-Autonomous-Racecar-System/21409046fa0496e177aa92f0f5ae8eb6ea7fb021/images/IMG_20200429_092927.jpg


--------------------------------------------------------------------------------
/yolov3.cfg:
--------------------------------------------------------------------------------
  1 | [net]
  2 | # Testing
  3 | # batch=1
  4 | # subdivisions=1
  5 | # Training
  6 | batch=64
  7 | subdivisions=16
  8 | width=608
  9 | height=608
 10 | channels=3
 11 | momentum=0.9
 12 | decay=0.0005
 13 | angle=0
 14 | saturation = 1.5
 15 | exposure = 1.5
 16 | hue=.1
 17 | 
 18 | learning_rate=0.001
 19 | burn_in=1000
 20 | max_batches = 500200
 21 | policy=steps
 22 | steps=400000,450000
 23 | scales=.1,.1
 24 | 
 25 | [convolutional]
 26 | batch_normalize=1
 27 | filters=32
 28 | size=3
 29 | stride=1
 30 | pad=1
 31 | activation=leaky
 32 | 
 33 | # Downsample
 34 | 
 35 | [convolutional]
 36 | batch_normalize=1
 37 | filters=64
 38 | size=3
 39 | stride=2
 40 | pad=1
 41 | activation=leaky
 42 | 
 43 | [convolutional]
 44 | batch_normalize=1
 45 | filters=32
 46 | size=1
 47 | stride=1
 48 | pad=1
 49 | activation=leaky
 50 | 
 51 | [convolutional]
 52 | batch_normalize=1
 53 | filters=64
 54 | size=3
 55 | stride=1
 56 | pad=1
 57 | activation=leaky
 58 | 
 59 | [shortcut]
 60 | from=-3
 61 | activation=linear
 62 | 
 63 | # Downsample
 64 | 
 65 | [convolutional]
 66 | batch_normalize=1
 67 | filters=128
 68 | size=3
 69 | stride=2
 70 | pad=1
 71 | activation=leaky
 72 | 
 73 | [convolutional]
 74 | batch_normalize=1
 75 | filters=64
 76 | size=1
 77 | stride=1
 78 | pad=1
 79 | activation=leaky
 80 | 
 81 | [convolutional]
 82 | batch_normalize=1
 83 | filters=128
 84 | size=3
 85 | stride=1
 86 | pad=1
 87 | activation=leaky
 88 | 
 89 | [shortcut]
 90 | from=-3
 91 | activation=linear
 92 | 
 93 | [convolutional]
 94 | batch_normalize=1
 95 | filters=64
 96 | size=1
 97 | stride=1
 98 | pad=1
 99 | activation=leaky
100 | 
101 | [convolutional]
102 | batch_normalize=1
103 | filters=128
104 | size=3
105 | stride=1
106 | pad=1
107 | activation=leaky
108 | 
109 | [shortcut]
110 | from=-3
111 | activation=linear
112 | 
113 | # Downsample
114 | 
115 | [convolutional]
116 | batch_normalize=1
117 | filters=256
118 | size=3
119 | stride=2
120 | pad=1
121 | activation=leaky
122 | 
123 | [convolutional]
124 | batch_normalize=1
125 | filters=128
126 | size=1
127 | stride=1
128 | pad=1
129 | activation=leaky
130 | 
131 | [convolutional]
132 | batch_normalize=1
133 | filters=256
134 | size=3
135 | stride=1
136 | pad=1
137 | activation=leaky
138 | 
139 | [shortcut]
140 | from=-3
141 | activation=linear
142 | 
143 | [convolutional]
144 | batch_normalize=1
145 | filters=128
146 | size=1
147 | stride=1
148 | pad=1
149 | activation=leaky
150 | 
151 | [convolutional]
152 | batch_normalize=1
153 | filters=256
154 | size=3
155 | stride=1
156 | pad=1
157 | activation=leaky
158 | 
159 | [shortcut]
160 | from=-3
161 | activation=linear
162 | 
163 | [convolutional]
164 | batch_normalize=1
165 | filters=128
166 | size=1
167 | stride=1
168 | pad=1
169 | activation=leaky
170 | 
171 | [convolutional]
172 | batch_normalize=1
173 | filters=256
174 | size=3
175 | stride=1
176 | pad=1
177 | activation=leaky
178 | 
179 | [shortcut]
180 | from=-3
181 | activation=linear
182 | 
183 | [convolutional]
184 | batch_normalize=1
185 | filters=128
186 | size=1
187 | stride=1
188 | pad=1
189 | activation=leaky
190 | 
191 | [convolutional]
192 | batch_normalize=1
193 | filters=256
194 | size=3
195 | stride=1
196 | pad=1
197 | activation=leaky
198 | 
199 | [shortcut]
200 | from=-3
201 | activation=linear
202 | 
203 | 
204 | [convolutional]
205 | batch_normalize=1
206 | filters=128
207 | size=1
208 | stride=1
209 | pad=1
210 | activation=leaky
211 | 
212 | [convolutional]
213 | batch_normalize=1
214 | filters=256
215 | size=3
216 | stride=1
217 | pad=1
218 | activation=leaky
219 | 
220 | [shortcut]
221 | from=-3
222 | activation=linear
223 | 
224 | [convolutional]
225 | batch_normalize=1
226 | filters=128
227 | size=1
228 | stride=1
229 | pad=1
230 | activation=leaky
231 | 
232 | [convolutional]
233 | batch_normalize=1
234 | filters=256
235 | size=3
236 | stride=1
237 | pad=1
238 | activation=leaky
239 | 
240 | [shortcut]
241 | from=-3
242 | activation=linear
243 | 
244 | [convolutional]
245 | batch_normalize=1
246 | filters=128
247 | size=1
248 | stride=1
249 | pad=1
250 | activation=leaky
251 | 
252 | [convolutional]
253 | batch_normalize=1
254 | filters=256
255 | size=3
256 | stride=1
257 | pad=1
258 | activation=leaky
259 | 
260 | [shortcut]
261 | from=-3
262 | activation=linear
263 | 
264 | [convolutional]
265 | batch_normalize=1
266 | filters=128
267 | size=1
268 | stride=1
269 | pad=1
270 | activation=leaky
271 | 
272 | [convolutional]
273 | batch_normalize=1
274 | filters=256
275 | size=3
276 | stride=1
277 | pad=1
278 | activation=leaky
279 | 
280 | [shortcut]
281 | from=-3
282 | activation=linear
283 | 
284 | # Downsample
285 | 
286 | [convolutional]
287 | batch_normalize=1
288 | filters=512
289 | size=3
290 | stride=2
291 | pad=1
292 | activation=leaky
293 | 
294 | [convolutional]
295 | batch_normalize=1
296 | filters=256
297 | size=1
298 | stride=1
299 | pad=1
300 | activation=leaky
301 | 
302 | [convolutional]
303 | batch_normalize=1
304 | filters=512
305 | size=3
306 | stride=1
307 | pad=1
308 | activation=leaky
309 | 
310 | [shortcut]
311 | from=-3
312 | activation=linear
313 | 
314 | 
315 | [convolutional]
316 | batch_normalize=1
317 | filters=256
318 | size=1
319 | stride=1
320 | pad=1
321 | activation=leaky
322 | 
323 | [convolutional]
324 | batch_normalize=1
325 | filters=512
326 | size=3
327 | stride=1
328 | pad=1
329 | activation=leaky
330 | 
331 | [shortcut]
332 | from=-3
333 | activation=linear
334 | 
335 | 
336 | [convolutional]
337 | batch_normalize=1
338 | filters=256
339 | size=1
340 | stride=1
341 | pad=1
342 | activation=leaky
343 | 
344 | [convolutional]
345 | batch_normalize=1
346 | filters=512
347 | size=3
348 | stride=1
349 | pad=1
350 | activation=leaky
351 | 
352 | [shortcut]
353 | from=-3
354 | activation=linear
355 | 
356 | 
357 | [convolutional]
358 | batch_normalize=1
359 | filters=256
360 | size=1
361 | stride=1
362 | pad=1
363 | activation=leaky
364 | 
365 | [convolutional]
366 | batch_normalize=1
367 | filters=512
368 | size=3
369 | stride=1
370 | pad=1
371 | activation=leaky
372 | 
373 | [shortcut]
374 | from=-3
375 | activation=linear
376 | 
377 | [convolutional]
378 | batch_normalize=1
379 | filters=256
380 | size=1
381 | stride=1
382 | pad=1
383 | activation=leaky
384 | 
385 | [convolutional]
386 | batch_normalize=1
387 | filters=512
388 | size=3
389 | stride=1
390 | pad=1
391 | activation=leaky
392 | 
393 | [shortcut]
394 | from=-3
395 | activation=linear
396 | 
397 | 
398 | [convolutional]
399 | batch_normalize=1
400 | filters=256
401 | size=1
402 | stride=1
403 | pad=1
404 | activation=leaky
405 | 
406 | [convolutional]
407 | batch_normalize=1
408 | filters=512
409 | size=3
410 | stride=1
411 | pad=1
412 | activation=leaky
413 | 
414 | [shortcut]
415 | from=-3
416 | activation=linear
417 | 
418 | 
419 | [convolutional]
420 | batch_normalize=1
421 | filters=256
422 | size=1
423 | stride=1
424 | pad=1
425 | activation=leaky
426 | 
427 | [convolutional]
428 | batch_normalize=1
429 | filters=512
430 | size=3
431 | stride=1
432 | pad=1
433 | activation=leaky
434 | 
435 | [shortcut]
436 | from=-3
437 | activation=linear
438 | 
439 | [convolutional]
440 | batch_normalize=1
441 | filters=256
442 | size=1
443 | stride=1
444 | pad=1
445 | activation=leaky
446 | 
447 | [convolutional]
448 | batch_normalize=1
449 | filters=512
450 | size=3
451 | stride=1
452 | pad=1
453 | activation=leaky
454 | 
455 | [shortcut]
456 | from=-3
457 | activation=linear
458 | 
459 | # Downsample
460 | 
461 | [convolutional]
462 | batch_normalize=1
463 | filters=1024
464 | size=3
465 | stride=2
466 | pad=1
467 | activation=leaky
468 | 
469 | [convolutional]
470 | batch_normalize=1
471 | filters=512
472 | size=1
473 | stride=1
474 | pad=1
475 | activation=leaky
476 | 
477 | [convolutional]
478 | batch_normalize=1
479 | filters=1024
480 | size=3
481 | stride=1
482 | pad=1
483 | activation=leaky
484 | 
485 | [shortcut]
486 | from=-3
487 | activation=linear
488 | 
489 | [convolutional]
490 | batch_normalize=1
491 | filters=512
492 | size=1
493 | stride=1
494 | pad=1
495 | activation=leaky
496 | 
497 | [convolutional]
498 | batch_normalize=1
499 | filters=1024
500 | size=3
501 | stride=1
502 | pad=1
503 | activation=leaky
504 | 
505 | [shortcut]
506 | from=-3
507 | activation=linear
508 | 
509 | [convolutional]
510 | batch_normalize=1
511 | filters=512
512 | size=1
513 | stride=1
514 | pad=1
515 | activation=leaky
516 | 
517 | [convolutional]
518 | batch_normalize=1
519 | filters=1024
520 | size=3
521 | stride=1
522 | pad=1
523 | activation=leaky
524 | 
525 | [shortcut]
526 | from=-3
527 | activation=linear
528 | 
529 | [convolutional]
530 | batch_normalize=1
531 | filters=512
532 | size=1
533 | stride=1
534 | pad=1
535 | activation=leaky
536 | 
537 | [convolutional]
538 | batch_normalize=1
539 | filters=1024
540 | size=3
541 | stride=1
542 | pad=1
543 | activation=leaky
544 | 
545 | [shortcut]
546 | from=-3
547 | activation=linear
548 | 
549 | ######################
550 | 
551 | [convolutional]
552 | batch_normalize=1
553 | filters=512
554 | size=1
555 | stride=1
556 | pad=1
557 | activation=leaky
558 | 
559 | [convolutional]
560 | batch_normalize=1
561 | size=3
562 | stride=1
563 | pad=1
564 | filters=1024
565 | activation=leaky
566 | 
567 | [convolutional]
568 | batch_normalize=1
569 | filters=512
570 | size=1
571 | stride=1
572 | pad=1
573 | activation=leaky
574 | 
575 | [convolutional]
576 | batch_normalize=1
577 | size=3
578 | stride=1
579 | pad=1
580 | filters=1024
581 | activation=leaky
582 | 
583 | [convolutional]
584 | batch_normalize=1
585 | filters=512
586 | size=1
587 | stride=1
588 | pad=1
589 | activation=leaky
590 | 
591 | [convolutional]
592 | batch_normalize=1
593 | size=3
594 | stride=1
595 | pad=1
596 | filters=1024
597 | activation=leaky
598 | 
599 | [convolutional]
600 | size=1
601 | stride=1
602 | pad=1
603 | filters=255
604 | activation=linear
605 | 
606 | 
607 | [yolo]
608 | mask = 6,7,8
609 | anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
610 | classes=80
611 | num=9
612 | jitter=.3
613 | ignore_thresh = .7
614 | truth_thresh = 1
615 | random=1
616 | 
617 | 
618 | [route]
619 | layers = -4
620 | 
621 | [convolutional]
622 | batch_normalize=1
623 | filters=256
624 | size=1
625 | stride=1
626 | pad=1
627 | activation=leaky
628 | 
629 | [upsample]
630 | stride=2
631 | 
632 | [route]
633 | layers = -1, 61
634 | 
635 | 
636 | 
637 | [convolutional]
638 | batch_normalize=1
639 | filters=256
640 | size=1
641 | stride=1
642 | pad=1
643 | activation=leaky
644 | 
645 | [convolutional]
646 | batch_normalize=1
647 | size=3
648 | stride=1
649 | pad=1
650 | filters=512
651 | activation=leaky
652 | 
653 | [convolutional]
654 | batch_normalize=1
655 | filters=256
656 | size=1
657 | stride=1
658 | pad=1
659 | activation=leaky
660 | 
661 | [convolutional]
662 | batch_normalize=1
663 | size=3
664 | stride=1
665 | pad=1
666 | filters=512
667 | activation=leaky
668 | 
669 | [convolutional]
670 | batch_normalize=1
671 | filters=256
672 | size=1
673 | stride=1
674 | pad=1
675 | activation=leaky
676 | 
677 | [convolutional]
678 | batch_normalize=1
679 | size=3
680 | stride=1
681 | pad=1
682 | filters=512
683 | activation=leaky
684 | 
685 | [convolutional]
686 | size=1
687 | stride=1
688 | pad=1
689 | filters=255
690 | activation=linear
691 | 
692 | 
693 | [yolo]
694 | mask = 3,4,5
695 | anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
696 | classes=80
697 | num=9
698 | jitter=.3
699 | ignore_thresh = .7
700 | truth_thresh = 1
701 | random=1
702 | 
703 | 
704 | 
705 | [route]
706 | layers = -4
707 | 
708 | [convolutional]
709 | batch_normalize=1
710 | filters=128
711 | size=1
712 | stride=1
713 | pad=1
714 | activation=leaky
715 | 
716 | [upsample]
717 | stride=2
718 | 
719 | [route]
720 | layers = -1, 36
721 | 
722 | 
723 | 
724 | [convolutional]
725 | batch_normalize=1
726 | filters=128
727 | size=1
728 | stride=1
729 | pad=1
730 | activation=leaky
731 | 
732 | [convolutional]
733 | batch_normalize=1
734 | size=3
735 | stride=1
736 | pad=1
737 | filters=256
738 | activation=leaky
739 | 
740 | [convolutional]
741 | batch_normalize=1
742 | filters=128
743 | size=1
744 | stride=1
745 | pad=1
746 | activation=leaky
747 | 
748 | [convolutional]
749 | batch_normalize=1
750 | size=3
751 | stride=1
752 | pad=1
753 | filters=256
754 | activation=leaky
755 | 
756 | [convolutional]
757 | batch_normalize=1
758 | filters=128
759 | size=1
760 | stride=1
761 | pad=1
762 | activation=leaky
763 | 
764 | [convolutional]
765 | batch_normalize=1
766 | size=3
767 | stride=1
768 | pad=1
769 | filters=256
770 | activation=leaky
771 | 
772 | [convolutional]
773 | size=1
774 | stride=1
775 | pad=1
776 | filters=255
777 | activation=linear
778 | 
779 | 
780 | [yolo]
781 | mask = 0,1,2
782 | anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
783 | classes=80
784 | num=9
785 | jitter=.3
786 | ignore_thresh = .7
787 | truth_thresh = 1
788 | random=1
789 | 
790 | 


--------------------------------------------------------------------------------