├── .gitignore ├── README.md ├── __init__.py ├── bbox.py ├── cam_demo.py ├── cfg ├── tiny-yolo-voc.cfg ├── yolo-voc.cfg ├── yolo.cfg └── yolov3.cfg ├── darknet.py ├── data ├── coco.names └── voc.names ├── det_messi.jpg ├── detect.py ├── dog-cycle-car.png ├── imgs ├── dog.jpg ├── eagle.jpg ├── giraffe.jpg ├── herd_of_horses.jpg ├── img1.jpg ├── img2.jpg ├── img3.jpg ├── img4.jpg ├── messi.jpg ├── person.jpg └── scream.jpg ├── pallete ├── preprocess.py ├── util.py ├── video_demo.py └── video_demo_half.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A PyTorch implementation of a YOLO v3 Object Detector 2 | 3 | [UPDATE] : This repo serves as a driver code for my research. I just graduated college, and am very busy looking for research internship / fellowship roles before eventually applying for a masters. I won't have the time to look into issues for the time being. Thank you. 4 | 5 | 6 | This repository contains code for a object detector based on [YOLOv3: An Incremental Improvement](https://pjreddie.com/media/files/papers/YOLOv3.pdf), implementedin PyTorch. The code is based on the official code of [YOLO v3](https://github.com/pjreddie/darknet), as well as a PyTorch 7 | port of the original code, by [marvis](https://github.com/marvis/pytorch-yolo2). One of the goals of this code is to improve 8 | upon the original port by removing redundant parts of the code (The official code is basically a fully blown deep learning 9 | library, and includes stuff like sequence models, which are not used in YOLO). I've also tried to keep the code minimal, and 10 | document it as well as I can. 11 | 12 | ### Tutorial for building this detector from scratch 13 | If you want to understand how to implement this detector by yourself from scratch, then you can go through this very detailed 5-part tutorial series I wrote on Paperspace. Perfect for someone who wants to move from beginner to intermediate pytorch skills. 14 | 15 | [Implement YOLO v3 from scratch](https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/) 16 | 17 | As of now, the code only contains the detection module, but you should expect the training module soon. :) 18 | 19 | ## Requirements 20 | 1. Python 3.5 21 | 2. OpenCV 22 | 3. PyTorch 0.4 23 | 24 | Using PyTorch 0.3 will break the detector. 25 | 26 | 27 | 28 | ## Detection Example 29 | 30 | ![Detection Example](https://i.imgur.com/m2jwneng.png) 31 | ## Running the detector 32 | 33 | ### On single or multiple images 34 | 35 | Clone, and `cd` into the repo directory. The first thing you need to do is to get the weights file 36 | This time around, for v3, authors has supplied a weightsfile only for COCO [here](https://pjreddie.com/media/files/yolov3.weights), and place 37 | 38 | the weights file into your repo directory. Or, you could just type (if you're on Linux) 39 | 40 | ``` 41 | wget https://pjreddie.com/media/files/yolov3.weights 42 | python detect.py --images imgs --det det 43 | ``` 44 | 45 | 46 | `--images` flag defines the directory to load images from, or a single image file (it will figure it out), and `--det` is the directory 47 | to save images to. Other setting such as batch size (using `--bs` flag) , object threshold confidence can be tweaked with flags that can be looked up with. 48 | 49 | ``` 50 | python detect.py -h 51 | ``` 52 | 53 | ### Speed Accuracy Tradeoff 54 | You can change the resolutions of the input image by the `--reso` flag. The default value is 416. Whatever value you chose, rememeber **it should be a multiple of 32 and greater than 32**. Weird things will happen if you don't. You've been warned. 55 | 56 | ``` 57 | python detect.py --images imgs --det det --reso 320 58 | ``` 59 | 60 | ### On Video 61 | For this, you should run the file, video_demo.py with --video flag specifying the video file. The video file should be in .avi format 62 | since openCV only accepts OpenCV as the input format. 63 | 64 | ``` 65 | python video_demo.py --video video.avi 66 | ``` 67 | 68 | Tweakable settings can be seen with -h flag. 69 | 70 | ### Speeding up Video Inference 71 | 72 | To speed video inference, you can try using the video_demo_half.py file instead which does all the inference with 16-bit half 73 | precision floats instead of 32-bit float. I haven't seen big improvements, but I attribute that to having an older card 74 | (Tesla K80, Kepler arch). If you have one of cards with fast float16 support, try it out, and if possible, benchmark it. 75 | 76 | ### On a Camera 77 | Same as video module, but you don't have to specify the video file since feed will be taken from your camera. To be precise, 78 | feed will be taken from what the OpenCV, recognises as camera 0. The default image resolution is 160 here, though you can change it with `reso` flag. 79 | 80 | ``` 81 | python cam_demo.py 82 | ``` 83 | You can easily tweak the code to use different weightsfiles, available at [yolo website](https://pjreddie.com/darknet/yolo/) 84 | 85 | NOTE: The scales features has been disabled for better refactoring. 86 | ### Detection across different scales 87 | YOLO v3 makes detections across different scales, each of which deputise in detecting objects of different sizes depending upon whether they capture coarse features, fine grained features or something between. You can experiment with these scales by the `--scales` flag. 88 | 89 | ``` 90 | python detect.py --scales 1,3 91 | ``` 92 | 93 | 94 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/__init__.py -------------------------------------------------------------------------------- /bbox.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | 3 | import torch 4 | import random 5 | 6 | import numpy as np 7 | import cv2 8 | 9 | def confidence_filter(result, confidence): 10 | conf_mask = (result[:,:,4] > confidence).float().unsqueeze(2) 11 | result = result*conf_mask 12 | 13 | return result 14 | 15 | def confidence_filter_cls(result, confidence): 16 | max_scores = torch.max(result[:,:,5:25], 2)[0] 17 | res = torch.cat((result, max_scores),2) 18 | print(res.shape) 19 | 20 | 21 | cond_1 = (res[:,:,4] > confidence).float() 22 | cond_2 = (res[:,:,25] > 0.995).float() 23 | 24 | conf = cond_1 + cond_2 25 | conf = torch.clamp(conf, 0.0, 1.0) 26 | conf = conf.unsqueeze(2) 27 | result = result*conf 28 | return result 29 | 30 | 31 | 32 | def get_abs_coord(box): 33 | box[2], box[3] = abs(box[2]), abs(box[3]) 34 | x1 = (box[0] - box[2]/2) - 1 35 | y1 = (box[1] - box[3]/2) - 1 36 | x2 = (box[0] + box[2]/2) - 1 37 | y2 = (box[1] + box[3]/2) - 1 38 | return x1, y1, x2, y2 39 | 40 | 41 | 42 | def sanity_fix(box): 43 | if (box[0] > box[2]): 44 | box[0], box[2] = box[2], box[0] 45 | 46 | if (box[1] > box[3]): 47 | box[1], box[3] = box[3], box[1] 48 | 49 | return box 50 | 51 | def bbox_iou(box1, box2): 52 | """ 53 | Returns the IoU of two bounding boxes 54 | 55 | 56 | """ 57 | #Get the coordinates of bounding boxes 58 | b1_x1, b1_y1, b1_x2, b1_y2 = box1[:,0], box1[:,1], box1[:,2], box1[:,3] 59 | b2_x1, b2_y1, b2_x2, b2_y2 = box2[:,0], box2[:,1], box2[:,2], box2[:,3] 60 | 61 | #get the corrdinates of the intersection rectangle 62 | inter_rect_x1 = torch.max(b1_x1, b2_x1) 63 | inter_rect_y1 = torch.max(b1_y1, b2_y1) 64 | inter_rect_x2 = torch.min(b1_x2, b2_x2) 65 | inter_rect_y2 = torch.min(b1_y2, b2_y2) 66 | 67 | #Intersection area 68 | if torch.cuda.is_available(): 69 | inter_area = torch.max(inter_rect_x2 - inter_rect_x1 + 1,torch.zeros(inter_rect_x2.shape).cuda())*torch.max(inter_rect_y2 - inter_rect_y1 + 1, torch.zeros(inter_rect_x2.shape).cuda()) 70 | else: 71 | inter_area = torch.max(inter_rect_x2 - inter_rect_x1 + 1,torch.zeros(inter_rect_x2.shape))*torch.max(inter_rect_y2 - inter_rect_y1 + 1, torch.zeros(inter_rect_x2.shape)) 72 | 73 | #Union Area 74 | b1_area = (b1_x2 - b1_x1 + 1)*(b1_y2 - b1_y1 + 1) 75 | b2_area = (b2_x2 - b2_x1 + 1)*(b2_y2 - b2_y1 + 1) 76 | 77 | iou = inter_area / (b1_area + b2_area - inter_area) 78 | 79 | return iou 80 | 81 | 82 | def pred_corner_coord(prediction): 83 | #Get indices of non-zero confidence bboxes 84 | ind_nz = torch.nonzero(prediction[:,:,4]).transpose(0,1).contiguous() 85 | 86 | box = prediction[ind_nz[0], ind_nz[1]] 87 | 88 | 89 | box_a = box.new(box.shape) 90 | box_a[:,0] = (box[:,0] - box[:,2]/2) 91 | box_a[:,1] = (box[:,1] - box[:,3]/2) 92 | box_a[:,2] = (box[:,0] + box[:,2]/2) 93 | box_a[:,3] = (box[:,1] + box[:,3]/2) 94 | box[:,:4] = box_a[:,:4] 95 | 96 | prediction[ind_nz[0], ind_nz[1]] = box 97 | 98 | return prediction 99 | 100 | 101 | 102 | 103 | def write(x, batches, results, colors, classes): 104 | c1 = tuple(x[1:3].int()) 105 | c2 = tuple(x[3:5].int()) 106 | img = results[int(x[0])] 107 | cls = int(x[-1]) 108 | label = "{0}".format(classes[cls]) 109 | color = random.choice(colors) 110 | cv2.rectangle(img, c1, c2,color, 1) 111 | t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0] 112 | c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4 113 | cv2.rectangle(img, c1, c2,color, -1) 114 | cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1, [225,255,255], 1); 115 | return img 116 | -------------------------------------------------------------------------------- /cam_demo.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import time 3 | import torch 4 | import torch.nn as nn 5 | from torch.autograd import Variable 6 | import numpy as np 7 | import cv2 8 | from util import * 9 | from darknet import Darknet 10 | from preprocess import prep_image, inp_to_image 11 | import pandas as pd 12 | import random 13 | import argparse 14 | import pickle as pkl 15 | 16 | def get_test_input(input_dim, CUDA): 17 | img = cv2.imread("imgs/messi.jpg") 18 | img = cv2.resize(img, (input_dim, input_dim)) 19 | img_ = img[:,:,::-1].transpose((2,0,1)) 20 | img_ = img_[np.newaxis,:,:,:]/255.0 21 | img_ = torch.from_numpy(img_).float() 22 | img_ = Variable(img_) 23 | 24 | if CUDA: 25 | img_ = img_.cuda() 26 | 27 | return img_ 28 | 29 | def prep_image(img, inp_dim): 30 | """ 31 | Prepare image for inputting to the neural network. 32 | 33 | Returns a Variable 34 | """ 35 | 36 | orig_im = img 37 | dim = orig_im.shape[1], orig_im.shape[0] 38 | img = cv2.resize(orig_im, (inp_dim, inp_dim)) 39 | img_ = img[:,:,::-1].transpose((2,0,1)).copy() 40 | img_ = torch.from_numpy(img_).float().div(255.0).unsqueeze(0) 41 | return img_, orig_im, dim 42 | 43 | def write(x, img): 44 | c1 = tuple(x[1:3].int()) 45 | c2 = tuple(x[3:5].int()) 46 | cls = int(x[-1]) 47 | label = "{0}".format(classes[cls]) 48 | color = random.choice(colors) 49 | cv2.rectangle(img, c1, c2,color, 1) 50 | t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0] 51 | c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4 52 | cv2.rectangle(img, c1, c2,color, -1) 53 | cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1, [225,255,255], 1); 54 | return img 55 | 56 | def arg_parse(): 57 | """ 58 | Parse arguements to the detect module 59 | 60 | """ 61 | 62 | 63 | parser = argparse.ArgumentParser(description='YOLO v3 Cam Demo') 64 | parser.add_argument("--confidence", dest = "confidence", help = "Object Confidence to filter predictions", default = 0.25) 65 | parser.add_argument("--nms_thresh", dest = "nms_thresh", help = "NMS Threshhold", default = 0.4) 66 | parser.add_argument("--reso", dest = 'reso', help = 67 | "Input resolution of the network. Increase to increase accuracy. Decrease to increase speed", 68 | default = "160", type = str) 69 | return parser.parse_args() 70 | 71 | 72 | 73 | if __name__ == '__main__': 74 | cfgfile = "cfg/yolov3.cfg" 75 | weightsfile = "yolov3.weights" 76 | num_classes = 80 77 | 78 | args = arg_parse() 79 | confidence = float(args.confidence) 80 | nms_thesh = float(args.nms_thresh) 81 | start = 0 82 | CUDA = torch.cuda.is_available() 83 | 84 | 85 | 86 | 87 | num_classes = 80 88 | bbox_attrs = 5 + num_classes 89 | 90 | model = Darknet(cfgfile) 91 | model.load_weights(weightsfile) 92 | 93 | model.net_info["height"] = args.reso 94 | inp_dim = int(model.net_info["height"]) 95 | 96 | assert inp_dim % 32 == 0 97 | assert inp_dim > 32 98 | 99 | if CUDA: 100 | model.cuda() 101 | 102 | model.eval() 103 | 104 | videofile = 'video.avi' 105 | 106 | cap = cv2.VideoCapture(0) 107 | 108 | assert cap.isOpened(), 'Cannot capture source' 109 | 110 | frames = 0 111 | start = time.time() 112 | while cap.isOpened(): 113 | 114 | ret, frame = cap.read() 115 | if ret: 116 | 117 | img, orig_im, dim = prep_image(frame, inp_dim) 118 | 119 | # im_dim = torch.FloatTensor(dim).repeat(1,2) 120 | 121 | 122 | if CUDA: 123 | im_dim = im_dim.cuda() 124 | img = img.cuda() 125 | 126 | 127 | output = model(Variable(img), CUDA) 128 | output = write_results(output, confidence, num_classes, nms = True, nms_conf = nms_thesh) 129 | 130 | if type(output) == int: 131 | frames += 1 132 | print("FPS of the video is {:5.2f}".format( frames / (time.time() - start))) 133 | cv2.imshow("frame", orig_im) 134 | key = cv2.waitKey(1) 135 | if key & 0xFF == ord('q'): 136 | break 137 | continue 138 | 139 | 140 | 141 | output[:,1:5] = torch.clamp(output[:,1:5], 0.0, float(inp_dim))/inp_dim 142 | 143 | # im_dim = im_dim.repeat(output.size(0), 1) 144 | output[:,[1,3]] *= frame.shape[1] 145 | output[:,[2,4]] *= frame.shape[0] 146 | 147 | 148 | classes = load_classes('data/coco.names') 149 | colors = pkl.load(open("pallete", "rb")) 150 | 151 | list(map(lambda x: write(x, orig_im), output)) 152 | 153 | 154 | cv2.imshow("frame", orig_im) 155 | key = cv2.waitKey(1) 156 | if key & 0xFF == ord('q'): 157 | break 158 | frames += 1 159 | print("FPS of the video is {:5.2f}".format( frames / (time.time() - start))) 160 | 161 | 162 | else: 163 | break 164 | 165 | 166 | 167 | 168 | 169 | -------------------------------------------------------------------------------- /cfg/tiny-yolo-voc.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | batch=64 3 | subdivisions=8 4 | width=416 5 | height=416 6 | channels=3 7 | momentum=0.9 8 | decay=0.0005 9 | angle=0 10 | saturation = 1.5 11 | exposure = 1.5 12 | hue=.1 13 | 14 | learning_rate=0.001 15 | max_batches = 40200 16 | policy=steps 17 | steps=-1,100,20000,30000 18 | scales=.1,10,.1,.1 19 | 20 | [convolutional] 21 | batch_normalize=1 22 | filters=16 23 | size=3 24 | stride=1 25 | pad=1 26 | activation=leaky 27 | 28 | [maxpool] 29 | size=2 30 | stride=2 31 | 32 | [convolutional] 33 | batch_normalize=1 34 | filters=32 35 | size=3 36 | stride=1 37 | pad=1 38 | activation=leaky 39 | 40 | [maxpool] 41 | size=2 42 | stride=2 43 | 44 | [convolutional] 45 | batch_normalize=1 46 | filters=64 47 | size=3 48 | stride=1 49 | pad=1 50 | activation=leaky 51 | 52 | [maxpool] 53 | size=2 54 | stride=2 55 | 56 | [convolutional] 57 | batch_normalize=1 58 | filters=128 59 | size=3 60 | stride=1 61 | pad=1 62 | activation=leaky 63 | 64 | [maxpool] 65 | size=2 66 | stride=2 67 | 68 | [convolutional] 69 | batch_normalize=1 70 | filters=256 71 | size=3 72 | stride=1 73 | pad=1 74 | activation=leaky 75 | 76 | [maxpool] 77 | size=2 78 | stride=2 79 | 80 | [convolutional] 81 | batch_normalize=1 82 | filters=512 83 | size=3 84 | stride=1 85 | pad=1 86 | activation=leaky 87 | 88 | [maxpool] 89 | size=2 90 | stride=1 91 | 92 | [convolutional] 93 | batch_normalize=1 94 | filters=1024 95 | size=3 96 | stride=1 97 | pad=1 98 | activation=leaky 99 | 100 | ########### 101 | 102 | [convolutional] 103 | batch_normalize=1 104 | size=3 105 | stride=1 106 | pad=1 107 | filters=1024 108 | activation=leaky 109 | 110 | [convolutional] 111 | size=1 112 | stride=1 113 | pad=1 114 | filters=125 115 | activation=linear 116 | 117 | [region] 118 | anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52 119 | bias_match=1 120 | classes=20 121 | coords=4 122 | num=5 123 | softmax=1 124 | jitter=.2 125 | rescore=1 126 | 127 | object_scale=5 128 | noobject_scale=1 129 | class_scale=1 130 | coord_scale=1 131 | 132 | absolute=1 133 | thresh = .6 134 | random=1 135 | -------------------------------------------------------------------------------- /cfg/yolo-voc.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Testing 3 | batch=64 4 | subdivisions=8 5 | # Training 6 | # batch=64 7 | # subdivisions=8 8 | height=416 9 | width=416 10 | channels=3 11 | momentum=0.9 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.001 19 | burn_in=1000 20 | max_batches = 80200 21 | policy=steps 22 | steps=-1,500,40000,60000 23 | scales=0.1,10,.1,.1 24 | 25 | [convolutional] 26 | batch_normalize=1 27 | filters=32 28 | size=3 29 | stride=1 30 | pad=1 31 | activation=leaky 32 | 33 | [maxpool] 34 | size=2 35 | stride=2 36 | 37 | [convolutional] 38 | batch_normalize=1 39 | filters=64 40 | size=3 41 | stride=1 42 | pad=1 43 | activation=leaky 44 | 45 | [maxpool] 46 | size=2 47 | stride=2 48 | 49 | [convolutional] 50 | batch_normalize=1 51 | filters=128 52 | size=3 53 | stride=1 54 | pad=1 55 | activation=leaky 56 | 57 | [convolutional] 58 | batch_normalize=1 59 | filters=64 60 | size=1 61 | stride=1 62 | pad=1 63 | activation=leaky 64 | 65 | [convolutional] 66 | batch_normalize=1 67 | filters=128 68 | size=3 69 | stride=1 70 | pad=1 71 | activation=leaky 72 | 73 | [maxpool] 74 | size=2 75 | stride=2 76 | 77 | [convolutional] 78 | batch_normalize=1 79 | filters=256 80 | size=3 81 | stride=1 82 | pad=1 83 | activation=leaky 84 | 85 | [convolutional] 86 | batch_normalize=1 87 | filters=128 88 | size=1 89 | stride=1 90 | pad=1 91 | activation=leaky 92 | 93 | [convolutional] 94 | batch_normalize=1 95 | filters=256 96 | size=3 97 | stride=1 98 | pad=1 99 | activation=leaky 100 | 101 | [maxpool] 102 | size=2 103 | stride=2 104 | 105 | [convolutional] 106 | batch_normalize=1 107 | filters=512 108 | size=3 109 | stride=1 110 | pad=1 111 | activation=leaky 112 | 113 | [convolutional] 114 | batch_normalize=1 115 | filters=256 116 | size=1 117 | stride=1 118 | pad=1 119 | activation=leaky 120 | 121 | [convolutional] 122 | batch_normalize=1 123 | filters=512 124 | size=3 125 | stride=1 126 | pad=1 127 | activation=leaky 128 | 129 | [convolutional] 130 | batch_normalize=1 131 | filters=256 132 | size=1 133 | stride=1 134 | pad=1 135 | activation=leaky 136 | 137 | [convolutional] 138 | batch_normalize=1 139 | filters=512 140 | size=3 141 | stride=1 142 | pad=1 143 | activation=leaky 144 | 145 | [maxpool] 146 | size=2 147 | stride=2 148 | 149 | [convolutional] 150 | batch_normalize=1 151 | filters=1024 152 | size=3 153 | stride=1 154 | pad=1 155 | activation=leaky 156 | 157 | [convolutional] 158 | batch_normalize=1 159 | filters=512 160 | size=1 161 | stride=1 162 | pad=1 163 | activation=leaky 164 | 165 | [convolutional] 166 | batch_normalize=1 167 | filters=1024 168 | size=3 169 | stride=1 170 | pad=1 171 | activation=leaky 172 | 173 | [convolutional] 174 | batch_normalize=1 175 | filters=512 176 | size=1 177 | stride=1 178 | pad=1 179 | activation=leaky 180 | 181 | [convolutional] 182 | batch_normalize=1 183 | filters=1024 184 | size=3 185 | stride=1 186 | pad=1 187 | activation=leaky 188 | 189 | 190 | ####### 191 | 192 | [convolutional] 193 | batch_normalize=1 194 | size=3 195 | stride=1 196 | pad=1 197 | filters=1024 198 | activation=leaky 199 | 200 | [convolutional] 201 | batch_normalize=1 202 | size=3 203 | stride=1 204 | pad=1 205 | filters=1024 206 | activation=leaky 207 | 208 | [route] 209 | layers=-9 210 | 211 | [convolutional] 212 | batch_normalize=1 213 | size=1 214 | stride=1 215 | pad=1 216 | filters=64 217 | activation=leaky 218 | 219 | [reorg] 220 | stride=2 221 | 222 | [route] 223 | layers=-1,-4 224 | 225 | [convolutional] 226 | batch_normalize=1 227 | size=3 228 | stride=1 229 | pad=1 230 | filters=1024 231 | activation=leaky 232 | 233 | [convolutional] 234 | size=1 235 | stride=1 236 | pad=1 237 | filters=125 238 | activation=linear 239 | 240 | 241 | [region] 242 | anchors = 1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071 243 | bias_match=1 244 | classes=20 245 | coords=4 246 | num=5 247 | softmax=1 248 | jitter=.3 249 | rescore=1 250 | 251 | object_scale=5 252 | noobject_scale=1 253 | class_scale=1 254 | coord_scale=1 255 | 256 | absolute=1 257 | thresh = .6 258 | random=1 259 | -------------------------------------------------------------------------------- /cfg/yolo.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Testing 3 | batch=1 4 | subdivisions=1 5 | # Training 6 | # batch=64 7 | # subdivisions=8 8 | width=416 9 | height=416 10 | channels=3 11 | momentum=0.9 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.001 19 | burn_in=1000 20 | max_batches = 500200 21 | policy=steps 22 | steps=400000,450000 23 | scales=.1,.1 24 | 25 | [convolutional] 26 | batch_normalize=1 27 | filters=32 28 | size=3 29 | stride=1 30 | pad=1 31 | activation=leaky 32 | 33 | [maxpool] 34 | size=2 35 | stride=2 36 | 37 | [convolutional] 38 | batch_normalize=1 39 | filters=64 40 | size=3 41 | stride=1 42 | pad=1 43 | activation=leaky 44 | 45 | [maxpool] 46 | size=2 47 | stride=2 48 | 49 | [convolutional] 50 | batch_normalize=1 51 | filters=128 52 | size=3 53 | stride=1 54 | pad=1 55 | activation=leaky 56 | 57 | [convolutional] 58 | batch_normalize=1 59 | filters=64 60 | size=1 61 | stride=1 62 | pad=1 63 | activation=leaky 64 | 65 | [convolutional] 66 | batch_normalize=1 67 | filters=128 68 | size=3 69 | stride=1 70 | pad=1 71 | activation=leaky 72 | 73 | [maxpool] 74 | size=2 75 | stride=2 76 | 77 | [convolutional] 78 | batch_normalize=1 79 | filters=256 80 | size=3 81 | stride=1 82 | pad=1 83 | activation=leaky 84 | 85 | [convolutional] 86 | batch_normalize=1 87 | filters=128 88 | size=1 89 | stride=1 90 | pad=1 91 | activation=leaky 92 | 93 | [convolutional] 94 | batch_normalize=1 95 | filters=256 96 | size=3 97 | stride=1 98 | pad=1 99 | activation=leaky 100 | 101 | [maxpool] 102 | size=2 103 | stride=2 104 | 105 | [convolutional] 106 | batch_normalize=1 107 | filters=512 108 | size=3 109 | stride=1 110 | pad=1 111 | activation=leaky 112 | 113 | [convolutional] 114 | batch_normalize=1 115 | filters=256 116 | size=1 117 | stride=1 118 | pad=1 119 | activation=leaky 120 | 121 | [convolutional] 122 | batch_normalize=1 123 | filters=512 124 | size=3 125 | stride=1 126 | pad=1 127 | activation=leaky 128 | 129 | [convolutional] 130 | batch_normalize=1 131 | filters=256 132 | size=1 133 | stride=1 134 | pad=1 135 | activation=leaky 136 | 137 | [convolutional] 138 | batch_normalize=1 139 | filters=512 140 | size=3 141 | stride=1 142 | pad=1 143 | activation=leaky 144 | 145 | [maxpool] 146 | size=2 147 | stride=2 148 | 149 | [convolutional] 150 | batch_normalize=1 151 | filters=1024 152 | size=3 153 | stride=1 154 | pad=1 155 | activation=leaky 156 | 157 | [convolutional] 158 | batch_normalize=1 159 | filters=512 160 | size=1 161 | stride=1 162 | pad=1 163 | activation=leaky 164 | 165 | [convolutional] 166 | batch_normalize=1 167 | filters=1024 168 | size=3 169 | stride=1 170 | pad=1 171 | activation=leaky 172 | 173 | [convolutional] 174 | batch_normalize=1 175 | filters=512 176 | size=1 177 | stride=1 178 | pad=1 179 | activation=leaky 180 | 181 | [convolutional] 182 | batch_normalize=1 183 | filters=1024 184 | size=3 185 | stride=1 186 | pad=1 187 | activation=leaky 188 | 189 | 190 | ####### 191 | 192 | [convolutional] 193 | batch_normalize=1 194 | size=3 195 | stride=1 196 | pad=1 197 | filters=1024 198 | activation=leaky 199 | 200 | [convolutional] 201 | batch_normalize=1 202 | size=3 203 | stride=1 204 | pad=1 205 | filters=1024 206 | activation=leaky 207 | 208 | [route] 209 | layers=-9 210 | 211 | [convolutional] 212 | batch_normalize=1 213 | size=1 214 | stride=1 215 | pad=1 216 | filters=64 217 | activation=leaky 218 | 219 | [reorg] 220 | stride=2 221 | 222 | [route] 223 | layers=-1,-4 224 | 225 | [convolutional] 226 | batch_normalize=1 227 | size=3 228 | stride=1 229 | pad=1 230 | filters=1024 231 | activation=leaky 232 | 233 | [convolutional] 234 | size=1 235 | stride=1 236 | pad=1 237 | filters=425 238 | activation=linear 239 | 240 | 241 | [region] 242 | anchors = 0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828 243 | bias_match=1 244 | classes=80 245 | coords=4 246 | num=5 247 | softmax=1 248 | jitter=.3 249 | rescore=1 250 | 251 | object_scale=5 252 | noobject_scale=1 253 | class_scale=1 254 | coord_scale=1 255 | 256 | absolute=1 257 | thresh = .6 258 | random=1 259 | -------------------------------------------------------------------------------- /cfg/yolov3.cfg: -------------------------------------------------------------------------------- 1 | [net] 2 | # Testing 3 | batch=1 4 | subdivisions=1 5 | # Training 6 | # batch=64 7 | # subdivisions=16 8 | width= 320 9 | height = 320 10 | channels=3 11 | momentum=0.9 12 | decay=0.0005 13 | angle=0 14 | saturation = 1.5 15 | exposure = 1.5 16 | hue=.1 17 | 18 | learning_rate=0.001 19 | burn_in=1000 20 | max_batches = 500200 21 | policy=steps 22 | steps=400000,450000 23 | scales=.1,.1 24 | 25 | [convolutional] 26 | batch_normalize=1 27 | filters=32 28 | size=3 29 | stride=1 30 | pad=1 31 | activation=leaky 32 | 33 | # Downsample 34 | 35 | [convolutional] 36 | batch_normalize=1 37 | filters=64 38 | size=3 39 | stride=2 40 | pad=1 41 | activation=leaky 42 | 43 | [convolutional] 44 | batch_normalize=1 45 | filters=32 46 | size=1 47 | stride=1 48 | pad=1 49 | activation=leaky 50 | 51 | [convolutional] 52 | batch_normalize=1 53 | filters=64 54 | size=3 55 | stride=1 56 | pad=1 57 | activation=leaky 58 | 59 | [shortcut] 60 | from=-3 61 | activation=linear 62 | 63 | # Downsample 64 | 65 | [convolutional] 66 | batch_normalize=1 67 | filters=128 68 | size=3 69 | stride=2 70 | pad=1 71 | activation=leaky 72 | 73 | [convolutional] 74 | batch_normalize=1 75 | filters=64 76 | size=1 77 | stride=1 78 | pad=1 79 | activation=leaky 80 | 81 | [convolutional] 82 | batch_normalize=1 83 | filters=128 84 | size=3 85 | stride=1 86 | pad=1 87 | activation=leaky 88 | 89 | [shortcut] 90 | from=-3 91 | activation=linear 92 | 93 | [convolutional] 94 | batch_normalize=1 95 | filters=64 96 | size=1 97 | stride=1 98 | pad=1 99 | activation=leaky 100 | 101 | [convolutional] 102 | batch_normalize=1 103 | filters=128 104 | size=3 105 | stride=1 106 | pad=1 107 | activation=leaky 108 | 109 | [shortcut] 110 | from=-3 111 | activation=linear 112 | 113 | # Downsample 114 | 115 | [convolutional] 116 | batch_normalize=1 117 | filters=256 118 | size=3 119 | stride=2 120 | pad=1 121 | activation=leaky 122 | 123 | [convolutional] 124 | batch_normalize=1 125 | filters=128 126 | size=1 127 | stride=1 128 | pad=1 129 | activation=leaky 130 | 131 | [convolutional] 132 | batch_normalize=1 133 | filters=256 134 | size=3 135 | stride=1 136 | pad=1 137 | activation=leaky 138 | 139 | [shortcut] 140 | from=-3 141 | activation=linear 142 | 143 | [convolutional] 144 | batch_normalize=1 145 | filters=128 146 | size=1 147 | stride=1 148 | pad=1 149 | activation=leaky 150 | 151 | [convolutional] 152 | batch_normalize=1 153 | filters=256 154 | size=3 155 | stride=1 156 | pad=1 157 | activation=leaky 158 | 159 | [shortcut] 160 | from=-3 161 | activation=linear 162 | 163 | [convolutional] 164 | batch_normalize=1 165 | filters=128 166 | size=1 167 | stride=1 168 | pad=1 169 | activation=leaky 170 | 171 | [convolutional] 172 | batch_normalize=1 173 | filters=256 174 | size=3 175 | stride=1 176 | pad=1 177 | activation=leaky 178 | 179 | [shortcut] 180 | from=-3 181 | activation=linear 182 | 183 | [convolutional] 184 | batch_normalize=1 185 | filters=128 186 | size=1 187 | stride=1 188 | pad=1 189 | activation=leaky 190 | 191 | [convolutional] 192 | batch_normalize=1 193 | filters=256 194 | size=3 195 | stride=1 196 | pad=1 197 | activation=leaky 198 | 199 | [shortcut] 200 | from=-3 201 | activation=linear 202 | 203 | 204 | [convolutional] 205 | batch_normalize=1 206 | filters=128 207 | size=1 208 | stride=1 209 | pad=1 210 | activation=leaky 211 | 212 | [convolutional] 213 | batch_normalize=1 214 | filters=256 215 | size=3 216 | stride=1 217 | pad=1 218 | activation=leaky 219 | 220 | [shortcut] 221 | from=-3 222 | activation=linear 223 | 224 | [convolutional] 225 | batch_normalize=1 226 | filters=128 227 | size=1 228 | stride=1 229 | pad=1 230 | activation=leaky 231 | 232 | [convolutional] 233 | batch_normalize=1 234 | filters=256 235 | size=3 236 | stride=1 237 | pad=1 238 | activation=leaky 239 | 240 | [shortcut] 241 | from=-3 242 | activation=linear 243 | 244 | [convolutional] 245 | batch_normalize=1 246 | filters=128 247 | size=1 248 | stride=1 249 | pad=1 250 | activation=leaky 251 | 252 | [convolutional] 253 | batch_normalize=1 254 | filters=256 255 | size=3 256 | stride=1 257 | pad=1 258 | activation=leaky 259 | 260 | [shortcut] 261 | from=-3 262 | activation=linear 263 | 264 | [convolutional] 265 | batch_normalize=1 266 | filters=128 267 | size=1 268 | stride=1 269 | pad=1 270 | activation=leaky 271 | 272 | [convolutional] 273 | batch_normalize=1 274 | filters=256 275 | size=3 276 | stride=1 277 | pad=1 278 | activation=leaky 279 | 280 | [shortcut] 281 | from=-3 282 | activation=linear 283 | 284 | # Downsample 285 | 286 | [convolutional] 287 | batch_normalize=1 288 | filters=512 289 | size=3 290 | stride=2 291 | pad=1 292 | activation=leaky 293 | 294 | [convolutional] 295 | batch_normalize=1 296 | filters=256 297 | size=1 298 | stride=1 299 | pad=1 300 | activation=leaky 301 | 302 | [convolutional] 303 | batch_normalize=1 304 | filters=512 305 | size=3 306 | stride=1 307 | pad=1 308 | activation=leaky 309 | 310 | [shortcut] 311 | from=-3 312 | activation=linear 313 | 314 | 315 | [convolutional] 316 | batch_normalize=1 317 | filters=256 318 | size=1 319 | stride=1 320 | pad=1 321 | activation=leaky 322 | 323 | [convolutional] 324 | batch_normalize=1 325 | filters=512 326 | size=3 327 | stride=1 328 | pad=1 329 | activation=leaky 330 | 331 | [shortcut] 332 | from=-3 333 | activation=linear 334 | 335 | 336 | [convolutional] 337 | batch_normalize=1 338 | filters=256 339 | size=1 340 | stride=1 341 | pad=1 342 | activation=leaky 343 | 344 | [convolutional] 345 | batch_normalize=1 346 | filters=512 347 | size=3 348 | stride=1 349 | pad=1 350 | activation=leaky 351 | 352 | [shortcut] 353 | from=-3 354 | activation=linear 355 | 356 | 357 | [convolutional] 358 | batch_normalize=1 359 | filters=256 360 | size=1 361 | stride=1 362 | pad=1 363 | activation=leaky 364 | 365 | [convolutional] 366 | batch_normalize=1 367 | filters=512 368 | size=3 369 | stride=1 370 | pad=1 371 | activation=leaky 372 | 373 | [shortcut] 374 | from=-3 375 | activation=linear 376 | 377 | [convolutional] 378 | batch_normalize=1 379 | filters=256 380 | size=1 381 | stride=1 382 | pad=1 383 | activation=leaky 384 | 385 | [convolutional] 386 | batch_normalize=1 387 | filters=512 388 | size=3 389 | stride=1 390 | pad=1 391 | activation=leaky 392 | 393 | [shortcut] 394 | from=-3 395 | activation=linear 396 | 397 | 398 | [convolutional] 399 | batch_normalize=1 400 | filters=256 401 | size=1 402 | stride=1 403 | pad=1 404 | activation=leaky 405 | 406 | [convolutional] 407 | batch_normalize=1 408 | filters=512 409 | size=3 410 | stride=1 411 | pad=1 412 | activation=leaky 413 | 414 | [shortcut] 415 | from=-3 416 | activation=linear 417 | 418 | 419 | [convolutional] 420 | batch_normalize=1 421 | filters=256 422 | size=1 423 | stride=1 424 | pad=1 425 | activation=leaky 426 | 427 | [convolutional] 428 | batch_normalize=1 429 | filters=512 430 | size=3 431 | stride=1 432 | pad=1 433 | activation=leaky 434 | 435 | [shortcut] 436 | from=-3 437 | activation=linear 438 | 439 | [convolutional] 440 | batch_normalize=1 441 | filters=256 442 | size=1 443 | stride=1 444 | pad=1 445 | activation=leaky 446 | 447 | [convolutional] 448 | batch_normalize=1 449 | filters=512 450 | size=3 451 | stride=1 452 | pad=1 453 | activation=leaky 454 | 455 | [shortcut] 456 | from=-3 457 | activation=linear 458 | 459 | # Downsample 460 | 461 | [convolutional] 462 | batch_normalize=1 463 | filters=1024 464 | size=3 465 | stride=2 466 | pad=1 467 | activation=leaky 468 | 469 | [convolutional] 470 | batch_normalize=1 471 | filters=512 472 | size=1 473 | stride=1 474 | pad=1 475 | activation=leaky 476 | 477 | [convolutional] 478 | batch_normalize=1 479 | filters=1024 480 | size=3 481 | stride=1 482 | pad=1 483 | activation=leaky 484 | 485 | [shortcut] 486 | from=-3 487 | activation=linear 488 | 489 | [convolutional] 490 | batch_normalize=1 491 | filters=512 492 | size=1 493 | stride=1 494 | pad=1 495 | activation=leaky 496 | 497 | [convolutional] 498 | batch_normalize=1 499 | filters=1024 500 | size=3 501 | stride=1 502 | pad=1 503 | activation=leaky 504 | 505 | [shortcut] 506 | from=-3 507 | activation=linear 508 | 509 | [convolutional] 510 | batch_normalize=1 511 | filters=512 512 | size=1 513 | stride=1 514 | pad=1 515 | activation=leaky 516 | 517 | [convolutional] 518 | batch_normalize=1 519 | filters=1024 520 | size=3 521 | stride=1 522 | pad=1 523 | activation=leaky 524 | 525 | [shortcut] 526 | from=-3 527 | activation=linear 528 | 529 | [convolutional] 530 | batch_normalize=1 531 | filters=512 532 | size=1 533 | stride=1 534 | pad=1 535 | activation=leaky 536 | 537 | [convolutional] 538 | batch_normalize=1 539 | filters=1024 540 | size=3 541 | stride=1 542 | pad=1 543 | activation=leaky 544 | 545 | [shortcut] 546 | from=-3 547 | activation=linear 548 | 549 | ###################### 550 | 551 | [convolutional] 552 | batch_normalize=1 553 | filters=512 554 | size=1 555 | stride=1 556 | pad=1 557 | activation=leaky 558 | 559 | [convolutional] 560 | batch_normalize=1 561 | size=3 562 | stride=1 563 | pad=1 564 | filters=1024 565 | activation=leaky 566 | 567 | [convolutional] 568 | batch_normalize=1 569 | filters=512 570 | size=1 571 | stride=1 572 | pad=1 573 | activation=leaky 574 | 575 | [convolutional] 576 | batch_normalize=1 577 | size=3 578 | stride=1 579 | pad=1 580 | filters=1024 581 | activation=leaky 582 | 583 | [convolutional] 584 | batch_normalize=1 585 | filters=512 586 | size=1 587 | stride=1 588 | pad=1 589 | activation=leaky 590 | 591 | [convolutional] 592 | batch_normalize=1 593 | size=3 594 | stride=1 595 | pad=1 596 | filters=1024 597 | activation=leaky 598 | 599 | [convolutional] 600 | size=1 601 | stride=1 602 | pad=1 603 | filters=255 604 | activation=linear 605 | 606 | 607 | [yolo] 608 | mask = 6,7,8 609 | anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 610 | classes=80 611 | num=9 612 | jitter=.3 613 | ignore_thresh = .5 614 | truth_thresh = 1 615 | random=1 616 | 617 | 618 | [route] 619 | layers = -4 620 | 621 | [convolutional] 622 | batch_normalize=1 623 | filters=256 624 | size=1 625 | stride=1 626 | pad=1 627 | activation=leaky 628 | 629 | [upsample] 630 | stride=2 631 | 632 | [route] 633 | layers = -1, 61 634 | 635 | 636 | 637 | [convolutional] 638 | batch_normalize=1 639 | filters=256 640 | size=1 641 | stride=1 642 | pad=1 643 | activation=leaky 644 | 645 | [convolutional] 646 | batch_normalize=1 647 | size=3 648 | stride=1 649 | pad=1 650 | filters=512 651 | activation=leaky 652 | 653 | [convolutional] 654 | batch_normalize=1 655 | filters=256 656 | size=1 657 | stride=1 658 | pad=1 659 | activation=leaky 660 | 661 | [convolutional] 662 | batch_normalize=1 663 | size=3 664 | stride=1 665 | pad=1 666 | filters=512 667 | activation=leaky 668 | 669 | [convolutional] 670 | batch_normalize=1 671 | filters=256 672 | size=1 673 | stride=1 674 | pad=1 675 | activation=leaky 676 | 677 | [convolutional] 678 | batch_normalize=1 679 | size=3 680 | stride=1 681 | pad=1 682 | filters=512 683 | activation=leaky 684 | 685 | [convolutional] 686 | size=1 687 | stride=1 688 | pad=1 689 | filters=255 690 | activation=linear 691 | 692 | 693 | [yolo] 694 | mask = 3,4,5 695 | anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 696 | classes=80 697 | num=9 698 | jitter=.3 699 | ignore_thresh = .5 700 | truth_thresh = 1 701 | random=1 702 | 703 | 704 | 705 | [route] 706 | layers = -4 707 | 708 | [convolutional] 709 | batch_normalize=1 710 | filters=128 711 | size=1 712 | stride=1 713 | pad=1 714 | activation=leaky 715 | 716 | [upsample] 717 | stride=2 718 | 719 | [route] 720 | layers = -1, 36 721 | 722 | 723 | 724 | [convolutional] 725 | batch_normalize=1 726 | filters=128 727 | size=1 728 | stride=1 729 | pad=1 730 | activation=leaky 731 | 732 | [convolutional] 733 | batch_normalize=1 734 | size=3 735 | stride=1 736 | pad=1 737 | filters=256 738 | activation=leaky 739 | 740 | [convolutional] 741 | batch_normalize=1 742 | filters=128 743 | size=1 744 | stride=1 745 | pad=1 746 | activation=leaky 747 | 748 | [convolutional] 749 | batch_normalize=1 750 | size=3 751 | stride=1 752 | pad=1 753 | filters=256 754 | activation=leaky 755 | 756 | [convolutional] 757 | batch_normalize=1 758 | filters=128 759 | size=1 760 | stride=1 761 | pad=1 762 | activation=leaky 763 | 764 | [convolutional] 765 | batch_normalize=1 766 | size=3 767 | stride=1 768 | pad=1 769 | filters=256 770 | activation=leaky 771 | 772 | [convolutional] 773 | size=1 774 | stride=1 775 | pad=1 776 | filters=255 777 | activation=linear 778 | 779 | 780 | [yolo] 781 | mask = 0,1,2 782 | anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 783 | classes=80 784 | num=9 785 | jitter=.3 786 | ignore_thresh = .5 787 | truth_thresh = 1 788 | random=1 789 | 790 | -------------------------------------------------------------------------------- /darknet.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | from torch.autograd import Variable 7 | import numpy as np 8 | import cv2 9 | import matplotlib.pyplot as plt 10 | from util import count_parameters as count 11 | from util import convert2cpu as cpu 12 | from util import predict_transform 13 | 14 | class test_net(nn.Module): 15 | def __init__(self, num_layers, input_size): 16 | super(test_net, self).__init__() 17 | self.num_layers= num_layers 18 | self.linear_1 = nn.Linear(input_size, 5) 19 | self.middle = nn.ModuleList([nn.Linear(5,5) for x in range(num_layers)]) 20 | self.output = nn.Linear(5,2) 21 | 22 | def forward(self, x): 23 | x = x.view(-1) 24 | fwd = nn.Sequential(self.linear_1, *self.middle, self.output) 25 | return fwd(x) 26 | 27 | def get_test_input(): 28 | img = cv2.imread("dog-cycle-car.png") 29 | img = cv2.resize(img, (416,416)) 30 | img_ = img[:,:,::-1].transpose((2,0,1)) 31 | img_ = img_[np.newaxis,:,:,:]/255.0 32 | img_ = torch.from_numpy(img_).float() 33 | img_ = Variable(img_) 34 | return img_ 35 | 36 | 37 | def parse_cfg(cfgfile): 38 | """ 39 | Takes a configuration file 40 | 41 | Returns a list of blocks. Each blocks describes a block in the neural 42 | network to be built. Block is represented as a dictionary in the list 43 | 44 | """ 45 | file = open(cfgfile, 'r') 46 | lines = file.read().split('\n') #store the lines in a list 47 | lines = [x for x in lines if len(x) > 0] #get read of the empty lines 48 | lines = [x for x in lines if x[0] != '#'] 49 | lines = [x.rstrip().lstrip() for x in lines] 50 | 51 | 52 | block = {} 53 | blocks = [] 54 | 55 | for line in lines: 56 | if line[0] == "[": #This marks the start of a new block 57 | if len(block) != 0: 58 | blocks.append(block) 59 | block = {} 60 | block["type"] = line[1:-1].rstrip() 61 | else: 62 | key,value = line.split("=") 63 | block[key.rstrip()] = value.lstrip() 64 | blocks.append(block) 65 | 66 | return blocks 67 | # print('\n\n'.join([repr(x) for x in blocks])) 68 | 69 | import pickle as pkl 70 | 71 | class MaxPoolStride1(nn.Module): 72 | def __init__(self, kernel_size): 73 | super(MaxPoolStride1, self).__init__() 74 | self.kernel_size = kernel_size 75 | self.pad = kernel_size - 1 76 | 77 | def forward(self, x): 78 | padded_x = F.pad(x, (0,self.pad,0,self.pad), mode="replicate") 79 | pooled_x = nn.MaxPool2d(self.kernel_size, self.pad)(padded_x) 80 | return pooled_x 81 | 82 | 83 | class EmptyLayer(nn.Module): 84 | def __init__(self): 85 | super(EmptyLayer, self).__init__() 86 | 87 | 88 | class DetectionLayer(nn.Module): 89 | def __init__(self, anchors): 90 | super(DetectionLayer, self).__init__() 91 | self.anchors = anchors 92 | 93 | def forward(self, x, inp_dim, num_classes, confidence): 94 | x = x.data 95 | global CUDA 96 | prediction = x 97 | prediction = predict_transform(prediction, inp_dim, self.anchors, num_classes, confidence, CUDA) 98 | return prediction 99 | 100 | 101 | 102 | 103 | 104 | class Upsample(nn.Module): 105 | def __init__(self, stride=2): 106 | super(Upsample, self).__init__() 107 | self.stride = stride 108 | 109 | def forward(self, x): 110 | stride = self.stride 111 | assert(x.data.dim() == 4) 112 | B = x.data.size(0) 113 | C = x.data.size(1) 114 | H = x.data.size(2) 115 | W = x.data.size(3) 116 | ws = stride 117 | hs = stride 118 | x = x.view(B, C, H, 1, W, 1).expand(B, C, H, stride, W, stride).contiguous().view(B, C, H*stride, W*stride) 119 | return x 120 | # 121 | 122 | class ReOrgLayer(nn.Module): 123 | def __init__(self, stride = 2): 124 | super(ReOrgLayer, self).__init__() 125 | self.stride= stride 126 | 127 | def forward(self,x): 128 | assert(x.data.dim() == 4) 129 | B,C,H,W = x.data.shape 130 | hs = self.stride 131 | ws = self.stride 132 | assert(H % hs == 0), "The stride " + str(self.stride) + " is not a proper divisor of height " + str(H) 133 | assert(W % ws == 0), "The stride " + str(self.stride) + " is not a proper divisor of height " + str(W) 134 | x = x.view(B,C, H // hs, hs, W // ws, ws).transpose(-2,-3).contiguous() 135 | x = x.view(B,C, H // hs * W // ws, hs, ws) 136 | x = x.view(B,C, H // hs * W // ws, hs*ws).transpose(-1,-2).contiguous() 137 | x = x.view(B, C, ws*hs, H // ws, W // ws).transpose(1,2).contiguous() 138 | x = x.view(B, C*ws*hs, H // ws, W // ws) 139 | return x 140 | 141 | 142 | def create_modules(blocks): 143 | net_info = blocks[0] #Captures the information about the input and pre-processing 144 | 145 | module_list = nn.ModuleList() 146 | 147 | index = 0 #indexing blocks helps with implementing route layers (skip connections) 148 | 149 | 150 | prev_filters = 3 151 | 152 | output_filters = [] 153 | 154 | for x in blocks: 155 | module = nn.Sequential() 156 | 157 | if (x["type"] == "net"): 158 | continue 159 | 160 | #If it's a convolutional layer 161 | if (x["type"] == "convolutional"): 162 | #Get the info about the layer 163 | activation = x["activation"] 164 | try: 165 | batch_normalize = int(x["batch_normalize"]) 166 | bias = False 167 | except: 168 | batch_normalize = 0 169 | bias = True 170 | 171 | filters= int(x["filters"]) 172 | padding = int(x["pad"]) 173 | kernel_size = int(x["size"]) 174 | stride = int(x["stride"]) 175 | 176 | if padding: 177 | pad = (kernel_size - 1) // 2 178 | else: 179 | pad = 0 180 | 181 | #Add the convolutional layer 182 | conv = nn.Conv2d(prev_filters, filters, kernel_size, stride, pad, bias = bias) 183 | module.add_module("conv_{0}".format(index), conv) 184 | 185 | #Add the Batch Norm Layer 186 | if batch_normalize: 187 | bn = nn.BatchNorm2d(filters) 188 | module.add_module("batch_norm_{0}".format(index), bn) 189 | 190 | #Check the activation. 191 | #It is either Linear or a Leaky ReLU for YOLO 192 | if activation == "leaky": 193 | activn = nn.LeakyReLU(0.1, inplace = True) 194 | module.add_module("leaky_{0}".format(index), activn) 195 | 196 | 197 | 198 | #If it's an upsampling layer 199 | #We use Bilinear2dUpsampling 200 | 201 | elif (x["type"] == "upsample"): 202 | stride = int(x["stride"]) 203 | # upsample = Upsample(stride) 204 | upsample = nn.Upsample(scale_factor = 2, mode = "nearest") 205 | module.add_module("upsample_{}".format(index), upsample) 206 | 207 | #If it is a route layer 208 | elif (x["type"] == "route"): 209 | x["layers"] = x["layers"].split(',') 210 | 211 | #Start of a route 212 | start = int(x["layers"][0]) 213 | 214 | #end, if there exists one. 215 | try: 216 | end = int(x["layers"][1]) 217 | except: 218 | end = 0 219 | 220 | 221 | 222 | #Positive anotation 223 | if start > 0: 224 | start = start - index 225 | 226 | if end > 0: 227 | end = end - index 228 | 229 | 230 | route = EmptyLayer() 231 | module.add_module("route_{0}".format(index), route) 232 | 233 | 234 | 235 | if end < 0: 236 | filters = output_filters[index + start] + output_filters[index + end] 237 | else: 238 | filters= output_filters[index + start] 239 | 240 | 241 | 242 | #shortcut corresponds to skip connection 243 | elif x["type"] == "shortcut": 244 | from_ = int(x["from"]) 245 | shortcut = EmptyLayer() 246 | module.add_module("shortcut_{}".format(index), shortcut) 247 | 248 | 249 | elif x["type"] == "maxpool": 250 | stride = int(x["stride"]) 251 | size = int(x["size"]) 252 | if stride != 1: 253 | maxpool = nn.MaxPool2d(size, stride) 254 | else: 255 | maxpool = MaxPoolStride1(size) 256 | 257 | module.add_module("maxpool_{}".format(index), maxpool) 258 | 259 | #Yolo is the detection layer 260 | elif x["type"] == "yolo": 261 | mask = x["mask"].split(",") 262 | mask = [int(x) for x in mask] 263 | 264 | 265 | anchors = x["anchors"].split(",") 266 | anchors = [int(a) for a in anchors] 267 | anchors = [(anchors[i], anchors[i+1]) for i in range(0, len(anchors),2)] 268 | anchors = [anchors[i] for i in mask] 269 | 270 | detection = DetectionLayer(anchors) 271 | module.add_module("Detection_{}".format(index), detection) 272 | 273 | 274 | 275 | else: 276 | print("Something I dunno") 277 | assert False 278 | 279 | 280 | module_list.append(module) 281 | prev_filters = filters 282 | output_filters.append(filters) 283 | index += 1 284 | 285 | 286 | return (net_info, module_list) 287 | 288 | 289 | 290 | class Darknet(nn.Module): 291 | def __init__(self, cfgfile): 292 | super(Darknet, self).__init__() 293 | self.blocks = parse_cfg(cfgfile) 294 | self.net_info, self.module_list = create_modules(self.blocks) 295 | self.header = torch.IntTensor([0,0,0,0]) 296 | self.seen = 0 297 | 298 | 299 | 300 | def get_blocks(self): 301 | return self.blocks 302 | 303 | def get_module_list(self): 304 | return self.module_list 305 | 306 | 307 | def forward(self, x, CUDA): 308 | detections = [] 309 | modules = self.blocks[1:] 310 | outputs = {} #We cache the outputs for the route layer 311 | 312 | 313 | write = 0 314 | for i in range(len(modules)): 315 | 316 | module_type = (modules[i]["type"]) 317 | if module_type == "convolutional" or module_type == "upsample" or module_type == "maxpool": 318 | 319 | x = self.module_list[i](x) 320 | outputs[i] = x 321 | 322 | 323 | elif module_type == "route": 324 | layers = modules[i]["layers"] 325 | layers = [int(a) for a in layers] 326 | 327 | if (layers[0]) > 0: 328 | layers[0] = layers[0] - i 329 | 330 | if len(layers) == 1: 331 | x = outputs[i + (layers[0])] 332 | 333 | else: 334 | if (layers[1]) > 0: 335 | layers[1] = layers[1] - i 336 | 337 | map1 = outputs[i + layers[0]] 338 | map2 = outputs[i + layers[1]] 339 | 340 | 341 | x = torch.cat((map1, map2), 1) 342 | outputs[i] = x 343 | 344 | elif module_type == "shortcut": 345 | from_ = int(modules[i]["from"]) 346 | x = outputs[i-1] + outputs[i+from_] 347 | outputs[i] = x 348 | 349 | 350 | 351 | elif module_type == 'yolo': 352 | 353 | anchors = self.module_list[i][0].anchors 354 | #Get the input dimensions 355 | inp_dim = int (self.net_info["height"]) 356 | 357 | #Get the number of classes 358 | num_classes = int (modules[i]["classes"]) 359 | 360 | #Output the result 361 | x = x.data 362 | x = predict_transform(x, inp_dim, anchors, num_classes, CUDA) 363 | 364 | if type(x) == int: 365 | continue 366 | 367 | 368 | if not write: 369 | detections = x 370 | write = 1 371 | 372 | else: 373 | detections = torch.cat((detections, x), 1) 374 | 375 | outputs[i] = outputs[i-1] 376 | 377 | 378 | 379 | try: 380 | return detections 381 | except: 382 | return 0 383 | 384 | 385 | def load_weights(self, weightfile): 386 | 387 | #Open the weights file 388 | fp = open(weightfile, "rb") 389 | 390 | #The first 4 values are header information 391 | # 1. Major version number 392 | # 2. Minor Version Number 393 | # 3. Subversion number 394 | # 4. IMages seen 395 | header = np.fromfile(fp, dtype = np.int32, count = 5) 396 | self.header = torch.from_numpy(header) 397 | self.seen = self.header[3] 398 | 399 | #The rest of the values are the weights 400 | # Let's load them up 401 | weights = np.fromfile(fp, dtype = np.float32) 402 | 403 | ptr = 0 404 | for i in range(len(self.module_list)): 405 | module_type = self.blocks[i + 1]["type"] 406 | 407 | if module_type == "convolutional": 408 | model = self.module_list[i] 409 | try: 410 | batch_normalize = int(self.blocks[i+1]["batch_normalize"]) 411 | except: 412 | batch_normalize = 0 413 | 414 | conv = model[0] 415 | 416 | if (batch_normalize): 417 | bn = model[1] 418 | 419 | #Get the number of weights of Batch Norm Layer 420 | num_bn_biases = bn.bias.numel() 421 | 422 | #Load the weights 423 | bn_biases = torch.from_numpy(weights[ptr:ptr + num_bn_biases]) 424 | ptr += num_bn_biases 425 | 426 | bn_weights = torch.from_numpy(weights[ptr: ptr + num_bn_biases]) 427 | ptr += num_bn_biases 428 | 429 | bn_running_mean = torch.from_numpy(weights[ptr: ptr + num_bn_biases]) 430 | ptr += num_bn_biases 431 | 432 | bn_running_var = torch.from_numpy(weights[ptr: ptr + num_bn_biases]) 433 | ptr += num_bn_biases 434 | 435 | #Cast the loaded weights into dims of model weights. 436 | bn_biases = bn_biases.view_as(bn.bias.data) 437 | bn_weights = bn_weights.view_as(bn.weight.data) 438 | bn_running_mean = bn_running_mean.view_as(bn.running_mean) 439 | bn_running_var = bn_running_var.view_as(bn.running_var) 440 | 441 | #Copy the data to model 442 | bn.bias.data.copy_(bn_biases) 443 | bn.weight.data.copy_(bn_weights) 444 | bn.running_mean.copy_(bn_running_mean) 445 | bn.running_var.copy_(bn_running_var) 446 | 447 | else: 448 | #Number of biases 449 | num_biases = conv.bias.numel() 450 | 451 | #Load the weights 452 | conv_biases = torch.from_numpy(weights[ptr: ptr + num_biases]) 453 | ptr = ptr + num_biases 454 | 455 | #reshape the loaded weights according to the dims of the model weights 456 | conv_biases = conv_biases.view_as(conv.bias.data) 457 | 458 | #Finally copy the data 459 | conv.bias.data.copy_(conv_biases) 460 | 461 | 462 | #Let us load the weights for the Convolutional layers 463 | num_weights = conv.weight.numel() 464 | 465 | #Do the same as above for weights 466 | conv_weights = torch.from_numpy(weights[ptr:ptr+num_weights]) 467 | ptr = ptr + num_weights 468 | 469 | conv_weights = conv_weights.view_as(conv.weight.data) 470 | conv.weight.data.copy_(conv_weights) 471 | 472 | def save_weights(self, savedfile, cutoff = 0): 473 | 474 | if cutoff <= 0: 475 | cutoff = len(self.blocks) - 1 476 | 477 | fp = open(savedfile, 'wb') 478 | 479 | # Attach the header at the top of the file 480 | self.header[3] = self.seen 481 | header = self.header 482 | 483 | header = header.numpy() 484 | header.tofile(fp) 485 | 486 | # Now, let us save the weights 487 | for i in range(len(self.module_list)): 488 | module_type = self.blocks[i+1]["type"] 489 | 490 | if (module_type) == "convolutional": 491 | model = self.module_list[i] 492 | try: 493 | batch_normalize = int(self.blocks[i+1]["batch_normalize"]) 494 | except: 495 | batch_normalize = 0 496 | 497 | conv = model[0] 498 | 499 | if (batch_normalize): 500 | bn = model[1] 501 | 502 | #If the parameters are on GPU, convert them back to CPU 503 | #We don't convert the parameter to GPU 504 | #Instead. we copy the parameter and then convert it to CPU 505 | #This is done as weight are need to be saved during training 506 | cpu(bn.bias.data).numpy().tofile(fp) 507 | cpu(bn.weight.data).numpy().tofile(fp) 508 | cpu(bn.running_mean).numpy().tofile(fp) 509 | cpu(bn.running_var).numpy().tofile(fp) 510 | 511 | 512 | else: 513 | cpu(conv.bias.data).numpy().tofile(fp) 514 | 515 | 516 | #Let us save the weights for the Convolutional layers 517 | cpu(conv.weight.data).numpy().tofile(fp) 518 | 519 | 520 | 521 | 522 | 523 | # 524 | #dn = Darknet('cfg/yolov3.cfg') 525 | #dn.load_weights("yolov3.weights") 526 | #inp = get_test_input() 527 | #a, interms = dn(inp) 528 | #dn.eval() 529 | #a_i, interms_i = dn(inp) 530 | -------------------------------------------------------------------------------- /data/coco.names: -------------------------------------------------------------------------------- 1 | person 2 | bicycle 3 | car 4 | motorbike 5 | aeroplane 6 | bus 7 | train 8 | truck 9 | boat 10 | traffic light 11 | fire hydrant 12 | stop sign 13 | parking meter 14 | bench 15 | bird 16 | cat 17 | dog 18 | horse 19 | sheep 20 | cow 21 | elephant 22 | bear 23 | zebra 24 | giraffe 25 | backpack 26 | umbrella 27 | handbag 28 | tie 29 | suitcase 30 | frisbee 31 | skis 32 | snowboard 33 | sports ball 34 | kite 35 | baseball bat 36 | baseball glove 37 | skateboard 38 | surfboard 39 | tennis racket 40 | bottle 41 | wine glass 42 | cup 43 | fork 44 | knife 45 | spoon 46 | bowl 47 | banana 48 | apple 49 | sandwich 50 | orange 51 | broccoli 52 | carrot 53 | hot dog 54 | pizza 55 | donut 56 | cake 57 | chair 58 | sofa 59 | pottedplant 60 | bed 61 | diningtable 62 | toilet 63 | tvmonitor 64 | laptop 65 | mouse 66 | remote 67 | keyboard 68 | cell phone 69 | microwave 70 | oven 71 | toaster 72 | sink 73 | refrigerator 74 | book 75 | clock 76 | vase 77 | scissors 78 | teddy bear 79 | hair drier 80 | toothbrush 81 | -------------------------------------------------------------------------------- /data/voc.names: -------------------------------------------------------------------------------- 1 | aeroplane 2 | bicycle 3 | bird 4 | boat 5 | bottle 6 | bus 7 | car 8 | cat 9 | chair 10 | cow 11 | diningtable 12 | dog 13 | horse 14 | motorbike 15 | person 16 | pottedplant 17 | sheep 18 | sofa 19 | train 20 | tvmonitor 21 | -------------------------------------------------------------------------------- /det_messi.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/det_messi.jpg -------------------------------------------------------------------------------- /detect.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import time 3 | import torch 4 | import torch.nn as nn 5 | from torch.autograd import Variable 6 | import numpy as np 7 | import cv2 8 | from util import * 9 | import argparse 10 | import os 11 | import os.path as osp 12 | from darknet import Darknet 13 | from preprocess import prep_image, inp_to_image 14 | import pandas as pd 15 | import random 16 | import pickle as pkl 17 | import itertools 18 | 19 | class test_net(nn.Module): 20 | def __init__(self, num_layers, input_size): 21 | super(test_net, self).__init__() 22 | self.num_layers= num_layers 23 | self.linear_1 = nn.Linear(input_size, 5) 24 | self.middle = nn.ModuleList([nn.Linear(5,5) for x in range(num_layers)]) 25 | self.output = nn.Linear(5,2) 26 | 27 | def forward(self, x): 28 | x = x.view(-1) 29 | fwd = nn.Sequential(self.linear_1, *self.middle, self.output) 30 | return fwd(x) 31 | 32 | def get_test_input(input_dim, CUDA): 33 | img = cv2.imread("dog-cycle-car.png") 34 | img = cv2.resize(img, (input_dim, input_dim)) 35 | img_ = img[:,:,::-1].transpose((2,0,1)) 36 | img_ = img_[np.newaxis,:,:,:]/255.0 37 | img_ = torch.from_numpy(img_).float() 38 | img_ = Variable(img_) 39 | 40 | if CUDA: 41 | img_ = img_.cuda() 42 | num_classes 43 | return img_ 44 | 45 | 46 | 47 | def arg_parse(): 48 | """ 49 | Parse arguements to the detect module 50 | 51 | """ 52 | 53 | 54 | parser = argparse.ArgumentParser(description='YOLO v3 Detection Module') 55 | 56 | parser.add_argument("--images", dest = 'images', help = 57 | "Image / Directory containing images to perform detection upon", 58 | default = "imgs", type = str) 59 | parser.add_argument("--det", dest = 'det', help = 60 | "Image / Directory to store detections to", 61 | default = "det", type = str) 62 | parser.add_argument("--bs", dest = "bs", help = "Batch size", default = 1) 63 | parser.add_argument("--confidence", dest = "confidence", help = "Object Confidence to filter predictions", default = 0.5) 64 | parser.add_argument("--nms_thresh", dest = "nms_thresh", help = "NMS Threshhold", default = 0.4) 65 | parser.add_argument("--cfg", dest = 'cfgfile', help = 66 | "Config file", 67 | default = "cfg/yolov3.cfg", type = str) 68 | parser.add_argument("--weights", dest = 'weightsfile', help = 69 | "weightsfile", 70 | default = "yolov3.weights", type = str) 71 | parser.add_argument("--reso", dest = 'reso', help = 72 | "Input resolution of the network. Increase to increase accuracy. Decrease to increase speed", 73 | default = "416", type = str) 74 | parser.add_argument("--scales", dest = "scales", help = "Scales to use for detection", 75 | default = "1,2,3", type = str) 76 | 77 | return parser.parse_args() 78 | 79 | if __name__ == '__main__': 80 | args = arg_parse() 81 | 82 | scales = args.scales 83 | 84 | 85 | # scales = [int(x) for x in scales.split(',')] 86 | # 87 | # 88 | # 89 | # args.reso = int(args.reso) 90 | # 91 | # num_boxes = [args.reso//32, args.reso//16, args.reso//8] 92 | # scale_indices = [3*(x**2) for x in num_boxes] 93 | # scale_indices = list(itertools.accumulate(scale_indices, lambda x,y : x+y)) 94 | # 95 | # 96 | # li = [] 97 | # i = 0 98 | # for scale in scale_indices: 99 | # li.extend(list(range(i, scale))) 100 | # i = scale 101 | # 102 | # scale_indices = li 103 | 104 | images = args.images 105 | batch_size = int(args.bs) 106 | confidence = float(args.confidence) 107 | nms_thesh = float(args.nms_thresh) 108 | start = 0 109 | 110 | CUDA = torch.cuda.is_available() 111 | 112 | num_classes = 80 113 | classes = load_classes('data/coco.names') 114 | 115 | #Set up the neural network 116 | print("Loading network.....") 117 | model = Darknet(args.cfgfile) 118 | model.load_weights(args.weightsfile) 119 | print("Network successfully loaded") 120 | 121 | model.net_info["height"] = args.reso 122 | inp_dim = int(model.net_info["height"]) 123 | assert inp_dim % 32 == 0 124 | assert inp_dim > 32 125 | 126 | #If there's a GPU availible, put the model on GPU 127 | if CUDA: 128 | model.cuda() 129 | 130 | 131 | #Set the model in evaluation mode 132 | model.eval() 133 | 134 | read_dir = time.time() 135 | #Detection phase 136 | try: 137 | imlist = [osp.join(osp.realpath('.'), images, img) for img in os.listdir(images) if os.path.splitext(img)[1] == '.png' or os.path.splitext(img)[1] =='.jpeg' or os.path.splitext(img)[1] =='.jpg'] 138 | except NotADirectoryError: 139 | imlist = [] 140 | imlist.append(osp.join(osp.realpath('.'), images)) 141 | except FileNotFoundError: 142 | print ("No file or directory with the name {}".format(images)) 143 | exit() 144 | 145 | if not os.path.exists(args.det): 146 | os.makedirs(args.det) 147 | 148 | load_batch = time.time() 149 | 150 | batches = list(map(prep_image, imlist, [inp_dim for x in range(len(imlist))])) 151 | im_batches = [x[0] for x in batches] 152 | orig_ims = [x[1] for x in batches] 153 | im_dim_list = [x[2] for x in batches] 154 | im_dim_list = torch.FloatTensor(im_dim_list).repeat(1,2) 155 | 156 | 157 | 158 | if CUDA: 159 | im_dim_list = im_dim_list.cuda() 160 | 161 | leftover = 0 162 | 163 | if (len(im_dim_list) % batch_size): 164 | leftover = 1 165 | 166 | 167 | if batch_size != 1: 168 | num_batches = len(imlist) // batch_size + leftover 169 | im_batches = [torch.cat((im_batches[i*batch_size : min((i + 1)*batch_size, 170 | len(im_batches))])) for i in range(num_batches)] 171 | 172 | 173 | i = 0 174 | 175 | 176 | write = False 177 | model(get_test_input(inp_dim, CUDA), CUDA) 178 | 179 | start_det_loop = time.time() 180 | 181 | objs = {} 182 | 183 | 184 | 185 | for batch in im_batches: 186 | #load the image 187 | start = time.time() 188 | if CUDA: 189 | batch = batch.cuda() 190 | 191 | 192 | #Apply offsets to the result predictions 193 | #Tranform the predictions as described in the YOLO paper 194 | #flatten the prediction vector 195 | # B x (bbox cord x no. of anchors) x grid_w x grid_h --> B x bbox x (all the boxes) 196 | # Put every proposed box as a row. 197 | with torch.no_grad(): 198 | prediction = model(Variable(batch), CUDA) 199 | 200 | # prediction = prediction[:,scale_indices] 201 | 202 | 203 | #get the boxes with object confidence > threshold 204 | #Convert the cordinates to absolute coordinates 205 | #perform NMS on these boxes, and save the results 206 | #I could have done NMS and saving seperately to have a better abstraction 207 | #But both these operations require looping, hence 208 | #clubbing these ops in one loop instead of two. 209 | #loops are slower than vectorised operations. 210 | 211 | prediction = write_results(prediction, confidence, num_classes, nms = True, nms_conf = nms_thesh) 212 | 213 | 214 | if type(prediction) == int: 215 | i += 1 216 | continue 217 | 218 | end = time.time() 219 | 220 | 221 | # print(end - start) 222 | 223 | 224 | 225 | prediction[:,0] += i*batch_size 226 | 227 | 228 | 229 | 230 | if not write: 231 | output = prediction 232 | write = 1 233 | else: 234 | output = torch.cat((output,prediction)) 235 | 236 | 237 | 238 | 239 | for im_num, image in enumerate(imlist[i*batch_size: min((i + 1)*batch_size, len(imlist))]): 240 | im_id = i*batch_size + im_num 241 | objs = [classes[int(x[-1])] for x in output if int(x[0]) == im_id] 242 | print("{0:20s} predicted in {1:6.3f} seconds".format(image.split("/")[-1], (end - start)/batch_size)) 243 | print("{0:20s} {1:s}".format("Objects Detected:", " ".join(objs))) 244 | print("----------------------------------------------------------") 245 | i += 1 246 | 247 | 248 | if CUDA: 249 | torch.cuda.synchronize() 250 | 251 | try: 252 | output 253 | except NameError: 254 | print("No detections were made") 255 | exit() 256 | 257 | im_dim_list = torch.index_select(im_dim_list, 0, output[:,0].long()) 258 | 259 | scaling_factor = torch.min(inp_dim/im_dim_list,1)[0].view(-1,1) 260 | 261 | 262 | output[:,[1,3]] -= (inp_dim - scaling_factor*im_dim_list[:,0].view(-1,1))/2 263 | output[:,[2,4]] -= (inp_dim - scaling_factor*im_dim_list[:,1].view(-1,1))/2 264 | 265 | 266 | 267 | output[:,1:5] /= scaling_factor 268 | 269 | for i in range(output.shape[0]): 270 | output[i, [1,3]] = torch.clamp(output[i, [1,3]], 0.0, im_dim_list[i,0]) 271 | output[i, [2,4]] = torch.clamp(output[i, [2,4]], 0.0, im_dim_list[i,1]) 272 | 273 | 274 | output_recast = time.time() 275 | 276 | 277 | class_load = time.time() 278 | 279 | colors = pkl.load(open("pallete", "rb")) 280 | 281 | 282 | draw = time.time() 283 | 284 | 285 | def write(x, batches, results): 286 | c1 = tuple(x[1:3].int()) 287 | c2 = tuple(x[3:5].int()) 288 | img = results[int(x[0])] 289 | cls = int(x[-1]) 290 | label = "{0}".format(classes[cls]) 291 | color = random.choice(colors) 292 | cv2.rectangle(img, c1, c2,color, 1) 293 | t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0] 294 | c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4 295 | cv2.rectangle(img, c1, c2,color, -1) 296 | cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1, [225,255,255], 1) 297 | return img 298 | 299 | 300 | list(map(lambda x: write(x, im_batches, orig_ims), output)) 301 | 302 | det_names = pd.Series(imlist).apply(lambda x: "{}/det_{}".format(args.det,x.split("/")[-1])) 303 | 304 | list(map(cv2.imwrite, det_names, orig_ims)) 305 | 306 | end = time.time() 307 | 308 | print() 309 | print("SUMMARY") 310 | print("----------------------------------------------------------") 311 | print("{:25s}: {}".format("Task", "Time Taken (in seconds)")) 312 | print() 313 | print("{:25s}: {:2.3f}".format("Reading addresses", load_batch - read_dir)) 314 | print("{:25s}: {:2.3f}".format("Loading batch", start_det_loop - load_batch)) 315 | print("{:25s}: {:2.3f}".format("Detection (" + str(len(imlist)) + " images)", output_recast - start_det_loop)) 316 | print("{:25s}: {:2.3f}".format("Output Processing", class_load - output_recast)) 317 | print("{:25s}: {:2.3f}".format("Drawing Boxes", end - draw)) 318 | print("{:25s}: {:2.3f}".format("Average time_per_img", (end - load_batch)/len(imlist))) 319 | print("----------------------------------------------------------") 320 | 321 | 322 | torch.cuda.empty_cache() 323 | 324 | 325 | 326 | 327 | 328 | 329 | -------------------------------------------------------------------------------- /dog-cycle-car.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/dog-cycle-car.png -------------------------------------------------------------------------------- /imgs/dog.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/dog.jpg -------------------------------------------------------------------------------- /imgs/eagle.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/eagle.jpg -------------------------------------------------------------------------------- /imgs/giraffe.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/giraffe.jpg -------------------------------------------------------------------------------- /imgs/herd_of_horses.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/herd_of_horses.jpg -------------------------------------------------------------------------------- /imgs/img1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/img1.jpg -------------------------------------------------------------------------------- /imgs/img2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/img2.jpg -------------------------------------------------------------------------------- /imgs/img3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/img3.jpg -------------------------------------------------------------------------------- /imgs/img4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/img4.jpg -------------------------------------------------------------------------------- /imgs/messi.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/messi.jpg -------------------------------------------------------------------------------- /imgs/person.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/person.jpg -------------------------------------------------------------------------------- /imgs/scream.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/imgs/scream.jpg -------------------------------------------------------------------------------- /pallete: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/fbb4ef98d5a598f4c8eded6d618a599b7d289e2f/pallete -------------------------------------------------------------------------------- /preprocess.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | from torch.autograd import Variable 7 | import numpy as np 8 | import cv2 9 | import matplotlib.pyplot as plt 10 | from util import count_parameters as count 11 | from util import convert2cpu as cpu 12 | from PIL import Image, ImageDraw 13 | 14 | 15 | def letterbox_image(img, inp_dim): 16 | '''resize image with unchanged aspect ratio using padding''' 17 | img_w, img_h = img.shape[1], img.shape[0] 18 | w, h = inp_dim 19 | new_w = int(img_w * min(w/img_w, h/img_h)) 20 | new_h = int(img_h * min(w/img_w, h/img_h)) 21 | resized_image = cv2.resize(img, (new_w,new_h), interpolation = cv2.INTER_CUBIC) 22 | 23 | canvas = np.full((inp_dim[1], inp_dim[0], 3), 128) 24 | 25 | canvas[(h-new_h)//2:(h-new_h)//2 + new_h,(w-new_w)//2:(w-new_w)//2 + new_w, :] = resized_image 26 | 27 | return canvas 28 | 29 | 30 | 31 | def prep_image(img, inp_dim): 32 | """ 33 | Prepare image for inputting to the neural network. 34 | 35 | Returns a Variable 36 | """ 37 | 38 | orig_im = cv2.imread(img) 39 | dim = orig_im.shape[1], orig_im.shape[0] 40 | img = (letterbox_image(orig_im, (inp_dim, inp_dim))) 41 | img_ = img[:,:,::-1].transpose((2,0,1)).copy() 42 | img_ = torch.from_numpy(img_).float().div(255.0).unsqueeze(0) 43 | return img_, orig_im, dim 44 | 45 | def prep_image_pil(img, network_dim): 46 | orig_im = Image.open(img) 47 | img = orig_im.convert('RGB') 48 | dim = img.size 49 | img = img.resize(network_dim) 50 | img = torch.ByteTensor(torch.ByteStorage.from_buffer(img.tobytes())) 51 | img = img.view(*network_dim, 3).transpose(0,1).transpose(0,2).contiguous() 52 | img = img.view(1, 3,*network_dim) 53 | img = img.float().div(255.0) 54 | return (img, orig_im, dim) 55 | 56 | def inp_to_image(inp): 57 | inp = inp.cpu().squeeze() 58 | inp = inp*255 59 | try: 60 | inp = inp.data.numpy() 61 | except RuntimeError: 62 | inp = inp.numpy() 63 | inp = inp.transpose(1,2,0) 64 | 65 | inp = inp[:,:,::-1] 66 | return inp 67 | 68 | 69 | -------------------------------------------------------------------------------- /util.py: -------------------------------------------------------------------------------- 1 | 2 | from __future__ import division 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch.nn.functional as F 7 | from torch.autograd import Variable 8 | import numpy as np 9 | import cv2 10 | import matplotlib.pyplot as plt 11 | from bbox import bbox_iou 12 | 13 | def count_parameters(model): 14 | return sum(p.numel() for p in model.parameters()) 15 | 16 | def count_learnable_parameters(model): 17 | return sum(p.numel() for p in model.parameters() if p.requires_grad) 18 | 19 | def convert2cpu(matrix): 20 | if matrix.is_cuda: 21 | return torch.FloatTensor(matrix.size()).copy_(matrix) 22 | else: 23 | return matrix 24 | 25 | def predict_transform(prediction, inp_dim, anchors, num_classes, CUDA = True): 26 | batch_size = prediction.size(0) 27 | stride = inp_dim // prediction.size(2) 28 | grid_size = inp_dim // stride 29 | bbox_attrs = 5 + num_classes 30 | num_anchors = len(anchors) 31 | 32 | anchors = [(a[0]/stride, a[1]/stride) for a in anchors] 33 | 34 | 35 | 36 | prediction = prediction.view(batch_size, bbox_attrs*num_anchors, grid_size*grid_size) 37 | prediction = prediction.transpose(1,2).contiguous() 38 | prediction = prediction.view(batch_size, grid_size*grid_size*num_anchors, bbox_attrs) 39 | 40 | 41 | #Sigmoid the centre_X, centre_Y. and object confidencce 42 | prediction[:,:,0] = torch.sigmoid(prediction[:,:,0]) 43 | prediction[:,:,1] = torch.sigmoid(prediction[:,:,1]) 44 | prediction[:,:,4] = torch.sigmoid(prediction[:,:,4]) 45 | 46 | 47 | 48 | #Add the center offsets 49 | grid_len = np.arange(grid_size) 50 | a,b = np.meshgrid(grid_len, grid_len) 51 | 52 | x_offset = torch.FloatTensor(a).view(-1,1) 53 | y_offset = torch.FloatTensor(b).view(-1,1) 54 | 55 | if CUDA: 56 | x_offset = x_offset.cuda() 57 | y_offset = y_offset.cuda() 58 | 59 | x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1,num_anchors).view(-1,2).unsqueeze(0) 60 | 61 | prediction[:,:,:2] += x_y_offset 62 | 63 | #log space transform height and the width 64 | anchors = torch.FloatTensor(anchors) 65 | 66 | if CUDA: 67 | anchors = anchors.cuda() 68 | 69 | anchors = anchors.repeat(grid_size*grid_size, 1).unsqueeze(0) 70 | prediction[:,:,2:4] = torch.exp(prediction[:,:,2:4])*anchors 71 | 72 | #Softmax the class scores 73 | prediction[:,:,5: 5 + num_classes] = torch.sigmoid((prediction[:,:, 5 : 5 + num_classes])) 74 | 75 | prediction[:,:,:4] *= stride 76 | 77 | 78 | return prediction 79 | 80 | def load_classes(namesfile): 81 | fp = open(namesfile, "r") 82 | names = fp.read().split("\n")[:-1] 83 | return names 84 | 85 | def get_im_dim(im): 86 | im = cv2.imread(im) 87 | w,h = im.shape[1], im.shape[0] 88 | return w,h 89 | 90 | def unique(tensor): 91 | tensor_np = tensor.cpu().numpy() 92 | unique_np = np.unique(tensor_np) 93 | unique_tensor = torch.from_numpy(unique_np) 94 | 95 | tensor_res = tensor.new(unique_tensor.shape) 96 | tensor_res.copy_(unique_tensor) 97 | return tensor_res 98 | 99 | def write_results(prediction, confidence, num_classes, nms = True, nms_conf = 0.4): 100 | conf_mask = (prediction[:,:,4] > confidence).float().unsqueeze(2) 101 | prediction = prediction*conf_mask 102 | 103 | 104 | try: 105 | ind_nz = torch.nonzero(prediction[:,:,4]).transpose(0,1).contiguous() 106 | except: 107 | return 0 108 | 109 | 110 | box_a = prediction.new(prediction.shape) 111 | box_a[:,:,0] = (prediction[:,:,0] - prediction[:,:,2]/2) 112 | box_a[:,:,1] = (prediction[:,:,1] - prediction[:,:,3]/2) 113 | box_a[:,:,2] = (prediction[:,:,0] + prediction[:,:,2]/2) 114 | box_a[:,:,3] = (prediction[:,:,1] + prediction[:,:,3]/2) 115 | prediction[:,:,:4] = box_a[:,:,:4] 116 | 117 | 118 | 119 | batch_size = prediction.size(0) 120 | 121 | output = prediction.new(1, prediction.size(2) + 1) 122 | write = False 123 | 124 | 125 | for ind in range(batch_size): 126 | #select the image from the batch 127 | image_pred = prediction[ind] 128 | 129 | 130 | 131 | #Get the class having maximum score, and the index of that class 132 | #Get rid of num_classes softmax scores 133 | #Add the class index and the class score of class having maximum score 134 | max_conf, max_conf_score = torch.max(image_pred[:,5:5+ num_classes], 1) 135 | max_conf = max_conf.float().unsqueeze(1) 136 | max_conf_score = max_conf_score.float().unsqueeze(1) 137 | seq = (image_pred[:,:5], max_conf, max_conf_score) 138 | image_pred = torch.cat(seq, 1) 139 | 140 | 141 | 142 | #Get rid of the zero entries 143 | non_zero_ind = (torch.nonzero(image_pred[:,4])) 144 | 145 | 146 | image_pred_ = image_pred[non_zero_ind.squeeze(),:].view(-1,7) 147 | 148 | #Get the various classes detected in the image 149 | try: 150 | img_classes = unique(image_pred_[:,-1]) 151 | except: 152 | continue 153 | #WE will do NMS classwise 154 | for cls in img_classes: 155 | #get the detections with one particular class 156 | cls_mask = image_pred_*(image_pred_[:,-1] == cls).float().unsqueeze(1) 157 | class_mask_ind = torch.nonzero(cls_mask[:,-2]).squeeze() 158 | 159 | 160 | image_pred_class = image_pred_[class_mask_ind].view(-1,7) 161 | 162 | 163 | 164 | #sort the detections such that the entry with the maximum objectness 165 | #confidence is at the top 166 | conf_sort_index = torch.sort(image_pred_class[:,4], descending = True )[1] 167 | image_pred_class = image_pred_class[conf_sort_index] 168 | idx = image_pred_class.size(0) 169 | 170 | #if nms has to be done 171 | if nms: 172 | #For each detection 173 | for i in range(idx): 174 | #Get the IOUs of all boxes that come after the one we are looking at 175 | #in the loop 176 | try: 177 | ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i+1:]) 178 | except ValueError: 179 | break 180 | 181 | except IndexError: 182 | break 183 | 184 | #Zero out all the detections that have IoU > treshhold 185 | iou_mask = (ious < nms_conf).float().unsqueeze(1) 186 | image_pred_class[i+1:] *= iou_mask 187 | 188 | #Remove the non-zero entries 189 | non_zero_ind = torch.nonzero(image_pred_class[:,4]).squeeze() 190 | image_pred_class = image_pred_class[non_zero_ind].view(-1,7) 191 | 192 | 193 | 194 | #Concatenate the batch_id of the image to the detection 195 | #this helps us identify which image does the detection correspond to 196 | #We use a linear straucture to hold ALL the detections from the batch 197 | #the batch_dim is flattened 198 | #batch is identified by extra batch column 199 | 200 | 201 | batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(ind) 202 | seq = batch_ind, image_pred_class 203 | if not write: 204 | output = torch.cat(seq,1) 205 | write = True 206 | else: 207 | out = torch.cat(seq,1) 208 | output = torch.cat((output,out)) 209 | 210 | return output 211 | 212 | #!/usr/bin/env python3 213 | # -*- coding: utf-8 -*- 214 | """ 215 | Created on Sat Mar 24 00:12:16 2018 216 | 217 | @author: ayooshmac 218 | """ 219 | 220 | def predict_transform_half(prediction, inp_dim, anchors, num_classes, CUDA = True): 221 | batch_size = prediction.size(0) 222 | stride = inp_dim // prediction.size(2) 223 | 224 | bbox_attrs = 5 + num_classes 225 | num_anchors = len(anchors) 226 | grid_size = inp_dim // stride 227 | 228 | 229 | prediction = prediction.view(batch_size, bbox_attrs*num_anchors, grid_size*grid_size) 230 | prediction = prediction.transpose(1,2).contiguous() 231 | prediction = prediction.view(batch_size, grid_size*grid_size*num_anchors, bbox_attrs) 232 | 233 | 234 | #Sigmoid the centre_X, centre_Y. and object confidencce 235 | prediction[:,:,0] = torch.sigmoid(prediction[:,:,0]) 236 | prediction[:,:,1] = torch.sigmoid(prediction[:,:,1]) 237 | prediction[:,:,4] = torch.sigmoid(prediction[:,:,4]) 238 | 239 | 240 | #Add the center offsets 241 | grid_len = np.arange(grid_size) 242 | a,b = np.meshgrid(grid_len, grid_len) 243 | 244 | x_offset = torch.FloatTensor(a).view(-1,1) 245 | y_offset = torch.FloatTensor(b).view(-1,1) 246 | 247 | if CUDA: 248 | x_offset = x_offset.cuda().half() 249 | y_offset = y_offset.cuda().half() 250 | 251 | x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1,num_anchors).view(-1,2).unsqueeze(0) 252 | 253 | prediction[:,:,:2] += x_y_offset 254 | 255 | #log space transform height and the width 256 | anchors = torch.HalfTensor(anchors) 257 | 258 | if CUDA: 259 | anchors = anchors.cuda() 260 | 261 | anchors = anchors.repeat(grid_size*grid_size, 1).unsqueeze(0) 262 | prediction[:,:,2:4] = torch.exp(prediction[:,:,2:4])*anchors 263 | 264 | #Softmax the class scores 265 | prediction[:,:,5: 5 + num_classes] = nn.Softmax(-1)(Variable(prediction[:,:, 5 : 5 + num_classes])).data 266 | 267 | prediction[:,:,:4] *= stride 268 | 269 | 270 | return prediction 271 | 272 | 273 | def write_results_half(prediction, confidence, num_classes, nms = True, nms_conf = 0.4): 274 | conf_mask = (prediction[:,:,4] > confidence).half().unsqueeze(2) 275 | prediction = prediction*conf_mask 276 | 277 | try: 278 | ind_nz = torch.nonzero(prediction[:,:,4]).transpose(0,1).contiguous() 279 | except: 280 | return 0 281 | 282 | 283 | 284 | box_a = prediction.new(prediction.shape) 285 | box_a[:,:,0] = (prediction[:,:,0] - prediction[:,:,2]/2) 286 | box_a[:,:,1] = (prediction[:,:,1] - prediction[:,:,3]/2) 287 | box_a[:,:,2] = (prediction[:,:,0] + prediction[:,:,2]/2) 288 | box_a[:,:,3] = (prediction[:,:,1] + prediction[:,:,3]/2) 289 | prediction[:,:,:4] = box_a[:,:,:4] 290 | 291 | 292 | 293 | batch_size = prediction.size(0) 294 | 295 | output = prediction.new(1, prediction.size(2) + 1) 296 | write = False 297 | 298 | for ind in range(batch_size): 299 | #select the image from the batch 300 | image_pred = prediction[ind] 301 | 302 | 303 | #Get the class having maximum score, and the index of that class 304 | #Get rid of num_classes softmax scores 305 | #Add the class index and the class score of class having maximum score 306 | max_conf, max_conf_score = torch.max(image_pred[:,5:5+ num_classes], 1) 307 | max_conf = max_conf.half().unsqueeze(1) 308 | max_conf_score = max_conf_score.half().unsqueeze(1) 309 | seq = (image_pred[:,:5], max_conf, max_conf_score) 310 | image_pred = torch.cat(seq, 1) 311 | 312 | 313 | #Get rid of the zero entries 314 | non_zero_ind = (torch.nonzero(image_pred[:,4])) 315 | try: 316 | image_pred_ = image_pred[non_zero_ind.squeeze(),:] 317 | except: 318 | continue 319 | 320 | #Get the various classes detected in the image 321 | img_classes = unique(image_pred_[:,-1].long()).half() 322 | 323 | 324 | 325 | 326 | #WE will do NMS classwise 327 | for cls in img_classes: 328 | #get the detections with one particular class 329 | cls_mask = image_pred_*(image_pred_[:,-1] == cls).half().unsqueeze(1) 330 | class_mask_ind = torch.nonzero(cls_mask[:,-2]).squeeze() 331 | 332 | 333 | image_pred_class = image_pred_[class_mask_ind] 334 | 335 | 336 | #sort the detections such that the entry with the maximum objectness 337 | #confidence is at the top 338 | conf_sort_index = torch.sort(image_pred_class[:,4], descending = True )[1] 339 | image_pred_class = image_pred_class[conf_sort_index] 340 | idx = image_pred_class.size(0) 341 | 342 | #if nms has to be done 343 | if nms: 344 | #For each detection 345 | for i in range(idx): 346 | #Get the IOUs of all boxes that come after the one we are looking at 347 | #in the loop 348 | try: 349 | ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i+1:]) 350 | except ValueError: 351 | break 352 | 353 | except IndexError: 354 | break 355 | 356 | #Zero out all the detections that have IoU > treshhold 357 | iou_mask = (ious < nms_conf).half().unsqueeze(1) 358 | image_pred_class[i+1:] *= iou_mask 359 | 360 | #Remove the non-zero entries 361 | non_zero_ind = torch.nonzero(image_pred_class[:,4]).squeeze() 362 | image_pred_class = image_pred_class[non_zero_ind] 363 | 364 | 365 | 366 | #Concatenate the batch_id of the image to the detection 367 | #this helps us identify which image does the detection correspond to 368 | #We use a linear straucture to hold ALL the detections from the batch 369 | #the batch_dim is flattened 370 | #batch is identified by extra batch column 371 | batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(ind) 372 | seq = batch_ind, image_pred_class 373 | 374 | if not write: 375 | output = torch.cat(seq,1) 376 | write = True 377 | else: 378 | out = torch.cat(seq,1) 379 | output = torch.cat((output,out)) 380 | 381 | return output 382 | -------------------------------------------------------------------------------- /video_demo.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import time 3 | import torch 4 | import torch.nn as nn 5 | from torch.autograd import Variable 6 | import numpy as np 7 | import cv2 8 | from util import * 9 | from darknet import Darknet 10 | from preprocess import prep_image, inp_to_image, letterbox_image 11 | import pandas as pd 12 | import random 13 | import pickle as pkl 14 | import argparse 15 | 16 | 17 | def get_test_input(input_dim, CUDA): 18 | img = cv2.imread("dog-cycle-car.png") 19 | img = cv2.resize(img, (input_dim, input_dim)) 20 | img_ = img[:,:,::-1].transpose((2,0,1)) 21 | img_ = img_[np.newaxis,:,:,:]/255.0 22 | img_ = torch.from_numpy(img_).float() 23 | img_ = Variable(img_) 24 | 25 | if CUDA: 26 | img_ = img_.cuda() 27 | 28 | return img_ 29 | 30 | def prep_image(img, inp_dim): 31 | """ 32 | Prepare image for inputting to the neural network. 33 | 34 | Returns a Variable 35 | """ 36 | 37 | orig_im = img 38 | dim = orig_im.shape[1], orig_im.shape[0] 39 | img = (letterbox_image(orig_im, (inp_dim, inp_dim))) 40 | img_ = img[:,:,::-1].transpose((2,0,1)).copy() 41 | img_ = torch.from_numpy(img_).float().div(255.0).unsqueeze(0) 42 | return img_, orig_im, dim 43 | 44 | def write(x, img): 45 | c1 = tuple(x[1:3].int()) 46 | c2 = tuple(x[3:5].int()) 47 | cls = int(x[-1]) 48 | label = "{0}".format(classes[cls]) 49 | color = random.choice(colors) 50 | cv2.rectangle(img, c1, c2,color, 1) 51 | t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0] 52 | c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4 53 | cv2.rectangle(img, c1, c2,color, -1) 54 | cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1, [225,255,255], 1); 55 | return img 56 | 57 | def arg_parse(): 58 | """ 59 | Parse arguements to the detect module 60 | 61 | """ 62 | 63 | 64 | parser = argparse.ArgumentParser(description='YOLO v3 Video Detection Module') 65 | 66 | parser.add_argument("--video", dest = 'video', help = 67 | "Video to run detection upon", 68 | default = "video.avi", type = str) 69 | parser.add_argument("--dataset", dest = "dataset", help = "Dataset on which the network has been trained", default = "pascal") 70 | parser.add_argument("--confidence", dest = "confidence", help = "Object Confidence to filter predictions", default = 0.5) 71 | parser.add_argument("--nms_thresh", dest = "nms_thresh", help = "NMS Threshhold", default = 0.4) 72 | parser.add_argument("--cfg", dest = 'cfgfile', help = 73 | "Config file", 74 | default = "cfg/yolov3.cfg", type = str) 75 | parser.add_argument("--weights", dest = 'weightsfile', help = 76 | "weightsfile", 77 | default = "yolov3.weights", type = str) 78 | parser.add_argument("--reso", dest = 'reso', help = 79 | "Input resolution of the network. Increase to increase accuracy. Decrease to increase speed", 80 | default = "416", type = str) 81 | return parser.parse_args() 82 | 83 | 84 | if __name__ == '__main__': 85 | args = arg_parse() 86 | confidence = float(args.confidence) 87 | nms_thesh = float(args.nms_thresh) 88 | start = 0 89 | 90 | CUDA = torch.cuda.is_available() 91 | 92 | num_classes = 80 93 | 94 | CUDA = torch.cuda.is_available() 95 | 96 | bbox_attrs = 5 + num_classes 97 | 98 | print("Loading network.....") 99 | model = Darknet(args.cfgfile) 100 | model.load_weights(args.weightsfile) 101 | print("Network successfully loaded") 102 | 103 | model.net_info["height"] = args.reso 104 | inp_dim = int(model.net_info["height"]) 105 | assert inp_dim % 32 == 0 106 | assert inp_dim > 32 107 | 108 | if CUDA: 109 | model.cuda() 110 | 111 | model(get_test_input(inp_dim, CUDA), CUDA) 112 | 113 | model.eval() 114 | 115 | videofile = args.video 116 | 117 | cap = cv2.VideoCapture(videofile) 118 | 119 | assert cap.isOpened(), 'Cannot capture source' 120 | 121 | frames = 0 122 | start = time.time() 123 | while cap.isOpened(): 124 | 125 | ret, frame = cap.read() 126 | if ret: 127 | 128 | 129 | img, orig_im, dim = prep_image(frame, inp_dim) 130 | 131 | im_dim = torch.FloatTensor(dim).repeat(1,2) 132 | 133 | 134 | if CUDA: 135 | im_dim = im_dim.cuda() 136 | img = img.cuda() 137 | 138 | with torch.no_grad(): 139 | output = model(Variable(img), CUDA) 140 | output = write_results(output, confidence, num_classes, nms = True, nms_conf = nms_thesh) 141 | 142 | if type(output) == int: 143 | frames += 1 144 | print("FPS of the video is {:5.2f}".format( frames / (time.time() - start))) 145 | cv2.imshow("frame", orig_im) 146 | key = cv2.waitKey(1) 147 | if key & 0xFF == ord('q'): 148 | break 149 | continue 150 | 151 | 152 | 153 | 154 | im_dim = im_dim.repeat(output.size(0), 1) 155 | scaling_factor = torch.min(inp_dim/im_dim,1)[0].view(-1,1) 156 | 157 | output[:,[1,3]] -= (inp_dim - scaling_factor*im_dim[:,0].view(-1,1))/2 158 | output[:,[2,4]] -= (inp_dim - scaling_factor*im_dim[:,1].view(-1,1))/2 159 | 160 | output[:,1:5] /= scaling_factor 161 | 162 | for i in range(output.shape[0]): 163 | output[i, [1,3]] = torch.clamp(output[i, [1,3]], 0.0, im_dim[i,0]) 164 | output[i, [2,4]] = torch.clamp(output[i, [2,4]], 0.0, im_dim[i,1]) 165 | 166 | classes = load_classes('data/coco.names') 167 | colors = pkl.load(open("pallete", "rb")) 168 | 169 | list(map(lambda x: write(x, orig_im), output)) 170 | 171 | 172 | cv2.imshow("frame", orig_im) 173 | key = cv2.waitKey(1) 174 | if key & 0xFF == ord('q'): 175 | break 176 | frames += 1 177 | print("FPS of the video is {:5.2f}".format( frames / (time.time() - start))) 178 | 179 | 180 | else: 181 | break 182 | 183 | 184 | 185 | 186 | 187 | -------------------------------------------------------------------------------- /video_demo_half.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import time 3 | import torch 4 | import torch.nn as nn 5 | from torch.autograd import Variable 6 | import numpy as np 7 | import cv2 8 | from util import * 9 | from darknet import Darknet 10 | from preprocess import prep_image, inp_to_image, letterbox_image 11 | import pandas as pd 12 | import random 13 | import pickle as pkl 14 | import argparse 15 | 16 | 17 | def get_test_input(input_dim, CUDA): 18 | img = cv2.imread("dog-cycle-car.png") 19 | img = cv2.resize(img, (input_dim, input_dim)) 20 | img_ = img[:,:,::-1].transpose((2,0,1)) 21 | img_ = img_[np.newaxis,:,:,:]/255.0 22 | img_ = torch.from_numpy(img_).float() 23 | img_ = Variable(img_) 24 | 25 | if CUDA: 26 | img_ = img_.cuda() 27 | 28 | return img_ 29 | 30 | def prep_image(img, inp_dim): 31 | """ 32 | Prepare image for inputting to the neural network. 33 | 34 | Returns a Variable 35 | """ 36 | 37 | orig_im = img 38 | dim = orig_im.shape[1], orig_im.shape[0] 39 | img = (letterbox_image(orig_im, (inp_dim, inp_dim))) 40 | img_ = img[:,:,::-1].transpose((2,0,1)).copy() 41 | img_ = torch.from_numpy(img_).float().div(255.0).unsqueeze(0) 42 | return img_, orig_im, dim 43 | 44 | def write(x, img): 45 | c1 = tuple(x[1:3].int()) 46 | c2 = tuple(x[3:5].int()) 47 | cls = int(x[-1]) 48 | label = "{0}".format(classes[cls]) 49 | color = random.choice(colors) 50 | cv2.rectangle(img, c1, c2,color, 1) 51 | t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0] 52 | c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4 53 | cv2.rectangle(img, c1, c2,color, -1) 54 | cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1, [225,255,255], 1); 55 | return img 56 | 57 | def arg_parse(): 58 | """ 59 | Parse arguements to the detect module 60 | 61 | """ 62 | 63 | 64 | parser = argparse.ArgumentParser(description='YOLO v2 Video Detection Module') 65 | 66 | parser.add_argument("--video", dest = 'video', help = 67 | "Video to run detection upon", 68 | default = "video.avi", type = str) 69 | parser.add_argument("--dataset", dest = "dataset", help = "Dataset on which the network has been trained", default = "pascal") 70 | parser.add_argument("--confidence", dest = "confidence", help = "Object Confidence to filter predictions", default = 0.5) 71 | parser.add_argument("--nms_thresh", dest = "nms_thresh", help = "NMS Threshhold", default = 0.4) 72 | parser.add_argument("--cfg", dest = 'cfgfile', help = 73 | "Config file", 74 | default = "cfg/yolov3.cfg", type = str) 75 | parser.add_argument("--weights", dest = 'weightsfile', help = 76 | "weightsfile", 77 | default = "yolov3.weights", type = str) 78 | parser.add_argument("--reso", dest = 'reso', help = 79 | "Input resolution of the network. Increase to increase accuracy. Decrease to increase speed", 80 | default = "416", type = str) 81 | return parser.parse_args() 82 | 83 | 84 | if __name__ == '__main__': 85 | args = arg_parse() 86 | confidence = float(args.confidence) 87 | nms_thesh = float(args.nms_thresh) 88 | start = 0 89 | 90 | CUDA = torch.cuda.is_available() 91 | 92 | 93 | 94 | CUDA = torch.cuda.is_available() 95 | num_classes = 80 96 | bbox_attrs = 5 + num_classes 97 | 98 | print("Loading network.....") 99 | model = Darknet(args.cfgfile) 100 | model.load_weights(args.weightsfile) 101 | print("Network successfully loaded") 102 | 103 | model.net_info["height"] = args.reso 104 | inp_dim = int(model.net_info["height"]) 105 | assert inp_dim % 32 == 0 106 | assert inp_dim > 32 107 | 108 | 109 | if CUDA: 110 | model.cuda().half() 111 | 112 | model(get_test_input(inp_dim, CUDA), CUDA) 113 | 114 | model.eval() 115 | 116 | videofile = 'video.avi' 117 | 118 | cap = cv2.VideoCapture(videofile) 119 | 120 | assert cap.isOpened(), 'Cannot capture source' 121 | 122 | frames = 0 123 | start = time.time() 124 | while cap.isOpened(): 125 | 126 | ret, frame = cap.read() 127 | if ret: 128 | 129 | 130 | img, orig_im, dim = prep_image(frame, inp_dim) 131 | 132 | im_dim = torch.FloatTensor(dim).repeat(1,2) 133 | 134 | 135 | if CUDA: 136 | img = img.cuda().half() 137 | im_dim = im_dim.half().cuda() 138 | write_results = write_results_half 139 | predict_transform = predict_transform_half 140 | 141 | 142 | output = model(Variable(img, volatile = True), CUDA) 143 | output = write_results(output, confidence, num_classes, nms = True, nms_conf = nms_thesh) 144 | 145 | 146 | if type(output) == int: 147 | frames += 1 148 | print("FPS of the video is {:5.2f}".format( frames / (time.time() - start))) 149 | cv2.imshow("frame", orig_im) 150 | key = cv2.waitKey(1) 151 | if key & 0xFF == ord('q'): 152 | break 153 | continue 154 | 155 | 156 | im_dim = im_dim.repeat(output.size(0), 1) 157 | scaling_factor = torch.min(inp_dim/im_dim,1)[0].view(-1,1) 158 | 159 | output[:,[1,3]] -= (inp_dim - scaling_factor*im_dim[:,0].view(-1,1))/2 160 | output[:,[2,4]] -= (inp_dim - scaling_factor*im_dim[:,1].view(-1,1))/2 161 | 162 | output[:,1:5] /= scaling_factor 163 | 164 | for i in range(output.shape[0]): 165 | output[i, [1,3]] = torch.clamp(output[i, [1,3]], 0.0, im_dim[i,0]) 166 | output[i, [2,4]] = torch.clamp(output[i, [2,4]], 0.0, im_dim[i,1]) 167 | 168 | 169 | classes = load_classes('data/coco.names') 170 | colors = pkl.load(open("pallete", "rb")) 171 | 172 | list(map(lambda x: write(x, orig_im), output)) 173 | 174 | 175 | cv2.imshow("frame", orig_im) 176 | key = cv2.waitKey(1) 177 | if key & 0xFF == ord('q'): 178 | break 179 | frames += 1 180 | print("FPS of the video is {:5.2f}".format( frames / (time.time() - start))) 181 | 182 | 183 | else: 184 | break 185 | 186 | 187 | 188 | 189 | 190 | --------------------------------------------------------------------------------