├── README.md ├── images └── model.jpg └── model.py /README.md: -------------------------------------------------------------------------------- 1 | # Code for "Leveraging Self-Supervision for Cross-Domain Crowd Counting" (CVPR 2022) 2 | 3 | This repository is a PyTorch implementation for the paper **Leveraging Self-Supervision for Cross-Domain Crowd Counting**, which is accepted as **oral** presentation at CVPR 2022. If you use this code in your research, please cite 4 | [the paper](https://arxiv.org/pdf/2103.16291.pdf). 5 | 6 | 7 | 8 | ## Abstract 9 | State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. While effective, these data-driven approaches rely on large amount of data annotation to achieve good performance, which stops these models from being deployed in emergencies during which data annotation is either too costly or cannot be obtained fast enough. 10 | 11 | One popular solution is to use synthetic data for training. Unfortunately, due to domain shift, the resulting models generalize poorly on real imagery. We remedy this shortcoming by training with both synthetic images, along with their associated labels, and unlabeled real images. To this end, we force our network to learn perspective-aware features by training it to recognize upside-down real images from regular ones and incorporate into it the ability to predict its own uncertainty so that it can generate useful pseudo labels for fine-tuning purposes. This yields an algorithm that consistently outperforms state-of-the-art cross-domain crowd counting ones without any extra computation at inference time. 12 | 13 | 14 | ![](./images/model.jpg) 15 | Figure 1: **Two-stage approach**. **Top:** During the first training stage, we use synthetic images, real images, and flipped versions of the latter. The network is trained to output the correct people density for the synthetic images and to classify the real images as being flipped or not. **Bottom:** During the second training stage, we use synthetic and real images. We run the previously trained network on the real images and treat the least uncertain people density estimates as pseudo labels. We then fine tune the network on both kinds of images and iterate the process. 16 | 17 | ## Installation 18 | 19 | Please refer to this [page](https://github.com/nikitadurasov/masksembles) for the installation of [Masksembles](https://arxiv.org/abs/2012.08334). 20 | 21 | For other packages, please refer to this [page](https://github.com/weizheliu/People-Flows). 22 | 23 | 24 | ## Implementation 25 | 26 | Please refer to model.py 27 | 28 | ## How to Use 29 | The code is managed in the same way as my previous work, please refer to this [code](https://github.com/weizheliu/People-Flows) for detailed information. You only need to replace the "model.py" file with the one attached here and update other info(model name, dataset etc.) accordingly. 30 | 31 | 32 | 33 | 34 | ## Citing 35 | 36 | ``` 37 | @InProceedings{Liu_2022_CVPR, 38 | 39 | author = {Liu, Weizhe and Durasov, Nikita and Fua, Pascal}, 40 | 41 | title = {Leveraging Self-Supervision for Cross-Domain Crowd Counting}, 42 | 43 | booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, 44 | 45 | month = {June}, 46 | 47 | year = {2022} 48 | 49 | } 50 | 51 | ``` 52 | 53 | ## Note 54 | 55 | I already left EPFL and therefore I'm not able to reach my previous code/data. This is a quick implementation based on my memory and I do not even have the machine/data to test it, if you find any bug/typo with this code, please let me know. 56 | 57 | 58 | ## Contact 59 | 60 | For any questions regard this paper/code, please directly contact [Weizhe Liu](mailto:weizheliu1991@163.ch). 61 | 62 | -------------------------------------------------------------------------------- /images/model.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weizheliu/Cross-Domain-Crowd-Counting/e65ff15b9b96341ed1b3b146bc80974c1f53d336/images/model.jpg -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch 3 | from torch.nn import functional as F 4 | from torchvision import models 5 | from utils import save_net,load_net 6 | from layer import convDU,convLR 7 | from masksembles.torch import Masksembles2D 8 | 9 | class SFCN(nn.Module): 10 | def __init__(self, load_weights=False): 11 | super(SFCN, self).__init__() 12 | self.seen = 0 13 | self.frontend_feat = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512] 14 | self.backend_feat = [512,'MASK', 512, 512, 256, 128, 64] 15 | self.backend_feat2 = [512, 512, 512,256,128,64] 16 | self.frontend = make_layers(self.frontend_feat) 17 | self.backend = make_layers(self.backend_feat,in_channels = 512,batch_norm=False, dilation = True) 18 | self.backend2 = make_layers(self.backend_feat2,in_channels = 512,batch_norm=False, dilation = True) 19 | self.adpool = nn.AdaptiveAvgPool2d((96,128)) 20 | self.fc = nn.Linear(64*96*128, 2) 21 | self.convDU = convDU(in_out_channels=64,kernel_size=(1,9)) 22 | self.convLR = convLR(in_out_channels=64,kernel_size=(9,1)) 23 | self.output_layer = nn.Sequential(nn.Conv2d(64, 1, kernel_size=1),nn.ReLU()) 24 | if not load_weights: 25 | mod = models.vgg16(pretrained = True) 26 | self._initialize_weights() 27 | # address the mismatch in key names 28 | pretrained_dict = {k[9:]: v for k, v in mod.state_dict().items() if k[9:] in self.frontend.state_dict()} 29 | self.frontend.load_state_dict(pretrained_dict) 30 | 31 | def forward(self,x): 32 | x_share = self.frontend(x) 33 | 34 | x = self.backend(x_share) 35 | x = self.convDU(x) 36 | x = self.convLR(x) 37 | x = self.output_layer(x) 38 | x = F.upsample(x,scale_factor=8) 39 | 40 | x_class = self.backend2(x_share) 41 | x_class = self.adpool(x_class) 42 | x_class = torch.flatten(x_class,1) 43 | x_class = self.fc(x_class) 44 | return x,x_class 45 | 46 | def _initialize_weights(self): 47 | for m in self.modules(): 48 | if isinstance(m, nn.Conv2d): 49 | nn.init.normal_(m.weight, std=0.01) 50 | if m.bias is not None: 51 | nn.init.constant_(m.bias, 0) 52 | elif isinstance(m, nn.BatchNorm2d): 53 | nn.init.constant_(m.weight, 1) 54 | nn.init.constant_(m.bias, 0) 55 | 56 | def make_layers(cfg, in_channels = 3,batch_norm=False,dilation = False): 57 | if dilation: 58 | d_rate = 2 59 | else: 60 | d_rate = 1 61 | layers = [] 62 | for v in cfg: 63 | if v == 'M': 64 | layers += [nn.MaxPool2d(kernel_size=2, stride=2)] 65 | elif v == 'MASK': 66 | layers +=[Masksembles2D(512,3,2.0)] 67 | else: 68 | conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=d_rate,dilation = d_rate) 69 | if batch_norm: 70 | layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)] 71 | else: 72 | layers += [conv2d, nn.ReLU(inplace=True)] 73 | in_channels = v 74 | return nn.Sequential(*layers) 75 | --------------------------------------------------------------------------------