├── .gitignore ├── 202103.md ├── 202104.md ├── 202106.md ├── 202107.md ├── 202108.md ├── 202109.md ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /202103.md: -------------------------------------------------------------------------------- 1 | # Arxiv-Daily 2 | My daily arxiv reading notes 3 | 4 | 5 | ## CV (Daily) 6 | #### 20210331 7 | TOP: 8 | * [Broaden Your Views for Self-Supervised Video Learning](https://arxiv.org/pdf/2103.16559.pdf) (Andrew Zisserman) 9 | * [Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers](https://arxiv.org/pdf/2103.16553.pdf) We make the following three contributions. First, we equip transformerbased models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving scalability. Second, we introduce a generic approach for combining a Fast dual encoder model with our Slow but accurate transformer-based model via distillation and re-ranking. Finally, we validate our approach on the Flickr30K image dataset where we show an increase in inference speed by several orders of magnitude while having results competitive to the state of the art. (Andrew Zisserman) (CVPR 2021) 10 | * [Boundary IoU: Improving Object-Centric Image Segmentation Evaluation](https://arxiv.org/pdf/2103.16562.pdf) [code](https://github.com/bowenc0221/boundary-iou-api) 新的图像分割评价指标 We present Boundary IoU (Intersection-over-Union), a new segmentation evaluation measure focused on boundary quality. We perform an extensive analysis across different error types and object sizes and show that Boundary IoU is significantly more sensitive than the standard Mask IoU measure to boundary errors for large objects and does not over-penalize errors on smaller objects.(Ross Girshick, Piotr Dollar) (CVPR21) 11 | * [Fully Convolutional Scene Graph Generation](https://arxiv.org/pdf/2103.16083.pdf) This paper presents a fully convolutional scene graph generation (FCSGG) model that detects objects and relations simultaneously. Unlike these approaches, FCSGG is a conceptually elegant and efficient bottom-up approach that encodes objects as bounding box center points, and relationships as 2D vector fields which are named as Relation Affinity Fields (RAFs). (CVPR21 Oral) TODO: DETR (inherently end to end) should also work ? 12 | 13 | 14 | CVPR21: 15 | * [Distribution Alignment: A Unified Framework for Long-tail Visual Recognition](https://arxiv.org/pdf/2103.16370.pdf) Specifically, we develop an adaptive calibration function that enables us to adjust the classification scores for each data point.(动态网络) We then introduce a generalized re-weight method in the two-stage learning to balance the class prior, which provides a flexible and unified solution to diverse scenarios in visual recognition tasks. 实验: four tasks, including image classification, semantic segmentation, object detection, and instance segmentation [code](https://github.com/Megvii-BaseDetection/DisAlign) (Jian Sun) 16 | * [AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning](https://arxiv.org/pdf/2103.16002.pdf) We present Action Genome Question Answering (AGQA), a new benchmark for compositional spatio-temporal reasoning. Using AGQA, we evaluate modern visual reasoning systems, demonstrating that the best models barely perform better than non-visual baselines exploiting linguistic biases and that none of the existing models generalize to novel compositions unseen during training. 17 | * [Flow-based Kernel Prior with Application to Blind Super-Resolution](https://arxiv.org/pdf/2103.15977.pdf) [code](https://github.com/JingyunLiang/FKP) (Luc Van Gool) 18 | * [TransFill: Reference-guided Image Inpainting by Merging Multiple Color and Spatial Transformations](https://arxiv.org/pdf/2103.15982.pdf) In this paper, we propose TransFill, a multi-homography transformed fusion method to fill the hole by referring to another source image that shares scene contents with the target image. TODO: reference-guided inpainting适合用注意力机制来实现 19 | * [High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation](https://arxiv.org/pdf/2103.15876.pdf) 20 | * [Read and Attend: Temporal Localisation in Sign Language Videos](https://arxiv.org/pdf/2103.16481.pdf) The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a largescale collection of signing footage with weakly-aligned subtitles. (Andrew Zisserman) 21 | * [Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection](https://arxiv.org/pdf/2103.16470.pdf) The objective of this paper is to learn context- and depthaware feature representation to solve the problem of monocular 3D object detection. [code](https://github.com/fudan-zvg/DDMP) 22 | * [CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning](https://arxiv.org/pdf/2103.16392.pdf) Weakly-supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. 结合对比学习 23 | * [Contrastive Embedding for Generalized Zero-Shot Learning](https://arxiv.org/pdf/2103.16173.pdf) To tackle this issue, we propose to integrate the generation model with the embedding model, yielding a hybrid GZSL framework. The hybrid GZSL approach maps both the real and the synthetic samples produced by the generation model into an embedding space, where we perform the final GZSL classification. with contrastive embedding [code](https://github.com/Hanzy1996/CE-GZSL) 24 | * [Self-Guided and Cross-Guided Learning for Few-Shot Segmentation](https://arxiv.org/pdf/2103.16129.pdf) [code](https://github.com/zbf1991/SCL) Self + Cross inherently suits Transformer architecture, but this paper is CNN based. 25 | * [Graph Stacked Hourglass Networks for 3D Human Pose Estimation](https://arxiv.org/pdf/2103.16385.pdf) 26 | * [Data-Uncertainty Guided Multi-Phase Learning for Semi-Supervised Object Detection](https://arxiv.org/pdf/2103.16368.pdf) 半监督目标检测 we propose a datauncertainty guided multi-phase learning method for semisupervised object detection. Image uncertainty guided easy data selection and region uncertainty guided RoI Re-weighting are involved in multi-phase learning and enable the detector to concentrate on more certain knowledge. 27 | * [Complementary Relation Contrastive Distillation](https://arxiv.org/pdf/2103.16367.pdf) Previous approaches focus on either individual representation distillation or inter-sample similarity preservation. While we argue that the inter-sample relation conveys abundant information and needs to be distilled in a more effective way. TODO 28 | * [Locate then Segment: A Strong Pipeline for Referring Image Segmentation](https://arxiv.org/pdf/2103.16284.pdf) Referring image segmentation aims to segment the objects referred by a natural language expression. (Tieniu Tan) 29 | * [Delving into Localization Errors for Monocular 3D Object Detection](https://arxiv.org/pdf/2103.16237.pdf) [code](https://github.com/xinzhuma/monodle) In this work, by intensive diagnosis experiments, we quantify the impact introduced by each sub-task and found the ‘localization error’ is the vital factor in restricting monocular 3D detection. Besides, we also investigate the underlying reasons behind localization errors, analyze the issues they might bring, and propose three strategies. (Wanli Ouyang) 30 | * [Face Forensics in the Wild](https://arxiv.org/pdf/2103.16076.pdf) To take face forgery detection to a new level, we construct a novel large-scale dataset, called FFIW10K, which comprises 10,000 high-quality forgery videos, with an average of three human faces in each frame. [code](https://github.com/tfzhou/FFIW) (Jianbing Shen) 31 | * [DiNTS: Differentiable Neural Network Topology Search for 3D Medical Image Segmentation](https://arxiv.org/pdf/2103.15954.pdf) (Johns Hopkins University) 32 | * [Repopulating Street Scenes](https://arxiv.org/pdf/2103.16183.pdf) We present a framework for automatically reconfiguring images of street scenes by populating, depopulating, or repopulating them with objects such as pedestrians or vehicles. (funny) 33 | * [Learnable Graph Matching: Incorporating Graph Partitioning with Deep Feature Learning for Multiple Object Tracking](https://arxiv.org/pdf/2103.16178.pdf) (Naiyan Wang) 34 | * [Model-Contrastive Federated Learning](https://arxiv.org/pdf/2103.16257.pdf) A key challenge in federated learning is to handle the heterogeneity of local data distribution across parties. The key idea of MOON is to utilize the similarity between model representations to correct the local training of individual parties, i.e., conducting contrastive learning in model-level. 35 | 36 | Vision Transformer: 37 | * [Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers](https://arxiv.org/pdf/2103.16553.pdf) (Andrew Zisserman) 38 | * [Kaleido-BERT: Vision-Language Pre-training on Fashion Domain](https://arxiv.org/pdf/2103.16110.pdf) We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT , which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. To this end, we carry out five novel tasks, i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. [code](https://github.com/mczhuge/Kaleido-BERT/) (Deng-Ping Fan, Ling Shao) (CVPR21) 39 | * [Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation](https://arxiv.org/pdf/2103.16024.pdf) 结合本领域的问题讲故事:Most previous works focus on capturing the local temporal context and can well locate simple action instances with clean frames and clear boundaries. However, they generally fail in complicated scenarios where interested actions involve irrelevant frames and background clutters, and the local temporal context becomes less effective. (Jiashi Feng) 40 | * [Spatiotemporal Transformer for Video-based Person Re-identification](https://arxiv.org/pdf/2103.16469.pdf) This paper applies the Transformer to video-based person reidentification, where the key issue is to extract the discriminative information from a tracklet. we propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains with the perception-constrained Spatiotemporal Transformer (STT) module and Global Transformer (GT) module. (Qi Tian) 41 | * [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/pdf/2103.16302.pdf) [code](https://github.com/naver-ai/pit) From the successful design principles of CNN, we investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture. we investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture. 提出Pooling-based Vision Transformer (PiT) 实验:image classification, object detection and robustness evaluation. Vision transformer中只比了ViT. [code](https://github.com/naver-ai/pit) 42 | 43 | Domain: 44 | * [Distribution Alignment: A Unified Framework for Long-tail Visual Recognition](https://arxiv.org/pdf/2103.16370.pdf) Specifically, we develop an adaptive calibration function that enables us to adjust the classification scores for each data point.(动态网络) We then introduce a generalized re-weight method in the two-stage learning to balance the class prior, which provides a flexible and unified solution to diverse scenarios in visual recognition tasks. 实验: four tasks, including image classification, semantic segmentation, object detection, and instance segmentation [code](https://github.com/Megvii-BaseDetection/DisAlign) (Jian Sun) 45 | * [Progressive Domain Expansion Network for Single Domain Generalization](https://arxiv.org/pdf/2103.16050.pdf) [code](https://github.com/lileicv/PDEN) (CVPR21) 46 | * [Adaptive Pseudo-Label Refinement by Negative Ensemble Learning for Source-Free Unsupervised Domain Adaptation](https://arxiv.org/pdf/2103.15973.pdf) 47 | * [Source-Free Domain Adaptation for Semantic Segmentation](https://arxiv.org/pdf/2103.16372.pdf) (CVPR21) 48 | * [Dynamic Domain Adaptation for Efficient Inference](https://arxiv.org/pdf/2103.16403.pdf) [code](https://github.com/BIT-DA/DDA) (CVPR21) 49 | * [Learning Domain Invariant Representations for Generalizable Person Re-Identification](https://arxiv.org/pdf/2103.15890.pdf) 50 | * [Domain-robust VQA with diverse datasets and methods but no target labels](https://arxiv.org/pdf/2103.15974.pdf) (CVPR21) 51 | * [Leveraging Self-Supervision for Cross-Domain Crowd Counting](https://arxiv.org/pdf/2103.16291.pdf) 52 | * [Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction](https://arxiv.org/pdf/2103.16449.pdf) Our general idea is to dynamically finetune the source model on test video streams with additional temporal constraints, such that it can mitigate the domain gaps without over-fitting the 2D information of individual test frames. (Bingbing Ni) (CVPR21) 53 | * [Model-Contrastive Federated Learning](https://arxiv.org/pdf/2103.16257.pdf) A key challenge in federated learning is to handle the heterogeneity of local data distribution across parties. The key idea of MOON is to utilize the similarity between model representations to correct the local training of individual parties, i.e., conducting contrastive learning in model-level. (CVPR21) 54 | 55 | 56 | 其他: 57 | * [Learning Target Candidate Association to Keep Track of What Not to Track](https://arxiv.org/pdf/2103.16556.pdf) (Luc Van Gool) 58 | * [The Elastic Lottery Ticket Hypothesis](https://arxiv.org/pdf/2103.16547.pdf) (Zhangyang Wang) 59 | * [Pre-training strategies and datasets for facial representation learning](https://arxiv.org/pdf/2103.16554.pdf) 60 | * [Visual Room Rearrangement](https://arxiv.org/pdf/2103.16544.pdf) 61 | 62 | #### 20210330 63 | CVPR21: 64 | * [RobustNet: Improving Domain Generalization in Urban-Scene Segmentation via Instance Selective Whitening](https://arxiv.org/pdf/2103.15597.pdf) 新坑:语义分割 (single) domain generalization,实验基本上只比了IBN-Net. this paper proposes a novel instance selective whitening loss to improve the robustness of the segmentation networks for unseen domains. [code](https://github.com/shachoi/RobustNet) (Oral) 65 | * [!!! Adaptive Methods for Real-World Domain Generalization](https://arxiv.org/pdf/2103.15796.pdf) 新坑 1. In our work, we investigate whether it is possible to leverage domain information from the unseen test samples themselves. 2. For unseen domains, our method simply uses few unlabelled test examples to construct the domain embedding. This enables adaptive classification on any unseen domain. 3. In addition, we introduce the first real-world, large-scale domain generalization benchmark, Geo-YFCC, containing 1.1M samples over 40 training, 7 validation and 15 test domains, orders of magnitude larger than prior work 66 | * [From Synthetic to Real: Unsupervised Domain Adaptation for Animal Pose Estimation](https://arxiv.org/pdf/2103.14843.pdf) 实验:We evaluate our approach on the TigDog and VisDA 2019 datasets, where we outperform existing approaches by a large margin. We also demonstrate the generalization ability of our model by testing extensively on both unseen domains and unseen animal categories. [code](https://github.com/chaneyddtt/UDA-Animal-Pose) 67 | * [Generalizing to the Open World: Deep Visual Odometry with Online Adaptation](https://arxiv.org/pdf/2103.15279.pdf) 域适应Visual Odometry. In this paper, we propose an online adaptation framework for deep VO with the assistance of scene-agnostic geometric computations and Bayesian inference. (Hongbin Zha) 68 | * [Capsule Network is Not More Robust than Convolutional Network](https://arxiv.org/pdf/2103.15459.pdf) The Capsule Network is widely believed to be more robust than Convolutional Networks. The study reveals that some designs, which are thought critical to CapsNet, actually can harm its robustness, i.e., the dynamic routing layer and the transformation process, while others are beneficial for the robustness. Based on these findings, we propose enhanced ConvNets simply by introducing the essential components behind the CapsNet’s success. (Han Hu) 69 | * [Checkerboard Context Model for Efficient Learned Image Compression](https://arxiv.org/pdf/2103.15306.pdf) 并行的图像压缩,速度解码速度40x However, the decoding process must be done in a strict scan order, which breaks the parallelization. We propose a parallelizable checkerboard context model (CCM) to solve the problem. 70 | * [Slimmable Compressive Autoencoders for Practical Neural Image Compression](https://arxiv.org/pdf/2103.15726.pdf) Focusing on practical image compression, we propose slimmable compressive autoencoders (SlimCAEs), where rate (R) and distortion (D) are jointly optimized for different capacities. [code](https://github.com/FireFYF/SlimCAE) 71 | * [Invertible Image Signal Processing](https://arxiv.org/pdf/2103.15061.pdf) 受20ECCV Invertible image rescaling启发,做可逆的ISP we design an Invertible Image Signal Processing (InvISP) pipeline, which not only enables rendering visually appealing sRGB images but also allows recovering nearly perfect RAW data. (Qifeng Chen) 72 | * [LiBRe: A Practical Bayesian Approach to Adversarial Detection](https://arxiv.org/pdf/2103.14835.pdf) 检测是否被攻击 [code](https://github.com/thudzj/ScalableBDL) (Jun Zhu) 73 | * [Tuning IR-cut Filter for Illumination-aware Spectral Reconstruction from RGB](https://arxiv.org/pdf/2103.14708.pdf) 74 | * [Panoptic-PolarNet: Proposal-free LiDAR Point Cloud Panoptic Segmentation](https://arxiv.org/pdf/2103.14962.pdf) [code](https://github.com/edwardzhou130/Panoptic-PolarNet) 75 | * [LiDAR R-CNN: An Efficient and Universal 3D Object Detector](https://arxiv.org/pdf/2103.15297.pdf) 即插即用 we present LiDAR R-CNN, a second stage detector that can generally improve any existing 3D detector. [code](https://github.com/tusimple/LiDAR_RCNN) (Naiyan Wang) 76 | * [Self-Supervised Visibility Learning for Novel View Synthesis](https://arxiv.org/pdf/2103.15407.pdf) We address the problem of novel view synthesis (NVS) from a few sparse source view images 77 | * [No frame left behind: Full Video Action Recognition](https://arxiv.org/pdf/2103.15395.pdf) 78 | * [SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences](https://arxiv.org/pdf/2103.14898.pdf) 79 | * [POSEFusion: Pose-guided Selective Fusion for Single-view Human Volumetric Capture](https://arxiv.org/pdf/2103.15331.pdf) We propose POse-guided SElective Fusion (POSEFusion), a single-view human volumetric capture method that leverages tracking-based methods and tracking-free inference to achieve high-fidelity and dynamic 3D reconstruction. 80 | * [Picasso: A CUDA-based Library for Deep Learning over 3D Meshes](https://arxiv.org/pdf/2103.15076.pdf) 81 | 82 | Vision Transformer: 83 | * [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/pdf/2103.15808.pdf) This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. [code](https://github.com/leoxiaobin/CvT) 84 | * [PixelTransformer: Sample Conditioned Signal Generation](https://arxiv.org/pdf/2103.15813.pdf) We propose a generative model that can infer a distribution for the underlying spatial signal conditioned on sparse samples e.g. plausible images given a few observed pixels. 85 | * [TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization](https://arxiv.org/pdf/2103.14862.pdf) (Bolei Zhou, Qi Tian) 86 | * [ViViT: A Video Vision Transformer](https://arxiv.org/pdf/2103.15691.pdf) transformer做视频分类 In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. 87 | * [On the Adversarial Robustness of Visual Transformers](https://arxiv.org/pdf/2103.15670.pdf) Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs). We summarize the following main observations contributing to the improved robustness of ViTs: 1) Features learned by ViTs contain less low-level information and are more generalizable, which contributes to superior robustness against adversarial perturbations. 2) Introducing convolutional or tokens-to-token blocks for learning low-level features in ViTs can improve classification accuracy but at the cost of adversarial robustness. 3) Increasing the proportion of transformers in the model structure (when the model consists of both transformer and CNN blocks) leads to better robustness. But for a pure transformer model, simply increasing the size or adding layers cannot guarantee a similar effect. 4) Pre-training on larger datasets does not significantly improve adversarial robustness though it is critical for training ViTs. 5) Adversarial training is also applicable to ViT for training robust models. The results show that ViTs are less sensitive to high-frequency perturbations than CNNs and there isa high correlation between how well the model learns low level features and its robustness against different frequencybased perturbations. 88 | * [Transformer Tracking](https://arxiv.org/pdf/2103.15436.pdf) [code](https://github.com/chenxin-dlut/TransT) 从tracking中关系建模和特征融合的重要性和复杂性讲起,引入transformer (Huchuan Lu) (CVPR21) 89 | * [Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding](https://arxiv.org/pdf/2103.15358.pdf) Multi-Scale Vision Longformer, which significantly enhances the ViT of [11] for encoding highresolution images using two techniques. multi-scale model structure, attention mechanism of vision Longformer, which is a variant of Longformer [2]. 实验:image classification, object detection, and segmentation. 90 | * [TFPose: Direct Human Pose Estimation with Transformers](https://arxiv.org/pdf/2103.15320.pdf) we formulate the pose estimation task into a sequence prediction problem that can effectively be solved by transformers. 关键:分析引入transformer解决的问题和带来的好处:Our framework is simple and direct, bypassing the drawbacks of the heatmapbased pose estimation. Moreover, with the attention mechanism in transformers, our proposed framework is able to adaptively attend to the features most relevant to the target keypoints, which largely overcomes the feature misalignment issue of previous regression-based methods and considerably improves the performance. [AdelaiDet](https://github.com/aim-uofa/AdelaiDet/) (Chunhua Shen, Zhi Tian, Xinlong Wang, et al.) 91 | * [HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval](https://arxiv.org/pdf/2103.15049.pdf) 讲Transformer和对比学习结合做Video-Text Retrieval 92 | * [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/pdf/2103.14899.pdf) 在ViT中引入多尺度和efficiency. To reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. (Linear) 93 | * [Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers](https://arxiv.org/pdf/2103.14829.pdf) 94 | * [Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers](https://arxiv.org/pdf/2103.15679.pdf) [code](https://github.com/hila-chefer/Transformer-MM-Explainability) 研究transformer可解释性. Unlike Transformers that only use self-attention, Transformers with coattention require to consider multiple attention maps in 95 | parallel in order to highlight the information that is relevant to the prediction in the model’s input. 96 | (coattention就是cross attention?) 97 | * [Face Transformer for Recognition](https://arxiv.org/pdf/2103.14803.pdf) 单纯拿ViT在人脸识别上测试 We wonder if transformer can be used in face recognition and whether it is better than CNNs. Therefore, we investigate the performance of Transformer models in face recognition. 98 | * [TransCenter: Transformers with Dense Queries for Multiple-Object Tracking](https://arxiv.org/pdf/2103.15145.pdf) Inspired by recent research, we propose TransCenter, the first transformer-based architecture for tracking the centers of multiple targets. Methodologically, we propose the use of dense queries in a double-decoder network, to be able to robustly infer the heatmap of targets’ centers and associate them through time. 99 | 100 | 其他: 101 | * [Mining Latent Classes for Few-shot Segmentation](https://arxiv.org/pdf/2103.15402.pdf) 102 | * [AlignMix: Improving representation by interpolating aligned features](https://arxiv.org/pdf/2103.15375.pdf) In this work, we revisit mixup from the interpolation perspective and introduce AlignMix, where we geometrically align two images in the feature space. 103 | * [Visual Distant Supervision for Scene Graph Generation](https://arxiv.org/pdf/2103.15365.pdf) 104 | * [Going Deeper Into Face Detection: A Survey](https://arxiv.org/pdf/2103.14983.pdf) 105 | 106 | #### 20210329 107 | CVPR21 108 | * [PAConv: Position Adaptive Convolution with Dynamic Kernel Assembling on Point Clouds](https://arxiv.org/pdf/2103.14635.pdf) 动态网络Position Adaptive Convolution (PAConv) for r 3D point cloud processing. The key of PAConv is to construct the convolution kernel by dynamically assembling basic weight matrices stored in Weight Bank, where the coefficients of these weight matrices are self-adaptively learned from point positions through ScoreNet. [code](https://github.com/CVMI-Lab/PAConv) 109 | * [Distilling Object Detectors via Decoupled Features](https://arxiv.org/pdf/2103.14475.pdf) 知识蒸馏做目标检测,方法采用前后景特征解耦 In this paper, we point out that the information of features derived from regions excluding objects are also essential for distilling the student detector, which is usually ignored in existing approaches. two levels of decoupled features will be processed for embedding useful information into the student, i.e., decoupled features from neck and decoupled proposals from classification head. [code](https://github.com/ggjy/DeFeat.pytorch) (Yunhe Wang, Chang Xu) 110 | * [Few-Shot Human Motion Transfer by Personalized Geometry and Texture Modeling](https://arxiv.org/pdf/2103.14338.pdf) 111 | * [Confluent Vessel Trees with Accurate Bifurcations](https://arxiv.org/pdf/2103.14268.pdf) unsupervised reconstruction of complex near-capillary vasculature with thousands of bifurcations where supervision and learning are infeasible. 112 | * [Contrastive Learning based Hybrid Networks for Long-Tailed Image Classification](https://arxiv.org/pdf/2103.14267.pdf) 对比学习做长尾分类 [code](https://www.kaihan.org/HybridLT/) (Xiu-Shen Wei) 113 | * [OTA: Optimal Transport Assignment for Object Detection](https://arxiv.org/pdf/2103.14259.pdf) 受到detr中label assignment的启发,提出基于optimal transport的global label assignment机制. we innovatively revisit the label assignment from a global perspective and propose to formulate the assigning procedure as an Optimal Transport (OT) problem – a well-studied topic in Optimization Theory. After formulation, finding the best assignment solution is converted to solve the optimal transport plan at minimal transportation costs, which can be solved via Sinkhorn-Knopp Iteration. [code](https://github.com/Megvii-BaseDetection/OTA) (Jian Sun) 114 | * [MagDR: Mask-guided Detection and Reconstruction for Defending Deepfakes](https://arxiv.org/pdf/2103.14211.pdf) 115 | * [Equivariant Point Network for 3D Point Cloud Analysis](https://arxiv.org/pdf/2103.14147.pdf) In this paper, we propose an effective and practical SE(3) (3D translation and rotation) equivariant network for point cloud analysis that addresses both problems. [code](https://github.com/nintendops/EPN_PointCloud) 116 | * [ACRE: Abstract Causal REasoning Beyond Covariation](https://arxiv.org/pdf/2103.14232.pdf) (朱松纯) 117 | * [Abstract Spatial-Temporal Reasoning via Probabilistic Abduction and Execution](https://arxiv.org/pdf/2103.14230.pdf) (朱松纯) 118 | 119 | 其他 120 | * [Understanding Robustness of Transformers for Image Classification](https://arxiv.org/pdf/2103.14586.pdf) 探索vision transformer鲁棒性 We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification. (Google Research) 121 | 122 | * [COTR: Correspondence Transformer for Matching Across Images](https://arxiv.org/pdf/2103.14167.pdf) 问题设置类似Jianlong Fu, et al用transformer解决有reference的超分辨 123 | 124 | * [Lifting Transformer for 3D Human Pose Estimation in Video](https://arxiv.org/pdf/2103.14304.pdf) 125 | 126 | * [Training a Better Loss Function for Image Restoration](https://arxiv.org/pdf/2103.14616.pdf) In this work, we explore the question of what makes a good loss function for an image restoration task. [code](https://github.com/gfxdisp/mdf) 127 | 128 | * [Marine Snow Removal Benchmarking Dataset](https://arxiv.org/pdf/2103.14249.pdf) low-level 水下新任务和benchmark [code](https://github.com/ychtanaka/marine-snow) 129 | 130 | * [DivAug: Plug-in Automated Data Augmentation with Explicit Diversity Maximization](https://arxiv.org/pdf/2103.14545.pdf) 131 | 132 | * [On Generating Transferable Targeted Perturbations](https://arxiv.org/pdf/2103.14641.pdf) changing an unseen model’s decisions to a specific ‘targeted’ class remains a challenging feat. In this paper, we propose a new generative approach for highly transferable targeted perturbations (TTP). [code](https://github.com/Muzammal-Naseer/TTP) 133 | 134 | * [Unsupervised Robust Domain Adaptation without Source Data](https://arxiv.org/pdf/2103.14577.pdf) 将Domain Adaptation without Source Data和对抗鲁棒性问题结合进行研究 This paper aims at answering the question of finding the right strategy to make the target model robust and accurate in the setting of unsupervised domain adaptation without source data. (Luc Van Gool) 135 | 136 | * [Geometry-Aware Unsupervised Domain Adaptation for Stereo Matching](https://arxiv.org/pdf/2103.14333.pdf) DA for Stereo Matching (期刊) 137 | 138 | * [Non-Salient Region Object Mining for Weakly Supervised Semantic Segmentation](https://arxiv.org/pdf/2103.14581.pdf) However, existing works mainly concentrate on expanding the seed of pseudo labels within the image’s salient region. In this work, we propose a non-salient region object mining approach for weakly supervised semantic segmentation. [code](https://github.com/NUST-Machine-Intelligence-Laboratory/nsrom) 139 | 140 | * [Sparse Object-level Supervision for Instance Segmentation with Pixel Embeddings](https://arxiv.org/pdf/2103.14572.pdf) 生物图像实例分割 We propose to address the dense annotation bottleneck by introducing a proposal-free segmentation approach based on non-spatial embeddings, which exploits the structure of the learned embedding space to extract individual instances in a differentiable way. [code](https://github.com/kreshuklab/spoco) 141 | 142 | * [Towards a Unified Approach to Single Image Deraining and Dehazing](https://arxiv.org/pdf/2103.14204.pdf) (期刊) 143 | 144 | 145 | 146 | 147 | #### 20210326 148 | TOP 149 | * [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/pdf/2103.14030.pdf) a hierarchical Transformer whose representation is computed with shifted windows. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. BOOM RESULTS. [code](https://github.com/microsoft/Swin-Transformer) (Han Hu, Yue Cao, Stephen Lin et al) 150 | * [Vision Transformers for Dense Prediction](https://arxiv.org/pdf/2103.13413.pdf) 提出针对dense prediction task的vision transformer,采用类似fpn的结构汇聚transformer各层的特征,在monocular depth estimation, semantic segmentation上验证其性能。 [code](https://github.com/intel-isl/DPT) 151 | * [High-Fidelity Pluralistic Image Completion with Transformers](https://arxiv.org/pdf/2103.14031.pdf) This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. (Dongdong Chen) 152 | * [AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks](https://arxiv.org/pdf/2103.14026.pdf) In this paper, we propose AutoLoss-Zero, the first general framework for searching loss functions from scratch for generic tasks. (Jifeng Dai, Hongsheng Li, Gao Huang, Xizhou Zhu, et al) 153 | * [Orthogonal Projection Loss](https://arxiv.org/pdf/2103.14021.pdf) 改进交叉熵损失:Motivated by the observation that groundtruth class representations in CE loss are orthogonal (onehot encoded vectors), we develop a novel loss function termed ‘Orthogonal Projection Loss’ (OPL) which imposes orthogonality in the feature space. Given the plug-and-play nature of OPL, we evaluate it on a diverse range of tasks including image recognition (CIFAR-100), large-scale classification (ImageNet), domain generalization (PACS) and few-shot learning (miniImageNet, CIFAR-FS, tiered-ImageNet and Meta-dataset) and demonstrate its effectiveness across the board. (但每个提升都不多) [code](https://github.com/kahnchana/opl) 154 | 155 | CVPR21: 156 | * [More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval](https://arxiv.org/pdf/2103.13990.pdf) semi-supervised Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) 采用图像翻译生成更多的成对数据用于训练(low-level主流) [code](https://github.com/AyanKumarBhunia/semisupervised-FGSBIR) 157 | * [Robust and Accurate Object Detection via Adversarial Learning](https://arxiv.org/pdf/2103.13886.pdf) This work instead augments the fine-tuning stage for object detectors by exploring adversarial examples, which can be viewed as a model-dependent data augmentation. Our method dynamically selects the stronger adversarial images sourced from a detector’s classification and localization branches and evolves with the detector to ensure the augmentation policy stays current and relevant. (Boqing Gong) 158 | * [Learning Dynamic Alignment via Meta-filter for Few-shot Learning](https://arxiv.org/pdf/2103.13582.pdf) Most of the existing methods for feature alignment in few-shot learning only consider image-level or spatial-level alignment while omitting the channel disparity. Therefore, in this paper, we propose to learn a dynamic alignment, which can effectively highlight both query regions and channels according to different local support information. 159 | * [MetaAlign: Coordinating Domain Alignment and Classification for Unsupervised Domain Adaptation](https://arxiv.org/pdf/2103.13575.pdf) 解决训练损失和测试metric之间mismatch的问题,指出domain alignment和task的优化方向之间存在差异,提出将其看作meta-learning问题,即插即用,在分类、检测域适应和域泛化问题上验证性能。 Motivation:However, the optimization objective of such domain alignment is generally not coordinated with that of the object classification task itself such that their descent directions for optimization may be inconsistent. In this paper, we aim to study and alleviate the optimization inconsistency problem between the domain alignment and classification tasks. 方法:MetaAlign, where we treat the domain alignment objective and the classification objective as the meta-train and meta-test tasks in a meta-learning scheme. (曾文君,陈志波) 160 | * [I^3Net: Implicit Instance-Invariant Network for Adapting One-Stage Object Detectors](https://arxiv.org/pdf/2103.13757.pdf) 161 | * [Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting](https://arxiv.org/pdf/2103.13716.pdf) 自监督Sketch and Handwriting表示学习,将其看作文本和图像之间的模态。In this paper, we are interested in defining a self-supervised pre-text task for sketches and handwriting data. This data is uniquely characterised by its existence in dual modalities of rasterized images and vector coordinate sequences. two novel cross-modal translation pre-text tasks for selfsupervised feature learning: Vectorization and Rasterization. 162 | * [OTCE: A Transferability Metric for Cross-Domain Cross-Task Representations](https://arxiv.org/pdf/2103.13843.pdf) 新的迁移任务:同时考虑domain和task的迁移。Transfer learning across heterogeneous data distributions (a.k.a. domains) and distinct tasks is a more general and challenging problem than conventional transfer learning, where either domains or tasks are assumed to be the same. We propose a transferability metric called Optimal Transport based Conditional Entropy (OTCE), to analytically predict the transfer performance for supervised classification tasks in such cross domain and cross-task feature transfer settings 163 | * [Closing the Loop: Joint Rain Generation and Removal via Disentangled Image Translation](https://arxiv.org/pdf/2103.13660.pdf) 图像翻译做去雨(low-level主流) 164 | 165 | Others 166 | * [An Image is Worth 16x16 Words, What is a Video Worth?](https://arxiv.org/pdf/2103.13915.pdf) transformer做action recognition. [code](https://github.com/Alibaba-MIIL/STAM) 167 | 168 | * [Contrast to Divide: Self-Supervised Pre-Training for Learning with Noisy Labels](https://arxiv.org/pdf/2103.13646.pdf) we identify a “warm-up obstacle”: the inability of standard warm-up stages to train high quality feature extractors and avert memorization of noisy labels. We propose “Contrast to Divide” (C2D), a simple framework that solves this problem by pre-training the feature extractor in a self-supervised fashion. 169 | 170 | * [Universal Representation Learning from Multiple Domains for Few-shot Classification](https://arxiv.org/pdf/2103.13841.pdf) In this work, we propose to learn a single set of universal deep representations by distilling knowledge of multiple separately trained networks after co-aligning their features with the help of adapters and centered kernel alignment. 171 | 172 | * [Inferring Latent Domains for Unsupervised Deep Domain Adaptation](https://arxiv.org/pdf/2103.13873.pdf) (TPAMI) 173 | 174 | * [USB: Universal-Scale Object Detection Benchmark](https://arxiv.org/pdf/2103.14027.pdf) In this paper, we introduce the UniversalScale object detection Benchmark (USB). USB has variations in object scales and image domains by incorporating COCO with the recently proposed Waymo Open Dataset and Manga109-s dataset. UniverseNets. [code](https://github.com/shinya7y/UniverseNet) 175 | 176 | * [Multi-Target Domain Adaptation via Unsupervised Domain Classification for Weather Invariant Object Detection](https://arxiv.org/pdf/2103.13970.pdf) However, most existing domain adaptation methods either handle singletarget domain or require domain labels. We propose a novel unsupervised domain classification method which can be used to generalize single-target domain adaptation methods to multi-target domains, and design a weather-invariant object detector training framework based on it. 177 | 178 | * [StyleLess layer: Improving robustness for real-world driving](https://arxiv.org/pdf/2103.13905.pdf) 179 | 180 | * [GridDehazeNet+: An Enhanced Multi-Scale Network with Intra-Task Knowledge Transfer for Single Image Dehazing](https://arxiv.org/pdf/2103.13998.pdf) 181 | 182 | * [Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition](https://arxiv.org/pdf/2103.13851.pdf) (Guo-Jun Qi, TCSVT) 183 | 184 | * [Self-Supervised Training Enhances Online Continual Learning](https://arxiv.org/pdf/2103.14010.pdf) 185 | 186 | 187 | 188 | #### 20210325 189 | CVPR21: 190 | * [M3DSSD: Monocular 3D Single Stage Object Detector](https://arxiv.org/pdf/2103.13164.pdf) 1. feature mismatching -> shape alignment, center alignment 2. global information -> asymmetric non-local attention 191 | * [Temporal Context Aggregation Network for Temporal Action Proposal Refinement](https://arxiv.org/pdf/2103.13141.pdf) Temporal action proposal generation aims to estimate temporal intervals of actions in untrimmed videos. 192 | * [Learning Salient Boundary Feature for Anchor-free Temporal Action Localization](https://arxiv.org/pdf/2103.13137.pdf) Temporal action localization aims at inferring both the action category and localization of the start and end frame for each action instance in a long, untrimmed video. 193 | * [Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning](https://arxiv.org/pdf/2103.13061.pdf) Cross-modal recipe retrieval,包含image-to-recipe和recipe-to-image, recipe is in text. [code](https://github.com/amzn/image-to-recipe-transformers) 194 | * [Coarse-to-Fine Domain Adaptive Semantic Segmentation with Photometric Alignment and Category-Center Regularization](https://arxiv.org/pdf/2103.13041.pdf) 认为domain shift主要发生在image-level和category-level两个层面,并分别提出a photometric alignment module和a category-oriented triplet loss for source, a self-supervised consistency regularization for target 195 | * [From Shadow Generation to Shadow Removal](https://arxiv.org/pdf/2103.12997.pdf) Follow现有low-level使用图像翻译方法这一大方向,训练不需要shadow free image。 196 | * [Relation-aware Instance Refinement for Weakly Supervised Visual Grounding](https://arxiv.org/pdf/2103.12989.pdf) Visual grounding, which aims to build a correspondence between visual objects and their language entities. 197 | * [Scene-Intuitive Agent for Remote Embodied Visual Grounding](https://arxiv.org/pdf/2103.12944.pdf) an agent that mimics human behaviors: The agent learns where to stop in the Scene Grounding task and what to attend to in the Object Grounding task respectively. 198 | * [Efficient Regional Memory Network for Video Object Segmentation](https://arxiv.org/pdf/2103.12934.pdf) Regional Memory Network: a novel local-to-local matching solution for semi-supervised VOS 199 | * [Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency](https://arxiv.org/pdf/2103.12886.pdf) Problems in weakly supervised instance segmentation:(a) partial segmentation of objects and (b) missing object predictions. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation. Keys: 1. inter-pixel relation network 2. MaskConsist module (Alan Yuille) 200 | * [Convex Online Video Frame Subset Selection using Multiple Criteria for Data Efficient Autonomous Driving](https://arxiv.org/pdf/2103.13021.pdf) 自动驾驶 201 | * [Dynamic Slimmable Network](https://arxiv.org/pdf/2103.13258.pdf) Problem: dynamic sparse patterns on convolutional filters fail to achieve actual acceleration in real-world implementation, due to the extra burden of indexing, weight-copying, or zero-masking. Method: 1. double-headed dynamic gate that comprises an attention head and a slimming head 2. a disentangled two-stage training scheme inspired by one-shot NAS 202 | * [Affective Processes: stochastic modelling of temporal context for emotion and facial expression recognition](https://arxiv.org/pdf/2103.13372.pdf) 203 | * [Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes](https://arxiv.org/pdf/2103.13225.pdf) 204 | * [The Blessings of Unlabeled Background in Untrimmed Videos](https://arxiv.org/pdf/2103.13183.pdf) 因果推断做Weakly-supervised Temporal Action Localization. While previous works treat the background as “curses”, we consider it as “blessings” (Hanwang Zhang) 205 | 206 | 其他: 207 | * [Can Vision Transformers Learn without Natural Images?](https://arxiv.org/pdf/2103.13023.pdf) 208 | 209 | 210 | 211 | 212 | #### 20210324 213 | CVPR21: 214 | * [Lifelong Person Re-Identification via Adaptive Knowledge Accumulation](https://arxiv.org/pdf/2103.12462.pdf) 提出lifelong person re-identification任务,要求模型在多个域上持续学习,并能泛化到没见过的域上。提出研究此问题的数据集和针对性的解决方案Adaptive Knowledge Accumulation framework. [code](https://github.com/TPCD/LifelongReID) 215 | * [Group-aware Label Transfer for Domain Adaptive Person Re-identification](https://arxiv.org/pdf/2103.12366.pdf) 216 | * [Transferable Semantic Augmentation for Domain Adaptation](https://arxiv.org/pdf/2103.12562.pdf) 语义增强做域适应(王雨霖) [code](https://github.com/BIT-DA/TSA) 217 | * [MetaSAug: Meta Semantic Augmentation for Long-Tailed Visual Recognition](https://arxiv.org/pdf/2103.12579.pdf) 语义增强做长尾识别(王雨霖) [code](https://github.com/BIT-DA/MetaSAug) 218 | 219 | 其他: 220 | * [Learning without Seeing nor Knowing: Towards Open Zero-Shot Learning](https://arxiv.org/pdf/2103.12437.pdf) 221 | 222 | * [BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search](https://arxiv.org/pdf/2103.12424.pdf) 223 | 224 | * [Global Correlation Network: End-to-End Joint Multi-Object Detection and Tracking](https://arxiv.org/pdf/2103.12511.pdf) 225 | 226 | * [End-to-End Trainable Multi-Instance Pose Estimation with Transformers](https://arxiv.org/pdf/2103.12115.pdf) 227 | 228 | 229 | 230 | 231 | #### 20210323 232 | * [Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking](https://arxiv.org/pdf/2103.11681.pdf) 将transformer运用到vision tracking任务上,将transformer的encoder和decoder分解为两个平行的分支,嵌入一个Siamese-like tracking pipelines(周文罡,李厚强) 233 | * [DeepViT: Towards Deeper Vision Transformer](https://arxiv.org/pdf/2103.11886.pdf) 指出ViT不能像CNN那样通过增加深度提升性能(the attention collapse issue),并基于以上观察提出Re-attention,重新生成具有多样性的attention map(Jiashi Feng) 234 | * [Incorporating Convolution Designs into Visual Transformers](https://arxiv.org/pdf/2103.11816.pdf) Convolution-enhanced image Transformer (CeiT)将CNN设计理念引入ViT,提出(1)Image-to-Tokens (I2T) module that extracts patches from generated low-level features(2)Locally-enhanced Feed-Forward (LeFF)提升相邻token之间的关联性(3)Layer-wise Class token Attention (LCA),提升ViT性能和训练速度。(刘子纬) 235 | * [Multimodal Motion Prediction with Stacked Transformers](https://arxiv.org/pdf/2103.11624.pdf) transformer做Multimodal Motion Prediction(交通流预测)(周博磊) 236 | * [Learning Multi-Scene Absolute Pose Regression with Transformers](https://arxiv.org/pdf/2103.11468.pdf) transformer做Multi-Scene Absolute Pose Regression(从采集的图片上判断相机的位置和朝向) 237 | 238 | 239 | 240 | #### 20210322 241 | CVPR21: 242 | * [Generic Perceptual Loss for Modeling Structured Output Dependencies](https://arxiv.org/pdf/2103.10571.pdf) 指出perceptual loss起作用的原因不是网络预训练的权重,而是网络本身的拓扑结构,使用随机初始化的网络计算perceptual loss在语义分割、深度估计、实例分割等任务上取得良好效果(Chunhua Shen) 243 | * [CDFI: Compression-Driven Network Design for Frame Interpolation](https://arxiv.org/pdf/2103.10559.pdf) Compression-Driven Network Design for Frame Interpolation 244 | * [Skeleton Merger: an Unsupervised Aligned Keypoint Detector](https://arxiv.org/pdf/2103.10814.pdf) an Unsupervised Aligned Keypoint Detector (卢策吾) 245 | * [XProtoNet: Diagnosis in Chest Radiography with Global and Local Explanations](https://arxiv.org/pdf/2103.10663.pdf) 医学诊断 246 | * [Learning the Superpixel in a Non-iterative and Lifelong Manner](https://arxiv.org/pdf/2103.10681.pdf) 将superpixel segmentation看作lifelong clustering task,提出一个CNN-based superpixel segmentation方法 247 | * [Degrade is Upgrade: Learning Degradation for Low-light Image Enhancement](https://arxiv.org/pdf/2103.10621.pdf) 低光照图像增强 248 | * [Dynamic Transfer for Multi-Source Domain Adaptation](https://arxiv.org/pdf/2103.10583.pdf) 用Dynamic Network做Multi-Source Domain Adaptation 249 | * [Sewer-ML: A Multi-Label Sewer Defect Classification Dataset and Benchmark](https://arxiv.org/pdf/2103.10895.pdf) 下水道缺陷检测benchmark 250 | 251 | Vision Transofomer: 252 | * [Reading Isn't Believing: Adversarial Attacks On Multi-Modal Neurons](https://arxiv.org/ftp/arxiv/papers/2103/2103.10480.pdf) 多模态预训练模型的对抗鲁棒性(主要基于CLIP做研究) 253 | * [UNETR: Transformers for 3D Medical Image Segmentation](https://arxiv.org/pdf/2103.10504.pdf) Transformers for 3D Medical Image Segmentation 254 | * [3D Human Pose Estimation with Spatial and Temporal Transformers](https://arxiv.org/pdf/2103.10455.pdf) pure transformer做3D Human Pose Estimation 255 | * [Hopper: Multi-hop Transformer for Spatiotemporal Reasoning](https://arxiv.org/pdf/2103.10574.pdf) transformer做Spatiotemporal Reasoning(ICLR21) 256 | * [Scalable Visual Transformers with Hierarchical Pooling](https://arxiv.org/pdf/2103.10619.pdf) 指出ViT ,DeiT在整个inference过程中使用固定长度的sequence可能是冗余的,提出Hierarchical Visual Transformer (HVT),采用类似CNN池化的方式对sequence长度做下采样,并对depth/width/resolution/patch size等维度做scaling。发现平均池化做全局信息聚合的判别性好于cls token 257 | * [ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases](https://arxiv.org/pdf/2103.10697.pdf) ConViT 提出gated positional self-attention (GPSA),以一种“soft”的方式在vision transformer中引入CNN的inductive bias 258 | * [MDMMT: Multidomain Multimodal Transformer for Video Retrieval](https://arxiv.org/pdf/2103.10699.pdf) 用transformer做text to video Retrieval 259 | * [Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning](https://arxiv.org/abs/2103.11731) 260 | 261 | 其他: 262 | * [Robustness via Cross-Domain Ensembles](https://arxiv.org/pdf/2103.10919.pdf) Robustness via Cross-Domain Ensembles基不确定度于对多个域上的输出做ensemble提升模型鲁棒性 263 | * [Paint by Word](https://arxiv.org/pdf/2103.10951.pdf) 用word引导inpainting内容 264 | * [Boosting Adversarial Transferability through Enhanced Momentum](https://arxiv.org/pdf/2103.10609.pdf) (胡瀚,王井东) 265 | * [UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning](https://arxiv.org/pdf/2103.10773.pdf) 提出UniMoCo,将无监督、半监督和全监督任务结合,基于MoCo做表征学习 266 | * [ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation](https://arxiv.org/pdf/2103.10702.pdf) 提出一种top-down方法做Text-based Video Segmentation(杨易) 267 | 268 | #### 20210319 269 | * [Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE](https://arxiv.org/pdf/2103.10022.pdf) Diverse Image Inpainting (刘东,李厚强,CVPR2021) 270 | * [Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks](https://arxiv.org/pdf/2103.10429.pdf) (CVPR2021) 271 | * [Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild](https://arxiv.org/pdf/2103.10391.pdf) 交互式视频分割 (CVPR2021) 272 | * [Large Scale Image Completion via Co-Modulated Generative Adversarial Networks](https://arxiv.org/pdf/2103.10428.pdf) GAN Image Completion (ICLR21) 273 | * [Using latent space regression to analyze and leverage compositionality in GANs](https://arxiv.org/pdf/2103.10426.pdf) 用latent space回归理解和运用的GAN组成 (ICLR21, Phillip Isola) 274 | * [Deep Wiener Deconvolution: Wiener Meets Deep Learning for Image Deblurring](https://arxiv.org/pdf/2103.09962.pdf) 将Wiener反卷积与深度学习结合做图像去模糊 (NIPS20) 275 | * [Learning to Resize Images for Computer Vision Tasks](https://arxiv.org/pdf/2103.09950.pdf) 可学习的图像resize,虽然视觉效果不好,但显著提升ImageNet分类即下游任务的性能 (Google Research) 276 | * [The Untapped Potential of Off-the-Shelf Convolutional Neural Networks](https://arxiv.org/pdf/2103.09891.pdf) 在测试时dynamically改变预训练网络的topology做预测,能将ResNet50 top1准确率提升到95% (陈怡然) 277 | * [The Low-Rank Simplicity Bias in Deep Networks](https://arxiv.org/pdf/2103.10427.pdf) 指出DNN倾向于学到lower rank solutions,因此不会overfit训练集,具有较好泛化能力 278 | * [DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer](https://arxiv.org/pdf/2103.10206.pdf) transformer从音乐中生成舞蹈视频 279 | * [Consistency-based Active Learning for Object Detection](https://arxiv.org/pdf/2103.10374.pdf) acitve learning做目标检测 280 | * [SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation](https://arxiv.org/pdf/2103.10284.pdf) One-Stage Video Instance Segmentation 281 | * [Pseudo-ISP: Learning Pseudo In-camera Signal Processing Pipeline from A Color Image Denoiser](https://arxiv.org/pdf/2103.10234.pdf) 做真实场景去噪(Wangmeng Zuo) 282 | * [Decoupled Spatial Temporal Graphs for Generic Visual Grounding](https://arxiv.org/pdf/2103.10191.pdf) 提出Generic Visual Grounding,更接近真实场景的Visual grounding问题设置,并提出新的数据集和方法 (程明明,杨易) 283 | * [Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning](https://arxiv.org/pdf/2103.10211.pdf) 指出自监督视频特征表示学习中,spatial augmentations如cropping十分重要,提出Feature Crop并利用transformer-based attention替代平均池化 284 | * [TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation](https://arxiv.org/pdf/2103.10158.pdf) 提出一个简单有效的自动数据增强方法TrivialAugment 285 | * [Self-Supervised Adaptation for Video Super-Resolution](https://arxiv.org/pdf/2103.10081.pdf) 自监督adaptation做视频超分辨 286 | * [Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training](https://arxiv.org/pdf/2103.10043.pdf) Transformer做Video Understanding 287 | * [RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection](https://arxiv.org/pdf/2103.10039.pdf) 基于Lidar的3D目标检测(Naiyan Wang) 288 | * [SparsePoint: Fully End-to-End Sparse 3D Object Detector](https://arxiv.org/pdf/2103.10042.pdf) Follow DETR和SparseRCNN做Sparse的3D目标检测 289 | 290 | #### 20210318 291 | CVPR21: 292 | * [Learning Discriminative Prototypes with Dynamic Time Warping](https://arxiv.org/pdf/2103.09458.pdf) 做Dynamic Time Warping(视频summary) 293 | * [You Only Look One-level Feature](https://arxiv.org/pdf/2103.09460.pdf) YOLOF提出FPN的成功来自于其divide-and-conquer的设计,而非多尺度特征融合。YOLOF采用Dilated Encoder和Uniform Matching(孙剑,张祥宇) 294 | 295 | 其他: 296 | * [Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network](https://arxiv.org/pdf/2103.09377.pdf) Multi-Prize Lottery Ticket Hypothesis,在Lottery Ticket Hypothesis基础上提出另外两个假设,实现Binary Neural Networks剪枝(ICLR21) 297 | * [Training GANs with Stronger Augmentations via Contrastive Discriminator](https://arxiv.org/pdf/2103.09742.pdf) 在GAN中引入对比学习(ICLR21) 298 | * [Bio-inspired Robustness: A Review](https://arxiv.org/ftp/arxiv/papers/2103/2103.09265.pdf) 生物启发鲁棒性综述 299 | * [Pros and Cons of GAN Evaluation Measures: New Developments](https://arxiv.org/pdf/2103.09396.pdf) GAN近期发展综述 300 | * [LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval](https://arxiv.org/pdf/2103.08784.pdf) vision language预训练,替换耗时的跨模态attention,实现快速Image-Text Retrieval (NAACL) 301 | * [Revisiting the Loss Weight Adjustment in Object Detection](https://arxiv.org/pdf/2103.09488.pdf) 探讨如何对目标检测中分类损失和定位损失做合适的加权,提出Adaptive Loss Weight Adjustment(ALWA),在训练过程中根据各损失的统计特性自适应调整损失权重 302 | * [Large-Scale Zero-Shot Image Classification from Rich and Diverse Textual Descriptions](https://arxiv.org/pdf/2103.09669.pdf) 在zero-shot learning(ZSL)中加入丰富的文本描述,显著提升ZSL性能(NAACL) 303 | * [Single Underwater Image Restoration by Contrastive Learning](https://arxiv.org/pdf/2103.09697.pdf) 对比学习和图像翻译结合做水下图像增强 304 | * [Trans-SVNet: Accurate Phase Recognition from Surgical Videos via Hybrid Embedding Aggregation Transformer](https://arxiv.org/pdf/2103.09712.pdf) transformer做医学手术阶段识别 305 | * [Prediction-assistant Frame Super-Resolution for Video Streaming](https://arxiv.org/pdf/2103.09455.pdf) 视频压缩和超分,以实现高效的数据传输 306 | * [Disentangled Cycle Consistency for Highly-realistic Virtual Try-On](https://arxiv.org/pdf/2103.09479.pdf) Disentangled Cycle Consistency for Highly-realistic Virtual Try-On(Ping Luo) 307 | * [PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning](https://arxiv.org/pdf/2103.09504.pdf) 做Spatiotemporal Predictive Learning(视频帧预测)(龙明盛) 308 | 309 | 310 | #### 20210317 311 | CVPR21: 312 | * [Frequency-aware Discriminative Feature Learning Supervised by Single-Center Loss for Face Forgery Detection](https://arxiv.org/pdf/2103.09096.pdf) 提出Frequency-aware特征学习和新的损失函数做Face Forgery Detection(Yongdong Zhang) 313 | * [BBAM: Bounding Box Attribution Map for Weakly Supervised Semantic and Instance Segmentation](https://arxiv.org/pdf/2103.08907.pdf) 弱监督实例分割 314 | * [Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation](https://arxiv.org/pdf/2103.08896.pdf) 弱监督语义分割 315 | * [Track to Detect and Segment: An Online Multi-Object Tracker](https://arxiv.org/pdf/2103.08808.pdf) Multi-Object Tracking 316 | 317 | 其他: 318 | * [Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models](https://arxiv.org/pdf/2103.08849.pdf) 多语言的vision language预训练模型(NAACL) 319 | * [Is it Enough to Optimize CNN Architectures on ImageNet?](https://arxiv.org/pdf/2103.09108.pdf) 探讨ImageNet预训练CNN的泛化性 320 | * [Balancing Biases and Preserving Privacy on Balanced Faces in the Wild](https://arxiv.org/pdf/2103.09118.pdf)(TPAMI,Yun Fu) 321 | * [QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection](https://arxiv.org/pdf/2103.09136.pdf) 在FPN中加入object query提升对小物体的检测(Naiyan Wang) 322 | * [Dense Interaction Learning for Video-based Person Re-identification](https://arxiv.org/pdf/2103.09013.pdf) transormer做视频reid(Zhibo Chen, Xian-Sheng Hua) 323 | * [Super-Resolving Cross-Domain Face Miniatures by Peeking at One-Shot Exemplar](https://arxiv.org/pdf/2103.08863.pdf) 跨域人脸超分辨(杨易) 324 | * [A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character Recognition](https://arxiv.org/pdf/2103.09030.pdf) 电梯按钮识别数据集 325 | 326 | 327 | #### 20210316 328 | CVPR21: 329 | * [Detecting Human-Object Interaction via Fabricated Compositional Learning](https://arxiv.org/pdf/2103.08214.pdf) 指出并应对HOI中的长尾分布问题(侯志,陶老师) 330 | * [Beyond Image to Depth: Improving Depth Prediction using Echoes](https://arxiv.org/pdf/2103.08468.pdf) 从回声和RGB图像中预测深度(多模态) 331 | * [Semi-Supervised Video Deraining with Dynamic Rain Generator](https://arxiv.org/pdf/2103.07939.pdf) 半监督视频去雨 332 | * [DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network](https://arxiv.org/pdf/2103.07893.pdf) 对比学习提升条件GAN图像合成的多样性(CHHK) 333 | * [Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion](https://arxiv.org/pdf/2103.07941.pdf) 交互式视频分割 334 | * [ReDet: A Rotation-equivariant Detector for Aerial Object Detection](https://arxiv.org/pdf/2103.07733.pdf) Rotation-equivariant Detector for Aerial Object Detection 335 | * [Uncertainty-guided Model Generalization to Unseen Domains](https://arxiv.org/pdf/2103.07531.pdf) 基于不确定度评估做single domain generalization 336 | * [Cross-Domain Similarity Learning for Face Recognition in Unseen Domains](https://arxiv.org/pdf/2103.07503.pdf) 人脸识别域泛化(Yi-Hsuan Tsai) 337 | * [Refine Myself by Teaching Myself : Feature Refinement via Self-Knowledge Distillation](https://arxiv.org/pdf/2103.08273.pdf) 采用类似BiFPN的连接方式做知识蒸馏 338 | 339 | 其他: 340 | * [TransFG: A Transformer Architecture for Fine-grained Recognition](https://arxiv.org/pdf/2103.07976.pdf) transformer细粒度分类(Alan Yuille组) 341 | * [Revisiting ResNets: Improved Training and Scaling Strategies](https://arxiv.org/pdf/2103.07579.pdf) 改进resnet,以更快的推理速度取得和EfficientNet相当的性能(Tsung-Yi Lin) 342 | * [Unsupervised Image Transformation Learning via Generative Adversarial Networks](https://arxiv.org/pdf/2103.07751.pdf) GAN做图像变换(周博磊组) 343 | * [Learning Frequency-aware Dynamic Network for Efficient Super-Resolution](https://arxiv.org/pdf/2103.08357.pdf) Frequency-aware Dynamic Network for Efficient Super-Resolution (王云鹤) 344 | 345 | 346 | #### 20210312 347 | CVPR21: 348 | * [CoMoGAN: continuous model-guided image-to-image translation](https://arxiv.org/pdf/2103.06879.pdf) CoMoGAN做连续图像翻译 349 | * [Fast and Accurate Model Scaling](https://arxiv.org/pdf/2103.06877.pdf) FAIR组对标谷歌efficientnet,指出现有模型scaling方法只考虑准确率和计算量,忽视了实际的运行时间。通过限制网络的activation数目,实现Fast and Accurate Model Scaling 350 | * [Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations](https://arxiv.org/pdf/2103.06342.pdf) 持续学习语义分割 351 | 352 | 其他: 353 | * [WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training](https://arxiv.org/pdf/2103.06561.pdf) 人大、中科院大型vision language预训练模型WenLan,对标CLIP 354 | * [Level-aware Haze Image Synthesis by Self-Supervised Content-Style Disentanglement](https://arxiv.org/pdf/2103.06501.pdf) 基于自监督图像解耦的雾天图像合成 355 | 356 | 357 | #### 20210311 358 | CVPR21: 359 | * [Involution: Inverting the Inherence of Convolution for Visual Recognition](https://arxiv.org/pdf/2103.06255.pdf) 内卷网络(self-attention变种) 360 | * [Spatially Consistent Representation Learning](https://arxiv.org/pdf/2103.06122.pdf) 能应对dense prediction task的对比学习方法 361 | * [Reformulating HOI Detection as Adaptive Set Prediction](https://arxiv.org/pdf/2103.05983.pdf) HOI transformer 362 | * [FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding](https://arxiv.org/pdf/2103.05950.pdf) 对比学习做few shot目标检测 363 | * [VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples](https://arxiv.org/pdf/2103.05905.pdf) VideoMoCo对比学习做视频任务 364 | * [Capturing Omni-Range Context for Omnidirectional Segmentation](https://arxiv.org/pdf/2103.05687.pdf) 用attention做广角镜头下的语义分割 365 | * [AutoDO: Robust AutoAugment for Biased Data with Label Noise via Scalable Probabilistic Implicit Differentiation](https://arxiv.org/pdf/2103.05863.pdf) AutoAugmentation升级版,能处理噪声数据 366 | 367 | 其他: 368 | * [Regressive Domain Adaptation for Unsupervised Keypoint Detection](https://arxiv.org/pdf/2103.06175.pdf) 回归域适应关键点检测(龙明盛组) 369 | * [U-Net Transformer: Self and Cross Attention for Medical Image Segmentation](https://arxiv.org/pdf/2103.06104.pdf) UNet Transformer 370 | 371 | 372 | #### 20210310 373 | CVPR2021: 374 | * [MetaCorrection: Domain-aware Meta Loss Correction for Unsupervised Domain Adaptation in Semantic Segmentation](https://arxiv.org/pdf/2103.05254.pdf) 语义分割域适应 375 | * [ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection](https://arxiv.org/pdf/2103.05346.pdf) 3D目标检测域适应 376 | * [Contrastive Neural Architecture Search with Neural Architecture Comparators](https://arxiv.org/pdf/2103.05471.pdf) contrastive NAS 377 | 378 | 其他: 379 | * [Deep Learning based 3D Segmentation: A Survey](https://arxiv.org/pdf/2103.05423.pdf) 3D语义分割综述 380 | 381 | 382 | #### 20210309 383 | CVPR21: 384 | * [Multi-Source Domain Adaptation with Collaborative Learning for Semantic Segmentation](https://arxiv.org/pdf/2103.04717.pdf) multi-source语义分割域适应 385 | * [Semi-supervised Domain Adaptation based on Dual-level Domain Mixing for Semantic Segmentation](https://arxiv.org/pdf/2103.04705.pdf) 半监督语义分割域适应 386 | * [MeGA-CDA: Memory Guided Attention for Category-Aware Unsupervised Domain Adaptive Object Detection](https://arxiv.org/pdf/2103.04224.pdf) 目标检测域适应 387 | * [End-to-End Human Object Interaction Detection with HOI Transformer](https://arxiv.org/pdf/2103.04503.pdf) human object interaction transformer 388 | 389 | 其他: 390 | * [Unsupervised Pretraining for Object Detection by Patch Reidentification](https://arxiv.org/pdf/2103.04814.pdf) 自监督目标检测 391 | * [Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision](https://arxiv.org/pdf/2103.04037.pdf) 探讨多模态transformer 392 | * [TransBTS: Multimodal Brain Tumor Segmentation Using Transformer](https://arxiv.org/pdf/2103.04430.pdf) 医学分割transformer 393 | 394 | #### 20210302 395 | * [Generative Adversarial Transformers](https://arxiv.org/pdf/2103.01209.pdf) transformer GAN 396 | * [OmniNet: Omnidirectional Representations from Transformers](https://arxiv.org/pdf/2103.01075.pdf) OmniNet让transformer能够建模网络不同层特征的全局信息,在cv和nlp任务上均有效 397 | * [Single-Shot Motion Completion with Transformer](https://arxiv.org/pdf/2103.00776.pdf) 用transformer做Single-Shot Motion Completion 398 | * [Transformer in Transformer](https://arxiv.org/pdf/2103.00112.pdf) 基于vit,用一个inner transformer建模pixel级别的特征表示 399 | 400 | 401 | ## NLP (Weekly) 402 | -------------------------------------------------------------------------------- /202104.md: -------------------------------------------------------------------------------- 1 | ## CV (Daily) 2 | 3 | 4 | #### 20210430 5 | * [A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning](https://arxiv.org/pdf/2104.14558.pdf) (Ross Girshick, Kaiming He) 6 | * [With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations](https://arxiv.org/pdf/2104.14548.pdf) (Andrew Zisserman) 7 | * [Ensembling with Deep Generative Views](https://arxiv.org/pdf/2104.14551.pdf) (Jun-Yan Zhu, Phillip Isola, Richard Zhang) 8 | * [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/pdf/2104.14294.pdf) 9 | * [Discover the Unknown Biased Attribute of an Image Classifier](https://arxiv.org/pdf/2104.14556.pdf) 10 | * [Decoupled Dynamic Filter Networks](https://arxiv.org/pdf/2104.14107.pdf) (Ming-Hsuan Yang) 11 | * [LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search](https://arxiv.org/pdf/2104.14545.pdf) (Jianlong Fu, Huchuan Lu) 12 | * [Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering](https://arxiv.org/pdf/2104.14085.pdf) 13 | * [Pseudo-IoU: Improving Label Assignment in Anchor-Free Object Detection](https://arxiv.org/pdf/2104.14082.pdf) 14 | * [Segmentation-grounded Scene Graph Generation](https://arxiv.org/pdf/2104.14207.pdf) 15 | * [Privacy-Preserving Portrait Matting](https://arxiv.org/pdf/2104.14222.pdf) (Jing Zhang, Dacheng Tao) 16 | * [Learning Multi-Attention Context Graph for Group-Based Re-Identification](https://arxiv.org/pdf/2104.14236.pdf) (Ling Shao) 17 | * [AutoFlow: Learning a Better Training Set for Optical Flow](https://arxiv.org/pdf/2104.14544.pdf) 18 | * [The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth](https://arxiv.org/pdf/2104.14540.pdf) 19 | * [GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classificatio](https://arxiv.org/pdf/2104.14528.pdf) 20 | * [Action Unit Memory Network for Weakly Supervised Temporal Action Localization](https://arxiv.org/pdf/2104.14135.pdf) (Tao Mei, Feng Wu, Yongdong Zhang) 21 | * [Do Feature Attribution Methods Correctly Attribute Features?](https://arxiv.org/pdf/2104.14403.pdf) 22 | * [Pushing it out of the Way: Interactive Visual Navigation](https://arxiv.org/pdf/2104.14040.pdf) 23 | 24 | #### 20210429 25 | ##### Vision Transformer 26 | * [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](https://arxiv.org/pdf/2104.13840.pdf) Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, TwinsPCPVT and Twins-SVT. 对标Swin Transformer,取得相当的性能 [code](https://github.com/Meituan-AutoML/Twins) (Zhi Tian, Chunhua Shen) 27 | * [ConTNet: Why not use convolution and transformer at the same time?](https://arxiv.org/pdf/2104.13497.pdf) In this work, we innovatively propose ConTNet (ConvolutionTransformer Network), combining transformer with ConvNet architectures to provide large receptive fields. 创新性很普通只能通过实验堆一堆“优点” [code](https://github.com/yan-hao-tian/ConTNet) 28 | * [HOTR: End-to-End Human-Object Interaction Detection with Transformers](https://arxiv.org/pdf/2104.13682.pdf) TASK: Human-Object Interaction (HOI) detection is a task of identifying “a set of interactions” in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. PROBLEM: Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. METHOD: In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of hhuman, object, interactioni triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. 用DETR做HOI,故事线还是end-to-end,简化pipeline,消除后处理。 29 | * [Point Cloud Learning with Transformer](https://arxiv.org/pdf/2104.13636.pdf) In this paper, we introduce a novel framework, called Multi-level Multi-scale Point Transformer (MLMSPT) that works directly on the irregular point clouds for representation learning. Specifically, a point pyramid transformer is investigated to model features with diverse resolutions or scales we defined, followed by a multi-level transformer module to aggregate contextual information from different levels of each scale and enhance their interactions 30 | * [Medical Transformer: Universal Brain Encoder for 3D MRI Analysis](https://arxiv.org/pdf/2104.13633.pdf) 31 | * [Inpainting Transformer for Anomaly Detection](https://arxiv.org/pdf/2104.13897.pdf) 32 | 33 | ##### Others 34 | * [Zero-Shot Detection via Vision and Language Knowledge Distillation](https://arxiv.org/pdf/2104.13921.pdf) MOTIVATION: Zero-shot image classification has made promising progress by training the aligned image and text encoders. The goal of this work is to advance zero-shot object detection, which aims to detect novel objects without bounding box nor mask annotations. We propose ViLD, a training method via Vision and Language knowledge Distillation. We distill the knowledge from a pre-trained zero-shot image classification model (e.g., CLIP [33]) into a two-stage detector (e.g., Mask R-CNN [17]). RESULT: We benchmark the performance on LVIS dataset [15] by holding out all rare categories as novel categories. ViLD obtains 16.1 mask APr with a Mask R-CNN (ResNet-50 FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8. The model can directly transfer to other datasets, achieving 72.2 AP50, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively. (Tsung-Yi Lin) 35 | * [Shot Contrastive Self-Supervised Learning for Scene Boundary Detection](https://arxiv.org/pdf/2104.13537.pdf) TASK: We presented a self-supervised learning approach to learn a shot representation for long-form videos using unlabeled video data. MOTIVATION: Our approach is based on the key observation that nearby shots in movies and TV episodes tend to have the same set of actors enacting a cohesive story-arch, and are therefore in expectation more similar to each other than a set of randomly selected shots. METHOD: We used this observation to consider nearby similar shots as augmented versions of each other and demonstrated that when used in a contrastive learning setting, this augmentation scheme can encode the scene-structure more effectively than existing augmentation schemes that are primarily geared towards images and short videos 36 | * [Efficient Pre-trained Features and Recurrent Pseudo-Labeling in Unsupervised Domain Adaptation](https://arxiv.org/pdf/2104.13486.pdf) In this paper, we show how to efficiently opt for the best pre-trained features from seventeen well-known ImageNet models in unsupervised DA problems. In addition, we propose a recurrent pseudo-labeling model using the best pre-trained features (termed PRPL) to improve classification performance. 37 | * [Exploring Relational Context for Multi-Task Dense Prediction](https://arxiv.org/pdf/2104.13874.pdf) TASK: We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads. Our goal is to find the most efficient way to refine each task prediction by capturing cross-task contexts dependent on tasks’ relations. METHOD: Empirical findings confirm that different source-target task pairs benefit from different context types. To automate the selection process, we propose an Adaptive Task-Relational Context (ATRC) module, which samples the pool of all available contexts for each task pair using neural architecture search and outputs the optimal configuration for deployment. (Luc Van Gool) 38 | * [Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation](https://arxiv.org/pdf/2104.13613.pdf) 用自监督的深度估计增强域适应语义分割(相互增强)[code](https://github.com/qinenergy/corda) (Dengxin Dai, Luc Van Gool) 39 | * [Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank](https://arxiv.org/pdf/2104.13415.pdf) 像素级对比学习做半监督语义分割 This module enforces the segmentation network to yield similar pixel-level feature representations for same-class samples across the whole dataset. To achieve this, we maintain a memory bank continuously updated with feature vectors from labeled data. These features are selected based on their quality and relevance for the contrastive learning. 40 | * [FrameExit: Conditional Early Exiting for Efficient Video Recognition](https://arxiv.org/pdf/2104.13400.pdf) While existing works focus on selecting a subset of salient frames to reduce the computation costs, we propose to use a simple sampling strategy combined with conditional early exiting to enable efficient recognition. 41 | * [AdvHaze: Adversarial Haze Attack](https://arxiv.org/pdf/2104.13673.pdf) MOTIVATION: However, previous attack methods have mainly focused on applying some lp normbounded noise perturbations. In this paper, we instead introduce a novel adversarial attack method based on haze, which is a common phenomenon in real-world scenery. Our method can synthesize potentially adversarial haze into an image based on the atmospheric scattering model with high realisticity and mislead classifiers to predict an incorrect class. SIGNIFICANCE: We hope this work can boost the development of non-noisebased adversarial attacks and help evaluate and improve the robustness of DNNs. 42 | * [Contrastive Spatial Reasoning on Multi-View Line Drawings](https://arxiv.org/pdf/2104.13433.pdf) Spatial reasoning on multi-view line drawings by stateof-the-art supervised deep networks is recently shown with puzzling low performances on the SPARE3D dataset. To study the reason behind the low performance and to further our understandings of these tasks, we design controlled experiments on both input data and network designs. Guided by the hindsight from these experiment results, we propose a simple contrastive learning approach along with other network modifications to improve the baseline performance. 43 | * [LambdaUNet: 2.5D Stroke Lesion Segmentation of Diffusion-weighted MR Images](https://arxiv.org/pdf/2104.13917.pdf) 把lambdaNet用在医学图像处理上 44 | * [MOD: Benchmark for Military Object Detection](https://arxiv.org/pdf/2104.13763.pdf) 45 | 46 | 47 | #### 20210428 48 | * [Multimodal Contrastive Training for Visual Representation Learning](https://arxiv.org/pdf/2104.12836.pdf) METHOD: Different from VirTex [10], our method not only learns the cross-modal correlation between images and captions, but also exploits intrinsic data properties in a self-supervised manner within each modality. RESULT: For example, the visual representations pre-trained on COCO by our method achieve stateof-the-art top-1 validation accuracy of 55.3% on ImageNet classification, under the common transfer protocol. 49 | * [Explaining in Style: Training a GAN to explain a classifier in StyleSpace](https://arxiv.org/pdf/2104.13369.pdf) 基于StyelGAN做分类器的可解释性。Image classification models can depend on multiple different semantic attributes of the image. An explanation of the decision of the classifier needs to both discover and visualize these properties. Here we present StylEx, a method for doing this, by training a generative model to specifically explain multiple attributes that underlie classifier decisions. (Phillip Isola) 50 | * [BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment](https://arxiv.org/pdf/2104.13371.pdf) NTIRE2021三项冠军。 BACKGROUND: A recurrent structure is a popular framework choice for the task of video super-resolution. The state-of-theart method BasicVSR adopts bidirectional propagation with feature alignment to effectively exploit information from the entire input video. METHOD: In this study, we redesign BasicVSR by proposing second-order grid propagation and flowguided deformable alignment. We show that by empowering the recurrent framework with the enhanced propagation and alignment, one can exploit spatiotemporal information across misaligned video frames more effectively. RESULTS: In addition to video superresolution, BasicVSR++ generalizes well to other video restoration tasks such as compressed video enhancement. In NTIRE 2021, BasicVSR++ obtains three champions and one runner-up in the Video Super-Resolution and Compressed Video Enhancement Challenges. (Chen Change Loy) 51 | * [Unsupervised 3D Shape Completion through GAN Inversion](https://arxiv.org/pdf/2104.13366.pdf) (Chen Change Loy) 52 | * [Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification](https://arxiv.org/pdf/2104.13298.pdf) BACKGROUND: The recent studies of knowledge distillation have discovered that ensembling the “dark knowledge” from multiple teachers or students contributes to creating better soft targets for training, but at the cost of significantly more computations and/or parameters. METHOD: In this work, we present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images by propagating and ensembling the knowledge of the other samples in the same mini-batch. Specifically, for each sample of interest, the propagation of knowledge is weighted in accordance with the inter-sample affinities, which are estimated on-the-fly with the current network. (依赖更大的batchsize?) RESULT: Extensive experiments demonstrate that the lightweight yet effective BAKE consistently boosts the classification performance of various architectures on multiple datasets, e.g., a significant +1.2% gain of ResNet-50 on ImageNet with only +3.7% computational overhead and zero additional parameters. (Hongsheng Li) 53 | * [Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?](https://arxiv.org/pdf/2104.13343.pdf) SIGNIFICANCE: Our results show that the winning lottery tickets of FCNs display the key features of CNNs. The ability of such automatic network-simplifying procedure to recover the key features “hand-crafted” in the design of CNNs suggests interesting applications to other datasets and tasks, in order to discover new and efficient architectural inductive biases. Funny Perspective (Physician) 54 | * [Adapting ImageNet-scale models to complex distribution shifts with self-learning](https://arxiv.org/pdf/2104.12928.pdf) 在imagenet级别的数据集上探究域适应中的self-training。EMPIRICAL: While self-learning methods are an important component in many recent domain adaptation techniques, they are not yet comprehensively evaluated on ImageNetscale datasets common in robustness research. In extensive experiments on ResNet and EfficientNet models, we find that three components are crucial for increasing performance with self-learning: (i) using short update times between the teacher and the student network, (ii) fine-tuning only few affine parameters distributed across the network, and (iii) leveraging methods from robust classification to counteract the effect of label noise. DATASET: We therefore re-purpose the dataset from the Visual Domain Adaptation Challenge 2019 and use a subset of it as a new robustness benchmark (ImageNet-D) which proves to be a more challenging dataset for all current state-of-the-art models (58.2% error) to guide future research efforts at the intersection of robustness and domain adaptation on ImageNet scale. 55 | * [Graphical Modeling for Multi-Source Domain Adaptation](https://arxiv.org/pdf/2104.13057.pdf) MOTIVATION: .In Multi-Source Domain Adaptation, it is essential to utilize the labeled source data and the unlabeled target data to approach the conditional distribution of semantic label on target domain, which requires the joint modeling across different domains and also an effective domain combination scheme. The graphical structure among different domains is useful to tackle these challenges, in which the interdependency among various instances/categories can be effectively modeled. METHOD: In this work, we propose two types of graphical models, i.e. Conditional Random Field for MSDA (CRF-MSDA) and Markov Random Field for MSDA (MRF-MSDA), for cross-domain joint modeling and learnable domain combination. (Bingbing Ni) 56 | * [Unsupervised Multi-Source Domain Adaptation for Person Re-Identification](https://arxiv.org/pdf/2104.12961.pdf) TASK: To make full use of the valuable labeled data, we introduce the multi-source concept into UDA person re-ID field, where multiple source datasets are used during training. METHOD: In this paper, we try to address this problem from two perspectives, i.e. domain-specific view and domain-fusion view. RESULT: The proposed method outperforms state-of-the-art UDA person re-ID methods by a large margin, and even achieves comparable performance to the supervised approaches without any post-processing techniques. 57 | * [Width transfer: on the (in)variance of width optimization](https://arxiv.org/pdf/2104.13255.pdf) Optimizing the channel counts for different layers of a CNN has shown great promise in improving the efficiency of CNNs at test-time. In this work, we propose width transfer, a technique that harnesses the assumptions that the optimized widths (or channel counts) are regular across sizes and depths. 58 | * [Dual Transformer for Point Cloud Analysis](https://arxiv.org/pdf/2104.13044.pdf) 59 | * [Every Annotation Counts: Multi-label Deep Supervision for Medical Image Segmentation](https://arxiv.org/pdf/2104.13243.pdf) 医学图像分割 Pixel-wise segmentation is one of the most data and annotation hungry tasks in our field. Providing representative and accurate annotations is often mission-critical especially for challenging medical applications. METHOD: Our approach is based on a new formulation of deep supervision and student-teacher model and allows for easy integration of different supervision signals. In contrast to previous work, we show that care has to be taken how deep supervision is integrated in lower layers and we present multi-label deep supervision as the most important secret ingredient for success. RESULT: we are able to cut the requirement for expensive labels by 94.22% – narrowing the gap to the best fully supervised baseline to only 5% mean IoU. Our approach is validated by extensive experiments on retinal fluid segmentation and we provide an in-depth analysis of the anticipated effect each annotation type can have in boosting segmentation performance. (CVPR21) 60 | * [Underwater Image Enhancement via Medium Transmission-Guided Multi-Color Space Embedding](https://arxiv.org/pdf/2104.13015.pdf) (Wenqi Ren, TIP) 61 | 62 | 63 | 64 | #### 20210427 65 | ##### Vision Transformer 66 | * [Improve Vision Transformers Training by Suppressing Over-smoothing](https://arxiv.org/pdf/2104.12753.pdf) ANALYSIS: This work investigate how to stabilize the training of vision transformers without special structure modification. We observe that the instability of transformer training on vision tasks can be attributed to a over-smoothing problem, that the self-attention layers tend to map the different patches from the input image into a similar latent representation, hence yielding the loss of information and degeneration of performance, especially when the number of layers is large. TECHNICALLY: We then propose a number of techniques to alleviate this problem, including introducing additional loss functions to encourage diversity, prevent loss of information, and discriminate different patches by additional patch classification loss for Cutmix. RESULT: We show that our proposed techniques stabilizes the training and allow us to train wider and deeper vision transformers, achieving 85.0% top-1 accuracy on ImageNet validation set without introducing extra teachers or additional convolution layers. [code](https://github.com/ChengyueGongR/PatchVisionTransformer) 67 | * [MDETR - Modulated Detection for End-to-End Multi-Modal Understanding](https://arxiv.org/pdf/2104.12763.pdf) PROBLEM: Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. 现有跨模态预训练将detector当作黑盒,而没有有效利用. METHOD: In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. RESULT: 除了在下游跨模态任务上取得SOTA,We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. [code](https://github.com/ashkamath/mdetr) (Yann LeCun, Nicolas Carion) 68 | * [Visformer: The Vision-friendly Transformer](https://arxiv.org/pdf/2104.12533.pdf) [code](https://github.com/danczs/Visformer) This paper offers an empirical study by performing step-by-step operations to gradually transit a Transformer-based model to a convolution-based model. The results we obtain during the transition process deliver useful messages for improving visual recognition. Based on these observations, we propose a new architecture named Visformer. RESULT: With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy, and the advantage becomes more significant when the model complexity is lower or the training set is smaller. [code](https://github.com/danczs/Visformer) (Qi Tian) 69 | * [Playing Lottery Tickets with Vision and Languag](https://arxiv.org/pdf/2104.11832.pdf) Models such as LXMERT, ViLBERT and UNITER have significantly lifted the state of the art over a wide range of V+L tasks. However, the large number of parameters in such models hinders their application in practice. EMPIRICAL: In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained V+L models. We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments. FINDINGS: Through comprehensive analysis, we summarize our main findings as follows. (i) It is difficult to find subnetworks (i.e., the tickets) that strictly match the performance of the full UNITER model. However, it is encouraging to confirm that we can find “relaxed” winning tickets at 50%- 70% sparsity that maintain 99% of the full accuracy. (ii) Subnetworks found by task-specific pruning transfer reasonably well to the other tasks, while those found on the pre-training tasks at 60%/70% sparsity transfer universally, matching 98%/96% of the full accuracy on average over all the tasks. (iii) Adversarial training can be further used to enhance the performance of the found lottery tickets. (Jingjing Liu) 70 | * [M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers](https://arxiv.org/pdf/2104.11896.pdf) We present a novel architecture for 3D object detection, M3DETR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids. SIGNIFICANCE: M3DETR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers. RESULT: Our method achieves state-of-the-art performance on the KITTI 3D object detection dataset and Waymo Open Dataset. Results show that M3DETR improves the baseline significantly by 1.48% mAP for all classes on Waymo Open Dataset. In particular, our approach ranks 1 st on the well-known KITTI 3D Detection Benchmark for both car and cyclist classes, and ranks 1 st on Waymo Open Dataset with single frame point cloud inpu 71 | * [Diverse Image Inpainting with Bidirectional and Autoregressive Transformers](https://arxiv.org/pdf/2104.12335.pdf) (Shijian Lu) 72 | * [GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization](https://arxiv.org/pdf/2104.12465.pdf) TASK: When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. METHOD: In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. RESULT: experimental results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method. [code](https://github.com/Jhhuangkay/GPT2MVS-Generative-Pre-trained-Transformer-2-for-Multi-modal-Video-Summarization) 73 | * [Visual Saliency Transformer](https://arxiv.org/pdf/2104.12099.pdf) (Ling Shao) 74 | * [RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory](https://arxiv.org/pdf/2104.11934.pdf) TASK: Visual relationship recognition (VRR) PROBLEM: Several recent studies showed that the long-tail problem in VRR is even more critical than that in object recognition due to the compositional complexity and structure. METHOD: To overcome this limitation, we propose a novel transformerbased framework, dubbed as RelTransformer, which performs relationship prediction using rich semantic features from multiple image levels 75 | 76 | 77 | ##### Others 78 | * [How Well Self-Supervised Pre-Training Performs with Streaming Data?](https://arxiv.org/pdf/2104.12081.pdf) 考虑现实场景中的streaming data. CONCEPT: The common self-supervised pre-training practice requires collecting massive unlabeled data together and then trains a representation model, dubbed joint training. However, in real-world scenarios where data are collected in a streaming fashion, the joint training scheme is usually storage-heavy and time-consuming. A more efficient alternative is to train a model continually with streaming data, dubbed sequential training. PURPOSE: Nevertheless, it is unclear how well sequential self-supervised pre-training performs with streaming data. In this paper, we conduct thorough experiments to investigate self-supervised pre-training with streaming data. Specifically, we evaluate the transfer performance of sequential self-supervised pre-training with four different data sequences on three different downstream tasks and make comparisons with joint self-supervised pretraining. FINDINGS: Surprisingly, we find sequential self-supervised learning exhibits almost the same performance as the joint training when the distribution shifts within streaming data are mild. Even for data sequences with large distribution shifts, sequential self-supervised training with simple techniques, e.g., parameter regularization or data replay, still performs comparably to joint training. CONCLUSTION: Based on our findings, we recommend using sequential self-supervised training as a more efficient yet performance-competitive representation learning practice for real-world applications. (Jiashi Feng) 79 | * [Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data](https://arxiv.org/pdf/2104.12673.pdf) This paper studies the problem of novel category discovery on single- and multi-modal data with labels from different but relevant categories. We present a generic, end-to-end framework to jointly learn a reliable representation and assign clusters to unlabelled data. To avoid over-fitting the learnt embedding to labelled data, we take inspiration from self-supervised representation learning by noise-contrastive estimation and extend it to jointly handle labelled and unlabelled data. RESULT: We thoroughly evaluate our framework on large-scale multi-modal video benchmarks Kinetics-400 and VGG-Sound, and image benchmarks CIFAR10, CIFAR100 and ImageNet, obtaining state-of-the-art results. 80 | * [Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos](https://arxiv.org/pdf/2104.12671.pdf) this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. 三种模态上的自监督预训练,不需要数据成对! (Shih-Fu Chang) 81 | * [Mutual Contrastive Learning for Visual Representation Learning](https://arxiv.org/pdf/2104.12565.pdf) We present a collaborative learning method called Mutual Contrastive Learning (MCL) for general visual representation learning. The core idea of MCL is to perform mutual interaction and transfer of contrastive distributions among a cohort of models. RESULT: Experimental results on supervised and self-supervised image classification, transfer learning and few-shot learning show that MCL can lead to consistent performance gains, demonstrating that MCL can guide the network to generate better feature representation learning. 82 | * [2.5D Visual Relationship Detection](https://arxiv.org/pdf/2104.12727.pdf) PROBLEM: Visual 2.5D perception involves understanding the semantics and geometry of a scene through reasoning about object relationships with respect to the viewer in an environment. However, existing works in visual recognition primarily focus on the semantics. NEW TASK: To bridge this gap, we study 2.5D visual relationship detection (2.5VRD), in which the goal is to jointly detect objects and predict their relative depth and occlusion relationships. Unlike general VRD, 2.5VRD is egocentric, using the camera’s viewpoint as a common reference for all 2.5D relationships. Unlike depth estimation, 2.5VRD is object-centric and not only focuses on depth. DATASET: To enable progress on this task, we create a new dataset consisting of 220k human-annotated 2.5D relationships among 512K objects from 11K images. Our results show that existing models largely rely on semantic cues and simple heuristics to solve 2.5VRD, motivating further research on models for 2.5D perception. (Ming-Hsuan Yang, Boqing Gong) 83 | * [Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets](https://arxiv.org/pdf/2104.12690.pdf) 新的数据集标注方法for collecting multi-class classification labels for a large collection of images,结合自监督和Machine Labeler (CVPR'21, Oral) 84 | * [Carrying out CNN Channel Pruning in a White Box](https://arxiv.org/pdf/2104.11883.pdf) (Rongrong Ji) 85 | * [CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment](https://arxiv.org/pdf/2104.12642.pdf) (ICLR'21) 86 | * [Practical Wide-Angle Portraits Correction with Deep Structured Models](https://arxiv.org/pdf/2104.12464.pdf) (CVPR'21, Haoqiang Fan) 87 | * [Delving into Data: Effectively Substitute Training for Black-box Attack](https://arxiv.org/pdf/2104.12378.pdf) (CVPR'21) 88 | * [StegaPos: Preventing Crops and Splices with Imperceptible Positional Encodings](https://arxiv.org/pdf/2104.12290.pdf) Funny 89 | * [Piggyback GAN: Efficient Lifelong Learning for Image Conditioned Generation](https://arxiv.org/pdf/2104.11939.pdf) 90 | * [Clean Images are Hard to Reblur: A New Clue for Deblurring](https://arxiv.org/pdf/2104.12665.pdf) 91 | * [Rich Semantics Improve Few-shot Learning](https://arxiv.org/pdf/2104.12709.pdf) 92 | 93 | 94 | 95 | 96 | #### 20210426 97 | * [VidTr: Video Transformer Without Convolutions](https://arxiv.org/pdf/2104.11746.pdf) We introduce Video Transformer (VidTr) with separableattention for video classification. Comparing with commonly used 3D networks, VidTr is able to aggregate spatiotemporal information via stacked attentions and provide better performance with higher efficiency. 98 | * [Learning to Cluster Faces via Transformer](https://arxiv.org/pdf/2104.11502.pdf) PROBLEM: The main challenge is that it is difficult to cluster images from the same identity with different face poses, occlusions, and image quality. METHOD: In this paper, we repurpose the well-known Transformer and introduce a Face Transformer for supervised face clustering. In Face Transformer, we decompose the face clustering into two steps: relation encoding and linkage predicting. 99 | * [Skeletor: Skeletal Transformers for Robust Body-Pose Estimation](https://arxiv.org/pdf/2104.11712.pdf) However, rather than tracking body parts and trying to temporally smooth them, we propose a novel transformer based network that can learn a distribution over both pose and motion in an unsupervised fashion. 100 | * [Deep Lucas-Kanade Homography for Multimodal Image Alignment](https://arxiv.org/pdf/2104.11693.pdf) TASK: Estimating homography to align image pairs captured by different sensors or image pairs with large appearance changes is an important and general challenge for many computer vision applications. METHOD: In contrast to others, we propose a generic solution to pixel-wise align multimodal image pairs by extending the traditional Lucas-Kanade algorithm with networks. funny 101 | * [A Closer Look at Self-training for Zero-Label Semantic Segmentation](https://arxiv.org/pdf/2104.11692.pdf) PROBLEM: Prior zerolabel semantic segmentation works approach this task by learning visual-semantic embeddings or generative models. However, they are prone to overfitting on the seen classes because there is no training signal for them. METHOD: We assume that pixels of unseen classes could be present in the training images but without being annotated. Our idea is to capture the latent information on unseen classes by supervising the model with self-produced pseudo-labels for unlabeled pixels. We propose a consistency regularizer to filter out noisy pseudolabels by taking the intersections of the pseudo-labels generated from different augmentations of the same image. Our framework generates pseudo-labels and then retrain the model with human-annotated and pseudo-labelled data. 102 | * [Skip-Convolutions for Efficient Video Processing](https://arxiv.org/pdf/2104.11487.pdf) We propose Skip-Convolutions to leverage the large amount of redundancies in video streams and save computations. Each video is represented as a series of changes across frames and network activations, denoted as residuals. We reformulate standard convolution to be efficiently computed on residual frames: each layer is coupled with a binary gate deciding whether a residual is important to the model prediction, e.g. foreground regions, or it can be safely skipped, e.g. background regions. RESULT: By replacing all convolutions with Skip-Convolutions in two state-ofthe-art architectures, namely EfficientDet and HRNet, we reduce their computational cost consistently by a factor of 3 ∼ 4× for two different tasks, without any accuracy drop. 103 | * [H2O: A Benchmark for Visual Human-human Object Handover Analysis](https://arxiv.org/pdf/2104.11466.pdf) Object handover is a common human collaboration behavior that attracts attention from researchers in Robotics and Cognitive Science. Though visual perception plays an important role in the object handover task, the whole handover process has been specifically explored. In this work, we propose a novel rich-annotated dataset, H2O, for visual analysis of human-human object handovers. The H2O, which contains 18K video clips involving 15 people who hand over 30 objects to each other, is a multi-purpose benchmark. Funnt task. 104 | * [Motion Representations for Articulated Animation](https://arxiv.org/pdf/2104.11280.pdf) We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies object parts, tracks them in a driving video, and infers their motions by considering their principal axes. 105 | 106 | #### 20210423 107 | ##### vision transformer 108 | * [VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text](https://arxiv.org/pdf/2104.11178.pdf) our Video-AudioText Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. (Shih-Fu Chang, Boqing Gong) 109 | * [Multiscale Vision Transformers](https://arxiv.org/pdf/2104.11227.pdf) METHOD: We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. RESULT: We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. 很直接的想法,但很有效 (Haoqi Fan) 110 | * [ImageNet-21K Pretraining for the Masses](https://arxiv.org/pdf/2104.10972.pdf) This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for everyone. Via a dedicated preprocessing stage, utilizing WordNet hierarchies, and a novel training scheme called semantic softmax, we show that various models, including small mobile-oriented models, significantly benefit from ImageNet-21K pretraining on numerous datasets and tasks. We also show that we outperform previous ImageNet-21K pretraining schemes for prominent new models like ViT. [code](https://github.com/Alibaba-MIIL/ImageNet21K) TODO 111 | * [So-ViT: Mind Visual Tokens for Vision Transformer](https://arxiv.org/pdf/2104.10935.pdf) PROBLEM: However, the high performance of the original ViT heavily depends on pretraining using ultra large-scale datasets, and it significantly underperforms on ImageNet1K if trained from scratch. METHOD: (1) This paper makes the efforts toward addressing this problem, by carefully considering the role of visual tokens. First, for classification head, existing ViT only exploits class token while entirely neglecting rich semantic information inherent in high-level visual tokens. Therefore, we propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification. (2) Second, the original ViT employs the naive embedding of fixed-size image patches, lacking the ability to model translation equivariance and locality. To alleviate this problem, we develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding. 112 | * [Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet](https://arxiv.org/pdf/2104.10858.pdf) 本文旨在为vision transformer提供准确率和模型复杂度的trade-off,图1提供了一个比较全面的比较。PROBLEM: While recent vision transformers have demonstrated promising results in ImageNet classification, their performance still lags behind powerful convolutional neural networks (CNNs) with approximately the same model size. METHOD: In this work, instead of describing a novel transformer architecture, we explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques. We show that by slightly tuning the structure of vision transformers and introducing token labeling—a new training objective, our models are able to achieve better results than the CNN counterparts and other transformer-based classification models with similar amount of training parameters and computations. [code](https://github.com/zihangJiang/TokenLabeling) (Jiashi Feng) 113 | * [KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control](https://arxiv.org/pdf/2104.11224.pdf) 114 | 115 | ##### CVPR21 116 | * [Hierarchical Motion Understanding via Motion Programs](https://arxiv.org/pdf/2104.11216.pdf) (Jiajun Wu) 117 | * [Distilling Audio-Visual Knowledge by Compositional Contrastive Learning](https://arxiv.org/pdf/2104.10955.pdf) 跨模态对比学习。 In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image, and video representations across modalities to uncover richer multi-modal knowledge. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets: UCF101, ActivityNet, and VGGSound. Moreover, we demonstrate that our model significantly outperforms a variety of existing knowledge distillation methods in transferring audio-visual knowledge to improve video representation learning. [code](https://github.com/yanbeic/CCL) TODO 118 | * [Heterogeneous Grid Convolution for Adaptive, Efficient, and Controllable Computation](https://arxiv.org/pdf/2104.11176.pdf) This paper proposes a novel heterogeneous grid convolution that builds a graph-based image representation by exploiting heterogeneity in the image content, enabling adaptive, efficient, and controllable computations in a convolutional architecture. We have evaluated the proposed approach on four image understanding tasks, semantic segmentation, object localization, road extraction, and salient object detection. 思想类似Efficient Segmentation: Learning Downsampling Near Semantic Boundaries(ICCV'19),但是结合graph讲故事 119 | * [Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation](https://arxiv.org/pdf/2104.11116.pdf) In this paper, we propose a clean yet effective framework to generate posecontrollable talking faces. (Ziwei Liu) 120 | * [DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation](https://arxiv.org/pdf/2104.10834.pdf) SETTING: It employs an adversarial training with a labeled daytime dataset and an unlabeled dataset that contains coarsely aligned day-night image pairs. METHOD: Specifically, for the unlabeled day-night image pairs, we use the pixel-level predictions of static object categories on a daytime image as a pseudo supervision to segment its counterpart nighttime image. We further design a re-weighting strategy to handle the inaccuracy caused by misalignment between day-night image pairs and wrong predictions of daytime images, as well as boost the prediction accuracy of small objects. RESULT: Extensive experiments on Dark Zurich and Nighttime Driving datasets show that our method achieves state-of-the-art performance for nighttime semantic segmentation. 121 | * [ManipulaTHOR: A Framework for Visual Object Manipulation](https://arxiv.org/pdf/2104.11213.pdf) (Oral) 122 | 123 | ##### Others 124 | * [Pri3D: Can 3D Priors Help 2D Representation Learning?](https://arxiv.org/pdf/2104.11225.pdf) (Saining Xie) 125 | * [Domain Adaptation for Semantic Segmentation via Patch-Wise Contrastive Learning](https://arxiv.org/pdf/2104.11056.pdf) Unlike many earlier methods that rely on adversarial learning for feature alignment, we leverage contrastive learning to bridge the domain gap by aligning the features of structurally similar label patches across domains. As a result, the networks are easier to train and deliver better performance. particularly with a small number of target domain annotations. It can also be naturally extended to weakly-supervised domain adaptation, where only a minor drop in accuracy can save up to 75% of annotation cost. 126 | * [Lighting the Darkness in the Deep Learning Era](https://arxiv.org/pdf/2104.10729.pdf) 低光照图像增强综述,并提供新的数据集和benchmark用于future research. (Ming-Ming Cheng, Chen Change Loy) 127 | * [On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation](https://arxiv.org/pdf/2104.11222.pdf) We investigate the sensitivity of the Frechet Inception ´ Distance (FID) score to inconsistent and often incorrect implementations across different image processing libraries. FID score is widely used to evaluate generative models, but each FID implementation uses a different low-level image processing process. OBSERVATION: We observe that numerous subtle choices need to be made for FID calculation and a lack of consistencies in these choices can lead to vastly different FID scores. In particular, we show that the following choices are significant: (1) selecting what image resizing library to use, (2) choosing what interpolation kernel to use, (3) what encoding to use when representing images. CONTRIBUTION: We additionally outline numerous common pitfalls that should be avoided and provide recommendations for computing the FID score accurately. We provide an easy-to-use optimized implementation of our proposed recommendations in the accompanying code. (Richard Zhang, Junyan Zhu) 128 | * [FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection](https://arxiv.org/pdf/2104.10956.pdf) However, it is non-trivial to make a general adapted 2D detector work in this 3D task. In this technical report, we study this problem with a practice built on fully convolutional single-stage detector and propose a general framework FCOS3D. RESULT: Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020. (Dahua Lin) 129 | * [Fully Convolutional Line Parsing](https://arxiv.org/pdf/2104.11207.pdf) We present a one-stage Fully Convolutional Line Parsing network (F-Clip) that detects line segments from images. F-Clip detects line segments in an end-to-end fashion by predicting them with each line’s center position, length, and angle. [code](https://github.com/Delay-Xili/F-Clip) (Yi Ma) 130 | 131 | #### 20210422 132 | * [MetricOpt: Learning to Optimize Black-Box Evaluation Metrics](https://arxiv.org/pdf/2104.10631.pdf) We study the problem of directly optimizing arbitrary non-differentiable task evaluation metrics such as misclassification rate and recall. Our method, named MetricOpt, operates in a black-box setting where the computational details of the target metric are unknown. We achieve this by learning a differentiable value function, which maps compact task-specific model parameters to metric observations. Result: MetricOpt achieves state-of-the-art performance on a variety of metrics for (image) classification, image retrieval and object detection. Solid benefits are found over competing methods, which often involve complex loss design or adaptation. MetricOpt also generalizes well to new tasks and model architectures. (CVPR21, Oral) 133 | * [Temporal Modulation Network for Controllable Space-Time Video Super-Resolution](https://arxiv.org/pdf/2104.10642.pdf) Space-time video super-resolution (STVSR) aims to increase the spatial and temporal resolutions of lowresolution and low-frame-rate videos (Ming-Ming Cheng) 134 | * [Visualizing Adapted Knowledge in Domain Transfer](https://arxiv.org/pdf/2104.10602.pdf) To understand the adaptation process, we portray their knowledge difference with image translation. Specifically, we feed a translated image and its original version to the two models respectively, formulating two branches. Through updating the translated image, we force similar outputs from the two branches. When such requirements are met, differences between the two images can compensate for and hence represent the knowledge difference between models.(why?) Funny idea 135 | * [Balanced Knowledge Distillation for Long-tailed Learning](https://arxiv.org/pdf/2104.10510.pdf) 136 | * [Camouflaged Object Segmentation with Distraction Mining](https://arxiv.org/pdf/2104.10475.pdf) One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. we develop a bio-inspired framework, termed Positioning and Focus Network (PFNet), which mimics the process of predation in nature. (Deng-Ping Fan) 137 | * [Fourier Contour Embedding for Arbitrary-Shaped Text Detection](https://arxiv.org/pdf/2104.10442.pdf) To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. 138 | * [PP-YOLOv2: A Practical Object Detector](https://arxiv.org/pdf/2104.10419.pdf) [code](https://github.com/PaddlePaddle/PaddleDetection) 139 | * [Comprehensive Multi-Modal Interactions for Referring Image Segmentation](https://arxiv.org/pdf/2104.10412.pdf) We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the given natural language description. 140 | * [Towards Corruption-Agnostic Robust Domain Adaptation](https://arxiv.org/pdf/2104.10376.pdf) In this paper, we investigate a new task, Corruptionagnostic Robust Domain Adaptation (CRDA): to be accurate on original data and robust against unavailablefor-training corruptions on target domains. TODO 141 | * [Guided Interactive Video Object Segmentation Using Reliability-Based Attention Maps](https://arxiv.org/pdf/2104.10386.pdf) [code](https://github.com/yuk6heo/GIS-RAmap) We propose a novel guided interactive segmentation (GIS) algorithm for video objects to improve the segmentation accuracy and reduce the interaction time. 将类似active learning的概念引入交互式视频分割中,利用可靠性向用户推荐要标注的帧 (CVPR21 Oral) 142 | * [SRWarp: Generalized Image Super-Resolution under Arbitrary Transformation](https://arxiv.org/pdf/2104.10386.pdf) Recent approaches extend the scope to real-valued upsampling factors, even with varying aspect ratios to handle the limitation. In this paper, we propose the SRWarp framework to further generalize the SR tasks toward an arbitrary image transformation. Compared with previous methods, we do not constrain the SR model on a regular grid but allow numerous possible deformations for flexible and diverse image editing. (CVPR21) Funny task. 143 | * [Invertible Denoising Network: A Light Solution for Real Noise Removal](https://arxiv.org/pdf/2104.10546.pdf) Invertible networks have various benefits for image denoising since they are lightweight, information-lossless, and memory-saving during back-propagation. InvDN transforms the noisy input into a low-resolution clean image and a latent representation containing noise. To discard noise and restore the clean image, InvDN replaces the noisy latent representation with another one sampled from a prior distribution during reversion. 144 | * [Auto-FedAvg: Learnable Federated Averaging for Multi-Institutional Medical Image Segmentation](https://arxiv.org/pdf/2104.10195.pdf) Federated learning (FL) enables collaborative model training while preserving each participant’s privacy, which is particularly beneficial to the medical field. (Alan Yuille) 145 | 146 | 147 | #### 20210421 148 | * [VideoGPT: Video Generation using VQ-VAE and Transformers](https://arxiv.org/pdf/2104.10157.pdf) We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQVAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPTlike architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Funny: views each frame as a word in GPT. [code](https://github.com/wilson1yan/VideoGPT) 149 | * [Understanding Synonymous Referring Expressions via Contrastive Features](https://arxiv.org/pdf/2104.10156.pdf) Task: Referring expression comprehension aims to localize objects identified by natural language descriptions. Motivation: One nature is that each object can be described by synonymous sentences with paraphrases, and such varieties in languages have critical impact on learning a comprehension model. While prior work usually treats each sentence and attends it to an object separately, we focus on learning a referring expression comprehension model that considers the property in synonymous sentences. Method: To this end, we develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. Funny Story (Yi-Hsuan Tsai, Ming-Hsuan Yang) [code](https://github.com/wenz116/RefContrast) 150 | * [Variational Relational Point Completion Network](https://arxiv.org/pdf/2104.10154.pdf) (Ziwei Liu) 151 | * [Transformer Transforms Salient Object Detection and Camouflaged Object Detection](https://arxiv.org/pdf/2104.10127.pdf) In this paper, we conduct research on applying the transformer networks for salient object detection (SOD). Specifically, we adopt the dense transformer backbone for fully supervised RGB image based SOD, RGB-D image pair based SOD, and weakly supervised SOD via scribble supervision. As an extension, we also apply our fully supervised model to the task of camouflaged object detection (COD) for camouflaged object segmentation. (Deng-Ping Fan) 152 | * [Contrastive Learning for Sports Video: Unsupervised Player Classification](https://arxiv.org/pdf/2104.10068.pdf) Task: We address the problem of unsupervised classification of players in a team sport according to their team affiliation, when jersey colours and design are not known a priori. Method: We adopt a contrastive learning approach in which an embedding network learns to maximize the distance between representations of players on different teams relative to players on the same team, in a purely unsupervised fashion, without any labelled data. Funny application. 153 | * [Style-Aware Normalized Loss for Improving Arbitrary Style Transfer](https://arxiv.org/pdf/2104.10064.pdf) Neural Style Transfer (NST) has quickly evolved from single-style to infinite-style models, also known as Arbitrary Style Transfer (AST). Problem: more than 50% of the time, AST stylized images are not acceptable to human users, typically due to under- or over-stylization. Insight: Our studies show that the IST issue is related to the conventional AST style loss, and reveal that the root cause is the equal weightage of training samples irrespective of the properties of their corresponding style images, which biases the model towards certain styles. Method: Through investigation of the theoretical bounds of the AST style loss, we propose a new loss that largely overcomes IST. Funny story. Long-tailed style distribution? 154 | * [VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization](https://arxiv.org/pdf/2104.10036.pdf) 155 | * [T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval](https://arxiv.org/pdf/2104.10054.pdf) Task: Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions. Problem: The key to this problem is to measure textvideo similarities in a joint embedding space. However, most existing methods only consider the global cross-modal similarity and overlook the local details. Method: In this paper, we design an efficient global-local alignment method. The multi-modal video sequences and text features are adaptively aggregated with a set of shared semantic centers. The local crossmodal similarities are computed between the video feature and text feature within the same center. (Yi Yang) 156 | * [Posterior Sampling for Image Restoration using Explicit Patch Priors](https://arxiv.org/pdf/2104.09895.pdf) In this paper, we show how to combine explicit priors on patches of natural images in order to sample from the posterior probability of a full image given a degraded image. Unlike previous approaches that computed a single restoration using MAP or MMSE, our method makes explicit the uncertainty in the restored images and guarantees that all patches in the restored images will be typical given the patch prior 157 | * [Lighting, Reflectance and Geometry Estimation from 360◦ Panoramic Stereo](https://arxiv.org/pdf/2104.09886.pdf) Our model takes advantage of the 360◦ input to observe the entire scene with geometric detail, then jointly estimates the scene’s properties with physical constraints. 158 | * [SelfReg: Self-supervised Contrastive Regularization for Domain Generalization](https://arxiv.org/pdf/2104.09841.pdf) In recent studies, contrastive learning-based domain generalization approaches have been proposed and achieved high performance. Problem: However, the performance of contrastive learning fundamentally depends on quality and quantity of negative data pairs. (问题不够明确) To address this issue, we propose a new regularization method for domain generalization based on contrastive learning, self-supervised contrastive regularization (SelfReg). The proposed approach use only positive data pairs, thus it resolves various problems caused by negative pair sampling.(BYOL?) Moreover, we propose a class-specific domain perturbation layer (CDPL), which makes it possible to effectively apply mixup augmentation even when only positive data pairs are used. 159 | * [Detector-Free Weakly Supervised Grounding by Separation](https://arxiv.org/pdf/2104.09829.pdf) Task: Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. 160 | * [CTNet: Context-based Tandem Network for Semantic Segmentation](https://arxiv.org/pdf/2104.09805.pdf) This work proposes a novel Context-based Tandem Network (CTNet) by interactively exploring the spatial contextual information and the channel contextual information, which can discover the semantic context for semantic segmentation. (Jinhui Tang) 161 | * [SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud](https://arxiv.org/pdf/2104.09804.pdf) [code](https://github.com/Vegeta2020/SE-SSD) 162 | * [Does enhanced shape bias improve neural network robustness to common corruptions](https://arxiv.org/pdf/2104.09789.pdf) It has been shown that augmenting the training data with different image styles decreases this texture bias in favor of increased shape bias while at the same time improving robustness to common corruptions, such as noise and blur. Commonly, this is interpreted as shape bias increasing corruption robustness. However, this relationship is only hypothesized. We perform a systematic study of different ways of composing inputs based on natural images, explicit edge information, and stylization. While stylization is essential for achieving high corruption robustness, we do not find a clear correlation between shape bias and robustness. We conclude that the data augmentation caused by style-variation accounts for the improved corruption robustness and increased shape bias is only a byproduct. (ICLR21) 163 | * [Learning Semantic-Aware Dynamics for Video Prediction](https://arxiv.org/pdf/2104.09762.pdf) We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions and capturing the evolution of semantically consistent regions in the video. (CVPR21) 164 | * [Imaginative Walks: Generative Random Walk Deviation Loss for Improved Unseen Learning Representation](https://arxiv.org/pdf/2104.09757.pdf) We propose a novel loss for generative models, dubbed as GRaWD (Generative Random Walk Deviation), to improve learning representations of unexplored visual spaces. TODO 165 | * [Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information](https://arxiv.org/pdf/2104.09580.pdf) Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions. 166 | 167 | 168 | 169 | #### 20210420 170 | * [Does language help generalization in vision models?](https://arxiv.org/pdf/2104.08313.pdf) 对CLIP中的观点提出质疑. PROBLEM: One might assume that these abilities are derived, at least in part, from a “semantic grounding” of the visual feature space, learning meaningful structure by mirroring the space of linguistic representations. FIND1: Contrary to this intuition, we show that a visual model (BiT-M) trained on a very large supervised image dataset (ImageNet-21k) can be as efficient for generalization (few-shot learning, unsupervised clustering) as its multimodal counterpart (CLIP). FIND2: When compared to other standard visual or language models, the latent representations of BiT-M were found to be just as “linguistic” as those of CLIP. CONCLUSION: Overall, these findings suggest that the main factor driving improvements of generalization in current models is the size of the training dataset, not (solely) the multimodal grounding property. 171 | * [TransVG: End-to-End Visual Grounding with Transformers](https://arxiv.org/pdf/2104.08541.pdf) TASK: In this paper, we present a neat yet effective transformerbased framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. METHOD: we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes) (Wengang Zhou, Houqiang Li) 172 | * [Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training](https://arxiv.org/pdf/2104.09411.pdf) (Houqiang Li) TODO 173 | * [Visual Transformer Pruning](https://arxiv.org/pdf/2104.08500.pdf) The pipeline for visual transformer pruning is as follows: 1) training with sparsity regularization; 2) pruning channels; 3) finetuning. (Yunhe Wang) 174 | 175 | CVPR21 176 | * [Temporal Query Networks for Fine-grained Video Understanding](https://arxiv.org/pdf/2104.09496.pdf) Our objective in this work is fine-grained classification of actions in untrimmed videos, where the actions may be temporally extended or may span only a few frames of the video. We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set. (Andrew Zisserman) 177 | * [One More Check: Making “Fake Background” Be Tracked Again](https://arxiv.org/pdf/2104.09441.pdf) Once a target bounding box is mistakenly classified as background by the detector, the temporal consistency of its corresponding tracklet will be no longer maintained, as shown in Fig. 1. In this paper, we set out to restore the misclassified bounding boxes, i.e., fake background, by proposing a re-check network. Good Tiltle, nice story. [code](https://github.com/JudasDie/SOTS) 178 | * [Cross-Domain Adaptive Clustering for Semi-Supervised Domain Adaptation](https://arxiv.org/pdf/2104.09415.pdf) TASK: In semi-supervised domain adaptation, a few labeled samples per class in the target domain guide features of the remaining target samples to aggregate around them. PROBLEM: However, the trained model cannot produce a highly discriminative feature representation for the target domain because the training data is dominated by labeled samples from the source domain. This could lead to disconnection between the labeled and unlabeled target samples as well as misalignment between unlabeled target samples and the source domain. (问题描述值得学习) 179 | * [Contrastive Learning for Compact Single Image Dehazing](https://arxiv.org/pdf/2104.09367.pdf) 用对比学习做图像去雾。问题:(1)existing deep learning based dehazing methods only adopt clear images as positive samples to guide the training of dehazing network while negative information is unexploited (2) most of them focus on strengthening the dehazing network with an increase of depth and width, leading to a significant requirement of computation and memory. 方法: (1)we propose a novel contrastive regularization (CR) built upon contrastive learning to exploit both the information of hazy images and clear images as negative and positive samples, respectively (2) we develop a compact dehazing network based on autoencoder-like (AE) framework. It involves an adaptive mixup operation and a dynamic feature enhancement module [code](https://github.com/GlassyWu/AECR-Net) 180 | * [Multi-person Implicit Reconstruction from a Single Image](https://arxiv.org/pdf/2104.09283.pdf) 181 | * [Multi-Modal Fusion Transformer for End-to-End Autonomous Driving](https://arxiv.org/pdf/2104.09224.pdf) Funny 182 | * [Surrogate Gradient Field for Latent Space Manipulation](https://arxiv.org/pdf/2104.09065.pdf) 183 | * [Distilling Knowledge via Knowledge Review](https://arxiv.org/pdf/2104.09044.pdf) (Jiaya Jia) 184 | * [Self-Supervised Pillar Motion Learning for Autonomous Driving](https://arxiv.org/pdf/2104.08683.pdf) Autonomous driving can benefit from motion behavior comprehension when interacting with diverse traffic participants in highly dynamic environments. To this end, we propose a learning framework that leverages free supervisory signals from point clouds and paired camera images to estimate motion purely via self-supervision 185 | * [RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features](https://arxiv.org/pdf/2104.08569.pdf) PROBLEM: However, the segmented masks are still very coarse due to the downsampling operations in both the feature pyramid and the instance-wise pooling process, especially for large objects. METHOD: In this work, we propose a new method called RefineMask for high-quality instance segmentation of objects and scenes, which incorporates fine-grained features during the instance-wise segmenting process in a multi-stage manner. RESULT: Without bells and whistles, RefineMask yields significant gains of 2.6, 3.4, 3.8 AP over Mask R-CNN on COCO, LVIS, and Cityscapes benchmarks respectively at a small amount of additional computational cost. 186 | * [Few-Shot Model Adaptation for Customized Facial Landmark Detection, Segmentation, Stylization and Shadow Removal](https://arxiv.org/pdf/2104.09457.pdf) 187 | * [Learning To Count Everything](https://arxiv.org/pdf/2104.08391.pdf) Existing works on visual counting primarily focus on one specific category at a time, such as people, animals, and cells. In this paper, we are interested in counting everything, that is to count objects from any category given only a few annotated instances from that category. TODO 188 | * [Attention in Attention Network for Image Super-Resolution](https://arxiv.org/pdf/2104.09497.pdf) In this work, we attempt to quantify and visualize the static attention mechanisms and show that not all attention modules are equally beneficial. We then propose attention in attention network (A2N) for highly accurate image SR. Specifically, our A2N consists of a non-attention branch and a coupling attention branch. 189 | 190 | 191 | #### 20210419 192 | CVPR21: 193 | * [Fusing the Old with the New: Learning Relative Camera Pose with Geometry-Guided Uncertainty](https://arxiv.org/pdf/2104.08278.pdf) 194 | * [Divide-and-Conquer for Lane-Aware Diverse Trajectory Prediction](https://arxiv.org/pdf/2104.08277.pdf) Our work addresses two key challenges in trajectory prediction, learning multimodal outputs, and better predictions by imposing constraints using driving knowledge. 195 | * [Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos](https://arxiv.org/pdf/2104.07905.pdf) We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Funny 196 | 197 | Others: 198 | * [Deep Stable Learning for Out-Of-Distribution Generalization](https://arxiv.org/pdf/2104.07876.pdf) 问题设置:探究更开放的域适应问题 Conventional methods assume either the known heterogeneity of training data (e.g. domain labels) or the approximately equal capacities of different domains. In this paper, we consider a more challenging case where neither of the above assumptions holds. 采用类似因果的解决方案 We propose to address this problem by removing the dependencies between features via learning weights for training samples, which helps deep models get rid of spurious correlations and, in turn, concentrate more on the true connection between discriminative features and labels. 199 | * [“BNN - BN = ?”: Training Binary Neural Networks without Batch Normalization](https://arxiv.org/pdf/2104.08215.pdf) 问题:However, the BN layer is costly to calculate and is typically implemented with non-binary parameters, leaving a hurdle for the efficient implementation of BNN training. It also introduces undesirable dependence between samples 200 | within each batch. 工作:Inspired by the latest advance on Batch Normalization Free (BN-Free) training [7], we extend their framework to training BNNs, and for the first time demonstrate that BNs can be completed removed from BNN training and inference regimes. (Zhangyang Wang) (CVPRW) 201 | * [Dual Contrastive Learning for Unsupervised Image-to-Image Translation](https://arxiv.org/pdf/2104.07689.pdf) 背景:Contrastive learning for Unpaired image-to-image Translation (CUT) yields state-of-the-art results in modeling unsupervised image-toimage translation by maximizing mutual information between input and output patches using only one encoder for both domains. 贡献:In this paper, we propose a novel method based on contrastive learning and a dual learning setting (exploiting two encoders) to infer an efficient mapping between unpaired data. Additionally, while CUT suffers from mode collapse, a variant of our method efficiently addresses this issue. 202 | * [Contrastive Learning with Stronger Augmentations](https://arxiv.org/pdf/2104.07713.pdf) 现有对比学习问题:However, those carefully designed transformations limited us to further explore the novel patterns exposed by other transformations. Meanwhile, as found in our experiments, the strong augmentations distorted the images’ structures, resulting in difficult retrieval. 方法:Thus, we propose a general framework called Contrastive Learning with Stronger Augmentations (CLSA) to complement current contrastive learning approaches. Here, the distribution divergence between the weakly and strongly augmented images over the representation bank is adopted to supervise the retrieval of strongly augmented queries from a pool of instances. (Guojun Qi) 203 | * [Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment](https://arxiv.org/pdf/2104.07719.pdf) We propose a meta-learning based few-shot object detection method by transferring meta-knowledge learned from data-abundant base classes to data-scarce novel classes. To improve proposal generation for few-shot novel classes, we propose to learn a lightweight matching network to measure the similarity between each spatial position in the query image feature map and spatially-pooled class features, instead of the traditional object/nonobject classifier, thus generating category-specific proposals and improving proposal recall for novel classes. (Shih-Fu Chang) 204 | * [Pareto Self-Supervised Training for Few-Shot Learning](https://arxiv.org/pdf/2104.07841.pdf) 探究few-shot learning和自监督学习的结合。 问题:Previous works benefit from sharing inductive bias between the main task (FSL) and auxiliary tasks (SSL), where the shared parameters of tasks are optimized by minimizing a linear combination of task losses. However, it is challenging to select a proper weight to balance tasks and reduce task conflict. 方法:To handle the problem as a whole, we propose a novel approach named as Pareto self-supervised training (PSST) for FSL. PSST explicitly decomposes the few-shot auxiliary problem into multiple constrained multi-objective subproblems with different trade-off preferences, and here a preference region in which the main task achieves the best performance is identified. Then, an effective preferred Pareto exploration is proposed to find a set of optimal solutions in such a preference region. 205 | * [Weakly Supervised Object Localization and Detection: A Survey](https://arxiv.org/pdf/2104.07918.pdf) (Ming-Hsuan Yang) 206 | * [Self-supervised Video Retrieval Transformer Network](https://arxiv.org/pdf/2104.07993.pdf)任务及其应用Content-based video retrieval aims to find videos from a large video database that are similar to or even nearduplicate of a given query video. It plays an important role in many video related applications, including copyright protection, recommendation, filtering and etc.. 方法: We propose a novel video retrieval system, termed SVRTN, It first applies self-supervised training to effectively learn video representation from unlabeled data to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. 207 | * [Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos](https://arxiv.org/pdf/2104.08241.pdf) The key factor for video person reidentification is to effectively exploit both spatial and temporal clues from video sequences. In this work, we propose a novel Spatial-Temporal Correlation and Topology Learning framework (CTL) to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. (Jiawei Liu, Zheng-Jun Zha, Kecheng Zheng) 208 | 209 | 210 | #### 20210405 211 | 212 | CVPR21: 213 | 214 | * [Group Collaborative Learning for Co-Salient Object Detection](https://arxiv.org/pdf/2104.01108.pdf) [code](https://github.com/fanq15/GCoNet) (Deng-Ping Fan, Ling Shao) 215 | * [MOST: A Multi-Oriented Scene Text Detector with Localization Refinement](https://arxiv.org/pdf/2104.01070.pdf) (Xiang Bai) 216 | * [Visual Semantic Role Labeling for Video Understanding](https://arxiv.org/pdf/2104.00990.pdf) 217 | * [UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles](https://arxiv.org/pdf/2104.00946.pdf) 218 | * [Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning](https://arxiv.org/pdf/2104.00924.pdf) 219 | * [Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation](https://arxiv.org/pdf/2104.00905.pdf) 220 | * [Network Quantization with Element-wise Gradient Scaling](https://arxiv.org/pdf/2104.00903.pdf) 221 | * [HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection](https://arxiv.org/pdf/2104.00902.pdf) 222 | * [Adaptive Class Suppression Loss for Long-Tail Object Detection](https://arxiv.org/pdf/2104.00885.pdf) 223 | * [S2R-DepthNet: Learning a Generalizable Depth-specific Structural Representation](https://arxiv.org/pdf/2104.00877.pdf) (Xuejin Chen, Wenjun Zeng) 224 | * [Self-supervised Video Representation Learning by Context and Motion Decoupling](https://arxiv.org/pdf/2104.00862.pdf) 225 | * [Fully Understanding Generic Objects: Modeling, Segmentation, and Reconstruction](https://arxiv.org/pdf/2104.00858.pdf) 226 | * [Towards High Fidelity Face Relighting with Realistic Shadows](https://arxiv.org/pdf/2104.00825.pdf) 227 | * [Curriculum Graph Co-Teaching for Multi-Target Domain Adaptation](https://arxiv.org/pdf/2104.00808.pdf) 228 | * [FESTA: Flow Estimation via Spatial-Temporal Attention for Scene Point Clouds](https://arxiv.org/pdf/2104.00798.pdf) 229 | 230 | Vision Transformer: 231 | * [LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference](https://arxiv.org/pdf/2104.01136.pdf) 从speed-acc tradeoff的角度讲故CNN与ViT结合,提出attention bias, a new way to integrate positional information in vision transformers.:We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. For example, at 80% ImageNet top-1 accuracy, LeViT is 3.3 times faster than EfficientNet on the CPU. 232 | * [Language-based Video Editing via Multi-Modal Multi-Level Transformer](https://arxiv.org/pdf/2104.01122.pdf) 233 | * [AAformer: Auto-Aligned Transformer for Person Re-Identification](https://arxiv.org/pdf/2104.00921.pdf) 234 | * [TubeR: Tube-Transformer for Action Detection](https://arxiv.org/pdf/2104.00969.pdf) 235 | * [TFill: Image Completion via a Transformer-Based Architecture](https://arxiv.org/pdf/2104.00845.pdf) [code](https://github.com/lyndonzheng/TFill) (Jianfei Cai) 236 | * [VisQA: X-raying Vision and Language Reasoning in Transformers](https://arxiv.org/pdf/2104.00926.pdf) 237 | 238 | Others: 239 | * [Scene Graphs: A Survey of Generations and Applications](https://arxiv.org/pdf/2104.01111.pdf) 240 | * 241 | 242 | #### 20210402 243 | 244 | TOP: 245 | * [Group-Free 3D Object Detection via Transformers](https://arxiv.org/pdf/2104.00678.pdf) In this paper, we present a simple yet effective method for directly detecting 3D objects from the 3D point cloud. Instead of grouping local points to each object candidate, our method computes the feature of an object from all the points in the point cloud with the help of an attention mechanism in the Transformers, where the contribution of each point is automatically learned in the network training. [code](https://github.com/zeliu98/Group-Free-3D) (Ze Liu, Yue Cao, Han Hu) 246 | * [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/pdf/2104.00298.pdf) 考虑训练的Efficiency (1) To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller. (2) we propose an improved method of progressive learning, which adaptively adjusts regularization (e.g., dropout and data augmentation) along with image size. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. [code](https://github.com/google/automl/efficientnetv2) (Mingxing Tan, Quoc V. Le) 247 | * [UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training](https://arxiv.org/pdf/2104.00332.pdf) To generalize this success to non-English languages, we introduce UC2 , the first machine translation-augmented framework for cross-lingual cross-modal representation learning. (1 ) augment existing English-only datasets with other languages via machine translation (MT) (2) shared visual context (i.e., using image as pivot) (3) To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. 248 | * [Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval](https://arxiv.org/pdf/2104.00650.pdf) Our objective in this work is video-text retrieval – in particular a joint embedding that enables efficient text-to-video retrieval. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. It is trained with a curriculum learning schedule that begins by treating images as ‘frozen’ snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. (Andrew Zisserman) 249 | * [Jigsaw Clustering for Unsupervised Visual Representation Learning](https://arxiv.org/pdf/2104.00323.pdf) 有趣的pretext task设计。 We propose a new jigsaw clustering pretext task in this paper, which only needs to forward each training batch itself, and reduces the training cost. Our method makes use of information from both intra- and inter-images, and outperforms previous single-batch based ones by a large margin. It is even comparable to the contrastive learning methods when only half of training batches are used. Our method indicates that multiple batches during training are not necessary, and opens the door for future research of single-batch unsupervised methods. [code](https://github.com/Jia-Research-Lab/JigsawClustering) (Jiaya Jia, CVPR21) 250 | * [Unsupervised Sound Localization via Iterative Contrastive Learning](https://arxiv.org/pdf/2104.00315.pdf) Sound localization aims to find the source of the audio signal in the visual scene. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the 1) localization results in images predicted in the previous iteration, and 2) semantic relationships inferred from the audio signals as the pseudolabels. Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio. (如何保证基于伪标签的迭代是变好,而非变差?) (Ming-Hsuan Yang) 251 | * [In&Out : Diverse Image Outpainting via GAN Inversion](https://arxiv.org/pdf/2104.00675.pdf) GAN inversion逐渐成为GAN研究的主流方向,本文借GAN inversion做Image outpainting. Image outpainting seeks for a semantically consistent extension of the input image beyond its available content. In this work, we formulate the problem from the perspective of inverting generative adversarial networks. Our generator renders micro-patches conditioned on their joint latent code as well as their individual positions in the image. [code](https://github.com/yccyenchicheng/InOut) (Ming-Hsuan Yang) 252 | 253 | CVPR21: 254 | 255 | * [Online Multiple Object Tracking with Cross-Task Synergy](https://arxiv.org/pdf/2104.00380.pdf) [code](https://github.com/songguocode/TADAM) (Dacheng Tao) 256 | * [Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition](https://arxiv.org/pdf/2104.00232.pdf) (Tao Mei) 257 | * [Divergence Optimization for Noisy Universal Domain Adaptation](https://arxiv.org/pdf/2104.00246.pdf) 258 | * [FAPIS: A Few-shot Anchor-free Part-based Instance Segmenter](https://arxiv.org/pdf/2104.00073.pdf) 259 | * [Self-supervised Motion Learning from Static Images](https://arxiv.org/pdf/2104.00240.pdf) 260 | * [Learning to Track Instances without Video Annotations](https://arxiv.org/pdf/2104.00287.pdf) Tracking segmentation masks of multiple instances has been intensively studied, but still faces two fundamental challenges: 1) the requirement of large-scale, frame-wise annotation, and 2) the complexity of two-stage approaches. 本文利用自监督实现单阶段 with only a labeled image dataset and unlabeled video sequences 261 | * [Improving Calibration for Long-Tailed Recognition](https://arxiv.org/pdf/2104.00466.pdf) [code](https://github.com/Jia-Research-Lab/MiSLAS) (Jiaya Jia) 262 | * [Towards Evaluating and Training Verifiably Robust Neural Networks](https://arxiv.org/pdf/2104.00447.pdf) (Dahua Lin) 263 | * [One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking](https://arxiv.org/pdf/2104.00597.pdf) (Jianlong Fu) 264 | * [Unsupervised Degradation Representation Learning for Blind Super-Resolution](https://arxiv.org/pdf/2104.00416.pdf) Funmy 构建不同程度降质的图像做对比学习 In this paper, we propose an unsupervised degradation representation learning scheme for blind SR without explicit degradation estimation. Specifically, we learn abstract representations to distinguish various degradations in the representation space rather than explicit estimation in the pixel space. [code](https://github.com/LongguangWang/DASR) 265 | * [Bipartite Graph Network with Adaptive Message Passing for Unbiased Scene Graph Generation](https://arxiv.org/pdf/2104.00308.pdf) long-tailed class distribution and large intra-class variation. To address these issues, we introduce a novel confidence-aware bipartite graph neural network with adaptive message propagation mechanism for unbiased scene graph generation. In addition, we propose an efficient bi-level data resampling strategy to alleviate the imbalanced data distribution problem in training our graph network. 266 | * [A Realistic Evaluation of Semi-Supervised Learning for Fine-Grained Classification](https://arxiv.org/pdf/2104.00679.pdf) 267 | * [RGB-D Local Implicit Function for Depth Completion of Transparent Objects](https://arxiv.org/pdf/2104.00622.pdf) 268 | * [SimPoE: Simulated Character Control for 3D Human Pose Estimation](https://arxiv.org/pdf/2104.00683.pdf) 269 | * [NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video](https://arxiv.org/pdf/2104.00681.pdf) 270 | * [PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting](https://arxiv.org/pdf/2104.00674.pdf) 271 | * [LED2 -Net: Monocular 360◦ Layout Estimation via Differentiable Depth Rendering](https://arxiv.org/pdf/2104.00568.pdf) Towards reconstructing the room layout in 3D, we formulate the task of 360◦ layout estimation as a problem of predicting depth on the horizon line of a panorama. 272 | * [Reconstructing 3D Human Pose by Watching Humans in the Mirror](https://arxiv.org/pdf/2104.00340.pdf) In this paper, we introduce the new task of reconstructing 3D human pose from a single image in which we can see the person and the person’s image through a mirror. [code](https://github.com/zju3dv/Mirrored-Human) 273 | * [Wide-Depth-Range 6D Object Pose Estimation in Space](https://arxiv.org/pdf/2104.00337.pdf) 有趣的应用 [code](https://github.com/cvlab-epfl/wide-depth-range-pose) 274 | * [Fostering Generalization in Single-view 3D Reconstruction by Learning a Hierarchy of Local and Global Shape Priors](https://arxiv.org/pdf/2104.00476.pdf) 275 | * [Deep Two-View Structure-from-Motion Revisited](https://arxiv.org/pdf/2104.00556.pdf) 276 | 277 | Vision Transformer: 278 | 279 | * [Group-Free 3D Object Detection via Transformers](https://arxiv.org/pdf/2104.00678.pdf) In this paper, we present a simple yet effective method for directly detecting 3D objects from the 3D point cloud. Instead of grouping local points to each object candidate, our method computes the feature of an object from all the points in the point cloud with the help of an attention mechanism in the Transformers, where the contribution of each point is automatically learned in the network training. [code](https://github.com/zeliu98/Group-Free-3D) (Ze Liu, Yue Cao, Han Hu) 280 | * [Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval](https://arxiv.org/pdf/2104.00650.pdf) 281 | * [Spatial-Temporal Graph Transformer for Multiple Object Tracking](https://arxiv.org/pdf/2104.00194.pdf) Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named Spatial-Temporal Graph Transformer (STGT), which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. 282 | * [Latent Variable Nested Set Transformers & AutoBots](https://arxiv.org/pdf/2104.00563.pdf) We validate the Nested Set Transformer for autonomous driving settings which we refer to as (“AutoBot”), where we model the trajectory of an ego-agent based on the sequential observations of key attributes of multiple agents in a scene. 283 | * [LoFTR: Detector-Free Local Feature Matching with Transformers](https://arxiv.org/pdf/2104.00680.pdf) (CVPR21) 284 | * [Mesh Graphormer](https://arxiv.org/pdf/2104.00272.pdf) 285 | 286 | 287 | Others: 288 | 289 | * [The surprising impact of mask-head architecture on novel class segmentation](https://arxiv.org/pdf/2104.00613.pdf) We address the partially supervised instance segmentation problem in which one can train on (significantly cheaper) bounding boxes for all categories but use masks only for a subset of categories. [code](https://google.github.io/deepmac/) 290 | * [In&Out : Diverse Image Outpainting via GAN Inversion](https://arxiv.org/pdf/2104.00675.pdf) GAN inversion逐渐成为GAN研究的主流方向,本文借GAN inversion做Image outpainting. Image outpainting seeks for a semantically consistent extension of the input image beyond its available content. In this work, we formulate the problem from the perspective of inverting generative adversarial networks. Our generator renders micro-patches conditioned on their joint latent code as well as their individual positions in the image. [code](https://github.com/yccyenchicheng/InOut) (Ming-Hsuan Yang) 291 | * [Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study](https://arxiv.org/pdf/2104.00676.pdf) (ICLR21) 292 | * [CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning](https://arxiv.org/pdf/2104.00285.pdf) 293 | * [Composable Augmentation Encoding for Video Representation Learning](https://arxiv.org/pdf/2104.00616.pdf) To overcome this limitation, we propose an ‘augmentation aware’ contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video representations for contrastive learning. 294 | * [Text to Image Generation with Semantic-Spatial Aware GAN](https://arxiv.org/pdf/2104.00567.pdf) 295 | * [Linear Semantics in Generative Adversarial Networks](https://arxiv.org/pdf/2104.00487.pdf) 296 | * [Unsupervised Foreground-Background Segmentation with Equivariant Layered GANs](https://arxiv.org/pdf/2104.00483.pdf) 297 | * [Improved Image Generation via Sparse Modeling](https://arxiv.org/pdf/2104.00464.pdf) 298 | * [Exploiting Relationship for Complex-scene Image Generation](https://arxiv.org/pdf/2104.00356.pdf) (Tao Mei) 299 | * [MeanShift++: Extremely Fast Mode-Seeking With Applications to Segmentation and Object Tracking](https://arxiv.org/pdf/2104.00303.pdf) 300 | * [SCALoss: Side and Corner Aligned Loss for Bounding Box Regression](https://arxiv.org/pdf/2104.00462.pdf) IoU-based loss has the gradient vanish problem in the case of low overlapping bounding boxes, and the model could easily ignore these simple cases. In this paper, we propose Side Overlap (SO) loss by maximizing the side overlap of two bounding boxes, which puts more penalty for low overlapping bounding box cases. 301 | * [Anchor Pruning for Object Detection](https://arxiv.org/pdf/2104.00432.pdf) This paper proposes anchor pruning for object detection in one-stage anchor-based detectors. In this work, we show that many anchors in the object detection head can be removed without any loss in accuracy. With additional retraining, anchor pruning can even lead to improved accuracy. 没引DETR和Sparse RCNN. (Deng Cai) 302 | * [Modular Adaptation for Cross-Domain Few-Shot Learning](https://arxiv.org/pdf/2104.00619.pdf) 303 | * [A Survey on Natural Language Video Localization](https://arxiv.org/pdf/2104.00234.pdf) 304 | 305 | 306 | 307 | #### 20210401 308 | 309 | TOP: 310 | 311 | * [Going deeper with Image Transformers](https://arxiv.org/pdf/2103.17239.pdf) However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.3% top-1 accuracy on Imagenet when training with no external data (Facebook, DeiT团队) 312 | * [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://arxiv.org/pdf/2103.17249.pdf) However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. [code](https://github.com/orpatashnik/StyleCLIP) 313 | * [PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering](https://arxiv.org/pdf/2103.17070.pdf) 314 | 315 | CVPR21: 316 | 317 | * [Scale-aware Automatic Augmentation for Object Detection](https://arxiv.org/pdf/2103.17220.pdf) [code](https://github.com/Jia-Research-Lab/SA-AutoAug) (Jiaya Jia) 318 | * [Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark](https://arxiv.org/pdf/2103.16746.pdf) Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch. (Feng Wu) 319 | * [SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification](https://arxiv.org/pdf/2103.16725.pdf) 320 | * [Denoise and Contrast for Category Agnostic Shape Completion](https://arxiv.org/pdf/2103.16671.pdf) 321 | * [DAP: Detection-Aware Pre-training with Weak Supervision](https://arxiv.org/pdf/2103.16651.pdf) we transform a classification dataset into a detection dataset through a weakly supervised object localization method based on Class Activation Maps to directly pre-train a detector, making the pre-trained model location-aware and capable of predicting bounding boxes. 322 | * [Unsupervised Disentanglement of Linear-Encoded Facial Semantics](https://arxiv.org/pdf/2103.16605.pdf) 323 | * [ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows](https://arxiv.org/pdf/2103.16877.pdf) (Jiebo Luo) 324 | * [Online Learning of a Probabilistic and Adaptive Scene Representation](https://arxiv.org/pdf/2103.16832.pdf) (Hongbin Zha) 325 | * [Convolutional Hough Matching Networks](https://arxiv.org/pdf/2103.16831.pdf) 326 | * [Rectification-based Knowledge Retention for Continual Learning](https://arxiv.org/pdf/2103.16597.pdf) 327 | * [Learning Scalable l∞-constrained Near-lossless Image Compression via Joint Lossy Image and Residual Compression](https://arxiv.org/pdf/2103.17015.pdf) 328 | * [Mask-ToF: Learning Microlens Masks for Flying Pixel Correction in Time-of-Flight Imaging](https://arxiv.org/pdf/2103.16693.pdf) 329 | * [Neural Response Interpretation through the Lens of Critical Pathways](https://arxiv.org/pdf/2103.16886.pdf) (VGG) 330 | * [Prototypical Cross-domain Self-supervised Learning for Few-shot Unsupervised Domain Adaptation](https://arxiv.org/pdf/2103.16765.pdf) 331 | * [Dense Relation Distillation with Context-aware Aggregation for Few-Shot Object Detection](https://arxiv.org/pdf/2103.17115.pdf) 332 | * [ReMix: Towards Image-to-Image Translation with Limited Data](https://arxiv.org/pdf/2103.16835.pdf) 333 | * [DER: Dynamically Expandable Representation for Class Incremental Learning](https://arxiv.org/pdf/2103.16788.pdf) 334 | * [GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection](https://arxiv.org/pdf/2103.17202.pdf) 335 | * [A Closer Look at Fourier Spectrum Discrepancies for CNN-generated Images Detection](https://arxiv.org/pdf/2103.17195.pdf) 336 | * [Semi-supervised Synthesis of High-Resolution Editable Textures for 3D Humans](https://arxiv.org/pdf/2103.17266.pdf) 337 | * [VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization](https://arxiv.org/pdf/2103.16874.pdf) While an increasing number of studies have been conducted, the resolution of synthesized images is still limited to low (e.g., 256×192), which acts as the critical limitation against satisfying online consumers. To address the challenges, we propose a novel virtual try-on method called VITON-HD that successfully synthesizes 1024×768 virtual try-on images. 338 | * [Learning Camera Localization via Dense Scene Matching](https://arxiv.org/pdf/2103.16792.pdf) 339 | * [Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding](https://arxiv.org/pdf/2103.16848.pdf) 340 | * [Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors](https://arxiv.org/pdf/2103.17265.pdf) We introduce (HPS) Human POSEitioning System, a method to recover the full 3D pose of a human registered with a 3D scan of the surrounding environment using wearable sensors. 341 | * [Learning by Aligning Videos in Time](https://arxiv.org/pdf/2103.17260.pdf) 342 | * [Dogfight: Detecting Drones from Drones Videos](https://arxiv.org/pdf/2103.17242.pdf) 343 | * [Rainbow Memory: Continual Learning with a Memory of Diverse Samples](https://arxiv.org/pdf/2103.17230.pdf) 344 | * [Layout-Guided Novel View Synthesis from a Single Indoor Panorama](https://arxiv.org/pdf/2103.17022.pdf) 345 | 346 | 347 | Vision Transformer: 348 | 349 | * [Going deeper with Image Transformers](https://arxiv.org/pdf/2103.17239.pdf) However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.3% top-1 accuracy on Imagenet when training with no external data (Facebook, DeiT团队) 350 | * [Learning Spatio-Temporal Transformer for Visual Tracking](https://arxiv.org/pdf/2103.17154.pdf) The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. The whole method is endto-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. [code](https://github.com/researchmm/Stark) (Jianlong Fu, Huchuan Lu) 351 | * [Robust Facial Expression Recognition with Convolutional Visual Transformers](https://arxiv.org/pdf/2103.16854.pdf) Different from previous pure CNNs based methods, we argue that it is feasible and practical to translate facial images into sequences of visual words and perform expression recognition from a global perspective. (Shutao Li) 352 | * [DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention](https://arxiv.org/pdf/2103.17084.pdf) 基于Deformable DETR的域适应目标检测 353 | 354 | 355 | 356 | ## NLP (Weekly) -------------------------------------------------------------------------------- /202106.md: -------------------------------------------------------------------------------- 1 | # Arxiv-Daily 2 | 3 | My daily arxiv reading notes. 4 | 5 | [2021 March](202103.md) 6 | 7 | [2021 April](202104.md) 8 | 9 | ## CV (Daily) 10 | 11 | #### 20210630 12 | 13 | * [CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders](https://arxiv.org/pdf/2106.14843.pdf) 14 | * [Rethinking Token-Mixing MLP for MLP-based Vision Backbone](https://arxiv.org/pdf/2106.14882.pdf) 15 | * [A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning](https://arxiv.org/pdf/2106.14749.pdf) 16 | * [Post-Training Quantization for Vision Transformer](https://arxiv.org/pdf/2106.14156.pdf) (Yunhe Wang, Wen Gao) 17 | * [Semi-supervised Semantic Segmentation with Directional Context-aware Consistency](https://arxiv.org/pdf/2106.14133.pdf) (Jiaya Jia) 18 | * [An Image Classifier Can Suffice Video Understanding](https://arxiv.org/pdf/2106.14104.pdf) 19 | * [Multimodal Few-Shot Learning with Frozen Language Models](https://arxiv.org/pdf/2106.13884.pdf) 20 | * [Inverting and Understanding Object Detectors](https://arxiv.org/pdf/2106.13933.pdf) 21 | * [K-Net: Towards Unified Image Segmentation](https://arxiv.org/pdf/2106.14855.pdf) 22 | * [Early Convolutions Help Transformers See Better](https://arxiv.org/pdf/2106.14881.pdf) 23 | * [Domain Adaptive YOLO for One-Stage Cross-Domain Detection](https://arxiv.org/pdf/2106.13939.pdf) (BMVC) 24 | 25 | 26 | #### 20210615 27 | 28 | - [Styleformer: Transformer based Generative Adversarial Networks with Style Vector](https://arxiv.org/pdf/2106.07023.pdf) 29 | > 基于Transformer的GAN做图像生成 (unconditional),和SOTA相比comparable 30 | > METHOD: 1. we change the demodulation of StyleGAN2 and modify the existing transformer structure (e.g., residual connection, layer normalization) to create a strong style-based generator with a convolution-free structure; 2.We also make Styleformer lighter by applying Linformer. 31 | > [code](https://github.com/Jeeseung-Park/Styleformer) 32 | 33 | - [Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation](https://arxiv.org/pdf/2106.06963.pdf) (CVPR'21) 34 | > TASK: Automatically generating radiology reports 35 | > PROBLEM: Yet, this task remains a challenging job for data-driven neural networks, due to the serious visual and textual data biases 36 | > METHOD: To this end, we propose a Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED) to imitate the working patterns of radiologists, who will first examine the abnormal regions and assign the disease topic tags to the abnormal regions, and then rely on the years of prior medical knowledge and prior working experience accumulations to write reports. 37 | 38 | - [Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning](https://arxiv.org/pdf/2106.06939.pdf) (Shaobo Min, Yongdong Zhang, Jingdong Wang) 39 | > MOTIVATION: We human visual perception could attend to regions where sounds are made, and our auditory perception could also ground their frequencies of sounding objects, which we call bidirectional local correspondence. Such supervision is intuitive but not well explored in the contrastive learning framework 40 | > a pretext task, Cross-Modal Attention Consistency (CMAC), aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal, and do a similar alignment for frequency grounding on the acoustic attention. 41 | 42 | - [Video Super-Resolution Transformer](https://arxiv.org/pdf/2106.06847.pdf) (Luc Van Gool) 43 | > TASK: Video super-resolution (VSR) 44 | > PROBLEM: However, the typical block design of Transformer with a fully connected self-attention layer and a tokenwise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. 45 | > 46 | > > METHOD: Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. 47 | 48 | - [Go Small and Similar: A Simple Output Decay Brings Better Performance](https://arxiv.org/pdf/2106.06726.pdf) 49 | > FUNNY 50 | > This paper begins with empirical observations that better performances are significantly associated with output distributions, that have smaller average values and variances 51 | > By audaciously assuming there is causality involved, we propose a novel regularization term, called Output Decay, that enforces the model to assign smaller and similar output values on each class. 52 | 53 | - [Disrupting Model Training with Adversarial Shortcuts](https://arxiv.org/pdf/2106.06654.pdf) 54 | > 一种新的攻击方法:adversarial shortcuts, which encourage models to rely on non-robust signals rather than semantic features. 55 | 56 | - [Large-Scale Unsupervised Object Discovery](https://arxiv.org/pdf/2106.06650.pdf) 57 | > Existing approaches to unsupervised object discovery (UOD) do not scale up to large datasets without approximations which compromise their performance. We propose a novel formulation of UOD as a ranking problem. 实验效果惊艳,能够scale到大型数据集上 58 | 59 | - [PopSkipJump: Decision-Based Attack for Probabilistic Classifiers](https://arxiv.org/pdf/2106.07445.pdf) (ICML'21) 60 | > Many existing attack algorithms cover various settings, from white-box to black-box classifiers, but typically assume that the answers are **deterministic** and often fail when they are not. We therefore propose a new adversarial decision-based attack **specifically designed for classifiers with probabilistic outputs.** 61 | > [code](https://github.com/cjsg/PopSkipJump) 62 | 63 | - [Robust Representation Learning via Perceptual Similarity Metrics](https://arxiv.org/pdf/2106.06620.pdf) (ICML'21) 64 | > Contrastive Input Morphing (CIM), a representation learning framework that learns input-space transformations of the data to mitigate the effect of irrelevant input features on downstream performance. Our method leverages a perceptual similarity metric via a triplet loss to ensure that the transformation preserves taskrelevant information. 65 | 66 | - :star: [Delving Deep into the Generalization of Vision Transformers under Distribution Shifts](https://arxiv.org/pdf/2106.07617.pdf) (Ziwei Liu) ViT域适应起步 67 | > In this work, we provide a comprehensive study on the outof-distribution generalization of Vision Transformers 68 | > SETTINGS: we first present a taxonomy of distribution shifts by categorizing them into five conceptual groups: corruption shift, background shift, texture shift, destruction shift, and style shift. Then we perform extensive evaluations of ViT variants under different groups of distribution shifts and compare their generalization ability with Convolutional Neural Network (CNN) models. 69 | > OBSERVATIONS: 1) ViTs generalize better than CNNs under multiple distribution shifts. With the same or less amount of parameters; 2) Larger ViTs gradually narrow the in-distribution (ID) and outof-distribution (OOD) performance gap. 70 | > To further improve the generalization of ViTs, we design the Generalization-Enhanced Vision Transformers by integrating adversarial learning, information theory, and self-supervised learning. we observe the gradient-sensitivity of Vision Transformers and design a smoother learning strategy to achieve a stable training process. 71 | > Further OBSERVATIONS : 1) For the enhanced model, larger ViTs still benefit more for the out-of-distribution generalization. 2) generalization-enhanced Vision Transformers are more sensitive to the hyper-parameters than their corresponding CNN models. 72 | 73 | - [Improved Transformer for High-Resolution GANs](https://arxiv.org/pdf/2106.07631.pdf) 74 | > In this paper, we introduce two key ingredients to Transformer to address this challenge. First, in low-resolution stages of the generative process, standard global self-attention is replaced with the proposed multi-axis blocked self-attention which allows efficient mixing of local and global attention. Second, in high-resolution stages, we drop self-attention while only keeping multi-layer perceptrons reminiscent of the implicit neural function. To further improve the performance, we introduce an additional selfmodulation component based on cross-attention. 降低attention的运算量,做法很直接。 75 | 76 | - [Magic Layouts: Structural Prior for Component Detection in User Interface Designs](https://arxiv.org/pdf/2106.07615.pdf) (CVPR'21) FUNNY APPLICATION 77 | 78 | - [PolarStream: Streaming Lidar Object Detection and Segmentation with Polar Pillars](https://arxiv.org/pdf/2106.07545.pdf) 79 | > However, due to use of cartesian coordinate systems these methods represent the sectors as rectangular regions, wasting memory and compute. In this work we propose using a polar coordinate system and make two key improvements on this design. 80 | 81 | - [S^2 -MLP: Spatial-Shift MLP Architecture for Vision](https://arxiv.org/pdf/2106.07477.pdf) 82 | > 沿MLP方向的改进,基于spatial-specific造成过拟合的观察,提出spatial-shift module 83 | > The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that token-mixing operation in MLP-Mixer is a variant of depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting 84 | > In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S2 -MLP). Different from MLP-Mixer, our S2 -MLP only contains channel-mixing MLP. We devise a spatial-shift operation for achieving the communication between patches. It has a local reception field and is spatial-agnostic. 85 | 86 | - :star: :star: [Partial success in closing the gap between human and machine vision](https://arxiv.org/pdf/2106.07411.pdf) 87 | > Our findings are threefold. (1.) The longstanding robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude 88 | > [code](https://github.com/bethgelab/model-vs-human/) 89 | 90 | - [Variational Quanvolutional Neural Networks with enhanced image encoding](https://arxiv.org/pdf/2106.07327.pdf) FUNNY 91 | 92 | - [Time Lens: Event-based Video Frame Interpolation](https://arxiv.org/pdf/2106.07286.pdf) (CVPR'21) 93 | 94 | - [Attention-based Domain Adaptation for Single Stage Detectors](https://arxiv.org/pdf/2106.07283.pdf) 95 | > 针对一阶段检测器的域适应目标检测 96 | > previous work has mostly focused on two-stage detectors. This is because their use of region proposals makes it possible to perform local adaptation, which has been shown to significantly improve the adaptation effectiveness. 97 | > To nonetheless benefit from the strength of local adaptation, we introduce an attention mechanism that lets us identify the important regions on which adaptation should focus. Our approach is generic and can be integrated into any single-stage detector. 98 | 99 | - [SinIR: Efficient General Image Manipulation with Single Image Reconstruction](https://arxiv.org/pdf/2106.07140.pdf) (ICML'21, Qifeng Chen) 100 | > We propose SinIR, an efficient reconstructionbased framework **trained on a single natural image** for **general image manipulation, including super-resolution, editing, harmonization, paint-toimage, photo-realistic style transfer, and artistic style transfer**. 101 | > Moreover, with a much simpler training objective (i.e., reconstruction), SinIR is trained 33.5 times faster than SinGAN (for 500 × 500 images) that solves similar tasks. 102 | > [code](https://github.com/YooJiHyeong/SinIR) 103 | 104 | - [Survey: Image Mixing and Deleting for Data Augmentation](https://arxiv.org/pdf/2106.07085.pdf) 105 | 106 | 107 | #### 20210611 108 | - :star: :star: ​[MST: Masked Self-Supervised Transformer for Visual Representation](https://arxiv.org/pdf/2106.05656.pdf) 109 | 110 | > 通过attention mask在transformer上实现局部对比学习,对dense的下游任务更有利,性能超越DINO. 111 | > METHOD: Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. 112 | 113 | 114 | - :star: :star: [Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations](https://arxiv.org/pdf/2106.05967.pdf) (Luc Van Gool) 115 | 116 | > However, current methods are still primarily applied to curated datasets like ImageNet. In this paper, we first study how biases in the dataset affect existing methods. Our results show that an approach like MoCo [22] works surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets. 117 | > Second, given the generality of the approach, we try to realize further gains with minor modifications. We show that learning additional invariances - through the use of multi-scale cropping, stronger augmentations and nearest neighbors - improves the representations. 118 | > Finally, we observe that MoCo learns spatially structured representations when trained with a multi-crop strategy. The representations can be used for semantic segment retrieval and video instance segmentation without finetuning. Moreover, the results are on par with specialized models 119 | 120 | - :star: [Learning to See by Looking at Noise ](https://arxiv.org/pdf/2106.05963.pdf) (MIT) 121 | 122 | > FUNNY 123 | > Current vision systems are trained on huge datasets, and these datasets come with costs: curation is expensive, they inherit human biases, and there are concerns over privacy and usage rights. To counter these costs, interest has surged in learning from cheaper data sources, such as unlabeled images. In this paper we go a step further and ask if we can do away with real image datasets entirely, instead learning from noise processes. We investigate a suite of image generation models that produce images from simple random processes. These are then used as training data for a visual representation learner with a contrastive loss. 124 | 125 | - :star: [What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments?](https://arxiv.org/pdf/2106.05961.pdf) 126 | 127 | > ICML'21 VERY FUNNY 128 | > A natural question then arises: given a trained classifier, can we evaluate its accuracy on varying unlabeled test sets? In this work, we train semantic classification and rotation prediction in a multi-task way. On a series of datasets, we report an interesting finding, i.e., the semantic classification accuracy exhibits a strong linear relationship with the accuracy of the rotation prediction task (Pearson’s Correlation r > 0.88). This finding allows us to utilize linear regression to estimate classifier performance from the accuracy of rotation prediction which can be obtained on the test set through the freely generated rotation labels. 129 | 130 | - [Implicit Feature Alignment: Learn to Convert Text Recognizer to Text Spotter](https://arxiv.org/pdf/2106.05920.pdf) 131 | 132 | > In this paper, we propose a simple, elegant and effective paradigm called Implicit Feature Alignment (IFA), which can be easily integrated into current text recognizers, resulting in a novel inference mechanism called IFA inference. This enables an ordinary text recognizer to process multi-line text such that text detection can be completely freed. S 133 | 134 | - [CAT: Cross Attention in Vision Transformer](https://arxiv.org/pdf/2106.05786.pdf) 135 | 136 | > In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps to capture global information. Both operations have less computation than standard self-attention in Transformer. 137 | > [code](https://github.com/linhezheng19/CAT) 138 | 139 | - [Deep neural network loses attention to adversarial images](https://arxiv.org/pdf/2106.05657.pdf) 140 | 141 | > 从attention map研究对抗样本的可解释性 142 | 143 | - [Space-time Mixing Attention for Video Transformer](https://arxiv.org/pdf/2106.05968.pdf) 144 | 145 | > In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. 146 | 147 | - [Multi-Dataset Benchmarks for Masked Identification using Contrastive Representation Learning](https://arxiv.org/pdf/2106.05596.pdf) 148 | 149 | > 戴口罩情况下的人脸识别 150 | 151 | - [Progressive Stage-wise Learning for Unsupervised Feature Representation Enhancement](https://arxiv.org/pdf/2106.05554.pdf) (Alan Yuille, Bingbing Ni, Wen Gao) 152 | 153 | > Progressive Stage-wise Learning (PSL) framework. For a given unsupervised task, we design multilevel tasks and define different learning stages for the deep network. Early learning stages are forced to focus on lowlevel tasks while late stages are guided to extract deeper information through harder tasks. 154 | > 在自监督/无监督学习中引入课程学习/Self-paced Learning 155 | 156 | - :star: [Cross-domain Contrastive Learning for Unsupervised Domain Adaptation](https://arxiv.org/pdf/2106.05528.pdf) (Guo-Jun Qi) 157 | 158 | > In this work, we build upon contrastive self-supervised learning to align features so as to reduce the domain discrepancy between training and testing sets. 159 | > Exploring the same set of categories shared by both domains, we introduce a simple yet effective framework CDCL, for domain alignment. In particular, given an anchor image from one domain, we minimize its distances to cross-domain samples from the same class relative to those from different categories. Since target labels are unavailable, we use a clustering-based approach with carefully initialized centers to produce pseudo labels. 160 | > In addition, we demonstrate that CDCL is a general framework and can be adapted to the data-free setting, where the source data are unavailable during training, with minimal modification 161 | 162 | - [Learning to Affiliate: Mutual Centralized Learning for Few-shot Classification](https://arxiv.org/pdf/2106.05517.pdf) (Deng Cai) 163 | 164 | > 现有工作:They generally explore a unidirectional query-to-support paradigm in FSL, e.g., find the nearest/optimal support feature for each query feature and aggregate these local matches for a joint classification. 165 | > 本文:In this paper, we propose a new method Mutual Centralized Learning (MCL) to fully affiliate the two disjoint sets of dense features in a bidirectional paradigm 166 | 167 | - :star: [Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers](https://arxiv.org/pdf/2106.05392.pdf) (FAIR, Oxford) 168 | 169 | > In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame t may be entirely unrelated to what is found at that location in frame t + k. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. 170 | > To this end, we propose a new drop-in block for video transformers—trajectory attention—that aggregates information along implicitly determined motion paths. We additionally propose a new method to address the quadratic dependence of computation and memory on the input size, which is particularly important for high resolution or long videos. 171 | > [code](https://github.com/facebookresearch/Motionformer) 172 | 173 | - :star: [Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning](https://arxiv.org/pdf/2106.05956.pdf) 174 | 175 | > In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to nine recently proposed normalization layers. 176 | > Our primary findings follow: (i) Similar to BatchNorm, activations-based normalization layers can avoid exploding activations in ResNets; (ii) Use of GroupNorm ensures rank of activations is at least Ω(pwidth/Group Size), thus explaining why LayerNorm witnesses slow optimization speed; and (iii) Small group sizes result in large gradient norm in earlier layers, hence justifying training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. 177 | 178 | 179 | - [AFAN: Augmented Feature Alignment Network for Cross-Domain Object Detection](https://arxiv.org/pdf/2106.05499.pdf) (Ling Shao) 180 | 181 | #### 20210610 182 | 183 | * [VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation](https://arxiv.org/pdf/2106.04632.pdf) 效仿GLUE的video-and-language benchmark: an assemblage of 11 video-and-language datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. 184 | * [Check It Again: Progressive Visual Question Answering via Visual Entailment](https://arxiv.org/pdf/2106.04605.pdf) ACL 185 | * :star: [Distilling Image Classifiers in Object Detectors](https://arxiv.org/pdf/2106.05209.pdf) Nevertheless, the knowledge distillation literature remains limited to the scenario where the student and the teacher tackle the same task. Here, we investigate the problem of transferring knowledge not only across architectures but also across tasks. To this end, we study the case of object detection and, instead of following the standard detector-to-detector distillation approach, introduce a classifier-to-detector knowledge transfer framework. 186 | 187 | #### 20210503 188 | 189 | * [A Good Image Generator Is What You Need for High-Resolution Video Synthesis](https://arxiv.org/pdf/2104.15069.pdf) (ICLR'21) 190 | * [Semantic Relation Preserving Knowledge Distillation for Image-to-Image Translation](https://arxiv.org/pdf/2104.15082.pdf) (ECCV'20) 191 | * [DriveGAN: Towards a Controllable High-Quality Neural Simulation](https://arxiv.org/pdf/2104.15060.pdf) (CVPR'21, Oral) 192 | * [Faster Meta Update Strategy for Noise-Robust Deep Learning](https://arxiv.org/pdf/2104.15092.pdf) (Yi Yang) 193 | * [Updatable Siamese Tracker with Two-stage One-shot Learning](https://arxiv.org/pdf/2104.15049.pdf) 194 | * [Unsupervised Data Augmentation for Object Detection](https://arxiv.org/pdf/2104.14965.pdf) 195 | * [Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification](https://arxiv.org/pdf/2104.14913.pdf) (CVPR'20, Ling Shao) 196 | * [BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification](https://arxiv.org/pdf/2104.14783.pdf) 197 | * [CoCon: Cooperative-Contrastive Learning](https://arxiv.org/pdf/2104.14764.pdf) 198 | * [MOOD: Multi-level Out-of-distribution Detection](https://arxiv.org/pdf/2104.14726.pdf) 199 | 200 | ##### vision transformers 201 | 202 | * [CAT: Cross-Attention Transformer for One-Shot Object Detection](https://arxiv.org/pdf/2104.14984.pdf) 203 | * [End-to-End Attention-based Image Captioning](https://arxiv.org/pdf/2104.14721.pdf) 204 | * [Chop Chop BERT: Visual Question Answering by Chopping VisualBERT’s Heads](https://arxiv.org/pdf/2104.14741.pdf) 205 | * [HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction](https://arxiv.org/pdf/2104.14639.pdf) 206 | * [Pyramid Medical Transformer for Medical Image Segmentation](https://arxiv.org/ftp/arxiv/papers/2104/2104.14702.pdf) 207 | * [CoSformer: Detecting Co-Salient Object with Transformers](https://arxiv.org/pdf/2104.14729.pdf) 208 | * [Perceptual Image Quality Assessment with Transformers](https://arxiv.org/pdf/2104.14730.pdf) 209 | 210 | -------------------------------------------------------------------------------- /202107.md: -------------------------------------------------------------------------------- 1 | # Arxiv-Daily 2 | 3 | My daily arxiv reading notes. 4 | 5 | [2021 March](202103.md) 6 | 7 | [2021 April](202104.md) 8 | 9 | [2021 June](202106.md) 10 | 11 | ## CV (Daily) 12 | 13 | #### 20210730 14 | 15 | * :star: [Open-World Entity Segmentation](https://arxiv.org/pdf/2107.14228.pdf) (Jiaya Jia) [code](https://github.com/dvlab-research/Entity) 16 | * We introduce a new image segmentation task, termed Entity Segmentation (ES) with the aim to segment all visual entities in an image without considering semantic category labels 17 | * It has many practical applications in image manipulation/editing where the segmentation mask quality is typically crucial but category labels are less important. 18 | * In this setting, all semantically-meaningful segments are equally treated as categoryless entities and there is no thing-stuff distinction. 19 | * ES enables the following: (1) merging multiple datasets to form a large training set without the need to resolve label conflicts; (2) any model trained on one dataset can generalize exceptionally well to other datasets with unseen domains. 20 | * [Learning with Noisy Labels for Robust Point Cloud Segmentation](https://arxiv.org/pdf/2107.14230.pdf) (Dongdong Chen) 21 | * Point cloud segmentation is a fundamental task in 3D. Object class labels are often mislabeled in real-world point cloud datasets. In this work, we take the lead in solving this issue by proposing a novel Point Noise-Adaptive Learning (PNAL) framework. 22 | * noise-rate blind, to cope with the spatially variant noise rate problem specific to point clouds . 23 | * :star: [Rethinking and Improving Relative Position Encoding for Vision Transformer](https://arxiv.org/pdf/2107.14222.pdf) (Jianlong Fu) [code](https://github.com/microsoft/AutoML/tree/main/iRPE) 24 | * whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. 25 | * We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider **directional relative distance modeling** as well as the interactions between queries and relative position embeddings in self-attention mechanism. 26 | * Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements 27 | * [A Unified Efficient Pyramid Transformer for Semantic Segmentation](https://arxiv.org/pdf/2107.14209.pdf) (Mu Li) 28 | * 使用了deformable attention 29 | * [ReFormer: The Relational Transformer for Image Captioning](https://arxiv.org/pdf/2107.14178.pdf) 30 | * we propose a novel architecture ReFormer- a RElational transFORMER to generate features with relation information embedded and to explicitly express the pair-wise relationships between objects in the image 31 | * ReFormer incorporates the objective of scene graph generation with that of image captioning using one modified Transformer model. 32 | 33 | * [Probabilistic and Geometric Depth: Detecting Objects in Perspective](https://arxiv.org/pdf/2107.14160.pdf) (Dahua Lin) 34 | * [Personalized Image Semantic Segmentation](https://arxiv.org/pdf/2107.13978.pdf) (Ming-Ming Cheng) (ICCV'21) 35 | * The objective is to generate more accurate segmentation results on unlabeled personalized images by investigating the data’s personalized traits. 36 | * To open up future research in this area, we collect a large dataset containing various users’ personalized images called PIS (Personalized Image Semantic Segmentation) 37 | * [FREE: Feature Refinement for Generalized Zero-Shot Learning](https://arxiv.org/pdf/2107.13807.pdf) (Ling Shao) (ICCV'21) 38 | * Generalized zero-shot learning (GZSL) has achieved significant progress, with many efforts dedicated to overcoming the problems of **visual-semantic domain gap** and **seen unseen bias**. 39 | * However, most existing methods directly use feature extraction models trained on ImageNet alone, ignoring the **cross-dataset bias between ImageNet and GZSL benchmark** 40 | * [Geometry Uncertainty Projection Network for Monocular 3D Object Detection](https://arxiv.org/pdf/2107.13774.pdf) (Wanli Ouyan) (ICCV'21) 41 | * [Discovering 3D Parts from Image Collections](https://arxiv.org/pdf/2107.13629.pdf) (Ming-Hsuan Yang) (ICCV'21) 42 | * [Few-Shot and Continual Learning with Attentive Independent Mechanisms](https://arxiv.org/pdf/2107.14053.pdf) (ICCV'21) 43 | 44 | #### 20210728 45 | 46 | * [Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers](https://arxiv.org/pdf/2107.12636.pdf) (MM'21) [code](https://github.com/encounter1997/SFA) 47 | * [Enriching Local and Global Contexts for Temporal Action Localization](https://arxiv.org/pdf/2107.12960.pdf) (ICCV'21) 48 | * [Adaptive Denoising via GainTuning](https://arxiv.org/pdf/2107.12815.pdf) 49 | 50 | #### 20210727 51 | 52 | * [Improve Unsupervised Pretraining for Few-label Transfer](https://arxiv.org/pdf/2107.12369.pdf) (Suichan Li, Dongdong Chen, Nenghai Yu) (ICCV'21) 53 | * Based on the analysis, we interestingly discover that only involving some unlabeled target domain into the unsupervised pretraining can improve the clustering quality, subsequently reducing the transfer performance gap with supervised pretraining. 54 | * [Spatial-Temporal Transformer for Dynamic Scene Graph Generation](https://arxiv.org/pdf/2107.12309.pdf) (ICCV'21) 55 | * Dynamic scene graph generation aims at generating a scene graph of the given video 56 | * In this paper, we propose Spatial-temporal Transformer (STTran), a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input in order to capture the temporal dependencies between frames and infer the dynamic relationships 57 | * [Contextual Transformer Networks for Visual Recognition ](https://arxiv.org/pdf/2107.12292.pdf) (Ting Yao, Tao Mei) 58 | * most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited 59 | * CoT block first mines the static context among keys via a 3×3 convolution. Next, based on the query and contextualized key, two consecutive 1×1 convolutions are utilized to perform self-attention, yielding the dynamic context. The static and dynamic contexts are finally fused as outputs 60 | * Our CoT block is appealing in the view that it can readily replace each 3 × 3 convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). 61 | * [Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference](https://arxiv.org/pdf/2107.12270.pdf) (Yi Yang, Yueting Zhuang) 62 | * TASK: Video-and-Language Inference is a recently proposed task for joint video-and-language understanding. This new task requires a model to draw inference on whether a natural language statement entails or contradicts a given video clip. 63 | * [Text is Text, No Matter What: Unifying Text Recognition using Knowledge Distillation](https://arxiv.org/pdf/2107.12087.pdf) (ICCV'21) 64 | * The challenging nature of the text recognition problem however dictated a fragmentation of research efforts: Scene Text Recognition (STR) that deals with text in everyday scenes, and Handwriting Text Recognition (HTR) that tackles hand-written text. 65 | * In this paper, for the first time, we argue for their unification – we aim for a single model that can compete favourably with two separate state-of-the-art STR and HTR models 66 | * [Parametric Contrastive Learning](https://arxiv.org/pdf/2107.12028.pdf) (Jiaya Jia) (ICCV'21) 67 | * we propose Parametric Contrastive Learning (PaCo) to tackle long-tailed recognition 68 | * Based on theoretical analysis, we observe **supervised contrastive loss** tends to bias on high-frequency classes and thus increases the difficulty of imbalance learning. 69 | * We introduce a set of **parametric class-wise learnable centers** to rebalance from an optimization perspective. Further, we analyze our PaCo loss under a balanced setting. 70 | * [Language Models as Zero-shot Visual Semantic Learner](https://arxiv.org/pdf/2107.12021.pdf) 71 | * 72 | 73 | #### 20210719 74 | 75 | * :star: [The Benchmark Lottery](https://arxiv.org/pdf/2107.07002.pdf) 76 | 77 | * [A Survey on Bias in Visual Datasets](https://arxiv.org/pdf/2107.07919.pdf) 78 | 79 | i) describe the biases that can affect visual datasets; ii) review the literature on methods for bias discovery and quantification in visual datasets; iii) discuss existing attempts to collect bias-aware visual datasets. 80 | 81 | * [Rectifying the Shortcut Learning of Background: Shared Object Concentration for Few-Shot Image Recognition](https://arxiv.org/pdf/2107.07746.pdf) 82 | In this paper, we observe that image background serves as a source of domain-specific knowledge, which is a shortcut for models to learn in the source dataset, but is harmful when adapting to brand-new classes. 83 | 84 | * [CutDepth: Edge-aware Data Augmentation in Depth Estimation](https://arxiv.org/pdf/2107.07684.pdf) 85 | In this paper, we propose a data augmentation method, called CutDepth. In CutDepth, part of the depth is pasted onto an input image during training. The method extends variations data without destroying edge features. 86 | 87 | * [Align before Fuse: Vision and Language Representation Learning with Momentum Distillation](https://arxiv.org/pdf/2107.07651.pdf) 88 | 89 | * [Multi-Level Contrastive Learning for Few-Shot Problems](https://arxiv.org/pdf/2107.07608.pdf) 90 | Most current applications of contrastive learning benefit only a single representation from the last layer of an encoder.In this paper, we propose a multi-level contrasitive learning approach which applies contrastive losses at different layers of an encoder to learn multiple representations from the encoder. 91 | 92 | 93 | #### 20210716 94 | 95 | * [Recurrent Parameter Generators](https://arxiv.org/pdf/2107.07110.pdf) (Yann LeCun) 96 | * [From Show to Tell: A Survey on Image Captioning](https://arxiv.org/pdf/2107.06912.pdf) 97 | * [yleFusion: A Generative Model for Disentangling Spatial Segments](https://arxiv.org/pdf/2107.07437.pdf) 98 | 99 | #### 20210715 100 | 101 | * [Deep Neural Networks are Surprisingly Reversible: A Baseline for Zero-Shot Inversion](https://arxiv.org/pdf/2107.06304.pdf) (NVIDIA) 102 | * [A Generalized Lottery Ticket Hypothesis](https://arxiv.org/pdf/2107.06825.pdf) 103 | * [How Much Can CLIP Benefit Vision-and-Language Tasks?](https://arxiv.org/pdf/2107.06383.pdf) 104 | 105 | 106 | #### 20210706 107 | 108 | * [What Makes for Hierarchical Vision Transformer?](https://arxiv.org/pdf/2107.02174.pdf) (Xinggang Wang) 109 | * [MixStyle Neural Networks for Domain Generalization and Adaptation](https://arxiv.org/pdf/2107.02053.pdf) 110 | * [On Model Calibration for Long-Tailed Object Detection and Instance Segmentation](https://arxiv.org/pdf/2107.02170.pdf) (Boqing Gong) 111 | * [Test-Time Personalization with a Transformer for Human Pose Estimation](https://arxiv.org/pdf/2107.02133.pdf) (Xiaolong Wang) 112 | 113 | #### 20210705 114 | * :star: [Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE](https://arxiv.org/pdf/2107.01152.pdf) 115 | * [How Incomplete is Contrastive Learning? An Inter-intra Variant Dual Representation Method for Self-supervised Video Recognition](https://arxiv.org/pdf/2107.01194.pdf) 116 | * [A Survey on Deep Learning Technique for Video Segmentation](https://arxiv.org/pdf/2107.01153.pdf) (Wenguan Wang, Luc Van Gool) 117 | * [Collaborative Visual Navigation](https://arxiv.org/pdf/2107.01151.pdf) (Wenguan Wang, Xizhou Zhu, Jifeng Dai) 118 | * [Unsupervised Single Image Super-resolution Under Complex Noise](https://arxiv.org/pdf/2107.00986.pdf) 119 | * [Polarized Self-Attention: Towards High-quality Pixel-wise Regression](https://arxiv.org/pdf/2107.00782.pdf) 120 | * [Blind Image Super-Resolution via Contrastive Representation Learning](https://arxiv.org/pdf/2107.00708.pdf) 121 | * [Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets](https://arxiv.org/pdf/2107.00860.pdf) (ICLR'21) 122 | 123 | #### 20210702 124 | 125 | * [CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows](https://arxiv.org/pdf/2107.00652.pdf) (Dongdong Chen, Nenghai Yu, Baining Guo) 126 | * [AutoFormer: Searching Transformers for Visual Recognition](https://arxiv.org/pdf/2107.00651.pdf) (Jianlong Fu) 127 | * :star: [CLIP-It! Language-Guided Video Summarization](https://arxiv.org/pdf/2107.00650.pdf) (Trevor Darrell) 128 | * [On the Practicality of Deterministic Epistemic Uncertainty](https://arxiv.org/pdf/2107.00649.pdf) (Luc Van Gool, Fisher Yu) 129 | * [Global Filter Networks for Image Classification](https://arxiv.org/pdf/2107.00645.pdf) transformer + 傅里叶变换 130 | * [Focal Self-attention for Local-Global Interactions in Vision Transformers](https://arxiv.org/pdf/2107.00641.pdf) 131 | * :star: [CBNetV2: A Composite Backbone Network Architecture for Object Detection](https://arxiv.org/pdf/2107.00420.pdf) In this paper, we propose a novel backbone network, namely CBNetV2, by constructing compositions of existing open-sourced pretrained backbones. 132 | * [OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation](https://arxiv.org/pdf/2107.00249.pdf) (Hanqing Lu) we propose an Omni-perception PreTrainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. 133 | * :star: [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/pdf/2107.00057.pdf) (Tsung-Yi Lin) 134 | * [CLDA: Contrastive Learning for Semi-Supervised Domain Adaptation](https://arxiv.org/pdf/2107.00085.pdf) 135 | * [Attention Bottlenecks for Multimodal Fusion](https://arxiv.org/pdf/2107.00135.pdf) 136 | * [Learning to See before Learning to Act: Visual Pre-training for Manipulation](https://arxiv.org/pdf/2107.00646.pdf) (Phillip Isola, Tsung-Yi Lin) 137 | * [Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation](https://arxiv.org/pdf/2107.00644.pdf) (Xiaolong Wang) 138 | * :star: [AdaXpert: Adapting Neural Architecture for Growing Data](https://arxiv.org/pdf/2107.00254.pdf) 139 | * [FedMix: Approximation of Mixup under Mean Augmented Federated Learning](https://arxiv.org/pdf/2107.00233.pdf) (ICLR'21) 140 | * [Scalable Certified Segmentation via Randomized Smoothing](https://arxiv.org/pdf/2107.00228.pdf) (ICML'21) 141 | * [Revisiting Knowledge Distillation: An Inheritance and Exploration Framework](https://arxiv.org/pdf/2107.00181.pdf) (CVPR'21, Tongliang Liu, Xinmei Tian, Houqiang Li, Xian-Sheng Hua) 142 | * [Sanity Checks for Lottery Tickets: Does Your Winning Ticket Really Win the Jackpot?](https://arxiv.org/pdf/2107.00166.pdf) 143 | 144 | 145 | #### 20210701 146 | 147 | * [SOLO: A Simple Framework for Instance Segmentation](https://arxiv.org/pdf/2106.15947.pdf) SOLO TPAMI version 148 | * [Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment](https://arxiv.org/pdf/2106.15788.pdf) 149 | * [Augmented Shortcuts for Vision Transformers](https://arxiv.org/pdf/2106.15941.pdf) (Yunhe Wang) 150 | * [Multi-Source Domain Adaptation for Object Detection](https://arxiv.org/pdf/2106.15793.pdf) -------------------------------------------------------------------------------- /202108.md: -------------------------------------------------------------------------------- 1 | # Arxiv-Daily 2 | 3 | My daily arxiv reading notes. 4 | 5 | [2021 March](202103.md) 6 | 7 | [2021 April](202104.md) 8 | 9 | [2021 June](202106.md) 10 | 11 | [2021 July](202107.md) 12 | 13 | ## CV (Daily) 14 | 15 | #### 20210809 16 | 17 | * :star: [Improving Contrastive Learning by Visualizing Feature Transformation](https://arxiv.org/pdf/2108.02982.pdf) (Chang Wen Chen, ICCV Oral) [code](https://github.com/DTennant/CL-Visualizing-Feature-Transformation) 18 | * In this paper, we attempt to devise a feature-level data manipulation, differing from data augmentation, to enhance the generic contrastive self-supervised learning 19 | * To this end, we first design a visualization scheme for pos/neg score distribution, which enables us to analyze, interpret and understand the learning process. To our knowledge, this is the first attempt of its kind. 20 | * leveraging this tool, we gain some significant observations, which inspire our novel Feature Transformation proposals including the extrapolation of positives. This operation creates harder positives to boost the learning because hard positives enable the model to be more view-invariant. 21 | * Besides, we propose the interpolation among negatives, which provides diversified negatives and makes the model more discriminative. 22 | * :star: ​[Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications](https://arxiv.org/pdf/2108.02818.pdf) 23 | * [Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer](https://arxiv.org/pdf/2108.03032.pdf) (ICCV"21) 24 | * [Learning Meta-class Memory for Few-Shot Semantic Segmentation](https://arxiv.org/pdf/2108.02958.pdf) (ICCV'21) 25 | 26 | #### 20210806 27 | 28 | * :star: ​[Sketch Your Own GAN ](https://arxiv.org/pdf/2108.02774.pdf) (Jun-Yan Zhu, ICCV'21) [code](https://github.com/peterwang512/GANSketching) 29 | * In this work, we present a method, GAN Sketching, for rewriting GANs with one or more sketches, to make GANs training easier for novice users. In particular, we change the weights of an original GAN model according to user sketches. 30 | * [Video Contrastive Learning with Global Context](https://arxiv.org/pdf/2108.02722.pdf) (Mu Li) [code](https://github.com/amazon-research/video-contrastive-learning) 31 | * However, existing approaches rely heavily on the short-range spatiotemporal salience to form clip-level contrastive signals, thus limit themselves from using global context 32 | * In this paper, we propose a new video-level contrastive learning method based on segments to formulate positive pairs. Our formulation is able to capture global context in a video, thus robust to temporal content change. We also incorporate a temporal order regularization term to enforce the inherent sequential structure of videos. 33 | * [Instance Similarity Learning for Unsupervised Feature Representation](https://arxiv.org/pdf/2108.02721) (ICCV'21) 34 | * Conventional methods assign close instance pairs in the feature space with high similarity, which usually leads to wrong pairwise relationship for large neighborhoods because the Euclidean distance fails to depict the true semantic similarity on the feature manifold. 35 | * On the contrary, our method mines the feature manifold in an unsupervised manner, through which the semantic similarity among instances is learned in order to obtain discriminative representations. 36 | * [A Low Rank Promoting Prior for Unsupervised Contrastive Learning](https://arxiv.org/pdf/2108.02696.pdf) (Tao Mei, Ting Yao) 37 | * In this paper, we construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning, referred to as LORAC. 38 | * Most importantly, we argue that the low rank prior employed here is not unique, and many different priors can be invoked in a similar probabilistic way, corresponding to different hypotheses about underlying truth behind the contrastive features 39 | * [Residual Attention: A Simple but Effective Method for Multi-Label Recognition](https://arxiv.org/pdf/2108.02456.pdf) (ICCV'21) Funny 40 | * To effectively capture different spatial regions occupied by objects from different categories, we propose an embarrassingly simple module, named class-specific residual attention (CSRA). CSRA generates class-specific features for every category by proposing a simple spatial attention score, and then combines it with the class-agnostic average pooling feature. 41 | * Furthermore, with only 4 lines of code, CSRA also leads to consistent improvement across many diverse pretrained models and datasets without any extra training 42 | * [Unifying Nonlocal Blocks for Neural Networks](https://arxiv.org/pdf/2108.02451.pdf) (ICCV'21) 43 | * [Token Shift Transformer for Video Classification](https://arxiv.org/pdf/2108.02432.pdf) (MM'21) 44 | * [IDM: An Intermediate Domain Module for Domain Adaptive Person Re-ID](https://arxiv.org/pdf/2108.02413.pdf) (ICCV'21 Oral) 45 | * [ACE: Ally Complementary Experts for Solving Long-Tailed Recognition in One-Shot](https://arxiv.org/pdf/2108.02385.pdf) 46 | * :star: ​[Hierarchical Aggregation for 3D Instance Segmentation](https://arxiv.org/pdf/2108.02350.pdf) (Xinggang Wang) [code](https://github.com/hustvl/HAIS) 47 | * [Boosting Few-shot Semantic Segmentation with Transformers](https://arxiv.org/pdf/2108.02266.pdf) (Luc Van Gool) 48 | 49 | 50 | 51 | #### 20210805 52 | 53 | * [Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization](https://arxiv.org/pdf/2108.02183.pdf) (ICCV'21, John See) 54 | * However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their temporal relationship which are crucial for general video understanding. 55 | * Concretely, high-level features obtained from naive and prototypical contrastive learning are utilized to build distribution graphs, guiding the process of low-level and mid-level feature learning. We also devise a simple temporal modeling module from multi-level features to enhance motion pattern learning. 56 | * [Towards Coherent Visual Storytelling with Ordered Image Attention](https://arxiv.org/pdf/2108.02180.pdf) 57 | * We address the problem of visual storytelling, i.e., generating a story for a given sequence of images. While each sentence of the story should describe a corresponding image, a coherent story also needs to be consistent and relate to both future and past images 58 | * [Armour: Generalizable Compact Self-Attention for Vision Transformers](https://arxiv.org/pdf/2108.01778.pdf) 59 | * This paper introduces a compact selfattention mechanism that is fundamental and highly generalizable. The proposed method reduces redundancy and improves efficiency on top of the existing attention optimizations. 60 | * We show its drop-in applicability for both the regular attention mechanism and some most recent variants in vision transformers 61 | * [Vision Transformer with Progressive Sampling](https://arxiv.org/pdf/2108.01684.pdf) (ICCV'21) (Dahua Lin) 62 | * STRAIGHTFORWARD: 改进ViT的patch spliting 63 | * However, such naive tokenization could destruct object structures, assign grids to uninterested regions such as background, and introduce interference signals. To mitigate the above issues, in this paper, we propose an iterative and progressive sampling strategy to locate discriminative regions 64 | * [Generic Neural Architecture Search via Regression](https://arxiv.org/pdf/2108.01899.pdf) 65 | * These observations inspire us to ask: Is it necessary to use the performance of specific downstream tasks to evaluate and search for good neural architectures? Can we perform NAS effectively and efficiently while being agnostic to the downstream task? 66 | * GenNAS does not use task-specific labels but instead adopts regression on a set of manually designed synthetic signal bases for architecture evaluation. Such a self-supervised regression task can effectively evaluate the intrinsic power of an architecture to capture and transform the input signal patterns, and allow more sufficient usage of training samples. 67 | 68 | #### 20210804 69 | 70 | * [Generalized Source-free Domain Adaptation](https://arxiv.org/pdf/2108.01614.pdf) (ICCV'21) 71 | * Some recent works tackle source-free domain adaptation (SFDA) where only a source pre-trained model is available for adaptation to the target domain. However, those methods do not consider keeping source performance which is of high practical value in real world applications. In this paper, we propose a new domain adaptation paradigm called Generalized Source-free Domain Adaptation (G-SFDA), where the learned model needs to perform well on both the target and source domains, with only access to current unlabeled target data during adaptation. 72 | * [Boosting Weakly Supervised Object Detection via Learning Bounding Box Adjusters](https://arxiv.org/pdf/2108.01499.pdf) (Wangmeng Zuo, ICCV'21) 73 | * In this paper, we defend the problem setting for improving localization performance by leveraging the bounding box regression knowledge from a well-annotated auxiliary dataset. 74 | * [Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer](https://arxiv.org/pdf/2108.01390.pdf) 75 | * ViT模型压缩 76 | * Recent efficient designs for vision transformers follow two pipelines, namely, structural compression based on local spatial prior and non-structural token pruning. However, token pruning breaks the spatial structure that is indispensable for local spatial prior. 77 | * To take advantage of both two pipelines, this work seeks to dynamically identify uninformative tokens for each instance and trim down both the training and inference complexity while maintaining complete spatial structure and information flow 78 | * [Where do Models go Wrong? Parameter-Space Saliency Maps for Explainability](https://arxiv.org/pdf/2108.01335.pdf) (Funny) 79 | * Conventional saliency maps highlight input features to which neural network predictions are highly sensitive. We take a different approach to saliency, in which we identify and analyze the network parameters, rather than inputs, which are responsible for erroneous decisions 80 | * We find that samples which cause similar parameters to malfunction are semantically similar. We also show that pruning the most salient parameters for a wrongly classified sample often improves model behavior. Furthermore, fine-tuning a small number of the most salient parameters on a single sample results in error correction on other samples that are misclassified for similar reasons. 81 | * 从参数上提升可解释性是否靠谱?(对不同结构模型的参数是否适用?在训练不同阶段是否都适用?对不同输入数据是否都适用?) 82 | * [CanvasVAE: Learning to Generate Vector Graphic Documents](https://arxiv.org/pdf/2108.01249.pdf) (ICCV'21) 83 | * [Domain Generalization via Gradient Surgery](https://arxiv.org/pdf/2108.01621.pdf) (ICCV'21) 84 | * Our hypothesis is that when training with multiple domains, conflicting gradients within each mini-batch contain information specific to the individual domains which is irrelevant to the others, including the test domain. If left untouched, such disagreement may degrade generalization performance. 85 | * In this work, we characterize the conflicting gradients emerging in domain shift scenarios and devise novel gradient agreement strategies based on gradient surgery to alleviate their effect. 86 | * 和zhibo chen的UDA文章思想类似 87 | * [Elastic Architecture Search for Diverse Tasks with Different Resources](https://arxiv.org/pdf/2108.01224.pdf) (Jianfei Cai) 88 | * We study a new challenging problem of efficient deployment for diverse tasks with different resources, where the resource constraint and task of interest corresponding to a group of classes are dynamically specified at testing time. 89 | * we present a novel and general framework, called Elastic Architecture Search (EAS), permitting instant specializations at runtime for diverse tasks with various resource constraints. 90 | * To this end, we first propose to effectively train the over-parameterized network via a task dropout strategy to disentangle the tasks during training. In this way, the resulting model is robust to the subsequent task dropping at inference time. Based on the well-trained over-parameterized network, we then propose an efficient architecture generator to obtain optimal architectures within a single forward pass. 91 | * [Toward Spatially Unbiased Generative Models](https://arxiv.org/pdf/2108.01285.pdf) (ICCV'21, Funny) [code]() 92 | * Recent image generation models show remarkable generation performance. However, they mirror strong location preference in datasets, which we call spatial bias. Therefore, generators render poor samples at unseen locations and scales. 93 | * We argue that the generators rely on their implicit positional encoding to render spatial content. From our observations, the generator’s implicit positional encoding is translation-variant, making the generator spatially biased. 94 | * To address this issue, we propose injecting explicit positional encoding at each scale of the generator. By learning the spatially unbiased generator, we facilitate the robust use of generators in multiple tasks, such as GAN inversion, multi-scale generation, generation of arbitrary sizes and aspect ratios. 95 | 96 | 97 | 98 | #### 20210803 99 | 100 | * [HiFT: Hierarchical Feature Transformer for Aerial Tracking](https://arxiv.org/pdf/2108.00202.pdf) (ICCV'21) 101 | * 用DETR做tracking 102 | * [S^2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision](https://arxiv.org/pdf/2108.01072.pdf) 103 | * :star: [Image Synthesis and Editing with Stochastic Differential Equations](https://arxiv.org/pdf/2108.01073.pdf) ([Jun-Yan Zhu](https://www.cs.cmu.edu/~junyanz/)) 104 | * [Multilevel Knowledge Transfer for Cross-Domain Object Detection](https://arxiv.org/pdf/2108.00977.pdf) 105 | * incremental 结合图像翻译、对抗训练和伪标签 106 | * [Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding](https://arxiv.org/pdf/2108.00205.pdf) 107 | * In this paper we propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. 108 | * The embedding of each word from the query sentence is treated alike by attending to visual pixels individually instead of single holistic sentence embedding. 109 | * [CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention](https://arxiv.org/pdf/2108.00154.pdf) (Deng Cai) 110 | * However, existing vision transformers still do not possess an ability that is important to visual input: building the attention among features of different scales 111 | * we propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA). 112 | * :star: [StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators ](https://arxiv.org/pdf/2108.00946.pdf) (NVIDIA) 113 | * Leveraging the semantic power of large scale Contrastive-Language-Image-Pretraining (CLIP) models, we present a text-driven method that allows shifting a generative model to new domains, without having to collect even a single image from those domains. 114 | * We show that **through natural language prompts** and a few minutes of training, our method can adapt a generator across a multitude of domains characterized by diverse styles and shapes. 115 | * [GTNet:Guided Transformer Network for Detecting Human-Object Interactions](https://arxiv.org/pdf/2108.00596.pdf) 116 | * [GraphFPN: Graph Feature Pyramid Network for Object Detection](https://arxiv.org/pdf/2108.00580.pdf) (ICCV'21) 117 | * State-of-the-art methods for multi-scale feature learning focus on performing feature interactions across space and scales using neural networks **with a fixed topology**. 118 | * In this paper, we propose graph feature pyramid networks that are capable of **adapting their topological structures to varying intrinsic image structures**, and **supporting simultaneous feature interactions across all scales**. 119 | * [Greedy Network Enlarging ](https://arxiv.org/pdf/2108.00177.pdf) (Yunhe Wang) 120 | * 针对CNN的scaling 121 | * [Multi-scale Matching Networks for Semantic Correspondence](https://arxiv.org/pdf/2108.00211.pdf) (ICCV'21) 122 | * [Learning Instance-level Spatial-Temporal Patterns for Person Re-identification](https://arxiv.org/pdf/2108.00171.pdf) (Tieniu Tan) 123 | * :star: [Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning](https://arxiv.org/pdf/2108.00045.pdf) [code](https://github.com/FaisalAlamri0/ViT-ZSL) 124 | * [Object-aware Contrastive Learning for Debiased Scene Representation](https://arxiv.org/pdf/2108.00049.pdf) [code](git@github.com:alinlab/occon.git) 125 | * 解决对比学习过度关注背景区域的问题(真的是个问题?) 126 | * However, the learned representations are often contextually biased to the spurious scene correlations of different objects or object and background, which may harm their generalization on the downstream tasks. 127 | * To tackle the issue, we develop a novel object-aware contrastive learning framework that first (a) localizes objects in a self-supervised manner and then (b) debias scene correlations via appropriate data augmentations considering the inferred object locations 128 | * [Conditional Bures Metric for Domain Adaptation](https://arxiv.org/pdf/2108.00302.pdf) (CVPR'21) 129 | * [Group Fisher Pruning for Practical Network Compression](https://arxiv.org/pdf/2108.00708.pdf) (ICML'21) 130 | 131 | #### 20210802 132 | 133 | * :star: [Perceiver IO: A General Architecture for Structured Inputs & Outputs ](https://arxiv.org/pdf/2107.14795.pdf) (Andrew Zisserman, deepmind) [code](https://github.com/deepmind/deepmind-research/tree/master/perceiver) 134 | * The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. 135 | * While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics. 136 | * :star: [Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation](https://arxiv.org/pdf/2107.14428.pdf) (Zhi Tian, Chunhua Shen) 137 | * Here, we propose a novel decoder, termed dynamic neural representational decoder (NRD), which is simple yet significantly more efficient. 138 | * As each location on the encoder’s output corresponds to a local patch of the semantic labels, in this work, **we represent these local patches of labels with compact neural networks**. This neural representation enables our decoder to leverage the **smoothness prior** in the semantic label space, and thus makes our decoder more efficient. 139 | * Furthermore, these neural representations are **dynamically generated and conditioned on the outputs of the encoder networks**. The desired semantic labels can be efficiently decoded from the neural representations, resulting in high-resolution semantic segmentation predictions 140 | * [DPT: Deformable Patch-based Transformer for Visual Recognition ](https://arxiv.org/pdf/2107.14467.pdf) (MM'21) 141 | * straight forward, but hard to reject. [code](https://github.com/CASIA-IVA-Lab/DPT) 142 | * Existing methods usually use a fixed-size patch embedding which might destroy the semantics of objects. 143 | * To address this problem, we propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches. In this way, our method can well preserve the semantics in patches. 144 | * The DePatch module can work as a plug-and-play module, which can easily be incorporated into different transformers to achieve an end-to-end training. 145 | * [T-SVDNet: Exploring High-Order Prototypical Correlations for Multi-Source Domain Adaptation](https://arxiv.org/pdf/2107.14447.pdf) (ICCV'21) 146 | * [Sparse-to-dense Feature Matching: Intra and Inter domain Cross-modal Learning in Domain Adaptation for 3D Semantic Segmentation](https://arxiv.org/pdf/2107.14724.pdf) 147 | * With the rise of multi-modal datasets, large amount of 2D images are accessible besides 3D point clouds. In light of this, we propose to further leverage 2D data for 3D domain adaptation by intra and inter domain cross modal learning 148 | * [Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining](https://arxiv.org/pdf/2107.14572.pdf) 149 | * [On the Efficacy of Small Self-Supervised Contrastive Models without Distillation Signals](https://arxiv.org/pdf/2107.14762.pdf) (Yueting Zhuang) 150 | * 提出问题:It is a consensus that small models perform quite poorly under the paradigm of self-supervised contrastive learning. (Really?) 151 | * 分析问题:In this paper, we study the issue of training self-supervised small models without distillation signals. We first evaluate the representation spaces of the small models and make two non-negligible observations: (i) small models can complete the pretext task without overfitting despite its limited capacity; (ii) small models universally suffer the problem of overclustering. 152 | * 解决问题:Finally, we combine the validated techniques and improve the baseline of five small architectures with considerable margins, which indicates that training small self-supervised contrastive models is feasible even without distillation signals 153 | * [ADeLA: Automatic Dense Labeling with Attention for Viewpoint Adaptation in Semantic Segmentation](https://arxiv.org/pdf/2107.14285.pdf) 154 | * We describe an unsupervised domain adaptation method for image content shift caused by viewpoint changes for a semantic segmentation task. 155 | * [Fourier Series Expansion Based Filter Parametrization for Equivariant Convolutions](https://arxiv.org/pdf/2107.14519.pdf) 156 | * [Manipulating Identical Filter Redundancy for Efficient Pruning on Deep and Complicated CNN](https://arxiv.org/pdf/2107.14444.pdf) 157 | * [OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild](https://arxiv.org/pdf/2107.14480.pdf) (ICCV'21) 158 | 159 | -------------------------------------------------------------------------------- /202109.md: -------------------------------------------------------------------------------- 1 | # Arxiv-Daily 2 | 3 | My daily arxiv reading notes. 4 | 5 | [2021 March](202103.md) 6 | 7 | [2021 April](202104.md) 8 | 9 | [2021 June](202106.md) 10 | 11 | [2021 July](202107.md) 12 | 13 | [2021 Aug](202108.md) 14 | 15 | ## CV (Daily) 16 | 17 | #### 20210917 18 | 19 | * :star: [An End-to-End Transformer Model for 3D Object Detection](https://arxiv.org/pdf/2109.08141.pdf) 20 | * [code](https://github.com/facebookresearch/3detr) ICCV oral 21 | * Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialized architectures that employ libraries of 3Dspecific operators with hand-tuned hyperparameters. 22 | * 3DETR is conceptually simple and easy to implement, enabling further improvements by incorporating 3D domain knowledge. 23 | * :star: [Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs](https://arxiv.org/pdf/2109.07710.pdf) 24 | * However, training these models involving large parameters is both time-consuming and energy-hogging. In this regard, several prior works have advocated for sparsity to speed up the of DL training and more so, the inference phase 25 | * This work begins with the observation that during training, sparsity in the forward and backward passes are correlated. In that context, we investigate two types of sparsity (input and output type) inherent in gradient descent-based optimization algorithms and propose a hardware micro-architecture to leverage the same. 26 | * 与Yulin Wang, Gao Huang的ICLR21工作相关? 27 | * [Label Assignment Distillation for Object Detection](https://arxiv.org/pdf/2109.07843.pdf) 28 | * [Dense Semantic Contrast for Self-Supervised Visual Representation Learning](https://arxiv.org/pdf/2109.07756.pdf) (MM'21 oral) 29 | * [Few-Shot Object Detection by Attending to Per-Sample-Prototype](https://arxiv.org/pdf/2109.07734.pdf) 30 | * [METEOR: A Massive Dense & Heterogeneous Behavior Dataset for Autonomous Driving](https://arxiv.org/pdf/2109.07648.pdf) 31 | * [RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching](https://arxiv.org/pdf/2109.07547.pdf) (Jia Deng) 32 | 33 | #### 20210913 34 | 35 | * :star: [Is Attention Better Than Matrix Decomposition?](https://arxiv.org/pdf/2109.04553.pdf) (ICLR'21) 36 | * Our intriguing finding is that self-attention is not better than the matrix decomposition (MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies. 37 | * We model the global context issue as a low-rank recovery problem and show that its optimization algorithms can help design global information blocks. 38 | * This paper then proposes a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Hamburgers with different MDs can perform favorably against the popular global context module self-attention when carefully coping with gradients back-propagated through MDs 39 | * [TADA: Taxonomy Adaptive Domain Adaptation](https://arxiv.org/pdf/2109.04813.pdf) (Dengxin Dai, Wenguan Wang, Fisher Yu , Luc Van Gool) 40 | * Funny 41 | * We therefore introduce the more general taxonomy adaptive domain adaptation (TADA) problem, allowing for inconsistent taxonomies between the two domains. 42 | * We further propose an approach that jointly addresses the imagelevel and label-level domain adaptation. On the label-level, we employ a bilateral mixed sampling strategy to augment the target domain, and a relabelling method to unify and align the label spaces. We address the image-level domain gap by proposing an uncertainty-rectified contrastive learning method, leading to more domain-invariant and class discriminative features. 43 | * different TADA settings: open taxonomy, coarse-to-fine taxonomy, and partially-overlapping taxonomy 44 | * [EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling](https://arxiv.org/pdf/2109.04699.pdf) 45 | * EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. 46 | * We achieve the state-of-theart performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, 47 | * [LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation](https://arxiv.org/pdf/2109.04993.pdf) 48 | * This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. 49 | * [LibFewShot: A Comprehensive Library for Few-shot Learning ](https://arxiv.org/pdf/2109.04898.pdf) (Jiebo Luo) 50 | * 代码库 51 | * [An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA](https://arxiv.org/pdf/2109.05014.pdf) 52 | 53 | #### 20210911 54 | 55 | * [RobustART : Benchmarking Robustness on Architecture Design and Training Techniques](https://arxiv.org/pdf/2109.05211.pdf) 56 | * (Alan Yuille , Philip H.S. Torr , Dacheng Tao) 57 | * [CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation](https://arxiv.org/pdf/2109.06165.pdf) 58 | * With the success of Transformer in various tasks, we find that the cross-attention in Transformer is robust to the noisy input pairs for better feature alignment, thus in this paper Transformer is adopted for the challenging UDA task. 59 | * Specifically, to generate accurate input pairs, we design a two-way centeraware labeling algorithm to produce pseudo labels for target samples. 60 | * [DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection](https://arxiv.org/pdf/2109.06148.pdf) 61 | * [BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation](https://arxiv.org/pdf/2109.05346.pdf) (CVPR'21) 62 | * [Mutual Supervision for Dense Object Detection](https://arxiv.org/pdf/2109.05986.pdf) (ICCV'21) 63 | * [Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?](https://arxiv.org/pdf/2109.05422.pdf) (Wenjun Zeng) 64 | * 把CC-NET的思想用到MLP上 65 | * [Adversarially Trained Object Detector for Unsupervised Domain Adaptation](https://arxiv.org/pdf/2109.05751.pdf) 66 | * [Variational Disentanglement for Domain Generalization](https://arxiv.org/pdf/2109.05826.pdf) 67 | 68 | * [Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based Perception](https://arxiv.org/pdf/2109.05441.pdf) (TPAMI'21) 69 | 70 | 71 | #### 20210910 72 | 73 | * [Leveraging Local Domains for Image-to-Image Translation](https://arxiv.org/pdf/2109.04468.pdf) 74 | * In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as ’local domains’ and demonstrate its benefit for image-to-image translation. Relying on a simple geometrical guidance, we train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target 75 | * :star: [NEAT: Neural Attention Fields for End-to-End Autonomous Driving](https://arxiv.org/pdf/2109.04456.pdf) (ICCV'21) 76 | * [code](https://github.com/autonomousvision/neat) 77 | * [ConvMLP: Hierarchical Convolutional MLPs for Vision](https://arxiv.org/pdf/2109.04454.pdf) 78 | * To tackle these problems, we propose ConvMLP: a hierarchical Convolutional MLP for visual recognition, which is a light-weight, stage-wise, co-design of convolution layers, and MLPs. 79 | * In particular, ConvMLPS achieves 76.8% top-1 accuracy on ImageNet-1k with 9M parameters and 2.4G MACs (15% and 19% of MLPMixer-B/16, respectively) 80 | * Experiments on object detection and semantic segmentation further show that visual representation learned by ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters 81 | * [TxT: Crossmodal End-to-End Learning with Transformers](https://arxiv.org/pdf/2109.04422.pdf) 82 | * [IICNet: A Generic Framework for Reversible Image Conversion](https://arxiv.org/pdf/2109.04242.pdf) 83 | * [Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers](https://arxiv.org/pdf/2109.04448.pdf) 84 | 85 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 WangWen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Arxiv-Daily 2 | 3 | My daily arxiv reading notes. 4 | 5 | [2021 March](202103.md) 6 | 7 | [2021 April](202104.md) 8 | 9 | [2021 June](202106.md) 10 | 11 | [2021 July](202107.md) 12 | 13 | [2021 Aug](202108.md) 14 | 15 | [2021 Sep](202109.md) 16 | 17 | ## CV (Daily) 18 | 19 | #### 20211108 20 | 21 | * :star: ​[EditGAN: High-Precision Semantic Image Editing](https://arxiv.org/pdf/2111.03186.pdf) (NVIDIA, MIT) NIPS 22 | * [Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers](https://arxiv.org/pdf/2111.03481.pdf) (jianlong fu) NIPS 23 | * 用transformer做图像生成,避免生搬硬套的角度:(1)逐个token的生成有利于保证生成图像的局部特性(2)new perspective: token-based generator 24 | 25 | #### 20211110 26 | 27 | * [Data Augmentation Can Improve Robustness](https://arxiv.org/pdf/2111.05328.pdf) (NIPS21) 28 | 29 | * 对抗训练会面临鲁棒性过拟合问题,本文提出一种数据增强方法来提升鲁棒性 30 | * Adversarial training suffers from robust overfitting, a phenomenon where the robust test accuracy starts to decrease during training. In this paper, we focus on reducing robust overfitting by using common data augmentation schemes. 31 | * We demonstrate that, contrary to previous findings, when combined with model weight averaging, data augmentation can significantly boost robust accuracy. 32 | 33 | * [Sliced Recursive Transformer](https://arxiv.org/pdf/2111.05297.pdf) (Eric Xing) 34 | 35 | * ICLR submission (65553) 36 | * We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across depth of transformer networks. 37 | 38 | * [MixACM: Mixup-Based Robustness Transfer via Distillation of Activated Channel Maps](https://arxiv.org/pdf/2111.05073.pdf) (NIPS21) 39 | 40 | * 提升对抗鲁棒性的最常用方法还是对抗训练,本文提出一种alternative,通过知识蒸馏用鲁棒teacher网络提升student鲁棒性 41 | 42 | * First, we theoretically show the transferability of robustness from an adversarially trained teacher model to a student model with the help of mixup augmentation. 43 | * MixACM transfers robustness from a robust teacher to a student by matching activated channel maps generated without expensive adversarial perturbations 44 | 45 | * [Self-Interpretable Model with Transformation Equivariant Interpretation]() (NIPS21) 46 | 47 | * 解释性网络稳定性很差,容易受到数据扰动或变换的干扰。本文提出一种鲁棒的解释性方法,它在self-interpretable model中引入变换的不变性约束。 48 | * Recent studies have found that interpretation methods can be sensitive and unreliable, where the interpretations can be disturbed by perturbations or transformations of input data. 49 | * To address this issue, we propose to learn robust interpretations through transformation equivariant regularization in a self-interpretable model. 50 | 51 | --------------------------------------------------------------------------------