├── .gitignore ├── README.md ├── code └── README.md └── figs └── survey_pipeline.jpg /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | work_dir/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .coverage 43 | .coverage.* 44 | .cache 45 | nosetests.xml 46 | coverage.xml 47 | *.cover 48 | .hypothesis/ 49 | .pytest_cache/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | local_settings.py 58 | db.sqlite3 59 | 60 | # Flask stuff: 61 | instance/ 62 | .webassets-cache 63 | 64 | # Scrapy stuff: 65 | .scrapy 66 | 67 | # Sphinx documentation 68 | docs/_build/ 69 | 70 | # PyBuilder 71 | target/ 72 | 73 | # Jupyter Notebook 74 | .ipynb_checkpoints 75 | 76 | # pyenv 77 | .python-version 78 | 79 | # celery beat schedule file 80 | celerybeat-schedule 81 | 82 | # SageMath parsed files 83 | *.sage.py 84 | 85 | # Environments 86 | .env 87 | .venv 88 | env/ 89 | venv/ 90 | ENV/ 91 | env.bak/ 92 | venv.bak/ 93 | 94 | # Spyder project settings 95 | .spyderproject 96 | .spyproject 97 | 98 | # Rope project settings 99 | .ropeproject 100 | 101 | # mkdocs documentation 102 | /site 103 | 104 | # mypy 105 | .mypy_cache/ 106 | 107 | data/ 108 | data 109 | .vscode 110 | .idea 111 | .DS_Store 112 | 113 | # custom 114 | *.pkl 115 | *.pkl.json 116 | *.log.json 117 | 118 | # Pytorch 119 | *.pth 120 | *.py~ 121 | *.sh~ 122 | 123 | debug/* 124 | vis/ 125 | analysis/* 126 | pretrain/* 127 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 3 | [![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer/pulls) 4 |
5 |

6 |

Transformer-Based Visual Segmentation: A Survey

7 |

8 | T-PAMI, 2024 9 |
10 | Xiangtai Li (Project Lead) 11 | · 12 | Henghui Ding 13 | · 14 | Haobo Yuan 15 | · 16 | Wenwei Zhang 17 | · 18 | Guangliang Cheng 19 |
20 | Jiangmiao Pang 21 | . 22 | Kai Chen 23 | . 24 | Ziwei Liu 25 | . 26 | Chen Change Loy 27 |

28 | 29 |

30 | 31 | arXiv PDF 32 | 33 | 34 | S-Lab Project Page 35 | 36 | 37 | TPAMI PDF 38 | 39 |

40 |
41 | 42 | This repo is used for recording, tracking, and benchmarking several recent transformer-based visual segmentation methods, 43 | as a supplement to our [survey](https://arxiv.org/abs/2304.09854). 44 | If you find any work missing or have any suggestions (papers, implementations and other resources), feel free 45 | to [pull requests](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer/pulls). 46 | We will add the missing papers to this repo ASAP. 47 | 48 | 49 | ### 🔥News 50 | 51 | [-] Accepted by T-PAMI-2024. 52 | 53 | [-] Add several CVPR-24 works on this directions. 2024-03. You are welcome to add your CVPR works in our repos! 54 | 55 | [-] The third version is on arxiv. [survey](https://arxiv.org/abs/2304.09854) More benchmark and methods are included!!. 2023-12. 56 | 57 | [-] The second draft is on arxiv. 2023-06. 58 | 59 | 60 | ### 🔥Highlight!! 61 | 62 | [1], Previous transformer surveys divide the methods by the different tasks and settings. 63 | Different from them, we re-visit and group the existing transformer-based methods from the **technical perspective.** 64 | 65 | [2], We survey the methods in two parts: one for the mainstream tasks based on DETR-like meta-architecture, the other for related directions according to the tasks. 66 | 67 | [3], We further re-benchmark several representative works on image semantic segmentation and panoptic segmentation datasets. 68 | 69 | [4], We also include the query-based detection transformers since both segmentation and detection tasks are unified by object query. 70 | 71 | 72 | ## Introduction 73 | 74 | In this survey, we present the first detailed survey on Transformer-Based Segmentation. 75 | 76 | ![Alt Text](./figs/survey_pipeline.jpg) 77 | 78 | ## Summary of Contents 79 | 80 | - [Methods: A Survey](#methods-a-survey) 81 | - [Meta-Architecture](#meta-architecture) 82 | - [Strong Representation](#Strong-Representation) 83 | - [Interaction Design in Decoder](#Interaction-Design-in-Decoder) 84 | - [Optimizing Object Query](#Optimizing-Object-Query) 85 | - [Using Query For Association](#Using-Query-For-Association) 86 | - [Conditional Query Generation](#Conditional-Query-Generation) 87 | 88 | - [Related Domains and Beyond](#Related-Domains-and-Beyond) 89 | - [Point Cloud Segmentation](#Point-Cloud-Segmentation) 90 | - [Tuning Foundation Models](#Tuning-Foundation-Models) 91 | - [Domain-aware Segmentation](#Domain-aware-Segmentation) 92 | - [Label and Model Efficient Segmentation](#Label-and-Model-Efficient-Segmentation) 93 | - [Class Agnostic Segmentation and Tracking](#Class-Agnostic-Segmentation-and-Tracking) 94 | - [Medical Image Segmentation](#Medical-Image-Segmentation) 95 | 96 | ## Methods: A Survey 97 | 98 | ### Meta-Architecture 99 | 100 | | Year | Venue | Acronym | Paper Title | Code/Project | 101 | |:----:|:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------| 102 | | 2020 | ECCV | DETR | [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) | [Code](https://github.com/facebookresearch/detr) | 103 | | 2021 | ICLR | Deformable DETR | [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) | [Code](https://github.com/fundamentalvision/Deformable-DETR) | 104 | | 2021 | CVPR | Max-Deeplab | [MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers](https://arxiv.org/abs/2012.00759) | [Code](https://github.com/google-research/deeplab2) | 105 | | 2021 | NeurIPS | MaskFormer | [MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation](http://arxiv.org/abs/2107.06278) | [Code](https://github.com/facebookresearch/MaskFormer) | 106 | | 2021 | NeurIPS | K-Net | [K-Net: Towards Unified Image Segmentation](https://arxiv.org/abs/2106.14855) | [Code](https://github.com/ZwwWayne/K-Net) | 107 | | 2023 | CVPR | Lite-DETR | [Lite detr: An interleaved multi-scale encoder for efficient detr](https://arxiv.org/pdf/2303.07335) | [Code](https://github.com/IDEA-Research/Lite-DETR) | 108 | 109 | ### Strong Representation 110 | 111 | #### Better ViTs Design 112 | 113 | | Year | Venue | Acronym | Paper Title | Code/Project | 114 | |:----:|:-------:|:-----------:|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------| 115 | | 2021 | CVPR | SETR | [Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers](https://arxiv.org/abs/2012.15840) | [Code](https://github.com/fudan-zvg/SETR) | 116 | | 2021 | ICCV | MviT-V1 | [Multiscale vision transformers](https://arxiv.org/abs/2104.11227) | [Code](https://github.com/facebookresearch/mvit) | 117 | | 2022 | CVPR | MviT-V2 | [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526) | [Code](https://github.com/facebookresearch/mvit) | 118 | | 2021 | NeurIPS | XCIT | [Xcit: Crosscovariance image transformers](https://arxiv.org/abs/2106.09681) | [Code](https://github.com/facebookresearch/xci) | 119 | | 2021 | ICCV | Pyramid VIT | [Pyramid vision transformer: A versatile backbone for dense prediction without convolutions](https://arxiv.org/abs/2102.12122) | [Code](https://github.com/whai362/PVT) | 120 | | 2021 | ICCV | CorssViT | [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899) | [Code](https://github.com/IBM/CrossViT) | 121 | | 2021 | ICCV | CoaT | [Co-Scale Conv-Attentional Image Transformers](https://arxiv.org/abs/2104.06399) | [Code](https://github.com/mlpc-ucsd/CoaT) | 122 | | 2022 | CVPR | MPViT | [MPViT: Multi-Path Vision Transformer for Dense Prediction](https://arxiv.org/abs/2112.11010) | [Code](https://github.com/youngwanLEE/MPViT) | 123 | | 2022 | NeurIPS | SegViT | [SegViT: Semantic Segmentation with Plain Vision Transformers](https://arxiv.org/abs/2210.05844) | [Code](https://github.com/zbwxp/SegVit) | 124 | | 2022 | arxiv | RSSeg | [Representation Separation for SemanticSegmentation with Vision Transformers](https://arxiv.org/abs/2212.13764) | N/A | 125 | 126 | #### Hybrid CNNs/Transformers/MLPs 127 | 128 | | Year | Venue | Acronym | Paper Title | Code/Project | 129 | |:----:|:-------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------| 130 | | 2021 | ICCV | Swin | [Swin transformer: Hierarchical vision transformer using shifted windows](https://arxiv.org/abs/2103.14030) | [Code](https://github.com/microsoft/Swin-Transformer) | 131 | | 2022 | CVPR | Swin-v2 | [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) | [Code](https://github.com/microsoft/Swin-Transformer) | 132 | | 2021 | NeurIPS | Segformer | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | [Code](http://github.com/NVlabs/SegFormer) | 133 | | 2022 | CVPR | CMT | [CMT: Convolutional Neural Networks Meet Vision Transformers](https://arxiv.org/abs/2107.06263) | [Code](https://github.com/FlyEgle/CMT-pytorch) | 134 | | 2021 | NeurIPS | Twins | [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](https://arxiv.org/abs/2104.13840) | [Code](https://github.com/Meituan-AutoML/Twins) | 135 | | 2021 | ICCV | CvT | [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) | [Code](https://github.com/microsoft/CvT) | 136 | | 2021 | NeurIPS | Vitae | [Vitae: Vision transformer advanced by exploring intrinsic inductive bias](https://arxiv.org/abs/2106.03348) | [Code](https://github.com/ViTAE-Transformer/ViTAE-Transformer) | 137 | | 2022 | CVPR | ConvNext | [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) | [Code](https://github.com/facebookresearch/ConvNeXt) | 138 | | 2022 | NeurIPS | SegNext | [SegNeXt:Rethinking Convolutional Attention Design for Semantic Segmentation](https://github.com/visual-attention-network/segnext) | [Code](https://github.com/visual-attention-network/segnext) | 139 | | 2022 | CVPR | PoolFormer | [PoolFormer: MetaFormer Is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) | [Code](https://github.com/sail-sg/poolformer) | 140 | | 2023 | ICLR | STM | [Demystify Transformers & Convolutions in Modern Image Deep Networks](https://arxiv.org/abs/2211.05781) | [Code](https://github.com/OpenGVLab/STM-Evaluation) | 141 | 142 | #### Self-Supervised Learning 143 | 144 | | Year | Venue | Acronym | Paper Title | Code/Project | 145 | |:----:|:-------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------| 146 | | 2021 | ICCV | MOCOV3 | [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057) | [Code](https://github.com/facebookresearch/moco-v3) | 147 | | 2022 | ICLR | Beit | [Beit: Bert pre-training of image transformers](https://arxiv.org/abs/2106.08254) | [Code](https://github.com/microsoft/unilm/tree/master/beit) | 148 | | 2022 | CVPR | MaskFeat | [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133) | [Code](https://github.com/facebookresearch/SlowFast) | 149 | | 2022 | CVPR | MAE | [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) | [Code](https://github.com/facebookresearch/mae) | 150 | | 2022 | NeurIPS | ConvMAE | [MCMAE: Masked Convolution Meets Masked Autoencoders](https://arxiv.org/abs/2303.05475) | [Code](https://github.com/Alpha-VL/ConvMAE) | 151 | | 2023 | ICLR | Spark | [SparK: the first successful BERT/MAE-style pretraining on any convolutional networks](https://github.com/keyu-tian/SparK) | [Code](https://github.com/keyu-tian/SparK) | 152 | | 2022 | CVPR | FLIP | [Scaling Language-Image Pre-training via Masking](https://arxiv.org/abs/2212.00794) | [Code](https://github.com/facebookresearch/flip) | 153 | | 2023 | arxiv | ConvNeXt V2 | [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808) | [Code](https://github.com/facebookresearch/ConvNeXt-V2) | 154 | 155 | ### Interaction Design in Decoder 156 | 157 | #### Improved Cross Attention Design 158 | 159 | | Year | Venue | Acronym | Paper Title | Code/Project | 160 | |:----:|:-------:|:------------------:|---------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------| 161 | | 2021 | CVPR | Sparse R-CNN | [Sparse R-CNN: End-to-End Object Detection with Learnable Proposals](https://arxiv.org/abs/2011.12450) | [Code](https://github.com/PeizeSun/SparseR-CNN) | 162 | | 2022 | CVPR | AdaMixer | [AdaMixer: A Fast-Converging Query-Based Object Detector](https://arxiv.org/abs/2203.16507) | [Code](https://github.com/MCG-NJU/AdaMixer) | 163 | | 2021 | CVPR | MaX-DeepLab | [MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers](https://arxiv.org/abs/2012.00759) | [Code](https://github.com/google-research/deeplab2) | 164 | | 2021 | NeurIPS | K-Net | [K-Net: Towards Unified Image Segmentation](https://arxiv.org/abs/2106.14855) | [Code](https://github.com/ZwwWayne/K-Net/) | 165 | | 2022 | CVPR | Mask2Former | [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) | [Code](https://github.com/facebookresearch/Mask2Former) | 166 | | 2022 | ECCV | kMaX-DeepLab | [k-means Mask Transformer](https://arxiv.org/abs/2207.04044) | [Code](https://github.com/google-research/deeplab2) | 167 | | 2021 | ICCV | QueryInst | [Instances as queries](https://arxiv.org/abs/2105.01928) | [Code](https://github.com/hustvl/QueryInst) | 168 | | 2021 | arxiv | ISTR | [ISTR: End-to-End Instance Segmentation via Transformers](https://arxiv.org/abs/2105.00637) | [Code](https://github.com/hujiecpp/ISTR) | 169 | | 2021 | NeurIPS | SOLQ | [Solq: Segmenting objects by learning queries](https://arxiv.org/abs/2106.02351) | [Code](https://github.com/megvii-research/SOLQ) | 170 | | 2022 | CVPR | Panoptic Segformer | [Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers](https://arxiv.org/abs/2109.03814) | [Code](https://github.com/zhiqi-li/Panoptic-SegFormer) | 171 | | 2022 | CVPR | CMT-Deeplab | [CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation](https://arxiv.org/abs/2206.08948) | N/A | 172 | | 2022 | CVPR | SparseInst | [Sparse Instance Activation for Real-Time Instance Segmentation](https://arxiv.org/abs/2203.12827) | [Code](https://github.com/hustvl/SparseInst) | 173 | | 2022 | CVPR | SAM-DETR | [Accelerating DETR Convergence via Semantic-Aligned Matching](https://arxiv.org/abs/2203.06883) | [Code](https://github.com/ZhangGongjie/SAM-DETR) | 174 | | 2021 | ICCV | SMCA-DETR | [Fast Convergence of DETR with Spatially Modulated Co-Attention](https://arxiv.org/abs/2101.07448) | [Code](https://github.com/gaopengcuhk/SMCA-DETR) | 175 | | 2021 | BMVC | ACT-DETR | [End-to-End Object Detection with Adaptive Clustering Transformer](https://www.bmvc2021-virtualconference.com/assets/papers/0709.pdf) | [Code](https://github.com/gaopengcuhk/SMCA-DETR) | 176 | | 2021 | ICCV | Dynamic DETR | [Dynamic DETR: End-to-End Object Detection with Dynamic Attention](https://ieeexplore.ieee.org/document/9709981) | N/A | 177 | | 2022 | ICLR | Sparse DETR | [Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity](https://arxiv.org/abs/2111.14330) | [Code](https://github.com/kakaobrain/sparse-detr) | 178 | | 2023 | CVPR | FastInst | [FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation](https://arxiv.org/abs/2303.08594) | [Code](https://github.com/junjiehe96/FastInst) | 179 | 180 | 181 | #### Spatial-Temporal Cross Attention Design 182 | 183 | | Year | Venue | Acronym | Paper Title | Code/Project | 184 | |:----:|:-------:|:------------------:|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------| 185 | | 2021 | CVPR | VisTR | [VisTR: End-to-End Video Instance Segmentation with Transformers](https://arxiv.org/abs/2011.14503) | [Code](https://github.com/Epiphqny/VisTR) | 186 | | 2021 | NeurIPS | IFC | [Video instance segmentation using inter-frame communication transformers](https://arxiv.org/abs/2106.03299) | [Code](https://github.com/sukjunhwang/IFC) | 187 | | 2022 | CVPR | SlotVPS | [Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation](https://arxiv.org/abs/2112.08949) | N/A | 188 | | 2022 | CVPR | TubeFormer-DeepLab | [TubeFormer-DeepLab: Video Mask Transformer](https://arxiv.org/abs/2205.15361) | N/A | 189 | | 2022 | CVPR | Video K-Net | [Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation](https://arxiv.org/abs/2204.04656) | [Code](https://github.com/lxtGH/Video-K-Net) | 190 | | 2022 | CVPR | TeViT | [Temporally efficient vision transformer for video instance segmentation](https://arxiv.org/abs/2204.08412) | [Code](https://github.com/hustvl/TeViT) | 191 | | 2022 | ECCV | Seqformer | [SeqFormer: Sequential Transformer for Video Instance Segmentation](https://arxiv.org/abs/2112.08275) | [Code](https://github.com/wjf5203/SeqFormer) | 192 | | 2022 | arxiv | Mask2Former-VIS | [Mask2Former for Video Instance Segmentation](https://arxiv.org/abs/2112.10764) | [Code](https://github.com/facebookresearch/Mask2Former) | 193 | | 2022 | PAMI | TransVOD | [TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers](https://arxiv.org/abs/2201.05047) | [Code](https://github.com/SJTU-LuHe/TransVOD) | 194 | | 2022 | NeurIPS | VITA | [VITA: Video Instance Segmentation via Object Token Association](https://arxiv.org/abs/2206.04403) | [Code](https://github.com/sukjunhwang/VITA) | 195 | 196 | 197 | 198 | ### Optimizing Object Query 199 | 200 | #### Adding Position Information into Query 201 | 202 | | Year | Venue | Acronym | Paper Title | Code/Project | 203 | |:----:|:-----:|:-------------------:|------------------------------------------------------------------------------------------------------------|------------------------------------------------------| 204 | | 2021 | ICCV | Conditional-DETR | [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) | [Code](https://github.com/Atten4Vis/ConditionalDETR) | 205 | | 2022 | arxiv | Conditional-DETR-v2 | [Conditional detr v2:Efficient detection transformer with box queries](https://arxiv.org/abs/2207.08914) | [Code](https://github.com/Atten4Vis/ConditionalDETR) | 206 | | 2022 | AAAI | Anchor DETR | [Anchor detr: Query design for transformer-based detector](https://arxiv.org/abs/2109.07107) | [Code](https://github.com/megvii-model/AnchorDETR) | 207 | | 2022 | ICLR | DAB-DETR | [DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR](https://arxiv.org/abs/2201.12329) | [Code](https://github.com/SlongLiu/DAB-DETR) | 208 | | 2021 | arxiv | Efficient DETR | [Efficient detr: improving end-to-end object detector with dense prior](https://arxiv.org/abs/2104.01318) | N/A | 209 | 210 | #### Adding Extra Supervision into Query 211 | 212 | | Year | Venue | Acronym | Paper Title | Code/Project | 213 | |:----:|:-------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------| 214 | | 2022 | ECCV | DE-DETR | [Towards Data-Efficient Detection Transformers](https://arxiv.org/abs/2203.09507) | [Code](https://github.com/encounter1997/DE-DETRs) | 215 | | 2022 | CVPR | DN-DETR | [Dndetr:Accelerate detr training by introducing query denoising](https://arxiv.org/abs/2203.01305) | [Code](https://github.com/IDEA-opensource/DN-DETR) | 216 | | 2023 | ICLR | DINO | [DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection](https://arxiv.org/abs/2203.03605) | [Code](https://github.com/IDEA-Research/DINO) | 217 | | 2023 | CVPR | Mp-Former | [Mp-former: Mask-piloted transformer for image segmentation](https://arxiv.org/abs/2303.07336) | [Code](https://github.com/IDEA-Research/MP-Former) | 218 | | 2023 | CVPR | Mask-DINO | [Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation](https://arxiv.org/abs/2206.02777) | [Code](https://github.com/IDEACVR/MaskDINO) | 219 | | 2022 | NeurIPS | N/A | [Learning equivariant segmentation with instance-unique querying](https://arxiv.org/abs/2210.00911) | [Code](https://github.com/JamesLiang819/Instance_Unique_Querying) | 220 | | 2023 | CVPR | H-DETR | [DETRs with Hybrid Matching](https://arxiv.org/abs/2207.13080) | [Code](https://github.com/HDETR) | 221 | | 2023 | ICCV | Group-DETR | [Group detr: Fast detr training with group-wise one-to-many assignment](https://arxiv.org/abs/2207.13085) | N/A | 222 | | 2023 | ICCV | Co-DETR | [Detrs with collaborative hybrid assignments training](https://arxiv.org/abs/2211.12860) | [Code](https://github.com/Sense-X/Co-DETR) | 223 | 224 | ### Using Query For Association 225 | 226 | #### Query as Instance Association 227 | 228 | | Year | Venue | Acronym | Paper Title | Code/Project | 229 | |:----:|:-------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------| 230 | | 2022 | CVPR | TrackFormer | [TrackFormer: Multi-Object Tracking with Transformer](https://arxiv.org/abs/2101.02702) | [Code](https://github.com/timmeinhardt/trackformer) | 231 | | 2021 | arxiv | TransTrack | [TransTrack: Multiple Object Tracking with Transformer](https://arxiv.org/abs/2012.15460) | [Code](https://github.com/PeizeSun/TransTrack) | 232 | | 2022 | ECCV | MOTR | [MOTR: End-to-End Multiple-Object Tracking with TRansformer](https://arxiv.org/abs/2105.03247) | [Code](https://github.com/megvii-research/MOTR) | 233 | | 2022 | NeurIPS | MinVIS | [MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training](https://arxiv.org/abs/2208.02245) | [Code](https://github.com/NVlabs/MinVIS) | 234 | | 2022 | ECCV | IDOL | [In defense of online models for video instance segmentation](https://arxiv.org/abs/2207.10661) | [Code](https://github.com/wjf5203/VNext) | 235 | | 2022 | CVPR | Video K-Net | [Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation](https://arxiv.org/abs/2204.04656) | [Code](https://github.com/lxtGH/Video-K-Net) | 236 | | 2023 | CVPR | GenVIS | [A Generalized Framework for Video Instance Segmentation](https://arxiv.org/abs/2211.08834) | [Code](https://github.com/miranheo/GenVIS) | 237 | | 2023 | ICCV | Tube-Link | [Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation](https://arxiv.org/abs/2303.12782) | [Code](https://github.com/lxtGH/Tube-Link) | 238 | | 2023 | ICCV | CTVIS | [CTVIS: Consistent Training for Online Video Instance Segmentation](https://arxiv.org/abs/2303.12782) | [Code](https://github.com/KainingYing/CTVIS) | 239 | | 2023 | CVPR-W | Video-kMaX | [Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation](https://arxiv.org/abs/2304.04694) | N/A | 240 | 241 | 242 | 243 | #### Query as Linking Multi-Tasks 244 | 245 | | Year | Venue | Acronym | Paper Title | Code/Project | 246 | |:----:|:-----:|:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------| 247 | | 2022 | ECCV | Panoptic-PartFormer | [Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation](https://arxiv.org/abs/2204.04655) | [Code](https://github.com/lxtGH/Panoptic-PartFormer) | 248 | | 2022 | ECCV | PolyphonicFormer | [PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation](https://arxiv.org/abs/2112.02582) | [Code](https://github.com/HarborYuan/PolyphonicFormer) | 249 | | 2022 | CVPR | PanopticDepth | [Panopticdepth: A unified framework for depth-aware panoptic segmentation](https://arxiv.org/abs/2206.00468) | [Code](https://github.com/NaiyuGao/PanopticDepth) | 250 | | 2022 | ECCV | Fashionformer | [Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition](https://arxiv.org/abs/2204.04654) | [Code](https://github.com/xushilin1/FashionFormer) | 251 | | 2022 | ECCV | InvPT | [InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding](https://arxiv.org/abs/2203.07997) | [Code](https://github.com/prismformore/InvPT) | 252 | | 2023 | CVPR | UNINEXT | [Universal Instance Perception as Object Discovery and Retrieval](https://arxiv.org/abs/2303.06674) | [Code](https://github.com/MasterBin-IIAU/UNINEXT) | 253 | | 2024 | CVPR | GLEE | [GLEE: General Object Foundation Model for Images and Videos at Scale](https://arxiv.org/abs/2312.09158) | [Code](https://glee-vision.github.io/) | 254 | | 2024 | CVPR | UniVS | [UniVS: Unified and Universal Video Segmentation with Prompts as Queries](https://arxiv.org/abs/2402.18115) | [Code](https://github.com/MinghanLi/UniVS) | 255 | | 2024 | CVPR | OMG-Seg | [OMG-Seg: Is One Model Good Enough For All Segmentation?](https://arxiv.org/abs/2401.10229) | [Code](https://github.com/lxtGH/OMG-Seg) | 256 | 257 | ### Conditional Query Generation 258 | 259 | #### Conditional Query Fusion on Language Features 260 | 261 | | Year | Venue | Acronym | Paper Title | Code/Project | 262 | |:----:|:-----:|:--------------:|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------| 263 | | 2021 | ICCV | VLT | [Vision-Language Transformer and Query Generation for Referring Segmentation](https://arxiv.org/abs/2108.05565) | [Code](https://github.com/henghuiding/Vision-Language-Transformer) | 264 | | 2022 | CVPR | LAVT | [Lavt: Language-aware vision transformer for referring image segmentation](https://arxiv.org/abs/2112.02244) | [Code](https://github.com/yz93/LAVT-RIS) | 265 | | 2022 | CVPR | Restr | [Restr:Convolution-free referring image segmentation using transformers](https://arxiv.org/abs/2203.16768) | N/A | 266 | | 2022 | CVPR | Cris | [Cris: Clip-driven referring image segmentation](https://arxiv.org/abs/2111.15174) | [Code](https://github.com/DerrickWang005/CRIS.pytorch) | 267 | | 2022 | CVPR | MTTR | [End-to-End Referring Video Object Segmentation with Multimodal Transformers](https://arxiv.org/abs/2111.14821) | [Code](https://github.com/mttr2021/MTTR) | 268 | | 2022 | CVPR | LBDT | [Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation](https://arxiv.org/abs/2206.03789) | [Code](https://github.com/dzh19990407/LBDT) | 269 | | 2022 | CVPR | ReferFormer | [Language as queries for referring video object segmentation](https://arxiv.org/abs/2201.00487) | [Code](https://github.com/wjn922/ReferFormer) | 270 | | 2024 | CVPR | MaskGrounding | [Mask Grounding for Referring Image Segmentation](https://arxiv.org/abs/2312.12198) | [Code](https://yxchng.github.io/projects/mask-grounding/) | 271 | 272 | #### Conditional Query Fusion on Cross Image Features 273 | 274 | | Year | Venue | Acronym | Paper Title | Code/Project | 275 | |:----:|:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------| 276 | | 2021 | NeurIPS | CyCTR | [Few-Shot Segmentation via Cycle-Consistent Transformer](https://arxiv.org/abs/2106.02320) | [Code](https://github.com/GengDavid/CyCTR) | 277 | | 2022 | CVPR | MatteFormer | [MatteFormer: Transformer-Based Image Matting via Prior-Tokens](https://arxiv.org/abs/2203.15662) | [Code](https://github.com/webtoon/matteformer) | 278 | | 2022 | ECCV | Segdeformer | [A Transformer-based Decoder for Semantic Segmentation with Multi-level Context Mining](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136880617.pdf) | [Code](https://github.com/lygsbw/segdeformer) | 279 | | 2022 | arxiv | StructToken | [StructToken : Rethinking Semantic Segmentation with Structural Prior](https://arxiv.org/abs/2203.12612) | N/A | 280 | | 2022 | NeurIPS | MM-Former | [Mask Matching Transformer for Few-Shot Segmentation](https://arxiv.org/abs/2301.01208) | [Code](https://github.com/jiaosiyu1999/mmformer) | 281 | | 2022 | ECCV | AAFormer | [Adaptive Agent Transformer for Few-shot Segmentation](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890035.pdf) | N/A | 282 | | 2023 | arxiv | ReferenceTwice | [Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation](https://arxiv.org/abs/2301.01156) | [Code](https://github.com/hanyue1648/RefT) | 283 | 284 | ### Tuning Foundation Models 285 | 286 | #### Vision Adapter 287 | 288 | | Year | Venue | Acronym | Paper Title | Code/Project | 289 | |:----:|:-----:|:-----------:|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------| 290 | | 2022 | CVPR | CoCoOp | [Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557) | [Code](https://github.com/KaiyangZhou/CoOp) | 291 | | 2022 | ECCV | Tip-Adapter | [Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification](https://arxiv.org/abs/2111.03930) | [Code](https://github.com/gaopengcuhk/Tip-Adapter) | 292 | | 2022 | ECCV | EVL | [Frozen CLIP Models are Efficient Video Learners](https://arxiv.org/abs/2208.03550) | [Code](https://github.com/OpenGVLab/efficient-video-recognition) | 293 | | 2023 | ICLR | ViT-Adapter | [Vision Transformer Adapter for Dense Predictions](https://arxiv.org/abs/2205.08534) | [Code](https://github.com/czczup/ViT-Adapter) | 294 | | 2022 | CVPR | DenseCLIP | [DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/abs/2112.01518) | [Code](https://github.com/raoyongming/DenseCLIP) | 295 | | 2022 | CVPR | CLIPSeg | [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) | [Code](https://eckerlab.org/code/clipseg) | 296 | | 2023 | CVPR | OneFormer | [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) | [Code](https://github.com/SHI-Labs/OneFormer) | 297 | 298 | #### Open Vocabulary Learning 299 | 300 | | Year | Venue | Acronym | Paper Title | Code/Project | 301 | |:----:|:-----:|:---------:|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------| 302 | | 2021 | CVPR | OVR-CNN | [Open-Vocabulary Object Detection Using Captions](https://arxiv.org/abs/2011.10678) | [Code](https://github.com/alirezazareian/ovr-cnn) | 303 | | 2022 | ICLR | ViLD | [Open-vocabulary Object Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921) | [Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild) | 304 | | 2022 | ECCV | Detic | [Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605) | [Code](https://github.com/facebookresearch/Detic) | 305 | | 2022 | ECCV | OV-DETR | [Open-Vocabulary DETR with Conditional Matching](https://arxiv.org/abs/2203.11876) | [Code](https://github.com/yuhangzang/OV-DETR) | 306 | | 2023 | ICLR | F-VLM | [F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models](https://arxiv.org/abs/2209.15639) | [Code](https://sites.google.com/view/f-vlm/home) | 307 | | 2022 | ECCV | MViT | [Class-agnostic Object Detection with Multi-modal Transformer](https://arxiv.org/abs/2111.11430) | [Code](https://github.com/mmaaz60/mvits_for_class_agnostic_od) | 308 | | 2022 | ECCV | OpenSeg | [Scaling Open-Vocabulary Image Segmentation with Image-Level Labels](https://arxiv.org/abs/2112.12143) | [Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/openseg) | 309 | | 2022 | ICLR | LSeg | [Language-driven Semantic Segmentation](https://arxiv.org/abs/2201.03546) | [Code](https://github.com/isl-org/lang-seg) | 310 | | 2022 | ECCV | SimSeg | [A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model](https://arxiv.org/abs/2112.14757) | [Code](https://github.com/MendelXu/zsseg.baseline) | 311 | | 2022 | ECCV | DenseCLIP | [Extract Free Dense Labels from CLIP](https://arxiv.org/abs/2112.01071) | [Code](https://github.com/chongzhou96/MaskCLIP) | 312 | | 2021 | ICCV | UVO | [Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation](https://arxiv.org/abs/2104.04691) | [Project](https://sites.google.com/view/unidentified-video-object) | 313 | | 2023 | arXiv | CGG | [Betrayed-by-Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation](https://arxiv.org/abs/2301.00805) | [Code](https://github.com/jzwu48033552/betrayed-by-captions) | 314 | | 2022 | TPAMI | ES | [Open-World Entity Segmentation](https://arxiv.org/abs/2107.14228) | [Code](https://github.com/dvlab-research/Entity/) | 315 | | 2022 | CVPR | OW-DETR | [OW-DETR: Open-world Detection Transformer](https://arxiv.org/abs/2112.01513) | [Code](https://github.com/akshitac8/OW-DETR) | 316 | | 2023 | CVPR | PROB | [PROB: Probabilistic Objectness for Open World Object Detection](https://arxiv.org/abs/2212.01424) | [Code](https://github.com/orrzohar/PROB) | 317 | 318 | ### Related Domains and Beyond 319 | 320 | #### Point Cloud Segmentation 321 | 322 | | Year | Venue | Acronym | Paper Title | Code/Project | 323 | |:----:|:-------:|:----------------------:|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------| 324 | | 2021 | ICCV | Point Transformer | [Point Transformer](https://arxiv.org/abs/2012.09164) | N/A | 325 | | 2021 | CVM | PCT | [PCT: Point cloud transformer](https://arxiv.org/abs/2012.09688) | [Code](https://github.com/MenghaoGuo/PCT) | 326 | | 2022 | CVPR | Stratified Transformer | [Stratified Transformer for 3D Point Cloud Segmentation](https://arxiv.org/abs/2203.14508) | [Code](https://github.com/dvlab-research/Stratified-Transformer) | 327 | | 2022 | CVPR | Point-BERT | [Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling](https://arxiv.org/abs/2111.14819) | [Code](https://github.com/lulutang0608/Point-BERT) | 328 | | 2022 | ECCV | Point-MAE | [Masked Autoencoders for Point Cloud Self-supervised Learning](https://arxiv.org/abs/2203.06604) | [Code](https://github.com/Pang-Yatian/Point-MAE) | 329 | | 2022 | NeurIPS | Point-M2AE | [Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training](https://arxiv.org/abs/2205.14401) | [Code](https://github.com/ZrrSkywalker/Point-M2AE) | 330 | | 2022 | ICRA | Mask3D | [Mask3D for 3D Semantic Instance Segmentation](https://arxiv.org/abs/2210.03105) | [Code](https://github.com/JonasSchult/Mask3D) | 331 | | 2023 | AAAI | SPFormer | [Superpoint Transformer for 3D Scene Instance Segmentation](https://arxiv.org/abs/2211.15766) | [Code](https://github.com/sunjiahao1999/SPFormer) | 332 | | 2023 | AAAI | PUPS | [PUPS: Point Cloud Unified Panoptic Segmentation](https://arxiv.org/abs/2302.06185) | N/A | 333 | 334 | #### Domain-aware Segmentation 335 | 336 | | Year | Venue | Acronym | Paper Title | Code/Project | 337 | |:----:|:------:|:-------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------| 338 | | 2022 | CVPR | DAFormer | [DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation](https://arxiv.org/abs/2111.14887) | [Code](https://github.com/lhoyer/DAFormer) | 339 | | 2022 | ECCV | HRDA | [HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation](https://arxiv.org/abs/2204.13132) | [Code](https://github.com/lhoyer/HRDA) | 340 | | 2023 | CVPR | MIC | [MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation](https://arxiv.org/abs/2212.01322) | [Code](https://github.com/lhoyer/MIC) | 341 | | 2021 | ACM MM | SFA | [Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers](https://arxiv.org/abs/2107.12636) | [Code](https://github.com/encounter1997/SFA) | 342 | | 2023 | CVPR | DA-DETR | [DA-DETR: Domain Adaptive Detection Transformer with Information Fusion](https://arxiv.org/abs/2103.17084) | N/A | 343 | | 2022 | ECCV | MTTrans | [MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer](https://arxiv.org/abs/2205.01643) | [Code](https://github.com/Lafite-Yu/MTTrans-OpenSource) | 344 | | 2022 | arXiv | Sentence-Seg | [The devil is in the labels: Semantic segmentation from sentences](https://arxiv.org/abs/2202.02002) | N/A | 345 | | 2023 | ICLR | LMSeg | [LMSeg: Language-guided Multi-dataset Segmentation](https://arxiv.org/abs/2302.13495) | N/A | 346 | | 2022 | CVPR | UniDet | [Simple multi-dataset detection](https://arxiv.org/abs/2102.13086) | [Code](https://github.com/xingyizhou/UniDet) | 347 | | 2023 | CVPR | Detection Hub | [Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding](https://arxiv.org/abs/2206.03484) | N/A | 348 | | 2022 | CVPR | WD2 | [Unifying Panoptic Segmentation for Autonomous Driving](https://openaccess.thecvf.com/content/CVPR2022/papers/Zendel_Unifying_Panoptic_Segmentation_for_Autonomous_Driving_CVPR_2022_paper.pdf) | [Data](https://github.com/ozendelait/wilddash_scripts) | 349 | | 2023 | arXiv | TarVIS | [TarViS: A Unified Approach for Target-based Video Segmentation](https://arxiv.org/abs/2301.02657) | [Code](https://github.com/Ali2500/TarViS) | 350 | 351 | #### Label and Model Efficient Segmentation 352 | 353 | | Year | Venue | Acronym | Paper Title | Code/Project | 354 | |:----:|:-------:|:-----------:|------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------| 355 | | 2022 | CVPR | MCTformer | [Multi-class Token Transformer for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2203.02891) | [Code](https://github.com/xulianuwa/MCTformer) | 356 | | 2020 | CVPR | PCM | [Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2004.04581) | [Code](https://github.com/YudeWang/SEAM) | 357 | | 2022 | ECCV | ViT-PCM | [Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation](https://arxiv.org/abs/2210.17400) | [Code](https://github.com/deepplants/ViT-PCM) | 358 | | 2021 | ICCV | DINO | [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.14294) | [Code](https://github.com/facebookresearch/dino) | 359 | | 2021 | BMVC | LOST | [Localizing Objects with Self-Supervised Transformers and no Labels](https://arxiv.org/abs/2109.14279) | [Code](https://github.com/valeoai/LOST) | 360 | | 2022 | ICLR | STEGO | [Unsupervised Semantic Segmentation by Distilling Feature Correspondences](https://arxiv.org/abs/2203.08414) | [Code](https://github.com/mhamilton723/STEGO) | 361 | | 2022 | NeurIPS | ReCo | [ReCo: Retrieve and Co-segment for Zero-shot Transfer](https://arxiv.org/abs/2206.07045) | [Code](https://github.com/NoelShin/reco) | 362 | | 2022 | arXiv | MaskDistill | [Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation](https://arxiv.org/abs/2206.06363) | N/A | 363 | | 2022 | CVPR | FreeSOLO | [FreeSOLO: Learning to Segment Objects without Annotations](https://arxiv.org/abs/2202.12181) | [Code](http://github.com/NVlabs/FreeSOLO) | 364 | | 2023 | CVPR | CutLER | [Cut and Learn for Unsupervised Object Detection and Instance Segmentation](https://arxiv.org/abs/2301.11320) | [Code](https://github.com/facebookresearch/CutLER) | 365 | | 2022 | CVPR | TokenCut | [Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut](https://arxiv.org/abs/2202.11539) | [Code](https://github.com/YangtaoWANG95/TokenCut) | 366 | | 2022 | ICLR | MobileViT | [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) | [Code](https://github.com/apple/ml-cvnets) | 367 | | 2023 | arXiv | EMO | [Rethinking Mobile Block for Efficient Neural Models](https://arxiv.org/abs/2301.01146) | [Code](https://github.com/zhangzjn/EMO) | 368 | | 2022 | CVPR | TopFormer | [TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation](https://arxiv.org/abs/2204.05525) | [Code](https://github.com/hustvl/TopFormer) | 369 | | 2023 | ICLR | SeaFormer | [SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation](https://arxiv.org/abs/2301.13156) | [Code](https://github.com/fudan-zvg/SeaFormer) | 370 | 371 | #### Class Agnostic Segmentation and Tracking 372 | 373 | | Year | Venue | Acronym | Paper Title | Code/Project | 374 | |:----:|:-------:|:-----------:|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------| 375 | | 2022 | CVPR | Transfiner | [Mask Transfiner for High-Quality Instance Segmentation](https://arxiv.org/abs/2111.13673) | [Code](https://github.com/SysCV/transfiner) | 376 | | 2022 | ECCV | VMT | [Video Mask Transfiner for High-Quality Video Instance Segmentation](https://arxiv.org/abs/2207.14012) | [Code](https://github.com/SysCV/vmt) | 377 | | 2022 | arXiv | SimpleClick | [SimpleClick: Interactive Image Segmentation with Simple Vision Transformers](https://arxiv.org/abs/2210.11006) | [Code](https://github.com/uncbiag/simpleclick) | 378 | | 2023 | ICLR | PatchDCT | [PatchDCT: Patch Refinement for High Quality Instance Segmentation](https://arxiv.org/abs/2302.02693) | [Code](https://github.com/olivia-w12/PatchDCT) | 379 | | 2019 | ICCV | STM | [Video Object Segmentation using Space-Time Memory Networks](https://arxiv.org/abs/1904.00607) | [Code](https://github.com/seoungwugoh/STM) | 380 | | 2021 | NeurIPS | AOT | [Associating Objects with Transformers for Video Object Segmentation](https://arxiv.org/abs/2106.02638) | [Code](https://github.com/z-x-yang/AOT) | 381 | | 2021 | NeurIPS | STCN | [Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation](https://arxiv.org/abs/2106.05210) | [Code](https://github.com/hkchengrex/STCN) | 382 | | 2022 | ECCV | XMem | [XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model](https://arxiv.org/abs/2207.07115) | [Code](https://hkchengrex.github.io/XMem) | 383 | | 2022 | CVPR | PCVOS | [Per-Clip Video Object Segmentation](https://arxiv.org/abs/2208.01924) | [Code](https://github.com/pkyong95/PCVOS) | 384 | | 2023 | CVPR | N/A | [Look Before You Match: Instance Understanding Matters in Video Object Segmentation](https://arxiv.org/abs/2212.06826) | N/A | 385 | 386 | #### Medical Image Segmentation 387 | 388 | | Year | Venue | Acronym | Paper Title | Code/Project | 389 | |:----:|:-------------:|:---------:|-----------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------| 390 | | 2020 | BIBM | CellDETR | [Attention-Based Transformers for Instance Segmentation of Cells in Microstructures](https://arxiv.org/abs/2011.09763) | [Code](https://github.com/ChristophReich1996/Cell-DETR) | 391 | | 2021 | arXiv | TransUNet | [TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation](https://arxiv.org/abs/2102.04306) | [Code](https://github.com/Beckschen/TransUNet) | 392 | | 2022 | ECCV Workshop | Swin-Unet | [Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation](https://arxiv.org/abs/2105.05537) | [Code](https://github.com/HuCaoFighting/Swin-Unet) | 393 | | 2021 | MICCAI | TransFuse | [TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation](https://arxiv.org/abs/2102.08005) | [Code](https://github.com/Rayicer/TransFuse) | 394 | | 2022 | WACV | UNETR | [UNETR: Transformers for 3D Medical Image Segmentation](https://arxiv.org/abs/2103.10504) | [Code](https://github.com/Project-MONAI/research-contributions/tree/main/UNETR) | 395 | 396 | ## Acknowledgement 397 | 398 | If you find our survey and repository useful for your research project, please consider citing our paper: 399 | 400 | ```bibtex 401 | @article{li2023transformer, 402 | author={Li, Xiangtai and Ding, Henghui and Zhang, Wenwei and Yuan, Haobo and Cheng, Guangliang and Jiangmiao, Pang and Chen, Kai and Liu, Ziwei and Loy, Chen Change}, 403 | title={Transformer-Based Visual Segmentation: A Survey}, 404 | journal={T-PAMI}, 405 | year={2024} 406 | } 407 | ``` 408 | ## Contact 409 | ``` 410 | xiangtai94@gmail.com (main) 411 | ``` 412 | ``` 413 | lxtpku@pku.edu.cn 414 | ``` 415 | ## Related Repo For Segmentation and Detection 416 | 417 | Attention Model [Repo](https://github.com/cmhungsteve/Awesome-Transformer-Attention) by Min-Hung (Steve) Chen. 418 | 419 | Detection Transformer [Repo](https://github.com/IDEA-Research/awesome-detection-transformer) by IDEA. 420 | 421 | Open Vocabulary Learning [Repo](https://github.com/jianzongwu/Awesome-Open-Vocabulary) by PKU and NTU. 422 | 423 | -------------------------------------------------------------------------------- /code/README.md: -------------------------------------------------------------------------------- 1 | ### Fair Re-Benchmark Code 2 | -------------------------------------------------------------------------------- /figs/survey_pipeline.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxtGH/Awesome-Segmentation-With-Transformer/9c7a4884c6a23590f11a694f39cd8caa618d3593/figs/survey_pipeline.jpg --------------------------------------------------------------------------------