├── .gitignore
├── README.md
├── code
└── README.md
└── figs
└── survey_pipeline.jpg
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | work_dir/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 | MANIFEST
28 |
29 | # PyInstaller
30 | # Usually these files are written by a python script from a template
31 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
32 | *.manifest
33 | *.spec
34 |
35 | # Installer logs
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 |
39 | # Unit test / coverage reports
40 | htmlcov/
41 | .tox/
42 | .coverage
43 | .coverage.*
44 | .cache
45 | nosetests.xml
46 | coverage.xml
47 | *.cover
48 | .hypothesis/
49 | .pytest_cache/
50 |
51 | # Translations
52 | *.mo
53 | *.pot
54 |
55 | # Django stuff:
56 | *.log
57 | local_settings.py
58 | db.sqlite3
59 |
60 | # Flask stuff:
61 | instance/
62 | .webassets-cache
63 |
64 | # Scrapy stuff:
65 | .scrapy
66 |
67 | # Sphinx documentation
68 | docs/_build/
69 |
70 | # PyBuilder
71 | target/
72 |
73 | # Jupyter Notebook
74 | .ipynb_checkpoints
75 |
76 | # pyenv
77 | .python-version
78 |
79 | # celery beat schedule file
80 | celerybeat-schedule
81 |
82 | # SageMath parsed files
83 | *.sage.py
84 |
85 | # Environments
86 | .env
87 | .venv
88 | env/
89 | venv/
90 | ENV/
91 | env.bak/
92 | venv.bak/
93 |
94 | # Spyder project settings
95 | .spyderproject
96 | .spyproject
97 |
98 | # Rope project settings
99 | .ropeproject
100 |
101 | # mkdocs documentation
102 | /site
103 |
104 | # mypy
105 | .mypy_cache/
106 |
107 | data/
108 | data
109 | .vscode
110 | .idea
111 | .DS_Store
112 |
113 | # custom
114 | *.pkl
115 | *.pkl.json
116 | *.log.json
117 |
118 | # Pytorch
119 | *.pth
120 | *.py~
121 | *.sh~
122 |
123 | debug/*
124 | vis/
125 | analysis/*
126 | pretrain/*
127 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | [](https://github.com/sindresorhus/awesome)
3 | [](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer/pulls)
4 |
5 |
6 |
Transformer-Based Visual Segmentation: A Survey
7 |
8 | T-PAMI, 2024
9 |
10 | Xiangtai Li (Project Lead)
11 | ·
12 | Henghui Ding
13 | ·
14 | Haobo Yuan
15 | ·
16 | Wenwei Zhang
17 | ·
18 | Guangliang Cheng
19 |
20 | Jiangmiao Pang
21 | .
22 | Kai Chen
23 | .
24 | Ziwei Liu
25 | .
26 | Chen Change Loy
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 | This repo is used for recording, tracking, and benchmarking several recent transformer-based visual segmentation methods,
43 | as a supplement to our [survey](https://arxiv.org/abs/2304.09854).
44 | If you find any work missing or have any suggestions (papers, implementations and other resources), feel free
45 | to [pull requests](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer/pulls).
46 | We will add the missing papers to this repo ASAP.
47 |
48 |
49 | ### 🔥News
50 |
51 | [-] Accepted by T-PAMI-2024.
52 |
53 | [-] Add several CVPR-24 works on this directions. 2024-03. You are welcome to add your CVPR works in our repos!
54 |
55 | [-] The third version is on arxiv. [survey](https://arxiv.org/abs/2304.09854) More benchmark and methods are included!!. 2023-12.
56 |
57 | [-] The second draft is on arxiv. 2023-06.
58 |
59 |
60 | ### 🔥Highlight!!
61 |
62 | [1], Previous transformer surveys divide the methods by the different tasks and settings.
63 | Different from them, we re-visit and group the existing transformer-based methods from the **technical perspective.**
64 |
65 | [2], We survey the methods in two parts: one for the mainstream tasks based on DETR-like meta-architecture, the other for related directions according to the tasks.
66 |
67 | [3], We further re-benchmark several representative works on image semantic segmentation and panoptic segmentation datasets.
68 |
69 | [4], We also include the query-based detection transformers since both segmentation and detection tasks are unified by object query.
70 |
71 |
72 | ## Introduction
73 |
74 | In this survey, we present the first detailed survey on Transformer-Based Segmentation.
75 |
76 | 
77 |
78 | ## Summary of Contents
79 |
80 | - [Methods: A Survey](#methods-a-survey)
81 | - [Meta-Architecture](#meta-architecture)
82 | - [Strong Representation](#Strong-Representation)
83 | - [Interaction Design in Decoder](#Interaction-Design-in-Decoder)
84 | - [Optimizing Object Query](#Optimizing-Object-Query)
85 | - [Using Query For Association](#Using-Query-For-Association)
86 | - [Conditional Query Generation](#Conditional-Query-Generation)
87 |
88 | - [Related Domains and Beyond](#Related-Domains-and-Beyond)
89 | - [Point Cloud Segmentation](#Point-Cloud-Segmentation)
90 | - [Tuning Foundation Models](#Tuning-Foundation-Models)
91 | - [Domain-aware Segmentation](#Domain-aware-Segmentation)
92 | - [Label and Model Efficient Segmentation](#Label-and-Model-Efficient-Segmentation)
93 | - [Class Agnostic Segmentation and Tracking](#Class-Agnostic-Segmentation-and-Tracking)
94 | - [Medical Image Segmentation](#Medical-Image-Segmentation)
95 |
96 | ## Methods: A Survey
97 |
98 | ### Meta-Architecture
99 |
100 | | Year | Venue | Acronym | Paper Title | Code/Project |
101 | |:----:|:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
102 | | 2020 | ECCV | DETR | [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) | [Code](https://github.com/facebookresearch/detr) |
103 | | 2021 | ICLR | Deformable DETR | [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) | [Code](https://github.com/fundamentalvision/Deformable-DETR) |
104 | | 2021 | CVPR | Max-Deeplab | [MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers](https://arxiv.org/abs/2012.00759) | [Code](https://github.com/google-research/deeplab2) |
105 | | 2021 | NeurIPS | MaskFormer | [MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation](http://arxiv.org/abs/2107.06278) | [Code](https://github.com/facebookresearch/MaskFormer) |
106 | | 2021 | NeurIPS | K-Net | [K-Net: Towards Unified Image Segmentation](https://arxiv.org/abs/2106.14855) | [Code](https://github.com/ZwwWayne/K-Net) |
107 | | 2023 | CVPR | Lite-DETR | [Lite detr: An interleaved multi-scale encoder for efficient detr](https://arxiv.org/pdf/2303.07335) | [Code](https://github.com/IDEA-Research/Lite-DETR) |
108 |
109 | ### Strong Representation
110 |
111 | #### Better ViTs Design
112 |
113 | | Year | Venue | Acronym | Paper Title | Code/Project |
114 | |:----:|:-------:|:-----------:|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|
115 | | 2021 | CVPR | SETR | [Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers](https://arxiv.org/abs/2012.15840) | [Code](https://github.com/fudan-zvg/SETR) |
116 | | 2021 | ICCV | MviT-V1 | [Multiscale vision transformers](https://arxiv.org/abs/2104.11227) | [Code](https://github.com/facebookresearch/mvit) |
117 | | 2022 | CVPR | MviT-V2 | [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526) | [Code](https://github.com/facebookresearch/mvit) |
118 | | 2021 | NeurIPS | XCIT | [Xcit: Crosscovariance image transformers](https://arxiv.org/abs/2106.09681) | [Code](https://github.com/facebookresearch/xci) |
119 | | 2021 | ICCV | Pyramid VIT | [Pyramid vision transformer: A versatile backbone for dense prediction without convolutions](https://arxiv.org/abs/2102.12122) | [Code](https://github.com/whai362/PVT) |
120 | | 2021 | ICCV | CorssViT | [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899) | [Code](https://github.com/IBM/CrossViT) |
121 | | 2021 | ICCV | CoaT | [Co-Scale Conv-Attentional Image Transformers](https://arxiv.org/abs/2104.06399) | [Code](https://github.com/mlpc-ucsd/CoaT) |
122 | | 2022 | CVPR | MPViT | [MPViT: Multi-Path Vision Transformer for Dense Prediction](https://arxiv.org/abs/2112.11010) | [Code](https://github.com/youngwanLEE/MPViT) |
123 | | 2022 | NeurIPS | SegViT | [SegViT: Semantic Segmentation with Plain Vision Transformers](https://arxiv.org/abs/2210.05844) | [Code](https://github.com/zbwxp/SegVit) |
124 | | 2022 | arxiv | RSSeg | [Representation Separation for SemanticSegmentation with Vision Transformers](https://arxiv.org/abs/2212.13764) | N/A |
125 |
126 | #### Hybrid CNNs/Transformers/MLPs
127 |
128 | | Year | Venue | Acronym | Paper Title | Code/Project |
129 | |:----:|:-------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|
130 | | 2021 | ICCV | Swin | [Swin transformer: Hierarchical vision transformer using shifted windows](https://arxiv.org/abs/2103.14030) | [Code](https://github.com/microsoft/Swin-Transformer) |
131 | | 2022 | CVPR | Swin-v2 | [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) | [Code](https://github.com/microsoft/Swin-Transformer) |
132 | | 2021 | NeurIPS | Segformer | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | [Code](http://github.com/NVlabs/SegFormer) |
133 | | 2022 | CVPR | CMT | [CMT: Convolutional Neural Networks Meet Vision Transformers](https://arxiv.org/abs/2107.06263) | [Code](https://github.com/FlyEgle/CMT-pytorch) |
134 | | 2021 | NeurIPS | Twins | [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](https://arxiv.org/abs/2104.13840) | [Code](https://github.com/Meituan-AutoML/Twins) |
135 | | 2021 | ICCV | CvT | [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) | [Code](https://github.com/microsoft/CvT) |
136 | | 2021 | NeurIPS | Vitae | [Vitae: Vision transformer advanced by exploring intrinsic inductive bias](https://arxiv.org/abs/2106.03348) | [Code](https://github.com/ViTAE-Transformer/ViTAE-Transformer) |
137 | | 2022 | CVPR | ConvNext | [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) | [Code](https://github.com/facebookresearch/ConvNeXt) |
138 | | 2022 | NeurIPS | SegNext | [SegNeXt:Rethinking Convolutional Attention Design for Semantic Segmentation](https://github.com/visual-attention-network/segnext) | [Code](https://github.com/visual-attention-network/segnext) |
139 | | 2022 | CVPR | PoolFormer | [PoolFormer: MetaFormer Is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) | [Code](https://github.com/sail-sg/poolformer) |
140 | | 2023 | ICLR | STM | [Demystify Transformers & Convolutions in Modern Image Deep Networks](https://arxiv.org/abs/2211.05781) | [Code](https://github.com/OpenGVLab/STM-Evaluation) |
141 |
142 | #### Self-Supervised Learning
143 |
144 | | Year | Venue | Acronym | Paper Title | Code/Project |
145 | |:----:|:-------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|
146 | | 2021 | ICCV | MOCOV3 | [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057) | [Code](https://github.com/facebookresearch/moco-v3) |
147 | | 2022 | ICLR | Beit | [Beit: Bert pre-training of image transformers](https://arxiv.org/abs/2106.08254) | [Code](https://github.com/microsoft/unilm/tree/master/beit) |
148 | | 2022 | CVPR | MaskFeat | [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133) | [Code](https://github.com/facebookresearch/SlowFast) |
149 | | 2022 | CVPR | MAE | [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) | [Code](https://github.com/facebookresearch/mae) |
150 | | 2022 | NeurIPS | ConvMAE | [MCMAE: Masked Convolution Meets Masked Autoencoders](https://arxiv.org/abs/2303.05475) | [Code](https://github.com/Alpha-VL/ConvMAE) |
151 | | 2023 | ICLR | Spark | [SparK: the first successful BERT/MAE-style pretraining on any convolutional networks](https://github.com/keyu-tian/SparK) | [Code](https://github.com/keyu-tian/SparK) |
152 | | 2022 | CVPR | FLIP | [Scaling Language-Image Pre-training via Masking](https://arxiv.org/abs/2212.00794) | [Code](https://github.com/facebookresearch/flip) |
153 | | 2023 | arxiv | ConvNeXt V2 | [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808) | [Code](https://github.com/facebookresearch/ConvNeXt-V2) |
154 |
155 | ### Interaction Design in Decoder
156 |
157 | #### Improved Cross Attention Design
158 |
159 | | Year | Venue | Acronym | Paper Title | Code/Project |
160 | |:----:|:-------:|:------------------:|---------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
161 | | 2021 | CVPR | Sparse R-CNN | [Sparse R-CNN: End-to-End Object Detection with Learnable Proposals](https://arxiv.org/abs/2011.12450) | [Code](https://github.com/PeizeSun/SparseR-CNN) |
162 | | 2022 | CVPR | AdaMixer | [AdaMixer: A Fast-Converging Query-Based Object Detector](https://arxiv.org/abs/2203.16507) | [Code](https://github.com/MCG-NJU/AdaMixer) |
163 | | 2021 | CVPR | MaX-DeepLab | [MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers](https://arxiv.org/abs/2012.00759) | [Code](https://github.com/google-research/deeplab2) |
164 | | 2021 | NeurIPS | K-Net | [K-Net: Towards Unified Image Segmentation](https://arxiv.org/abs/2106.14855) | [Code](https://github.com/ZwwWayne/K-Net/) |
165 | | 2022 | CVPR | Mask2Former | [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) | [Code](https://github.com/facebookresearch/Mask2Former) |
166 | | 2022 | ECCV | kMaX-DeepLab | [k-means Mask Transformer](https://arxiv.org/abs/2207.04044) | [Code](https://github.com/google-research/deeplab2) |
167 | | 2021 | ICCV | QueryInst | [Instances as queries](https://arxiv.org/abs/2105.01928) | [Code](https://github.com/hustvl/QueryInst) |
168 | | 2021 | arxiv | ISTR | [ISTR: End-to-End Instance Segmentation via Transformers](https://arxiv.org/abs/2105.00637) | [Code](https://github.com/hujiecpp/ISTR) |
169 | | 2021 | NeurIPS | SOLQ | [Solq: Segmenting objects by learning queries](https://arxiv.org/abs/2106.02351) | [Code](https://github.com/megvii-research/SOLQ) |
170 | | 2022 | CVPR | Panoptic Segformer | [Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers](https://arxiv.org/abs/2109.03814) | [Code](https://github.com/zhiqi-li/Panoptic-SegFormer) |
171 | | 2022 | CVPR | CMT-Deeplab | [CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation](https://arxiv.org/abs/2206.08948) | N/A |
172 | | 2022 | CVPR | SparseInst | [Sparse Instance Activation for Real-Time Instance Segmentation](https://arxiv.org/abs/2203.12827) | [Code](https://github.com/hustvl/SparseInst) |
173 | | 2022 | CVPR | SAM-DETR | [Accelerating DETR Convergence via Semantic-Aligned Matching](https://arxiv.org/abs/2203.06883) | [Code](https://github.com/ZhangGongjie/SAM-DETR) |
174 | | 2021 | ICCV | SMCA-DETR | [Fast Convergence of DETR with Spatially Modulated Co-Attention](https://arxiv.org/abs/2101.07448) | [Code](https://github.com/gaopengcuhk/SMCA-DETR) |
175 | | 2021 | BMVC | ACT-DETR | [End-to-End Object Detection with Adaptive Clustering Transformer](https://www.bmvc2021-virtualconference.com/assets/papers/0709.pdf) | [Code](https://github.com/gaopengcuhk/SMCA-DETR) |
176 | | 2021 | ICCV | Dynamic DETR | [Dynamic DETR: End-to-End Object Detection with Dynamic Attention](https://ieeexplore.ieee.org/document/9709981) | N/A |
177 | | 2022 | ICLR | Sparse DETR | [Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity](https://arxiv.org/abs/2111.14330) | [Code](https://github.com/kakaobrain/sparse-detr) |
178 | | 2023 | CVPR | FastInst | [FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation](https://arxiv.org/abs/2303.08594) | [Code](https://github.com/junjiehe96/FastInst) |
179 |
180 |
181 | #### Spatial-Temporal Cross Attention Design
182 |
183 | | Year | Venue | Acronym | Paper Title | Code/Project |
184 | |:----:|:-------:|:------------------:|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
185 | | 2021 | CVPR | VisTR | [VisTR: End-to-End Video Instance Segmentation with Transformers](https://arxiv.org/abs/2011.14503) | [Code](https://github.com/Epiphqny/VisTR) |
186 | | 2021 | NeurIPS | IFC | [Video instance segmentation using inter-frame communication transformers](https://arxiv.org/abs/2106.03299) | [Code](https://github.com/sukjunhwang/IFC) |
187 | | 2022 | CVPR | SlotVPS | [Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation](https://arxiv.org/abs/2112.08949) | N/A |
188 | | 2022 | CVPR | TubeFormer-DeepLab | [TubeFormer-DeepLab: Video Mask Transformer](https://arxiv.org/abs/2205.15361) | N/A |
189 | | 2022 | CVPR | Video K-Net | [Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation](https://arxiv.org/abs/2204.04656) | [Code](https://github.com/lxtGH/Video-K-Net) |
190 | | 2022 | CVPR | TeViT | [Temporally efficient vision transformer for video instance segmentation](https://arxiv.org/abs/2204.08412) | [Code](https://github.com/hustvl/TeViT) |
191 | | 2022 | ECCV | Seqformer | [SeqFormer: Sequential Transformer for Video Instance Segmentation](https://arxiv.org/abs/2112.08275) | [Code](https://github.com/wjf5203/SeqFormer) |
192 | | 2022 | arxiv | Mask2Former-VIS | [Mask2Former for Video Instance Segmentation](https://arxiv.org/abs/2112.10764) | [Code](https://github.com/facebookresearch/Mask2Former) |
193 | | 2022 | PAMI | TransVOD | [TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers](https://arxiv.org/abs/2201.05047) | [Code](https://github.com/SJTU-LuHe/TransVOD) |
194 | | 2022 | NeurIPS | VITA | [VITA: Video Instance Segmentation via Object Token Association](https://arxiv.org/abs/2206.04403) | [Code](https://github.com/sukjunhwang/VITA) |
195 |
196 |
197 |
198 | ### Optimizing Object Query
199 |
200 | #### Adding Position Information into Query
201 |
202 | | Year | Venue | Acronym | Paper Title | Code/Project |
203 | |:----:|:-----:|:-------------------:|------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
204 | | 2021 | ICCV | Conditional-DETR | [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) | [Code](https://github.com/Atten4Vis/ConditionalDETR) |
205 | | 2022 | arxiv | Conditional-DETR-v2 | [Conditional detr v2:Efficient detection transformer with box queries](https://arxiv.org/abs/2207.08914) | [Code](https://github.com/Atten4Vis/ConditionalDETR) |
206 | | 2022 | AAAI | Anchor DETR | [Anchor detr: Query design for transformer-based detector](https://arxiv.org/abs/2109.07107) | [Code](https://github.com/megvii-model/AnchorDETR) |
207 | | 2022 | ICLR | DAB-DETR | [DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR](https://arxiv.org/abs/2201.12329) | [Code](https://github.com/SlongLiu/DAB-DETR) |
208 | | 2021 | arxiv | Efficient DETR | [Efficient detr: improving end-to-end object detector with dense prior](https://arxiv.org/abs/2104.01318) | N/A |
209 |
210 | #### Adding Extra Supervision into Query
211 |
212 | | Year | Venue | Acronym | Paper Title | Code/Project |
213 | |:----:|:-------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|
214 | | 2022 | ECCV | DE-DETR | [Towards Data-Efficient Detection Transformers](https://arxiv.org/abs/2203.09507) | [Code](https://github.com/encounter1997/DE-DETRs) |
215 | | 2022 | CVPR | DN-DETR | [Dndetr:Accelerate detr training by introducing query denoising](https://arxiv.org/abs/2203.01305) | [Code](https://github.com/IDEA-opensource/DN-DETR) |
216 | | 2023 | ICLR | DINO | [DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection](https://arxiv.org/abs/2203.03605) | [Code](https://github.com/IDEA-Research/DINO) |
217 | | 2023 | CVPR | Mp-Former | [Mp-former: Mask-piloted transformer for image segmentation](https://arxiv.org/abs/2303.07336) | [Code](https://github.com/IDEA-Research/MP-Former) |
218 | | 2023 | CVPR | Mask-DINO | [Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation](https://arxiv.org/abs/2206.02777) | [Code](https://github.com/IDEACVR/MaskDINO) |
219 | | 2022 | NeurIPS | N/A | [Learning equivariant segmentation with instance-unique querying](https://arxiv.org/abs/2210.00911) | [Code](https://github.com/JamesLiang819/Instance_Unique_Querying) |
220 | | 2023 | CVPR | H-DETR | [DETRs with Hybrid Matching](https://arxiv.org/abs/2207.13080) | [Code](https://github.com/HDETR) |
221 | | 2023 | ICCV | Group-DETR | [Group detr: Fast detr training with group-wise one-to-many assignment](https://arxiv.org/abs/2207.13085) | N/A |
222 | | 2023 | ICCV | Co-DETR | [Detrs with collaborative hybrid assignments training](https://arxiv.org/abs/2211.12860) | [Code](https://github.com/Sense-X/Co-DETR) |
223 |
224 | ### Using Query For Association
225 |
226 | #### Query as Instance Association
227 |
228 | | Year | Venue | Acronym | Paper Title | Code/Project |
229 | |:----:|:-------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|
230 | | 2022 | CVPR | TrackFormer | [TrackFormer: Multi-Object Tracking with Transformer](https://arxiv.org/abs/2101.02702) | [Code](https://github.com/timmeinhardt/trackformer) |
231 | | 2021 | arxiv | TransTrack | [TransTrack: Multiple Object Tracking with Transformer](https://arxiv.org/abs/2012.15460) | [Code](https://github.com/PeizeSun/TransTrack) |
232 | | 2022 | ECCV | MOTR | [MOTR: End-to-End Multiple-Object Tracking with TRansformer](https://arxiv.org/abs/2105.03247) | [Code](https://github.com/megvii-research/MOTR) |
233 | | 2022 | NeurIPS | MinVIS | [MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training](https://arxiv.org/abs/2208.02245) | [Code](https://github.com/NVlabs/MinVIS) |
234 | | 2022 | ECCV | IDOL | [In defense of online models for video instance segmentation](https://arxiv.org/abs/2207.10661) | [Code](https://github.com/wjf5203/VNext) |
235 | | 2022 | CVPR | Video K-Net | [Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation](https://arxiv.org/abs/2204.04656) | [Code](https://github.com/lxtGH/Video-K-Net) |
236 | | 2023 | CVPR | GenVIS | [A Generalized Framework for Video Instance Segmentation](https://arxiv.org/abs/2211.08834) | [Code](https://github.com/miranheo/GenVIS) |
237 | | 2023 | ICCV | Tube-Link | [Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation](https://arxiv.org/abs/2303.12782) | [Code](https://github.com/lxtGH/Tube-Link) |
238 | | 2023 | ICCV | CTVIS | [CTVIS: Consistent Training for Online Video Instance Segmentation](https://arxiv.org/abs/2303.12782) | [Code](https://github.com/KainingYing/CTVIS) |
239 | | 2023 | CVPR-W | Video-kMaX | [Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation](https://arxiv.org/abs/2304.04694) | N/A |
240 |
241 |
242 |
243 | #### Query as Linking Multi-Tasks
244 |
245 | | Year | Venue | Acronym | Paper Title | Code/Project |
246 | |:----:|:-----:|:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
247 | | 2022 | ECCV | Panoptic-PartFormer | [Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation](https://arxiv.org/abs/2204.04655) | [Code](https://github.com/lxtGH/Panoptic-PartFormer) |
248 | | 2022 | ECCV | PolyphonicFormer | [PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation](https://arxiv.org/abs/2112.02582) | [Code](https://github.com/HarborYuan/PolyphonicFormer) |
249 | | 2022 | CVPR | PanopticDepth | [Panopticdepth: A unified framework for depth-aware panoptic segmentation](https://arxiv.org/abs/2206.00468) | [Code](https://github.com/NaiyuGao/PanopticDepth) |
250 | | 2022 | ECCV | Fashionformer | [Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition](https://arxiv.org/abs/2204.04654) | [Code](https://github.com/xushilin1/FashionFormer) |
251 | | 2022 | ECCV | InvPT | [InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding](https://arxiv.org/abs/2203.07997) | [Code](https://github.com/prismformore/InvPT) |
252 | | 2023 | CVPR | UNINEXT | [Universal Instance Perception as Object Discovery and Retrieval](https://arxiv.org/abs/2303.06674) | [Code](https://github.com/MasterBin-IIAU/UNINEXT) |
253 | | 2024 | CVPR | GLEE | [GLEE: General Object Foundation Model for Images and Videos at Scale](https://arxiv.org/abs/2312.09158) | [Code](https://glee-vision.github.io/) |
254 | | 2024 | CVPR | UniVS | [UniVS: Unified and Universal Video Segmentation with Prompts as Queries](https://arxiv.org/abs/2402.18115) | [Code](https://github.com/MinghanLi/UniVS) |
255 | | 2024 | CVPR | OMG-Seg | [OMG-Seg: Is One Model Good Enough For All Segmentation?](https://arxiv.org/abs/2401.10229) | [Code](https://github.com/lxtGH/OMG-Seg) |
256 |
257 | ### Conditional Query Generation
258 |
259 | #### Conditional Query Fusion on Language Features
260 |
261 | | Year | Venue | Acronym | Paper Title | Code/Project |
262 | |:----:|:-----:|:--------------:|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
263 | | 2021 | ICCV | VLT | [Vision-Language Transformer and Query Generation for Referring Segmentation](https://arxiv.org/abs/2108.05565) | [Code](https://github.com/henghuiding/Vision-Language-Transformer) |
264 | | 2022 | CVPR | LAVT | [Lavt: Language-aware vision transformer for referring image segmentation](https://arxiv.org/abs/2112.02244) | [Code](https://github.com/yz93/LAVT-RIS) |
265 | | 2022 | CVPR | Restr | [Restr:Convolution-free referring image segmentation using transformers](https://arxiv.org/abs/2203.16768) | N/A |
266 | | 2022 | CVPR | Cris | [Cris: Clip-driven referring image segmentation](https://arxiv.org/abs/2111.15174) | [Code](https://github.com/DerrickWang005/CRIS.pytorch) |
267 | | 2022 | CVPR | MTTR | [End-to-End Referring Video Object Segmentation with Multimodal Transformers](https://arxiv.org/abs/2111.14821) | [Code](https://github.com/mttr2021/MTTR) |
268 | | 2022 | CVPR | LBDT | [Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation](https://arxiv.org/abs/2206.03789) | [Code](https://github.com/dzh19990407/LBDT) |
269 | | 2022 | CVPR | ReferFormer | [Language as queries for referring video object segmentation](https://arxiv.org/abs/2201.00487) | [Code](https://github.com/wjn922/ReferFormer) |
270 | | 2024 | CVPR | MaskGrounding | [Mask Grounding for Referring Image Segmentation](https://arxiv.org/abs/2312.12198) | [Code](https://yxchng.github.io/projects/mask-grounding/) |
271 |
272 | #### Conditional Query Fusion on Cross Image Features
273 |
274 | | Year | Venue | Acronym | Paper Title | Code/Project |
275 | |:----:|:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|
276 | | 2021 | NeurIPS | CyCTR | [Few-Shot Segmentation via Cycle-Consistent Transformer](https://arxiv.org/abs/2106.02320) | [Code](https://github.com/GengDavid/CyCTR) |
277 | | 2022 | CVPR | MatteFormer | [MatteFormer: Transformer-Based Image Matting via Prior-Tokens](https://arxiv.org/abs/2203.15662) | [Code](https://github.com/webtoon/matteformer) |
278 | | 2022 | ECCV | Segdeformer | [A Transformer-based Decoder for Semantic Segmentation with Multi-level Context Mining](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136880617.pdf) | [Code](https://github.com/lygsbw/segdeformer) |
279 | | 2022 | arxiv | StructToken | [StructToken : Rethinking Semantic Segmentation with Structural Prior](https://arxiv.org/abs/2203.12612) | N/A |
280 | | 2022 | NeurIPS | MM-Former | [Mask Matching Transformer for Few-Shot Segmentation](https://arxiv.org/abs/2301.01208) | [Code](https://github.com/jiaosiyu1999/mmformer) |
281 | | 2022 | ECCV | AAFormer | [Adaptive Agent Transformer for Few-shot Segmentation](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890035.pdf) | N/A |
282 | | 2023 | arxiv | ReferenceTwice | [Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation](https://arxiv.org/abs/2301.01156) | [Code](https://github.com/hanyue1648/RefT) |
283 |
284 | ### Tuning Foundation Models
285 |
286 | #### Vision Adapter
287 |
288 | | Year | Venue | Acronym | Paper Title | Code/Project |
289 | |:----:|:-----:|:-----------:|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
290 | | 2022 | CVPR | CoCoOp | [Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557) | [Code](https://github.com/KaiyangZhou/CoOp) |
291 | | 2022 | ECCV | Tip-Adapter | [Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification](https://arxiv.org/abs/2111.03930) | [Code](https://github.com/gaopengcuhk/Tip-Adapter) |
292 | | 2022 | ECCV | EVL | [Frozen CLIP Models are Efficient Video Learners](https://arxiv.org/abs/2208.03550) | [Code](https://github.com/OpenGVLab/efficient-video-recognition) |
293 | | 2023 | ICLR | ViT-Adapter | [Vision Transformer Adapter for Dense Predictions](https://arxiv.org/abs/2205.08534) | [Code](https://github.com/czczup/ViT-Adapter) |
294 | | 2022 | CVPR | DenseCLIP | [DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/abs/2112.01518) | [Code](https://github.com/raoyongming/DenseCLIP) |
295 | | 2022 | CVPR | CLIPSeg | [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) | [Code](https://eckerlab.org/code/clipseg) |
296 | | 2023 | CVPR | OneFormer | [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) | [Code](https://github.com/SHI-Labs/OneFormer) |
297 |
298 | #### Open Vocabulary Learning
299 |
300 | | Year | Venue | Acronym | Paper Title | Code/Project |
301 | |:----:|:-----:|:---------:|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
302 | | 2021 | CVPR | OVR-CNN | [Open-Vocabulary Object Detection Using Captions](https://arxiv.org/abs/2011.10678) | [Code](https://github.com/alirezazareian/ovr-cnn) |
303 | | 2022 | ICLR | ViLD | [Open-vocabulary Object Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921) | [Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild) |
304 | | 2022 | ECCV | Detic | [Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605) | [Code](https://github.com/facebookresearch/Detic) |
305 | | 2022 | ECCV | OV-DETR | [Open-Vocabulary DETR with Conditional Matching](https://arxiv.org/abs/2203.11876) | [Code](https://github.com/yuhangzang/OV-DETR) |
306 | | 2023 | ICLR | F-VLM | [F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models](https://arxiv.org/abs/2209.15639) | [Code](https://sites.google.com/view/f-vlm/home) |
307 | | 2022 | ECCV | MViT | [Class-agnostic Object Detection with Multi-modal Transformer](https://arxiv.org/abs/2111.11430) | [Code](https://github.com/mmaaz60/mvits_for_class_agnostic_od) |
308 | | 2022 | ECCV | OpenSeg | [Scaling Open-Vocabulary Image Segmentation with Image-Level Labels](https://arxiv.org/abs/2112.12143) | [Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/openseg) |
309 | | 2022 | ICLR | LSeg | [Language-driven Semantic Segmentation](https://arxiv.org/abs/2201.03546) | [Code](https://github.com/isl-org/lang-seg) |
310 | | 2022 | ECCV | SimSeg | [A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model](https://arxiv.org/abs/2112.14757) | [Code](https://github.com/MendelXu/zsseg.baseline) |
311 | | 2022 | ECCV | DenseCLIP | [Extract Free Dense Labels from CLIP](https://arxiv.org/abs/2112.01071) | [Code](https://github.com/chongzhou96/MaskCLIP) |
312 | | 2021 | ICCV | UVO | [Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation](https://arxiv.org/abs/2104.04691) | [Project](https://sites.google.com/view/unidentified-video-object) |
313 | | 2023 | arXiv | CGG | [Betrayed-by-Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation](https://arxiv.org/abs/2301.00805) | [Code](https://github.com/jzwu48033552/betrayed-by-captions) |
314 | | 2022 | TPAMI | ES | [Open-World Entity Segmentation](https://arxiv.org/abs/2107.14228) | [Code](https://github.com/dvlab-research/Entity/) |
315 | | 2022 | CVPR | OW-DETR | [OW-DETR: Open-world Detection Transformer](https://arxiv.org/abs/2112.01513) | [Code](https://github.com/akshitac8/OW-DETR) |
316 | | 2023 | CVPR | PROB | [PROB: Probabilistic Objectness for Open World Object Detection](https://arxiv.org/abs/2212.01424) | [Code](https://github.com/orrzohar/PROB) |
317 |
318 | ### Related Domains and Beyond
319 |
320 | #### Point Cloud Segmentation
321 |
322 | | Year | Venue | Acronym | Paper Title | Code/Project |
323 | |:----:|:-------:|:----------------------:|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
324 | | 2021 | ICCV | Point Transformer | [Point Transformer](https://arxiv.org/abs/2012.09164) | N/A |
325 | | 2021 | CVM | PCT | [PCT: Point cloud transformer](https://arxiv.org/abs/2012.09688) | [Code](https://github.com/MenghaoGuo/PCT) |
326 | | 2022 | CVPR | Stratified Transformer | [Stratified Transformer for 3D Point Cloud Segmentation](https://arxiv.org/abs/2203.14508) | [Code](https://github.com/dvlab-research/Stratified-Transformer) |
327 | | 2022 | CVPR | Point-BERT | [Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling](https://arxiv.org/abs/2111.14819) | [Code](https://github.com/lulutang0608/Point-BERT) |
328 | | 2022 | ECCV | Point-MAE | [Masked Autoencoders for Point Cloud Self-supervised Learning](https://arxiv.org/abs/2203.06604) | [Code](https://github.com/Pang-Yatian/Point-MAE) |
329 | | 2022 | NeurIPS | Point-M2AE | [Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training](https://arxiv.org/abs/2205.14401) | [Code](https://github.com/ZrrSkywalker/Point-M2AE) |
330 | | 2022 | ICRA | Mask3D | [Mask3D for 3D Semantic Instance Segmentation](https://arxiv.org/abs/2210.03105) | [Code](https://github.com/JonasSchult/Mask3D) |
331 | | 2023 | AAAI | SPFormer | [Superpoint Transformer for 3D Scene Instance Segmentation](https://arxiv.org/abs/2211.15766) | [Code](https://github.com/sunjiahao1999/SPFormer) |
332 | | 2023 | AAAI | PUPS | [PUPS: Point Cloud Unified Panoptic Segmentation](https://arxiv.org/abs/2302.06185) | N/A |
333 |
334 | #### Domain-aware Segmentation
335 |
336 | | Year | Venue | Acronym | Paper Title | Code/Project |
337 | |:----:|:------:|:-------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
338 | | 2022 | CVPR | DAFormer | [DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation](https://arxiv.org/abs/2111.14887) | [Code](https://github.com/lhoyer/DAFormer) |
339 | | 2022 | ECCV | HRDA | [HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation](https://arxiv.org/abs/2204.13132) | [Code](https://github.com/lhoyer/HRDA) |
340 | | 2023 | CVPR | MIC | [MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation](https://arxiv.org/abs/2212.01322) | [Code](https://github.com/lhoyer/MIC) |
341 | | 2021 | ACM MM | SFA | [Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers](https://arxiv.org/abs/2107.12636) | [Code](https://github.com/encounter1997/SFA) |
342 | | 2023 | CVPR | DA-DETR | [DA-DETR: Domain Adaptive Detection Transformer with Information Fusion](https://arxiv.org/abs/2103.17084) | N/A |
343 | | 2022 | ECCV | MTTrans | [MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer](https://arxiv.org/abs/2205.01643) | [Code](https://github.com/Lafite-Yu/MTTrans-OpenSource) |
344 | | 2022 | arXiv | Sentence-Seg | [The devil is in the labels: Semantic segmentation from sentences](https://arxiv.org/abs/2202.02002) | N/A |
345 | | 2023 | ICLR | LMSeg | [LMSeg: Language-guided Multi-dataset Segmentation](https://arxiv.org/abs/2302.13495) | N/A |
346 | | 2022 | CVPR | UniDet | [Simple multi-dataset detection](https://arxiv.org/abs/2102.13086) | [Code](https://github.com/xingyizhou/UniDet) |
347 | | 2023 | CVPR | Detection Hub | [Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding](https://arxiv.org/abs/2206.03484) | N/A |
348 | | 2022 | CVPR | WD2 | [Unifying Panoptic Segmentation for Autonomous Driving](https://openaccess.thecvf.com/content/CVPR2022/papers/Zendel_Unifying_Panoptic_Segmentation_for_Autonomous_Driving_CVPR_2022_paper.pdf) | [Data](https://github.com/ozendelait/wilddash_scripts) |
349 | | 2023 | arXiv | TarVIS | [TarViS: A Unified Approach for Target-based Video Segmentation](https://arxiv.org/abs/2301.02657) | [Code](https://github.com/Ali2500/TarViS) |
350 |
351 | #### Label and Model Efficient Segmentation
352 |
353 | | Year | Venue | Acronym | Paper Title | Code/Project |
354 | |:----:|:-------:|:-----------:|------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|
355 | | 2022 | CVPR | MCTformer | [Multi-class Token Transformer for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2203.02891) | [Code](https://github.com/xulianuwa/MCTformer) |
356 | | 2020 | CVPR | PCM | [Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2004.04581) | [Code](https://github.com/YudeWang/SEAM) |
357 | | 2022 | ECCV | ViT-PCM | [Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation](https://arxiv.org/abs/2210.17400) | [Code](https://github.com/deepplants/ViT-PCM) |
358 | | 2021 | ICCV | DINO | [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.14294) | [Code](https://github.com/facebookresearch/dino) |
359 | | 2021 | BMVC | LOST | [Localizing Objects with Self-Supervised Transformers and no Labels](https://arxiv.org/abs/2109.14279) | [Code](https://github.com/valeoai/LOST) |
360 | | 2022 | ICLR | STEGO | [Unsupervised Semantic Segmentation by Distilling Feature Correspondences](https://arxiv.org/abs/2203.08414) | [Code](https://github.com/mhamilton723/STEGO) |
361 | | 2022 | NeurIPS | ReCo | [ReCo: Retrieve and Co-segment for Zero-shot Transfer](https://arxiv.org/abs/2206.07045) | [Code](https://github.com/NoelShin/reco) |
362 | | 2022 | arXiv | MaskDistill | [Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation](https://arxiv.org/abs/2206.06363) | N/A |
363 | | 2022 | CVPR | FreeSOLO | [FreeSOLO: Learning to Segment Objects without Annotations](https://arxiv.org/abs/2202.12181) | [Code](http://github.com/NVlabs/FreeSOLO) |
364 | | 2023 | CVPR | CutLER | [Cut and Learn for Unsupervised Object Detection and Instance Segmentation](https://arxiv.org/abs/2301.11320) | [Code](https://github.com/facebookresearch/CutLER) |
365 | | 2022 | CVPR | TokenCut | [Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut](https://arxiv.org/abs/2202.11539) | [Code](https://github.com/YangtaoWANG95/TokenCut) |
366 | | 2022 | ICLR | MobileViT | [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) | [Code](https://github.com/apple/ml-cvnets) |
367 | | 2023 | arXiv | EMO | [Rethinking Mobile Block for Efficient Neural Models](https://arxiv.org/abs/2301.01146) | [Code](https://github.com/zhangzjn/EMO) |
368 | | 2022 | CVPR | TopFormer | [TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation](https://arxiv.org/abs/2204.05525) | [Code](https://github.com/hustvl/TopFormer) |
369 | | 2023 | ICLR | SeaFormer | [SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation](https://arxiv.org/abs/2301.13156) | [Code](https://github.com/fudan-zvg/SeaFormer) |
370 |
371 | #### Class Agnostic Segmentation and Tracking
372 |
373 | | Year | Venue | Acronym | Paper Title | Code/Project |
374 | |:----:|:-------:|:-----------:|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|
375 | | 2022 | CVPR | Transfiner | [Mask Transfiner for High-Quality Instance Segmentation](https://arxiv.org/abs/2111.13673) | [Code](https://github.com/SysCV/transfiner) |
376 | | 2022 | ECCV | VMT | [Video Mask Transfiner for High-Quality Video Instance Segmentation](https://arxiv.org/abs/2207.14012) | [Code](https://github.com/SysCV/vmt) |
377 | | 2022 | arXiv | SimpleClick | [SimpleClick: Interactive Image Segmentation with Simple Vision Transformers](https://arxiv.org/abs/2210.11006) | [Code](https://github.com/uncbiag/simpleclick) |
378 | | 2023 | ICLR | PatchDCT | [PatchDCT: Patch Refinement for High Quality Instance Segmentation](https://arxiv.org/abs/2302.02693) | [Code](https://github.com/olivia-w12/PatchDCT) |
379 | | 2019 | ICCV | STM | [Video Object Segmentation using Space-Time Memory Networks](https://arxiv.org/abs/1904.00607) | [Code](https://github.com/seoungwugoh/STM) |
380 | | 2021 | NeurIPS | AOT | [Associating Objects with Transformers for Video Object Segmentation](https://arxiv.org/abs/2106.02638) | [Code](https://github.com/z-x-yang/AOT) |
381 | | 2021 | NeurIPS | STCN | [Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation](https://arxiv.org/abs/2106.05210) | [Code](https://github.com/hkchengrex/STCN) |
382 | | 2022 | ECCV | XMem | [XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model](https://arxiv.org/abs/2207.07115) | [Code](https://hkchengrex.github.io/XMem) |
383 | | 2022 | CVPR | PCVOS | [Per-Clip Video Object Segmentation](https://arxiv.org/abs/2208.01924) | [Code](https://github.com/pkyong95/PCVOS) |
384 | | 2023 | CVPR | N/A | [Look Before You Match: Instance Understanding Matters in Video Object Segmentation](https://arxiv.org/abs/2212.06826) | N/A |
385 |
386 | #### Medical Image Segmentation
387 |
388 | | Year | Venue | Acronym | Paper Title | Code/Project |
389 | |:----:|:-------------:|:---------:|-----------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
390 | | 2020 | BIBM | CellDETR | [Attention-Based Transformers for Instance Segmentation of Cells in Microstructures](https://arxiv.org/abs/2011.09763) | [Code](https://github.com/ChristophReich1996/Cell-DETR) |
391 | | 2021 | arXiv | TransUNet | [TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation](https://arxiv.org/abs/2102.04306) | [Code](https://github.com/Beckschen/TransUNet) |
392 | | 2022 | ECCV Workshop | Swin-Unet | [Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation](https://arxiv.org/abs/2105.05537) | [Code](https://github.com/HuCaoFighting/Swin-Unet) |
393 | | 2021 | MICCAI | TransFuse | [TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation](https://arxiv.org/abs/2102.08005) | [Code](https://github.com/Rayicer/TransFuse) |
394 | | 2022 | WACV | UNETR | [UNETR: Transformers for 3D Medical Image Segmentation](https://arxiv.org/abs/2103.10504) | [Code](https://github.com/Project-MONAI/research-contributions/tree/main/UNETR) |
395 |
396 | ## Acknowledgement
397 |
398 | If you find our survey and repository useful for your research project, please consider citing our paper:
399 |
400 | ```bibtex
401 | @article{li2023transformer,
402 | author={Li, Xiangtai and Ding, Henghui and Zhang, Wenwei and Yuan, Haobo and Cheng, Guangliang and Jiangmiao, Pang and Chen, Kai and Liu, Ziwei and Loy, Chen Change},
403 | title={Transformer-Based Visual Segmentation: A Survey},
404 | journal={T-PAMI},
405 | year={2024}
406 | }
407 | ```
408 | ## Contact
409 | ```
410 | xiangtai94@gmail.com (main)
411 | ```
412 | ```
413 | lxtpku@pku.edu.cn
414 | ```
415 | ## Related Repo For Segmentation and Detection
416 |
417 | Attention Model [Repo](https://github.com/cmhungsteve/Awesome-Transformer-Attention) by Min-Hung (Steve) Chen.
418 |
419 | Detection Transformer [Repo](https://github.com/IDEA-Research/awesome-detection-transformer) by IDEA.
420 |
421 | Open Vocabulary Learning [Repo](https://github.com/jianzongwu/Awesome-Open-Vocabulary) by PKU and NTU.
422 |
423 |
--------------------------------------------------------------------------------
/code/README.md:
--------------------------------------------------------------------------------
1 | ### Fair Re-Benchmark Code
2 |
--------------------------------------------------------------------------------
/figs/survey_pipeline.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxtGH/Awesome-Segmentation-With-Transformer/9c7a4884c6a23590f11a694f39cd8caa618d3593/figs/survey_pipeline.jpg
--------------------------------------------------------------------------------