Transformer-Based Visual Segmentation: A Survey

├── .gitignore
├── README.md
├── code
    └── README.md
└── figs
    └── survey_pipeline.jpg


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | work_dir/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .coverage
 43 | .coverage.*
 44 | .cache
 45 | nosetests.xml
 46 | coverage.xml
 47 | *.cover
 48 | .hypothesis/
 49 | .pytest_cache/
 50 | 
 51 | # Translations
 52 | *.mo
 53 | *.pot
 54 | 
 55 | # Django stuff:
 56 | *.log
 57 | local_settings.py
 58 | db.sqlite3
 59 | 
 60 | # Flask stuff:
 61 | instance/
 62 | .webassets-cache
 63 | 
 64 | # Scrapy stuff:
 65 | .scrapy
 66 | 
 67 | # Sphinx documentation
 68 | docs/_build/
 69 | 
 70 | # PyBuilder
 71 | target/
 72 | 
 73 | # Jupyter Notebook
 74 | .ipynb_checkpoints
 75 | 
 76 | # pyenv
 77 | .python-version
 78 | 
 79 | # celery beat schedule file
 80 | celerybeat-schedule
 81 | 
 82 | # SageMath parsed files
 83 | *.sage.py
 84 | 
 85 | # Environments
 86 | .env
 87 | .venv
 88 | env/
 89 | venv/
 90 | ENV/
 91 | env.bak/
 92 | venv.bak/
 93 | 
 94 | # Spyder project settings
 95 | .spyderproject
 96 | .spyproject
 97 | 
 98 | # Rope project settings
 99 | .ropeproject
100 | 
101 | # mkdocs documentation
102 | /site
103 | 
104 | # mypy
105 | .mypy_cache/
106 | 
107 | data/
108 | data
109 | .vscode
110 | .idea
111 | .DS_Store
112 | 
113 | # custom
114 | *.pkl
115 | *.pkl.json
116 | *.log.json
117 | 
118 | # Pytorch
119 | *.pth
120 | *.py~
121 | *.sh~
122 | 
123 | debug/*
124 | vis/
125 | analysis/*
126 | pretrain/*
127 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 |  <!-- # <p align=center>`awesome gan-inversion`</p> -->
  2 | [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
  3 | [![PR's Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat)](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer/pulls)
  4 | <br />
  5 | <p align="center">
  6 |   <h1 align="center">Transformer-Based Visual Segmentation: A Survey</h1>
  7 |   <p align="center">
  8 |     T-PAMI, 2024
  9 |     <br />
 10 |     <a href="https://lxtgh.github.io/"><strong>Xiangtai Li (Project Lead)</strong></a>
 11 |     ·
 12 |     <a href="https://henghuiding.github.io/"><strong>Henghui Ding</strong></a>
 13 |     ·
 14 |     <a href="https://yuanhaobo.me/"><strong>Haobo Yuan</strong></a>
 15 |     ·
 16 |     <a href="http://zhangwenwei.cn/"><strong>Wenwei Zhang</strong></a>
 17 |     ·
 18 |     <a href="https://sites.google.com/view/guangliangcheng"><strong>Guangliang Cheng</strong></a>
 19 |     <br />
 20 |     <a href="https://oceanpang.github.io/"><strong>Jiangmiao Pang</strong></a>
 21 |     .
 22 |     <a href="https://chenkai.site/"><strong>Kai Chen</strong></a>
 23 |     .
 24 |     <a href="https://liuziwei7.github.io/"><strong>Ziwei Liu</strong></a>
 25 |     .
 26 |     <a href="https://www.mmlab-ntu.com/person/ccloy/"><strong>Chen Change Loy</strong></a>
 27 |   </p>
 28 | 
 29 |   <p align="center">
 30 |     <a href='https://arxiv.org/abs/2304.09854'>
 31 |       <img src='https://img.shields.io/badge/Paper-PDF-green?style=flat&logo=arXiv&logoColor=green' alt='arXiv PDF'>
 32 |     </a>
 33 |     <a href='https://www.mmlab-ntu.com/project/seg_survey/index.html' style='padding-left: 0.5rem;'>
 34 |       <img src='https://img.shields.io/badge/Project-Page-blue?style=flat&logo=Google%20chrome&logoColor=blue' alt='S-Lab Project Page'>
 35 |     </a>
 36 |     <a href='https://ieeexplore.ieee.org/abstract/document/10613466'>
 37 |       <img src='https://img.shields.io/badge/TPAMI-PDF-blue?style=flat&logo=IEEE&logoColor=green' alt='TPAMI PDF'>
 38 |     </a>
 39 |   </p>
 40 | <br />
 41 | 
 42 | This repo is used for recording, tracking, and benchmarking several recent transformer-based visual segmentation methods,
 43 | as a supplement to our [survey](https://arxiv.org/abs/2304.09854).  
 44 | If you find any work missing or have any suggestions (papers, implementations and other resources), feel free
 45 | to [pull requests](https://github.com/lxtGH/Awesome-Segmenation-With-Transformer/pulls).
 46 | We will add the missing papers to this repo ASAP.
 47 | 
 48 | 
 49 | ### 🔥News
 50 | 
 51 | [-] Accepted by T-PAMI-2024.
 52 | 
 53 | [-] Add several CVPR-24 works on this directions. 2024-03. You are welcome to add your CVPR works in our repos!
 54 | 
 55 | [-] The third version is on arxiv. [survey](https://arxiv.org/abs/2304.09854) More benchmark and methods are included!!. 2023-12.
 56 | 
 57 | [-] The second draft is on arxiv. 2023-06.
 58 | 
 59 | 
 60 | ### 🔥Highlight!!
 61 | 
 62 | [1], Previous transformer surveys divide the methods by the different tasks and settings.
 63 | Different from them, we re-visit and group the existing transformer-based methods from the **technical perspective.**
 64 | 
 65 | [2], We survey the methods in two parts: one for the mainstream tasks based on DETR-like meta-architecture, the other for related directions according to the tasks.
 66 | 
 67 | [3], We further re-benchmark several representative works on image semantic segmentation and panoptic segmentation datasets. 
 68 | 
 69 | [4], We also include the query-based detection transformers since both segmentation and detection tasks are unified by object query. 
 70 | 
 71 | 
 72 | ## Introduction
 73 | 
 74 | In this survey, we present the first detailed survey on Transformer-Based Segmentation.
 75 | 
 76 | ![Alt Text](./figs/survey_pipeline.jpg)
 77 | 
 78 | ## Summary of Contents
 79 | 
 80 | - [Methods: A Survey](#methods-a-survey)
 81 |     - [Meta-Architecture](#meta-architecture)
 82 |     - [Strong Representation](#Strong-Representation)
 83 |     - [Interaction Design in Decoder](#Interaction-Design-in-Decoder)
 84 |     - [Optimizing Object Query](#Optimizing-Object-Query)
 85 |     - [Using Query For Association](#Using-Query-For-Association)
 86 |     - [Conditional Query Generation](#Conditional-Query-Generation)
 87 | 
 88 | - [Related Domains and Beyond](#Related-Domains-and-Beyond)
 89 |     - [Point Cloud Segmentation](#Point-Cloud-Segmentation)
 90 |     - [Tuning Foundation Models](#Tuning-Foundation-Models)
 91 |     - [Domain-aware Segmentation](#Domain-aware-Segmentation)
 92 |     - [Label and Model Efficient Segmentation](#Label-and-Model-Efficient-Segmentation)
 93 |     - [Class Agnostic Segmentation and Tracking](#Class-Agnostic-Segmentation-and-Tracking)
 94 |     - [Medical Image Segmentation](#Medical-Image-Segmentation)
 95 | 
 96 | ## Methods: A Survey
 97 | 
 98 | ### Meta-Architecture
 99 | 
100 | | Year |  Venue  |     Acronym     | Paper Title                                                                                                           | Code/Project                                                 |
101 | |:----:|:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|
102 | | 2020 |  ECCV   |      DETR       | [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872)                                     | [Code](https://github.com/facebookresearch/detr)             |
103 | | 2021 |  ICLR   | Deformable DETR | [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159)          | [Code](https://github.com/fundamentalvision/Deformable-DETR) |
104 | | 2021 |  CVPR   |   Max-Deeplab   | [MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers](https://arxiv.org/abs/2012.00759)              | [Code](https://github.com/google-research/deeplab2)          |
105 | | 2021 | NeurIPS |   MaskFormer    | [MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation](http://arxiv.org/abs/2107.06278) | [Code](https://github.com/facebookresearch/MaskFormer)       |
106 | | 2021 | NeurIPS |      K-Net      | [K-Net: Towards Unified Image Segmentation](https://arxiv.org/abs/2106.14855)                                         | [Code](https://github.com/ZwwWayne/K-Net)                    |
107 | | 2023 |  CVPR   |    Lite-DETR    | [Lite detr: An interleaved multi-scale encoder for efficient detr](https://arxiv.org/pdf/2303.07335)                  | [Code](https://github.com/IDEA-Research/Lite-DETR)           |
108 | 
109 | ### Strong Representation
110 | 
111 | #### Better ViTs Design
112 | 
113 | | Year |  Venue  |   Acronym   | Paper Title                                                                                                                    | Code/Project                                     |
114 | |:----:|:-------:|:-----------:|--------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|
115 | | 2021 |  CVPR   |    SETR     | [Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers](https://arxiv.org/abs/2012.15840) | [Code](https://github.com/fudan-zvg/SETR)        |
116 | | 2021 |  ICCV   |   MviT-V1   | [Multiscale vision transformers](https://arxiv.org/abs/2104.11227)                                                             | [Code](https://github.com/facebookresearch/mvit) |
117 | | 2022 |  CVPR   |   MviT-V2   | [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526)           | [Code](https://github.com/facebookresearch/mvit) |
118 | | 2021 | NeurIPS |    XCIT     | [Xcit: Crosscovariance image transformers](https://arxiv.org/abs/2106.09681)                                                   | [Code](https://github.com/facebookresearch/xci)  |
119 | | 2021 |  ICCV   | Pyramid VIT | [Pyramid vision transformer: A versatile backbone for dense prediction without convolutions](https://arxiv.org/abs/2102.12122) | [Code](https://github.com/whai362/PVT)           |
120 | | 2021 |  ICCV   |  CorssViT   | [CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification](https://arxiv.org/abs/2103.14899)          | [Code](https://github.com/IBM/CrossViT)          |
121 | | 2021 |  ICCV   |    CoaT     | [Co-Scale Conv-Attentional Image Transformers](https://arxiv.org/abs/2104.06399)                                               | [Code](https://github.com/mlpc-ucsd/CoaT)        |
122 | | 2022 |  CVPR   |    MPViT    | [MPViT: Multi-Path Vision Transformer for Dense Prediction](https://arxiv.org/abs/2112.11010)                                  | [Code](https://github.com/youngwanLEE/MPViT)     |
123 | | 2022 | NeurIPS |   SegViT    | [SegViT: Semantic Segmentation with Plain Vision Transformers](https://arxiv.org/abs/2210.05844)                               | [Code](https://github.com/zbwxp/SegVit)          |
124 | | 2022 |  arxiv  |    RSSeg    | [Representation Separation for SemanticSegmentation with Vision Transformers](https://arxiv.org/abs/2212.13764)                | N/A                                              |
125 | 
126 | #### Hybrid CNNs/Transformers/MLPs
127 | 
128 | | Year |  Venue  |  Acronym   | Paper Title                                                                                                                        | Code/Project                                                   |
129 | |:----:|:-------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|
130 | | 2021 |  ICCV   |    Swin    | [Swin transformer: Hierarchical vision transformer using shifted windows](https://arxiv.org/abs/2103.14030)                        | [Code](https://github.com/microsoft/Swin-Transformer)          |
131 | | 2022 |  CVPR   |  Swin-v2   | [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883)                                        | [Code](https://github.com/microsoft/Swin-Transformer)          |
132 | | 2021 | NeurIPS | Segformer  | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203)             | [Code](http://github.com/NVlabs/SegFormer)                     |
133 | | 2022 |  CVPR   |    CMT     | [CMT: Convolutional Neural Networks Meet Vision Transformers](https://arxiv.org/abs/2107.06263)                                    | [Code](https://github.com/FlyEgle/CMT-pytorch)                 |
134 | | 2021 | NeurIPS |   Twins    | [Twins: Revisiting the Design of Spatial Attention in Vision Transformers](https://arxiv.org/abs/2104.13840)                       | [Code](https://github.com/Meituan-AutoML/Twins)                |
135 | | 2021 |  ICCV   |    CvT     | [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808)                                           | [Code](https://github.com/microsoft/CvT)                       |
136 | | 2021 | NeurIPS |   Vitae    | [Vitae: Vision transformer advanced by exploring intrinsic inductive bias](https://arxiv.org/abs/2106.03348)                       | [Code](https://github.com/ViTAE-Transformer/ViTAE-Transformer) |
137 | | 2022 |  CVPR   |  ConvNext  | [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545)                                                                        | [Code](https://github.com/facebookresearch/ConvNeXt)           |
138 | | 2022 | NeurIPS |  SegNext   | [SegNeXt:Rethinking Convolutional Attention Design for Semantic Segmentation](https://github.com/visual-attention-network/segnext) | [Code](https://github.com/visual-attention-network/segnext)    |
139 | | 2022 |  CVPR   | PoolFormer | [PoolFormer: MetaFormer Is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)                                    | [Code](https://github.com/sail-sg/poolformer)                  |
140 | | 2023 |  ICLR   |    STM     | [Demystify Transformers & Convolutions in Modern Image Deep Networks](https://arxiv.org/abs/2211.05781)                            | [Code](https://github.com/OpenGVLab/STM-Evaluation)            |
141 | 
142 | #### Self-Supervised Learning
143 | 
144 | | Year |  Venue  |   Acronym   | Paper Title                                                                                                                | Code/Project                                                |
145 | |:----:|:-------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|
146 | | 2021 |  ICCV   |   MOCOV3    | [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057)                     | [Code](https://github.com/facebookresearch/moco-v3)         |
147 | | 2022 |  ICLR   |    Beit     | [Beit: Bert pre-training of image transformers](https://arxiv.org/abs/2106.08254)                                          | [Code](https://github.com/microsoft/unilm/tree/master/beit) |
148 | | 2022 |  CVPR   |  MaskFeat   | [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133)                      | [Code](https://github.com/facebookresearch/SlowFast)        |
149 | | 2022 |  CVPR   |     MAE     | [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)                                       | [Code](https://github.com/facebookresearch/mae)             |
150 | | 2022 | NeurIPS |   ConvMAE   | [MCMAE: Masked Convolution Meets Masked Autoencoders](https://arxiv.org/abs/2303.05475)                                    | [Code](https://github.com/Alpha-VL/ConvMAE)                 |
151 | | 2023 |  ICLR   |    Spark    | [SparK: the first successful BERT/MAE-style pretraining on any convolutional networks](https://github.com/keyu-tian/SparK) | [Code](https://github.com/keyu-tian/SparK)                  |
152 | | 2022 |  CVPR   |    FLIP     | [Scaling Language-Image Pre-training via Masking](https://arxiv.org/abs/2212.00794)                                        | [Code](https://github.com/facebookresearch/flip)            |
153 | | 2023 |  arxiv  | ConvNeXt V2 | [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808)                 | [Code](https://github.com/facebookresearch/ConvNeXt-V2)     |
154 | 
155 | ### Interaction Design in Decoder
156 | 
157 | #### Improved Cross Attention Design
158 | 
159 | | Year |  Venue  |      Acronym       | Paper Title                                                                                                                           | Code/Project                                            |
160 | |:----:|:-------:|:------------------:|---------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
161 | | 2021 |  CVPR   |    Sparse R-CNN    | [Sparse R-CNN: End-to-End Object Detection with Learnable Proposals](https://arxiv.org/abs/2011.12450)                                | [Code](https://github.com/PeizeSun/SparseR-CNN)         |
162 | | 2022 |  CVPR   |      AdaMixer      | [AdaMixer: A Fast-Converging Query-Based Object Detector](https://arxiv.org/abs/2203.16507)                                           | [Code](https://github.com/MCG-NJU/AdaMixer)             |
163 | | 2021 |  CVPR   |    MaX-DeepLab     | [MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers](https://arxiv.org/abs/2012.00759)                              | [Code](https://github.com/google-research/deeplab2)     |
164 | | 2021 | NeurIPS |       K-Net        | [K-Net: Towards Unified Image Segmentation](https://arxiv.org/abs/2106.14855)                                                         | [Code](https://github.com/ZwwWayne/K-Net/)              |
165 | | 2022 |  CVPR   |    Mask2Former     | [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527)                                | [Code](https://github.com/facebookresearch/Mask2Former) |
166 | | 2022 |  ECCV   |    kMaX-DeepLab    | [k-means Mask Transformer](https://arxiv.org/abs/2207.04044)                                                                          | [Code](https://github.com/google-research/deeplab2)     |
167 | | 2021 |  ICCV   |     QueryInst      | [Instances as queries](https://arxiv.org/abs/2105.01928)                                                                              | [Code](https://github.com/hustvl/QueryInst)             |
168 | | 2021 |  arxiv  |        ISTR        | [ISTR: End-to-End Instance Segmentation via Transformers](https://arxiv.org/abs/2105.00637)                                           | [Code](https://github.com/hujiecpp/ISTR)                |
169 | | 2021 | NeurIPS |        SOLQ        | [Solq: Segmenting objects by learning queries](https://arxiv.org/abs/2106.02351)                                                      | [Code](https://github.com/megvii-research/SOLQ)         |
170 | | 2022 |  CVPR   | Panoptic Segformer | [Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers](https://arxiv.org/abs/2109.03814)                   | [Code](https://github.com/zhiqi-li/Panoptic-SegFormer)  |
171 | | 2022 |  CVPR   |    CMT-Deeplab     | [CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation](https://arxiv.org/abs/2206.08948)                               | N/A                                                     |
172 | | 2022 |  CVPR   |     SparseInst     | [Sparse Instance Activation for Real-Time Instance Segmentation](https://arxiv.org/abs/2203.12827)                                    | [Code](https://github.com/hustvl/SparseInst)            |
173 | | 2022 |  CVPR   |      SAM-DETR      | [Accelerating DETR Convergence via Semantic-Aligned Matching](https://arxiv.org/abs/2203.06883)                                       | [Code](https://github.com/ZhangGongjie/SAM-DETR)        |
174 | | 2021 |  ICCV   |     SMCA-DETR      | [Fast Convergence of DETR with Spatially Modulated Co-Attention](https://arxiv.org/abs/2101.07448)                                    | [Code](https://github.com/gaopengcuhk/SMCA-DETR)        |
175 | | 2021 |  BMVC   |      ACT-DETR      | [End-to-End Object Detection with Adaptive Clustering Transformer](https://www.bmvc2021-virtualconference.com/assets/papers/0709.pdf) | [Code](https://github.com/gaopengcuhk/SMCA-DETR)        |
176 | | 2021 |  ICCV   |    Dynamic DETR    | [Dynamic DETR: End-to-End Object Detection with Dynamic Attention](https://ieeexplore.ieee.org/document/9709981)                      | N/A                                                     |
177 | | 2022 |  ICLR   |    Sparse DETR     | [Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity](https://arxiv.org/abs/2111.14330)                        | [Code](https://github.com/kakaobrain/sparse-detr)       |
178 | | 2023 |  CVPR   |      FastInst      | [FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation](https://arxiv.org/abs/2303.08594)                          | [Code](https://github.com/junjiehe96/FastInst)          |
179 | 
180 | 
181 | #### Spatial-Temporal Cross Attention Design
182 | 
183 | | Year |  Venue  |      Acronym       | Paper Title                                                                                                          | Code/Project                                            |
184 | |:----:|:-------:|:------------------:|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
185 | | 2021 |  CVPR   |       VisTR        | [VisTR: End-to-End Video Instance Segmentation with Transformers](https://arxiv.org/abs/2011.14503)                  | [Code](https://github.com/Epiphqny/VisTR)               |
186 | | 2021 | NeurIPS |        IFC         | [Video instance segmentation using inter-frame communication transformers](https://arxiv.org/abs/2106.03299)         | [Code](https://github.com/sukjunhwang/IFC)              |
187 | | 2022 |  CVPR   |      SlotVPS       | [Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation](https://arxiv.org/abs/2112.08949) | N/A                                                     |
188 | | 2022 |  CVPR   | TubeFormer-DeepLab | [TubeFormer-DeepLab: Video Mask Transformer](https://arxiv.org/abs/2205.15361)                                       | N/A                                                     |
189 | | 2022 |  CVPR   |    Video K-Net     | [Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation](https://arxiv.org/abs/2204.04656)       | [Code](https://github.com/lxtGH/Video-K-Net)            |
190 | | 2022 |  CVPR   |       TeViT        | [Temporally efficient vision transformer for video instance segmentation](https://arxiv.org/abs/2204.08412)          | [Code](https://github.com/hustvl/TeViT)                 |
191 | | 2022 |  ECCV   |     Seqformer      | [SeqFormer: Sequential Transformer for Video Instance Segmentation](https://arxiv.org/abs/2112.08275)                | [Code](https://github.com/wjf5203/SeqFormer)            |
192 | | 2022 |  arxiv  |  Mask2Former-VIS   | [Mask2Former for Video Instance Segmentation](https://arxiv.org/abs/2112.10764)                                      | [Code](https://github.com/facebookresearch/Mask2Former) |
193 | | 2022 |  PAMI   |      TransVOD      | [TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers](https://arxiv.org/abs/2201.05047)   | [Code](https://github.com/SJTU-LuHe/TransVOD)           |
194 | | 2022 | NeurIPS |        VITA        | [VITA: Video Instance Segmentation via Object Token Association](https://arxiv.org/abs/2206.04403)                   | [Code](https://github.com/sukjunhwang/VITA)             |
195 | 
196 | 
197 | 
198 | ### Optimizing Object Query
199 | 
200 | #### Adding Position Information into Query
201 | 
202 | | Year | Venue |       Acronym       | Paper Title                                                                                                | Code/Project                                         |
203 | |:----:|:-----:|:-------------------:|------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
204 | | 2021 | ICCV  |  Conditional-DETR   | [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152)                         | [Code](https://github.com/Atten4Vis/ConditionalDETR) |
205 | | 2022 | arxiv | Conditional-DETR-v2 | [Conditional detr v2:Efficient detection transformer with box queries](https://arxiv.org/abs/2207.08914)   | [Code](https://github.com/Atten4Vis/ConditionalDETR) |
206 | | 2022 | AAAI  |     Anchor DETR     | [Anchor detr: Query design for transformer-based detector](https://arxiv.org/abs/2109.07107)               | [Code](https://github.com/megvii-model/AnchorDETR)   |
207 | | 2022 | ICLR  |      DAB-DETR       | [DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR](https://arxiv.org/abs/2201.12329)             | [Code](https://github.com/SlongLiu/DAB-DETR)         |
208 | | 2021 | arxiv |   Efficient DETR    | [Efficient detr: improving end-to-end object  detector with dense prior](https://arxiv.org/abs/2104.01318) | N/A                                                  |
209 | 
210 | #### Adding Extra Supervision into Query
211 | 
212 | | Year |  Venue  |  Acronym   | Paper Title                                                                                                                        | Code/Project                                                      |
213 | |:----:|:-------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|
214 | | 2022 |  ECCV   |  DE-DETR   | [Towards Data-Efficient Detection Transformers](https://arxiv.org/abs/2203.09507)                                                  | [Code](https://github.com/encounter1997/DE-DETRs)                 |
215 | | 2022 |  CVPR   |  DN-DETR   | [Dndetr:Accelerate detr training by introducing query denoising](https://arxiv.org/abs/2203.01305)                                 | [Code](https://github.com/IDEA-opensource/DN-DETR)                |
216 | | 2023 |  ICLR   |    DINO    | [DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection](https://arxiv.org/abs/2203.03605)                | [Code](https://github.com/IDEA-Research/DINO)                     |
217 | | 2023 |  CVPR   | Mp-Former  | [Mp-former: Mask-piloted transformer for image segmentation](https://arxiv.org/abs/2303.07336)                                     | [Code](https://github.com/IDEA-Research/MP-Former)                |
218 | | 2023 |  CVPR   | Mask-DINO  | [Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation](https://arxiv.org/abs/2206.02777) | [Code](https://github.com/IDEACVR/MaskDINO)                       |
219 | | 2022 | NeurIPS |    N/A     | [Learning equivariant segmentation with instance-unique querying](https://arxiv.org/abs/2210.00911)                                | [Code](https://github.com/JamesLiang819/Instance_Unique_Querying) |
220 | | 2023 |  CVPR   |   H-DETR   | [DETRs with Hybrid Matching](https://arxiv.org/abs/2207.13080)                                                                     | [Code](https://github.com/HDETR)                                  |
221 | | 2023 |  ICCV  | Group-DETR | [Group detr: Fast detr training with group-wise one-to-many assignment](https://arxiv.org/abs/2207.13085)                          | N/A                                                               |
222 | | 2023 |  ICCV  |  Co-DETR   | [Detrs with collaborative hybrid assignments training](https://arxiv.org/abs/2211.12860)                                           | [Code](https://github.com/Sense-X/Co-DETR)                        |
223 | 
224 | ### Using Query For Association
225 | 
226 | #### Query as Instance Association
227 | 
228 | | Year |  Venue  |   Acronym   | Paper Title                                                                                                                      | Code/Project                                        |
229 | |:----:|:-------:|:-----------:|----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|
230 | | 2022 |  CVPR   | TrackFormer | [TrackFormer: Multi-Object Tracking with Transformer](https://arxiv.org/abs/2101.02702)                                          | [Code](https://github.com/timmeinhardt/trackformer) |
231 | | 2021 |  arxiv  | TransTrack  | [TransTrack: Multiple Object Tracking with Transformer](https://arxiv.org/abs/2012.15460)                                        | [Code](https://github.com/PeizeSun/TransTrack)      |
232 | | 2022 |  ECCV   |    MOTR     | [MOTR: End-to-End Multiple-Object Tracking with TRansformer](https://arxiv.org/abs/2105.03247)                                   | [Code](https://github.com/megvii-research/MOTR)     |
233 | | 2022 | NeurIPS |   MinVIS    | [MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training](https://arxiv.org/abs/2208.02245)         | [Code](https://github.com/NVlabs/MinVIS)            |
234 | | 2022 |  ECCV   |    IDOL     | [In defense of online models for video instance segmentation](https://arxiv.org/abs/2207.10661)                                  | [Code](https://github.com/wjf5203/VNext)            |
235 | | 2022 |  CVPR   | Video K-Net | [Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation](https://arxiv.org/abs/2204.04656)                   | [Code](https://github.com/lxtGH/Video-K-Net)        |
236 | | 2023 |  CVPR   |   GenVIS    | [A Generalized Framework for Video Instance Segmentation](https://arxiv.org/abs/2211.08834)                                      | [Code](https://github.com/miranheo/GenVIS)          |
237 | | 2023 |  ICCV  |  Tube-Link  | [Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation](https://arxiv.org/abs/2303.12782)                   | [Code](https://github.com/lxtGH/Tube-Link)          |
238 | | 2023 |  ICCV  |  CTVIS  | [CTVIS: Consistent Training for Online Video Instance Segmentation](https://arxiv.org/abs/2303.12782)                   | [Code](https://github.com/KainingYing/CTVIS)          |
239 | | 2023 |  CVPR-W  | Video-kMaX  | [Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation](https://arxiv.org/abs/2304.04694) | N/A                                                 |
240 | 
241 | 
242 | 
243 | #### Query as Linking Multi-Tasks
244 | 
245 | | Year | Venue |       Acronym       | Paper Title                                                                                                                            | Code/Project                                           |
246 | |:----:|:-----:|:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
247 | | 2022 | ECCV  | Panoptic-PartFormer | [Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation](https://arxiv.org/abs/2204.04655)                       | [Code](https://github.com/lxtGH/Panoptic-PartFormer)   |
248 | | 2022 | ECCV  |  PolyphonicFormer   | [PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation](https://arxiv.org/abs/2112.02582)               | [Code](https://github.com/HarborYuan/PolyphonicFormer) |
249 | | 2022 | CVPR  |    PanopticDepth    | [Panopticdepth: A unified framework for depth-aware panoptic segmentation](https://arxiv.org/abs/2206.00468)                           | [Code](https://github.com/NaiyuGao/PanopticDepth)      |
250 | | 2022 | ECCV  |    Fashionformer    | [Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition](https://arxiv.org/abs/2204.04654) | [Code](https://github.com/xushilin1/FashionFormer)     |
251 | | 2022 | ECCV  |        InvPT        | [InvPT: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding](https://arxiv.org/abs/2203.07997)                       | [Code](https://github.com/prismformore/InvPT)          |
252 | | 2023 | CVPR  |       UNINEXT       | [Universal Instance Perception as Object Discovery and Retrieval](https://arxiv.org/abs/2303.06674)                       | [Code](https://github.com/MasterBin-IIAU/UNINEXT)          |
253 | | 2024 | CVPR  |        GLEE         | [GLEE: General Object Foundation Model for Images and Videos at Scale](https://arxiv.org/abs/2312.09158)                          | [Code](https://glee-vision.github.io/)          |
254 | | 2024 | CVPR  |        UniVS        | [UniVS: Unified and Universal Video Segmentation with Prompts as Queries](https://arxiv.org/abs/2402.18115)                        | [Code](https://github.com/MinghanLi/UniVS)          |
255 | | 2024 | CVPR  |       OMG-Seg       | [OMG-Seg: Is One Model Good Enough For All Segmentation?](https://arxiv.org/abs/2401.10229)                        | [Code](https://github.com/lxtGH/OMG-Seg)          |
256 | 
257 | ### Conditional Query Generation
258 | 
259 | #### Conditional Query Fusion on Language Features
260 | 
261 | | Year | Venue |    Acronym     | Paper Title                                                                                                              | Code/Project                                                       |
262 | |:----:|:-----:|:--------------:|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|
263 | | 2021 | ICCV  |      VLT       | [Vision-Language Transformer and Query Generation for Referring Segmentation](https://arxiv.org/abs/2108.05565)          | [Code](https://github.com/henghuiding/Vision-Language-Transformer) |
264 | | 2022 | CVPR  |      LAVT      | [Lavt: Language-aware vision transformer for referring image segmentation](https://arxiv.org/abs/2112.02244)             | [Code](https://github.com/yz93/LAVT-RIS)                           |
265 | | 2022 | CVPR  |     Restr      | [Restr:Convolution-free referring image segmentation using transformers](https://arxiv.org/abs/2203.16768)               | N/A                                                                |
266 | | 2022 | CVPR  |      Cris      | [Cris: Clip-driven referring image segmentation](https://arxiv.org/abs/2111.15174)                                       | [Code](https://github.com/DerrickWang005/CRIS.pytorch)             |
267 | | 2022 | CVPR  |      MTTR      | [End-to-End Referring Video Object Segmentation with Multimodal Transformers](https://arxiv.org/abs/2111.14821)          | [Code](https://github.com/mttr2021/MTTR)                           |
268 | | 2022 | CVPR  |      LBDT      | [Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation](https://arxiv.org/abs/2206.03789) | [Code](https://github.com/dzh19990407/LBDT)                        |
269 | | 2022 | CVPR  |  ReferFormer   | [Language as queries for referring video object segmentation](https://arxiv.org/abs/2201.00487)                          | [Code](https://github.com/wjn922/ReferFormer)                      |
270 | | 2024 | CVPR  | MaskGrounding  | [Mask Grounding for Referring Image Segmentation](https://arxiv.org/abs/2312.12198)                           | [Code](https://yxchng.github.io/projects/mask-grounding/)                      |
271 | 
272 | #### Conditional Query Fusion on Cross Image Features
273 | 
274 | | Year |  Venue  |     Acronym     | Paper Title                                                                                                                                                     | Code/Project                                     |
275 | |:----:|:-------:|:---------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|
276 | | 2021 | NeurIPS |      CyCTR      | [Few-Shot Segmentation via Cycle-Consistent Transformer](https://arxiv.org/abs/2106.02320)                                                                      | [Code](https://github.com/GengDavid/CyCTR)       |
277 | | 2022 |  CVPR   |   MatteFormer   | [MatteFormer: Transformer-Based Image Matting via Prior-Tokens](https://arxiv.org/abs/2203.15662)                                                               | [Code](https://github.com/webtoon/matteformer)   |
278 | | 2022 |  ECCV   |   Segdeformer   | [A Transformer-based Decoder for Semantic Segmentation with Multi-level Context Mining](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136880617.pdf) | [Code](https://github.com/lygsbw/segdeformer)    |
279 | | 2022 |  arxiv  |   StructToken   | [StructToken : Rethinking Semantic Segmentation with Structural Prior](https://arxiv.org/abs/2203.12612)                                                        | N/A                                              |
280 | | 2022 | NeurIPS |    MM-Former    | [Mask Matching Transformer for Few-Shot Segmentation](https://arxiv.org/abs/2301.01208)                                                                         | [Code](https://github.com/jiaosiyu1999/mmformer) |
281 | | 2022 |  ECCV   |    AAFormer     | [Adaptive Agent Transformer for Few-shot Segmentation](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136890035.pdf)                                  | N/A                                              |
282 | | 2023 |  arxiv  | ReferenceTwice | [Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation](https://arxiv.org/abs/2301.01156)                                           | [Code](https://github.com/hanyue1648/RefT)       |
283 | 
284 | ### Tuning Foundation Models
285 | 
286 | #### Vision Adapter
287 | 
288 | | Year | Venue |   Acronym   | Paper Title                                                                                                  | Code/Project                                                     |
289 | |:----:|:-----:|:-----------:|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
290 | | 2022 | CVPR  |   CoCoOp    | [Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557)                   | [Code](https://github.com/KaiyangZhou/CoOp)                      |
291 | | 2022 | ECCV  | Tip-Adapter | [Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification](https://arxiv.org/abs/2111.03930)  | [Code](https://github.com/gaopengcuhk/Tip-Adapter)               |
292 | | 2022 | ECCV  |     EVL     | [Frozen CLIP Models are Efficient Video Learners](https://arxiv.org/abs/2208.03550)                          | [Code](https://github.com/OpenGVLab/efficient-video-recognition) |
293 | | 2023 | ICLR  | ViT-Adapter | [Vision Transformer Adapter for Dense Predictions](https://arxiv.org/abs/2205.08534)                         | [Code](https://github.com/czczup/ViT-Adapter)                    |
294 | | 2022 | CVPR  |  DenseCLIP  | [DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/abs/2112.01518) | [Code](https://github.com/raoyongming/DenseCLIP)                 |
295 | | 2022 | CVPR  |   CLIPSeg   | [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)                          | [Code](https://eckerlab.org/code/clipseg)                        |
296 | | 2023 | CVPR  |  OneFormer  | [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220)          | [Code](https://github.com/SHI-Labs/OneFormer)                    |
297 | 
298 | #### Open Vocabulary Learning
299 | 
300 | | Year | Venue |  Acronym  | Paper Title                                                                                                                                | Code/Project                                                                                     |
301 | |:----:|:-----:|:---------:|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
302 | | 2021 | CVPR  |  OVR-CNN  | [Open-Vocabulary Object Detection Using Captions](https://arxiv.org/abs/2011.10678)                                                        | [Code](https://github.com/alirezazareian/ovr-cnn)                                                |
303 | | 2022 | ICLR  |   ViLD    | [Open-vocabulary Object Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921)                        | [Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)    |
304 | | 2022 | ECCV  |   Detic   | [Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605)                                        | [Code](https://github.com/facebookresearch/Detic)                                                |
305 | | 2022 | ECCV  |  OV-DETR  | [Open-Vocabulary DETR with Conditional Matching](https://arxiv.org/abs/2203.11876)                                                         | [Code](https://github.com/yuhangzang/OV-DETR)                                                    |
306 | | 2023 | ICLR  |   F-VLM   | [F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models](https://arxiv.org/abs/2209.15639)                         | [Code](https://sites.google.com/view/f-vlm/home)                                                 |
307 | | 2022 | ECCV  |   MViT    | [Class-agnostic Object Detection with Multi-modal Transformer](https://arxiv.org/abs/2111.11430)                                           | [Code](https://github.com/mmaaz60/mvits_for_class_agnostic_od)                                   |
308 | | 2022 | ECCV  |  OpenSeg  | [Scaling Open-Vocabulary Image Segmentation with Image-Level Labels](https://arxiv.org/abs/2112.12143)                                     | [Code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/openseg) |
309 | | 2022 | ICLR  |   LSeg    | [Language-driven Semantic Segmentation](https://arxiv.org/abs/2201.03546)                                                                  | [Code](https://github.com/isl-org/lang-seg)                                                      |
310 | | 2022 | ECCV  |  SimSeg   | [A Simple Baseline for Open Vocabulary Semantic Segmentation with Pre-trained Vision-language Model](https://arxiv.org/abs/2112.14757)     | [Code](https://github.com/MendelXu/zsseg.baseline)                                               |
311 | | 2022 | ECCV  | DenseCLIP | [Extract Free Dense Labels from CLIP](https://arxiv.org/abs/2112.01071)                                                                    | [Code](https://github.com/chongzhou96/MaskCLIP)                                                  |
312 | | 2021 | ICCV  |    UVO    | [Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation](https://arxiv.org/abs/2104.04691)                             | [Project](https://sites.google.com/view/unidentified-video-object)                               |
313 | | 2023 | arXiv |    CGG    | [Betrayed-by-Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation](https://arxiv.org/abs/2301.00805) | [Code](https://github.com/jzwu48033552/betrayed-by-captions)                                     |
314 | | 2022 | TPAMI |    ES     | [Open-World Entity Segmentation](https://arxiv.org/abs/2107.14228)                                                                         | [Code](https://github.com/dvlab-research/Entity/)                                                |
315 | | 2022 | CVPR  |  OW-DETR  | [OW-DETR: Open-world Detection Transformer](https://arxiv.org/abs/2112.01513)                                                              | [Code](https://github.com/akshitac8/OW-DETR)                                                     |
316 | | 2023 | CVPR  |   PROB    | [PROB: Probabilistic Objectness for Open World Object Detection](https://arxiv.org/abs/2212.01424)                                         | [Code](https://github.com/orrzohar/PROB)                                                         |
317 | 
318 | ### Related Domains and Beyond
319 | 
320 | #### Point Cloud Segmentation
321 | 
322 | | Year |  Venue  |        Acronym         | Paper Title                                                                                                               | Code/Project                                                     |
323 | |:----:|:-------:|:----------------------:|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
324 | | 2021 |  ICCV   |   Point Transformer    | [Point Transformer](https://arxiv.org/abs/2012.09164)                                                                     | N/A                                                              |
325 | | 2021 |   CVM   |          PCT           | [PCT: Point cloud transformer](https://arxiv.org/abs/2012.09688)                                                          | [Code](https://github.com/MenghaoGuo/PCT)                        |
326 | | 2022 |  CVPR   | Stratified Transformer | [Stratified Transformer for 3D Point Cloud Segmentation](https://arxiv.org/abs/2203.14508)                                | [Code](https://github.com/dvlab-research/Stratified-Transformer) |
327 | | 2022 |  CVPR   |       Point-BERT       | [Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling](https://arxiv.org/abs/2111.14819)       | [Code](https://github.com/lulutang0608/Point-BERT)               |
328 | | 2022 |  ECCV   |       Point-MAE        | [Masked Autoencoders for Point Cloud Self-supervised Learning](https://arxiv.org/abs/2203.06604)                          | [Code](https://github.com/Pang-Yatian/Point-MAE)                 |
329 | | 2022 | NeurIPS |       Point-M2AE       | [Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training](https://arxiv.org/abs/2205.14401) | [Code](https://github.com/ZrrSkywalker/Point-M2AE)               |
330 | | 2022 |  ICRA   |         Mask3D         | [Mask3D for 3D Semantic Instance Segmentation](https://arxiv.org/abs/2210.03105)                                          | [Code](https://github.com/JonasSchult/Mask3D)                    |
331 | | 2023 |  AAAI   |        SPFormer        | [Superpoint Transformer for 3D Scene Instance Segmentation](https://arxiv.org/abs/2211.15766)                             | [Code](https://github.com/sunjiahao1999/SPFormer)                |
332 | | 2023 |  AAAI   |          PUPS          | [PUPS: Point Cloud Unified Panoptic Segmentation](https://arxiv.org/abs/2302.06185)                                       | N/A                                                              |
333 | 
334 | #### Domain-aware Segmentation
335 | 
336 | | Year | Venue  |    Acronym    | Paper Title                                                                                                                                                                                     | Code/Project                                            |
337 | |:----:|:------:|:-------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
338 | | 2022 |  CVPR  |   DAFormer    | [DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation](https://arxiv.org/abs/2111.14887)                                                 | [Code](https://github.com/lhoyer/DAFormer)              |
339 | | 2022 |  ECCV  |     HRDA      | [HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation](https://arxiv.org/abs/2204.13132)                                                                                   | [Code](https://github.com/lhoyer/HRDA)                  |
340 | | 2023 |  CVPR  |      MIC      | [MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation](https://arxiv.org/abs/2212.01322)                                                                                        | [Code](https://github.com/lhoyer/MIC)                   |
341 | | 2021 | ACM MM |      SFA      | [Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers](https://arxiv.org/abs/2107.12636)                                                                             | [Code](https://github.com/encounter1997/SFA)            |
342 | | 2023 |  CVPR  |    DA-DETR    | [DA-DETR: Domain Adaptive Detection Transformer with Information Fusion](https://arxiv.org/abs/2103.17084)                                                                                      | N/A                                                     |
343 | | 2022 |  ECCV  |    MTTrans    | [MTTrans: Cross-Domain Object Detection with Mean-Teacher Transformer](https://arxiv.org/abs/2205.01643)                                                                                        | [Code](https://github.com/Lafite-Yu/MTTrans-OpenSource) |
344 | | 2022 | arXiv  | Sentence-Seg  | [The devil is in the labels: Semantic segmentation from sentences](https://arxiv.org/abs/2202.02002)                                                                                            | N/A                                                     |
345 | | 2023 |  ICLR  |     LMSeg     | [LMSeg: Language-guided Multi-dataset Segmentation](https://arxiv.org/abs/2302.13495)                                                                                                           | N/A                                                     |
346 | | 2022 |  CVPR  |    UniDet     | [Simple multi-dataset detection](https://arxiv.org/abs/2102.13086)                                                                                                                              | [Code](https://github.com/xingyizhou/UniDet)            |
347 | | 2023 |  CVPR  | Detection Hub | [Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding](https://arxiv.org/abs/2206.03484)                                                                | N/A                                                     |
348 | | 2022 |  CVPR  |      WD2      | [Unifying Panoptic Segmentation for Autonomous Driving](https://openaccess.thecvf.com/content/CVPR2022/papers/Zendel_Unifying_Panoptic_Segmentation_for_Autonomous_Driving_CVPR_2022_paper.pdf) | [Data](https://github.com/ozendelait/wilddash_scripts)  |
349 | | 2023 | arXiv  |    TarVIS     | [TarViS: A Unified Approach for Target-based Video Segmentation](https://arxiv.org/abs/2301.02657)                                                                                              | [Code](https://github.com/Ali2500/TarViS)                                                     |
350 | 
351 | #### Label and Model Efficient Segmentation
352 | 
353 | | Year |  Venue  |   Acronym   | Paper Title                                                                                                                                    | Code/Project                                       |
354 | |:----:|:-------:|:-----------:|------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|
355 | | 2022 |  CVPR   |  MCTformer  | [Multi-class Token Transformer for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2203.02891)                                  | [Code](https://github.com/xulianuwa/MCTformer)     |
356 | | 2020 |  CVPR   |     PCM     | [Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation](https://arxiv.org/abs/2004.04581)                | [Code](https://github.com/YudeWang/SEAM)           |
357 | | 2022 |  ECCV   |   ViT-PCM   | [Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation](https://arxiv.org/abs/2210.17400) | [Code](https://github.com/deepplants/ViT-PCM)      |
358 | | 2021 |  ICCV   |    DINO     | [Emerging Properties in Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.14294)                                                 | [Code](https://github.com/facebookresearch/dino)   |
359 | | 2021 |  BMVC   |    LOST     | [Localizing Objects with Self-Supervised Transformers and no Labels](https://arxiv.org/abs/2109.14279)                                         | [Code](https://github.com/valeoai/LOST)            |
360 | | 2022 |  ICLR   |    STEGO    | [Unsupervised Semantic Segmentation by Distilling Feature Correspondences](https://arxiv.org/abs/2203.08414)                                   | [Code](https://github.com/mhamilton723/STEGO)      |
361 | | 2022 | NeurIPS |    ReCo     | [ReCo: Retrieve and Co-segment for Zero-shot Transfer](https://arxiv.org/abs/2206.07045)                                                       | [Code](https://github.com/NoelShin/reco)           |
362 | | 2022 |  arXiv  | MaskDistill | [Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation](https://arxiv.org/abs/2206.06363)                          | N/A                                                |
363 | | 2022 |  CVPR   |  FreeSOLO   | [FreeSOLO: Learning to Segment Objects without Annotations](https://arxiv.org/abs/2202.12181)                                                  | [Code](http://github.com/NVlabs/FreeSOLO)          |
364 | | 2023 |  CVPR   |   CutLER    | [Cut and Learn for Unsupervised Object Detection and Instance Segmentation](https://arxiv.org/abs/2301.11320)                                  | [Code](https://github.com/facebookresearch/CutLER) |
365 | | 2022 |  CVPR   |  TokenCut   | [Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut](https://arxiv.org/abs/2202.11539)                        | [Code](https://github.com/YangtaoWANG95/TokenCut)  |
366 | | 2022 |  ICLR   |  MobileViT  | [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178)                           | [Code](https://github.com/apple/ml-cvnets)         |
367 | | 2023 |  arXiv  |     EMO     | [Rethinking Mobile Block for Efficient Neural Models](https://arxiv.org/abs/2301.01146)                                                        | [Code](https://github.com/zhangzjn/EMO)            |
368 | | 2022 |  CVPR   |  TopFormer  | [TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation](https://arxiv.org/abs/2204.05525)                                      | [Code](https://github.com/hustvl/TopFormer)        |
369 | | 2023 |  ICLR   |  SeaFormer  | [SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation](https://arxiv.org/abs/2301.13156)                             | [Code](https://github.com/fudan-zvg/SeaFormer)     |
370 | 
371 | #### Class Agnostic Segmentation and Tracking
372 | 
373 | | Year |  Venue  |   Acronym   | Paper Title                                                                                                                              | Code/Project                                   |
374 | |:----:|:-------:|:-----------:|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|
375 | | 2022 |  CVPR   | Transfiner  | [Mask Transfiner for High-Quality Instance Segmentation](https://arxiv.org/abs/2111.13673)                                               | [Code](https://github.com/SysCV/transfiner)    |
376 | | 2022 |  ECCV   |     VMT     | [Video Mask Transfiner for High-Quality Video Instance Segmentation](https://arxiv.org/abs/2207.14012)                                   | [Code](https://github.com/SysCV/vmt)           |
377 | | 2022 |  arXiv  | SimpleClick | [SimpleClick: Interactive Image Segmentation with Simple Vision Transformers](https://arxiv.org/abs/2210.11006)                          | [Code](https://github.com/uncbiag/simpleclick) |
378 | | 2023 |  ICLR   |  PatchDCT   | [PatchDCT: Patch Refinement for High Quality Instance Segmentation](https://arxiv.org/abs/2302.02693)                                    | [Code](https://github.com/olivia-w12/PatchDCT) |
379 | | 2019 |  ICCV   |     STM     | [Video Object Segmentation using Space-Time Memory Networks](https://arxiv.org/abs/1904.00607)                                           | [Code](https://github.com/seoungwugoh/STM)     |
380 | | 2021 | NeurIPS |     AOT     | [Associating Objects with Transformers for Video Object Segmentation](https://arxiv.org/abs/2106.02638)                                  | [Code](https://github.com/z-x-yang/AOT)        |
381 | | 2021 | NeurIPS |    STCN     | [Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation](https://arxiv.org/abs/2106.05210) | [Code](https://github.com/hkchengrex/STCN)     |
382 | | 2022 |  ECCV   |    XMem     | [XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model](https://arxiv.org/abs/2207.07115)                     | [Code](https://hkchengrex.github.io/XMem)      |
383 | | 2022 |  CVPR   |    PCVOS    | [Per-Clip Video Object Segmentation](https://arxiv.org/abs/2208.01924)                                                                   | [Code](https://github.com/pkyong95/PCVOS)      |
384 | | 2023 |  CVPR   |     N/A     | [Look Before You Match: Instance Understanding Matters in Video Object Segmentation](https://arxiv.org/abs/2212.06826)                   | N/A                                            |
385 | 
386 | #### Medical Image Segmentation
387 | 
388 | | Year |     Venue     |  Acronym  | Paper Title                                                                                                     | Code/Project                                                                    |
389 | |:----:|:-------------:|:---------:|-----------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
390 | | 2020 |     BIBM     | CellDETR | [Attention-Based Transformers for Instance Segmentation of Cells in Microstructures](https://arxiv.org/abs/2011.09763) | [Code](https://github.com/ChristophReich1996/Cell-DETR)                                  |
391 | | 2021 |     arXiv     | TransUNet | [TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation](https://arxiv.org/abs/2102.04306) | [Code](https://github.com/Beckschen/TransUNet)                                  |
392 | | 2022 | ECCV Workshop | Swin-Unet | [Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation](https://arxiv.org/abs/2105.05537)        | [Code](https://github.com/HuCaoFighting/Swin-Unet)                              |
393 | | 2021 |    MICCAI     | TransFuse | [TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation](https://arxiv.org/abs/2102.08005)      | [Code](https://github.com/Rayicer/TransFuse)                                    |
394 | | 2022 |     WACV      |   UNETR   | [UNETR: Transformers for 3D Medical Image Segmentation](https://arxiv.org/abs/2103.10504)                       | [Code](https://github.com/Project-MONAI/research-contributions/tree/main/UNETR) |
395 | 
396 | ## Acknowledgement
397 | 
398 | If you find our survey and repository useful for your research project, please consider citing our paper:
399 | 
400 | ```bibtex
401 | @article{li2023transformer,
402 |     author={Li, Xiangtai and Ding, Henghui and Zhang, Wenwei and Yuan, Haobo and Cheng, Guangliang and Jiangmiao, Pang and Chen, Kai and Liu, Ziwei and Loy, Chen Change},
403 |     title={Transformer-Based Visual Segmentation: A Survey},
404 |     journal={T-PAMI},
405 |     year={2024}
406 |   }
407 | ```
408 | ## Contact
409 | ```
410 | xiangtai94@gmail.com (main)
411 | ```
412 | ```
413 | lxtpku@pku.edu.cn
414 | ```
415 | ## Related Repo For Segmentation and Detection
416 | 
417 | Attention Model [Repo](https://github.com/cmhungsteve/Awesome-Transformer-Attention) by Min-Hung (Steve) Chen.
418 | 
419 | Detection Transformer [Repo](https://github.com/IDEA-Research/awesome-detection-transformer) by IDEA.
420 | 
421 | Open Vocabulary Learning [Repo](https://github.com/jianzongwu/Awesome-Open-Vocabulary) by PKU and NTU. 
422 | 
423 | 


--------------------------------------------------------------------------------
/code/README.md:
--------------------------------------------------------------------------------
1 | ### Fair Re-Benchmark Code 
2 | 


--------------------------------------------------------------------------------
/figs/survey_pipeline.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxtGH/Awesome-Segmentation-With-Transformer/9c7a4884c6a23590f11a694f39cd8caa618d3593/figs/survey_pipeline.jpg


--------------------------------------------------------------------------------