├── README.md
├── list_old_20240323.md
└── main
    ├── action.md
    ├── active-learning.md
    ├── adversarial-attacks.md
    ├── anomaly-detection.md
    ├── assessment.md
    ├── audio.md
    ├── augmentation.md
    ├── birds-eye-view.md
    ├── captioning.md
    ├── change-detection.md
    ├── classification-backbone.md
    ├── clustering.md
    ├── completion.md
    ├── compression.md
    ├── cross-view.md
    ├── crowd.md
    ├── deblurring.md
    ├── deepfake-detection.md
    ├── dehazing.md
    ├── denoising.md
    ├── depth.md
    ├── deraining.md
    ├── detection.md
    ├── diffusion.md
    ├── edge.md
    ├── enhancement.md
    ├── face.md
    ├── federated-learning.md
    ├── few-shot-learning.md
    ├── fusion.md
    ├── gait.md
    ├── gaze.md
    ├── generative-model.md
    ├── graph.md
    ├── hand-gesture.md
    ├── high-dynamic-range-imaging.md
    ├── hoi.md
    ├── hyperspectral.md
    ├── illumination.md
    ├── in-painting.md
    ├── incremental-learning.md
    ├── instance-segmentation.md
    ├── knowledge-distillation.md
    ├── lane.md
    ├── layout.md
    ├── lighting.md
    ├── llmlvm.md
    ├── matching.md
    ├── matting.md
    ├── medical.md
    ├── mesh.md
    ├── metric-learning.md
    ├── motion.md
    ├── multi-label.md
    ├── multi-taskmodal.md
    ├── multi-view-stereo.md
    ├── nas.md
    ├── navigation.md
    ├── neural-rendering.md
    ├── ocr.md
    ├── octree.md
    ├── open-world.md
    ├── optical-flow.md
    ├── others.md
    ├── panoptic-segmentation.md
    ├── planning.md
    ├── point-cloud.md
    ├── pose.md
    ├── pruning--quantization.md
    ├── re-identification.md
    ├── recognition.md
    ├── reconstruction.md
    ├── referring.md
    ├── registration.md
    ├── remote-sensing.md
    ├── restoration.md
    ├── retrieval.md
    ├── robotic.md
    ├── salient-detection.md
    ├── scene.md
    ├── self-supervised-learning.md
    ├── semantic-segmentation.md
    ├── shape.md
    ├── slam.md
    ├── snn.md
    ├── style-transfer.md
    ├── super-resolution.md
    ├── survey.md
    ├── synthesis.md
    ├── text-to-imagevideo.md
    ├── texture.md
    ├── time-series.md
    ├── tracking.md
    ├── traffic.md
    ├── transfer-learning.md
    ├── translation.md
    ├── uav.md
    ├── unsupervised-learning.md
    ├── video.md
    ├── visual-grounding.md
    ├── visual-question-answering.md
    ├── visual-reasoning.md
    ├── visual-relationship-detection.md
    ├── voxel.md
    ├── weakly-supervised-learning.md
    └── zero-shot-learning.md


/README.md:
--------------------------------------------------------------------------------
  1 | Transformer-in-Vision[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
  2 | 
  3 | A paper list of some recent Transformer-based CV works. If you find some ignored papers, please open issues or pull requests.
  4 | 
  5 | !!The latest version has been updated, and you can click on the following links to view the list of papers and the codes (if available). The old version is [20240323](list_old_20240323.md).
  6 | 
  7 | **Last updated: 2025/06/06
  8 | 
  9 | ## Table of Contents
 10 | 
 11 | - [Survey](main/survey.md)
 12 | - Recent Papers
 13 |   - [Action](main/action.md)
 14 |   - [Active Learning](main/active-learning.md)
 15 |   - [Adversarial Attacks](main/adversarial-attacks.md)
 16 |   - [Anomaly Detection](main/anomaly-detection.md)
 17 |   - [Assessment](main/assessment.md)
 18 |   - [Augmentation](main/augmentation.md)
 19 |   - [Audio](main/audio.md)
 20 |   - [Bird's-Eye-View](main/birds-eye-view.md)
 21 |   - [Captioning](main/captioning.md)
 22 |   - [Change Detection](main/change-detection.md)
 23 |   - [Classification (Backbone)](main/classification-backbone.md)
 24 |   - [Clustering](main/clustering.md)
 25 |   - [Completion](main/completion.md)
 26 |   - [Compression](main/compression.md)
 27 |   - [Cross-view](main/cross-view.md)
 28 |   - [Crowd](main/crowd.md)
 29 |   - [Deblurring](main/deblurring.md)
 30 |   - [Depth](main/depth.md)
 31 |   - [Deepfake Detection](main/deepfake-detection.md)
 32 |   - [Dehazing](main/dehazing.md)
 33 |   - [Deraining](main/deraining.md)
 34 |   - [Denoising](main/denoising.md)
 35 |   - [Detection](main/detection.md)
 36 |   - [Diffusion](main/diffusion.md)
 37 |   - [Edge](main/edge.md)
 38 |   - [Enhancement](main/enhancement.md)
 39 |   - [Face](main/face.md)
 40 |   - [Federated Learning](main/federated-learning.md)
 41 |   - [Few-shot Learning](main/few-shot-learning.md)
 42 |   - [Fusion](main/fusion.md)
 43 |   - [Gait](main/gait.md)
 44 |   - [Gaze](main/gaze.md)
 45 |   - [Generative Model](main/generative-model.md)
 46 |   - [Graph](main/graph.md)
 47 |   - [Hand Gesture](main/hand-gesture.md)
 48 |   - [High Dynamic Range Imaging](main/high-dynamic-range-imaging.md)
 49 |   - [HOI](main/hoi.md)
 50 |   - [Hyperspectral](main/hyperspectral.md)
 51 |   - [Illumination](main/illumination.md)
 52 |   - [Incremental Learning](main/incremental-learning.md)
 53 |   - [In-painting](main/in-painting.md)
 54 |   - [Instance Segmentation](main/instance-segmentation.md)
 55 |   - [Knowledge Distillation](main/knowledge-distillation.md)
 56 |   - [Lane](main/lane.md)
 57 |   - [Layout](main/layout.md)
 58 |   - [Lighting](main/lighting.md)
 59 |   - [LLM](main/llmlvm.md)
 60 |   - [Matching](main/matching.md)
 61 |   - [Matting](main/matting.md)
 62 |   - [Medical](main/medical.md)
 63 |   - [Mesh](main/mesh.md)
 64 |   - [Metric learning](main/metric-learning.md)
 65 |   - [Motion](main/motion.md)
 66 |   - [Multi-label](main/multi-label.md)
 67 |   - [Multi-task/modal](main/multi-taskmodal.md)
 68 |   - [Multi-view Stereo](main/multi-view-stereo.md)
 69 |   - [NAS](main/nas.md)
 70 |   - [Navigation](main/navigation.md)
 71 |   - [Neural Rendering](main/neural-rendering.md)
 72 |   - [OCR](main/ocr.md)
 73 |   - [Octree](main/octree.md)
 74 |   - [Open World](main/open-world.md)
 75 |   - [Optical Flow](main/optical-flow.md)
 76 |   - [Panoptic Segmentation](main/panoptic-segmentation.md)
 77 |   - [Point Cloud](main/point-cloud.md)
 78 |   - [Pose](main/pose.md)
 79 |   - [Planning](main/planning.md)
 80 |   - [Pruning & Quantization](main/pruning--quantization.md)
 81 |   - [Recognition](main/recognition.md)
 82 |   - [Reconstruction](main/reconstruction.md)
 83 |   - [Referring](main/referring.md)
 84 |   - [Registration](main/registration.md)
 85 |   - [Re-identification](main/re-identification.md)
 86 |   - [Remote Sensing](main/remote-sensing.md)
 87 |   - [Restoration](main/restoration.md)
 88 |   - [Retrieval](main/retrieval.md)
 89 |   - [Robotic](main/robotic.md)
 90 |   - [Salient Detection](main/salient-detection.md)
 91 |   - [Scene](main/scene.md)
 92 |   - [Self-supervised Learning](main/self-supervised-learning.md)
 93 |   - [Semantic Segmentation](main/semantic-segmentation.md)
 94 |   - [Shape](main/shape.md)
 95 |   - [SLAM](main/slam.md)
 96 |   - [SNN](main/snn.md)
 97 |   - [Style Transfer](main/style-transfer.md)
 98 |   - [Super-Resolution](main/super-resolution.md)
 99 |   - [Synthesis](main/synthesis.md)
100 |   - [Text-to-Image/Video](main/text-to-imagevideo.md)
101 |   - [Texture](main/texture.md)
102 |   - [Time Series](main/time-series.md)
103 |   - [Tracking](main/tracking.md)
104 |   - [Traffic](main/traffic.md)
105 |   - [Transfer learning](main/transfer-learning.md)
106 |   - [Translation](main/translation.md)
107 |   - [Unsupervised learning](main/unsupervised-learning.md)
108 |   - [UAV](main/uav.md)
109 |   - [Video](main/video.md)
110 |   - [Visual Grounding](main/visual-grounding.md)
111 |   - [Visual Question Answering](main/visual-question-answering.md)
112 |   - [Visual Reasoning](main/visual-reasoning.md)
113 |   - [Visual Relationship Detection](main/visual-relationship-detection.md)
114 |   - [Voxel](main/voxel.md)
115 |   - [Weakly Supervised Learning](main/weakly-supervised-learning.md)
116 |   - [Zero-Shot Learning](main/zero-shot-learning.md)
117 |   - [Others](main/others.md)
118 | - [Contact & Feedback](#contact--feedback)
119 | 
120 | 
121 | ## Contact & Feedback
122 | 
123 | If you have any suggestions about this project, feel free to contact me.
124 | 
125 | - [e-mail: yzhangcst[at]gmail.com]
126 | 


--------------------------------------------------------------------------------
/main/active-learning.md:
--------------------------------------------------------------------------------
1 | ### Active Learning
2 | - (arXiv 2022.06) Visual Transformer for Task-aware Active Learning, [[Paper]](https://arxiv.org/pdf/2206.06761.pdf), [[Code]](https://github.com/razvancaramalau/Visual-Transformer-for-Task-aware-Active-Learning)
3 | - (arXiv 2024.11) GCI-ViTAL: Gradual Confidence Improvement with Vision Transformers for Active Learning on Label Noise, [[Paper]](https://arxiv.org/pdf/2411.05898.pdf)
4 | - (arXiv 2025.05) Balancing Accuracy, Calibration, and Efficiency in Active Learning with Vision Transformers Under Label Noise, [[Paper]](https://arxiv.org/pdf/2505.04375.pdf)
5 | 


--------------------------------------------------------------------------------
/main/adversarial-attacks.md:
--------------------------------------------------------------------------------
 1 | ### Adversarial Attacks
 2 | - (arXiv 2022.06) Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO, [[Paper]](https://arxiv.org/pdf/2206.06761.pdf), [[Code]](https://github.com/thobauma/AADefDINO)
 3 | - (arXiv 2022.06) Backdoor Attacks on Vision Transformers, [[Paper]](https://arxiv.org/pdf/2206.08477.pdf), [[Code]](https://github.com/UCDvision/backdoor_transformer.git)
 4 | - (arXiv 2022.06) Defending Backdoor Attacks on Vision Transformer via Patch Processing, [[Paper]](https://arxiv.org/pdf/2206.12381.pdf)
 5 | - (arXiv 2022.07) Towards Efficient Adversarial Training on Vision Transformers, [[Paper]](https://arxiv.org/pdf/2207.10498.pdf)
 6 | - (arXiv 2022.08) Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem, [[Paper]](https://arxiv.org/pdf/2208.00906.pdf), [[Code]](https://github.com/TrustAI/ODE4RobustViT)
 7 | - (arXiv 2022.08) Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks, [[Paper]](https://arxiv.org/pdf/2208.09602.pdf)
 8 | - (arXiv 2023.01) Inference Time Evidences of Adversarial Attacks for Forensic on Transformers, [[Paper]](https://arxiv.org/pdf/2301.13356.pdf)
 9 | - (arXiv 2023.03) Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization, [[Paper]](https://arxiv.org/pdf/2303.15754.pdf)
10 | - (arXiv 2023.05) On enhancing the robustness of Vision Transformers: Defensive Diffusion, [[Paper]](https://arxiv.org/pdf/2305.08031.pdf), [[Code]](https://github.com/Muhammad-Huzaifaa/Defensive_Diffusion)
11 | - (arXiv 2023.06) Pre-trained transformer for adversarial purification, [[Paper]](https://arxiv.org/pdf/2306.01762.pdf)
12 | - (arXiv 2023.07) Random Position Adversarial Patch for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2307.04066.pdf)
13 | - (arXiv 2023.07) Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models, [[Paper]](https://arxiv.org/pdf/2307.13985.pdf)
14 | - (arXiv 2023.09) Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks, [[Paper]](https://arxiv.org/pdf/2309.06438.pdf)
15 | - (arXiv 2023.09) RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias, [[Paper]](https://arxiv.org/pdf/2309.13245.pdf)
16 | - (arXiv 2023.10) Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models, [[Paper]](https://arxiv.org/pdf/2310.04655.pdf)
17 | - (arXiv 2023.10) ConViViT -- A Deep Neural Network Combining Convolutions and Factorized Self-Attention for Human Activity Recognition, [[Paper]](https://arxiv.org/pdf/2310.14416.pdf)
18 | - (arXiv 2023.10) Blacksmith: Fast Adversarial Training of Vision Transformers via a Mixture of Single-step and Multi-step Methods, [[Paper]](https://arxiv.org/pdf/2310.18975.pdf)
19 | - (arXiv 2023.11) DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training, [[Paper]](https://arxiv.org/pdf/2311.06855.pdf)
20 | - (arXiv 2023.11) Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches, [[Paper]](https://arxiv.org/pdf/2311.12914.pdf)
21 | - (arXiv 2023.12) MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness, [[Paper]](https://arxiv.org/pdf/2312.04960.pdf), [[Code]](https://github.com/xiaoyunxxy/MIMIR)
22 | - (arXiv 2024.01) FullLoRA-AT: Efficiently Boosting the Robustness of Pretrained Vision Transformers, [[Paper]](https://arxiv.org/pdf/2401.01752.pdf)
23 | - (arXiv 2024.02) DeSparsify: Adversarial Attack Against Token Sparsification Mechanisms in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2402.02554.pdf)
24 | - (arXiv 2024.03) Attacking Transformers with Feature Diversity Adversarial Perturbation, [[Paper]](https://arxiv.org/pdf/2403.07942.pdf)
25 | - (arXiv 2024.03) Approximate Nullspace Augmented Finetuning for Robust Vision Transformers, [[Paper]](https://arxiv.org/pdf/2403.10476.pdf)
26 | - (arXiv 2024.05) Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers, [[Paper]](https://arxiv.org/pdf/2405.10612.pdf),[[Code]](https://github.com/20000yshust/SWARM)
27 | - (arXiv 2024.07) Query-Efficient Hard-Label Black-Box Attack against Vision Transformers, [[Paper]](https://arxiv.org/pdf/2407.00389.pdf)
28 | - (arXiv 2024.07) TrackPGD: A White-box Attack using Binary Masks against Robust Transformer Trackers, [[Paper]](https://arxiv.org/pdf/2407.03946.pdf)
29 | - (arXiv 2024.07) S-E Pipeline: A Vision Transformer (ViT) based Resilient Classification Pipeline for Medical Imaging Against Adversarial Attacks, [[Paper]](https://arxiv.org/pdf/2407.17587.pdf)
30 | - (arXiv 2024.08) Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers, [[Paper]](https://arxiv.org/pdf/2408.01705.pdf)
31 | - (arXiv 2024.09) ViTGuard: Attention-aware Detection against Adversarial Examples for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2409.13828)
32 | - (arXiv 2024.09) RoWSFormer: A Robust Watermarking Framework with Swin Transformer for Enhanced Geometric Attack Resilience, [[Paper]](https://arxiv.org/pdf/2409.14829)
33 | - (arXiv 2024.10) Backdoor Attack Against Vision Transformers via Attention Gradient-Based Image Erosion, [[Paper]](https://arxiv.org/pdf/2410.22678)
34 | - (arXiv 2024.11) Target-Guided Adversarial Point Cloud Transformer Towards Recognition Against Real-world Corruptions, [[Paper]](https://arxiv.org/pdf/2411.00462),[[Code]](https://github.com/Roywangj/APCT)
35 | - (arXiv 2024.12) Megatron: Evasive Clean-Label Backdoor Attacks against Vision Transformer, [[Paper]](https://arxiv.org/pdf/2412.04776)
36 | - (arXiv 2024.12) An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers, [[Paper]](https://arxiv.org/pdf/2412.06149)
37 | - (arXiv 2024.12) Distortion-Aware Adversarial Attacks on Bounding Boxes of Object Detectors, [[Paper]](https://arxiv.org/pdf/2412.18815),[[Code]](https://github.com/anonymous20210106/attack_detector)
38 | - (arXiv 2024.12) Evaluating the Adversarial Robustness of Detection Transformers, [[Paper]](https://arxiv.org/pdf/2412.18718)
39 | - (arXiv 2025.01) Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities, [[Paper]](https://arxiv.org/pdf/2501.07044)
40 | - (arXiv 2025.01) Generalized Single-Image-Based Morphing Attack Detection Using Deep Representations from Vision Transformer, [[Paper]](https://arxiv.org/pdf/2501.09817)
41 | - (arXiv 2025.02) Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers, [[Paper]](https://arxiv.org/pdf/2502.04679)
42 | - (arXiv 2025.03) Robustness Tokens: Towards Adversarial Robustness of Transformers, [[Paper]](https://arxiv.org/pdf/2503.10191)
43 | 


--------------------------------------------------------------------------------
/main/anomaly-detection.md:
--------------------------------------------------------------------------------
 1 | ### Anomaly Detection
 2 | - (arXiv 2021.04) VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization, [[Paper]](https://arxiv.org/pdf/2104.10036.pdf)
 3 | - (arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2104.13897.pdf)
 4 | - (arXiv 2022.03) AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder, [[Paper]](https://arxiv.org/pdf/2203.10808.pdf)
 5 | - (arXiv 2022.06) Anomaly detection in surveillance videos using transformer based attention model, [[Paper]](https://arxiv.org/pdf/2206.01524.pdf), [[Code]](https://github.com/kapildeshpande/Anomaly-Detection-in-Surveillance-Videos)
 6 | - (arXiv 2022.06) Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2206.08568.pdf)
 7 | - (arXiv 2022.08) HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization, [[Paper]](https://arxiv.org/pdf/2208.03486.pdf), [[Code]](https://anonymous.4open.science/r/HaloAE-E27B/README.md)
 8 | - (arXiv 2022.08) ADTR: Anomaly Detection Transformer with Feature Reconstruction, [[Paper]](https://arxiv.org/pdf/2209.01816.pdf)
 9 | - (arXiv 2022.09) Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2209.12148.pdf), [[Code]](https://github.com/ristea/ssmctb)
10 | - (arXiv 2022.09) Anomaly Detection in Aerial Videos with Transformers, [[Paper]](https://arxiv.org/pdf/2209.13363.pdf), [[Code]](https://github.com/jin-pu/drone-anomaly)
11 | - (arXiv 2022.10) Masked Transformer for image Anomaly Localization, [[Paper]](https://arxiv.org/pdf/2210.15540.pdf)
12 | - (arXiv 2022.11) Generalizable Industrial Visual Anomaly Detection with Self-Induction Vision Transformer, [[Paper]](https://arxiv.org/pdf/2211.12311.pdf)
13 | - (arXiv 2023.03) Incremental Self-Supervised Learning Based on Transformer for Anomaly Detection and Localization, [[Paper]](https://arxiv.org/pdf/2303.17354.pdf)
14 | - (arXiv 2023.03) Unsupervised Anomaly Detection with Local-Sensitive VQVAE and Global-Sensitive Transformers, [[Paper]](https://arxiv.org/pdf/2303.17505.pdf)
15 | - (arXiv 2023.03) Visual Anomaly Detection via Dual-Attention Transformer and Discriminative Flow, [[Paper]](https://arxiv.org/pdf/2303.17882.pdf)
16 | - (arXiv 2023.05) Multiresolution Feature Guidance Based Transformer for Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2305.14880.pdf)
17 | - (arXiv 2023.06) Efficient Anomaly Detection with Budget Annotation Using Semi-Supervised Residual Transformer, [[Paper]](https://arxiv.org/pdf/2306.03492.pdf), [[Code]](https://github.com/BeJane/Semi_REST)
18 | - (arXiv 2023.07) SelFormaly: Towards Task-Agnostic Unified Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2307.12540.pdf)
19 | - (arXiv 2023.08) Patch-wise Auto-Encoder for Visual Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2308.00429.pdf)
20 | - (arXiv 2023.09) Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation, [[Paper]](https://arxiv.org/pdf/2309.04579.pdf)
21 | - (arXiv 2023.10) Hierarchical Vector Quantized Transformer for Multi-class Unsupervised Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2310.14228.pdf), [[Code]](https://github.com/RuiyingLu/HVQ-Trans)
22 | - (arXiv 2023.12) Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning, [[Paper]](https://arxiv.org/pdf/2312.04398.pdf)
23 | - (arXiv 2023.12) Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2312.07495.pdf), [[Code]](https://zhangzjn.github.io/projects/ViTAD)
24 | - (arXiv 2024.06) Prior Normality Prompt Transformer for Multi-class Industrial Image Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2406.11507.pdf)
25 | - (arXiv 2024.08) Feature Purified Transformer With Cross-level Feature Guiding Decoder For Multi-class OOD and Anomaly Deteciton, [[Paper]](https://arxiv.org/pdf/2406.15396.pdf)
26 | - (arXiv 2024.08) PoseWatch: A Transformer-based Architecture for Human-centric Video Anomaly Detection Using Spatio-temporal Pose Tokenization, [[Paper]](https://arxiv.org/pdf/2408.15185.pdf)
27 | - (arXiv 2024.10) RADAR: Robust Two-stage Modality-incomplete Industrial Anomaly Detection, [[Paper]](https://arxiv.org/pdf/2410.01737.pdf), [[Code]](https://anonymous.4open.science/r/RADAR-54F2/README.md)
28 | - (arXiv 2025.03) Transformer Based Self-Context Aware Prediction for Few-Shot Anomaly Detection in Videos, [[Paper]](https://arxiv.org/pdf/2503.00670.pdf)
29 | 


--------------------------------------------------------------------------------
/main/assessment.md:
--------------------------------------------------------------------------------
 1 | ### Assessment
 2 | - (arXiv 2021.01) Transformer for Image Quality Assessment, [[Paper]](https://arxiv.org/abs/2101.01097), [[Code]](https://github.com/junyongyou/triq)
 3 | - (arXiv 2021.04) Perceptual Image Quality Assessment with Transformers, [[Paper]](https://arxiv.org/abs/2104.14730), [[Code]](https://github.com/manricheon/IQT)
 4 | - (arXiv 2021.08) No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency, [[Paper]](https://arxiv.org/pdf/2108.06858.pdf), [[Code]](https://github.com/isalirezag/TReS)
 5 | - (arXiv 2021.08) MUSIQ: Multi-scale Image Quality Transformer, [[Paper]](https://arxiv.org/pdf/2108.05997.pdf), [[Code]](https://github.com/google-research/google-research/tree/master/musiq)
 6 | - (arXiv 2021.10) VTAMIQ: Transformers for Attention Modulated Image Quality Assessment, [[Paper]](https://arxiv.org/pdf/2110.01655.pdf)
 7 | - (arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [[Paper]](https://arxiv.org/pdf/2112.00485.pdf)
 8 | - (arXiv 2022.03) Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment, [[Paper]](https://arxiv.org/pdf/2203.14557.pdf)
 9 | - (arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based Image Quality Assessment, [[Paper]](https://arxiv.org/pdf/2204.09779.pdf), [[Code]](https://github.com/KomalPal9610/IQA)
10 | - (arXiv 2022.05) SwinIQA: Learned Swin Distance for Compressed Image Quality Assessment, [[Paper]](https://arxiv.org/pdf/2205.04264.pdf)
11 | - (arXiv 2022.05) MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion, [[Paper]](https://arxiv.org/pdf/2205.10101.pdf)
12 | - (arXiv 2022.08) DAHiTrA: Damage Assessment Using a Novel Hierarchical Transformer Architecture, [[Paper]](https://arxiv.org/pdf/2208.02205.pdf)
13 | - (arXiv 2022.10) DCVQE: A Hierarchical Transformer for Video Quality Assessment, [[Paper]](https://arxiv.org/pdf/2210.04377.pdf)
14 | - (arXiv 2023.03) ST360IQ: No-Reference Omnidirectional Image Quality Assessment with Spherical Vision Transformers, [[Paper]](https://arxiv.org/pdf/2303.06907.pdf), [[Code]](https://github.com/Nafiseh-Tofighi/ST360IQ)
15 | - (arXiv 2023.03) MRET: Multi-resolution Transformer for Video Quality Assessment, [[Paper]](https://arxiv.org/pdf/2303.07489.pdf)
16 | - (arXiv 2023.05) Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token, [[Paper]](https://arxiv.org/pdf/2305.09353.pdf), [[Code]](https://github.com/Srache/TempQT)
17 | - (arXiv 2023.08) Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment, [[Paper]](https://arxiv.org/pdf/2308.12001.pdf)
18 | - (arXiv 2023.12) Activating Frequency and ViT for 3D Point Cloud Quality Assessment without Reference, [[Paper]](https://arxiv.org/pdf/2312.05972.pdf), [[Code]](https://github.com/o-messai/3D-PCQA)
19 | - (arXiv 2024.01) Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine Strategy, [[Paper]](https://arxiv.org/pdf/2401.08522.pdf), [[Code]](https://github.com/o-messai/3D-PCQA)
20 | - (arXiv 2024.05) RMT-BVQA: Recurrent Memory Transformer-based Blind Video Quality Assessment for Enhanced Video Content, [[Paper]](https://arxiv.org/pdf/2405.08621.pdf)
21 | - (arXiv 2024.06) DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer, [[Paper]](https://arxiv.org/pdf/2406.09622.pdf),[[Code]](https://dsl-fiqa.github.io/)
22 | - (arXiv 2024.09) Attention Down-Sampling Transformer, Relative Ranking and Self-Consistency for Blind Image Quality Assessment, [[Paper]](https://arxiv.org/pdf/2409.07115.pdf),[[Code]](https://github.com/mas94/ADTRS)
23 | 


--------------------------------------------------------------------------------
/main/audio.md:
--------------------------------------------------------------------------------
 1 | ### Audio 
 2 | - (arXiv 2021.08) The Right to Talk: An Audio-Visual Transformer Approach, [[Paper]](https://arxiv.org/pdf/2108.03256.pdf), [[Code]](https://github.com/uark-cviu/Right2Talk)
 3 | - (arXiv 2022.03) SepTr: Separable Transformer for Audio Spectrogram Processing, [[Paper]](https://arxiv.org/pdf/2203.09581.pdf), [[Code]](https://github.com/ristea/septr)
 4 | - (arXiv 2022.11) ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation, [[Paper]](https://arxiv.org/pdf/2211.13189.pdf)
 5 | - (arXiv 2023.03) Multiscale Audio Spectrogram Transformer for Efficient Audio Classification, [[Paper]](https://arxiv.org/pdf/2303.10757.pdf)
 6 | - (arXiv 2023.03) ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers, [[Paper]](https://arxiv.org/pdf/2303.11551.pdf)
 7 | - (arXiv 2023.07) AVSegFormer: Audio-Visual Segmentation with Transformer, [[Paper]](https://arxiv.org/pdf/2307.01146.pdf), [[Code]](https://github.com/vvvb-github/AVSegFormer)
 8 | - (arXiv 2023.11) Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing, [[Paper]](https://arxiv.org/pdf/2311.08151.pdf)
 9 | - (arXiv 2023.12) Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling, [[Paper]](https://arxiv.org/pdf/2312.01017.pdf)
10 | - (arXiv 2024.01) Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classificationg, [[Paper]](https://arxiv.org/pdf/2312.01017.pdf)
11 | - (arXiv 2024.01) Siamese Vision Transformers are Scalable Audio-visual Learners, [[Paper]](https://arxiv.org/pdf/2403.19638.pdf), [[Code]](https://github.com/GenjiB/AVSiam）
12 | - (arXiv 2024.05) A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation, [[Paper]](https://arxiv.org/pdf/2405.13762.pdf), [[Code]](http://avdit2024.github.io/）
13 | - (arXiv 2024.06) MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers, [[Paper]](https://arxiv.org/pdf/2406.04930.pdf), [[Code]](https://github.com/enyac-group/MA-AVT）
14 | - (arXiv 2024.06) Taming Data and Transformers for Audio Generation, [[Paper]](https://arxiv.org/pdf/2406.19388.pdf), [[Code]](https://snap-research.github.io/GenAU/）
15 | - (arXiv 2024.06) Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity, [[Paper]](https://arxiv.org/pdf/2407.10387.pdf), [[Code]](https://snap-research.github.io/GenAU/）
16 | - (arXiv 2024.07) Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention, [[Paper]](https://arxiv.org/pdf/2407.18552.pdf)
17 | - (arXiv 2024.08) AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation, [[Paper]](https://arxiv.org/pdf/2408.01708.pdf)
18 | 


--------------------------------------------------------------------------------
/main/augmentation.md:
--------------------------------------------------------------------------------
1 | ### Augmentation 
2 | - (arXiv 2022.10) TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers, [[Paper]](https://arxiv.org/pdf/2210.07562.pdf), [[Code]](https://github.com/mlvlab/TokenMixup)
3 | - (arXiv 2022.12) SMMix: Self-Motivated Image Mixing for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2212.12977.pdf), [[Code]](https://github.com/ChenMnZ/SMMix)
4 | - (arXiv 2023.05) Transformer-based Sequence Labeling for Audio Classification based on MFCCs, [[Paper]](https://arxiv.org/pdf/2305.00417.pdf)
5 | 


--------------------------------------------------------------------------------
/main/birds-eye-view.md:
--------------------------------------------------------------------------------
 1 | ### Bird's-Eye-View
 2 | - (arXiv 2022.03) BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, [[Paper]](https://arxiv.org/pdf/2203.17270.pdf), [[Code]](https://github.com/zhiqi-li/BEVFormer)
 3 | - (arXiv 2022.05) ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation, [[Paper]](https://arxiv.org/pdf/2205.15667.pdf), [[Code]](https://github.com/robotvisionmu/ViT-BEVSeg)
 4 | - (arXiv 2022.06) PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images, [[Paper]](https://arxiv.org/pdf/2206.01256.pdf)
 5 | - (arXiv 2022.06) Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer, [[Paper]](https://arxiv.org/pdf/2206.04584.pdf), [[Code]](https://github.com/hustvl/GKT)
 6 | - (arXiv 2022.06) PolarFormer: Multi-camera 3D Object Detection with Polar Transformer, [[Paper]](https://arxiv.org/pdf/2206.15398.pdf), [[Code]](https://github.com/fudan-zvg/PolarFormer)
 7 | - (arXiv 2022.07) CoBEVT: Cooperative Bird's Eye View ation with Sparse Transformers, [[Paper]](https://arxiv.org/pdf/2207.02202.pdf)
 8 | - (arXiv 2022.07) UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View, [[Paper]](https://arxiv.org/pdf/2207.08536.pdf)
 9 | - (arXiv 2022.09) A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View, [[Paper]](https://arxiv.org/pdf/2209.08844.pdf)
10 | - (arXiv 2022.09) BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for BEV 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2212.00623.pdf)
11 | - (arXiv 2023.02) DA-BEV: Depth Aware BEV Transformer for 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2302.13002.pdf)
12 | - (arXiv 2023.03) TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving, [[Paper]](https://arxiv.org/pdf/2303.09998.pdf), [[Code]](https://github.com/MediaBrain-SJTU/TBP-Former)
13 | - (arXiv 2023.04) VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2304.01054.pdf), [[Code]](https://github.com/Lizhuoling/VoxelFormer-public.git)
14 | - (arXiv 2023.04) FedBEVT: Federated Learning Bird's Eye View Perception Transformer in Road Traffic Systems, [[Paper]](https://arxiv.org/pdf/2304.01534.pdf)
15 | - (arXiv 2023.04) A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View ation, [[Paper]](https://arxiv.org/pdf/2304.03650.pdf)
16 | - (arXiv 2023.06) OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2306.01738.pdf)
17 | - (arXiv 2023.06) An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection, [[Paper]](https://arxiv.org/pdf/2306.04927.pdf)
18 | - (arXiv 2023.07) HeightFormer: Explicit Height Modeling without Extra Data for Camera-only 3D Object Detection in Bird鈥檚 Eye View, [[Paper]](https://arxiv.org/pdf/2307.13510.pdf)
19 | - (arXiv 2023.08) UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation, [[Paper]](https://arxiv.org/pdf/2308.07732.pdf), [[Code]](https://github.com/Haiyang-W/UniTR)
20 | - (arXiv 2023.09) FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection, [[Paper]](https://arxiv.org/pdf/2309.05257.pdf)
21 | - (arXiv 2023.10) Towards Generalizable Multi-Camera 3D Object Detection via Perspective Debiasing, [[Paper]](https://arxiv.org/pdf/2310.11346.pdf)
22 | - (arXiv 2023.12) Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach, [[Paper]](https://arxiv.org/pdf/2312.00633.pdf)
23 | - (arXiv 2023.12) BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2312.01696.pdf)
24 | - (arXiv 2023.12) COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction, [[Paper]](https://arxiv.org/pdf/2312.01919.pdf)
25 | - (arXiv 2023.12) Learned Fusion: 3D Object Detection using Calibration-Free Transformer Feature Fusion, [[Paper]](https://arxiv.org/pdf/2312.09082.pdf)
26 | - (arXiv 2023.12) Diffusion-Based Particle-DETR for BEV Perception, [[Paper]](https://arxiv.org/pdf/2312.11578.pdf)
27 | - (arXiv 2023.12) Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers, [[Paper]](https://arxiv.org/pdf/2312.14919.pdf)
28 | - (arXiv 2024.01) WidthFormer: Toward Efficient Transformer-based BEV View Transformation, [[Paper]](https://arxiv.org/pdf/2401.03836.pdf), [[Code]](https://github.com/ChenhongyiYang/WidthFormer)
29 | - (arXiv 2024.02) OccTransformer: Improving BEVFormer for 3D camera-only occupancy prediction, [[Paper]](https://arxiv.org/pdf/2402.18140.pdf)
30 | - (arXiv 2024.03) CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow,  [[Paper]](https://arxiv.org/pdf/2403.08919.pdf)
31 | - (arXiv 2024.07) BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space,  [[Paper]](https://arxiv.org/pdf/2407.05679.pdf), [[Code]](https://github.com/zympsyche/BevWorld)
32 | - (arXiv 2024.07) CarFormer: Self-Driving with Learned Object-Centric Representations,  [[Paper]](https://arxiv.org/pdf/2407.15843.pdf), [[Code]](https://github.com/Shamdan17/CarFormer)
33 | - (arXiv 2024.07) RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies,  [[Paper]](https://arxiv.org/pdf/2407.14923.pdf)
34 | - (arXiv 2024.09) CASPFormer: Trajectory Prediction from BEV Images with Deformable Attention,  [[Paper]](https://arxiv.org/pdf/2409.17790.pdf)
35 | - (arXiv 2024.11) Fast and Efficient Transformer-based Method for Bird's Eye View Instance Prediction,  [[Paper]](https://arxiv.org/pdf/2411.06851.pdf), [[Code]](https://github.com/miguelag99/Efficient-Instance-Prediction)
36 | - (arXiv 2024.12) Epipolar Attention Field Transformers for Bird's Eye View Semantic Segmentation,  [[Paper]](https://arxiv.org/pdf/2412.01595.pdf)
37 | - (arXiv 2024.12) RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion,  [[Paper]](https://arxiv.org/pdf/2412.12725.pdf)
38 | 


--------------------------------------------------------------------------------
/main/change-detection.md:
--------------------------------------------------------------------------------
 1 | ### Change Detection
 2 | - (arXiv 2022.01) A Transformer-Based Siamese Network for Change Detection, [[Paper]](https://arxiv.org/pdf/2201.01293.pdf), [[Code]](https://github.com/wgcban/ChangeFormer)
 3 | - (arXiv 2022.07) IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection, [[Paper]](https://arxiv.org/pdf/2207.09240.pdf)
 4 | - (arXiv 2023.08) UCDFormer: Unsupervised Change Detection Using a Transformer-driven Image Translation, [[Paper]](https://arxiv.org/pdf/2308.01146.pdf), [[Code]](https://github.com/zhu-xlab/UCDFormer)
 5 | - (arXiv 2023.09) Changes-Aware Transformer: Learning Generalized Changes Representation, [[Paper]](https://arxiv.org/pdf/2309.13619.pdf)
 6 | - (arXiv 2023.10) Transformer-based Multimodal Change Detection with Multitask Consistency Constraints, [[Paper]](https://arxiv.org/pdf/2310.09276.pdf)
 7 | - (arXiv 2023.10) TransY-Net:Learning Fully Transformer Networks for Change Detection of Remote Sensing Images, [[Paper]](https://arxiv.org/pdf/2310.14214.pdf), [[Code]](https://github.com/Drchip61/TransYNet)
 8 | - (arXiv 2023.11) MS-Former: Memory-Supported Transformer for Weakly Supervised Change Detection with Patch-Level Annotations, [[Paper]](https://arxiv.org/pdf/2311.09726.pdf), [[Code]](https://github.com/guanyuezhen/MS-Former)
 9 | - (arXiv 2023.12) Adapting Vision Transformer for Efficient Change Detection, [[Paper]](https://arxiv.org/pdf/2312.04869.pdf)
10 | - (arXiv 2024.06) ChangeViT: Unleashing Plain Vision Transformers for Change Detection, [[Paper]](https://arxiv.org/pdf/2406.12847.pdf), [[Code]](https://github.com/zhuduowang/ChangeViT)
11 | - (arXiv 2024.07) Relating CNN-Transformer Fusion Network for Change Detection, [[Paper]](https://arxiv.org/pdf/2407.03178.pdf), [[Code]](https://github.com/NUST-Machine-Intelligence-Laboratory/RCTNet)
12 | - (arXiv 2025.01) ME-CPT: Multi-Task Enhanced Cross-Temporal Point Transformer for Urban 3D Change Detection, [[Paper]](https://arxiv.org/pdf/2501.14004.pdf), [[Code]](https://github.com/zhangluqi0209/ME-CPT)
13 | 


--------------------------------------------------------------------------------
/main/clustering.md:
--------------------------------------------------------------------------------
1 | ### Clustering
2 | - (arXiv 2022.04) Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer, [[Paper]](https://arxiv.org/pdf/2204.08680.pdf)
3 | - (arXiv 2022.06) Vision Transformer for Contrastive Clustering, [[Paper]](https://arxiv.org/pdf/2206.12925.pdf)
4 | - (arXiv 2023.04) Fairness in Visual Clustering: A Novel Transformer Clustering Approach, [[Paper]](https://arxiv.org/pdf/2304.07408.pdf)
5 | - (arXiv 2023.06) Dynamic Clustering Transformer Network for Point Cloud Segmentation, [[Paper]](https://arxiv.org/pdf/2306.08073.pdf)
6 | - (arXiv 2024.07) TCFormer: Visual Recognition via Token Clustering Transformer, [[Paper]](https://arxiv.org/pdf/2407.11321.pdf),[[Code]](https://github.com/zengwang430521/TCFormer)
7 | 


--------------------------------------------------------------------------------
/main/completion.md:
--------------------------------------------------------------------------------
 1 | ### Completion 
 2 | - (arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [[Paper]](https://arxiv.org/pdf/2103.14031.pdf), [[Code]](http://raywzy.com/ICT)
 3 | - (arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [[Paper]](https://arxiv.org/pdf/2111.06707.pdf), [[Code]](https://github.com/yhlleo/MJP)
 4 | - (arXiv 2021.08) Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion, [[Paper]](https://arxiv.org/pdf/2108.04927.pdf), [[Code]](https://github.com/amazon-research/embert)
 5 | - (arXiv 2023.03) FishDreamer: Towards Fisheye Semantic Completion via Unified Image Outpainting and Segmentation, [[Paper]](https://arxiv.org/pdf/2303.13842.pdf), [[Code]](https://github.com/MasterHow/FishDreamer)
 6 | - (arXiv 2023.04) Contour Completion by Transformers and Its Application to Vector Font Data, [[Paper]](https://arxiv.org/pdf/2304.13988.pdf)
 7 | - (arXiv 2023.07) CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion, [[Paper]](https://arxiv.org/pdf/2307.07938.pdf)
 8 | - (arXiv 2023.10) Distance-based Weighted Transformer Network for Image Completion, [[Paper]](https://arxiv.org/pdf/2310.07440.pdf)
 9 | - (arXiv 2024.01) CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers, [[Paper]](https://arxiv.org/pdf/2401.01552.pdf),[[Code]](https://github.com/EasyRy/CRA-PCN)
10 | - (arXiv 2024.04) Transformer based Pluralistic Image Completion with Reduced Information Loss, [[Paper]](https://arxiv.org/pdf/2404.00513.pdf),[[Code]](https://github.com/liuqk3/PUT)
11 | - (arXiv 2024.05) Context and Geometry Aware Voxel Transformer for Semantic Scene Completion, [[Paper]](https://arxiv.org/pdf/2405.13675.pdf),[[Code]](https://github.com/pkqbajng/CGFormer)
12 | - (arXiv 2024.06) GeoFormer: Learning Point Cloud Completion with Tri-Plane Integrated Transformer, [[Paper]](https://arxiv.org/pdf/2406.06596.pdf),[[Code]](https://github.com/Jinpeng-Yu/GeoFormer)
13 | - (arXiv 2024.10) SurgPointTransformer: Vertebrae Shape Completion with RGB-D Data, [[Paper]](https://arxiv.org/pdf/2410.01443.pdf)
14 | - (arXiv 2024.10) ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera, [[Paper]](https://arxiv.org/pdf/2410.11019.pdf)
15 | - (arXiv 2024.12) TDCNet: Transparent Objects Depth Completion with CNN-Transformer Dual-Branch Parallel Network, [[Paper]](https://arxiv.org/pdf/2412.14961.pdf),[[Code]](https://github.com/XianghuiFan/TDCNet)
16 | 


--------------------------------------------------------------------------------
/main/compression.md:
--------------------------------------------------------------------------------
 1 | ### Compression
 2 | - (arXiv 2021.10) Accelerating Framework of Transformer by hardware Design and Model Compression Co-Optimization, [[Paper]](https://arxiv.org/pdf/2110.10030.pdf)
 3 | - (arXiv 2021.11) Transformer-based Image Compression, [[Paper]](https://arxiv.org/pdf/2104.00845.pdf)
 4 | - (arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [[Paper]](https://arxiv.org/pdf/2112.09300.pdf), [[Code]](https://github.com/BYchao100/Towards-Image-Compression-and-Analysis-with-Transformers)
 5 | - (arXiv 2021.12) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [[Paper]](https://arxiv.org/pdf/2112.15299.pdf)
 6 | - (arXiv 2022.01) Multi-Dimensional Model Compression of Vision Transformer, [[Paper]](https://arxiv.org/pdf/2201.00043.pdf)
 7 | - (arXiv 2022.02) Entroformer: A Transformer-based Entropy Model for Learned Image Compression, [[Paper]](https://arxiv.org/pdf/2202.05492.pdf), [[Code]](https://github.com/mx54039q/entroformer)
 8 | - (arXiv 2022.03) Unified Visual Transformer Compression, [[Paper]](https://arxiv.org/pdf/2203.08243.pdf), [[Code]](https://github.com/VITA-Group/UVC)
 9 | - (arXiv 2022.03) Transformer Compressed Sensing via Global Image Tokens, [[Paper]](https://arxiv.org/pdf/2203.12861.pdf), [[supplementary]](https://github.com/uqmarlonbran/TCS)
10 | - (arXiv 2022.03) Vision Transformer Compression with Structured Pruning and Low Rank Approximation, [[Paper]](https://arxiv.org/pdf/2203.13444.pdf)
11 | - (arXiv 2022.04) Searching Intrinsic Dimensions of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2204.07722.pdf)
12 | - (arXiv 2022.04) Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging, [[Paper]](https://arxiv.org/pdf/2205.10102.pdf)
13 | - (arXiv 2022.06) VCT: A Video Compression Transformer, [[Paper]](https://arxiv.org/pdf/2206.07307.pdf), [[Code]](https://github.com/google-research/google-research/tree/master/vct)
14 | - (arXiv 2022.07) TransCL: Transformer Makes Strong and Flexible Compressive Learning, [[Paper]](https://arxiv.org/pdf/2207.11972.pdf), [[Code]](https://github.com/MC-E/TransCL/)
15 | - (arXiv 2022.08) Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation, [[Paper]](https://arxiv.org/pdf/2208.00219.pdf), [[Code]](https://github.com/ZhangGongjie/Meta-DETR)
16 | - (arXiv 2022.08) Unified Normalization for Accelerating and Stabilizing Transformers, [[Paper]](https://arxiv.org/pdf/2208.01313.pdf), [[Code]](https://github.com/hikvision-research/Unified-Normalization)
17 | - (arXiv 2022.09) Uformer-ICS: A Specialized U-Shaped Transformer for Image Compressive Sensing, [[Paper]](https://arxiv.org/pdf/2209.01763.pdf)
18 | - (arXiv 2022.09) Attacking Compressed Vision Transformers, [[Paper]](https://arxiv.org/pdf/2209.13785.pdf)
19 | - (arXiv 2023.01) GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2301.05345.pdf)
20 | - (arXiv 2023.03) SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage, [[Paper]](https://arxiv.org/pdf/2303.11114.pdf), [[Code]](https://github.com/naver-ai/seit)
21 | - (arXiv 2023.03) Learned Image Compression with Mixed Transformer-CNN Architectures, [[Paper]](https://arxiv.org/pdf/2303.14978.pdf), [[Code]](https://github.com/jmliu206/LIC_TCM)
22 | - (arXiv 2023.04) Optimization-Inspired Cross-Attention Transformer for Compressive Sensing, [[Paper]](https://arxiv.org/pdf/2304.13986.pdf), [[Code]](https://github.com/songjiechong/OCTUF)
23 | - (arXiv 2023.05) ROI-based Deep Image Compression with Swin Transformers, [[Paper]](https://arxiv.org/pdf/2305.07783.pdf)
24 | - (arXiv 2023.05) Transformer-based Variable-rate Image Compression with Region-of-interest Control, [[Paper]](https://arxiv.org/pdf/2305.10807.pdf)
25 | - (arXiv 2023.06) Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression, [[Paper]](https://arxiv.org/pdf/2306.14287.pdf)
26 | - (arXiv 2023.07) AICT: An Adaptive Image Compression Transformer, [[Paper]](https://arxiv.org/pdf/2307.06091.pdf)
27 | - (arXiv 2023.07) JPEG Quantized Coefficient Recovery via DCT Domain Spatial-Frequential Transformer, [[Paper]](https://arxiv.org/pdf/2308.09110.pdf)
28 | - (arXiv 2023.09) Compressing Vision Transformers for Low-Resource Visual Learning, [[Paper]](https://arxiv.org/pdf/2309.02617.pdf)
29 | - (arXiv 2023.09) CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs, [[Paper]](https://arxiv.org/pdf/2309.15755.pdf)
30 | - (arXiv 2023.10) USDC: Unified Static and Dynamic Compression for Visual Transformer, [[Paper]](https://arxiv.org/pdf/2310.11117.pdf)
31 | - (arXiv 2023.10) Frequency-Aware Transformer for Learned Image Compression, [[Paper]](https://arxiv.org/pdf/2310.16387.pdf)
32 | - (arXiv 2023.11) White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is, [[Paper]](https://arxiv.org/pdf/2311.13110.pdf), [[Code]](https://ma-lab-berkeley.github.io/CRATE)
33 | - (arXiv 2023.11) Corner-to-Center Long-range Context Model for Efficient Learned Image Compression, [[Paper]](https://arxiv.org/pdf/2311.18103.pdf)
34 | - (arXiv 2023.12) Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks, [[Paper]](https://arxiv.org/pdf/2312.12385.pdf), [[Code]](https://github.com/amrnag/ICPC)
35 | - (arXiv 2024.01) UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer, [[Paper]](https://arxiv.org/pdf/2401.06426.pdf)
36 | - (arXiv 2024.02) Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank Compression Strategy, [[Paper]](https://arxiv.org/pdf/2402.06004.pdf)
37 | - (arXiv 2024.03) Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer, [[Paper]](https://arxiv.org/pdf/2403.03736.pdf)
38 | - (arXiv 2024.03) Content-aware Masked Image Modeling Transformer for Stereo Image Compression, [[Paper]](https://arxiv.org/pdf/2403.08505.pdf)
39 | - (arXiv 2024.03) Dense Vision Transformer Compression with Few Samples, [[Paper]](https://arxiv.org/pdf/2403.18708.pdf)
40 | - (arXiv 2024.06) ReduceFormer: Attention with Tensor Reduction by Summation, [[Paper]](https://arxiv.org/pdf/2406.07488.pdf)
41 | - (arXiv 2024.08) Bi-Level Spatial and Channel-aware Transformer for Learned Image Compression, [[Paper]](https://arxiv.org/pdf/2408.03842.pdf)
42 | - (arXiv 2024.12) Efficient Semantic Communication Through Transformer-Aided Compression, [[Paper]](https://arxiv.org/pdf/2412.01817.pdf)
43 | 


--------------------------------------------------------------------------------
/main/cross-view.md:
--------------------------------------------------------------------------------
1 | ### Cross-view
2 | - (arXiv 2022.03) Mutual Generative Transformer Learning for Cross-view Geo-localization, [[Paper]](https://arxiv.org/pdf/2203.09135.pdf)
3 | - (arXiv 2022.04) TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, [[Paper]](https://arxiv.org/pdf/2204.00097.pdf), [[Code]](https://github.com/Jeff-Zilence/TransGeo2022)
4 | 


--------------------------------------------------------------------------------
/main/crowd.md:
--------------------------------------------------------------------------------
 1 | ### Crowd 
 2 | - (arXiv 2021.04) TransCrowd: Weakly-Supervised Crowd Counting with Transformer, [[Paper]](https://arxiv.org/pdf/2104.09116.pdf), [[Code]](https://github.com/dk-liang/TransCrowd)
 3 | - (arXiv 2021.05) Boosting Crowd Counting with Transformers, [[Paper]](https://arxiv.org/pdf/2105.10926.pdf), [[Code]](https://github.com/dk-liang/TransCrowd)
 4 | - (arXiv 2021.08) Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer, [[Paper]](https://arxiv.org/pdf/2108.00584.pdf)
 5 | - (arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [[Paper]](https://arxiv.org/pdf/2109.01926.pdf), [[Code]](https://github.com/rucv/avcc)
 6 | - (arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [[Paper]](https://arxiv.org/pdf/2109.14483.pdf)
 7 | - (arXiv 2022.01) Scene-Adaptive Attention Network for Crowd Counting, [[Paper]](https://arxiv.org/pdf/2112.15509.pdf)
 8 | - (arXiv 2022.03) An End-to-End Transformer Model for Crowd Localization, [[Paper]](https://arxiv.org/pdf/2202.13065.pdf)
 9 | - (arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting, [[Paper]](https://arxiv.org/pdf/2203.06388.pdf)
10 | - (arXiv 2022.06) Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation, [[Paper]](https://arxiv.org/pdf/2206.10075.pdf)
11 | - (arXiv 2022.08) CounTR: Transformer-based Generalised Visual Counting, [[Paper]](https://arxiv.org/pdf/2208.13721.pdf)
12 | - (arXiv 2023.01) RGB-T Multi-Modal Crowd Counting Based on Transformer, [[Paper]](https://arxiv.org/pdf/2301.03033.pdf), [[Code]](https://github.com/liuzywen/RGBTCC)
13 | - (arXiv 2023.03) InCrowdFormer: On-Ground Pedestrian World Model From Egocentric Views, [[Paper]](https://arxiv.org/pdf/2303.09534.pdf)
14 | - (arXiv 2023.05) Selecting Learnable Training Samples is All DETRs Need in Crowded Pedestrian Detection, [[Paper]](https://arxiv.org/pdf/2305.10801.pdf)
15 | - (arXiv 2023.10) Query-adaptive DETR for Crowded Pedestrian Detection, [[Paper]](https://arxiv.org/pdf/2310.15725.pdf)
16 | - (arXiv 2023.12) Regressor-Segmenter Mutual Prompt Learning for Crowd Counting, [[Paper]](https://arxiv.org/pdf/2312.01711.pdf)
17 | - (arXiv 2024.01) Gramformer: Learning Crowd Counting via Graph-Modulated Transformer, [[Paper]](https://arxiv.org/pdf/2401.03870.pdf), [[Code]](https://github.com/LoraLinH/Gramformer)
18 | - (arXiv 2024.07) CountFormer: Multi-View Crowd Counting Transformer, [[Paper]](https://arxiv.org/pdf/2407.02047.pdf)
19 | - (arXiv 2025.04) RCCFormer: A Robust Crowd Counting Network Based on Transformer, [[Paper]](https://arxiv.org/pdf/2504.04935.pdf), [[Code]](https://github.com/lp-094/RCCFormer)
20 | - (arXiv 2025.04) A Transformer-based Multimodal Fusion Model for Efficient Crowd Counting Using Visual and Wireless Signals, [[Paper]](https://arxiv.org/pdf/2504.20178.pdf)
21 | - (arXiv 2025.05) Transformer-Based Dual-Optical Attention Fusion Crowd Head Point Counting and Localization Network, [[Paper]](https://arxiv.org/pdf/2505.06937.pdf), [[Code]](https://github.com/zz-zik/TAPNet)
22 | 


--------------------------------------------------------------------------------
/main/deblurring.md:
--------------------------------------------------------------------------------
 1 | ### Deblurring
 2 | - (arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring,[[Paper]](https://arxiv.org/pdf/2201.01893.pdf)
 3 | - (arXiv 2022.04) Stripformer: Strip Transformer for Fast Image Deblurring, [[Paper]](https://arxiv.org/pdf/2204.04627.pdf)
 4 | - (arXiv 2022.04) VDTR: Video Deblurring with Transformer, [[Paper]](https://arxiv.org/pdf/2204.08023.pdf), [[Code]](https://github.com/ljzycmd/VDTR)
 5 | - (arXiv 2022.09) DMTNet: Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer, [[Paper]](https://arxiv.org/pdf/2209.06040.pdf)
 6 | - (arXiv 2022.11) Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring, [[Paper]](https://arxiv.org/pdf/2211.12250.pdf), [[Code]](https://github.com/kkkls/FFTformer)
 7 | - (arXiv 2023.03) Image Deblurring by Exploring In-depth Properties of Transformer, [[Paper]](https://arxiv.org/pdf/2303.15198.pdf)
 8 | - (arXiv 2023.09) Aggregating Long-term Sharp Features via Hybrid Transformers for Video Deblurring, [[Paper]](https://arxiv.org/pdf/2309.07054.pdf), [[Code]](https://github.com/shangwei5/STGTN)
 9 | - (arXiv 2024.03) A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning, [[Paper]](https://arxiv.org/pdf/2403.02611.pdf), [[Code]](https://github.com/PieceZhang/MPT-CataBlur)
10 | - (arXiv 2024.03) DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, [[Paper]](https://arxiv.org/pdf/2403.13163.pdf), [[Code]](https://github.com/HanzhouLiu/DeblurDiNAT.git)
11 | - (arXiv 2024.04) Spread Your Wings: A Radial Strip Transformer for Image Deblurring, [[Paper]](https://arxiv.org/pdf/2404.00358.pdf), [[Code]](https://github.com/Calvin11311/RST)
12 | - (arXiv 2024.06) Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring, [[Paper]](https://arxiv.org/pdf/2406.07551.pdf), [[Code]](https://vilab.hit.edu.cn/projects/bsstnet)
13 | - (arXiv 2024.07) LoFormer: Local Frequency Transformer for Image Deblurring, [[Paper]](https://arxiv.org/pdf/2407.16993.pdf), [[Code]](https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur)
14 | - (arXiv 2024.08) Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model, [[Paper]](https://arxiv.org/pdf/2408.13459.pdf), [[Code]](https://github.com/Chen-Rao/VD-Diff)
15 | - (arXiv 2024.09) F2former: When Fractional Fourier Meets Deep Wiener Deconvolution and Selective Frequency Transformer for Image Deblurring, [[Paper]](https://arxiv.org/pdf/2409.02056.pdf)
16 | - (arXiv 2025.01) Efficient Transformer for High Resolution Image Motion Deblurring, [[Paper]](https://arxiv.org/pdf/2501.18403.pdf), [[Code]](https://github.com/hamzafer/image-deblurring)
17 | 


--------------------------------------------------------------------------------
/main/deepfake-detection.md:
--------------------------------------------------------------------------------
 1 | ### Deepfake Detection
 2 | - (arXiv.2021.02) Deepfake Video Detection Using Convolutional Vision Transformer, [[Paper]](https://arxiv.org/abs/2102.11126)
 3 | - (arXiv 2021.04) Deepfake Detection Scheme Based on Vision Transformer and Distillation, [[Paper]](https://arxiv.org/abs/2104.01353)
 4 | - (arXiv 2021.04) M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, [[Paper]](https://arxiv.org/pdf/2104.09770.pdf)
 5 | - (arXiv 2021.07) Combining EfficientNet and Vision Transformers for Video Deepfake Detection, [[Paper]](https://arxiv.org/pdf/2107.02612.pdf)
 6 | - (arXiv 2021.08) Video Transformer for Deepfake Detection with Incremental Learning, [[Paper]](https://arxiv.org/pdf/2108.05307.pdf)
 7 | - (arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [[Paper]](https://arxiv.org/pdf/2203.01265.pdf), [[Code]](https://github.com/IDKiro/DehazeFormer)
 8 | - (arXiv 2022.06) Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection, [[Paper]](https://arxiv.org/pdf/2206.13829.pdf)
 9 | - (arXiv 2022.07) Deepfake Video Detection with Spatiotemporal Dropout Transformer, [[Paper]](https://arxiv.org/pdf/2207.06612.pdf)
10 | - (arXiv 2022.07) Hybrid Transformer Network for Deepfake Detection, [[Paper]](https://arxiv.org/pdf/2208.05820.pdf)
11 | - (arXiv 2022.09) Deep Convolutional Pooling Transformer for Deepfake Detection, [[Paper]](https://arxiv.org/pdf/2303.07033.pdf)
12 | - (arXiv 2023.04) Deepfake Detection with Deep Learning: Convolutional Neural Networks versus Transformers, [[Paper]](https://arxiv.org/pdf/2304.03698.pdf)
13 | - (arXiv 2023.07) Deepfake Video Detection Using Generative Convolutional Vision Transformer, [[Paper]](https://arxiv.org/pdf/2307.07036.pdf), [[Code]](https://github.com/erprogs/GenConViT)
14 | - (arXiv 2023.07) Self-Supervised Graph Transformer for Deepfake Detection, [[Paper]](https://arxiv.org/pdf/2307.15019.pdf)
15 | - (arXiv 2023.09) DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention, [[Paper]](https://arxiv.org/pdf/2309.06511.pdf)
16 | - (arXiv 2024.03) TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer, [[Paper]](https://arxiv.org/pdf/2403.12481.pdf)
17 | - (arXiv 2024.03) AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies, [[Paper]](https://arxiv.org/pdf/2403.14974.pdf), [[Code]](https://github.com/raining-dev/AVT2-DWF)
18 | - (arXiv 2024.04) Texture-aware and Shape-guided Transformer for Sequential DeepFake Detection, [[Paper]](https://arxiv.org/pdf/2404.13873.pdf)
19 | - (arXiv 2024.05) Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis, [[Paper]](https://arxiv.org/pdf/2405.00355.pdf)
20 | - (arXiv 2024.05) A Timely Survey on Vision Transformer for Deepfake Detection, [[Paper]](https://arxiv.org/pdf/2405.08463.pdf)
21 | - (arXiv 2024.09) Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector, [[Paper]](https://arxiv.org/pdf/2408.16892.pdf)
22 | - (arXiv 2024.10) FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection, [[Paper]](https://arxiv.org/pdf/2410.21964.pdf), [[Code]](https://github.com/10Ring/FakeFormer)
23 | - (arXiv 2025.01) Classifying Deepfakes Using Swin Transformers, [[Paper]](https://arxiv.org/pdf/2501.15656.pdf)
24 | - (arXiv 2025.02) Cross multiscale vision transformer for deep fake detection, [[Paper]](https://arxiv.org/pdf/2502.00833.pdf)
25 | - (arXiv 2025.04) Advance Fake Video Detection via Vision Transformers, [[Paper]](https://arxiv.org/pdf/2504.20669.pdf)
26 | 


--------------------------------------------------------------------------------
/main/dehazing.md:
--------------------------------------------------------------------------------
 1 | ### Dehazing
 2 | - (arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [[Paper]](https://arxiv.org/pdf/2109.07100.pdf)
 3 | - (arXiv 2022.04) Vision Transformers for Single Image Dehazing, [[Paper]](https://arxiv.org/pdf/2204.03883.pdf)
 4 | - (arXiv 2022.10) Semi-UFormer: Semi-supervised Uncertainty-aware Transformer for Image Dehazing, [[Paper]](https://arxiv.org/pdf/2210.16057.pdf)
 5 | - (arXiv 2023.03) SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency, [[Paper]](https://arxiv.org/pdf/2210.16057.pdf)
 6 | - (arXiv 2023.04) A Data-Centric Solution to NonHomogeneous Dehazing via Vision Transformer, [[Paper]](https://arxiv.org/pdf/2304.07874.pdf), [[Code]](https://github.com/yangyiliu21/ntire2023_ITBdehaze)
 7 | - (arXiv 2023.05) NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer, [[Paper]](https://arxiv.org/pdf/2305.09533.pdf)
 8 | - (arXiv 2023.08) MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing, [[Paper]](https://arxiv.org/pdf/2308.14036.pdf), [[Code]](https://github.com/FVL2020/ICCV-2023-MB-TaylorFormer)
 9 | - (arXiv 2023.12) DHFormer: A Vision Transformer-Based Attention Module for Image Dehazing, [[Paper]](https://arxiv.org/pdf/2312.09955.pdf)
10 | - (arXiv 2024.01) WaveletFormerNet: A Transformer-based Wavelet Network for Real-world Non-homogeneous and Dense Fog Removal, [[Paper]](https://arxiv.org/pdf/2401.04550.pdf)
11 | - (arXiv 2024.07) Vision Transformer with Key-select Routing Attention for Single Image Dehazing, [[Paper]](https://arxiv.org/pdf/2406.19703.pdf)
12 | - (arXiv 2024.07) DehazeDCT: Towards Effective Non-Homogeneous Dehazing via Deformable Convolutional Transformer, [[Paper]](https://arxiv.org/pdf/2407.05169.pdf), [[Code]](https://github.com/movingforward100/Dehazing_R)
13 | - (arXiv 2024.07) Restoring Images in Adverse Weather Conditions via Histogram Transformer, [[Paper]](https://arxiv.org/pdf/2407.10172.pdf), [[Code]](https://github.com/sunshangquan/Histoformer)
14 | - (arXiv 2024.12) Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing, [[Paper]](https://arxiv.org/pdf/2412.14220.pdf), [[Code]](https://github.com/tranleanh/dpte-net)
15 | - (arXiv 2025.01) Transformer-Driven Inverse Problem Transform for Fast Blind Hyperspectral Image Dehazing, [[Paper]](https://arxiv.org/pdf/2501.01924.pdf)
16 | - (arXiv 2025.04) Fine-Tuning Adversarially-Robust Transformers for Single-Image Dehazing, [[Paper]](https://arxiv.org/pdf/2504.17829.pdf), [[Code]](https://github.com/Vladimirescu/RobustDehazing)
17 | 


--------------------------------------------------------------------------------
/main/denoising.md:
--------------------------------------------------------------------------------
 1 | ### Denoising
 2 | - (arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [[Paper]](https://arxiv.org/pdf/2112.09685.pdf)
 3 | - (arXiv 2022.03) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [[Paper]](https://arxiv.org/pdf/2202.14009.pdf), [[Code]](https://github.com/FanChiMao/SUNet)
 4 | - (arXiv 2022.03) Practical Blind Denoising via Swin-Conv-UNet and Data Synthesis, [[Paper]](https://arxiv.org/pdf/2203.13278.pdf), [[Code]](https://github.com/cszn/SCUNet)
 5 | - (arXiv 2022.05) Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer, [[Paper]](https://arxiv.org/pdf/2205.00214.pdf)
 6 | - (arXiv 2022.05) Dense residual Transformer for image denoising, [[Paper]](https://arxiv.org/pdf/2205.00214.pdf)
 7 | - (arXiv 2022.07) DnSwin: Toward Real-World Denoising via Continuous Wavelet Sliding-Transformer, [[Paper]](https://arxiv.org/pdf/2207.13861.pdf)
 8 | - (arXiv 2022.11) Spatial-Spectral Transformer for Hyperspectral Image Denoising, [[Paper]](https://arxiv.org/pdf/2211.14090.pdf), [[Code]](https://github.com/myuli/sst)
 9 | - (arXiv 2023.03) Xformer: Hybrid X-Shaped Transformer for Image Denoising, [[Paper]](https://arxiv.org/pdf/2303.06440.pdf)
10 | - (arXiv 2023.03) Hybrid Spectral Denoising Transformer with Learnable Query, [[Paper]](https://arxiv.org/pdf/2303.09040.pdf)
11 | - (arXiv 2023.04) Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising, [[Paper]](https://arxiv.org/pdf/2303.09040.pdf), [[Code]](https://github.com/MyuLi/SERT)
12 | - (arXiv 2023.04) Exploration of Lightweight Single Image Denoising with Transformers and Truly Fair Training, [[Paper]](https://arxiv.org/pdf/2304.01805.pdf), [[Code]](https://github.com/rami0205/LWDN)
13 | - (arXiv 2023.04) Self-Supervised Image Denoising for Real-World Images with Context-aware Transformer, [[Paper]](https://arxiv.org/pdf/2304.01627.pdf)
14 | - (arXiv 2023.04) DDT: Dual-branch Deformable Transformer for Image Denoising, [[Paper]](https://arxiv.org/pdf/2304.06346.pdf), [[Code]](https://github.com/Merenguelkl/DDT)
15 | - (arXiv 2023.04) EWT: Efficient Wavelet-Transformer for Single Image Denoising, [[Paper]](https://arxiv.org/pdf/2304.06274.pdf)
16 | - (arXiv 2023.04) NoiseTrans: Point Cloud Denoising with Transformers, [[Paper]](https://arxiv.org/pdf/2304.11812.pdf)
17 | - (arXiv 2023.05) RViDeformer: Efficient Raw Video Denoising Transformer with a Larger Benchmark Dataset, [[Paper]](https://arxiv.org/pdf/2305.00767.pdf)
18 | - (arXiv 2023.05) Degradation-Noise-Aware Deep Unfolding Transformer for Hyperspectral Image Denoising, [[Paper]](https://arxiv.org/pdf/2305.04047.pdf)
19 | - (arXiv 2023.10) Physics-guided Noise Neural Proxy for Low-light Raw Image Denoising, [[Paper]](https://arxiv.org/pdf/2310.09126.pdf)
20 | - (arXiv 2023.10) A cross Transformer for image denoising, [[Paper]](https://arxiv.org/pdf/2310.10408.pdf), [[Code]](https://github.com/hellloxiaotian/CTNet)
21 | - (arXiv 2023.10) Complex Image Generation SwinTransformer Network for Audio Denoising, [[Paper]](https://arxiv.org/pdf/2310.16109.pdf)
22 | - (arXiv 2024.01) Denoising Vision Transformers, [[Paper]](https://arxiv.org/pdf/2401.02957.pdf), [[Code]](https://jiawei-yang.github.io/DenoisingViT/)
23 | - (arXiv 2024.01) Hyperspectral Image Denoising via Spatial-Spectral Recurrent Transformer, [[Paper]](https://arxiv.org/pdf/2401.03885.pdf), [[Code]](https://github.com/lronkitty/SSRT)
24 | - (arXiv 2024.04) SGDFormer: One-stage Transformer-based Architecture for Cross-Spectral Stereo Image Guided Denoising, [[Paper]](https://arxiv.org/pdf/2404.00349.pdf)
25 | - (arXiv 2024.04) TBSN: Transformer-Based Blind-Spot Network for Self-Supervised Image Denoising, [[Paper]](https://arxiv.org/pdf/2404.07846.pdf), [[Code]](https://github.com/nagejacob/TBSN)
26 | - (arXiv 2024.07) Heterogeneous window transformer for image denoising, [[Paper]](https://arxiv.org/pdf/2407.05709.pdf)
27 | - (arXiv 2024.07) Beyond Image Prior: Embedding Noise Prior into Conditional Denoising Transformer, [[Paper]](https://arxiv.org/pdf/2407.09094.pdf), [[Code]](https://github.com/YuanfeiHuang/Condformer)
28 | - (arXiv 2024.08) A Practical Gated Recurrent Transformer Network Incorporating Multiple Fusions for Video Denoising, [[Paper]](https://arxiv.org/pdf/2409.06603.pdf)
29 | - (arXiv 2025.02) Residual Transformer Fusion Network for Salt and Pepper Image Denoising, [[Paper]](https://arxiv.org/pdf/2502.09000.pdf)
30 | 


--------------------------------------------------------------------------------
/main/deraining.md:
--------------------------------------------------------------------------------
 1 | ### Deraining
 2 | - (arXiv 2022.04) DRT: A Lightweight Single Image Deraining Recursive Transformer, [[Paper]](https://arxiv.org/pdf/2204.11385.pdf)
 3 | - (arXiv 2022.07) Magic ELF: Image Deraining Meets Association Learning and Transformer, [[Paper]](https://arxiv.org/pdf/2207.10455.pdf), [[Code]](https://github.com/kuijiang94/Magic-ELF)
 4 | - (arXiv 2023.03) Learning A Sparse Transformer Network for Effective Image Deraining, [[Paper]](https://arxiv.org/pdf/2303.11950.pdf), [[Code]](https://github.com/cschenxiang/DRSformer)
 5 | - (arXiv 2023.08) Learning Image Deraining Transformer Network with Dynamic Dual Self-Attention, [[Paper]](https://arxiv.org/pdf/2303.11950.pdf)
 6 | - (arXiv 2023.08) Sparse Sampling Transformer with Uncertainty-Driven Ranking for Unified Removal of Raindrops and Rain Streaks, [[Paper]](https://arxiv.org/pdf/2308.14153.pdf)
 7 | - (arXiv 2024.01) NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction, [[Paper]](https://arxiv.org/pdf/2401.00729.pdf)
 8 | - (arXiv 2024.02) Diving Deep into Regions: Exploiting Regional Information Transformer for Single Image Deraining, [[Paper]](https://arxiv.org/pdf/2402.16033.pdf), [[Code]](https://github.com/ztMotaLee/Regformer)
 9 | - (arXiv 2024.03) Gabor-guided transformer for single image deraining, [[Paper]](https://arxiv.org/pdf/2403.07380.pdf)
10 | - (arXiv 2024.05) Dual-Path Multi-Scale Transformer for High-Quality Image Deraining, [[Paper]](https://arxiv.org/pdf/2405.18124.pdf)
11 | - (arXiv 2024.08) Improving Image De-raining Using Reference-Guided Transformers, [[Paper]](https://arxiv.org/pdf/2408.00258.pdf), [[Code]](https://github.com/ziiihoooYe/RefGT?tab=readme-ov-file)
12 | - (arXiv 2024.09) A Hybrid Transformer-Mamba Network for Single Image Deraining, [[Paper]](https://arxiv.org/pdf/2409.00410.pdf), [[Code]](https://github.com/sunshangquan/TransMamba)
13 | - (arXiv 2025.04) Cross Paradigm Representation and Alignment Transformer for Image Deraining, [[Paper]](https://arxiv.org/pdf/2504.16455.pdf)
14 | 


--------------------------------------------------------------------------------
/main/edge.md:
--------------------------------------------------------------------------------
1 | ### Edge
2 | - (arXiv 2022.03) EDTER: Edge Detection with Transformer, [[Paper]](https://arxiv.org/pdf/2203.08566.pdf), [[Code]](https://github.com/MengyangPu/EDTER)
3 | - (arXiv 2022.06) XBound-Former: Toward Cross-scale Boundary Modeling in Transformers, [[Paper]](https://arxiv.org/pdf/2206.00806.pdf), [[Code]](https://github.com/jcwang123/xboundformer)
4 | - (arXiv 2022.06) Structured Context Transformer for Generic Event Boundary Detection, [[Paper]](https://arxiv.org/pdf/2206.02985.pdf)
5 | - (arXiv 2022.06) SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection, [[Paper]](https://arxiv.org/pdf/2206.12634.pdf), [[Project]](https://github.com/ZhangGongjie/IMFA)
6 | - (arXiv 2023.07) CT-Net: Arbitrary-Shaped Text Detection via Contour Transformer, [[Paper]](https://arxiv.org/pdf/2307.13310.pdf)
7 | - (arXiv 2023.07) EAFormer: Scene Text Segmentation with Edge-Aware Transformers, [[Paper]](https://arxiv.org/pdf/2407.17020.pdf), [[Code]](https://hyangyu.github.io/EAFormer/)
8 | - (arXiv 2023.08) EdgeNAT: Transformer for Efficient Edge Detection, [[Paper]](https://arxiv.org/pdf/2408.10527.pdf), [[Code]](https://github.com/jhjie/EdgeNAT)
9 | 


--------------------------------------------------------------------------------
/main/enhancement.md:
--------------------------------------------------------------------------------
 1 | ### Enhancement
 2 | - (arXiv 2021.11) U-shape Transformer for Underwater Image Enhancement, [[Paper]](https://arxiv.org/pdf/2111.10135.pdf)
 3 | - (arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [[Paper]](https://arxiv.org/pdf/2201.10252.pdf), [[Code]](https://github.com/dali92002/DocEnTR)
 4 | - (arXiv 2022.04) Underwater Image Enhancement Using Pre-trained Transformer, [[Paper]](https://arxiv.org/pdf/2204.04199.pdf)
 5 | - (arXiv 2022.04) VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout, [[Paper]](https://arxiv.org/pdf/2204.11024.pdf), [[Code]](https://github.com/istiakshihab/automated-retail-checkout-aicity22)
 6 | - (arXiv 2022.05) Reinforced Swin-Convs Transformer for Underwater Image Enhancement, [[Paper]](https://arxiv.org/pdf/2205.00434.pdf)
 7 | - (arXiv 2022.07) Structural Prior Guided Generative Adversarial Transformers for Low-Light Image Enhancement, [[Paper]](https://arxiv.org/pdf/2207.07828.pdf)
 8 | - (arXiv 2022.10) End-to-end Transformer for Compressed Video Quality Enhancement, [[Paper]](https://arxiv.org/pdf/2210.13827.pdf)
 9 | - (arXiv 2022.12) WavEnhancer: Unifying Wavelet and Transformer for Image Enhancement, [[Paper]](https://arxiv.org/pdf/2212.08327.pdf)
10 | - (arXiv 2022.12) Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method, [[Paper]](https://arxiv.org/pdf/2212.11548.pdf), [[Code]](https://github.com/TaoWangzj/LLFormer)
11 | - (arXiv 2023.03) Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement, [[Paper]](https://arxiv.org/pdf/2303.06705.pdf)
12 | - (arXiv 2023.06) Unsupervised Low Light Image Enhancement Using SNR-Aware Swin Transformer, [[Paper]](https://arxiv.org/pdf/2306.02082.pdf)
13 | - (arXiv 2023.06) Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network, [[Paper]](https://arxiv.org/pdf/2308.08220.pdf)
14 | - (arXiv 2023.09) Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy, [[Paper]](https://arxiv.org/pdf/2309.03445.pdf), [[Code]](https://github.com/piggy2009/DM_underwater)
15 | - (arXiv 2023.09) DEFormer: DCT-driven Enhancement Transformer for Low-light Image and Dark Vision, [[Paper]](https://arxiv.org/pdf/2309.06941.pdf)
16 | - (arXiv 2023.10) UWFormer: Underwater Image Enhancement via a Semi-Supervised Multi-Scale Transformer, [[Paper]](https://arxiv.org/pdf/2310.20210.pdf)
17 | - (arXiv 2023.12) A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement, [[Paper]](https://arxiv.org/pdf/2312.03946.pdf), [[Code]](https://github.com/RisabBiswas/T2T-BinFormer)
18 | - (arXiv 2023.12) Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning, [[Paper]](https://arxiv.org/pdf/2312.06995.pdf), [[Code]](https://github.com/I2-Multimedia-Lab/SaTQA)
19 | - (arXiv 2023.12) A Non-Uniform Low-Light Image Enhancement Method with Multi-Scale Attention Transformer and Luminance Consistency Loss, [[Paper]](https://arxiv.org/pdf/2312.16498.pdf), [[Code]](https://github.com/fang001021/MSATr)
20 | - (arXiv 2024.01) LYT-Net: Lightweight YUV Transformer-based Network for Low-Light Image Enhancement, [[Paper]](https://arxiv.org/pdf/2401.15204.pdf), [[Code]](https://github.com/albrateanu/LYT-Net)
21 | - (arXiv 2024.03) LoLiSRFlow: Joint Single Image Low-light Enhancement and Super-resolution via Cross-scale Transformer-based Conditional Flow, [[Paper]](https://arxiv.org/pdf/2402.18871.pdf)
22 | - (arXiv 2024.03) Learning A Physical-aware Diffusion Model Based on Transformer for Underwater Image Enhancement, [[Paper]](https://arxiv.org/pdf/2403.01497.pdf)
23 | - (arXiv 2024.07) Image-Conditional Diffusion Transformer for Underwater Image Enhancement, [[Paper]](https://arxiv.org/pdf/2407.05389.pdf)
24 | - (arXiv 2024.07) CAPformer: Compression-Aware Pre-trained Transformer for Low-Light Image Enhancement, [[Paper]](https://arxiv.org/pdf/2407.07056.pdf)
25 | - (arXiv 2024.07) Unified-EGformer: Exposure Guided Lightweight Transformer for Mixed-Exposure Image Enhancement, [[Paper]](https://arxiv.org/pdf/2407.13170.pdf)
26 | - (arXiv 2024.08) UIE-UnFold: Deep Unfolding Network with Color Priors and Vision Transformer for Underwater Image Enhancement, [[Paper]](https://arxiv.org/pdf/2408.10653.pdf), [[Code]](https://github.com/CXH-Research/UIE-UnFold)
27 | - (arXiv 2025.01) DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains, [[Paper]](https://arxiv.org/pdf/2501.12235.pdf), [[Code]](https://github.com/LaLaLoXX/DLEN)
28 | - (arXiv 2025.04) Structure-guided Diffusion Transformer for Low-Light Image Enhancement, [[Paper]](https://arxiv.org/pdf/2504.15054.pdf)
29 | 


--------------------------------------------------------------------------------
/main/federated-learning.md:
--------------------------------------------------------------------------------
 1 | ### Federated Learning 
 2 | - (arXiv 2022.11) FedTune: A Deep Dive into Efficient Federated Fine-Tuning with Pre-trained Transformers, [[Paper]](https://arxiv.org/pdf/2211.08025.pdf)
 3 | - (arXiv 2023.06) FeSViBS: Federated Split Learning of Vision Transformer with Block Sampling, [[Paper]](https://arxiv.org/pdf/2306.14638.pdf),[[Code]](https://github.com/faresmalik/FeSViBS)
 4 | - (arXiv 2023.08) Pelta: Shielding Transformers to Mitigate Evasion Attacks in Federated Learning, [[Paper]](https://arxiv.org/pdf/2308.04373.pdf)
 5 | - (arXiv 2023.08) FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning, [[Paper]](https://arxiv.org/pdf/2308.09160.pdf), [[Code]](https://github.com/imguangyu/FedPerfix)
 6 | - (arXiv 2024.01) OnDev-LCT: On-Device Lightweight Convolutional Transformers towards federated learning, [[Paper]](https://arxiv.org/pdf/2401.11652.pdf)
 7 | - (arXiv 2024.03) A General and Efficient Federated Split Learning with Pre-trained Image Transformers for Heterogeneous Data, [[Paper]](https://arxiv.org/pdf/2403.16050.pdf)
 8 | - (arXiv 2024.04) Towards Multi-modal Transformers in Federated Learning, [[Paper]](https://arxiv.org/pdf/2404.12467.pdf)
 9 | - (arXiv 2024.12) EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Edge Devices, [[Paper]](https://arxiv.org/pdf/2412.00334.pdf)
10 | - (arXiv 2024.12) FLAMe: Federated Learning with Attention Mechanism using Spatio-Temporal Keypoint Transformers for Pedestrian Fall Detection in Smart Cities, [[Paper]](https://arxiv.org/pdf/2412.14768.pdf)
11 | - (arXiv 2025.04) Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections, [[Paper]](https://arxiv.org/pdf/2504.16612.pdf)
12 | 


--------------------------------------------------------------------------------
/main/few-shot-learning.md:
--------------------------------------------------------------------------------
 1 | ### Few-shot Learning
 2 | - (arXiv 2021.04) Rich Semantics Improve Few-shot Learning, [[Paper]](https://arxiv.org/pdf/2104.12709.pdf), [[Code]](https://github.com/MohamedAfham/RS_FSL)
 3 | - (arXiv 2021.04) Few-Shot Segmentation via Cycle-Consistent Transformer, [[Paper]](https://arxiv.org/pdf/2106.02320.pdf)
 4 | - (arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [[Paper]](https://arxiv.org/pdf/2109.12932.pdf)
 5 | - (arXiv 2021.12) Cost Aggregation Is All You Need for Few-Shot Segmentation, [[Paper]](https://arxiv.org/pdf/2112.11685.pdf), [[Code]](https://github.com/Seokju-Cho/Volumetric-Aggregation-Transformer)
 6 | - (arXiv 2022.01) HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning, [[Paper]](https://arxiv.org/pdf/2201.04182.pdf)
 7 | - (arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [[Paper]](https://arxiv.org/pdf/2202.06498.pdf)
 8 | - (arXiv 2022.03) Self-Promoted Supervision for Few-Shot Transformer, [[Paper]](https://arxiv.org/pdf/2203.07057.pdf), [[Code]](https://github.com/DongSky/few-shot-vit)
 9 | - (arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning, [[Paper]](https://arxiv.org/pdf/2203.09064.pdf), [[Code]](https://github.com/StomachCold/HCTransformers)
10 | - (arXiv 2022.04) CATrans: Context and Affinity Transformer for Few-Shot Segmentation, [[Paper]](https://arxiv.org/pdf/2204.12817.pdf)
11 | - (arXiv 2022.05) Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning, [[Paper]](https://arxiv.org/pdf/2205.09995.pdf)
12 | - (arXiv 2022.05) Few-Shot Diffusion Models, [[Paper]](https://arxiv.org/pdf/2205.15463.pdf)
13 | - (arXiv 2022.06) Prompting Decision Transformer for Few-Shot Policy Generalization, [[Paper]](https://arxiv.org/pdf/2206.13499.pdf), [[Code]](https://mxu34.github.io/PromptDT/)
14 | - (arXiv 2022.07) Learning Cross-Image Object Semantic Relation in Transformer for Few-Shot Fine-Grained Image Classification, [[Paper]](https://arxiv.org/pdf/2207.00784.pdf), [[Code]](https://github.com/JiakangYuan/HelixFormer)
15 | - (arXiv 2022.07) Few-shot Object Counting and Detection, [[Paper]](https://arxiv.org/pdf/2207.10988.pdf), [[Code]](https://github.com/VinAIResearch/Counting-DETR)
16 | - (arXiv 2022.07) Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation, [[Paper]](https://arxiv.org/pdf/2207.10866.pdf), [[Code]](https://seokju-cho.github.io/VAT/)
17 | - (arXiv 2022.08) Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification, [[Paper]](https://arxiv.org/pdf/2208.12398.pdf)
18 | - (arXiv 2022.10) BaseTransformers: Attention over base data-points for One Shot Learning, [[Paper]](https://arxiv.org/pdf/2210.02476.pdf), [[Code]](https://github.com/mayug/BaseTransformers)
19 | - (arXiv 2022.10) FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training, [[Paper]](https://arxiv.org/pdf/2210.04845.pdf)
20 | - (arXiv 2022.10) Feature-Proxy Transformer for Few-Shot Segmentation, [[Paper]](https://arxiv.org/pdf/2210.06908.pdf)
21 | - (arXiv 2022.11) tSF: Transformer-based Semantic Filter for Few-Shot Learning, [[Paper]](https://arxiv.org/pdf/2211.00868.pdf)
22 | - (arXiv 2022.11) Enhancing Few-shot Image Classification with Cosine Transformer, [[Paper]](https://arxiv.org/pdf/2211.06828.pdf), [[Code]](https://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer)
23 | - (arXiv 2023.01) Mask Matching Transformer for Few-Shot Segmentation, [[Paper]](https://arxiv.org/pdf/2301.01208.pdf), [[Code]](https://github.com/Picsart-AI-Research/Mask-Matching-Transformer)
24 | - (arXiv 2023.01) Exploring Efficient Few-shot Adaptation for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2301.02419.pdf), [[Code]](https://github.com/loadder/eTT_TMLR2022)
25 | - (arXiv 2023.01) Continual Few-Shot Learning Using HyperTransformers, [[Paper]](https://arxiv.org/pdf/2301.04584.pdf)
26 | - (arXiv 2023.02) SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning, [[Paper]](https://arxiv.org/pdf/2303.09281.pdf)
27 | - (arXiv 2023.04) From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection, [[Paper]](https://arxiv.org/pdf/2304.03140.pdf)
28 | - (arXiv 2023.04) Analogy-Forming Transformers for Few-Shot 3D Parsing, [[Paper]](https://arxiv.org/pdf/2304.14382.pdf), [[Project]](http://analogicalnets.github.io/)
29 | - (arXiv 2023.05) Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting, [[Paper]](https://arxiv.org/pdf/2305.04440.pdf)
30 | - (arXiv 2023.07) Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation, [[Paper]](https://arxiv.org/pdf/2307.07812.pdf), [[Code]](https://github.com/MSiam/MMC-MultiscaleMemory)
31 | - (arXiv 2023.07) Target-aware Bi-Transformer for Few-shot Segmentation, [[Paper]](https://arxiv.org/pdf/2309.09492.pdf)
32 | - (arXiv 2023.10) PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification, [[Paper]](https://arxiv.org/pdf/2310.03517.pdf)
33 | - (arXiv 2023.11) Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation, [[Paper]](https://arxiv.org/pdf/2311.17626.pdf),[[Code]](https://github.com/Wyxdm/AMNet)
34 | - (arXiv 2024.03) Cross-domain Multi-modal Few-shot Object Detection via Rich Text, [[Paper]](https://arxiv.org/pdf/2403.16188.pdf),[[Code]](https://github.com/zshanggu/CDMM)
35 | - (arXiv 2024.04) Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers, [[Paper]](https://arxiv.org/pdf/2404.09326.pdf)
36 | - (arXiv 2024.05) Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning, [[Paper]](https://arxiv.org/pdf/2405.03109.pdf)
37 | - (arXiv 2024.08) Siamese Transformer Networks for Few-shot Image Classification, [[Paper]](https://arxiv.org/pdf/2408.01427.pdf)
38 | - (arXiv 2024.10) KNN Transformer with Pyramid Prompts for Few-Shot Learning, [[Paper]](https://arxiv.org/pdf/2410.10227.pdf)
39 | - (arXiv 2025.05) CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion, [[Paper]](https://arxiv.org/pdf/2505.00938.pdf),[[Code]](https://longxuanx.github.io/CDFormer/)
40 | 


--------------------------------------------------------------------------------
/main/fusion.md:
--------------------------------------------------------------------------------
 1 | ### Fusion
 2 | - (arXiv 2021.07) Image Fusion Transformer, [[Paper]](https://arxiv.org/pdf/2107.09011.pdf), [[Code]](https://github.com/Vibashan/Image-FusionTransformer)
 3 | - (arXiv 2021.07) PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion, [[Paper]](https://arxiv.org/pdf/2107.13967.pdf)
 4 | - (arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [[Paper]](https://arxiv.org/pdf/2201.07451.pdf)
 5 | - (arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [[Paper]](https://arxiv.org/pdf/2201.10147.pdf)
 6 | - (arXiv 2022.04) SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images, [[Paper]](https://arxiv.org/pdf/2201.10147.pdf), [[Code]](https://github.com/Zhishe-Wang/SwinFuse)
 7 | - (arXiv 2022.07) Array Camera Image Fusion using Physics-Aware Transformers, [[Paper]](https://arxiv.org/pdf/2207.02250.pdf)
 8 | - (arXiv 2023.09) Holistic Dynamic Frequency Transformer for Image Fusion and Exposure Correction, [[Paper]](https://arxiv.org/pdf/2309.01183.pdf)
 9 | - (arXiv 2024.02) FuseFormer: A Transformer for Visual and Thermal Image Fusion, [[Paper]](https://arxiv.org/pdf/2402.00971.pdf), [[Code]](https://github.com/aytekXR/FuseFormer-Infrared-Fusion)
10 | - (arXiv 2024.03) Fusion Transformer with Object Mask Guidance for Image Forgery Analysis, [[Paper]](https://arxiv.org/pdf/2403.12229.pdf)
11 | - (arXiv 2024.03) An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models, [[Paper]](https://arxiv.org/pdf/2403.16530.pdf)
12 | - (arXiv 2024.10) IRFusionFormer: Enhancing Pavement Crack Segmentation with RGB-T Fusion and Topological-Based Loss, [[Paper]](https://arxiv.org/pdf/2409.20474.pdf), [[Code]](https://github.com/sheauhuu/IRFusionFormer)
13 | 


--------------------------------------------------------------------------------
/main/gait.md:
--------------------------------------------------------------------------------
 1 | ### Gait
 2 | - (arXiv 2021.10) TNTC: two-stream network with transformer-based complementarity for gait-based emotion recognition, [[Paper]](https://arxiv.org/pdf/2110.13708.pdf)
 3 | - (arXiv 2021.11) Attention-based Dual-stream Vision Transformer for Radar Gait Recognition,[[Paper]](https://arxiv.org/pdf/2111.12290.pdf)
 4 | - (arXiv 2022.04) Spatial Transformer Network on Skeleton-based Gait Recognition, [[Paper]](https://arxiv.org/pdf/2204.03873.pdf)
 5 | - (arXiv 2022.06) Exploring Transformers for Behavioural Biometrics: A Case Study in Gait Recognition, [[Paper]](https://arxiv.org/pdf/2206.01441.pdf)
 6 | - (arXiv 2022.06) GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation, [[Paper]](https://arxiv.org/pdf/2207.00106.pdf), [[Code]](https://github.com/markendo/GaitForeMer)
 7 | - (arXiv 2022.10) Multi-view Gait Recognition based on Siamese Vision Transformer, [[Paper]](https://arxiv.org/pdf/2210.10421.pdf)
 8 | - (arXiv 2023.07) GaitFormer: Revisiting Intrinsic Periodicity for Gait Recognition, [[Paper]](https://arxiv.org/pdf/2307.13259.pdf)
 9 | - (arXiv 2023.08) GaitPT: Skeletons Are All You Need For Gait Recognition, [[Paper]](https://arxiv.org/pdf/2308.10623.pdf)
10 | - (arXiv 2023.10) HCT: Hybrid Convnet-Transformer for Parkinson鈥檚 disease detection and severity prediction from gait, [[Paper]](https://arxiv.org/pdf/2310.17078.pdf), [[Code]](https://github.com/SafwenNaimi/HCT-Hybrid-Convnet-Transformer-for-Parkinson-s-disease-detection-and-severity-prediction-from-gait)
11 | - (arXiv 2023.10) GaitFormer: Learning Gait Representations with Noisy Multi-Task Learning, [[Paper]](https://arxiv.org/pdf/2310.19418.pdf)
12 | - (arXiv 2023.11) 1D-Convolutional transformer for Parkinson disease diagnosis from gait, [[Paper]](https://arxiv.org/pdf/2311.03177.pdf), [[Code]](https://github.com/SafwenNaimi/1D-Convolutional-transformer-for-Parkinson-disease-diagnosis-from-gait)
13 | - (arXiv 2023.11) GaitContour: Efficient Gait Recognition based on a Contour-Pose Representation, [[Paper]](https://arxiv.org/pdf/2311.16497.pdf)
14 | - (arXiv 2023.12) Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network, [[Paper]](https://arxiv.org/pdf/2312.00398.pdf), [[Code]](https://github.com/vinuni-vishc/transformer-gait-analysis)
15 | - (arXiv 2025.01) Quantitative Gait Analysis from Single RGB Videos Using a Dual-Input Transformer-Based Network, [[Paper]](https://arxiv.org/pdf/2501.01689.pdf), [[Code]](https://github.com/lmtszrl/Quantitative-Gait-Analysis-From-RGB-Videos-Using-a-Dual-Input-Transformer-Based-Network)
16 | 


--------------------------------------------------------------------------------
/main/gaze.md:
--------------------------------------------------------------------------------
 1 | ### Gaze
 2 | - (arXiv 2021.06) Gaze Estimation using Transformer, [[Paper]](https://arxiv.org/pdf/2105.14424.pdf), [[Code]](https://github.com/yihuacheng/GazeTR)
 3 | - (arXiv 2022.03) End-to-End Human-Gaze-Target Detection with Transformers, [[Paper]](https://arxiv.org/pdf/2203.10433.pdf)
 4 | - (arXiv 2022.05) Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning, [[Paper]](https://arxiv.org/pdf/2205.12466.pdf)
 5 | - (arXiv 2022.08) In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation, [[Paper]](https://arxiv.org/pdf/2208.04464.pdf), [[Code]](https://bolinlai.github.io/GLC-EgoGazeEst)
 6 | - (arXiv 2022.09) MGTR: End-to-End Mutual Gaze Detection with Transformer, [[Paper]](https://arxiv.org/pdf/2209.10930.pdf), [[Code]](https://github.com/Gmbition/MGTR)
 7 | - (arXiv 2023.08) Interaction-aware Joint Attention Estimation Using People Attributes, [[Paper]](https://arxiv.org/pdf/2308.05382.pdf), [[Code]](https://github.com/chihina/PJAE)
 8 | - (arXiv 2023.08) DVGaze: Dual-View Gaze Estimation, [[Paper]](https://arxiv.org/pdf/2308.10310.pdf), [[Code]](https://github.com/yihuacheng/DVGaze)
 9 | - (arXiv 2023.10) Sharingan: A Transformer-based Architecture for Gaze Following, [[Paper]](https://arxiv.org/pdf/2310.00816.pdf)
10 | - (arXiv 2023.11) Dual input stream transformer for eye-tracking line assignment, [[Paper]](https://arxiv.org/pdf/2311.06095.pdf)
11 | - (arXiv 2024.01) GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance, [[Paper]](https://arxiv.org/pdf/2401.00260.pdf)
12 | - (arXiv 2024.01) EmMixformer: Mix transformer for eye movement recognition, [[Paper]](https://arxiv.org/pdf/2401.04956.pdf)
13 | - (arXiv 2024.02) TransGOP: Transformer-Based Gaze Object Prediction, [[Paper]](https://arxiv.org/pdf/2402.13578.pdf), [[Code]](https://github.com/chenxi-Guo/TransGOP)
14 | - (arXiv 2024.03) ViTGaze: Gaze Following with Interaction Features in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2403.12778.pdf), [[Code]](https://github.com/hustvl/ViTGaze)
15 | - (arXiv 2024.04) Denoising Distillation Makes Event-Frame Transformers as Accurate Gaze Trackers, [[Paper]](https://arxiv.org/pdf/2404.00548.pdf), [[Code]](https://github.com/jdjdli/Denoise_distill_EF_gazetracker)
16 | - (arXiv 2024.05) Gaze-DETR: Using Expert Gaze to Reduce False Positives in Vulvovaginal Candidiasis Screening, [[Paper]](https://arxiv.org/pdf/2405.09463.pdf), [[Code]](https://github.com/YanKong0408/Gaze-DETR)
17 | - (arXiv 2024.07) OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction, [[Paper]](https://arxiv.org/pdf/2407.13335.pdf), [[Code]](https://github.com/HKUST-NISL/oat_eccv24)
18 | 


--------------------------------------------------------------------------------
/main/graph.md:
--------------------------------------------------------------------------------
 1 | ### Graph
 2 | - (arXiv 2022.09) Graph Reasoning Transformer for Image Parsing, [[Paper]](https://arxiv.org/pdf/2209.09545.pdf)
 3 | - (arXiv 2022.11) Rethinking Batch Sample Relationships for Data Representation: A Batch-Graph Transformer based Approach, [[Paper]](https://arxiv.org/pdf/2211.10622.pdf)
 4 | - (arXiv 2022.12) A Generalization of ViT/MLP-Mixer to Graphs, [[Paper]](https://arxiv.org/pdf/2212.13350.pdf), [[Code]](https://github.com/XiaoxinHe/Graph-MLPMixer)
 5 | - (arXiv 2023.02) Energy Transformer, [[Paper]](https://arxiv.org/pdf/2302.07253.pdf), [[Code]](https://github.com/bhoov/energy-transformer-jax)
 6 | - (arXiv 2023.02) MulGT: Multi-task Graph-Transformer with Task-aware Knowledge Injection and Domain Knowledge-driven Pooling for Whole Slide Image Analysis, [[Paper]](https://arxiv.org/pdf/2302.10574.pdf)
 7 | - (arXiv 2023.02) Contrastive Video Question Answering via Video Graph Transformer, [[Paper]](https://arxiv.org/pdf/2302.13668.pdf), [[Code]](https://github.com/doc-doc/CoVGT)
 8 | - (arXiv 2023.03) AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images, [[Paper]](https://arxiv.org/pdf/2303.00865.pdf), [[Code]](https://github.com/doc-doc/CoVGT)
 9 | - (arXiv 2023.03) An Adaptive GViT for Gas Mixture Identification and Concentration Estimation, [[Paper]](https://arxiv.org/pdf/2303.05685.pdf)
10 | - (arXiv 2023.04) Transformer-based Graph Neural Networks for Outfit Generation, [[Paper]](https://arxiv.org/pdf/2304.08098.pdf)
11 | - (arXiv 2023.05) GTNet: Graph Transformer Network for 3D Point Cloud Classification and ation, [[Paper]](https://arxiv.org/pdf/2305.15213.pdf)
12 | - (arXiv 2023.05) Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification, [[Paper]](https://arxiv.org/pdf/2305.15773.pdf)
13 | - (arXiv 2023.06) NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning, [[Paper]](https://arxiv.org/pdf/2306.10792.pdf)
14 | - (arXiv 2023.08) Geometric Learning-Based Transformer Network for Estimation of Segmentation Errors, [[Paper]](https://arxiv.org/pdf/2308.05068.pdf)
15 | - (arXiv 2023.08) Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images, [[Paper]](https://arxiv.org/pdf/2308.11015.pdf)
16 | - (arXiv 2023.08) Deep Prompt Tuning for Graph Transformers, [[Paper]](https://arxiv.org/pdf/2309.10131.pdf)
17 | - (arXiv 2023.11) GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation, [[Paper]](https://arxiv.org/pdf/2311.03035.pdf), [[Code]](https://github.com/Ackesnal/GTP-ViT)
18 | - (arXiv 2023.11) GMTR: Graph Matching Transformers, [[Paper]](https://arxiv.org/pdf/2311.08141.pdf)
19 | - (arXiv 2023.12) GSGFormer: Generative Social Graph Transformer for Multimodal Pedestrian Trajectory Prediction, [[Paper]](https://arxiv.org/pdf/2312.04479.pdf)
20 | - (arXiv 2023.12) Large-scale Graph Representation Learning of Dynamic Brain Connectome with Transformers, [[Paper]](https://arxiv.org/pdf/2312.14939.pdf)
21 | - (arXiv 2024.01) Graph Transformer GANs with Graph Masked Modeling for Architectural Layout Generation, [[Paper]](https://arxiv.org/pdf/2401.07721.pdf)
22 | - (arXiv 2024.02) Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification, [[Paper]](https://arxiv.org/pdf/2402.19339.pdf), [[Code]](https://github.com/delfimpandiani/Stitching-Gaps)
23 | - (arXiv 2024.03) ChebMixer: Efficient Graph Representation Learning with MLP Mixer, [[Paper]](https://arxiv.org/pdf/2403.16358.pdf)
24 | - (arXiv 2024.04) GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets, [[Paper]](https://arxiv.org/pdf/2404.04924.pdf)
25 | - (arXiv 2024.06) CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos, [[Paper]](https://arxiv.org/pdf/2406.01029.pdf)
26 | - (arXiv 2024.06) HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction, [[Paper]](https://arxiv.org/pdf/2406.17697.pdf)
27 | - (arXiv 2024.07) Learning Lane Graphs from Aerial Imagery Using Transformers, [[Paper]](https://arxiv.org/pdf/2407.05687.pdf)
28 | - (arXiv 2024.07) Video-Language Alignment Pre-training via Spatio-Temporal Graph Transformer, [[Paper]](https://arxiv.org/pdf/2407.11677.pdf), [[Code]](https://github.com/GXYM/STGT)
29 | - (arXiv 2024.08) Integrating Features for Recognizing Human Activities through Optimized Parameters in Graph Convolutional Networks and Transformer Architectures, [[Paper]](https://arxiv.org/pdf/2408.16442.pdf)
30 | - (arXiv 2024.09) A Sinkhorn Regularized Adversarial Network for Image Guided DEM Super-resolution using Frequency Selective Hybrid Graph Transformer, [[Paper]](https://arxiv.org/pdf/2409.14198.pdf)
31 | - (arXiv 2024.10) GTransPDM: A Graph-embedded Transformer with Positional Decoupling for Pedestrian Crossing Intention Prediction, [[Paper]](https://arxiv.org/pdf/2409.20223.pdf)
32 | - (arXiv 2024.12) Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video, [[Paper]](https://arxiv.org/pdf/2412.01179.pdf), [[Code]](https://github.com/TangTao-PKU/DGTR)
33 | - (arXiv 2025.04) HGFormer: Topology-Aware Vision Transformer with HyperGraph Learning, [[Paper]](https://arxiv.org/pdf/2504.02440)
34 | - (arXiv 2025.04) Hypergraph Vision Transformers: Images are More than Nodes, More than Edges, [[Paper]](https://arxiv.org/pdf/2504.08710)
35 | 


--------------------------------------------------------------------------------
/main/hand-gesture.md:
--------------------------------------------------------------------------------
 1 | ### Hand Gesture
 2 | - (arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density  EMG Signals, [[Paper]](https://arxiv.org/pdf/2201.10060.pdf)
 3 | - (arXiv 2023.07) Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting, [[Paper]](https://arxiv.org/pdf/2307.08243.pdf), [[Code]](https://github.com/Cogito2012/USST)
 4 | - (arXiv 2023.08) Nonrigid Object Contact Estimation With Regional Unwrapping Transformer, [[Paper]](https://arxiv.org/pdf/2308.14074.pdf)
 5 | - (arXiv 2023.10) BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer, [[Paper]](https://arxiv.org/pdf/2310.06851.pdf)
 6 | - (arXiv 2023.11) Improving Hand Recognition in Uncontrolled and Uncooperative Environments using Multiple Spatial Transformers and Loss Functions, [[Paper]](https://arxiv.org/pdf/2311.05383.pdf)
 7 | - (arXiv 2023.12) Reconstructing Hands in 3D with Transformers, [[Paper]](https://arxiv.org/pdf/2312.05251.pdf), [[Code]](https://geopavlakos.github.io/hamer/)
 8 | - (arXiv 2024.05) GestFormer: Multiscale Wavelet Pooling Transformer Network for Dynamic Hand Gesture Recognition, [[Paper]](https://arxiv.org/pdf/2405.11180.pdf), [[Code]](https://github.com/mallikagarg/GestFormer)
 9 | - (arXiv 2024.08) MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation, [[Paper]](https://arxiv.org/pdf/2408.03312.pdf), [[Code]](https://xiaofenmao.github.io/web-project/MDT-A2G/)
10 | - (arXiv 2024.09) MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition, [[Paper]](https://arxiv.org/pdf/2409.03890.pdf), [[Code]](https://github.com/mallikagarg/MVTN)
11 | - (arXiv 2024.11) ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition, [[Paper]](https://arxiv.org/pdf/2411.07118.pdf), [[Code]](https://github.com/mallikagarg/ConvMixFormer)
12 | - (arXiv 2025.01) Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition, [[Paper]](https://arxiv.org/pdf/2501.00935.pdf)
13 | - (arXiv 2025.03) Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Multi-Culture Sign Language Recognition, [[Paper]](https://arxiv.org/pdf/2503.16855.pdf)
14 | 


--------------------------------------------------------------------------------
/main/high-dynamic-range-imaging.md:
--------------------------------------------------------------------------------
1 | ### High Dynamic Range Imaging
2 | - (arXiv 2022.08) Ghost-free High Dynamic Range Imaging with Context-aware Transformer, [[Paper]](https://arxiv.org/pdf/2208.05114.pdf), [[Code]](https://github.com/megvii-research/HDR-Transformer)
3 | - (arXiv 2023.03) SpiderMesh: Spatial-aware Demand-guided Recursive Meshing for RGB-T ation, [[Paper]](https://arxiv.org/pdf/2303.08704.pdf)
4 | - (arXiv 2023.04) High Dynamic Range Imaging with Context-aware Transformer, [[Paper]](https://arxiv.org/pdf/2304.04416.pdf)
5 | - (arXiv 2023.05) Alignment-free HDR Deghosting with Semantics Consistent Transformer, [[Paper]](https://arxiv.org/pdf/2305.18135.pdf), [[Code]](https://steven-tel.github.io/sctnet)
6 | - (arXiv 2023.09) IFT: Image Fusion Transformer for Ghost-free High Dynamic Range Imaging, [[Paper]](https://arxiv.org/pdf/2309.15019.pdf)
7 | 


--------------------------------------------------------------------------------
/main/hoi.md:
--------------------------------------------------------------------------------
 1 | ### HOI
 2 | - (CVPR'21) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [[Paper]](https://arxiv.org/pdf/2104.13682.pdf), [[Code]](https://github.com/bbepoch/HoiTransformer)
 3 | - (arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [[Paper]](https://arxiv.org/pdf/2103.05399), [[Code]](https://github.com/hitachi-rd-cv/qpic)
 4 | - (arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [[Paper]](https://arxiv.org/pdf/2103.05983), [[Code]](https://github.com/yoyomimi/AS-Net)
 5 | - (arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [[Paper]](https://arxiv.org/pdf/2103.04503.pdf), [[Code]](https://github.com/bbepoch/HoiTransformer)
 6 | - (arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [[Paper]](https://arxiv.org/pdf/2105.02170.pdf)
 7 | - (arXiv 2021.08) GTNet:Guided Transformer Network for Detecting Human-Object Interactions, [[Paper]](https://arxiv.org/pdf/2108.00596.pdf), [[Code]](https://github.com/ASMIftekhar/GTNet)
 8 | - (arXiv 2021.12) Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer, [[Paper]](https://arxiv.org/pdf/2112.01838.pdf), [[Code]](https://github.com/fredzzhang/upt)
 9 | - (arXiv 2022.03) Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows, [[Paper]](https://arxiv.org/pdf/2203.10537.pdf)
10 | - (arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection, [[Paper]](https://arxiv.org/pdf/2203.14709.pdf)
11 | - (arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions, [[Paper]](https://arxiv.org/pdf/2204.00746.pdf)
12 | - (arXiv 2022.04) End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2204.03541.pdf), [[Code]](https://github.com/mrwu-mac/EoID)
13 | - (arXiv 2022.04) Category-Aware Transformer Network for Better Human-Object Interaction Detection, [[Paper]](https://arxiv.org/pdf/2204.04911.pdf)
14 | - (arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection, [[Paper]](https://arxiv.org/pdf/2204.04836.pdf), [[Code]](https://github.com/mlvlab/CPChoi)
15 | - (arXiv 2022.04) Human-Object Interaction Detection via Disentangled Transformer, [[Paper]](https://arxiv.org/pdf/2204.09290.pdf)
16 | - (arXiv 2022.06) Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection, [[Paper]](https://arxiv.org/pdf/2206.06291.pdf), [[Code]](https://github.com/zyong812/STIP)
17 | - (arXiv 2022.07) Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection, [[Paper]](https://arxiv.org/pdf/2207.05293.pdf), [[Code]](https://github.com/MuchHair/HQM)
18 | - (arXiv 2022.07) IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition, [[Paper]](https://arxiv.org/pdf/2207.12100.pdf)
19 | - (arXiv 2023.04) ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection, [[Paper]](https://arxiv.org/pdf/2304.08114.pdf), [[Code]](https://github.com/Jeeseung-Park/ViPLO)
20 | - (arXiv 2023.08) Exploring Predicate Visual Context in Detecting of Human鈥揙bject Interactions, [[Paper]](https://arxiv.org/pdf/2308.06202.pdf), [[Code]](https://github.com/fredzzhang/pvic)
21 | - (arXiv 2023.08) Compositional Learning in Transformer-Based Human-Object Interaction Detection, [[Paper]](https://arxiv.org/pdf/2308.05961.pdf)
22 | - (arXiv 2023.08) Agglomerative Transformer for Human-Object Interaction Detection, [[Paper]](https://arxiv.org/pdf/2308.08370.pdf), [[Code]](https://github.com/six6607/AGER.git)
23 | - (arXiv 2024.01) A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition, [[Paper]](https://arxiv.org/pdf/2401.00409.pdf), [[Code]](https://github.com/six6607/AGER.git)
24 | - (arXiv 2024.05) Bidirectional Progressive Transformer for Interaction Intention Anticipation, [[Paper]](https://arxiv.org/pdf/2405.05552.pdf)
25 | - (arXiv 2025.03) End-to-End HOI Reconstruction Transformer with Graph-based Encoding, [[Paper]](https://arxiv.org/pdf/2503.06012.pdf), [[Code]](https://github.com/ZhenrongWang/hoitg)
26 | 


--------------------------------------------------------------------------------
/main/illumination.md:
--------------------------------------------------------------------------------
1 | ### Illumination
2 | - (arXiv 2022.05) Illumination Adaptive Transformer, [[Paper]](https://arxiv.org/pdf/2205.14871.pdf), [[Code]](https://github.com/caiyuanhao1998/MST-plus-plus)
3 | - (arXiv 2024.10) 360U-Former: HDR Illumination Estimation with Panoramic Adapted Vision Transformers, [[Paper]](https://arxiv.org/pdf/2410.13566.pdf)
4 | 


--------------------------------------------------------------------------------
/main/in-painting.md:
--------------------------------------------------------------------------------
 1 | ### In-painting
 2 | - (ECCV'20) Learning Joint Spatial-Temporal Transformations for Video Inpainting, [[Paper]](https://arxiv.org/abs/2007.10247), [[Code]](https://github.com/researchmm/STTN)
 3 | - (arXiv 2021.04) Aggregated Contextual Transformations for High-Resolution Image Inpainting, [[Paper]](https://arxiv.org/abs/2104.01431), [[Code]](https://github.com/researchmm/AOT-GAN-for-Inpainting)
 4 | - (arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [[Paper]](https://arxiv.org/pdf/2112.08275.pdf), [[Code]](https://github.com/wjf5203/SeqFormer)
 5 | - (arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [[Paper]](https://arxiv.org/pdf/2203.00867.pdf), [[Code]](https://github.com/DQiaole/ZITS_inpainting)
 6 | - (arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image Inpainting, [[Paper]](https://arxiv.org/pdf/2203.15270.pdf), [[Code]](https://github.com/fenglinglwb/MAT)
 7 | - (arXiv 2022.05) Reduce Information Loss in Transformers for Pluralistic Image Inpainting, [[Paper]](https://arxiv.org/pdf/2205.05076.pdf)
 8 | - (arXiv 2022.08) Flow-Guided Transformer for Video Inpainting, [[Paper]](https://arxiv.org/pdf/2208.06768.pdf), [[Code]](https://github.com/hitachinsk/FGT)
 9 | - (arXiv 2022.09) DeViT: Deformed Vision Transformers in Video Inpainting, [[Paper]](https://arxiv.org/pdf/2209.13925.pdf)
10 | - (arXiv 2022.10) TPFNet: A Novel Text In-painting Transformer for Text Removal, [[Paper]](https://arxiv.org/pdf/2210.14461.pdf), [[Code]](https://github.com/CandleLabAI/TPFNet)
11 | - (arXiv 2023.01) Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting, [[Paper]](https://arxiv.org/pdf/2301.10048.pdf)
12 | - (arXiv 2023.05) T-former: An Efficient Transformer for Image Inpainting, [[Paper]](https://arxiv.org/pdf/2305.07239.pdf), [[Code]](https://github.com/dengyecode/T-former_image_inpainting)
13 | - (arXiv 2023.06) TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting, [[Paper]](https://arxiv.org/pdf/2306.11528.pdf), [[Code]](https://github.com/Cameltr/TransRef)
14 | - (arXiv 2023.07) Deficiency-Aware Masked Transformer for Video Inpainting, [[Paper]](https://arxiv.org/pdf/2307.08629.pdf), [[Code]](http://github.com/yeates/DMT)
15 | - (arXiv 2023.09) ProPainter: Improving Propagation and Transformer for Video Inpainting, [[Paper]](https://arxiv.org/pdf/2309.03897.pdf), [[Code]](https://github.com/sczhou/ProPainter)
16 | - (arXiv 2024.01) Federated Class-Incremental Learning with Prototype Guided Transformer, [[Paper]](https://arxiv.org/pdf/2401.02094.pdf)
17 | - (arXiv 2024.02) HINT: High-quality INPainting Transformer with Mask-Aware Encoding and Enhanced Attention, [[Paper]](https://arxiv.org/pdf/2402.14185.pdf), [[Code]](https://github.com/ChrisChen1023/HINT)
18 | - (arXiv 2024.03) Towards Online Real-Time Memory-based Video Inpainting Transformers, [[Paper]](https://arxiv.org/pdf/2403.16161.pdf), [[Code]](https://github.com/ChrisChen1023/HINT)
19 | - (arXiv 2024.04) Raformer: Redundancy-Aware Transformer for Video Wire Inpainting, [[Paper]](https://arxiv.org/pdf/2404.15802.pdf), [[Code]](https://github.com/Suyimu/WRV2)
20 | - (arXiv 2024.07) Transformer-based Image and Video Inpainting: Current Challenges and Future Directions, [[Paper]](https://arxiv.org/pdf/2407.00226.pdf)
21 | - (arXiv 2024.07) MxT: Mamba x Transformer for Image Inpainting, [[Paper]](https://arxiv.org/pdf/2407.16126.pdf)
22 | 


--------------------------------------------------------------------------------
/main/incremental-learning.md:
--------------------------------------------------------------------------------
 1 | ### Incremental Learning
 2 | - (arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [[Paper]](https://arxiv.org/pdf/2112.06103.pdf)
 3 | - (arXiv 2022.03) Meta-attention for ViT-backed Continual Learning, [[Paper]](https://arxiv.org/pdf/2203.11684.pdf), [[Code]](https://github.com/zju-vipa/MEAT-TIL)
 4 | - (arXiv 2022.03)Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [[Paper]](https://arxiv.org/pdf/2203.13167.pdf)
 5 | - (arXiv 2022.07) Online Continual Learning with Contrastive Vision Transformer, [[Paper]](https://arxiv.org/pdf/2207.13516.pdf)
 6 | - (arXiv 2022.08) D3Former: Debiased Dual Distilled Transformer for Incremental Learning, [[Paper]](https://arxiv.org/pdf/2208.00777.pdf), [[Code]](https://tinyurl.com/d3former)
 7 | - (arXiv 2022.10) A Memory Transformer Network for Incremental Learning, [[Paper]](https://arxiv.org/pdf/2210.04485.pdf)
 8 | - (arXiv 2023.01) Combined Use of Federated Learning and Image Encryption for Privacy-Preserving Image Classification with Vision Transformer, [[Paper]](https://arxiv.org/pdf/2301.09255.pdf)
 9 | - (arXiv 2023.03) Learning to Grow Artificial Hippocampi in Vision Transformers for Resilient Lifelong Learning, [[Paper]](https://arxiv.org/pdf/2303.08250.pdf)
10 | - (arXiv 2023.03) Dense Network Expansion for Class Incremental Learning, [[Paper]](https://arxiv.org/pdf/2303.12696.pdf)
11 | - (arXiv 2023.03) Semantic-visual Guided Transformer for Few-shot Class-incremental Learning, [[Paper]](https://arxiv.org/pdf/2303.15494.pdf)
12 | - (arXiv 2023.04) Continual Detection Transformer for Incremental Object Detection, [[Paper]](https://arxiv.org/pdf/2304.03110.pdf)
13 | - (arXiv 2023.04) Preserving Locality in Vision Transformers for Class Incremental Learning, [[Paper]](https://arxiv.org/pdf/2304.06971.pdf)
14 | - (arXiv 2023.05) BiRT: Bio-inspired Replay in Vision Transformers for Continual Learning, [[Paper]](https://arxiv.org/pdf/2305.04769.pdf), [[Code]](https://github.com/NeurAI-Lab/BiRT)
15 | - (arXiv 2023.06) TADIL: Task-Agnostic Domain-Incremental Learning through Task-ID Inference using Transformer Nearest-Centroid Embeddings, [[Paper]](https://arxiv.org/pdf/2306.11955.pdf)
16 | - (arXiv 2023.08) On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers, [[Paper]](https://arxiv.org/pdf/2308.09372.pdf), [[Code]](https://github.com/tdemin16/Continual-LayerNorm-Tuning)
17 | - (arXiv 2023.08) Exemplar-Free Continual Transformer with Convolutions, [[Paper]](https://arxiv.org/pdf/2308.11357.pdf), [[Projet]](https://cvir.github.io/projects/contracon)
18 | - (arXiv 2023.08) Introducing Language Guidance in Prompt-based Continual Learning, [[Paper]](https://arxiv.org/pdf/2308.15827.pdf)
19 | - (arXiv 2023.11) CMFDFormer: Transformer-based Copy-Move Forgery Detection with Continual Learning, [[Paper]](https://arxiv.org/pdf/2311.13263.pdf)
20 | - (arXiv 2023.12) Fine-Grained Knowledge Selection and Restoration for Non-Exemplar Class Incremental Learning, [[Paper]](https://arxiv.org/pdf/2312.12722.pdf), [[Code]](https://github.com/scok30/)
21 | - (arXiv 2024.01) PL-FSCIL: Harnessing the Power of Prompts for Few-Shot Class-Incremental Learning, [[Paper]](https://arxiv.org/pdf/2401.14807.pdf), [[Code]](https://github.com/TianSongS/PL-FSCIL)
22 | - (arXiv 2024.01) Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks, [[Paper]](https://arxiv.org/pdf/2401.14807.pdf)
23 | - (arXiv 2024.03) Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer, [[Paper]](https://arxiv.org/pdf/2403.19979.pdf), [[Code]](https://github.com/HAIV-Lab/SSIAT)
24 | - (arXiv 2024.04) Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners, [[Paper]](https://arxiv.org/pdf/2404.02117.pdf), [[Code]](https://github.com/KHU-AGI/PriViLege)
25 | - (arXiv 2024.04) Calibrating Higher-Order Statistics for Few-Shot Class-Incremental Learning with Pre-trained Vision Transformers, [[Paper]](https://arxiv.org/pdf/2404.06622.pdf), [[Code]](https://github.com/dipamgoswami/FSCIL-Calibration)
26 | - (arXiv 2024.04) Remembering Transformer for Continual Learning, [[Paper]](https://arxiv.org/pdf/2404.07518.pdf)
27 | - (arXiv 2024.05) Less is more: Summarizing Patch Tokens for efficient Multi-Label Class-Incremental Learning,  [[Paper]](https://arxiv.org/pdf/2405.15633.pdf), [[Code]](https://github.com/tdemin16/multi-lane)
28 | - (arXiv 2024.07) PECTP: Parameter-Efficient Cross-Task Prompts for Incremental Vision Transformer, [[Paper]](https://arxiv.org/pdf/2407.03813.pdf), [[Code]](https://github.com/RAIAN08/PECTP)
29 | - (arXiv 2024.08) Dynamic Object Queries for Transformer-based Incremental Object Detection, [[Paper]](https://arxiv.org/pdf/2407.21687.pdf)
30 | - (arXiv 2025.03) Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features, [[Paper]](https://arxiv.org/pdf/2503.08148.pdf)
31 | 


--------------------------------------------------------------------------------
/main/knowledge-distillation.md:
--------------------------------------------------------------------------------
 1 | ### Knowledge Distillation
 2 | - (arXiv 2022.04) DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2204.12997.pdf)
 3 | - (arXiv 2022.05) Knowledge Distillation via the Target-aware Transformer, [[Paper]](https://arxiv.org/pdf/2205.10793.pdf)
 4 | - (arXiv 2022.05) Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation, [[Paper]](https://arxiv.org/pdf/2208.08037.pdf), [[Code]](https://github.com/SwinTransformer/Feature-Distillation)
 5 | - (arXiv 2022.07) Self-Distilled Vision Transformer for Domain Generalization, [[Paper]](https://arxiv.org/pdf/2207.12392.pdf), [[Code]](https://github.com/maryam089/SDViT)
 6 | - (arXiv 2022.09) ViTKD: Practical Guidelines for ViT feature knowledge distillation, [[Paper]](https://arxiv.org/pdf/2209.02432.pdf), [[Code]](https://github.com/yzd-v/cls_KD)
 7 | - (arXiv 2022.10) Self-Distillation for Further Pre-training of Transformers, [[Paper]](https://arxiv.org/pdf/2210.02871.pdf)
 8 | - (arXiv 2022.11) Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling, [[Paper]](https://arxiv.org/pdf/2211.08071.pdf)
 9 | - (arXiv 2022.11) D3ETR: Decoder Distillation for Detection Transformer, [[Paper]](https://arxiv.org/pdf/2211.09768.pdf)
10 | - (arXiv 2022.11) DETRDistill: A Universal Knowledge Distillation Framework for DETR-families, [[Paper]](https://arxiv.org/pdf/2211.10156.pdf)
11 | - (arXiv 2022.12) Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning, [[Paper]](https://arxiv.org/pdf/2212.08320.pdf), [[Code]](https://github.com/RunpeiDong/ACT)
12 | - (arXiv 2022.12) OVO: One-shot Vision Transformer Search with Online distillation, [[Paper]](https://arxiv.org/pdf/2212.13766.pdf)
13 | - (arXiv 2023.02) Knowledge Distillation in Vision Transformers: A Critical Review, [[Paper]](https://arxiv.org/pdf/2302.02108.pdf)
14 | - (arXiv 2023.02) MaskedKD: Efficient Distillation of Vision Transformers with Masked Images, [[Paper]](https://arxiv.org/pdf/2302.10494.pdf)
15 | - (arXiv 2023.03) Multi-view knowledge distillation transformer for human action recognition, [[Paper]](https://arxiv.org/pdf/2303.14358.pdf)
16 | - (arXiv 2023.03) Supervised Masked Knowledge Distillation for Few-Shot Transformers, [[Paper]](https://arxiv.org/pdf/2303.15466.pdf), [[Code]](https://github.com/HL-hanlin/SMKD)
17 | - (arXiv 2023.05) Vision Transformers for Small Histological Datasets Learned through Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2305.17370.pdf)
18 | - (arXiv 2023.05) Are Large Kernels Better Teachers than Transformers for ConvNets?, [[Paper]](https://arxiv.org/pdf/2305.19412.pdf), [[Code]](https://github.com/VITA-Group/SLaK)
19 | - (arXiv 2023.07) Cumulative Spatial Knowledge Distillation for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2307.08500.pdf)
20 | - (arXiv 2023.10) CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction, [[Paper]](https://arxiv.org/pdf/2310.01403.pdf), [[Code]](https://github.com/wusize/CLIPSelf)
21 | - (arXiv 2023.10) Distilling Efficient Vision Transformers from CNNs for ation, [[Paper]](https://arxiv.org/pdf/2310.07265.pdf), [[Code]](https://vlislab22.github.io/C2VKD/)
22 | - (arXiv 2023.10) One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2310.19444.pdf), [[Code]](https://github.com/Hao840/OFAKD)
23 | - (arXiv 2023.11) Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples, [[Paper]](https://arxiv.org/pdf/2311.06056.pdf)
24 | - (arXiv 2023.12) GIST: Improving Parameter Efficient Fine Tuning via Knowledge Interaction, [[Paper]](https://arxiv.org/pdf/2312.07255.pdf)
25 | - (arXiv 2024.02) m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers, [[Paper]](https://arxiv.org/pdf/2402.16918.pdf), [[Code]](https://github.com/kamanphoebe/m2mKD)
26 | - (arXiv 2024.04) Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities, [[Paper]](https://arxiv.org/pdf/2404.16456.pdf)
27 | - (arXiv 2024.07) Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge, [[Paper]](https://arxiv.org/pdf/2407.12808.pdf)
28 | - (arXiv 2024.07) Continual Distillation Learning, [[Paper]](https://arxiv.org/pdf/2407.13911v1.pdf), [[Code]](https://github.com/IRVLUTD/CDL)
29 | - (arXiv 2024.08) Optimizing Vision Transformers with Data-Free Knowledge Transfer, [[Paper]](https://arxiv.org/pdf/2408.05952.pdf)
30 | - (arXiv 2024.08) Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers, [[Paper]](https://arxiv.org/pdf/2408.10503.pdf)
31 | - (arXiv 2024.11) ScaleKD: Strong Vision Transformers Could Be Excellent Teachers, [[Paper]](https://arxiv.org/pdf/2411.06786.pdf), [[Code]](https://github.com/deep-optimization/ScaleKD)
32 | - (arXiv 2025.02) Optimizing Knowledge Distillation in Transformers: Enabling Multi-Head Attention without Alignment Barriers, [[Paper]](https://arxiv.org/pdf/2502.07436.pdf)
33 | - (arXiv 2025.03) Distilling Knowledge into Quantum Vision Transformers for Biomedical Image Classification, [[Paper]](https://arxiv.org/pdf/2503.07294.pdf), [[Code]](https://github.com/surgical-vision/QViT-KD)
34 | - (arXiv 2025.03) ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models, [[Paper]](https://arxiv.org/pdf/2504.00037.pdf)
35 | - (arXiv 2025.06) Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation, [[Paper]](https://arxiv.org/pdf/2505.21549.pdf)
36 | 


--------------------------------------------------------------------------------
/main/lane.md:
--------------------------------------------------------------------------------
 1 | ### Lane
 2 | - (arXiv 2022.03) Laneformer: Object-aware Row-Column Transformers for Lane Detection, [[Paper]](https://arxiv.org/pdf/2203.09830.pdf)
 3 | - (arXiv 2022.03) PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark, [[Paper]](https://arxiv.org/pdf/2203.11089.pdf), [[Project]](https://github.com/OpenPerceptionX/OpenLane)
 4 | - (arXiv 2022.09) PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer, [[Paper]](https://arxiv.org/pdf/2209.06994.pdf), [[Code]](https://github.com/vincentqqb/PriorLane)
 5 | - (arXiv 2022.09) CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention, [[Paper]](https://arxiv.org/pdf/2209.07989.pdf)
 6 | - (arXiv 2023.08) LATR: 3D Lane Detection from Monocular Images with Transformer, [[Paper]](https://arxiv.org/pdf/2308.04583.pdf), [[Code]](https://github.com/JMoonr/LATR)
 7 | - (arXiv 2024.02) CurveFormer++: 3D Lane Detection by Curve Propagation with Temporal Curve Queries and Attention, [[Paper]](https://arxiv.org/pdf/2402.06423.pdf), [[Code]](https://github.com/JMoonr/LATR)
 8 | - (arXiv 2024.03) LDTR: Transformer-based Lane Detection with Anchor-chain Representation, [[Paper]](https://arxiv.org/pdf/2403.14354.pdf)
 9 | - (arXiv 2024.04) Sparse Laneformer, [[Paper]](https://arxiv.org/pdf/2404.07821.pdf)
10 | - (arXiv 2024.04) BezierFormer: A Unified Architecture for 2D and 3D Lane Detection, [[Paper]](https://arxiv.org/pdf/2404.16304.pdf)
11 | 


--------------------------------------------------------------------------------
/main/layout.md:
--------------------------------------------------------------------------------
 1 | ### Layout
 2 | - (CVPR'21) Variational Transformer Networks for Layout Generation, [[Paper]](https://arxiv.org/abs/2104.02416)
 3 | - (arXiv 2021.10) The Layout Generation Algorithm of Graphic Design Based on Transformer-CVAE, [[Paper]](https://arxiv.org/abs/2110.06794)
 4 | - (arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [[Paper]](https://arxiv.org/abs/2112.05112)
 5 | - (arXiv 2022.02) ATEK: Augmenting Transformers with Expert Knowledge for Indoor Layout Synthesis, [[Paper]](https://arxiv.org/abs/2112.05112)
 6 | - (arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [[Paper]](https://arxiv.org/abs/2203.01824), [[Code]](https://github.com/zhigangjiang/LGT-Net)
 7 | - (arXiv 2022.08) UniLayout: Taming Unified Sequence-to-Sequence Transformers for Graphic Layout Generation, [[Paper]](https://arxiv.org/pdf/2208.08037.pdf)
 8 | - (arXiv 2022.09) Geometry Aligned Variational Transformer for Image-conditioned Layout Generation, [[Paper]](https://arxiv.org/pdf/2209.00852.pdf)
 9 | - (arXiv 2022.12) LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer, [[Paper]](https://arxiv.org/pdf/2212.09877.pdf), [[Code]](https://github.com/salesforce/LayoutDETR)
10 | - (arXiv 2022.12) PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image, [[Paper]](https://arxiv.org/pdf/2212.12156.pdf)
11 | - (arXiv 2023.03) DLT: Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transformer, [[Paper]](https://arxiv.org/pdf/2303.03755.pdf)
12 | - (arXiv 2023.04) GUILGET: GUI Layout GEneration with Transformer, [[Paper]](https://arxiv.org/pdf/2304.09012.pdf)
13 | - (arXiv 2023.05) LayoutDM: Transformer-based Diffusion Model for Layout Generation, [[Paper]](https://arxiv.org/pdf/2305.02567.pdf)
14 | - (arXiv 2023.08) MapPrior: Bird鈥檚-Eye View Map Layout Estimation with Generative Models, [[Paper]](https://arxiv.org/pdf/2308.12963.pdf), [[Code]](https://mapprior.github.io/)
15 | - (arXiv 2023.08) Vision Grid Transformer for Document Layout Analysis, [[Paper]](https://arxiv.org/pdf/2308.14978.pdf), [[Code]](https://github.com/AlibabaResearch/AdvancedLiterateMachinery)
16 | - (arXiv 2023.08) Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis, [[Paper]](https://arxiv.org/pdf/2308.15517.pdf)
17 | - (arXiv 2023.10) Dolfin: Diffusion Layout Transformers without Autoencoder, [[Paper]](https://arxiv.org/pdf/2310.16305.pdf)
18 | - (arXiv 2023.11) LayoutPrompter: Awaken the Design Ability of Large Language Models, [[Paper]](https://arxiv.org/pdf/2311.06495.pdf), [[Code]](https://github.com/microsoft/LayoutGeneration/tree/main/LayoutPrompter)
19 | - (arXiv 2023.11) Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation, [[Paper]](https://arxiv.org/pdf/2311.13602.pdf), [[Project]](https://udonda.github.io/RALF/)
20 | - (arXiv 2024.05) DLAFormer: An End-to-End Transformer For Document Layout Analysis, [[Paper]](https://arxiv.org/pdf/2405.11757.pdf)
21 | - (arXiv 2024.07) CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model, [[Paper]](https://arxiv.org/pdf/2405.11757.pdf), [[Code]](https://github.com/yuli0103/CGB-DM)
22 | - (arXiv 2025.05) Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers, [[Paper]](https://arxiv.org/pdf/2505.04718.pdf)
23 | 


--------------------------------------------------------------------------------
/main/lighting.md:
--------------------------------------------------------------------------------
1 | ### Lighting
2 | - (arXiv 2022.02) Spatio-Temporal Outdoor Lighting Aggregation on Image Sequences using Transformer Networks, [[Paper]](https://arxiv.org/abs/2202.09206)
3 | - (arXiv 2023.05) Ray-Patch: An Efficient Decoder for Light Field Transformers, [[Paper]](https://arxiv.org/abs/2305.09566)
4 | 


--------------------------------------------------------------------------------
/main/matching.md:
--------------------------------------------------------------------------------
 1 | ### Matching
 2 | - (CVPR'21') LoFTR: Detector-Free Local Feature Matching with Transformers, [[Paper]](https://arxiv.org/abs/2104.00680), [[Code]](https://zju3dv.github.io/loftr/)
 3 | - (arXiv 2022.02) Local Feature Matching with Transformers for low-end devices, [[Paper]](https://arxiv.org/pdf/2202.00770.pdf), [[Code]](https://github.com/Kolkir/Coarse_LoFTR_TRT)
 4 | - (arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [[Paper]](https://arxiv.org/pdf/2202.06817.pdf), [[Code]](https://github.com/SunghwanHong/Cost-Aggregation-transformers)
 5 | - (arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for Feature Matching, [[Paper]](https://arxiv.org/pdf/2203.09645.pdf), [[Code]](https://github.com/jamycheung/MatchFormer)
 6 | - (arXiv 2022.05) TransforMatcher: Match-to-Match Attention for Semantic Correspondence, [[Paper]](https://arxiv.org/pdf/2205.11634.pdf), [[Code]](http://cvlab.postech.ac.kr/research/TransforMatcher)
 7 | - (arXiv 2022.07) Deep Laparoscopic Stereo Matching with Transformers, [[Paper]](https://arxiv.org/pdf/2207.12152.pdf)
 8 | - (arXiv 2022.08) ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer, [[Paper]](https://arxiv.org/pdf/2208.14201.pdf), [[Project]](https://aspanformer.github.io/)
 9 | - (arXiv 2023.01) DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching, [[Paper]](https://arxiv.org/pdf/2301.02993.pdf), [[Code]](https://github.com/XT-1997/DeepMatcher)
10 | - (arXiv 2023.03) ParaFormer: Parallel Attention Transformer for Efficient Feature Matching, [[Paper]](https://arxiv.org/pdf/2303.00941.pdf)
11 | - (arXiv 2023.03) Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints, [[Paper]](https://arxiv.org/pdf/2303.02885.pdf)
12 | - (arXiv 2023.03) Adaptive Spot-Guided Transformer for Consistent Local Feature Matching, [[Paper]](https://arxiv.org/pdf/2303.16624.pdf), [[Code]](https://astr2023.github.io/)
13 | - (arXiv 2023.05) AMatFormer: Efficient Feature Matching via Anchor Matching Transformer, [[Paper]](https://arxiv.org/pdf/2305.19205.pdf)
14 | - (arXiv 2023.08) Multi-scale Alternated Attention Transformer for Generalized Stereo Matching, [[Paper]](https://arxiv.org/pdf/2308.03048.pdf)
15 | - (arXiv 2023.10) FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer, [[Paper]](https://arxiv.org/pdf/2310.13605.pdf)
16 | - (arXiv 2023.11) LGFCTR: Local and Global Feature Convolutional Transformer for Image Matching, [[Paper]](https://arxiv.org/pdf/2311.17571.pdf), [[Code]](https://github.com/zwh0527/LGFCTR)
17 | - (arXiv 2023.12) Latent Space Editing in Transformer-Based Flow Matching, [[Paper]](https://arxiv.org/pdf/2312.10825.pdf), [[Code]](https://taohu.me/lfm/)
18 | - (arXiv 2024.04) IFViT: Interpretable Fixed-Length Representation for Fingerprint Matching via Vision Transformer, [[Paper]](https://arxiv.org/pdf/2404.08237.pdf)
19 | - (arXiv 2024.04) XoFTR: Cross-modal Feature Matching Transformer, [[Paper]](https://arxiv.org/pdf/2404.09692.pdf), [[Code]](https://github.com/OnderT/XoFTR)
20 | - (arXiv 2024.05) A Light-weight Transformer-based Self-supervised Matching Network for Heterogeneous Images, [[Paper]](https://arxiv.org/pdf/2404.19311.pdf),[[Code]](https://github.com/NUST-Machine- Intelligence-Laboratory/LTFormer)
21 | - (arXiv 2024.05) TP3M: Transformer-based Pseudo 3D Image Matching with Reference, [[Paper]](https://arxiv.org/pdf/2405.08434.pdf)
22 | - (arXiv 2024.06) CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder, [[Paper]](https://arxiv.org/pdf/2409.02545.pdf)
23 | - (arXiv 2024.10) ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses, [[Paper]](https://arxiv.org/pdf/2410.22733.pdf)
24 | - (arXiv 2024.10) LoFLAT: Local Feature Matching using Focused Linear Attention Transformer, [[Paper]](https://arxiv.org/pdf/2410.22710.pdf)
25 | - (arXiv 2025.01) Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer, [[Paper]](https://arxiv.org/pdf/2501.01023.pdf),[[Code]](https://github.com/ZYangChen/HART)
26 | - (arXiv 2025.02) Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion, [[Paper]](https://arxiv.org/pdf/2502.00264.pdf),[[Code]](https://github.com/zhengzaiyi/RotationSymmetry)
27 | - (arXiv 2025.03) Normalized Matching Transformer, [[Paper]](https://arxiv.org/pdf/2503.17715.pdf),[[Code]](https://github.com/Apollos1301/NormMatchTrans)
28 | - (arXiv 2025.03) CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching, [[Paper]](https://arxiv.org/pdf/2503.23925.pdf)
29 | - (arXiv 2025.05) RDD: Robust Feature Detector and Descriptor using Deformable Transformer, [[Paper]](https://arxiv.org/pdf/2505.08013.pdf),[[Code]](https://github.com/xtcpete/rdd)
30 | 


--------------------------------------------------------------------------------
/main/matting.md:
--------------------------------------------------------------------------------
1 | ### Matting
2 | - (arXiv 2022.03) MatteFormer: Transformer-Based Image Matting via Prior-Tokens, [[Paper]](https://arxiv.org/pdf/2203.15662.pdf), [[Code]](https://github.com/webtoon/matteformer)
3 | - (arXiv 2022.08) TransMatting: Enhancing Transparent Objects Matting with Transformers, [[Paper]](https://arxiv.org/pdf/2208.03007.pdf), [[Code]](https://github.com/AceCHQ/TransMatting)
4 | - (arXiv 2022.08) VMFormer: End-to-End Video Matting with Transformer, [[Paper]](https://arxiv.org/pdf/2208.12801.pdf), [[Project]](https://chrisjuniorli.github.io/project/VMFormer/)
5 | - (arXiv 2023.03) TransMatting: Tri-token Equipped Transformer Model for Image Matting, [[Paper]](https://arxiv.org/pdf/2303.06476.pdf), [[Project]](https://github.com/AceCHQ/TransMatting)
6 | - (arXiv 2023.05) ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers, [[Paper]](https://arxiv.org/pdf/2305.15272.pdf)
7 | - (arXiv 2023.08) EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting, [[Paper]](https://arxiv.org/pdf/2308.12831.pdf)
8 | 


--------------------------------------------------------------------------------
/main/mesh.md:
--------------------------------------------------------------------------------
 1 | ### Mesh
 2 | - (arXiv 2021.09) TransforMesh: A Transformer Network for Longitudinal modeling of Anatomical Meshes, [[Paper]](https://arxiv.org/pdf/2109.00532.pdf)
 3 | - (arXiv 2022.07) Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers, [[Paper]](https://arxiv.org/pdf/2207.13820.pdf), [[Code]](https://github.com/postech-ami/FastMETRO)
 4 | - (arXiv 2022.11) TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer, [[Paper]](https://arxiv.org/pdf/2211.10705.pdf)
 5 | - (arXiv 2023.03) GATOR: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose, [[Paper]](https://arxiv.org/pdf/2303.05652.pdf)
 6 | - (arXiv 2023.03) DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video, [[Paper]](https://arxiv.org/pdf/2303.13397.pdf)
 7 | - (arXiv 2023.03) POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery, [[Paper]](https://arxiv.org/pdf/2303.13357.pdf), [[Project]](https://zczcwh.github.io/potter_page)
 8 | - (arXiv 2023.03) One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer, [[Paper]](https://arxiv.org/pdf/2303.16160.pdf), [[Project]](https://osx-ubody.github.io/)
 9 | - (arXiv 2023.07) MeT: A Graph Transformer for ation of 3D Meshes, [[Paper]](https://arxiv.org/pdf/2307.01115.pdf), [[Project]](https://osx-ubody.github.io/)
10 | - (arXiv 2023.07) 3Deformer: A Common Framework for Image-Guided Mesh Deformation, [[Paper]](https://arxiv.org/pdf/2307.09892.pdf), [[Project]](https://osx-ubody.github.io/)
11 | - (arXiv 2023.07) JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery, [[Paper]](https://arxiv.org/pdf/2307.16377.pdf), [[Code]](https://github.com/xljh0520/JOTR)
12 | - (arXiv 2023.08) Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos, [[Paper]](https://arxiv.org/pdf/2308.10334.pdf), [[Code]](https://github.com/Li-Hao-yuan/CoordFormer)
13 | - (arXiv 2023.11) MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers, [[Paper]](https://arxiv.org/pdf/2311.15475.pdf), [[Project]](https://nihalsid.github.io/mesh-gpt/)
14 | - (arXiv 2024.02) Multi-Human Mesh Recovery with Transformers, [[Paper]](https://arxiv.org/pdf/2402.16806.pdf)
15 | - (arXiv 2024.03) Distribution and Depth-Aware Transformers for 3D Human Mesh Recovery, [[Paper]](https://arxiv.org/pdf/2403.09063.pdf)
16 | - (arXiv 2024.03) T-Pixel2Mesh: Combining Global and Local Transformer for 3D Mesh Generation from a Single Image, [[Paper]](https://arxiv.org/pdf/2403.13663.pdf)
17 | - (arXiv 2024.03) PostoMETRO: Pose Token Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery, [[Paper]](https://arxiv.org/pdf/2403.12473.pdf)
18 | - (arXiv 2024.03) InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models, [[Paper]](https://arxiv.org/pdf/2404.06542.pdf), [[Code]](https://github.com/TencentARC/InstantMesh)
19 | - (arXiv 2024.05) Mesh Denoising Transformer, [[Paper]](https://arxiv.org/pdf/2405.06536v1.pdf)
20 | - (arXiv 2024.06) MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers, [[Paper]](https://arxiv.org/pdf/2406.10163.pdf), [[Code]](https://github.com/buaacyw/MeshAnything)
21 | - (arXiv 2024.07) STMR: Spiral Transformer for Hand Mesh Reconstruction, [[Paper]](https://arxiv.org/pdf/2407.05967.pdf), [[Code]](https://github.com/SmallXieGithub/STMR)
22 | - (arXiv 2024.08) MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model, [[Paper]](https://arxiv.org/pdf/2408.10198.pdf), [[Code]](https://meshformer3d.github.io/)
23 | - (arXiv 2024.09) G3PT: Unleash the power of Autoregressive Modeling in 3D Generation via Cross-scale Querying Transformer, [[Paper]](https://arxiv.org/pdf/2409.06322.pdf)
24 | - (arXiv 2024.10) A Recipe for Geometry-Aware 3D Mesh Transformers, [[Paper]](https://arxiv.org/pdf/2411.00164.pdf)
25 | - (arXiv 2024.11) DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery, [[Paper]](https://arxiv.org/pdf/2411.11214.pdf)
26 | - (arXiv 2024.12) MeshArt: Generating Articulated Meshes with Structure-guided Transformers, [[Paper]](https://arxiv.org/pdf/2412.11596.pdf), [[Code]](https://daoyig.github.io/Mesh_Art/)
27 | - (arXiv 2025.03) Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision, [[Paper]](https://arxiv.org/pdf/2503.06089.pdf), [[Project]](https://fish2mesh.github.io/)
28 | - (arXiv 2025.03) MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs, [[Paper]](https://arxiv.org/pdf/2503.23022.pdf)
29 | - (arXiv 2025.06) RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination, [[Paper]](https://arxiv.org/pdf/2505.21925.pdf), [[Project]](https://microsoft.github.io/renderformer/)
30 | 


--------------------------------------------------------------------------------
/main/metric-learning.md:
--------------------------------------------------------------------------------
1 | ### Metric learning
2 | - (arXiv 2022.03) Hyperbolic Vision Transformers: Combining Improvements in Metric Learning, [[Paper]](https://arxiv.org/pdf/2203.10833.pdf),[[Code]](https://github.com/htdt/hyp_metric)
3 | 


--------------------------------------------------------------------------------
/main/multi-label.md:
--------------------------------------------------------------------------------
 1 | ### Multi-label
 2 | - (arXiv 2021.06) MlTr: Multi-label Classification with Transformer, [[Paper]](https://arxiv.org/pdf/2106.06195.pdf), [[Code]](https://github.com/starmemda/MlTr/)
 3 | - (arXiv 2021.07) Query2Label: A Simple Transformer Way to Multi-Label Classification, [[Paper]](https://arxiv.org/pdf/2107.10834.pdf), [[Code]](https://github.com/SlongLiu/query2labels)
 4 | - (arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [[Paper]](https://arxiv.org/pdf/2110.04722.pdf), [[Code]](https://github.com/iCVTEAM/TDRG)
 5 | - (arXiv 2020.11) General Multi-label Image Classification with Transformers, [[Paper]](https://arxiv.org/pdf/2011.14027)
 6 | - (arXiv 2022.03) Graph Attention Transformer Network for Multi-Label Image Classification, [[Paper]](https://arxiv.org/pdf/2203.04049.pdf)
 7 | - (arXiv 2022.03) Incomplete Multi-View Multi-Label Learning via Label-Guided Masked Viewand Category-Aware Transformers, [[Paper]](https://arxiv.org/pdf/2303.07180.pdf)
 8 | - (arXiv 2023.09) Multi-Label Feature Selection Using Adaptive and Transformed Relevance, [[Paper]](https://arxiv.org/pdf/2309.14768.pdf)
 9 | - (arXiv 2024.07) HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification, [[Paper]](https://arxiv.org/pdf/2407.16244.pdf)
10 | 


--------------------------------------------------------------------------------
/main/multi-view-stereo.md:
--------------------------------------------------------------------------------
 1 | ### Multi-view Stereo
 2 | - (arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [[Paper]](https://arxiv.org/pdf/2111.14600.pdf), [[Code]](https://github.com/MegviiRobot/TransMVSNet)
 3 | - (arXiv 2021.12) Multi-View Stereo with Transformer, [[Paper]](https://arxiv.org/pdf/2112.00336.pdf)
 4 | - (arXiv 2022.04) MVSTER: Epipolar Transformer for Efficient Multi-View Stereo, [[Paper]](https://arxiv.org/pdf/2204.07346.pdf), [[Code]](https://github.com/JeffWang987)
 5 | - (arXiv 2022.05) WT-MVSNet: Window-based Transformers for Multi-view Stereo, [[Paper]](https://arxiv.org/pdf/2205.14319.pdf), [[Code]](https://github.com/JeffWang987)
 6 | - (arXiv 2022.08) MVSFormer: Learning Robust Image Representations via Transformers and Temperature-based Depth for Multi-View Stereo, [[Paper]](https://arxiv.org/pdf/2208.02541.pdf)
 7 | - (arXiv 2022.08) A Light Touch Approach to Teaching Transformers Multi-view Geometry, [[Paper]](https://arxiv.org/pdf/2211.15107.pdf)
 8 | - (arXiv 2023.03) Implicit Ray-Transformers for Multi-view Remote Sensing Image Segmentation, [[Paper]](https://arxiv.org/pdf/2303.08401.pdf)
 9 | - (arXiv 2023.05) CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo, [[Paper]](https://arxiv.org/pdf/2305.10320.pdf)
10 | - (arXiv 2023.10) GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers, [[Paper]](https://arxiv.org/pdf/2310.10375.pdf)
11 | - (arXiv 2023.12) CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer, [[Paper]](https://arxiv.org/pdf/2312.08594.pdf), [[Code]](https://github.com/wscstrive/CT-MVSNet)
12 | - (arXiv 2023.12) Global Occlusion-Aware Transformer for Robust Stereo Matching, [[Paper]](https://arxiv.org/pdf/2312.14650.pdf), [[Code]](https://github.com/Magicboomliu/GOAT)
13 | - (arXiv 2024.05) ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers, [[Paper]](https://arxiv.org/pdf/2405.04299.pdf)
14 | - (arXiv 2024.11) A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding, [[Paper]](https://arxiv.org/pdf/2411.01893.pdf), [[Code]](https://zju3dv.github.io/GD-PoseMVS/)
15 | 


--------------------------------------------------------------------------------
/main/nas.md:
--------------------------------------------------------------------------------
 1 | ### NAS
 2 | - (CVPR'21) HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers, [[Paper]](https://arxiv.org/pdf/2106.06560.pdf), [[Code]](https://github.com/dingmyu/HR-NAS)
 3 | - (arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [[Paper]](https://arxiv.org/pdf/2102.10301.pdf)
 4 | - (arXiv.2021.03) BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search, [[Paper]](https://arxiv.org/abs/2103.12424), [[Code]](https://github.com/changlin31/BossNAS)
 5 | - (arXiv.2021.06) Vision Transformer Architecture Search, [[Paper]](https://arxiv.org/pdf/2106.13700.pdf), [[Code]](https://github.com/xiusu/ViTAS)
 6 | - (arXiv.2021.07) AutoFormer: Searching Transformers for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2107.00651.pdf), [[Code]](https://github.com/microsoft/AutoML)
 7 | - (arXiv.2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [[Paper]](https://arxiv.org/pdf/2107.02960.pdf)
 8 | - (arXiv.2021.09) Searching for Efficient Multi-Stage Vision Transformers, [[Paper]](https://arxiv.org/pdf/2109.00642.pdf)
 9 | - (arXiv.2021.10) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [[Paper]](https://arxiv.org/pdf/2110.04035.pdf)
10 | - (arXiv.2021.11) Searching the Search Space of Vision Transformer, [[Paper]](https://arxiv.org/pdf/2111.14725.pdf), [[Code]](https://github.com/microsoft/Cream)
11 | - (arXiv.2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [[Paper]](https://arxiv.org/pdf/2201.00814.pdf)
12 | - (arXiv.2022.03) Vision Transformer with Convolutions Architecture Search, [[Paper]](https://arxiv.org/pdf/2203.10435.pdf)
13 | - (arXiv.2022.03) Training-free Transformer Architecture Search, [[Paper]](https://arxiv.org/pdf/2203.12217.pdf)
14 | - (arXiv.2022.06) Neural Prompt Search, [[Paper]](https://arxiv.org/pdf/2206.04673.pdf)
15 | - (arXiv.2022.07) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [[Paper]](https://arxiv.org/pdf/2207.05420.pdf), [[Code]](https://github.com/Sense-X/UniNet)
16 | - (arXiv.2022.09) NasHD: Efficient ViT Architecture Performance Ranking using Hyperdimensional Computing, [[Paper]](https://arxiv.org/pdf/2209.11356.pdf)
17 | - (arXiv.2022.11) NAR-Former: Neural Architecture Representation Learning towards Holistic Attributes Prediction, [[Paper]](https://arxiv.org/pdf/2211.08024.pdf)
18 | - (arXiv 2023.03) HyT-NAS: Hybrid Transformers Neural Architecture Search for Edge Devices, [[Paper]](https://arxiv.org/pdf/2303.04440.pdf), [[Code]](https://anonymous.4open.science/r/HyT-NAS-Search-Algorithm-A864/README.md)
19 | - (arXiv 2023.07) AutoST: Training-free Neural Architecture Search for Spiking Transformers, [[Paper]](https://arxiv.org/pdf/2307.00293.pdf)
20 | - (arXiv 2023.08) TurboViT: Generating Fast Vision Transformers via Generative Architecture Search, [[Paper]](https://arxiv.org/pdf/2308.11421.pdf)
21 | - (arXiv 2023.11) FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer, [[Paper]](https://arxiv.org/pdf/2311.03912.pdf), [[Code]](https://github.com/shadowpa0327/FLORA)
22 | - (arXiv 2023.11) TVT: Training-Free Vision Transformer Search on Tiny Datasets, [[Paper]](https://arxiv.org/pdf/2311.14337.pdf)
23 | - (arXiv 2023.12) Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery, [[Paper]](https://arxiv.org/pdf/2312.09059.pdf)
24 | - (arXiv.2024.03) Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression, [[Paper]](https://arxiv.org/pdf/2403.15835.pdf)
25 | - (arXiv.2024.05) When Training-Free NAS Meets Vision Transformer: A Neural Tangent Kernel Perspective, [[Paper]](https://arxiv.org/pdf/2405.04536.pdf)
26 | - (arXiv.2024.07) HyTAS: A Hyperspectral Image Transformer Architecture Search Benchmark and Analysis, [[Paper]](https://arxiv.org/pdf/2407.16269.pdf)
27 | - (arXiv.2024.07) Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers, [[Paper]](https://arxiv.org/pdf/2407.18175.pdf)
28 | - (arXiv.2025.05) L-SWAG: Layer-Sample Wise Activation with Gradients information for Zero-Shot NAS on Vision Transformers, [[Paper]](https://arxiv.org/pdf/2505.07300.pdf)
29 | 


--------------------------------------------------------------------------------
/main/navigation.md:
--------------------------------------------------------------------------------
 1 | ### Navigation
 2 | - (ICLR'21) VTNet: Visual Transformer Network for Object Goal Navigation, [[Paper]](https://arxiv.org/pdf/2105.09447.pdf)
 3 | - (arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [[Paper]](https://arxiv.org/pdf/2103.11374.pdf)
 4 | - (arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [[Paper]](https://arxiv.org/pdf/2104.04167.pdf)
 5 | - (arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [[Paper]](https://arxiv.org/pdf/2105.06453.pdf)
 6 | - (arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [[Paper]](https://arxiv.org/pdf/2107.03172.pdf)
 7 | - (arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [[Paper]](https://arxiv.org/pdf/2110.14143.pdf)
 8 | - (arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [[Paper]](https://arxiv.org/pdf/2110.13309.pdf), [[Code]](https://cshizhe.github.io/projects/vln_hamt.html)
 9 | - (arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [[Paper]](https://arxiv.org/pdf/2111.05759.pdf)
10 | - (arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [[Paper]](https://arxiv.org/pdf/2202.11742.pdf), [[Project]](https://cshizhe.github.io/projects/vln_duet.html)
11 | - (arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [[Paper]](https://arxiv.org/pdf/2203.03682.pdf), [[Project]](https://sachamorin.github.io/dino/)
12 | - (arXiv 2022.03) Object Memory Transformer for Object Goal Navigation, [[Paper]](https://arxiv.org/pdf/2203.14708.pdf)
13 | - (arXiv 2022.07) Target-Driven Structured Transformer Planner for Vision-Language Navigation, [[Paper]](https://arxiv.org/pdf/2207.11201.pdf), [[Code]](https://github.com/YushengZhao/TD-STP)
14 | - (arXiv 2023.05) ASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation,  [[Paper]](https://arxiv.org/pdf/2305.11918.pdf)
15 | - (arXiv 2023.06) ViNT: A Foundation Model for Visual Navigation, [[Paper]](https://arxiv.org/pdf/2306.14846.pdf), [[Code]](https://visualnav-transformer.github.io/)
16 | - (arXiv 2023.07) GridMM: Grid Memory Map for Vision-and-Language Navigation, [[Paper]](https://arxiv.org/pdf/2307.12907.pdf), [[Code]](https://github.com/MrZihan/GridMM)
17 | - (arXiv 2023.08) Bird鈥檚-Eye-View Scene Graph for Vision-Language Navigation, [[Paper]](https://arxiv.org/pdf/2308.04758.pdf), [[Code]](https://github.com/DefaultRui/BEV-Scene-Graph)
18 | - (arXiv 2023.08) Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation, [[Paper]](https://arxiv.org/pdf/2308.11561.pdf), [[Code]](https://github.com/yifeisu/avdn-challenge)
19 | - (arXiv 2023.11) Navigating Scaling Laws: Accelerating Vision Transformer's Training via Adaptive Strategies, [[Paper]](https://arxiv.org/pdf/2311.03314.pdf)
20 | - (arXiv 2024.05) Transformers for Image-Goal Navigation, [[Paper]](https://arxiv.org/pdf/2405.14128.pdf)
21 | - (arXiv 2024.05) Vision-and-Language Navigation Generative Pretrained Transformer, [[Paper]](https://arxiv.org/pdf/2405.16994.pdf)
22 | - (arXiv 2024.07) PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators, [[Paper]](https://arxiv.org/pdf/2406.20083.pdf), [[Code]](https://poliformer.allen.ai/)
23 | 


--------------------------------------------------------------------------------
/main/neural-rendering.md:
--------------------------------------------------------------------------------
 1 | ### Neural Rendering
 2 | - (arXiv 2021.12) Light Field Neural Rendering,[[Paper]](https://arxiv.org/pdf/2112.09687.pdf), [[Project]](https://light-field-neural-rendering.github.io/)
 3 | - (arXiv 2022.03) ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers, [[Paper]](https://arxiv.org/pdf/2203.10157.pdf), [[Code]](https://github.com/jkulhanek/viewformer)
 4 | - (arXiv 2022.06) Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer, [[Paper]](https://arxiv.org/pdf/2206.05375.pdf)
 5 | - (arXiv 2022.06) IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering inIndoor Scenes, [[Paper]](https://arxiv.org/pdf/2206.08423.pdf), [[Code]](https://github.com/ViLab-UCSD/IRISformer)
 6 | - (arXiv 2022.07) Vision Transformer for NeRF-Based View Synthesis from a Single Input Image, [[Paper]](https://arxiv.org/pdf/2207.05736.pdf), [[Project]](https://cseweb.ucsd.edu/~viscomp/projects/VisionNeRF/)
 7 | - (arXiv 2022.09) NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields, [[Paper]](https://arxiv.org/pdf/2209.12068.pdf)
 8 | - (arXiv 2023.03) Single-view Neural Radiance Fields with Depth Teacher, [[Paper]](https://arxiv.org/pdf/2303.09952.pdf)
 9 | - (arXiv 2024.01) CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video, [[Paper]](https://arxiv.org/pdf/2401.04861.pdf)
10 | 


--------------------------------------------------------------------------------
/main/octree.md:
--------------------------------------------------------------------------------
1 | ### Octree
2 | - (arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [[Paper]](https://arxiv.org/pdf/2111.12480.pdf), [[Code]](https://github.com/orrzohar/PROB)
3 | - (arXiv 2023.03) OcTr: Octree-based Transformer for 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2303.12621.pdf)
4 | - (arXiv 2023.05) OctFormer: Octree-based Transformers for 3D Point Clouds, [[Paper]](https://arxiv.org/pdf/2305.03045.pdf), [[Code]](https://wang-ps.github.io/octformer)
5 | - (arXiv 2025.03) HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views, [[Paper]](https://arxiv.org/pdf/2503.08140.pdf), [[Code]](https://csiro-robotics.github.io/HOTFormerLoc)
6 | 


--------------------------------------------------------------------------------
/main/open-world.md:
--------------------------------------------------------------------------------
 1 | ### Open World
 2 | - (arXiv 2022.03) Open Set Recognition using Vision Transformer with an Additional Detection Head, [[Paper]](https://arxiv.org/pdf/2203.08441.pdf)
 3 | - (arXiv 2022.06) OOD Augmentation May Be at Odds with Open-Set Recognition, [[Paper]](https://arxiv.org/pdf/2206.04242.pdf)
 4 | - (arXiv 2022.07) Scaling Novel Object Detection with Weakly Supervised Detection Transformers, [[Paper]](https://arxiv.org/pdf/2207.05205.pdf)
 5 | - (arXiv 2022.09) Pre-training image-language transformers for open-vocabulary tasks, [[Paper]](https://arxiv.org/pdf/2209.04372.pdf)
 6 | - (arXiv 2022.10) Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario, [[Paper]](https://arxiv.org/pdf/2209.04372.pdf)
 7 | - (arXiv 2022.12) PROB: Probabilistic Objectness for Open World Object Detection, [[Paper]](https://arxiv.org/pdf/2212.01424.pdf), [[Code]](https://github.com/feiyang-cai/osr_vit)
 8 | - (arXiv 2022.12) Open World DETR: Transformer based Open World Object Detection, [[Paper]](https://arxiv.org/pdf/2212.02969.pdf)
 9 | - (arXiv 2023.01) CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection, [[Paper]](https://arxiv.org/pdf/2301.01970.pdf), [[Code]](https://github.com/xiaomabufei/CAT)
10 | - (arXiv 2023.03) Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection, [[Paper]](https://arxiv.org/pdf/2303.14386.pdf)
11 | - (arXiv 2023.05) Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2305.07011.pdf)
12 | - (arXiv 2023.08) SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning, [[Paper]](https://arxiv.org/pdf/2308.06531.pdf), [[Code]](https://github.com/aim-uofa/SegPrompt)
13 | - (arXiv 2023.09) Contrastive Feature Masking Open-Vocabulary Vision Transformer, [[Paper]](https://arxiv.org/pdf/2309.00775.pdf)
14 | - (arXiv 2023.09) Diffusion Model is Secretly a Training-free Open Vocabulary er, [[Paper]](https://arxiv.org/pdf/2309.02773.pdf)
15 | - (arXiv 2023.09) Unsupervised Open-Vocabulary Object Localization in Videos, [[Paper]](https://arxiv.org/pdf/2309.09858.pdf), [[Code]](https://github.com/aim-uofa/SegPrompt)
16 | - (arXiv 2023.10) CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2310.02960.pdf), [[Code]](https://github.com/yangcaoai/CoDA_NeurIPS2023)
17 | - (arXiv 2023.11) Enhancing Novel Object Detection via Cooperative Foundational Models, [[Paper]](https://arxiv.org/pdf/2311.12068.pdf), [[Code]](https://github.com/rohit901/cooperative-foundational-models)
18 | - (arXiv 2023.11) Language-conditioned Detection Transformer, [[Paper]](https://arxiv.org/pdf/2311.17902.pdf), [[Code]](https://github.com/janghyuncho/DECOLA)
19 | - (arXiv 2023.12) Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection, [[Paper]](https://arxiv.org/pdf/2312.02103.pdf)
20 | - (arXiv 2023.12) Boosting Segment Anything Model Towards Open-Vocabulary Learning, [[Paper]](https://arxiv.org/pdf/2312.03628.pdf), [[Code]](https://github.com/ucas-vg/Sambor)
21 | - (arXiv 2023.12) Open World Object Detection in the Era of Foundation Models, [[Paper]](https://arxiv.org/pdf/2312.05745.pdf), [[Code]](https://orrzohar.github.io/projects/fomo/)
22 | - (arXiv 2024.02) Semi-supervised Open-World Object Detection, [[Paper]](https://arxiv.org/pdf/2402.16013.pdf), [[Code]](https://github.com/sahalshajim/SS-OWFormer)
23 | - (arXiv 2024.03) CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation, [[Paper]](https://arxiv.org/pdf/2403.12455.pdf), [[Code]](https://github.com/zwq456/CLIP-VIS)
24 | - (arXiv 2024.03) OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation, [[Paper]](https://arxiv.org/pdf/2403.19580.pdf)
25 | - (arXiv 2024.04) DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection, [[Paper]](https://arxiv.org/pdf/2404.09216.pdf)
26 | - (arXiv 2024.04) OSR-ViT: A Simple and Modular Framework for Open-Set Object Detection and Discovery, [[Paper]](https://arxiv.org/pdf/2404.10865.pdf)
27 | - (arXiv 2024.05) OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision, [[Paper]](https://arxiv.org/pdf/2405.17913.pdf)
28 | - (arXiv 2024.07) Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation, [[Paper]](https://arxiv.org/pdf/2407.07427.pdf), [[Code]](https://github.com/fanghaook/OVFormer)
29 | - (arXiv 2024.09) FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation, [[Paper]](https://arxiv.org/pdf/2409.03525), [[Code]](https://github.com/chenxi52/FrozenSeg)
30 | - (arXiv 2024.09) Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection，[[Paper]](https://arxiv.org/pdf/2409.08513.pdf), [[Code]](https://github.com/clearxu/GBA/)
31 | - (arXiv 2025.01) Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection，[[Paper]](https://arxiv.org/pdf/2501.16981.pdf)
32 | 


--------------------------------------------------------------------------------
/main/optical-flow.md:
--------------------------------------------------------------------------------
 1 | ### Optical Flow
 2 | - (arXiv 2022.03) FlowFormer: A Transformer Architecture for Optical Flow, [[Paper]](https://arxiv.org/pdf/2203.16194.pdf), [[Project]](https://drinkingcoder.github.io/publication/flowformer/)
 3 | - (arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow, [[Paper]](https://arxiv.org/pdf/2203.16896.pdf), [[Code]](https://github.com/askerlee/craft)
 4 | - (arXiv 2023.03) FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation, [[Paper]](https://arxiv.org/pdf/2303.01237.pdf)
 5 | - (arXiv 2023.04) TransFlow: Transformer as Flow Learner, [[Paper]](https://arxiv.org/pdf/2304.11523.pdf)
 6 | - (arXiv 2023.05) SSTM: Spatiotemporal Recurrent Transformers for Multi-frame Optical Flow Estimation, [[Paper]](https://arxiv.org/pdf/2304.14418.pdf)
 7 | - (arXiv 2023.06) FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow, [[Paper]](https://arxiv.org/pdf/2306.05442.pdf)
 8 | - (arXiv 2024.03) CathFlow: Self-Supervised Segmentation of Catheters in Interventional Ultrasound Using Optical Flow and Transformers, [[Paper]](https://arxiv.org/pdf/2403.14465.pdf)
 9 | - (arXiv 2024.09) SDformerFlow: Spatiotemporal swin spikeformer for event-based optical flow estimation, [[Paper]](https://arxiv.org/pdf/2409.04082.pdf), [[Code]](https://github.com/yitian97/SDformerFlow)
10 | 


--------------------------------------------------------------------------------
/main/panoptic-segmentation.md:
--------------------------------------------------------------------------------
 1 | ### Panoptic Segmentation
 2 | - (arXiv.2020.12) MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, [[Paper]](https://arxiv.org/pdf/2012.00759.pdf)
 3 | - (arXiv 2021.09) Panoptic SegFormer, [[Paper]](https://arxiv.org/pdf/2109.03814.pdf)
 4 | - (arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [[Paper]](https://arxiv.org/pdf/2109.07036.pdf), [[Code]](https://github.com/twangnh/pnp-detr)
 5 | - (arXiv 2021.10) An End-to-End Trainable Video Panoptic Segmentation Method using Transformers, [[Paper]](https://arxiv.org/pdf/2110.04009.pdf)
 6 | - (arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [[Paper]](https://arxiv.org/pdf/2112.01527.pdf), [[Code]](https://bowenc0221.github.io/mask2former/)
 7 | - (arXiv 2021.12) PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2112.02582.pdf), [[Code]](https://github.com/HarborYuan/PolyphonicFormer)
 8 | - (arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation, [[Paper]](https://arxiv.org/pdf/2204.04655.pdf), [[Code]](https://github.com/lxtGH/Panoptic-PartFormer)
 9 | - (arXiv 2022.05) CONSENT: Context Sensitive Transformer for Bold Words Classification, [[Paper]](https://arxiv.org/pdf/2205.07683.pdf)
10 | - (arXiv 2022.05) CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2206.08948.pdf)
11 | - (arXiv 2022.07) k-means Mask Transformer, [[Paper]](https://arxiv.org/pdf/2207.04044.pdf), [[Code]](https://github.com/google-research/deeplab2)
12 | - (arXiv 2022.07) Masked-attention Mask Transformer for Universal Image Segmentation, [[Paper]](https://arxiv.org/pdf/2112.01527.pdf), [[Code]](https://bowenc0221.github.io/mask2former)
13 | - (arXiv 2022.07) Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic ation, [[Paper]](https://arxiv.org/pdf/2207.11860.pdf), [[Code]](https://github.com/jamycheung/Trans4PASS)
14 | - (arXiv 2022.10) Time-Space Transformers for Video Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2210.03546.pdf), [[Code]](https://github.com/jamycheung/Trans4PASS)
15 | - (arXiv 2022.10) Uncertainty-aware LiDAR Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2210.04472.pdf), [[Code]](https://github.com/kshitij3112/EvLPSNet)
16 | - (arXiv 2022.10) A Generalist Framework for Panoptic Segmentation of Images and Videos, [[Paper]](https://arxiv.org/pdf/2210.06366.pdf)
17 | - (arXiv 2022.10) Pointly-Supervised Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2210.13950.pdf), [[Code]](https://github.com/BraveGroup/PSPS.git)
18 | - (arXiv 2023.03) Position-Guided Point Cloud Panoptic Segmentation Transformer, [[Paper]](https://arxiv.org/pdf/2303.13509.pdf), [[Code]](https://github.com/SmartBot-PJLab/P3Former)
19 | - (arXiv 2023.07) Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning, [[Paper]](https://arxiv.org/pdf/2307.14786.pdf)
20 | - (arXiv 2023.08) LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment, [[Paper]](https://arxiv.org/pdf/2308.01686.pdf), [[Code]](https://github.com/zhangzw12319/lcps.git)
21 | - (arXiv 2023.08) PanoSwin: a Pano-style Swin Transformer for Panorama Understanding, [[Paper]](https://arxiv.org/pdf/2308.14726.pdf)
22 | - (arXiv 2023.09) MASK4D: Mask Transformer for 4D Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2309.16133.pdf), [[Code]](https://vision.rwth-aachen.de/mask4d)
23 | - (arXiv 2023.10) Hierarchical Mask2Former: Panoptic Segmentation of Crops, Weeds and Leaves, [[Paper]](https://arxiv.org/pdf/2310.06582.pdf), [[Code]](https://github.com/madeleinedarbyshire/HierarchicalMask2Former)
24 | - (arXiv 2023.11) 4D-Former: Multimodal 4D Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2311.01520.pdf), [[Code]](https://waabi.ai/4dformer)
25 | - (arXiv 2023.11) MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2311.18537.pdf), [[Code]](https://github.com/TACJu/MaXTron)
26 | - (arXiv 2024.01) 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation, [[Paper]](https://arxiv.org/pdf/2401.02281.pdf)
27 | - (arXiv 2024.01) Scalable 3D Panoptic Segmentation With Superpoint Graph Clustering, [[Paper]](https://arxiv.org/pdf/2401.06704.pdf), [[Code]](https://github.com/drprojects/superpoint_transformer)
28 | - (arXiv 2024.02) Benchmarking the Robustness of Panoptic Segmentation for Automated Driving, [[Paper]](https://arxiv.org/pdf/2402.15469.pdf)
29 | - (arXiv 2024.03) PEM: Prototype-based Efficient MaskFormer for Image Segmentation, [[Paper]](https://arxiv.org/pdf/2402.19422.pdf), [[Code]](https://github.com/NiccoloCavagnero/PEM)
30 | - (arXiv 2024.03) ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning, [[Paper]](https://arxiv.org/pdf/2403.20126.pdf), [[Code]](https://github.com/clovaai/ECLIPSE)
31 | - (arXiv 2024.04) Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2404.03799.pdf)
32 | - (arXiv 2024.12) PanSR: An Object-Centric Mask Transformer for Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2412.10589.pdf), [[Code]](https://github.com/lojzezust/PanSR)
33 | - (arXiv 2024.12) Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model, [[Paper]](https://arxiv.org/pdf/2412.18917.pdf), [[Code]](https://github.com/AI-Application-and-Integration-Lab/OMTSeg)
34 | - (arXiv 2025.04) Prior2Former -- Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation, [[Paper]](https://arxiv.org/pdf/2504.04841.pdf)
35 | 


--------------------------------------------------------------------------------
/main/planning.md:
--------------------------------------------------------------------------------
1 | ### Planning
2 | - (arXiv 2021.12) Differentiable Spatial Planning using Transformers, [[Paper]](https://arxiv.org/pdf/2112.01010.pdf), [[Project]](https://devendrachaplot.github.io/projects/spatial-planning-transformers)
3 | 


--------------------------------------------------------------------------------
/main/re-identification.md:
--------------------------------------------------------------------------------
 1 | ### Re-identification
 2 | - (arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [[Paper]](https://arxiv.org/abs/2102.04378)
 3 | - (arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [[Paper]](https://arxiv.org/abs/2103.16469)
 4 | - (arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [[Paper]](https://arxiv.org/abs/2104.00921)
 5 | - (arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [[Paper]](https://arxiv.org/abs/2104.01745)
 6 | - (arXiv 2021.06) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [[Paper]](https://arxiv.org/pdf/2105.14432.pdf)
 7 | - (arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [[Paper]](https://arxiv.org/pdf/2106.04095.pdf)
 8 | - (arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [[Paper]](https://arxiv.org/pdf/2106.03720.pdf)
 9 | - (arXiv 2021.07) Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2107.02380.pdf), [[Code]](https://github.com/Anonymous-release-code/DRL-Net)
10 | - (arXiv 2021.07) GiT: Graph Interactive Transformer for Vehicle Re-identification, [[Paper]](https://arxiv.org/pdf/2107.05448.pdf)
11 | - (arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [[Paper]](https://arxiv.org/pdf/2107.05946.pdf)
12 | - (arXiv 2021.09) Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2109.03483.pdf)
13 | - (arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2109.11159.pdf)
14 | - (arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [[Paper]](https://arxiv.org/pdf/2110.08994.pdf)
15 | - (arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2111.12084.pdf), [[Code]](https://github.com/michuanhaohao/TransReID-SSL)
16 | - (arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [[Paper]](https://arxiv.org/pdf/2112.02466.pdf), [[Code]](https://github.com/WangTaoAs/PFD_Net)
17 | - (arXiv 2022.01) Short Range Correlation Transformer for Occluded Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2201.01090.pdf)
18 | - (arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [[Paper]](https://arxiv.org/pdf/2202.04243.pdf)
19 | - (arXiv 2022.04) PSTR: End-to-End One-Step Person Search With Transformers, [[Paper]](https://arxiv.org/pdf/2204.03340.pdf), [[Code]](https://github.com/JialeCao001/PSTR)
20 | - (arXiv 2022.04) NFormer: Robust Person Re-identification with Neighbor Transformer, [[Paper]](https://arxiv.org/pdf/2204.09331.pdf), [[Code]](https://github.com/haochenheheda/NFormer)
21 | - (arXiv 2022.09) Uncertainty Aware Multitask Pyramid Vision Transformer For UAV-Based Object Re-Identification, [[Paper]](https://arxiv.org/pdf/2209.08686.pdf)
22 | - (arXiv 2022.11) Sequential Transformer for End-to-End Person Search, [[Paper]](https://arxiv.org/pdf/2211.04323.pdf)
23 | - (arXiv 2022.11) Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2211.04323.pdf), [[Code]](https://github.com/RikoLi/WACV23-workshop-TMGF)
24 | - (arXiv 2022.11) Learning Progressive Modality-shared Transformers for Effective Visible-Infrared Person Re-identification, [[Paper]](https://arxiv.org/pdf/2212.00226.pdf), [[Code]](https://github.com/hulu88/PMT)
25 | - (arXiv 2023.01) Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification, [[Paper]](https://arxiv.org/pdf/2301.00531.pdf)
26 | - (arXiv 2023.02) X-ReID: Cross-Instance Transformer for Identity-Level Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2302.02075.pdf)
27 | - (arXiv 2023.02) DC-Former: Diverse and Compact Transformer for Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2302.14335.pdf)
28 | - (arXiv 2023.03) Feature Completion Transformer for Occluded Person Re-identification, [[Paper]](https://arxiv.org/pdf/2303.01656.pdf)
29 | - (arXiv 2023.03) TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2303.06819.pdf), [[Code]](https://github.com/Kali-Hac/TranSG)
30 | - (arXiv 2023.04) Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification, [[Paper]](https://arxiv.org/pdf/2304.14122.pdf), [[Code]](https://github.com/Kali-Hac/TranSG)
31 | - (arXiv 2023.08) Part-Aware Transformer for Generalizable Person Re-identification, [[Paper]](https://arxiv.org/pdf/2308.03322.pdf), [[Code]](https://github.com/liyuke65535/Part-Aware-Transformer)
32 | - (arXiv 2023.10) GraFT: Gradual Fusion Transformer for Multimodal Re-Identification, [[Paper]](https://arxiv.org/pdf/2310.16856.pdf)
33 | - (arXiv 2024.02) Dynamic Patch-aware Enrichment Transformer for Occluded Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2402.10435.pdf)
34 | - (arXiv 2024.03) View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network, [[Paper]](https://arxiv.org/pdf/2403.14513.pdf), [[Code]](https://github.com/LinlyAC/VDT-AGPReID)
35 | - (arXiv 2024.04) Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification, [[Paper]](https://arxiv.org/pdf/2404.14985.pdf)
36 | - (arXiv 2024.08) PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification, [[Paper]](https://arxiv.org/pdf/2408.16684.pdf)
37 | - (arXiv 2024.09) Tran-GCN: A Transformer-Enhanced Graph Convolutional Network for Person Re-Identification in Monitoring Videos, [[Paper]](https://arxiv.org/pdf/2409.09391.pdf)
38 | - (arXiv 2024.10) Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2410.15613.pdf)
39 | - (arXiv 2024.11) Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID, [[Paper]](https://arxiv.org/pdf/2411.06297.pdf), [[Code]](https://github.com/qiumei1101/Adaptive_AR_PM_TransReID)
40 | - (arXiv 2024.12) Motif Guided Graph Transformer with Combinatorial Skeleton Prototype Learning for Skeleton-Based Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2412.09044.pdf), [[Code]](https://github.com/Kali-Hac/MoCos)
41 | - (arXiv 2024.12) Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification, [[Paper]](https://arxiv.org/pdf/2412.17239.pdf), [[Code]](https://github.com/924973292/FusionReID)
42 | - (arXiv 2025.02) Two-Stream Spatial-Temporal Transformer Framework for Person Identification via Natural Conversational Keypoints, [[Paper]](https://arxiv.org/pdf/2502.20803.pdf)
43 | 


--------------------------------------------------------------------------------
/main/recognition.md:
--------------------------------------------------------------------------------
 1 | ### Recognition
 2 | - (arXiv 2021.03) Global Self-Attention Networks for Image Recognition, [[Paper]](https://arxiv.org/abs/2010.03019)
 3 | - (arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [[Paper]](https://arxiv.org/pdf/2103.07976.pdf)
 4 | - (arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision, [[Paper]](https://arxiv.org/pdf/2103.07976.pdf)
 5 | - (arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [[Paper]](https://arxiv.org/pdf/2107.06538.pdf)
 6 | - (arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [[Paper]](https://arxiv.org/pdf/2107.06538.pdf)
 7 | - (arXiv 2021.08) DPT: Deformable Patch-based Transformer for Visual Recognition, [[Paper]](https://arxiv.org/pdf/2107.14467.pdf), [[Code]](https://github.com/CASIA-IVA-Lab/DPT)
 8 | - (arXiv 2021.10) A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition, [[Paper]](https://arxiv.org/pdf/2110.01240.pdf)
 9 | - (arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [[Paper]](https://arxiv.org/pdf/2111.09492.pdf)
10 | - (arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [[Paper]](https://arxiv.org/pdf/2111.15668.pdf)
11 | - (arXiv 2021.11) Grounded Situation Recognition with Transformers, [[Paper]](https://arxiv.org/pdf/2111.10135.pdf), [[Code]](https://github.com/jhcho99/gsrtr)
12 | - (arXiv 2022.01) TransVPR: Transformer-based place recognition with multi-level attention aggregation, [[Paper]](https://arxiv.org/pdf/2201.02001.pdf)
13 | - (arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [[Paper]](https://arxiv.org/pdf/2203.02751.pdf), [[Code]](https://github.com/dqshuai/MetaFormer)
14 | - (arXiv 2022.04) Diverse Instance Discovery: Vision-Transformer for Instance-Aware Multi-Label Image Recognition, [[Paper]](https://arxiv.org/pdf/2204.10731.pdf), [[Code]](https://github.com/dqshuai/MetaFormer)
15 | - (arXiv 2022.07) Forensic License Plate Recognition with Compression-Informed Transformers, [[Paper]](https://arxiv.org/pdf/2207.14686.pdf), [[Code]](https://www.cs1.tf.fau.de/research/multimedia-security/)
16 | - (arXiv 2022.08) TSRFormer: Table Structure Recognition with Transformers, [[Paper]](https://arxiv.org/pdf/2208.04921.pdf)
17 | - (arXiv 2022.08) GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement, [[Paper]](https://arxiv.org/pdf/2208.08965.pdf), [[Code]](https://github.com/zhiqic/GSRFormer)
18 | - (arXiv 2022.09) SeqOT: A Spatial-Temporal Transformer Network for Place Recognition Using Sequential LiDAR Data, [[Paper]](https://arxiv.org/pdf/2209.07951.pdf), [[Code]](https://github.com/BIT-MJY/SeqOT)
19 | - (arXiv 2022.12) Part-guided Relational Transformers for Fine-grained Visual Recognition, [[Paper]](https://arxiv.org/pdf/2212.13685.pdf), [[Code]](https://github.com/iCVTEAM/PART)
20 | - (arXiv 2023.02) CVTNet: A Cross-View Transformer Network for Place Recognition Using LiDAR Data, [[Paper]](https://arxiv.org/pdf/2302.01665.pdf), [[Code]](https://github.com/BIT-MJY/CVTNet)
21 | - (arXiv 2023.02) Rethink Long-tailed Recognition with Vision Transforms, [[Paper]](https://arxiv.org/pdf/2302.14284.pdf)
22 | - (arXiv 2023.04) R2Former: Unified Retrieval and Reranking Transformer for Place Recognition, [[Paper]](https://arxiv.org/pdf/2304.03410.pdf), [[Code]](https://github.com/Jeff-Zilence/R2Former)
23 | - (arXiv 2023.05) MASK-CNN-Transformer For Real-Time Multi-Label Weather Recognition, [[Paper]](https://arxiv.org/pdf/2304.14857.pdf)
24 | - (arXiv 2023.05) TReR: A Lightweight Transformer Re-Ranking Approach for 3D LiDAR Place Recognition, [[Paper]](https://arxiv.org/pdf/2305.18013.pdf)
25 | - (arXiv 2023.07) Convolutional Transformer for Autonomous Recognition and Grading of Tomatoes Under Various Lighting, Occlusion, and Ripeness Conditions, [[Paper]](https://arxiv.org/pdf/2307.01530.pdf)
26 | - (arXiv 2023.08) M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition, [[Paper]](https://arxiv.org/pdf/2308.02161.pdf)
27 | - (arXiv 2023.09) Parameter-Efficient Long-Tailed Recognition, [[Paper]](https://arxiv.org/pdf/2309.10019.pdf), [[Code]](https://github.com/shijxcs/PEL)
28 | - (arXiv 2023.09) MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily Behavior Recognition in Group Settings, [[Paper]](https://arxiv.org/pdf/2309.10765.pdf), [[Code]](https://github.com/surbhimadan92/MAGIC-TBR)
29 | - (arXiv 2023.10) ClusVPR: Efficient Visual Place Recognition with Clustering-based Weighted Transformer, [[Paper]](https://arxiv.org/pdf/2310.04099.pdf), [[Code]](https://github.com/surbhimadan92/MAGIC-TBR)
30 | - (arXiv 2023.10) FaultSeg Swin-UNETR: Transformer-Based Self-Supervised Pretraining Model for Fault Recognition, [[Paper]](https://arxiv.org/pdf/2310.17974.pdf)
31 | - (arXiv 2023.12) Are Vision Transformers More Data Hungry Than Newborn Visual Systems, [[Paper]](https://arxiv.org/pdf/2312.02843.pdf)
32 | - (arXiv 2024.01) PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion, [[Paper]](https://arxiv.org/pdf/2401.13082.pdf)
33 | - (arXiv 2024.01) Regressing Transformers for Data-efficient Visual Place Recognition, [[Paper]](https://arxiv.org/pdf/2401.16304.pdf)
34 | - (arXiv 2024.01) A New Method for Vehicle Logo Recognition Based on Swin Transformer, [[Paper]](https://arxiv.org/pdf/2401.15458.pdf)
35 | - (arXiv 2024.07) Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2407.12891.pdf), [[Code]](https://github.com/arkel23/GLSim)
36 | - (arXiv 2024.10) big.LITTLE Vision Transformer for Efficient Visual Recognition, [[Paper]](https://arxiv.org/pdf/2410.10267.pdf)
37 | - (arXiv 2024.12) EDTformer: An Efficient Decoder Transformer for Visual Place Recognition, [[Paper]](https://arxiv.org/pdf/2412.00784.pdf), [[Code]](https://github.com/Tong-Jin01/EDTformer)
38 | - (arXiv 2025.02) A Transformer-in-Transformer Network Utilizing Knowledge Distillation for Image Recognition, [[Paper]](https://arxiv.org/pdf/2502.16762.pdf)
39 | - (arXiv 2025.03) Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition, [[Paper]](https://arxiv.org/pdf/2503.11995.pdf), [[Code]](https://github.com/zs1314/Fraesormer)
40 | - (arXiv 2025.03) Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition, [[Paper]](https://arxiv.org/pdf/2503.20436.pdf)
41 | - (arXiv 2025.04) LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition, [[Paper]](https://arxiv.org/pdf/2504.19256.pdf)
42 | 


--------------------------------------------------------------------------------
/main/reconstruction.md:
--------------------------------------------------------------------------------
 1 | ### Reconstruction 
 2 | - (arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [[Paper]](https://arxiv.org/pdf/2103.12957.pdf)
 3 | - (arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [[Paper]](https://arxiv.org/pdf/2106.09336.pdf)
 4 | - (arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [[Paper]](https://arxiv.org/pdf/2106.12102.pdf)
 5 | - (arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [[Paper]](https://arxiv.org/pdf/2107.08192.pdf)
 6 | - (arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [[Paper]](https://arxiv.org/pdf/2110.08861.pdf), [[Code]](https://github.com/FomalhautB/3D-RETR)
 7 | - (arXiv 2021.11) Reference-based Magnetic Resonance Image Reconstruction Using Texture Transformer, [[Paper]](https://arxiv.org/pdf/2111.09492.pdf)
 8 | - (arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [[Paper]](https://arxiv.org/pdf/2111.15143.pdf)
 9 | - (arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [[Paper]](https://arxiv.org/pdf/2112.00236.pdf), [[Code]](https://noahstier.github.io/vortx/)
10 | - (arXiv 2022.01) Spectral Compressive Imaging Reconstruction Using Convolution and Spectral Contextual Transformer, [[Paper]](https://arxiv.org/pdf/2201.05768.pdf)
11 | - (arXiv 2022.03) RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers, [[Paper]](https://arxiv.org/pdf/2203.13296.pdf)
12 | - (arXiv 2022.05) 3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction, [[Paper]](https://arxiv.org/pdf/2205.14575.pdf)
13 | - (arXiv 2022.05) HeatER: An Efficient and Unified Network for Human Reconstruction via Heatmap-based TransformER, [[Paper]](https://arxiv.org/pdf/2205.15448.pdf)
14 | - (arXiv 2022.06) Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades, [[Paper]](https://arxiv.org/pdf/2206.00645.pdf)
15 | - (arXiv 2022.08) PlaneFormers: From Sparse View Planes to 3D Reconstruction, [[Paper]](https://arxiv.org/pdf/2208.04307.pdf)
16 | - (arXiv 2023.01) Monocular Scene Reconstruction with 3D SDF Transformers, [[Paper]](https://arxiv.org/pdf/2301.13510.pdf), [[Project]](https://weihaosky.github.io/sdfformer)
17 | - (arXiv 2023.02) Efficient 3D Object Reconstruction using Visual Transformers, [[Paper]](https://arxiv.org/pdf/2302.08474.pdf), [[Project]](https://weihaosky.github.io/sdfformer)
18 | - (arXiv 2023.02) UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction, [[Paper]](https://arxiv.org/pdf/2302.13987.pdf)
19 | - (arXiv 2023.03) CryoFormer: Continuous Reconstruction of 3D Structures from Cryo-EM Data using Transformer-based Neural Representations, [[Paper]](https://arxiv.org/pdf/2303.16254.pdf), [[Project]](https://cryoformer.github.io/)
20 | - (arXiv 2023.04) CornerFormer: Boosting Corner Representation for Fine-Grained Structured Reconstruction, [[Paper]](https://arxiv.org/pdf/2304.07072.pdf)
21 | - (arXiv 2023.07) Image Reconstruction using Enhanced Vision Transformer, [[Paper]](https://arxiv.org/pdf/2307.05616.pdf)
22 | - (arXiv 2023.08) Long-Range Grouping Transformer for Multi-View 3D Reconstruction, [[Paper]](https://arxiv.org/pdf/2308.08724.pdf),[[Code]](https://github.com/LiyingCV/Long-Range-Grouping-Transformer)
23 | - (arXiv 2023.08) A Transformer-Conditioned Neural Fields Pipeline with Polar Coordinate Representation for Astronomical Radio Interferometric Data Reconstruction, [[Paper]](https://arxiv.org/pdf/2308.14610.pdf)
24 | - (arXiv 2023.09) Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction, [[Paper]](https://arxiv.org/pdf/2309.13524.pdf),[[Code]](https://github.com/River-Zhang/GTA)
25 | - (arXiv 2023.10) Sketch2CADScript: 3D Scene Reconstruction from 2D Sketch using Visual Transformer and Rhino Grasshopper, [[Paper]](https://arxiv.org/pdf/2309.16850.pdf)
26 | - (arXiv 2023.10) ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map, [[Paper]](https://arxiv.org/pdf/2310.11811.pdf)
27 | - (arXiv 2023.10) DIAR: Deep Image Alignment and Reconstruction using Swin Transformers, [[Paper]](https://arxiv.org/pdf/2310.11605.pdf)
28 | - (arXiv 2023.12) Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers, [[Paper]](https://arxiv.org/pdf/2312.09147.pdf),[[Project]](https://zouzx.github.io/TriplaneGaussian/)
29 | - (arXiv 2024.01) GridFormer: Point-Grid Transformer for Surface Reconstruction, [[Paper]](https://arxiv.org/pdf/2401.02292.pdf),[[Code]](https://github.com/list17/GridFormer)
30 | - (arXiv 2024.04) Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer, [[Paper]](https://arxiv.org/pdf/2404.04819.pdf),[[Code]](https://github.com/dqj5182/CONTHO_RELEASE)
31 | - (arXiv 2024.04) DIG3D: Marrying Gaussian Splatting with Deformable Transformer for Single Image 3D Reconstruction, [[Paper]](https://arxiv.org/pdf/2404.16323.pdf)
32 | - (arXiv 2024.08) TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers, [[Paper]](https://arxiv.org/pdf/2408.13770.pdf),[[Code]](https://xingyoujun.github.io/transplat)
33 | - (arXiv 2024.10) Disambiguating Monocular Reconstruction of 3D Clothed Human with Spatial-Temporal Transformer, [[Paper]](https://arxiv.org/pdf/2410.16337.pdf)
34 | - (arXiv 2024.11) TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction, [[Paper]](https://arxiv.org/pdf/2411.11941.pdf)
35 | - (arXiv 2025.02) ViSIR: Vision Transformer Single Image Reconstruction Method for Earth System Models, [[Paper]](https://arxiv.org/pdf/2502.06741.pdf)
36 | - (arXiv 2025.03) HORT: Monocular Hand-held Objects Reconstruction with Transformers, [[Paper]](https://arxiv.org/pdf/2503.21313.pdf),[[Code]](https://zerchen.github.io/projects/hort.html)
37 | - (arXiv 2025.05) The Moon's Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction, [[Paper]](https://arxiv.org/pdf/2505.05644.pdf)
38 | 


--------------------------------------------------------------------------------
/main/referring.md:
--------------------------------------------------------------------------------
 1 | ### Referring
 2 | - (arXiv 2021.08) Vision-Language Transformer and Query Generation for Referring Segmentation, [[Paper]](https://arxiv.org/pdf/2108.05565.pdf), [[Code]](https://github.com/henghuiding/Vision-Language-Transformer)
 3 | - (arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [[Paper]](https://arxiv.org/pdf/2112.02244.pdf),[[Code]](https://github.com/yz93/LAVT-RIS)
 4 | - (arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [[Paper]](https://arxiv.org/pdf/2203.16768.pdf), [[Code]](http://cvlab.postech.ac.kr/research/restr/)
 5 | - (arXiv 2022.10) VLT: Vision-Language Transformer and Query Generation for Referring Segmentation, [[Paper]](https://arxiv.org/pdf/2210.15871.pdf), [[Code]](https://github.com/henghuiding/Vision-Language-Transformer)
 6 | - (arXiv 2023.09) Contrastive Grouping with Transformer for Referring Image Segmentation, [[Paper]](https://arxiv.org/pdf/2309.01017.pdf),[[Code]](https://github.com/Toneyaya/CGFormer)
 7 | - (arXiv 2024.07) SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation, [[Paper]](https://arxiv.org/pdf/2407.02394.pdf),[[Code]](https://sayannag.github.io/safari_eccv2024/)
 8 | - (arXiv 2024.07) RefMask3D: Language-Guided Transformer for 3D Referring Segmentation, [[Paper]](https://arxiv.org/pdf/2407.18244.pdf),[[Code]](https://github.com/heshuting555/RefMask3D)
 9 | - (arXiv 2024.08) Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation,[[Paper]](https://arxiv.org/pdf/2408.07539.pdf)
10 | 


--------------------------------------------------------------------------------
/main/registration.md:
--------------------------------------------------------------------------------
 1 | ### Registration
 2 | - (arXiv 2021.04) ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration, [[Paper]](https://arxiv.org/pdf/2104.06468.pdf), [[Code]](https://bit.ly/3bWDynR)
 3 | - (arXiv 2021.05) Attention for Image Registration (AiR): an unsupervised Transformer approach, [[Paper]](https://arxiv.org/pdf/2105.02282.pdf)
 4 | - (arXiv 2022.02) A Transformer-based Network for Deformable Medical Image Registration, [[Paper]](https://arxiv.org/pdf/2202.12104.pdf)
 5 | - (arXiv 2022.03) Affine Medical Image Registration with Coarse-to-Fine Vision Transformer, [[Paper]](https://arxiv.org/pdf/2203.15216.pdf), [[Code]](https://github.com/cwmok/C2FViT)
 6 | - (arXiv 2022.04) Symmetric Transformer-based Network for Unsupervised Image Registration, [[Paper]](https://arxiv.org/pdf/2204.13575.pdf), [[Code]](https://github.com/MingR-Ma/SymTrans)
 7 | - (arXiv 2023.03) Spatially-varying Regularization with Conditional Transformer for Unsupervised Image Registration, [[Paper]](https://arxiv.org/pdf/2303.06168.pdf)
 8 | - (arXiv 2023.03) RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration, [[Paper]](https://arxiv.org/pdf/2303.12384.pdf)
 9 | - (arXiv 2023.07) Non-iterative Coarse-to-fine Transformer Networks for Joint Affine and Deformable Image Registration, [[Paper]](https://arxiv.org/pdf/2307.03421.pdf)
10 | - (arXiv 2023.08) 2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds, [[Paper]](https://arxiv.org/pdf/2308.05667.pdf), [[Code]](https://github.com/minhaolee/2D3DMATR)
11 | - (arXiv 2023.08) GeoTransformer: Fast and Robust Point Cloud Registration with Geometric Transformer, [[Paper]](https://arxiv.org/pdf/2308.03768.pdf), [[Code]](https://github.com/qinzheng93/GeoTransformer)
12 | - (arXiv 2023.10) OAAFormer: Robust and Efficient Point Cloud Registration Through Overlapping-Aware Attention in Transformer, [[Paper]](https://arxiv.org/pdf/2310.09817.pdf)
13 | - (arXiv 2023.12) VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning, [[Paper]](https://arxiv.org/pdf/2312.08774.pdf), [[Code]](https://github.com/sugar-fly/VSFormer)
14 | - (arXiv 2023.12) D3Former: Jointly Learning Repeatable Dense Detectors and Feature-enhanced Descriptors via Saliency-guided Transformer,  [[Paper]](https://arxiv.org/pdf/2312.12970.pdf)
15 | - (arXiv 2024.03) EfficientMorph: Parameter-Efficient Transformer-Based Architecture for 3D Image Registration,  [[Paper]](https://arxiv.org/pdf/2403.11026.pdf)
16 | - (arXiv 2024.04) PointDifformer: Robust Point Cloud Registration With Neural Diffusion and Transformer,  [[Paper]](https://arxiv.org/pdf/2404.14034.pdf)
17 | - (arXiv 2024.06) Point Tree Transformer for Point Cloud Registration,  [[Paper]](https://arxiv.org/pdf/2406.17530.pdf)
18 | - (arXiv 2024.10) TFCT-I2P: Three stream fusion network with color aware transformer for image-to-point cloud registration,  [[Paper]](https://arxiv.org/pdf/2410.00360.pdf), [[Code]](https://github.com/muyao99/TFCT-I2P)
19 | - (arXiv 2024.10) A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration,  [[Paper]](https://arxiv.org/pdf/2410.10295.pdf), [[Code]](https://github.com/RenlangHuang/CAST)
20 | 


--------------------------------------------------------------------------------
/main/retrieval.md:
--------------------------------------------------------------------------------
 1 | ### Retrieval
 2 | - (CVPR'21') Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers,  [[Paper]](https://arxiv.org/abs/2103.16553)
 3 | - (arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [[Paper]](https://arxiv.org/pdf/2101.03771)
 4 | - (arXiv 2021.02) Training Vision Transformers for Image Retrieval, [[Paper]](https://arxiv.org/pdf/2102.05644.pdf)
 5 | - (arXiv 2021.03) Instance-level Image Retrieval using Reranking Transformers, [[Paper]](https://arxiv.org/abs/2103.12424)
 6 | - (arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [[Paper]](https://arxiv.org/pdf/2104.00650.pdf)
 7 | - (arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [[Paper]](https://arxiv.org/pdf/2104.07993.pdf)
 8 | - (arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [[Paper]](https://arxiv.org/pdf/2105.07197.pdf), [[Code]](https://github.com/shikhartuli/cnn_txf_bias)
 9 | - (arXiv 2021.06) Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features, [[Paper]](https://arxiv.org/pdf/2106.00358.pdf)
10 | - (arXiv 2021.06) All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers, [[Paper]](https://arxiv.org/pdf/2106.10153.pdf), [[Code]](https://github.com/cscribano/AYCE_2021)
11 | - (arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [[Paper]](https://arxiv.org/pdf/2109.12564.pdf)
12 | - (arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [[Paper]](https://arxiv.org/pdf/2203.06429.pdf)
13 | - (arXiv 2022.07) TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval, [[Paper]](https://arxiv.org/pdf/2207.07852.pdf), [[Code]](https://github.com/yuqi657/ts2_net)
14 | - (arXiv 2022.08) EViT: Privacy-Preserving Image Retrieval via Encrypted Vision Transformer in Cloud Computing, [[Paper]](https://arxiv.org/pdf/2208.14657.pdf), [[Code]](https://github.com/onlinehuazai/EViT)
15 | - (arXiv 2022.10) ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval, [[Paper]](https://arxiv.org/pdf/2210.04341.pdf)
16 | - (arXiv 2022.10) General Image Descriptors for Open World Image Retrieval using ViT CLIP, [[Paper]](https://arxiv.org/pdf/2210.11141.pdf)
17 | - (arXiv 2022.10) Boosting vision transformers for image retrieval, [[Paper]](https://arxiv.org/pdf/2210.11909.pdf), [[Code]](https://github.com/dealicious-inc/DToP)
18 | - (arXiv 2023.04) STIR: Siamese Transformer for Image Retrieval Postprocessing, [[Paper]](https://arxiv.org/pdf/2304.13393.pdf), [[Code]](https://github.com/OML-Team/open-metric-learning/tree/main/pipelines/postprocessing/)
19 | - (arXiv 2023.08) Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval, [[Paper]](https://arxiv.org/pdf/2308.04343.pdf), [[Code]](https://github.com/LuminosityX/HAT)
20 | - (arXiv 2023.10) GMMFormer: Gaussian-Mixture-Model based Transformer for Efficient Partially Relevant Video Retrieval, [[Paper]](https://arxiv.org/pdf/2310.05195.pdf)
21 | - (arXiv 2024.01) Transformer-based Clipped Contrastive Quantization Learning for Unsupervised Image Retrieval, [[Paper]](https://arxiv.org/pdf/2401.15362.pdf)
22 | - (arXiv 2024.05) GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval, [[Paper]](https://arxiv.org/pdf/2405.13824.pdf), [[Code]](https://github.com/huangmozhi9527/GMMFormer_v2)
23 | - (arXiv 2024.06) Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP, [[Paper]](https://arxiv.org/pdf/2406.01583.pdf)
24 | - (arXiv 2024.06) Multi-Scale Temporal Difference Transformer for Video-Text Retrieval, [[Paper]](https://arxiv.org/pdf/2406.16111.pdf)
25 | - (arXiv 2024.09) Self-Supervised Vision Transformers for Writer Retrieval, [[Paper]](https://arxiv.org/pdf/2409.00751.pdf)
26 | - (arXiv 2024.09) Evidential Transformers for Improved Image Retrieval, [[Paper]](https://arxiv.org/pdf/2409.01082.pdf)
27 | - (arXiv 2024.10) PtychoFormer: A Transformer-based Model for Ptychographic Phase Retrieval, [[Paper]](https://arxiv.org/pdf/2410.17377.pdf)
28 | - (arXiv 2025.01) Length-Aware DETR for Robust Moment Retrieval, [[Paper]](https://arxiv.org/pdf/2412.20816.pdf), [[Code]](https://github.com/sjpark5800/LA-DETR)
29 | - (arXiv 2025.01) LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection, [[Paper]](https://arxiv.org/pdf/2501.10787.pdf), [[Code]](https://github.com/qingchen239/ld-detr)
30 | 


--------------------------------------------------------------------------------
/main/robotic.md:
--------------------------------------------------------------------------------
 1 | ### Robotic
 2 | - (arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [[Paper]](https://arxiv.org/pdf/2201.07779.pdf), [[Code]](https://github.com/jangirrishabh/look-closer)
 3 | - (arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [[Paper]](https://arxiv.org/pdf/2202.11911.pdf), [[Code]](https://github.com/WangShaoSUN/grasp-transformer)
 4 | - (arXiv 2022.07) 3D Part Assembly Generation with Instance Encoded Transformer, [[Paper]](https://arxiv.org/pdf/2207.01779.pdf)
 5 | - (arXiv 2022.09) Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation, [[Paper]](https://arxiv.org/pdf/2209.05451.pdf), [[Project]](https://peract.github.io/)
 6 | - (arXiv 2022.09) PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training, [[Paper]](https://arxiv.org/pdf/2209.11133.pdf)
 7 | - (arXiv 2022.12) RT-1: Robotics Transformer for Real-World Control at Scale, [[Paper]](https://arxiv.org/pdf/2212.06817.pdf), [[Project]](http://robotics-transformer.github.io/)
 8 | - (arXiv 2023.06) RVT: Robotic View Transformer for 3D Object Manipulation, [[Paper]](https://arxiv.org/pdf/2306.14896.pdf), [[Project]](https://robotic-view-transformer.github.io/)
 9 | - (arXiv 2023.09) AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with Pretrained ViT, [[Paper]](https://arxiv.org/pdf/2309.08134.pdf)
10 | - (arXiv 2023.09) PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation, [[Paper]](https://arxiv.org/pdf/2309.15596.pdf), [[Project]](https://www.di.ens.fr/willow/research/polarnet/)
11 | - (arXiv 2023.10) Knolling bot: A Transformer-based Approach to Organizing a Messy Table, [[Paper]](https://arxiv.org/pdf/2310.04566.pdf)
12 | - (arXiv 2023.11) M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place, [[Paper]](https://arxiv.org/pdf/2311.00926.pdf), [[Project]](https://m2-t2.github.io/)
13 | - (arXiv 2023.11) FViT-Grasp: Grasping Objects With Using Fast Vision Transformers, [[Paper]](https://arxiv.org/pdf/2311.13986.pdf)
14 | - (arXiv 2024.01) MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models, [[Paper]](https://arxiv.org/pdf/2401.14502.pdf), [[Code]](http://tinyurl.com/multi-res-realtime-control)
15 | - (arXiv 2024.02) EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization, [[Paper]](https://arxiv.org/pdf/2402.13537.pdf)
16 | - (arXiv 2024.05) Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation, [[Paper]](https://arxiv.org/pdf/2405.01527.pdf), [[Project]](https://homangab.github.io/track2act/)
17 | - (arXiv 2024.06) Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control, [[Paper]](https://arxiv.org/pdf/2406.06072.pdf), [[Project]](https://github.com/dojeon-ai/CoIn)
18 | - (arXiv 2024.11) HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation, [[Paper]](https://arxiv.org/pdf/2411.01455.pdf)
19 | - (arXiv 2025.02) UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent, [[Paper]](https://arxiv.org/pdf/2501.18867.pdf)
20 | - (arXiv 2025.02) VertiFormer: A Data-Efficient Multi-Task Transformer for Off-Road Robot Mobility, [[Paper]](https://arxiv.org/pdf/2502.00543.pdf), [[Code]](https://github.com/mhnazeri/VertiFormer)
21 | - (arXiv 2025.03) ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing, [[Paper]](https://arxiv.org/pdf/2503.04545.pdf), [[Code]](https://github.com/AlessandroScherl/ViT-VS)
22 | - (arXiv 2025.03) High-Precision Transformer-Based Visual Servoing for Humanoid Robots in Aligning Tiny Objects, [[Paper]](https://arxiv.org/pdf/2503.04862)
23 | 


--------------------------------------------------------------------------------
/main/salient-detection.md:
--------------------------------------------------------------------------------
 1 | ### Salient Detection
 2 | - (arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [[Paper]](https://arxiv.org/pdf/2104.10127.pdf)
 3 | - (arXiv 2021.04) Visual Saliency Transformer, [[Paper]](https://arxiv.org/pdf/2104.12099.pdf)
 4 | - (arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [[Paper]](https://arxiv.org/pdf/2104.14729.pdf)
 5 | - (arXiv 2021.08) Unifying Global-Local Representations in Salient Object Detection with Transformer, [[Paper]](https://arxiv.org/pdf/2108.02759.pdf), [[Code]](https://github.com/OliverRensu/GLSTR)
 6 | - (arXiv 2021.08) TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network, [[Paper]](https://arxiv.org/pdf/2108.03990.pdf), [[Code]](https://github.com/liuzywen/TriTransNet)
 7 | - (arXiv 2021.08) Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net, [[Paper]](https://arxiv.org/pdf/2108.07851.pdf)
 8 | - (arXiv 2021.12) Transformer-based Network for RGB-D Saliency Detection, [[Paper]](https://arxiv.org/pdf/2112.00582.pdf)
 9 | - (arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2112.01177.pdf)
10 | - (arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [[Paper]](https://arxiv.org/pdf/2112.13528.pdf)
11 | - (arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2203.04708.pdf), [[Code]](https://github.com/suyukun666/UFO)
12 | - (arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2112.13528.pdf)
13 | - (arXiv 2022.03) GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2203.10785.pdf)
14 | - (arXiv 2022.03) Unsupervised Salient Object Detection with Spectral Cluster Voting, [[Paper]](https://arxiv.org/pdf/2203.12614.pdf), [[Code]](https://github.com/NoelShin/selfmask)
15 | - (arXiv 2022.05) SelfReformer: Self-Refined Network with Transformer for Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2205.11283.pdf)
16 | - (arXiv 2022.06) Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2206.03105.pdf)
17 | - (arXiv 2022.07) TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2207.01172.pdf), [[Code]](https://github.com/lc012463/TANet)
18 | - (arXiv 2022.07) Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2207.03558.pdf), [[Code]](https://github.com/jxr326/SwinMCNet)
19 | - (arXiv 2022.07) SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification, [[Paper]](https://arxiv.org/pdf/2207.04224.pdf)
20 | - (arXiv 2022.07) Panoramic Vision Transformer for Saliency Detection in 360掳 Videos, [[Paper]](https://arxiv.org/pdf/2209.08956.pdf)
21 | - (arXiv 2023.01) HRTransNet: HRFormer-Driven Two-Modality Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2301.03036.pdf), [[Code]](https://github.com/liuzywen/HRTransNet)
22 | - (arXiv 2023.02) Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2302.08052.pdf)
23 | - (arXiv 2023.05) Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2305.00514.pdf), [[Code]](https://github.com/dragonlee258079/DMT)
24 | - (arXiv 2023.05) Salient Mask-Guided Vision Transformer for Fine-Grained Classification, [[Paper]](https://arxiv.org/pdf/2305.07102.pdf)
25 | - (arXiv 2023.08) Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2308.03826.pdf),[[Code]](https://github.com/DrowsyMon/RMFormer)
26 | - (arXiv 2023.08) Distortion-aware Transformer in 360掳 Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2308.03359.pdf),[[Code]](https://github.com/yjzhao19981027/DATFormer/)
27 | - (arXiv 2023.09) UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection, [[Paper]](https://arxiv.org/pdf/2309.08220.pdf)
28 | - (arXiv 2023.09) Salient Object Detection in Optical Remote Sensing Images Driven by Transformer, [[Paper]](https://arxiv.org/pdf/2309.08206.pdf), [[Code]](https://github.com/LiJiahang617/Road-Former)
29 | - (arXiv 2023.10) VST++: Efficient and Stronger Visual Saliency Transformer, [[Paper]](https://arxiv.org/pdf/2310.11725.pdf), [[Code]](https://github.com/LiJiahang617/Road-Former)
30 | - (arXiv 2024.03) A Simple yet Effective Network based on Vision Transformer for Camouflaged Object and Salient Object Detection, [[Paper]](https://arxiv.org/pdf/2402.18922.pdf), [[Code]](https://github.com/linuxsino/SENet)
31 | 


--------------------------------------------------------------------------------
/main/shape.md:
--------------------------------------------------------------------------------
 1 | ### Shape
 2 | - (WACV'21) End-to-end Lane Shape Prediction with Transformers,  [[Paper]](https://arxiv.org/abs/2011.04233), [[Code]](https://github.com/liuruijin17/LSTR)
 3 | - (arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation,  [[Paper]](https://arxiv.org/abs/2201.10326), [[Project]](https://shapeformer.github.io/)
 4 | - (arXiv 2022.10) AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation, [[Paper]](https://arxiv.org/pdf/2210.12381.pdf)
 5 | - (arXiv 2022.10) mm-Wave Radar Hand Shape Classification Using Deformable Transformers, [[Paper]](https://arxiv.org/pdf/2210.13079.pdf)
 6 | - (arxiv 2023.07) DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation, [[Paper]](https://arxiv.org/pdf/2307.01831.pdf)
 7 | - (arxiv 2023.09) DeFormer: Integrating Transformers with Deformable Models for 3D Shape Abstraction from a Single Image, [[Paper]](https://arxiv.org/pdf/2309.12594.pdf)
 8 | - (arxiv 2024.05) PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images, [[Paper]](https://arxiv.org/pdf/2405.11914.pdf)
 9 | - (arxiv 2024.07) PASTA: Controllable Part-Aware Shape Generation with Autoregressive Transformers, [[Paper]](https://arxiv.org/pdf/2407.13677.pdf)
10 | - (arxiv 2024.09) VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding, [[Paper]](https://arxiv.org/pdf/2409.09254.pdf), [[Code]](https://github.com/auniquesun/VSFormer)
11 | - (arxiv 2024.11) POC-SLT: Partial Object Completion with SDF Latent Transformers, [[Paper]](https://arxiv.org/pdf/2411.05419.pdf)
12 | - (arxiv 2024.11) Geometric Point Attention Transformer for 3D Shape Reassembly, [[Paper]](https://arxiv.org/pdf/2411.17788.pdf)
13 | 


--------------------------------------------------------------------------------
/main/slam.md:
--------------------------------------------------------------------------------
1 | ### SLAM
2 | - (arXiv 2022.06) AFT-VO: Asynchronous Fusion Transformers for Multi-View Visual Odometry Estimation,  [[Paper]](https://arxiv.org/abs/2206.12946)
3 | - (arXiv 2023.04) TransFusionOdom: Interpretable Transformer-based LiDAR-Inertial Fusion Odometry Estimation,  [[Paper]](https://arxiv.org/abs/2304.07728)
4 | 


--------------------------------------------------------------------------------
/main/snn.md:
--------------------------------------------------------------------------------
 1 | ### SNN
 2 | - (arXiv 2022.10) Spikformer: When Spiking Neural Network Meets Transformer,  [[Paper]](https://arxiv.org/abs/2209.15425)
 3 | - (arxiv 2023.07) Spike-driven Transformer,  [[Paper]](https://arxiv.org/abs/2307.01694), [[Code]](https://github.com/BICLab/Spike-Driven-Transformer)
 4 | - (arxiv 2023.08) SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition,  [[Paper]](https://arxiv.org/pdf/2308.04369.pdf), [[Code]](https://github.com/Event-AHU/SSTFormer)
 5 | - (arxiv 2023.08) Attention-free Spikformer: Mixing Spike Sequences with Simple Linear Transforms,  [[Paper]](https://arxiv.org/pdf/2308.02557.pdf)
 6 | - (arxiv 2023.11) SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in Spiking Transformer,  [[Paper]](https://arxiv.org/pdf/2311.08806.pdf)
 7 | - (arxiv 2023.11) Spiking Neural Networks with Dynamic Time Steps for Vision Transformers,  [[Paper]](https://arxiv.org/pdf/2311.16456.pdf)
 8 | - (arxiv 2024.01) Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket,  [[Paper]](https://arxiv.org/pdf/2401.02020.pdf)
 9 | - (arxiv 2024.01) TIM: An Efficient Temporal Interaction Module for Spiking Transformer,  [[Paper]](https://arxiv.org/pdf/2401.11687.pdf)
10 | - (arxiv 2024.02) Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer,  [[Paper]](https://arxiv.org/pdf/2402.04798.pdf)
11 | - (arxiv 2024.02) SDiT: Spiking Diffusion Model with Transformer,  [[Paper]](https://arxiv.org/pdf/2402.11588.pdf)
12 | - (arxiv 2024.03) SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks,  [[Paper]](https://arxiv.org/pdf/2403.14302.pdf), [[Code]](https://github.com/xyshi2000/SpikingResformer)
13 | - (arxiv 2024.03) QKFormer: Hierarchical Spiking Transformer using Q-K Attention,  [[Paper]](https://arxiv.org/pdf/2403.16552.pdf), [[Code]](https://github.com/zhouchenlin2096/QKFormer)
14 | - (arxiv 2024.03) Fourier or Wavelet bases as counterpart self-attention in spikformer for efficient visual classification,  [[Paper]](https://arxiv.org/pdf/2403.18228.pdf)
15 | - (arxiv 2024.04) Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips,  [[Paper]](https://arxiv.org/pdf/2404.03663.pdf), [[Code]](https://github.com/BICLab/Spike-Driven-Transformer-V2)
16 | - (arxiv 2024.04) A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation,  [[Paper]](https://arxiv.org/pdf/2404.17335.pdf)
17 | - (arxiv 2024.06) SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition, [[Paper]](https://arxiv.org/pdf/2406.15034.pdf)
18 | - (arxiv 2024.07) Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2407.08130.pdf)
19 | - (arxiv 2024.08) AT-SNN: Adaptive Tokens for Vision Transformer on Spiking Neural Network, [[Paper]](https://arxiv.org/pdf/2408.12293.pdf)
20 | - (arxiv 2024.09) DS2TA: Denoising Spiking Transformer with Attenuated Spatiotemporal Attention, [[Paper]](https://arxiv.org/pdf/2409.15375.pdf)
21 | - (arxiv 2024.11) Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training, [[Paper]](https://arxiv.org/pdf/2411.16061.pdf), [[Code]](https://github.com/BICLab/Spike-Driven-Transformer-V3)
22 | - (arxiv 2024.12) Hybrid Spiking Neural Network -- Transformer Video Classification Model, [[Paper]](https://arxiv.org/pdf/2412.00237.pdf), [[Code]](https://github.com/TheRNB/HyTSSN/tree/main)
23 | - (arxiv 2024.12) Efficient Event-based Semantic Segmentation with Spike-driven Lightweight Transformer-based Networks, [[Paper]](https://arxiv.org/pdf/2412.12843.pdf)
24 | - (arxiv 2024.12) Spike2Former: Efficient Spiking Transformer for High-performance Image Segmentation, [[Paper]](https://arxiv.org/pdf/2412.14587.pdf), [[Code]](https://github.com/BICLab/Spike2Former)
25 | - (arxiv 2025.01) Binary Event-Driven Spiking Transformer, [[Paper]](https://arxiv.org/pdf/2501.05904)
26 | - (arxiv 2025.01) Quantized Spike-driven Transformer, [[Paper]](https://arxiv.org/pdf/2501.13492), [[Code]](https://github.com/bollossom/QSD-Transformer/tree/main)
27 | - (arxiv 2025.02) Spiking Vision Transformer with Saccadic Attention, [[Paper]](https://arxiv.org/pdf/2502.12677)
28 | - (arxiv 2025.02) Towards High-performance Spiking Transformers from ANN to SNN Conversion, [[Paper]](https://arxiv.org/pdf/2502.21193), [[Code]](https://github.com/h-z-h-cell/Transformer-to-SNN-ECMT)
29 | - (arxiv 2025.03) Spiking Transformer:Introducing Accurate Addition-Only Spiking Self-Attention for Transformer, [[Paper]](https://arxiv.org/pdf/2503.00226)
30 | - (arxiv 2025.03) SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition, [[Paper]](https://arxiv.org/pdf/2503.15986)
31 | - (arxiv 2025.05) Hybrid Spiking Vision Transformer for Object Detection with Event Cameras, [[Paper]](https://arxiv.org/pdf/2505.07715)
32 | - (arxiv 2025.05) SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and  Complexity, [[Paper]](https://arxiv.org/pdf/2505.10352), [[Code]](https://github.com/JimmyZou/SpikeVideoFormer)
33 | 


--------------------------------------------------------------------------------
/main/style-transfer.md:
--------------------------------------------------------------------------------
1 | ### Style Transfer
2 | - (arXiv 2021.06) StyTr2: Unbiased Image Style Transfer with Transformers, [[Paper]](https://arxiv.org/pdf/2105.14576.pdf)
3 | - (arXiv 2022.10) Fine-Grained Image Style Transfer with Visual Transformers,  [[Paper]](https://arxiv.org/abs/2210.05176), [[Code]](https://yccyenchicheng.github.io/AutoSDF/)
4 | - (arXiv 2022.10) S2WAT: Image Style Transfer via Hierarchical Vision Transformer using Strips Window Attention,  [[Paper]](https://arxiv.org/abs/2210.12381), [[Code]](https://yccyenchicheng.github.io/AutoSDF/)
5 | - (arXiv 2022.11) Learning Visual Representation of Underwater Acoustic Imagery Using Transformer-Based Style Transfer Method,  [[Paper]](https://arxiv.org/abs/2211.05396)
6 | - (arXiv 2023.01) Edge Enhanced Image Style Transfer via Transformers,  [[Paper]](https://arxiv.org/abs/2301.00592)
7 | - (arXiv 2023.04) Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer,  [[Paper]](https://arxiv.org/abs/2304.11818),[[Code]](https://github.com/ZK-Zhou/spikformer)
8 | - (arXiv 2024.04) Rethink Arbitrary Style Transfer with Transformer and Contrastive Learning,  [[Paper]](https://arxiv.org/pdf/2404.13584.pdf)
9 | 


--------------------------------------------------------------------------------
/main/synthesis.md:
--------------------------------------------------------------------------------
 1 | ### Synthesis
 2 | - (arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [[Paper]](https://arxiv.org/abs/2012.09841), [[Code]](https://compvis.github.io/taming-transformers/)
 3 | - (arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [[Paper]](https://arxiv.org/pdf/2104.07652.pdf)
 4 | - (arXiv 2021.05) High-Resolution Complex Scene Synthesis with Transformers, [[Paper]](https://arxiv.org/pdf/2105.06458.pdf)
 5 | - (arXiv 2021.06) The Image Local Autoregressive Transformer, [[Paper]](https://arxiv.org/pdf/2106.02514.pdf)
 6 | - (arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [[Paper]](https://arxiv.org/pdf/2110.03675.pdf), [[Project]](https://nv-tlabs.github.io/ATISS/)
 7 | - (arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [[Paper]](https://arxiv.org/pdf/2111.03481.pdf)
 8 | - (arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [[Paper]](https://arxiv.org/pdf/2202.04200.pdf)
 9 | - (arXiv 2022.05) Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis, [[Paper]](https://arxiv.org/pdf/2205.01800.pdf)
10 | - (arXiv 2022.07) Diverse Dance Synthesis via Keyframes with Transformer Controllers, [[Paper]](https://arxiv.org/pdf/2207.05906.pdf)
11 | - (arXiv 2022.10) Style-Guided Inference of Transformer for High-resolution Image Synthesis, [[Paper]](https://arxiv.org/pdf/2210.05533.pdf)
12 | - (arXiv 2022.10) FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers, [[Paper]](https://arxiv.org/pdf/2210.06301.pdf)
13 | - (arXiv 2023.01) Geometry-biased Transformers for Novel View Synthesis, [[Paper]](https://arxiv.org/pdf/2301.04650.pdf), [[Project]](https://mayankgrwl97.github.io/gbt)
14 | - (arXiv 2023.12) NViST: In the Wild New View Synthesis from a Single Image with Transformers, [[Paper]](https://arxiv.org/pdf/2312.08568.pdf), [[Project]](https://wbjang.github.io/nvist_webpage)
15 | - (arXiv 2024.06) Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis, [[Paper]](https://arxiv.org/pdf/2406.05478.pdf), [[Project]](https://github.com/LeapLabTHU/ImprovedNAT)
16 | - (arXiv 2024.07) Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis, [[Paper]](https://arxiv.org/pdf/2407.05388.pdf)
17 | - (arXiv 2024.09) Swin Transformer for Robust Differentiation of Real and Synthetic Images: Intra- and Inter-Dataset Analysis, [[Paper]](https://arxiv.org/pdf/2409.04734.pdf)
18 | - (arXiv 2024.09) Enhancing Image Authenticity Detection: Swin Transformers and Color Frame Analysis for CGI vs. Real Images, [[Paper]](https://arxiv.org/pdf/2409.04742.pdf)
19 | 


--------------------------------------------------------------------------------
/main/text-to-imagevideo.md:
--------------------------------------------------------------------------------
 1 | ### Text-to-Image/Video
 2 | - (arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [[Paper]](https://arxiv.org/abs/2101.00265)
 3 | - (arXiv 2021.05) CogView: Mastering Text-to-Image Generation via Transformers, [[Paper]](https://arxiv.org/pdf/2105.13290.pdf)
 4 | - (arXiv 2022.02) DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [[Paper]](https://arxiv.org/pdf/2201.11316.pdf),[[Code]](https://github.com/j-min/DallEval)
 5 | - (arXiv 2022.05) CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, [[Paper]](https://arxiv.org/pdf/2204.14217.pdf),[[Code]](https://github.com/THUDM/CogView2)
 6 | - (arXiv 2022.05) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers, [[Paper]](https://arxiv.org/pdf/2205.15868.pdf),[[Code]](https://github.com/THUDM/CogVideo)
 7 | - (arXiv 2022.09) StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation, [[Paper]](https://arxiv.org/pdf/2209.06192.pdf),[[Code]](https://github.com/adymaharana/storydalle)
 8 | - (arXiv 2022.10) Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation, [[Paper]](https://arxiv.org/pdf/2210.09549.pdf)
 9 | - (arXiv 2022.12) Exploring Vision Transformers as Diffusion Learners, [[Paper]](https://arxiv.org/pdf/2212.13771.pdf)
10 | - (arXiv 2023.01) Muse: Text-To-Image Generation via Masked Generative Transformers, [[Paper]](https://arxiv.org/pdf/2301.00704.pdf),[[Code]](http://muse-model.github.io/)
11 | - (arXiv 2023.03) Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding, [[Paper]](https://arxiv.org/pdf/2303.03800.pdf)
12 | - (arXiv 2023.09) A Simple Text to Video Model via Transformer, [[Paper]](https://arxiv.org/pdf/2309.14683.pdf)
13 | - (arXiv 2023.10) Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis, [[Paper]](https://arxiv.org/pdf/2310.00426.pdf)
14 | - (arXiv 2023.11) VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning, [[Paper]](https://arxiv.org/pdf/2311.00990.pdf),[[Code]](https://videodreamer23.github.io/)
15 | - (arXiv 2023.11) MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture, [[Paper]](https://arxiv.org/pdf/2311.10123.pdf),[[Code]](https://metadreamer3d.github.io/)
16 | - (arXiv 2023.12) X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation, [[Paper]](https://arxiv.org/pdf/2312.00085.pdf),[[Project]](https://xmuxiaoma666.github.io/Projects/X-Dreamer)
17 | - (arXiv 2023.12) GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation, [[Paper]](https://arxiv.org/pdf/2312.04557.pdf),[[Code]](https://www.shoufachen.com/gentron_website/)
18 | - (arXiv 2024.02) Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis, [[Paper]](https://arxiv.org/pdf/2402.14797.pdf),[[Code]](https://snap-research.github.io/snapvideo/)
19 | - (arXiv 2024.03) PixArt-危: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation, [[Paper]](https://arxiv.org/pdf/2403.04692.pdf),[[Code]](https://pixart-alpha.github.io/PixArt-sigma-project/)
20 | - (arXiv 2024.03) BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis, [[Paper]](https://arxiv.org/pdf/2403.11273.pdf),[[Code]](https://vlislab22.github.io/BrightDreamer/)
21 | - (arXiv 2024.08) CogVideoX: Text-to-Video Diffusion Models with An Expert Transformers, [[Paper]](https://arxiv.org/pdf/2408.06072.pdf),[[Code]](https://github.com/THUDM/CogVideo)
22 | - (arXiv 2024.12) Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis, [[Paper]](https://arxiv.org/pdf/2412.01819.pdf),[[Code]](https://github.com/yandex-research/switti)
23 | - (arXiv 2024.12) DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation, [[Paper]](https://arxiv.org/pdf/2503.10618.pdf)
24 | 


--------------------------------------------------------------------------------
/main/texture.md:
--------------------------------------------------------------------------------
 1 | ### Texture
 2 | - (arXiv 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [[Paper]](https://arxiv.org/pdf/2109.02563.pdf), [[Code]](https://www.mmlab-ntu.com/project/texformer/)
 3 | - (arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis, [[Paper]](https://arxiv.org/pdf/2202.11703.pdf)
 4 | - (arXiv 2023.11) 3D-TexSeg: Unsupervised Segmentation of 3D Texture using Mutual Transformer Learning, [[Paper]](https://arxiv.org/pdf/2311.10651.pdf),[[Code]](https://videodreamer23.github.io/)
 5 | - (arXiv 2024.02) Dynamic Texture Transfer using PatchMatch and Transformers, [[Paper]](https://arxiv.org/pdf/2402.00606.pdf)
 6 | - (arXiv 2024.03) 3DTextureTransformer: Geometry Aware Texture Generation for Arbitrary Mesh Topology, [[Paper]](https://arxiv.org/pdf/2403.04225.pdf)
 7 | - (arXiv 2024.03) TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation, [[Paper]](https://arxiv.org/pdf/2403.12906.pdf),[[Code]](https://ggxxii.github.io/texdreamer/)
 8 | - (arXiv 2024.06) A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis, [[Paper]](https://arxiv.org/pdf/2406.06136)
 9 | - (arXiv 2025.03) VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings, [[Paper]](https://arxiv.org/pdf/2503.06368),[[Code]](https://github.com/scabini/VORTEX)
10 | 


--------------------------------------------------------------------------------
/main/time-series.md:
--------------------------------------------------------------------------------
1 | ### Time Series
2 | - (arXiv 2023.03) Time Series as Images: Vision Transformer for Irregularly Sampled Time Series, [[Paper]](https://arxiv.org/pdf/2303.12799.pdf), [[Code]](https://github.com/Leezekun/ViTST)
3 | - (arXiv 2023.05) Improving Position Encoding of Transformers for Multivariate Time Series Classification, [[Paper]](https://arxiv.org/pdf/2305.16642.pdf), [[Code]](https://github.com/Navidfoumani/ConvTran)
4 | - (arXiv 2023.07) U-shaped Transformer: Retain High Frequency Context in Time Series Analysis, [[Paper]](https://arxiv.org/pdf/2307.09019.pdf)
5 | - (arXiv 2024.02) Multi-scale Spatio-temporal Transformer-based Imbalanced Longitudinal Learning for Glaucoma Forecasting from Irregular Time Series Images, [[Paper]](https://arxiv.org/pdf/2402.13475.pdf)
6 | - (arXiv 2024.03) From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting, [[Paper]](https://arxiv.org/pdf/2403.11047.pdf)
7 | - (arXiv 2024.03) HEAL-ViT: Vision Transformers on a spherical mesh for medium-range weather forecasting, [[Paper]](https://arxiv.org/pdf/2403.17016.pdf)
8 | 


--------------------------------------------------------------------------------
/main/translation.md:
--------------------------------------------------------------------------------
1 | ### Translation
2 | - (arXiv 2021.10) Tensor-to-Image: Image-to-Image Translation with Vision Transformers, [[Paper]](https://arxiv.org/pdf/2110.08037.pdf)
3 | - (arXiv 2022.03) UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation, [[Paper]](https://arxiv.org/pdf/2203.02557.pdf), [[Code]](https://github.com/LS4GAN/uvcga)
4 | - (arXiv 2022.03) InstaFormer: Instance-Aware Image-to-Image Translation with Transformer, [[Paper]](https://arxiv.org/pdf/2203.16248.pdf)
5 | - (arXiv 2022.03) ITTR: Unpaired Image-to-Image Translation with Transformers, [[Paper]](https://arxiv.org/pdf/2203.16015.pdf)
6 | - (arXiv 2023.03) Masked and Adaptive Transformer for Exemplar Based Image Translation, [[Paper]](https://arxiv.org/pdf/2303.17123.pdf)
7 | - (arXiv 2023.04) Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation Model, [[Paper]](https://arxiv.org/pdf/2404.07072.pdf)
8 | 


--------------------------------------------------------------------------------
/main/uav.md:
--------------------------------------------------------------------------------
 1 | ### UAV
 2 | - (arXiv 2021.11) The self-supervised channel-spatial attention-based transformer network for automated, accurate prediction of crop nitrogen status from UAV imagery, [[Paper]](https://arxiv.org/pdf/2111.06839.pdf)
 3 | - (arXiv 2023.10) SoybeanNet: Transformer-Based Convolutional Neural Network for Soybean Pod Counting from Unmanned Aerial Vehicle (UAV) Images, [[Paper]](https://arxiv.org/pdf/2310.10861.pdf), [[Code]](https://github.com/JiajiaLi04/Soybean-Pod-Counting-from-UAV-Images)
 4 | - (arXiv 2023.11) Rotation Invariant Transformer for Recognizing Object in UAVs, [[Paper]](https://arxiv.org/pdf/2311.02559.pdf), [[Code]](https://github.com/whucsy/RotTrans)
 5 | - (arXiv 2023.11) SugarViT -- Multi-objective Regression of UAV Images with Vision Transformers and Deep Label Distribution Learning Demonstrated on Disease Severity Prediction in Sugar Beet, [[Paper]](https://arxiv.org/pdf/2311.03076.pdf), [[Code]](https://github.com/whucsy/RotTrans)
 6 | - (arXiv 2024.01) A Transformer-Based Adaptive Semantic Aggregation Method for UAV Visual Geo-Localization, [[Paper]](https://arxiv.org/pdf/2401.01574.pdf)
 7 | - (arXiv 2024.06) Deep Transformer Network for Monocular Pose Estimation of Ship-Based UAV, [[Paper]](https://arxiv.org/pdf/2406.09260.pdf), [[Code]](https://github.com/fdcl-gwu/TNN-MO)
 8 | - (arXiv 2024.07) PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation, [[Paper]](https://arxiv.org/pdf/2406.19632.pdf)
 9 | - (arXiv 2024.07) Learning Motion Blur Robust Vision Transformers with Dynamic Early Exit for Real-Time UAV Tracking, [[Paper]](https://arxiv.org/pdf/2407.05383.pdf), [[Code]](https://github.com/wuyou3474/BDTrack)
10 | - (arXiv 2025.01) Learning Adaptive and View-Invariant Vision Transformer with Multi-Teacher Knowledge Distillation for Real-Time UAV Tracking, [[Paper]](https://arxiv.org/pdf/2412.20002.pdf), [[Code]](https://github.com/wuyou3474/AVTrack)
11 | - (arXiv 2025.01) UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery, [[Paper]](https://arxiv.org/pdf/2501.01855.pdf), [[Code]](https://github.com/ValiantDiligent/UAV-DETR)
12 | - (arXiv 2025.01) UAV-Assisted Real-Time Disaster Detection Using Optimized Transformer Model, [[Paper]](https://arxiv.org/pdf/2501.12087.pdf)
13 | - (arXiv 2025.02) Automatic Vehicle Detection using DETR: A Transformer-Based Approach for Navigating Treacherous Roads, [[Paper]](https://arxiv.org/pdf/2502.17843.pdf)
14 | - (arXiv 2025.02) AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation, [[Paper]](https://arxiv.org/pdf/2502.16680.pdf), [[Code]](https://github.com/lironui/AeroReformer)
15 | - (arXiv 2025.03) Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking, [[Paper]](https://arxiv.org/pdf/2503.06625.pdf), [[Code]](https://github.com/GXNU-ZhongLab/SGLATrack)
16 | 


--------------------------------------------------------------------------------
/main/unsupervised-learning.md:
--------------------------------------------------------------------------------
1 | ### Unsupervised learning
2 | - (arXiv 2022.02) Handcrafted Histological Transformer (H2T): Unsupervised Representation of Whole Slide Images, [[Paper]](https://arxiv.org/pdf/2202.07001.pdf)
3 | - (arXiv 2022.07) SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery, [[Paper]](https://arxiv.org/pdf/2207.08051.pdf)
4 | - (arXiv 2023.12) SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers, [[Paper]](https://arxiv.org/pdf/2312.00648.pdf), [[Code]](https://github.com/gkakogeorgiou/spot)
5 | 


--------------------------------------------------------------------------------
/main/visual-grounding.md:
--------------------------------------------------------------------------------
 1 | ### Visual Grounding
 2 | - (arXiv 2021.04) TransVG: End-to-End Visual Grounding with Transformers, [[Paper]](https://arxiv.org/abs/2104.08541)
 3 | - (arXiv 2021.05) Visual Grounding with Transformers, [[Paper]](https://arxiv.org/pdf/2105.04281.pdf)
 4 | - (arXiv 2021.06) Referring Transformer: A One-step Approach to Multi-task Visual Grounding, [[Paper]](https://arxiv.org/pdf/2106.03089.pdf)
 5 | - (arXiv 2021.08) Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding, [[Paper]](https://arxiv.org/pdf/2108.00205.pdf)
 6 | - (arXiv 2021.08) TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding, [[Paper]](https://arxiv.org/pdf/2108.02388.pdf)
 7 | - (arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [[Paper]](https://arxiv.org/pdf/2109.08478.pdf)
 8 | - (arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [[Paper]](https://arxiv.org/pdf/2202.07305.pdf)
 9 | - (arXiv 2022.03) TubeDETR: Spatio-Temporal Video Grounding with Transformers, [[Paper]](https://arxiv.org/pdf/2203.16434.pdf), [[Code]](https://antoyang.github.io/tubedetr.html)
10 | - (arXiv 2022.04) Multi-View Transformer for 3D Visual Grounding, [[Paper]](https://arxiv.org/pdf/2204.02174.pdf), [[Code]](https://github.com/sega-hsj/MVT-3DVG)
11 | - (arXiv 2022.06) TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer, [[Paper]](https://arxiv.org/pdf/2206.06619v1.pdf), [[Code]](https://github.com/djiajunustc/TransVG)
12 | - (arXiv 2022.09) Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding, [[Paper]](https://arxiv.org/pdf/2209.13959.pdf)
13 | - (arXiv 2022.11) UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding, [[Paper]](https://arxiv.org/pdf/2212.00836.pdf)
14 | - (arXiv 2023.03) ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance, [[Paper]](https://arxiv.org/pdf/2303.16894.pdf), [[Code]](https://github.com/ZiyuGuo99/ViewRefer3D)
15 | - (arXiv 2023.07) Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding, [[Paper]](https://arxiv.org/pdf/2307.09267.pdf)
16 | - (arXiv 2023.07) Advancing Visual Grounding with Scene Knowledge: Benchmark and Method, [[Paper]](https://arxiv.org/pdf/2307.11558.pdf), [[Code]](https://github.com/zhjohnchan/SK-VG)
17 | - (arXiv 2023.07) 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding, [[Paper]](https://arxiv.org/pdf/2307.13363.pdf)
18 | - (arXiv 2023.08) ViGT: Proposal-free Video Grounding with Learnable Token in Transformer, [[Paper]](https://arxiv.org/pdf/2308.06009.pdf)
19 | - (arXiv 2023.08) Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation, [[Paper]](https://arxiv.org/pdf/2308.03725.pdf)
20 | - (arXiv 2023.08) Knowing Where to Focus: Event-aware Transformer for Video Grounding, [[Paper]](https://arxiv.org/pdf/2308.06947.pdf), [[Code]](https://github.com/jinhyunj/SANet)
21 | - (arXiv 2023.08) Language-Guided Diffusion Model for Visual Grounding, [[Paper]](https://arxiv.org/pdf/2308.09599.pdf), [[Code]](https://github.com/iQua/vgbase/tree/DiffusionVG)
22 | - (arXiv 2023.12) BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos, [[Paper]](https://arxiv.org/pdf/2312.00083.pdf), [[Code]](https://github.com/Pilhyeon/BAM-DETR)
23 | - (arXiv 2023.12) Grounding Everything: Emerging Localization Properties in Vision-Language Transformers, [[Paper]](https://arxiv.org/pdf/2312.00878.pdf), [[Code]](https://github.com/WalBouss/GEM)
24 | - (arXiv 2023.12) Mono3DVG: 3D Visual Grounding in Monocular Images, [[Paper]](https://arxiv.org/pdf/2312.08022.pdf)
25 | - (arXiv 2023.12) GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection, [[Paper]](https://arxiv.org/pdf/2312.15043.pdf), [[Code]](https://github.com/om-ai-lab/GroundVLP)
26 | - (arXiv 2024.01) Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding, [[Paper]](https://arxiv.org/pdf/2401.00901.pdf), [[Code]](https://github.com/TalalWasim/Video-GroundingDINO)
27 | - (arXiv 2024.03) MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding, [[Paper]](https://arxiv.org/pdf/2403.03077.pdf), [[Code]](https://github.com/birdy666/MiKASA-3DVG)
28 | - (arXiv 2024.08) An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding, [[Paper]](https://arxiv.org/pdf/2408.01120.pdf), [[Code]](https://github.com/chenwei746/EEVG)
29 | - (arXiv 2024.11) LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers, [[Paper]](https://arxiv.org/pdf/2411.04351.pdf)
30 | 


--------------------------------------------------------------------------------
/main/visual-question-answering.md:
--------------------------------------------------------------------------------
 1 | ### Visual Question Answering
 2 | - (arXiv 2021.06) Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue, [[Paper]](https://arxiv.org/pdf/2106.15550.pdf)
 3 | - (arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [[Paper]](https://arxiv.org/pdf/2112.12494.pdf)
 4 | - (arXiv 2021.12) 3D Question Answering,[[Paper]](https://arxiv.org/pdf/2112.08359.pdf)
 5 | - (arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [[Paper]](https://arxiv.org/pdf/2201.03965.pdf)
 6 | - (arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [[Paper]](https://arxiv.org/pdf/2201.11316.pdf)
 7 | - (arXiv 2022.01) Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer, [[Paper]](https://arxiv.org/pdf/2203.12814.pdf)
 8 | - (arXiv 2022.04) Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering, [[Paper]](https://arxiv.org/pdf/2204.10448.pdf), [[Code]](https://github.com/yujungheo/kbvqa-public)
 9 | - (arXiv 2022.06) DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment, [[Paper]](https://arxiv.org/pdf/2206.09265.pdf), [[Code]](https://github.com/yujungheo/kbvqa-public)
10 | - (arXiv 2022.06) Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer, [[Paper]](https://arxiv.org/pdf/2206.11053.pdf), [[Code]](https://github.com/lalithjets/Surgical_VQA.git)
11 | - (arXiv 2022.07) Weakly Supervised Grounding for VQA in Vision-Language Transformers, [[Paper]](https://arxiv.org/pdf/2207.02334.pdf), [[Code]](https://github.com/aurooj/WSG-VQA-VLTransformers)
12 | - (arXiv 2022.07) Video Graph Transformer for Video Question Answering, [[Paper]](https://arxiv.org/pdf/2207.05342.pdf), [[Code]](https://github.com/sail-sg/VGT)
13 | - (arXiv 2022.10) Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing, [[Paper]](https://arxiv.org/pdf/2210.04510.pdf)
14 | - (arXiv 2022.10) SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models, [[Paper]](https://arxiv.org/pdf/2210.05950.pdf), [[Project]](https://slotformer.github.io/)
15 | - (arXiv 2022.12) MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering, [[Paper]](https://arxiv.org/pdf/2212.09522.pdf), [[Project]](https://github.com/showlab/mist)
16 | - (arXiv 2023.02) Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer, [[Paper]](https://arxiv.org/pdf/2302.02136.pdf), [[Project]](https://github.com/Trunpm/PMT-AAAI23)
17 | - (arXiv 2023.03) PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers, [[Paper]](https://arxiv.org/pdf/2303.09187.pdf)
18 | - (arXiv 2023.04) Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder, [[Paper]](https://arxiv.org/pdf/2304.01611.pdf)
19 | - (arXiv 2023.05) Multimodal Graph Transformer for Multimodal Question Answering, [[Paper]](https://arxiv.org/pdf/2305.00581.pdf)
20 | - (arXiv 2023.05) Is a Video worth n脳n Images? A Highly Efficient Approach to Transformer-based Video Question Answering, [[Paper]](https://arxiv.org/pdf/2305.09107.pdf)
21 | - (arXiv 2023.06) LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing, [[Paper]](https://arxiv.org/pdf/2306.00758.pdf), [[Code]](https://git.tu-berlin.de/rsim/lit4rsvqa)
22 | - (arXiv 2023.07) Discovering Spatio-Temporal Rationales for Video Question Answering, [[Paper]](https://arxiv.org/pdf/2307.12058.pdf), [[Code]](https://github.com/yl3800/TranSTR)
23 | - (arXiv 2023.07) BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering, [[Paper]](https://arxiv.org/pdf/2307.15335.pdf)
24 | - (arXiv 2023.08) Redundancy-aware Transformer for Video Question Answering, [[Paper]](https://arxiv.org/pdf/2308.03267.pdf)
25 | - (arXiv 2023.08) Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration, [[Paper]](https://arxiv.org/pdf/2407.21229.pdf), [[Code]](https://github.com/nngocson2002/ViVQA)
26 | - (arXiv 2023.10) ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering, [[Paper]](https://arxiv.org/pdf/2410.14132.pdf), [[Code]](https://github.com/hieunghia-pat/ViConsFormer)
27 | - (arXiv 2025.04) VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering, [[Paper]](https://arxiv.org/pdf/2504.08269.pdf)
28 | 


--------------------------------------------------------------------------------
/main/visual-reasoning.md:
--------------------------------------------------------------------------------
1 | ### Visual Reasoning
2 | - (arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [[Paper]](https://arxiv.org/pdf/2111.14576.pdf)
3 | - (arXiv 2022.04) RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning, [[Paper]](https://arxiv.org/pdf/2204.11167.pdf), [[Code]](https://github.com/NVlabs/RelViT)
4 | - (arXiv 2022.06) SAViR-T: Spatially Attentive Visual Reasoning with Transformers, [[Paper]](https://arxiv.org/pdf/2204.11167.pdf)
5 | - (arXiv 2023.01) Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning, [[Paper]](https://arxiv.org/pdf/2301.13335.pdf)
6 | - (arXiv 2024.03) ViTCN: Vision Transformer Contrastive Network For Reasoning, [[Paper]](https://arxiv.org/pdf/2403.09962.pdf)
7 | 


--------------------------------------------------------------------------------
/main/visual-relationship-detection.md:
--------------------------------------------------------------------------------
 1 | ### Visual Relationship Detection
 2 | - (arXiv 2021.04) RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory, [[Paper]](https://arxiv.org/pdf/2104.11934.pdf)
 3 | - (arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [[Paper]](https://arxiv.org/pdf/2108.00045.pdf)
 4 | - (arXiv 2021.08) Discovering Spatial Relationships by Transformers for Domain Generalization, [[Paper]](https://arxiv.org/pdf/2108.10046.pdf)
 5 | - (arXiv 2022.06) VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection, [[Paper]](https://arxiv.org/pdf/2206.09111.pdf)
 6 | - (arXiv 2023.11) Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box Reconstruction, [[Paper]](https://arxiv.org/pdf/2311.04834.pdf), [[Code]](https://github.com/deeplab-ai/SelfSupervisedVRD)
 7 | - (arXiv 2023.11) RelVAE: Generative Pretraining for few-shot Visual Relationship Detection, [[Paper]](https://arxiv.org/pdf/2311.16261.pdf), [[Code]](https://github.com/deeplab-ai/SelfSupervisedVRD)
 8 | - (arXiv 2024.03) Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection, [[Paper]](https://arxiv.org/pdf/2403.14270.pdf)
 9 | - (arXiv 2024.03) Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection, [[Paper]](https://arxiv.org/pdf/2403.17709.pdf), [[Code]](https://github.com/mlvlab/SpeaQ)
10 | - (arXiv 2024.09) End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting, [[Paper]](https://arxiv.org/pdf/2409.12499.pdf)
11 | 


--------------------------------------------------------------------------------
/main/voxel.md:
--------------------------------------------------------------------------------
 1 | ### Voxel
 2 | - (arXiv 2021.05) SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers, [[Paper]](https://arxiv.org/abs/2105.00149)
 3 | - (arXiv 2021.09) Voxel Transformer for 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2109.02497.pdf)
 4 | - (arXiv 2022.06) Unifying Voxel-based Representation with Transformer for 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2206.00630.pdf), [[Code]](https://github.com/dvlab-research/UVTR)
 5 | - (arXiv 2023.02) VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion, [[Paper]](https://arxiv.org/pdf/2302.12251.pdf), [[Code]](https://github.com/NVlabs/VoxFormer)
 6 | - (arXiv 2023.03) Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams, [[Paper]](https://arxiv.org/pdf/2303.03856.pdf)
 7 | - (arXiv 2023.03) SnakeVoxFormer: Transformer-based Single Image Voxel Reconstruction with Run Length Encoding, [[Paper]](https://arxiv.org/pdf/2303.16293.pdf)
 8 | - (arXiv 2023.05) PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer, [[Paper]](https://arxiv.org/pdf/2305.06621.pdf), [[Code]](https://github.com/Nightmare-n/PVT-SSD)
 9 | - (arXiv 2023.08) Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion based Classification, [[Paper]](https://arxiv.org/pdf/2308.11937.pdf), [[Code]](https://github.com/Event-AHU/EFV_event_classification)
10 | - (arXiv 2024.01) ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention, [[Paper]](https://arxiv.org/pdf/2401.00912.pdf), [[Code]](https://github.com/skyhehe123/ScatterFormer)
11 | - (arXiv 2024.01) MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2401.11718.pdf)
12 | - (arXiv 2024.03) CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs, [[Paper]](https://arxiv.org/pdf/2403.16885.pdf), [[Code]](https://zhongyingji.github.io/CVT-xRF/)
13 | - (arXiv 2024.05) PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2405.02811.pdf)
14 | - (arXiv 2025.03) SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection, [[Paper]](https://arxiv.org/pdf/2503.08092.pdf)
15 | - (arXiv 2025.03) HeightFormer: Learning Height Prediction in Voxel Features for Roadside Vision Centric 3D Object Detection via Transformer, [[Paper]](https://arxiv.org/pdf/2503.10777.pdf), [[Code]](https://github.com/zhangzhang2024/HeightFormer)
16 | 


--------------------------------------------------------------------------------
/main/weakly-supervised-learning.md:
--------------------------------------------------------------------------------
 1 | ### Weakly Supervised Learning
 2 | - (arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2112.05291)
 3 | - (arXiv 2022.01) CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2201.00475)
 4 | - (arXiv 2021.03) TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2103.14862)
 5 | - (arXiv 2022.04) ViTOL: Vision Transformer for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2204.06772), [[Code]](https://github.com/Saurav-31/ViTOL)
 6 | - (arXiv 2022.07) Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration, [[Paper]](https://arxiv.org/abs/2207.10447), [[Code]](https://github.com/164140757/SCM)
 7 | - (arXiv 2022.08) Re-Attention Transformer for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2208.01838), [[Code]](https://github.com/su-hui-zz/ReAttentionTransformer)
 8 | - (arXiv 2022.09) Discriminative Sampling of Proposals in Self-Supervised Transformers for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2209.09209), [[Code]](https://github.com/shakeebmurtaza/dips)
 9 | - (arXiv 2022.09) PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification, [[Paper]](https://arxiv.org/abs/2209.10074), [[Code]](https://github.com/DearCaat/PicT)
10 | - (arXiv 2023.09) Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2309.01331)
11 | - (arXiv 2023.10) DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2310.06196), [[Code]](https://github.com/shakeebmurtaza/dips)
12 | - (arXiv 2023.12) Multiscale Vision Transformer With Deep Clustering-Guided Refinement for Weakly Supervised Object Localization, [[Paper]](https://arxiv.org/abs/2312.09584)
13 | - (arXiv 2024.05) Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection, [[Paper]](https://arxiv.org/abs/2405.05130), [[Code]](https://github.com/shengyangsun/MSBT)
14 | - (arXiv 2025.01) Weakly Supervised Segmentation of Hyper-Reflective Foci with Compact Convolutional Transformers and SAM 2, [[Paper]](https://arxiv.org/abs/2501.05933)
15 | 


--------------------------------------------------------------------------------
/main/zero-shot-learning.md:
--------------------------------------------------------------------------------
 1 | ### Zero-Shot Learning
 2 | - (arXiv 2021.08) Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2108.00205.pdf)
 3 | - (arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [[Paper]](https://arxiv.org/pdf/2112.01522.pdf)
 4 | - (arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2112.01683.pdf), [[Code]](https://github.com/shiming-chen/transzero)
 5 | - (arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2112.08643.pdf), [[Code]](https://github.com/shiming-chen/TransZero_pp)
 6 | - (arXiv 2022.03) Hybrid Routing Transformer for Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2203.15310.pdf)
 7 | - (arXiv 2022.11) Efficient Zero-shot Visual Search via Target and Context-aware Transformer, [[Paper]](https://arxiv.org/pdf/2211.13470.pdf)
 8 | - (arXiv 2023.03) Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2302.00875.pdf)
 9 | - (arXiv 2023.03) Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers, [[Paper]](https://arxiv.org/pdf/2305.17328.pdf)
10 | - (arXiv 2023.08) Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation, [[Paper]](https://arxiv.org/pdf/2308.06693.pdf), [[Code]](https://github.com/DLUT-yyc/Isomer)
11 | - (arXiv 2023.08) Meta-ZSDETR: Zero-shot DETR with Meta-learning, [[Paper]](https://arxiv.org/pdf/2308.09540.pdf), [[Code]](https://github.com/DLUT-yyc/Isomer)
12 | - (arXiv 2023.08) ViT-Lens: Towards Omni-modal Representations, [[Paper]](https://arxiv.org/pdf/2308.10185.pdf)
13 | - (arXiv 2023.08) Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding, [[Paper]](https://arxiv.org/pdf/2308.11448.pdf)
14 | - (arXiv 2023.11) SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation, [[Paper]](https://arxiv.org/pdf/2311.17707.pdf), [[Code]](https://mutianxu.github.io/sampro3d/)
15 | - (arXiv 2024.04) Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2404.07713.pdf)
16 | - (arXiv 2024.05) Dual Relation Mining Network for Zero-Shot Learning, [[Paper]](https://arxiv.org/pdf/2405.03613.pdf)
17 | - (arXiv 2025.01) Super-class guided Transformer for Zero-Shot Attribute Classification, [[Paper]](https://arxiv.org/pdf/2501.05728.pdf), [[Code]](https://github.com/mlvlab/SugaFormer)
18 | 


--------------------------------------------------------------------------------