└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome CLIP 
  2 | This repo collects the research resources based on CLIP (Contrastive Language-Image Pre-Training) proposed by OpenAI. If you would like to contribute, please open an issue.
  3 | 
  4 | ## CLIP 
  5 | - [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) [[code](https://github.com/openai/CLIP)]
  6 | - [CLIP: Connecting Text and Images](https://openai.com/blog/clip/)
  7 | - [Multimodal Neurons in Artificial Neural Networks](https://openai.com/blog/multimodal-neurons/)
  8 | 
  9 | ## Training
 10 | - OpenCLIP (3rd-party, PyTorch) [[code](https://github.com/mlfoundations/open_clip)]  
 11 | - Train-CLIP (3rd-party, PyTorch) [[code](https://github.com/Zasder3/train-CLIP)] 
 12 | - Paddle-CLIP (3rd-party, PaddlePaddle) [[code](https://github.com/AgentMaker/Paddle-CLIP)] 
 13 | 
 14 | 
 15 | ## Applications
 16 | 
 17 | ### GAN 
 18 | - VQGAN-CLIP [[code](https://github.com/nerdyrodent/VQGAN-CLIP)]
 19 | - [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://arxiv.org/abs/2103.17249) [[code](https://github.com/orpatashnik/StyleCLIP)]
 20 | - CLIP Guided Diffusion [[code](https://github.com/afiaka87/clip-guided-diffusion)]
 21 | - [CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions](https://arxiv.org/abs/2112.05219) [[code](https://github.com/RameenAbdal/CLIP2StyleGAN)]
 22 | - [TargetCLIP: Image-Based CLIP-Guided Essence Transfer](https://arxiv.org/abs/2110.12427) [[code](https://github.innominds.com/hila-chefer/TargetCLIP)]
 23 | - [DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation](https://arxiv.org/pdf/2110.02711.pdf) [[code](https://github.com/gwang-kim/DiffusionCLIP)]
 24 | - [Clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP](https://arxiv.org/pdf/2210.02347.pdf) [[code](https://github.com/justinpinkney/clip2latent)]
 25 | 
 26 | ### Object Detection
 27 | - Roboflow Zero-shot Object Tracking [[code](https://github.com/roboflow-ai/zero-shot-object-tracking)] 
 28 | - [Zero-Shot Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921) [[code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)]
 29 | - Crop-CLIP [[code](https://github.com/vijishmadhavan/Crop-CLIP)]
 30 | - [Detic: Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605) [[code](https://github.com/facebookresearch/Detic)] 
 31 | - [CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks](https://arxiv.org/abs/2201.05729)
 32 | - [SLIP: Self-supervision meets Language-Image Pre-training](https://arxiv.org/abs/2112.12750) [[code](https://github.com/facebookresearch/SLIP)]
 33 | - [ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension](https://arxiv.org/pdf/2204.05991.pdf) [[code](https://github.com/allenai/reclip)]  
 34 | 
 35 | ### Information Retrieval
 36 | - Unsplash Image Search [[code](https://github.com/haltakov/natural-language-image-search)]
 37 | - [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval](https://arxiv.org/abs/2104.08860) [[code](https://github.com/ArrowLuo/CLIP4Clip)]
 38 | - [Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling](https://arxiv.org/abs/2102.06183) [[code](https://github.com/jayleicn/ClipBERT)]
 39 | - Natural Language YouTube Search [[code](https://github.com/haltakov/natural-language-youtube-search)]
 40 | - [CLIP-as-service: Embed images and sentences into fixed-length vectors with CLIP](https://github.com/jina-ai/clip-as-service/tree/main/docs) [[code](https://github.com/jina-ai/clip-as-service)]
 41 | - clip-retrieval [[code](https://github.com/rom1504/clip-retrieval)]
 42 | - [A CLIP-Hitchhiker’s Guide to Long Video Retrieval](https://arxiv.org/pdf/2205.08508.pdf) [code]
 43 | - [CLIP2Video: Mastering Video-Text Retrieval via Image CLIP](https://arxiv.org/pdf/2106.11097.pdf) [[code](https://github.com/CryhanFang/CLIP2Video)]
 44 | - [X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval](https://arxiv.org/pdf/2207.07285.pdf) [[code](https://github.com/xuguohai/X-CLIP)]
 45 | - [Extending CLIP for Category-to-image Retrieval in E-commerce](https://mariyahendriksen.github.io/files/ecir22.pdf) [[code](https://github.com/mariyahendriksen/ecir2022_category_to_image_retrieval)]
 46 | 
 47 | ### Representation Learning 
 48 | - [Wav2CLIP: Learning Robust Audio Representations From CLIP](https://arxiv.org/pdf/2110.11499.pdf) [[code](https://github.com/descriptinc/lyrebird-Wav2CLIP)]
 49 | - [CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotation](https://arxiv.org/abs/2112.07133) [code]
 50 | - [RegionCLIP: Region-based Language-Image Pretraining](https://arxiv.org/pdf/2112.09106.pdf) [[code](https://github.com/microsoft/RegionCLIP)]
 51 | - [CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification](https://arxiv.org/abs/2112.03562) [code]
 52 | - [DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/pdf/2112.01518.pdf) [[code](https://github.com/raoyongming/DenseCLIP)]
 53 | - [CyCLIP: Cyclic Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2205.14459v1.pdf) [[code](https://github.com/goel-shashank/CyCLIP)]
 54 | - [CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment](https://arxiv.org/pdf/2209.06430.pdf) [[code](https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP)]
 55 | - [DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection](https://arxiv.org/pdf/2209.09407.pdf) [[code](https://github.com/Sense-GVT/DeCLIP)]
 56 | - [UniCLIP: Unified Framework for Contrastive Language–Image Pre-training](https://arxiv.org/pdf/2209.13430.pdf) [code]
 57 | - [SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model](https://arxiv.org/pdf/2210.00705.pdf) [[code](https://github.com/atosystem/SpeechCLIP)]
 58 | - [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/pdf/2211.01335.pdf) [[code](https://github.com/OFA-Sys/Chinese-CLIP)]
 59 | - [PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining](https://arxiv.org/pdf/2204.14095v2.pdf) [[code](https://github.com/Yuting-Gao/PyramidCLIP)]
 60 | - [Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training ](https://arxiv.org/pdf/2207.12661.pdf) [[code](https://github.com/Hxyou/MSCLIP)]
 61 | - [Fine-tuned CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2212.03640.pdf)[[code](https://github.com/muzairkhattak/ViFi-CLIP)]
 62 | - [MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2208.12262.pdf) [code]
 63 | - [Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm](https://arxiv.org/abs/2110.05208) [[code](https://github.com/Sense-GVT/DeCLIP)]
 64 | - [Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision](https://arxiv.org/pdf/2203.05796v1.pdf) [[code](https://github.com/sense-gvt/declip)]
 65 | 
 66 | ### Text-to-3D Generation
 67 | - [CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation](https://arxiv.org/pdf/2110.02624.pdf) [[code](https://github.com/AutodeskAILab/Clip-Forge)]
 68 | - [Text2Mesh: Text-Driven Neural Stylization for Meshes](https://arxiv.org/abs/2112.03221) [[code](https://github.com/threedle/text2mesh)]
 69 | - [CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP](https://arxiv.org/pdf/2203.00386.pdf) [[code](https://github.com/HFAiLab/clip-gen)]
 70 | - [CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders](https://arxiv.org/pdf/2106.14843.pdf) [[code](https://github.com/kvfrans/clipdraw/)]
 71 | - [CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields](https://arxiv.org/pdf/2112.05139.pdf) [[code](https://github.com/cassiePython/CLIPNeRF)]
 72 | - [MotionCLIP: Exposing Human Motion Generation to CLIP Space](https://arxiv.org/pdf/2203.08063.pdf) [[code](https://github.com/GuyTevet/MotionCLIP)]
 73 | - [AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars](https://arxiv.org/pdf/2205.08535.pdf) [[code](https://github.com/hongfz16/AvatarCLIP)]
 74 | - [ClipFace: Text-guided Editing of Textured 3D Morphable Models](https://arxiv.org/pdf/2212.01406.pdf?) [[code](https://github.com/sanonymous22/ClipFace)]
 75 | 
 76 | ### Text-to-Image Generation
 77 | - Big Sleep: A simple command line tool for text to image generation [[code](https://github.com/lucidrains/big-sleep)]
 78 | - Deep Daze: A simple command line tool for text to image generation [[code](https://github.com/lucidrains/deep-daze)]
 79 | - [CLIP-CLOP: CLIP-Guided Collage and Photomontage](https://arxiv.org/pdf/2205.03146v2.pdf) [[code](https://github.com/deepmind/arnheim)]
 80 | - [CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP](https://arxiv.org/pdf/2203.00386.pdf) [[code](https://github.com/HFAiLab/clip-gen/blob/main/README_en.md)]
 81 | 
 82 | ### Prompt Learning
 83 | - [Learning to Prompt for Vision-Language Models](https://arxiv.org/abs/2109.01134.pdf) [[code](https://github.com/KaiyangZhou/CoOp)]
 84 | - [Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557.pdf) [[code](https://github.com/KaiyangZhou/CoOp)]
 85 | - [Prompt-aligned Gradient for Prompt Tuning](https://arxiv.org/abs/2205.14865.pdf) [[code](https://github.com/BeierZhu/Prompt-align)]
 86 | - [CLIP-Adapter: Better Vision-Language Models with Feature Adapters](https://arxiv.org/abs/2110.04544.pdf) [[code](https://github.com/gaopengcuhk/CLIP-Adapter)]
 87 | - [Learning to Compose Soft Prompts for Compositional Zero-Shot Learning](https://arxiv.org/abs/2204.03574) [[code](https://github.com/BatsResearch/csp)]
 88 | 
 89 | ### Video Understanding
 90 | - [VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/pdf/2109.14084.pdf) [[code](https://github.com/pytorch/fairseq/tree/main/examples/MMPT)]
 91 | - [FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks](https://arxiv.org/pdf/2203.13371.pdf) [[code](https://github.com/bryant1410/fitclip)]
 92 | - [Frozen CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2208.03550.pdf) [[code](https://github.com/OpenGVLab/efficient-video-recognition)]
 93 | - [Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization](https://arxiv.org/pdf/2210.12826.pdf) [[code](https://github.com/pschaldenbrand/Text2Video)] 
 94 | - [MovieCLIP: Visual Scene Recognition in Movies](https://arxiv.org/pdf/2210.11065v2.pdf) [[code](https://github.com/usc-sail/mica-MovieCLIP)]
 95 | 
 96 | ### Image Captioning
 97 | - CLIP prefix captioning [[code](https://github.com/rmokady/CLIP_prefix_caption)]
 98 | - [CLIPScore: A Reference-free Evaluation Metric for Image Captioning](https://arxiv.org/abs/2104.08718) [[code](https://github.com/jmhessel/clipscore)]
 99 | - [ClipCap: CLIP Prefix for Image Captioning](https://arxiv.org/pdf/2111.09734v1.pdf) [[code](https://github.com/rmokady/CLIP_prefix_caption)]
100 | - [Text-Only Training for Image Captioning using Noise-Injected CLIP](https://arxiv.org/pdf/2211.00575.pdf) [[code](https://github.com/DavidHuji/CapDec)]
101 | - [Fine-grained Image Captioning with CLIP Reward](https://arxiv.org/pdf/2205.13115.pdf) [[code](https://github.com/j-min/CLIP-Caption-Reward)]
102 | 
103 | ### Image Editing 
104 | - [HairCLIP: Design Your Hair by Text and Reference Image](https://arxiv.org/pdf/2112.05142.pdf) [[code](https://github.com/wty-ustc/HairCLIP)]
105 | - [CLIPstyler: Image Style Transfer with a Single Text Condition](https://arxiv.org/pdf/2112.00374.pdf) [[code](https://github.com/paper11667/CLIPstyler)]
106 | - [CLIPasso: Semantically-Aware Object Sketching](https://clipasso.github.io/clipasso/static/source/paper_CLIPasso_Semantically_Aware_Object_Sketching.pdf) [[code](https://clipasso.github.io/clipasso/)]
107 | - [Image-based CLIP-Guided Essence Transfer](https://arxiv.org/pdf/2110.12427.pdf) [[code](https://github.com/hila-chefer/TargetCLIP)]
108 | - [CLIPDraw: Synthesize drawings to match a text prompt!](https://arxiv.org/pdf/2106.14843.pdf) [[code](https://github.com/kvfrans/clipdraw)]
109 | - [CLIP-CLOP: CLIP-Guided Collage and Photomontage](https://arxiv.org/pdf/2205.03146.pdf) [[code](https://github.com/deepmind/arnheim)]
110 | - [Towards Counterfactual Image Manipulation via CLIP](https://arxiv.org/pdf/2207.02812.pdf) [[code](https://github.com/yingchen001/CF-CLIP)]
111 | - [ClipCrop: Conditioned Cropping Driven by Vision-Language Model](https://arxiv.org/pdf/2211.11492.pdf) [code]
112 | - [CLIPascene: Scene Sketching with Different Types and Levels of Abstraction](https://arxiv.org/pdf/2211.17256.pdf) [[code](https://clipascene.github.io/CLIPascene/)]
113 | 
114 | ### Image Segmentation
115 | - [CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation](https://arxiv.org/pdf/2203.02668.pdf) [[code](https://github.com/CVI-SZU/CLIMS)]
116 | - [Image Segmentation Using Text and Image Prompts](https://arxiv.org/pdf/2112.10003.pdf) [[code](https://github.com/timojl/clipseg)]
117 | - [Extract Free Dense Labels from CLIP](https://arxiv.org/pdf/2112.01071.pdf) [[code](https://github.com/chongzhou96/MaskCLIP)]
118 | - [Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP](https://arxiv.org/pdf/2210.04150.pdf) [[code](https://github.com/facebookresearch/ov-seg)] 
119 | 
120 | ### 3D Recognition
121 | - [PointCLIP: Point Cloud Understanding by CLIP](https://arxiv.org/pdf/2112.02413.pdf) [[code](https://github.com/zrrskywalker/pointclip)]
122 | - [CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training](https://arxiv.org/pdf/2210.01055.pdf) [[code](https://github.com/tyhuang0428/CLIP2Point)]
123 | - [MotionCLIP: Exposing Human Motion Generation to CLIP Space](https://arxiv.org/pdf/2203.08063.pdf) [[code](https://github.com/GuyTevet/MotionCLIP)] 
124 | - [LidarCLIP or: How I Learned to Talk to Point Clouds](https://arxiv.org/pdf/2212.06858.pdf?)[[code](https://github.com/atonderski/lidarclip)]
125 | - [CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP](https://arxiv.org/pdf/2301.04926.pdf) [code]
126 | 
127 | ### Audio
128 | - [AudioCLIP: Extending CLIP to Image, Text and Audio](https://arxiv.org/pdf/2106.13043.pdf) [[code](https://github.com/AndreyGuzhov/AudioCLIP)]
129 | - [Wav2CLIP: Learning Robust Audio Representations from Clip](https://arxiv.org/pdf/2110.11499.pdf) [[code](https://github.com/descriptinc/lyrebird-wav2clip)] 
130 | - [AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization](https://arxiv.org/pdf/2210.05060.pdf) [code]
131 | 
132 | ### Language Tasks
133 | - [CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment](https://arxiv.org/pdf/2203.07190v1.pdf) [code]
134 | 
135 | ### Object Navigation
136 | - [CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration](https://arxiv.org/pdf/2203.10421.pdf) [code]
137 | 
138 | ### Localization
139 | - [Adapting CLIP For Phrase Localization Without Further Training](https://arxiv.org/pdf/2204.03647.pdf) [[code](https://github.com/pals-ttic/adapting-CLIP)]
140 | 
141 | ### Others
142 | - Multilingual-CLIP [[code](https://github.com/FreddeFrallan/Multilingual-CLIP)]
143 | - CLIP (With Haiku + Jax!) [[code](https://github.com/kingoflolz/CLIP_JAX)]
144 | - [CLIP-Event: Connecting Text and Images with Event Structures](https://arxiv.org/abs/2201.05078) [[code](https://github.com/limanling/clip-event)]
145 | - [How Much Can CLIP Benefit Vision-and-Language Tasks?](https://openreview.net/forum?id=zf_Ll3HZWgy) [[code](https://github.com/clip-vil/CLIP-ViL)]
146 | - [CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning](https://arxiv.org/pdf/2203.11096.pdf) [[code](https://asgaardlab.github.io/CLIPxGamePhysics/)]
147 | - [CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory](https://arxiv.org/pdf/2210.05663.pdf) [[code](https://github.com/notmahi/clip-fields)]
148 | - [CLIP-Event: Connecting Text and Images with Event Structures](https://arxiv.org/pdf/2201.05078.pdf) [[code](https://github.com/limanling/clip-event)]
149 | - [CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracywith ViT-B and ViT-L on ImageNet](https://arxiv.org/pdf/2212.06138v1.pdf) [[code](https://github.com/lightdxy/ft-clip)]
150 | - [Task Residual for Tuning Vision-Language Models](https://arxiv.org/pdf/2211.10277.pdf) [[code](https://github.com/geekyutao/TaskRes)]
151 | 
152 | ## Acknowledgment
153 | Inspired by [Awesome Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer).  
154 | 


--------------------------------------------------------------------------------