└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome CLIP 2 | This repo collects the research resources based on CLIP (Contrastive Language-Image Pre-Training) proposed by OpenAI. If you would like to contribute, please open an issue. 3 | 4 | ## CLIP 5 | - [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) [[code](https://github.com/openai/CLIP)] 6 | - [CLIP: Connecting Text and Images](https://openai.com/blog/clip/) 7 | - [Multimodal Neurons in Artificial Neural Networks](https://openai.com/blog/multimodal-neurons/) 8 | 9 | ## Training 10 | - OpenCLIP (3rd-party, PyTorch) [[code](https://github.com/mlfoundations/open_clip)] 11 | - Train-CLIP (3rd-party, PyTorch) [[code](https://github.com/Zasder3/train-CLIP)] 12 | - Paddle-CLIP (3rd-party, PaddlePaddle) [[code](https://github.com/AgentMaker/Paddle-CLIP)] 13 | 14 | 15 | ## Applications 16 | 17 | ### GAN 18 | - VQGAN-CLIP [[code](https://github.com/nerdyrodent/VQGAN-CLIP)] 19 | - [StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery](https://arxiv.org/abs/2103.17249) [[code](https://github.com/orpatashnik/StyleCLIP)] 20 | - CLIP Guided Diffusion [[code](https://github.com/afiaka87/clip-guided-diffusion)] 21 | - [CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions](https://arxiv.org/abs/2112.05219) [[code](https://github.com/RameenAbdal/CLIP2StyleGAN)] 22 | - [TargetCLIP: Image-Based CLIP-Guided Essence Transfer](https://arxiv.org/abs/2110.12427) [[code](https://github.innominds.com/hila-chefer/TargetCLIP)] 23 | - [DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation](https://arxiv.org/pdf/2110.02711.pdf) [[code](https://github.com/gwang-kim/DiffusionCLIP)] 24 | - [Clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP](https://arxiv.org/pdf/2210.02347.pdf) [[code](https://github.com/justinpinkney/clip2latent)] 25 | 26 | ### Object Detection 27 | - Roboflow Zero-shot Object Tracking [[code](https://github.com/roboflow-ai/zero-shot-object-tracking)] 28 | - [Zero-Shot Detection via Vision and Language Knowledge Distillation](https://arxiv.org/abs/2104.13921) [[code](https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild)] 29 | - Crop-CLIP [[code](https://github.com/vijishmadhavan/Crop-CLIP)] 30 | - [Detic: Detecting Twenty-thousand Classes using Image-level Supervision](https://arxiv.org/abs/2201.02605) [[code](https://github.com/facebookresearch/Detic)] 31 | - [CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks](https://arxiv.org/abs/2201.05729) 32 | - [SLIP: Self-supervision meets Language-Image Pre-training](https://arxiv.org/abs/2112.12750) [[code](https://github.com/facebookresearch/SLIP)] 33 | - [ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension](https://arxiv.org/pdf/2204.05991.pdf) [[code](https://github.com/allenai/reclip)] 34 | 35 | ### Information Retrieval 36 | - Unsplash Image Search [[code](https://github.com/haltakov/natural-language-image-search)] 37 | - [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval](https://arxiv.org/abs/2104.08860) [[code](https://github.com/ArrowLuo/CLIP4Clip)] 38 | - [Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling](https://arxiv.org/abs/2102.06183) [[code](https://github.com/jayleicn/ClipBERT)] 39 | - Natural Language YouTube Search [[code](https://github.com/haltakov/natural-language-youtube-search)] 40 | - [CLIP-as-service: Embed images and sentences into fixed-length vectors with CLIP](https://github.com/jina-ai/clip-as-service/tree/main/docs) [[code](https://github.com/jina-ai/clip-as-service)] 41 | - clip-retrieval [[code](https://github.com/rom1504/clip-retrieval)] 42 | - [A CLIP-Hitchhiker’s Guide to Long Video Retrieval](https://arxiv.org/pdf/2205.08508.pdf) [code] 43 | - [CLIP2Video: Mastering Video-Text Retrieval via Image CLIP](https://arxiv.org/pdf/2106.11097.pdf) [[code](https://github.com/CryhanFang/CLIP2Video)] 44 | - [X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval](https://arxiv.org/pdf/2207.07285.pdf) [[code](https://github.com/xuguohai/X-CLIP)] 45 | - [Extending CLIP for Category-to-image Retrieval in E-commerce](https://mariyahendriksen.github.io/files/ecir22.pdf) [[code](https://github.com/mariyahendriksen/ecir2022_category_to_image_retrieval)] 46 | 47 | ### Representation Learning 48 | - [Wav2CLIP: Learning Robust Audio Representations From CLIP](https://arxiv.org/pdf/2110.11499.pdf) [[code](https://github.com/descriptinc/lyrebird-Wav2CLIP)] 49 | - [CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotation](https://arxiv.org/abs/2112.07133) [code] 50 | - [RegionCLIP: Region-based Language-Image Pretraining](https://arxiv.org/pdf/2112.09106.pdf) [[code](https://github.com/microsoft/RegionCLIP)] 51 | - [CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification](https://arxiv.org/abs/2112.03562) [code] 52 | - [DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting](https://arxiv.org/pdf/2112.01518.pdf) [[code](https://github.com/raoyongming/DenseCLIP)] 53 | - [CyCLIP: Cyclic Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2205.14459v1.pdf) [[code](https://github.com/goel-shashank/CyCLIP)] 54 | - [CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment](https://arxiv.org/pdf/2209.06430.pdf) [[code](https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP)] 55 | - [DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection](https://arxiv.org/pdf/2209.09407.pdf) [[code](https://github.com/Sense-GVT/DeCLIP)] 56 | - [UniCLIP: Unified Framework for Contrastive Language–Image Pre-training](https://arxiv.org/pdf/2209.13430.pdf) [code] 57 | - [SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model](https://arxiv.org/pdf/2210.00705.pdf) [[code](https://github.com/atosystem/SpeechCLIP)] 58 | - [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/pdf/2211.01335.pdf) [[code](https://github.com/OFA-Sys/Chinese-CLIP)] 59 | - [PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining](https://arxiv.org/pdf/2204.14095v2.pdf) [[code](https://github.com/Yuting-Gao/PyramidCLIP)] 60 | - [Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training ](https://arxiv.org/pdf/2207.12661.pdf) [[code](https://github.com/Hxyou/MSCLIP)] 61 | - [Fine-tuned CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2212.03640.pdf)[[code](https://github.com/muzairkhattak/ViFi-CLIP)] 62 | - [MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining](https://arxiv.org/pdf/2208.12262.pdf) [code] 63 | - [Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm](https://arxiv.org/abs/2110.05208) [[code](https://github.com/Sense-GVT/DeCLIP)] 64 | - [Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision](https://arxiv.org/pdf/2203.05796v1.pdf) [[code](https://github.com/sense-gvt/declip)] 65 | 66 | ### Text-to-3D Generation 67 | - [CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation](https://arxiv.org/pdf/2110.02624.pdf) [[code](https://github.com/AutodeskAILab/Clip-Forge)] 68 | - [Text2Mesh: Text-Driven Neural Stylization for Meshes](https://arxiv.org/abs/2112.03221) [[code](https://github.com/threedle/text2mesh)] 69 | - [CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP](https://arxiv.org/pdf/2203.00386.pdf) [[code](https://github.com/HFAiLab/clip-gen)] 70 | - [CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders](https://arxiv.org/pdf/2106.14843.pdf) [[code](https://github.com/kvfrans/clipdraw/)] 71 | - [CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields](https://arxiv.org/pdf/2112.05139.pdf) [[code](https://github.com/cassiePython/CLIPNeRF)] 72 | - [MotionCLIP: Exposing Human Motion Generation to CLIP Space](https://arxiv.org/pdf/2203.08063.pdf) [[code](https://github.com/GuyTevet/MotionCLIP)] 73 | - [AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars](https://arxiv.org/pdf/2205.08535.pdf) [[code](https://github.com/hongfz16/AvatarCLIP)] 74 | - [ClipFace: Text-guided Editing of Textured 3D Morphable Models](https://arxiv.org/pdf/2212.01406.pdf?) [[code](https://github.com/sanonymous22/ClipFace)] 75 | 76 | ### Text-to-Image Generation 77 | - Big Sleep: A simple command line tool for text to image generation [[code](https://github.com/lucidrains/big-sleep)] 78 | - Deep Daze: A simple command line tool for text to image generation [[code](https://github.com/lucidrains/deep-daze)] 79 | - [CLIP-CLOP: CLIP-Guided Collage and Photomontage](https://arxiv.org/pdf/2205.03146v2.pdf) [[code](https://github.com/deepmind/arnheim)] 80 | - [CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP](https://arxiv.org/pdf/2203.00386.pdf) [[code](https://github.com/HFAiLab/clip-gen/blob/main/README_en.md)] 81 | 82 | ### Prompt Learning 83 | - [Learning to Prompt for Vision-Language Models](https://arxiv.org/abs/2109.01134.pdf) [[code](https://github.com/KaiyangZhou/CoOp)] 84 | - [Conditional Prompt Learning for Vision-Language Models](https://arxiv.org/abs/2203.05557.pdf) [[code](https://github.com/KaiyangZhou/CoOp)] 85 | - [Prompt-aligned Gradient for Prompt Tuning](https://arxiv.org/abs/2205.14865.pdf) [[code](https://github.com/BeierZhu/Prompt-align)] 86 | - [CLIP-Adapter: Better Vision-Language Models with Feature Adapters](https://arxiv.org/abs/2110.04544.pdf) [[code](https://github.com/gaopengcuhk/CLIP-Adapter)] 87 | - [Learning to Compose Soft Prompts for Compositional Zero-Shot Learning](https://arxiv.org/abs/2204.03574) [[code](https://github.com/BatsResearch/csp)] 88 | 89 | ### Video Understanding 90 | - [VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/pdf/2109.14084.pdf) [[code](https://github.com/pytorch/fairseq/tree/main/examples/MMPT)] 91 | - [FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks](https://arxiv.org/pdf/2203.13371.pdf) [[code](https://github.com/bryant1410/fitclip)] 92 | - [Frozen CLIP Models are Efficient Video Learners](https://arxiv.org/pdf/2208.03550.pdf) [[code](https://github.com/OpenGVLab/efficient-video-recognition)] 93 | - [Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization](https://arxiv.org/pdf/2210.12826.pdf) [[code](https://github.com/pschaldenbrand/Text2Video)] 94 | - [MovieCLIP: Visual Scene Recognition in Movies](https://arxiv.org/pdf/2210.11065v2.pdf) [[code](https://github.com/usc-sail/mica-MovieCLIP)] 95 | 96 | ### Image Captioning 97 | - CLIP prefix captioning [[code](https://github.com/rmokady/CLIP_prefix_caption)] 98 | - [CLIPScore: A Reference-free Evaluation Metric for Image Captioning](https://arxiv.org/abs/2104.08718) [[code](https://github.com/jmhessel/clipscore)] 99 | - [ClipCap: CLIP Prefix for Image Captioning](https://arxiv.org/pdf/2111.09734v1.pdf) [[code](https://github.com/rmokady/CLIP_prefix_caption)] 100 | - [Text-Only Training for Image Captioning using Noise-Injected CLIP](https://arxiv.org/pdf/2211.00575.pdf) [[code](https://github.com/DavidHuji/CapDec)] 101 | - [Fine-grained Image Captioning with CLIP Reward](https://arxiv.org/pdf/2205.13115.pdf) [[code](https://github.com/j-min/CLIP-Caption-Reward)] 102 | 103 | ### Image Editing 104 | - [HairCLIP: Design Your Hair by Text and Reference Image](https://arxiv.org/pdf/2112.05142.pdf) [[code](https://github.com/wty-ustc/HairCLIP)] 105 | - [CLIPstyler: Image Style Transfer with a Single Text Condition](https://arxiv.org/pdf/2112.00374.pdf) [[code](https://github.com/paper11667/CLIPstyler)] 106 | - [CLIPasso: Semantically-Aware Object Sketching](https://clipasso.github.io/clipasso/static/source/paper_CLIPasso_Semantically_Aware_Object_Sketching.pdf) [[code](https://clipasso.github.io/clipasso/)] 107 | - [Image-based CLIP-Guided Essence Transfer](https://arxiv.org/pdf/2110.12427.pdf) [[code](https://github.com/hila-chefer/TargetCLIP)] 108 | - [CLIPDraw: Synthesize drawings to match a text prompt!](https://arxiv.org/pdf/2106.14843.pdf) [[code](https://github.com/kvfrans/clipdraw)] 109 | - [CLIP-CLOP: CLIP-Guided Collage and Photomontage](https://arxiv.org/pdf/2205.03146.pdf) [[code](https://github.com/deepmind/arnheim)] 110 | - [Towards Counterfactual Image Manipulation via CLIP](https://arxiv.org/pdf/2207.02812.pdf) [[code](https://github.com/yingchen001/CF-CLIP)] 111 | - [ClipCrop: Conditioned Cropping Driven by Vision-Language Model](https://arxiv.org/pdf/2211.11492.pdf) [code] 112 | - [CLIPascene: Scene Sketching with Different Types and Levels of Abstraction](https://arxiv.org/pdf/2211.17256.pdf) [[code](https://clipascene.github.io/CLIPascene/)] 113 | 114 | ### Image Segmentation 115 | - [CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation](https://arxiv.org/pdf/2203.02668.pdf) [[code](https://github.com/CVI-SZU/CLIMS)] 116 | - [Image Segmentation Using Text and Image Prompts](https://arxiv.org/pdf/2112.10003.pdf) [[code](https://github.com/timojl/clipseg)] 117 | - [Extract Free Dense Labels from CLIP](https://arxiv.org/pdf/2112.01071.pdf) [[code](https://github.com/chongzhou96/MaskCLIP)] 118 | - [Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP](https://arxiv.org/pdf/2210.04150.pdf) [[code](https://github.com/facebookresearch/ov-seg)] 119 | 120 | ### 3D Recognition 121 | - [PointCLIP: Point Cloud Understanding by CLIP](https://arxiv.org/pdf/2112.02413.pdf) [[code](https://github.com/zrrskywalker/pointclip)] 122 | - [CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training](https://arxiv.org/pdf/2210.01055.pdf) [[code](https://github.com/tyhuang0428/CLIP2Point)] 123 | - [MotionCLIP: Exposing Human Motion Generation to CLIP Space](https://arxiv.org/pdf/2203.08063.pdf) [[code](https://github.com/GuyTevet/MotionCLIP)] 124 | - [LidarCLIP or: How I Learned to Talk to Point Clouds](https://arxiv.org/pdf/2212.06858.pdf?)[[code](https://github.com/atonderski/lidarclip)] 125 | - [CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP](https://arxiv.org/pdf/2301.04926.pdf) [code] 126 | 127 | ### Audio 128 | - [AudioCLIP: Extending CLIP to Image, Text and Audio](https://arxiv.org/pdf/2106.13043.pdf) [[code](https://github.com/AndreyGuzhov/AudioCLIP)] 129 | - [Wav2CLIP: Learning Robust Audio Representations from Clip](https://arxiv.org/pdf/2110.11499.pdf) [[code](https://github.com/descriptinc/lyrebird-wav2clip)] 130 | - [AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization](https://arxiv.org/pdf/2210.05060.pdf) [code] 131 | 132 | ### Language Tasks 133 | - [CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment](https://arxiv.org/pdf/2203.07190v1.pdf) [code] 134 | 135 | ### Object Navigation 136 | - [CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration](https://arxiv.org/pdf/2203.10421.pdf) [code] 137 | 138 | ### Localization 139 | - [Adapting CLIP For Phrase Localization Without Further Training](https://arxiv.org/pdf/2204.03647.pdf) [[code](https://github.com/pals-ttic/adapting-CLIP)] 140 | 141 | ### Others 142 | - Multilingual-CLIP [[code](https://github.com/FreddeFrallan/Multilingual-CLIP)] 143 | - CLIP (With Haiku + Jax!) [[code](https://github.com/kingoflolz/CLIP_JAX)] 144 | - [CLIP-Event: Connecting Text and Images with Event Structures](https://arxiv.org/abs/2201.05078) [[code](https://github.com/limanling/clip-event)] 145 | - [How Much Can CLIP Benefit Vision-and-Language Tasks?](https://openreview.net/forum?id=zf_Ll3HZWgy) [[code](https://github.com/clip-vil/CLIP-ViL)] 146 | - [CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning](https://arxiv.org/pdf/2203.11096.pdf) [[code](https://asgaardlab.github.io/CLIPxGamePhysics/)] 147 | - [CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory](https://arxiv.org/pdf/2210.05663.pdf) [[code](https://github.com/notmahi/clip-fields)] 148 | - [CLIP-Event: Connecting Text and Images with Event Structures](https://arxiv.org/pdf/2201.05078.pdf) [[code](https://github.com/limanling/clip-event)] 149 | - [CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracywith ViT-B and ViT-L on ImageNet](https://arxiv.org/pdf/2212.06138v1.pdf) [[code](https://github.com/lightdxy/ft-clip)] 150 | - [Task Residual for Tuning Vision-Language Models](https://arxiv.org/pdf/2211.10277.pdf) [[code](https://github.com/geekyutao/TaskRes)] 151 | 152 | ## Acknowledgment 153 | Inspired by [Awesome Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer). 154 | --------------------------------------------------------------------------------