├── hf_demo.py ├── LICENSE └── README.md /hf_demo.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from PIL import Image 3 | from transformers import BlipProcessor, BlipForConditionalGeneration 4 | import torch 5 | 6 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 7 | processor = BlipProcessor.from_pretrained("noamrot/FuseCap") 8 | model = BlipForConditionalGeneration.from_pretrained("noamrot/FuseCap").to(device) 9 | 10 | img_url = 'https://huggingface.co/spaces/noamrot/FuseCap/resolve/main/bike.jpg' 11 | raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') 12 | 13 | text = "a picture of " 14 | inputs = processor(raw_image, text, return_tensors="pt").to(device) 15 | 16 | out = model.generate(**inputs, num_beams = 3) 17 | print(processor.decode(out[0], skip_special_tokens=True)) 18 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 RotsteinNoam 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions 2 | 3 | Welcome to the GitHub repository of FuseCap, a framework designed to enhance image captioning by incorporating detailed visual information into traditional captions. 4 | 5 | 🎉 Exciting News: Paper accepted at WACV 2024! 6 | 7 | ## Resources 8 | 9 | - 💻 **Project Page**: For more details, visit the official [project page](https://rotsteinnoam.github.io/FuseCap/). 10 | 11 | - 📝 **Read the Paper**: You can find the paper [here](https://arxiv.org/abs/2305.17718). 12 | 13 | - 🚀 **Demo**: Try out our BLIP-based model [demo](https://huggingface.co/spaces/noamrot/FuseCap) trained using FuseCap, hosted on Huggingface Spaces. 14 | 15 | 16 | ## Release Status 17 | 18 | ### Done 19 | - ✅ Paper publication. 20 | - ✅ Release of the FuseCap dataset. 21 | - ✅ HuggingFace Captioner demo, including captioner weights. 22 | 23 | ## Hugging Face Demo 24 | Try out our BLIP-based captioning model trained using FuseCap quickly with this Python snippet. 25 | This code demonstrates how to use the model to generate captions for an image: 26 | 27 | ```python 28 | import requests 29 | from PIL import Image 30 | from transformers import BlipProcessor, BlipForConditionalGeneration 31 | import torch 32 | 33 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 34 | processor = BlipProcessor.from_pretrained("noamrot/FuseCap") 35 | model = BlipForConditionalGeneration.from_pretrained("noamrot/FuseCap").to(device) 36 | 37 | img_url = 'https://huggingface.co/spaces/noamrot/FuseCap/resolve/main/bike.jpg' 38 | raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB') 39 | 40 | text = "a picture of " 41 | inputs = processor(raw_image, text, return_tensors="pt").to(device) 42 | 43 | out = model.generate(**inputs, num_beams = 3) 44 | print(processor.decode(out[0], skip_special_tokens=True)) 45 | ``` 46 | 47 | ## Datasets 48 | We provide the fused captions that were created using the FuseCap framework. 49 | These captions were used for both pretraining and training phases of our image captioning model. 50 | The images can downloaded from the respective dataset websites or the provided urls (SBU, CC3, CC12). 51 | 52 | Dataset | FuseCap Captions 53 | --- | :---: 54 | COCO | Train, Val, Test 55 | SBU | Train 56 | CC3 | Train 57 | CC12 | Train 58 | 59 | ## BibTeX 60 | 61 | ``` 62 | @inproceedings{rotstein2024fusecap, 63 | title={Fusecap: Leveraging large language models for enriched fused image captions}, 64 | author={Rotstein, Noam and Bensa{\"\i}d, David and Brody, Shaked and Ganz, Roy and Kimmel, Ron}, 65 | booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision}, 66 | pages={5689--5700}, 67 | year={2024} 68 | } 69 | ``` 70 | --------------------------------------------------------------------------------