├── .github
└── workflows
│ └── publish.yml
├── .gitignore
├── LICENSE
├── README.md
├── __init__.py
├── assets
├── icon.png
└── workflow.png
├── image_captioner.py
├── pyproject.toml
└── requirements.txt
/.github/workflows/publish.yml:
--------------------------------------------------------------------------------
1 | name: Publish to Comfy registry
2 | on:
3 | workflow_dispatch:
4 | push:
5 | branches:
6 | - main
7 | - master
8 | paths:
9 | - "pyproject.toml"
10 |
11 | permissions:
12 | issues: write
13 |
14 | jobs:
15 | publish-node:
16 | name: Publish Custom Node to registry
17 | runs-on: ubuntu-latest
18 | if: ${{ github.repository_owner == 'neverbiasu' }}
19 | steps:
20 | - name: Check out code
21 | uses: actions/checkout@v4
22 | - name: Publish Custom Node
23 | uses: Comfy-Org/publish-node-action@v1
24 | with:
25 | ## Add your own personal access token to your Github Repository secrets and reference it here.
26 | personal_access_token: ${{ secrets.REGISTRY_ACCESS_TOKEN }}
27 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/neverbiasu/ComfyUI-Image-Captioner/66fa6f3015c91087db4782de52befbac055a4547/.gitignore
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 Faych Chen
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ComfyUI ImageCaptioner
2 |
3 |
4 |

5 |
6 |
7 | A [ComfyUI](https://github.com/comfyanonymous/ComfyUI) extension for generating captions for your images. Runs on your own system, no external services used, no filter.
8 |
9 | Uses various VLMs with APIs to generate captions for images. You can give instructions or ask questions in natural language.
10 |
11 | Try asking for:
12 |
13 | * captions or long descriptions
14 | * whether a person or object is in the image, and how many
15 | * lists of keywords or tags
16 | * a description of the opposite of the image
17 |
18 | 
19 |
20 | ## Installation
21 |
22 | 1. `git clone https://github.com/neverbiasu/ComfyUI-ImageCaptioner` into your `custom_nodes` folder
23 | - e.g. `custom_nodes\ComfyUI-ImageCaptioner`
24 | 2. Open a console/Command Prompt/Terminal etc
25 | 3. Change to the `custom_nodes/ComfyUI-ImageCaptioner` folder you just created
26 | - e.g. `cd C:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-ImageCaptioner` or wherever you have it installed
27 | 4. Run `pip install -r requirements.txt`
28 |
29 | ## Usage
30 |
31 | Add the node via `image` -> `ImageCaptioner`
32 |
33 | Supports tagging and outputting multiple batched inputs.
34 | - **image**: The image you want to make captions.
35 | - **api**: The API of dashscope.
36 | - **use_prompt**: The prompt to drive the VLMs.
37 |
38 | ## Requirements
39 |
40 | U need to get the API of dashscope from the [document](https://help.aliyun.com/zh/dashscope/developer-reference/acquisition-and-configuration-of-api-key?spm=a2c4g.11186623.0.0.7a32fa70GIg3tt)
41 |
42 | ## See also
43 |
44 | * [ComfyUI-WD14-Tagger](https://github.com/pythongosssss/ComfyUI-WD14-Tagger)
45 | * [ComfyUI-LLaVA-Captioner](https://github.com/ceruleandeep/ComfyUI-LLaVA-Captioner)
46 | * [IELTSDuck](https://github.com/neverbiasu/IELTSDuck)
47 |
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | from .image_captioner import ImageCaptioner
2 |
3 | NODE_CLASS_MAPPINGS = {
4 | "ImageCaptioner": ImageCaptioner
5 | }
6 |
7 | NODE_DISPLAY_NAME_MAPPINGS = {
8 | "ImageCaptioner": "Image Captioner"
9 | }
10 |
--------------------------------------------------------------------------------
/assets/icon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/neverbiasu/ComfyUI-Image-Captioner/66fa6f3015c91087db4782de52befbac055a4547/assets/icon.png
--------------------------------------------------------------------------------
/assets/workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/neverbiasu/ComfyUI-Image-Captioner/66fa6f3015c91087db4782de52befbac055a4547/assets/workflow.png
--------------------------------------------------------------------------------
/image_captioner.py:
--------------------------------------------------------------------------------
1 | import os
2 | import torch
3 | import dashscope
4 |
5 | from http import HTTPStatus
6 | import torchvision.transforms as transforms
7 |
8 | def post_process_prompt(raw_prompt):
9 | tags = [tag.strip().lower() for tag in raw_prompt.split(',') if tag.strip()]
10 | tags = ['_'.join(tag.split()) for tag in tags]
11 | seen = set()
12 | unique_tags = [tag for tag in tags if not (tag in seen or seen.add(tag))]
13 | final_tags = unique_tags[:70]
14 | return ', '.join(final_tags)
15 |
16 | class ImageCaptioner:
17 | def __init__(self):
18 | pass
19 |
20 | @classmethod
21 | def INPUT_TYPES(cls):
22 | return {
23 | "required": {
24 | "image": ("IMAGE",),
25 | "api": ("STRING",{"multiline": True, "default": ""}),
26 | "user_prompt": ("STRING", {"multiline": True, "default": "As an AI image tagging expert, please provide precise tags for these images to enhance CLIP model's understanding of the content. Employ succinct keywords or phrases, steering clear of elaborate sentences and extraneous conjunctions. Prioritize the tags by relevance. Your tags should capture key elements such as the main subject, setting, artistic style, composition, image quality, color tone, filter, and camera specifications, and any other tags crucial for the image. When tagging photos of people, include specific details like gender, nationality, attire, actions, pose, expressions, accessories, makeup, composition type, age, etc. For other image categories, apply appropriate and common descriptive tags as well. Recognize and tag any celebrities, well-known landmark or IPs if clearly featured in the image. Your tags should be accurate, non-duplicative, and within a 20-75 word count range. These tags will use for image re-creation, so the closer the resemblance to the original image, the better the tag quality. Tags should be comma-separated. Exceptional tagging will be rewarded with $10 per image.", })
27 | }
28 | }
29 |
30 | RETURN_TYPES = ("STRING",)
31 | FUNCTION = "generate_image_captions"
32 | OUTPUT_NODE = True
33 |
34 | CATEGORY = "image"
35 |
36 | def generate_image_captions(self, image, api, user_prompt):
37 | assert isinstance(image, torch.Tensor), "Image must be a numpy array."
38 | assert isinstance(api, str), "API key must be a string."
39 | assert isinstance(user_prompt, str), "User prompt must be a string."
40 |
41 | dashscope.api_key = api
42 | image = image.squeeze(0)
43 | image = image.permute(2, 0, 1)
44 | image = transforms.ToPILImage()(image)
45 |
46 | image.save("image.png")
47 | image_url = "image.png"
48 |
49 | messages = [
50 | {
51 | "role": "system",
52 | "content": [{"text":"As an AI image tagging expert, please provide precise tags for these images to enhance CLIP model's understanding of the content. Employ succinct keywords or phrases, steering clear of elaborate sentences and extraneous conjunctions. Prioritize the tags by relevance. Your tags should capture key elements such as the main subject, setting, artistic style, composition, image quality, color tone, filter, and camera specifications, and any other tags crucial for the image. When tagging photos of people, include specific details like gender, nationality, attire, actions, pose, expressions, accessories, makeup, composition type, age, etc. For other image categories, apply appropriate and common descriptive tags as well. Recognize and tag any celebrities, well-known landmark or IPs if clearly featured in the image. Your tags should be accurate, non-duplicative, and within a 20-75 word count range. These tags will use for image re-creation, so the closer the resemblance to the original image, the better the tag quality. Tags should be comma-separated. Exceptional tagging will be rewarded with $10 per image."}]
53 | },
54 | {
55 | "role": "user",
56 | "content": [
57 | {"image": f"{image_url}"},
58 | {"text": f"{user_prompt}"}
59 | ]
60 | }
61 | ]
62 |
63 | response = dashscope.MultiModalConversation.call(
64 | model="qwen-vl-plus",
65 | messages=messages
66 | )
67 |
68 | os.remove(image_url)
69 |
70 | if response.status_code == HTTPStatus.OK:
71 | raw_prompt = response.output.choices[0].message.content[0]["text"]
72 | if isinstance(raw_prompt, list):
73 | # Convert each item in the list to a string
74 | raw_prompt = ', '.join(str(item) for item in raw_prompt)
75 | processed_prompt = post_process_prompt(raw_prompt)
76 | return (processed_prompt,)
77 | else:
78 | print(f"Error: {response.code} - {response.message}")
79 | return ("Error generating captions.",)
80 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [project]
2 | name = "comfyui-image-captioner"
3 | description = "A ComfyUI extension for generating captions for your images. Runs on your own system, no external services used, no filter.\nUses various VLMs with APIs to generate captions for images. You can give instructions or ask questions in natural language."
4 | version = "1.0.2"
5 | license = { file = "LICENSE" }
6 | dependencies = ["http", "torch", "dashscope", "torchvision"]
7 |
8 | [project.urls]
9 | Repository = "https://github.com/neverbiasu/ComfyUI-Image-Captioner"
10 | # Used by Comfy Registry https://comfyregistry.org
11 |
12 | [tool.comfy]
13 | PublisherId = "faych"
14 | DisplayName = "ComfyUI-Image-Captioner"
15 | Icon = "https://raw.githubusercontent.com/neverbiasu/ComfyUI-Image-Captioner/main/assets/icon.png"
16 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | http
2 | torch
3 | dashscope
4 | torchvision
5 |
--------------------------------------------------------------------------------