├── .gitignore ├── LICENSE ├── README.md ├── assets ├── demo1.png └── demo2.png ├── download_weights.py └── ros2_vlm ├── package.xml ├── resource └── ros2_vlm ├── ros2_vlm ├── __init__.py ├── blip_visual_qna.py ├── grounded_sam.py └── modules │ ├── blipvisual.py │ ├── checkpoints │ └── README.md │ ├── groundedsam.py │ └── openvino_irs │ └── README.md ├── setup.cfg ├── setup.py └── test ├── test_copyright.py ├── test_flake8.py └── test_pep257.py /.gitignore: -------------------------------------------------------------------------------- 1 | devel/ 2 | logs/ 3 | build/ 4 | bin/ 5 | lib/ 6 | msg_gen/ 7 | srv_gen/ 8 | msg/*Action.msg 9 | msg/*ActionFeedback.msg 10 | msg/*ActionGoal.msg 11 | msg/*ActionResult.msg 12 | msg/*Feedback.msg 13 | msg/*Goal.msg 14 | msg/*Result.msg 15 | msg/_*.py 16 | build_isolated/ 17 | devel_isolated/ 18 | 19 | # Generated by dynamic reconfigure 20 | *.cfgc 21 | /cfg/cpp/ 22 | /cfg/*.py 23 | 24 | # Ignore generated docs 25 | *.dox 26 | *.wikidoc 27 | 28 | # eclipse stuff 29 | .project 30 | .cproject 31 | 32 | # qcreator stuff 33 | CMakeLists.txt.user 34 | 35 | srv/_*.py 36 | *.pcd 37 | *.pyc 38 | qtcreator-* 39 | *.user 40 | 41 | /planning/cfg 42 | /planning/docs 43 | /planning/src 44 | 45 | *~ 46 | 47 | # Emacs 48 | .#* 49 | 50 | # Catkin custom files 51 | CATKIN_IGNORE 52 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Nilutpol Kashyap 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Application of Vision Language Models with ROS 2 workshop 2 | 3 | We explore how robots can perceive and understand their environment through the powerful combination of image understanding and natural language processing. This repository dives deep into the fascinating world of **vision-language models** for **robotics applications**, specifically utilizing the powerful **Intel OpenVINO Toolkit**. 4 | 5 | This repository is presented as a **workshop** at the **ROS meetup Lagos** 6 | 7 | ## Prerequisites 8 | - **Ubuntu 22.04 or newer** 9 | - **ROS 2 Humble or newer** 10 | - **Python 3** 11 | - **Intel OpenVINO toolkit** 12 | 13 | ## Hardware Requirements 14 | Please note that to run the code in this repository, you will need a device compatible with the **[Intel OpenVINO Toolkit](https://docs.openvino.ai/2024/home.html)**. This typically includes **Intel CPUs**, Intel Neural Compute Sticks, or other Intel hardware supporting OpenVINO. 15 | 16 | ### Create a ROS 2 colcon workspace 17 | 18 | ``` 19 | mkdir -p ~/ros2_ws/src 20 | ``` 21 | 22 | ### Create & Setup Python Virtual Environment 23 | ``` 24 | cd ~/ros2_ws 25 | 26 | virtualenv -p python3 ./vlm-venv 27 | source ./vlm-venv/bin/activate 28 | 29 | # Make sure that colcon doesn’t try to build the venv 30 | touch ./vlm-venv/COLCON_IGNORE 31 | ``` 32 | 33 | ### Install Python dependencies 34 | ``` 35 | pip install timm --extra-index-url https://download.pytorch.org/whl/cpu # is needed for torch 36 | 37 | pip install "openvino>=2024.1" "torch>=2.1" opencv-python supervision transformers yapf pycocotools addict "gradio>=4.19" tqdm 38 | ``` 39 | 40 | ### Add your Python virtual environment package path 41 | **Make sure to update <> with your system username.** 42 | ``` 43 | export PYTHONPATH='/home/<>/ros2_ws/vlm-venv/lib/python3.10/site-packages' 44 | ``` 45 | 46 | ### Clone this repository inside the 'src' folder of your workspace 47 | ``` 48 | cd ~/ros2_ws/src 49 | 50 | git clone https://github.com/nilutpolkashyap/vlms_with_ros2_workshop.git 51 | ``` 52 | 53 | ### Download weights and required packages 54 | ``` 55 | cd ~/ros2_ws/src/vlms_with_ros2_workshop 56 | 57 | python3 download_weights.py 58 | ``` 59 | 60 | ### Download OpenVINO IR models from Google Drive 61 | 62 | Download the zip file from the Google Drive [link here](https://drive.google.com/file/d/1yCpUEWr3KR76uVWgtHNreIdlYh59z0dS/view?usp=sharing) 63 | 64 | Place the contents of the zip file inside the **'openvino_irs'** directory in the following path 65 | 66 | ``` 67 | ~/ros2_ws/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/openvino_irs 68 | ``` 69 | 70 | ### Build and source the workspace 71 | ``` 72 | cd ~/ros2_ws 73 | colcon build --symlink-install 74 | 75 | source ~/ros2_ws/install/setup.bash 76 | ``` 77 | 78 | ## Object detection and masking with GroundedSAM (GroundingDINO + SAM) 79 | 80 | **GroundedSAM** tackles **object detection** and **segmentation**. It integrates various open-world models, allowing to not just detect objects but also **understand** their specific regions. This can empower robots to act on specific parts (e.g., grasping a cup's handle) based on textual instructions or visual cues. 81 | 82 |
83 | GroundedSAM

84 | 85 | ### Run the GroundedSAM node 86 | ``` 87 | ros2 run ros2_vlm grounded_sam --ros-args -p device:='CPU' -p video_source:=/dev/video2 -p isSegment:=False -p detectionList:="["eyes", "person", "hair"]" 88 | ``` 89 | ### **ROS 2 CLI Arguments** 90 | - device - Inference Device (e.g. CPU, GPU, NPU) 91 | - video_source - Video source to get the image frame 92 | - isSegment - To run Segment Anything model (True/False) 93 | - detectionList - List of objects to detect 94 | 95 | Check out more in the [GroundedSAM OpenVINO Notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/grounded-segment-anything/grounded-segment-anything.ipynb) 96 | 97 | ## Visual Question Answering using BLIP 98 | 99 | **BLIP** bridges the gap between vision and language. It analyzes images and extracts meaningful information, generating captions describing the scene or answering questions about it. This lets robots not only **"see"** their environment but also **understand** its context and **respond** to natural language instructions effectively. 100 | 101 |
102 | Visual Question Answering

103 | 104 | ### Run the Blip Visual QnA node 105 | ``` 106 | ros2 run ros2_vlm blip_visual_qna --ros-args -p device_name:="GPU.0" -p question:="What is in the image?" -p image_path:="/home/nilutpol/ai_ws/src/blip_qna_code/demo2.jpg" 107 | ``` 108 | ### **ROS 2 CLI Arguments** 109 | - device_name - Inference Device (e.g. CPU, GPU, NPU) 110 | - question - Question for the blip model 111 | - image_path - Path to the image source 112 | 113 | Check out more in the [BLIP Visual Question Answering OpenVINO Notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/blip-visual-language-processing/blip-visual-language-processing.ipynb) 114 | 115 | ## Resources 116 | - [Using Python Packages with ROS 2](https://docs.ros.org/en/humble/How-To-Guides/Using-Python-Packages.html) 117 | - [How to use (python) virtual environments with ROS2? 118 | ](https://answers.ros.org/question/371083/how-to-use-python-virtual-environments-with-ros2/) 119 | - [Intel OpenVINO Toolkit](https://docs.openvino.ai/2024/home.html) 120 | - [OpenVINO Notebooks](https://github.com/openvinotoolkit/openvino_notebooks) 121 | -------------------------------------------------------------------------------- /assets/demo1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/assets/demo1.png -------------------------------------------------------------------------------- /assets/demo2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/assets/demo2.png -------------------------------------------------------------------------------- /download_weights.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import os 3 | import urllib.request 4 | import subprocess 5 | 6 | current_dir = os.getcwd() 7 | 8 | grounding_dino_config_base_path = os.path.join(current_dir, 'ros2_vlm', 'ros2_vlm', 'modules', 'checkpoints') 9 | grounding_dino_config_name = "groundingdino_swint_ogc.pth" 10 | file_path = os.path.join(grounding_dino_config_base_path, grounding_dino_config_name) 11 | grounding_dino_config_link = "https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth" 12 | 13 | # Check if file exists 14 | if not os.path.exists(file_path): 15 | print(f"Downloading {grounding_dino_config_name}...") 16 | # Download file 17 | urllib.request.urlretrieve(grounding_dino_config_link, file_path) 18 | print("Download complete!") 19 | else: 20 | print(f"File {grounding_dino_config_name} already exists.") 21 | 22 | ground_dino_dir = os.path.join(current_dir, 'ros2_vlm', 'ros2_vlm', 'modules', 'GroundingDINO') 23 | subprocess.run(["git", "clone", "https://github.com/wenyi5608/GroundingDINO/", ground_dino_dir]) 24 | 25 | efficient_sam_dir = os.path.join(current_dir, 'ros2_vlm', 'ros2_vlm', 'modules', 'EfficientSAM') 26 | subprocess.run(["git", "clone", "https://github.com/yformer/EfficientSAM/", efficient_sam_dir]) 27 | 28 | 29 | 30 | -------------------------------------------------------------------------------- /ros2_vlm/package.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | ros2_vlm 5 | 0.0.0 6 | ROS 2 OpenCV Python Package 7 | Nilutpol Kashyap 8 | MIT License 9 | 10 | rclpy 11 | cv_bridge 12 | sensor_msgs 13 | 14 | ament_copyright 15 | ament_flake8 16 | ament_pep257 17 | python3-pytest 18 | 19 | 20 | ament_python 21 | 22 | 23 | -------------------------------------------------------------------------------- /ros2_vlm/resource/ros2_vlm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/ros2_vlm/resource/ros2_vlm -------------------------------------------------------------------------------- /ros2_vlm/ros2_vlm/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/ros2_vlm/ros2_vlm/__init__.py -------------------------------------------------------------------------------- /ros2_vlm/ros2_vlm/blip_visual_qna.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import rclpy 3 | from rclpy.node import Node 4 | from PIL import Image 5 | from .modules.blipvisual import BlipVQA 6 | import openvino as ov 7 | from pathlib import Path 8 | import sys 9 | import os 10 | 11 | current_dir = os.getcwd() 12 | additional_path = '/src/ros2_vlm/ros2_vlm/modules' 13 | updated_path = os.path.join(current_dir + additional_path) 14 | 15 | class QuestionImageProcessor(Node): 16 | def __init__(self): 17 | super().__init__('question_image_processor_node') 18 | 19 | self.core = ov.Core() 20 | 21 | self.declare_parameter('device', 'GPU.0') 22 | self.declare_parameter('question', 'What is in the image?') 23 | self.declare_parameter('image_path', '') 24 | 25 | self.device = self.get_parameter('device').get_parameter_value().string_value 26 | self.question = self.get_parameter('question').get_parameter_value().string_value 27 | self.image_path = self.get_parameter('image_path').get_parameter_value().string_value 28 | 29 | if not self.image_path: 30 | self.get_logger().error('image_path parameter is required.') 31 | return 32 | 33 | self.get_logger().info(f'Inference Device: {self.device}') 34 | self.get_logger().info(f'Question: {self.question}') 35 | self.get_logger().info(f'Image Path: {self.image_path}') 36 | 37 | self.irs_path = Path("openvino_irs") 38 | 39 | self.vision_model_path = updated_path / self.irs_path / f"blip_vision_model.xml" 40 | self.text_encoder_path = updated_path / self.irs_path / f"blip_text_encoder.xml" 41 | self.text_decoder_path = updated_path / self.irs_path / f"blip_text_decoder_with_past.xml" 42 | 43 | self.blip_inference = BlipVQA(self.core, self.vision_model_path, self.text_encoder_path, self.text_decoder_path, self.device) 44 | 45 | try: 46 | answer, inference_time = self.blip_inference.generate_answer(self.image_path, self.question, max_length=20) 47 | 48 | print(f"Generated Answer: {answer}") 49 | print(f"Inference Time: {inference_time} seconds") 50 | except Exception as e: 51 | self.get_logger().error(f'Failed to load image: {e}') 52 | 53 | def main(args=None): 54 | rclpy.init(args=args) 55 | node = QuestionImageProcessor() 56 | 57 | if rclpy.ok(): 58 | rclpy.spin(node) 59 | 60 | node.destroy_node() 61 | rclpy.shutdown() 62 | 63 | if __name__ == '__main__': 64 | main() 65 | -------------------------------------------------------------------------------- /ros2_vlm/ros2_vlm/grounded_sam.py: -------------------------------------------------------------------------------- 1 | import rclpy 2 | from rclpy.node import Node 3 | import cv2 4 | import torch 5 | import openvino as ov 6 | from .modules.groundedsam import ImageProcessor 7 | from PIL import Image 8 | 9 | class VideoDisplayNode(Node): 10 | def __init__(self): 11 | super().__init__('video_display_node') 12 | 13 | self.declare_parameter('video_source', '/dev/video0') 14 | self.declare_parameter('isSegment', False) 15 | self.declare_parameter('detectionList', ['person']) 16 | 17 | self.video_source = self.get_parameter('video_source').value 18 | self.isSegment = self.get_parameter('isSegment').value 19 | self.detectionList = self.get_parameter('detectionList').value 20 | 21 | self.get_logger().info(f'Video source: {self.video_source}') 22 | self.get_logger().info(f'isSegment: {self.isSegment}') 23 | self.get_logger().info('My list:') 24 | for item in self.detectionList: 25 | self.get_logger().info(f' - {item}') 26 | 27 | self.cap = cv2.VideoCapture(self.video_source) 28 | 29 | self.core = ov.Core() 30 | self.device = "CPU" #"GPU.0" 31 | self.processor = ImageProcessor(self.core, self.device) 32 | 33 | def __del__(self): 34 | self.cap.release() 35 | 36 | def run(self): 37 | ret, frame = self.cap.read() 38 | if ret: 39 | self.pil_image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) 40 | self.processed_image = self.processor.process_image(self.pil_image, self.detectionList, self.isSegment) 41 | cv2.imshow('Video Frame', self.processed_image) 42 | cv2.waitKey(0) 43 | 44 | def main(args=None): 45 | rclpy.init(args=args) 46 | node = VideoDisplayNode() 47 | node.run() 48 | rclpy.shutdown() 49 | 50 | if __name__ == '__main__': 51 | main() 52 | -------------------------------------------------------------------------------- /ros2_vlm/ros2_vlm/modules/blipvisual.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from pathlib import Path 3 | import openvino as ov 4 | import time 5 | from PIL import Image 6 | from transformers import BlipProcessor, BlipForQuestionAnswering 7 | from functools import partial 8 | import torch 9 | import numpy as np 10 | from typing import List, Dict 11 | from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions 12 | 13 | def init_past_inputs(model_inputs: List): 14 | """ 15 | Helper function for initialization of past inputs on first inference step 16 | Parameters: 17 | model_inputs (List): list of model inputs 18 | Returns: 19 | pkv (List[ov.Tensor]): list of filled past key values 20 | """ 21 | pkv = [] 22 | for input_tensor in model_inputs[4:]: 23 | partial_shape = input_tensor.partial_shape 24 | partial_shape[0] = 1 25 | partial_shape[2] = 0 26 | pkv.append(ov.Tensor(ov.Type.f32, partial_shape.get_shape())) 27 | return pkv 28 | 29 | 30 | def postprocess_text_decoder_outputs(output: Dict): 31 | """ 32 | Helper function for rearranging model outputs and wrapping to CausalLMOutputWithCrossAttentions 33 | Parameters: 34 | output (Dict): dictionary with model output 35 | Returns 36 | wrapped_outputs (CausalLMOutputWithCrossAttentions): outputs wrapped to CausalLMOutputWithCrossAttentions format 37 | """ 38 | logits = torch.from_numpy(output[0]) 39 | past_kv = list(output.values())[1:] 40 | return CausalLMOutputWithCrossAttentions( 41 | loss=None, 42 | logits=logits, 43 | past_key_values=past_kv, 44 | hidden_states=None, 45 | attentions=None, 46 | cross_attentions=None, 47 | ) 48 | 49 | def text_decoder_forward( 50 | ov_text_decoder_with_past: ov.CompiledModel, 51 | input_ids: torch.Tensor, 52 | attention_mask: torch.Tensor, 53 | past_key_values: List[ov.Tensor], 54 | encoder_hidden_states: torch.Tensor, 55 | encoder_attention_mask: torch.Tensor, 56 | **kwargs 57 | ): 58 | """ 59 | Inference function for text_decoder in one generation step 60 | Parameters: 61 | input_ids (torch.Tensor): input token ids 62 | attention_mask (torch.Tensor): attention mask for input token ids 63 | past_key_values (List[ov.Tensor] list of cached decoder hidden states from previous step 64 | encoder_hidden_states (torch.Tensor): encoder (vision or text) hidden states 65 | encoder_attention_mask (torch.Tensor): attnetion mask for encoder hidden states 66 | Returns 67 | model outputs (CausalLMOutputWithCrossAttentions): model prediction wrapped to CausalLMOutputWithCrossAttentions class including predicted logits and hidden states for caching 68 | """ 69 | inputs = [input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask] 70 | if past_key_values is None: 71 | inputs.extend(init_past_inputs(ov_text_decoder_with_past.inputs)) 72 | else: 73 | inputs.extend(past_key_values) 74 | outputs = ov_text_decoder_with_past(inputs) 75 | return postprocess_text_decoder_outputs(outputs) 76 | 77 | 78 | class OVBlipModel: 79 | """ 80 | Model class for inference BLIP model with OpenVINO 81 | """ 82 | 83 | def __init__( 84 | self, 85 | config, 86 | decoder_start_token_id: int, 87 | vision_model, 88 | text_encoder, 89 | text_decoder, 90 | ): 91 | """ 92 | Initialization class parameters 93 | """ 94 | self.vision_model = vision_model 95 | self.vision_model_out = vision_model.output(0) 96 | self.text_encoder = text_encoder 97 | self.text_encoder_out = text_encoder.output(0) 98 | self.text_decoder = text_decoder 99 | self.config = config 100 | self.decoder_start_token_id = decoder_start_token_id 101 | self.decoder_input_ids = config.text_config.bos_token_id 102 | 103 | def generate_answer(self, pixel_values: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor, **generate_kwargs): 104 | """ 105 | Visual Question Answering prediction 106 | Parameters: 107 | pixel_values (torch.Tensor): preprocessed image pixel values 108 | input_ids (torch.Tensor): question token ids after tokenization 109 | attention_mask (torch.Tensor): attention mask for question tokens 110 | Retruns: 111 | generation output (torch.Tensor): tensor which represents sequence of generated answer token ids 112 | """ 113 | image_embed = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out] 114 | image_attention_mask = np.ones(image_embed.shape[:-1], dtype=int) 115 | if isinstance(input_ids, list): 116 | input_ids = torch.LongTensor(input_ids) 117 | question_embeds = self.text_encoder( 118 | [ 119 | input_ids.detach().numpy(), 120 | attention_mask.detach().numpy(), 121 | image_embed, 122 | image_attention_mask, 123 | ] 124 | )[self.text_encoder_out] 125 | question_attention_mask = np.ones(question_embeds.shape[:-1], dtype=int) 126 | 127 | bos_ids = np.full((question_embeds.shape[0], 1), fill_value=self.decoder_start_token_id) 128 | 129 | outputs = self.text_decoder.generate( 130 | input_ids=torch.from_numpy(bos_ids), 131 | eos_token_id=self.config.text_config.sep_token_id, 132 | pad_token_id=self.config.text_config.pad_token_id, 133 | encoder_hidden_states=torch.from_numpy(question_embeds), 134 | encoder_attention_mask=torch.from_numpy(question_attention_mask), 135 | **generate_kwargs, 136 | ) 137 | return outputs 138 | 139 | def generate_caption(self, pixel_values: torch.Tensor, input_ids: torch.Tensor = None, attention_mask: torch.Tensor = None, **generate_kwargs): 140 | """ 141 | Image Captioning prediction 142 | Parameters: 143 | pixel_values (torch.Tensor): preprocessed image pixel values 144 | input_ids (torch.Tensor, *optional*, None): pregenerated caption token ids after tokenization, if provided caption generation continue provided text 145 | attention_mask (torch.Tensor): attention mask for caption tokens, used only if input_ids provided 146 | Retruns: 147 | generation output (torch.Tensor): tensor which represents sequence of generated caption token ids 148 | """ 149 | batch_size = pixel_values.shape[0] 150 | 151 | image_embeds = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out] 152 | 153 | image_attention_mask = torch.ones(image_embeds.shape[:-1], dtype=torch.long) 154 | 155 | if isinstance(input_ids, list): 156 | input_ids = torch.LongTensor(input_ids) 157 | elif input_ids is None: 158 | input_ids = torch.LongTensor( 159 | [ 160 | [ 161 | self.config.text_config.bos_token_id, 162 | self.config.text_config.eos_token_id, 163 | ] 164 | ] 165 | ).repeat(batch_size, 1) 166 | input_ids[:, 0] = self.config.text_config.bos_token_id 167 | attention_mask = attention_mask[:, :-1] if attention_mask is not None else None 168 | 169 | outputs = self.text_decoder.generate( 170 | input_ids=input_ids[:, :-1], 171 | eos_token_id=self.config.text_config.sep_token_id, 172 | pad_token_id=self.config.text_config.pad_token_id, 173 | attention_mask=attention_mask, 174 | encoder_hidden_states=torch.from_numpy(image_embeds), 175 | encoder_attention_mask=image_attention_mask, 176 | **generate_kwargs, 177 | ) 178 | return outputs 179 | 180 | class BlipVQA: 181 | def __init__(self, core, vision_model_path, text_encoder_path, text_decoder_path, device="CPU"): 182 | self.processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base") 183 | self.model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base") 184 | 185 | self.vision_model = self.model.vision_model 186 | self.text_encoder = self.model.text_encoder 187 | self.text_decoder = self.model.text_decoder 188 | 189 | self.VISION_MODEL_OV = Path(vision_model_path) 190 | self.TEXT_ENCODER_OV = Path(text_encoder_path) 191 | self.TEXT_DECODER_OV = Path(text_decoder_path) 192 | 193 | self.core = core 194 | self.device = device 195 | 196 | self.ov_vision_model = self.core.compile_model(self.VISION_MODEL_OV, self.device) 197 | self.ov_text_encoder = self.core.compile_model(self.TEXT_ENCODER_OV, self.device) 198 | self.ov_text_decoder_with_past = self.core.compile_model(self.TEXT_DECODER_OV, self.device) 199 | 200 | self.text_decoder.forward = partial(text_decoder_forward, ov_text_decoder_with_past=self.ov_text_decoder_with_past) 201 | 202 | self.ov_model = OVBlipModel(self.model.config, self.model.decoder_start_token_id, self.ov_vision_model, self.ov_text_encoder, self.text_decoder) 203 | 204 | def preprocess(self, image_path, question): 205 | raw_image = Image.open(image_path).convert("RGB") 206 | inputs = self.processor(raw_image, question, return_tensors="pt") 207 | return raw_image, inputs 208 | 209 | def generate_answer(self, image_path, question, max_length=20): 210 | raw_image, inputs = self.preprocess(image_path, question) 211 | start = time.perf_counter() 212 | out = self.ov_model.generate_answer(**inputs, max_length=max_length) 213 | end = time.perf_counter() - start 214 | answer = self.processor.decode(out[0], skip_special_tokens=True) 215 | return answer, end 216 | 217 | -------------------------------------------------------------------------------- /ros2_vlm/ros2_vlm/modules/checkpoints/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/ros2_vlm/ros2_vlm/modules/checkpoints/README.md -------------------------------------------------------------------------------- /ros2_vlm/ros2_vlm/modules/groundedsam.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import sys 3 | import os 4 | import cv2 5 | import torch 6 | import numpy as np 7 | import openvino as ov 8 | from PIL import Image 9 | import supervision as sv 10 | import transformers 11 | from typing import Union, List 12 | from torchvision.transforms.functional import resize, InterpolationMode 13 | 14 | current_dir = os.getcwd() 15 | 16 | additional_path = '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules' 17 | updated_path = os.path.join(current_dir + additional_path) 18 | 19 | sys.path.append(os.path.join(current_dir + '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/GroundingDINO')) 20 | sys.path.append(os.path.join(current_dir + '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/GroundingDINO')) 21 | sys.path.append(os.path.join(current_dir + '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/checkpoints')) 22 | sys.path.append(os.path.join(current_dir + '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/openvino_irs')) 23 | 24 | ground_dino_dir = updated_path / Path("GroundingDINO") 25 | efficient_sam_dir = updated_path / Path("EfficientSAM") 26 | 27 | from .GroundingDINO.groundingdino.models.GroundingDINO.bertwarper import generate_masks_with_special_tokens_and_transfer_map 28 | from .GroundingDINO.groundingdino.models import build_model 29 | from .GroundingDINO.groundingdino.util.slconfig import SLConfig 30 | from .GroundingDINO.groundingdino.util.utils import clean_state_dict 31 | from .GroundingDINO.groundingdino.util import get_tokenlizer 32 | from .GroundingDINO.groundingdino.util.utils import get_phrases_from_posmap 33 | from .GroundingDINO.groundingdino.util.inference import Model 34 | from .GroundingDINO.groundingdino.datasets import transforms as T 35 | 36 | class ImageProcessor: 37 | def __init__(self, core, device): 38 | self.core = core 39 | self.irs_path = Path("openvino_irs") 40 | self.ov_dino_name = "openvino_grounding_dino" 41 | self.ov_dino_path = updated_path / self.irs_path / f"{self.ov_dino_name}.xml" 42 | 43 | self.ov_dino_model = self.core.read_model(self.ov_dino_path) 44 | self.device = device #"AUTO" 45 | 46 | self.ground_dino_img_size = (1024, 1280) 47 | 48 | self.pt_device = "cpu" 49 | self.ckpt_base_path = Path("checkpoints") 50 | 51 | self.grounding_dino_config_path = f"{ground_dino_dir}/groundingdino/config/GroundingDINO_SwinT_OGC.py" 52 | self.grounding_dino_checkpoint_path = updated_path / self.ckpt_base_path / "groundingdino_swint_ogc.pth" 53 | 54 | self.ov_compiled_grounded_dino = self.core.compile_model(self.ov_dino_model, self.device) 55 | 56 | self.box_threshold = 0.3 57 | self.text_threshold = 0.25 58 | 59 | self.ov_efficient_sam_name = "openvino_efficient_sam" 60 | self.ov_efficient_sam_path = updated_path / self.irs_path / f"{self.ov_efficient_sam_name}.xml" 61 | 62 | self.ov_efficient_sam = core.read_model(self.ov_efficient_sam_path) 63 | 64 | self.ov_compiled_efficient_sam = core.compile_model(self.ov_efficient_sam, device_name=self.device) 65 | 66 | self.model, self.max_text_len, self.dino_tokenizer, *_ = self.load_pt_grounding_dino(self.grounding_dino_config_path, self.grounding_dino_checkpoint_path) 67 | 68 | def sig(self, x): 69 | return 1 / (1 + np.exp(-x)) 70 | 71 | def load_pt_grounding_dino(self, model_config_path, model_checkpoint_path): 72 | args = SLConfig.fromfile(model_config_path) 73 | 74 | args.device = self.pt_device 75 | args.use_checkpoint = False 76 | args.use_transformer_ckpt = False 77 | 78 | model = build_model(args) 79 | checkpoint = torch.load(model_checkpoint_path, map_location=self.pt_device) 80 | model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False) 81 | _ = model.eval() 82 | 83 | return ( 84 | model, 85 | args.max_text_len, 86 | get_tokenlizer.get_tokenlizer(args.text_encoder_type), 87 | ) 88 | 89 | def transform_image(self, pil_image: Image.Image) -> torch.Tensor: 90 | 91 | transform = T.Compose( 92 | [ 93 | T.RandomResize([800], max_size=1333), 94 | T.ToTensor(), 95 | T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), 96 | ] 97 | ) 98 | image, _ = transform(pil_image, None) # 3, h, w 99 | return image 100 | 101 | def get_ov_grounding_output( 102 | self, 103 | model: ov.CompiledModel, 104 | pil_image: Image.Image, 105 | caption: Union[str, List[str]], 106 | box_threshold: float, 107 | text_threshold: float, 108 | dino_tokenizer: transformers.PreTrainedTokenizerBase, 109 | max_text_len: int) -> (torch.Tensor, List[str], torch.Tensor): 110 | 111 | if isinstance(caption, list): 112 | caption = ". ".join(caption) 113 | caption = caption.lower() 114 | caption = caption.strip() 115 | if not caption.endswith("."): 116 | caption = caption + "." 117 | captions = [caption] 118 | 119 | tokenized = dino_tokenizer(captions, padding="longest", return_tensors="pt") 120 | specical_tokens = dino_tokenizer.convert_tokens_to_ids(["[CLS]", "[SEP]", ".", "?"]) 121 | 122 | ( 123 | text_self_attention_masks, 124 | position_ids, 125 | cate_to_token_mask_list, 126 | ) = generate_masks_with_special_tokens_and_transfer_map(tokenized, specical_tokens, dino_tokenizer) 127 | 128 | if text_self_attention_masks.shape[1] > max_text_len: 129 | text_self_attention_masks = text_self_attention_masks[:, :max_text_len, :max_text_len] 130 | 131 | position_ids = position_ids[:, :max_text_len] 132 | tokenized["input_ids"] = tokenized["input_ids"][:, :max_text_len] 133 | tokenized["attention_mask"] = tokenized["attention_mask"][:, :max_text_len] 134 | tokenized["token_type_ids"] = tokenized["token_type_ids"][:, :max_text_len] 135 | 136 | inputs = {} 137 | inputs["attention_mask.1"] = tokenized["attention_mask"] 138 | inputs["text_self_attention_masks"] = text_self_attention_masks 139 | inputs["input_ids"] = tokenized["input_ids"] 140 | inputs["position_ids"] = position_ids 141 | inputs["token_type_ids"] = tokenized["token_type_ids"] 142 | 143 | 144 | input_img = resize( 145 | self.transform_image(pil_image), 146 | self.ground_dino_img_size, 147 | interpolation=InterpolationMode.BICUBIC, 148 | )[None, ...] 149 | inputs["samples"] = input_img 150 | 151 | request = model.create_infer_request() 152 | request.start_async(inputs, share_inputs=False) 153 | request.wait() 154 | 155 | logits = torch.from_numpy(self.sig(np.squeeze(request.get_tensor("pred_logits").data, 0))) 156 | boxes = torch.from_numpy(np.squeeze(request.get_tensor("pred_boxes").data, 0)) 157 | 158 | filt_mask = logits.max(dim=1)[0] > box_threshold 159 | logits, boxes = logits[filt_mask], boxes[filt_mask] 160 | 161 | tokenized = dino_tokenizer(caption) 162 | pred_phrases = [] 163 | for logit in logits: 164 | pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, dino_tokenizer) 165 | pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})") 166 | 167 | return boxes, pred_phrases, logits.max(dim=1)[0] 168 | 169 | def predict_efficient_sam_mask(self, compiled_efficient_sam: ov.CompiledModel, image: Image.Image, bbox: torch.Tensor): 170 | input_size = 1024 171 | w, h = image.size[:2] 172 | scale = input_size / max(w, h) 173 | new_w = int(w * scale) 174 | new_h = int(h * scale) 175 | image = image.resize((new_w, new_h)) 176 | 177 | numpy_image = np.array(image, dtype=np.float32) / 255.0 178 | numpy_image = np.transpose(numpy_image, (2, 0, 1))[None, ...] 179 | 180 | scaled_points = bbox * scale 181 | 182 | bounding_box = scaled_points.reshape([1, 1, 2, 2]) 183 | bbox_labels = np.reshape(np.array([2, 3]), [1, 1, 2]) 184 | 185 | res = compiled_efficient_sam((numpy_image, bounding_box, bbox_labels)) 186 | 187 | predicted_logits, predicted_iou = res[0], res[1] 188 | 189 | all_masks = torch.ge(torch.sigmoid(torch.from_numpy(predicted_logits[0, 0, :, :, :])), 0.5).numpy() 190 | predicted_iou = predicted_iou[0, 0, ...] 191 | 192 | max_predicted_iou = -1 193 | selected_mask_using_predicted_iou = None 194 | for m in range(all_masks.shape[0]): 195 | curr_predicted_iou = predicted_iou[m] 196 | if curr_predicted_iou > max_predicted_iou or selected_mask_using_predicted_iou is None: 197 | max_predicted_iou = curr_predicted_iou 198 | selected_mask_using_predicted_iou = all_masks[m] 199 | return selected_mask_using_predicted_iou 200 | 201 | def predict_efficient_sam_masks(self, compiled_efficient_sam: ov.CompiledModel, pil_image: Image.Image, transformed_boxes) -> torch.Tensor: 202 | masks = [] 203 | for bbox in transformed_boxes: 204 | mask = self.predict_efficient_sam_mask(compiled_efficient_sam, pil_image, bbox) 205 | mask = Image.fromarray(mask).resize(pil_image.size) 206 | masks.append(np.array(mask)) 207 | masks = torch.from_numpy(np.array(masks)) 208 | return masks 209 | 210 | def process_image(self, pil_image: Image.Image, classes_prompt: List[str], use_segment: bool = False) -> np.ndarray: 211 | boxes, pred_phrases, logits = self.get_ov_grounding_output(self.ov_compiled_grounded_dino, pil_image, classes_prompt, self.box_threshold, self.text_threshold, self.dino_tokenizer, self.max_text_len) 212 | 213 | source_w, source_h = pil_image.size 214 | detections = Model.post_process_result(source_h=source_h, source_w=source_w, boxes=boxes, logits=logits) 215 | 216 | class_id = Model.phrases2classes(phrases=pred_phrases, classes=list(map(str.lower, classes_prompt))) 217 | detections.class_id = class_id 218 | 219 | if use_segment: 220 | masks = self.predict_efficient_sam_masks(self.ov_compiled_efficient_sam, pil_image, detections.xyxy) 221 | detections.mask = masks.numpy() 222 | 223 | box_annotator = sv.BoxAnnotator() 224 | mask_annotator = sv.MaskAnnotator() 225 | 226 | labels = [f"{classes_prompt[class_id] if class_id is not None else 'None'} {confidence:0.2f}" for _, _, confidence, class_id, _, _ in detections] 227 | 228 | annotated_image = np.array(pil_image) 229 | annotated_image = mask_annotator.annotate(scene=np.array(pil_image).copy(), detections=detections) 230 | mask_annotated_image = box_annotator.annotate(scene=annotated_image, detections=detections, labels=labels) 231 | 232 | annotated_frame_bgr = cv2.cvtColor(mask_annotated_image, cv2.COLOR_RGB2BGR) 233 | else: 234 | box_annotator = sv.BoxAnnotator() 235 | box_annotated_image = box_annotator.annotate(scene=np.array(pil_image).copy(), detections=detections) 236 | 237 | annotated_frame_bgr = cv2.cvtColor(box_annotated_image, cv2.COLOR_RGB2BGR) 238 | 239 | return annotated_frame_bgr 240 | 241 | -------------------------------------------------------------------------------- /ros2_vlm/ros2_vlm/modules/openvino_irs/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/ros2_vlm/ros2_vlm/modules/openvino_irs/README.md -------------------------------------------------------------------------------- /ros2_vlm/setup.cfg: -------------------------------------------------------------------------------- 1 | [develop] 2 | script_dir=$base/lib/ros2_vlm 3 | [install] 4 | install_scripts=$base/lib/ros2_vlm 5 | -------------------------------------------------------------------------------- /ros2_vlm/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | import os 3 | from glob import glob 4 | 5 | package_name = 'ros2_vlm' 6 | modules = "ros2_vlm/modules" 7 | 8 | setup( 9 | name=package_name, 10 | version='0.0.0', 11 | packages=[package_name, modules], 12 | data_files=[ 13 | ('share/ament_index/resource_index/packages', 14 | ['resource/' + package_name]), 15 | ('share/' + package_name, ['package.xml']), 16 | ], 17 | install_requires=['setuptools'], 18 | zip_safe=True, 19 | maintainer='Nilutpol Kashyap', 20 | maintainer_email='nilutpolkashyap@todo.todo', 21 | description='Application of Vision Language Models with ROS 2 workshop', 22 | license='MIT License', 23 | tests_require=['pytest'], 24 | entry_points={ 25 | 'console_scripts': [ 26 | 'grounded_sam = ros2_vlm.grounded_sam:main', 27 | 'blip_visual_qna = ros2_vlm.blip_visual_qna:main', 28 | ], 29 | }, 30 | ) 31 | -------------------------------------------------------------------------------- /ros2_vlm/test/test_copyright.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Open Source Robotics Foundation, Inc. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | from ament_copyright.main import main 16 | import pytest 17 | 18 | 19 | # Remove the `skip` decorator once the source file(s) have a copyright header 20 | @pytest.mark.skip(reason='No copyright header has been placed in the generated source file.') 21 | @pytest.mark.copyright 22 | @pytest.mark.linter 23 | def test_copyright(): 24 | rc = main(argv=['.', 'test']) 25 | assert rc == 0, 'Found errors' 26 | -------------------------------------------------------------------------------- /ros2_vlm/test/test_flake8.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 Open Source Robotics Foundation, Inc. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | from ament_flake8.main import main_with_errors 16 | import pytest 17 | 18 | 19 | @pytest.mark.flake8 20 | @pytest.mark.linter 21 | def test_flake8(): 22 | rc, errors = main_with_errors(argv=[]) 23 | assert rc == 0, \ 24 | 'Found %d code style errors / warnings:\n' % len(errors) + \ 25 | '\n'.join(errors) 26 | -------------------------------------------------------------------------------- /ros2_vlm/test/test_pep257.py: -------------------------------------------------------------------------------- 1 | # Copyright 2015 Open Source Robotics Foundation, Inc. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | from ament_pep257.main import main 16 | import pytest 17 | 18 | 19 | @pytest.mark.linter 20 | @pytest.mark.pep257 21 | def test_pep257(): 22 | rc = main(argv=['.', 'test']) 23 | assert rc == 0, 'Found code style errors / warnings' 24 | --------------------------------------------------------------------------------