├── .gitignore
├── LICENSE
├── README.md
├── assets
    ├── demo1.png
    └── demo2.png
├── download_weights.py
└── ros2_vlm
    ├── package.xml
    ├── resource
        └── ros2_vlm
    ├── ros2_vlm
        ├── __init__.py
        ├── blip_visual_qna.py
        ├── grounded_sam.py
        └── modules
        │   ├── blipvisual.py
        │   ├── checkpoints
        │       └── README.md
        │   ├── groundedsam.py
        │   └── openvino_irs
        │       └── README.md
    ├── setup.cfg
    ├── setup.py
    └── test
        ├── test_copyright.py
        ├── test_flake8.py
        └── test_pep257.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | devel/
 2 | logs/
 3 | build/
 4 | bin/
 5 | lib/
 6 | msg_gen/
 7 | srv_gen/
 8 | msg/*Action.msg
 9 | msg/*ActionFeedback.msg
10 | msg/*ActionGoal.msg
11 | msg/*ActionResult.msg
12 | msg/*Feedback.msg
13 | msg/*Goal.msg
14 | msg/*Result.msg
15 | msg/_*.py
16 | build_isolated/
17 | devel_isolated/
18 | 
19 | # Generated by dynamic reconfigure
20 | *.cfgc
21 | /cfg/cpp/
22 | /cfg/*.py
23 | 
24 | # Ignore generated docs
25 | *.dox
26 | *.wikidoc
27 | 
28 | # eclipse stuff
29 | .project
30 | .cproject
31 | 
32 | # qcreator stuff
33 | CMakeLists.txt.user
34 | 
35 | srv/_*.py
36 | *.pcd
37 | *.pyc
38 | qtcreator-*
39 | *.user
40 | 
41 | /planning/cfg
42 | /planning/docs
43 | /planning/src
44 | 
45 | *~
46 | 
47 | # Emacs
48 | .#*
49 | 
50 | # Catkin custom files
51 | CATKIN_IGNORE
52 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Nilutpol Kashyap
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Application of Vision Language Models with ROS 2 workshop
  2 | 
  3 | We explore how robots can perceive and understand their environment through the powerful combination of image understanding and natural language processing. This repository dives deep into the fascinating world of **vision-language models** for **robotics applications**, specifically utilizing the powerful **Intel OpenVINO Toolkit**.
  4 | 
  5 | This repository is presented as a **workshop** at the **ROS meetup Lagos**
  6 | 
  7 | ## Prerequisites
  8 | - **Ubuntu 22.04 or newer**
  9 | - **ROS 2 Humble or newer**
 10 | - **Python 3**
 11 | - **Intel OpenVINO toolkit**
 12 | 
 13 | ## Hardware Requirements
 14 | Please note that to run the code in this repository, you will need a device compatible with the **[Intel OpenVINO Toolkit](https://docs.openvino.ai/2024/home.html)**. This typically includes **Intel CPUs**, Intel Neural Compute Sticks, or other Intel hardware supporting OpenVINO.
 15 | 
 16 | ### Create a ROS 2 colcon workspace
 17 | 
 18 | ```
 19 | mkdir -p ~/ros2_ws/src
 20 | ```
 21 | 
 22 | ### Create & Setup Python Virtual Environment 
 23 | ```
 24 | cd ~/ros2_ws
 25 | 
 26 | virtualenv -p python3 ./vlm-venv                      
 27 | source ./vlm-venv/bin/activate
 28 | 
 29 | # Make sure that colcon doesn’t try to build the venv
 30 | touch ./vlm-venv/COLCON_IGNORE        
 31 | ```
 32 | 
 33 | ### Install Python dependencies
 34 | ```
 35 | pip install timm --extra-index-url https://download.pytorch.org/whl/cpu  # is needed for torch
 36 | 
 37 | pip install "openvino>=2024.1" "torch>=2.1" opencv-python supervision transformers yapf pycocotools addict "gradio>=4.19" tqdm
 38 | ```
 39 | 
 40 | ### Add your Python virtual environment package path 
 41 | **Make sure to update <<YOUR_USER_NAME>> with your system username.**
 42 | ```
 43 | export PYTHONPATH='/home/<<YOUR_USER_NAME>>/ros2_ws/vlm-venv/lib/python3.10/site-packages'
 44 | ```
 45 | 
 46 | ### Clone this repository inside the 'src' folder of your workspace
 47 | ```
 48 | cd ~/ros2_ws/src
 49 | 
 50 | git clone https://github.com/nilutpolkashyap/vlms_with_ros2_workshop.git
 51 | ```
 52 | 
 53 | ### Download weights and required packages
 54 | ```
 55 | cd ~/ros2_ws/src/vlms_with_ros2_workshop
 56 | 
 57 | python3 download_weights.py
 58 | ```
 59 | 
 60 | ### Download OpenVINO IR models from Google Drive
 61 | 
 62 | Download the zip file from the Google Drive [link here](https://drive.google.com/file/d/1yCpUEWr3KR76uVWgtHNreIdlYh59z0dS/view?usp=sharing)
 63 | 
 64 | Place the contents of the zip file inside the **'openvino_irs'** directory in the following path
 65 | 
 66 | ```
 67 | ~/ros2_ws/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/openvino_irs
 68 | ```
 69 | 
 70 | ### Build and source the workspace 
 71 | ```
 72 | cd ~/ros2_ws
 73 | colcon build --symlink-install
 74 | 
 75 | source ~/ros2_ws/install/setup.bash
 76 | ```
 77 | 
 78 | ## Object detection and masking with GroundedSAM (GroundingDINO + SAM)
 79 | 
 80 | **GroundedSAM** tackles **object detection** and **segmentation**. It integrates various open-world models, allowing to not just detect objects but also **understand** their specific regions. This can empower robots to act on specific parts (e.g., grasping a cup's handle) based on textual instructions or visual cues.
 81 | 
 82 | <div align="center">
 83 | <img  alt="GroundedSAM" width="90%" src="assets/demo1.png" /></div><br />
 84 | 
 85 | ### Run the GroundedSAM node
 86 | ```
 87 | ros2 run ros2_vlm grounded_sam --ros-args -p device:='CPU' -p video_source:=/dev/video2 -p isSegment:=False -p detectionList:="["eyes", "person", "hair"]"
 88 | ```
 89 | ### **ROS 2 CLI Arguments**
 90 | - device - Inference Device (e.g. CPU, GPU, NPU)
 91 | - video_source - Video source to get the image frame
 92 | - isSegment - To run Segment Anything model (True/False)
 93 | - detectionList - List of objects to detect
 94 | 
 95 | Check out more in the [GroundedSAM OpenVINO Notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/grounded-segment-anything/grounded-segment-anything.ipynb)
 96 | 
 97 | ## Visual Question Answering using BLIP 
 98 | 
 99 | **BLIP** bridges the gap between vision and language. It analyzes images and extracts meaningful information, generating captions describing the scene or answering questions about it. This lets robots not only **"see"** their environment but also **understand** its context and **respond** to natural language instructions effectively.
100 | 
101 | <div align="center">
102 | <img  alt="Visual Question Answering" width="90%" src="assets/demo2.png" /></div><br />
103 | 
104 | ### Run the Blip Visual QnA node
105 | ```
106 | ros2 run ros2_vlm blip_visual_qna --ros-args -p device_name:="GPU.0" -p question:="What is in the image?" -p image_path:="/home/nilutpol/ai_ws/src/blip_qna_code/demo2.jpg"
107 | ```
108 | ### **ROS 2 CLI Arguments**
109 | - device_name - Inference Device (e.g. CPU, GPU, NPU)
110 | - question - Question for the blip model
111 | - image_path - Path to the image source
112 | 
113 | Check out more in the [BLIP Visual Question Answering OpenVINO Notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/blip-visual-language-processing/blip-visual-language-processing.ipynb)
114 | 
115 | ## Resources
116 | - [Using Python Packages with ROS 2](https://docs.ros.org/en/humble/How-To-Guides/Using-Python-Packages.html)
117 | - [How to use (python) virtual environments with ROS2?
118 | ](https://answers.ros.org/question/371083/how-to-use-python-virtual-environments-with-ros2/)
119 | - [Intel OpenVINO Toolkit](https://docs.openvino.ai/2024/home.html)
120 | - [OpenVINO Notebooks](https://github.com/openvinotoolkit/openvino_notebooks)
121 | 


--------------------------------------------------------------------------------
/assets/demo1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/assets/demo1.png


--------------------------------------------------------------------------------
/assets/demo2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/assets/demo2.png


--------------------------------------------------------------------------------
/download_weights.py:
--------------------------------------------------------------------------------
 1 | from pathlib import Path
 2 | import os
 3 | import urllib.request
 4 | import subprocess
 5 | 
 6 | current_dir = os.getcwd()
 7 | 
 8 | grounding_dino_config_base_path = os.path.join(current_dir, 'ros2_vlm', 'ros2_vlm', 'modules', 'checkpoints')
 9 | grounding_dino_config_name = "groundingdino_swint_ogc.pth"
10 | file_path = os.path.join(grounding_dino_config_base_path, grounding_dino_config_name)
11 | grounding_dino_config_link = "https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth"
12 | 
13 | # Check if file exists
14 | if not os.path.exists(file_path):
15 |     print(f"Downloading {grounding_dino_config_name}...")
16 |     # Download file
17 |     urllib.request.urlretrieve(grounding_dino_config_link, file_path)
18 |     print("Download complete!")
19 | else:
20 |     print(f"File {grounding_dino_config_name} already exists.")
21 | 
22 | ground_dino_dir = os.path.join(current_dir, 'ros2_vlm', 'ros2_vlm', 'modules', 'GroundingDINO')
23 | subprocess.run(["git", "clone", "https://github.com/wenyi5608/GroundingDINO/", ground_dino_dir])
24 | 
25 | efficient_sam_dir = os.path.join(current_dir, 'ros2_vlm', 'ros2_vlm', 'modules', 'EfficientSAM')
26 | subprocess.run(["git", "clone", "https://github.com/yformer/EfficientSAM/", efficient_sam_dir])
27 | 
28 | 
29 | 
30 | 


--------------------------------------------------------------------------------
/ros2_vlm/package.xml:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0"?>
 2 | <?xml-model href="http://download.ros.org/schema/package_format3.xsd" schematypens="http://www.w3.org/2001/XMLSchema"?>
 3 | <package format="3">
 4 |   <name>ros2_vlm</name>
 5 |   <version>0.0.0</version>
 6 |   <description>ROS 2 OpenCV Python Package</description>
 7 |   <maintainer email="nilutpolkashyap@todo.todo">Nilutpol Kashyap</maintainer>
 8 |   <license>MIT License</license>
 9 | 
10 |   <depend>rclpy</depend>
11 |   <depend>cv_bridge</depend>
12 |   <depend>sensor_msgs</depend>
13 | 
14 |   <test_depend>ament_copyright</test_depend>
15 |   <test_depend>ament_flake8</test_depend>
16 |   <test_depend>ament_pep257</test_depend>
17 |   <test_depend>python3-pytest</test_depend>
18 | 
19 |   <export>
20 |     <build_type>ament_python</build_type>
21 |   </export>
22 | </package>
23 | 


--------------------------------------------------------------------------------
/ros2_vlm/resource/ros2_vlm:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/ros2_vlm/resource/ros2_vlm


--------------------------------------------------------------------------------
/ros2_vlm/ros2_vlm/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/ros2_vlm/ros2_vlm/__init__.py


--------------------------------------------------------------------------------
/ros2_vlm/ros2_vlm/blip_visual_qna.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | import rclpy
 3 | from rclpy.node import Node
 4 | from PIL import Image
 5 | from .modules.blipvisual import BlipVQA
 6 | import openvino as ov
 7 | from pathlib import Path
 8 | import sys
 9 | import os
10 | 
11 | current_dir = os.getcwd()
12 | additional_path = '/src/ros2_vlm/ros2_vlm/modules'
13 | updated_path = os.path.join(current_dir + additional_path)
14 | 
15 | class QuestionImageProcessor(Node):
16 |     def __init__(self):
17 |         super().__init__('question_image_processor_node')
18 | 
19 |         self.core = ov.Core()
20 | 
21 |         self.declare_parameter('device', 'GPU.0')
22 |         self.declare_parameter('question', 'What is in the image?')
23 |         self.declare_parameter('image_path', '')
24 | 
25 |         self.device = self.get_parameter('device').get_parameter_value().string_value
26 |         self.question = self.get_parameter('question').get_parameter_value().string_value
27 |         self.image_path = self.get_parameter('image_path').get_parameter_value().string_value
28 | 
29 |         if not self.image_path:
30 |             self.get_logger().error('image_path parameter is required.')
31 |             return
32 | 
33 |         self.get_logger().info(f'Inference Device: {self.device}')
34 |         self.get_logger().info(f'Question: {self.question}')
35 |         self.get_logger().info(f'Image Path: {self.image_path}')
36 | 
37 |         self.irs_path = Path("openvino_irs")
38 | 
39 |         self.vision_model_path = updated_path / self.irs_path / f"blip_vision_model.xml"
40 |         self.text_encoder_path = updated_path / self.irs_path / f"blip_text_encoder.xml"
41 |         self.text_decoder_path = updated_path / self.irs_path / f"blip_text_decoder_with_past.xml"
42 | 
43 |         self.blip_inference = BlipVQA(self.core, self.vision_model_path, self.text_encoder_path, self.text_decoder_path, self.device)
44 | 
45 |         try:
46 |             answer, inference_time = self.blip_inference.generate_answer(self.image_path, self.question, max_length=20)
47 | 
48 |             print(f"Generated Answer: {answer}")
49 |             print(f"Inference Time: {inference_time} seconds")
50 |         except Exception as e:
51 |             self.get_logger().error(f'Failed to load image: {e}')
52 | 
53 | def main(args=None):
54 |     rclpy.init(args=args)
55 |     node = QuestionImageProcessor()
56 | 
57 |     if rclpy.ok():
58 |         rclpy.spin(node)
59 | 
60 |     node.destroy_node()
61 |     rclpy.shutdown()
62 | 
63 | if __name__ == '__main__':
64 |     main()
65 | 


--------------------------------------------------------------------------------
/ros2_vlm/ros2_vlm/grounded_sam.py:
--------------------------------------------------------------------------------
 1 | import rclpy
 2 | from rclpy.node import Node
 3 | import cv2
 4 | import torch
 5 | import openvino as ov
 6 | from .modules.groundedsam import ImageProcessor
 7 | from PIL import Image
 8 | 
 9 | class VideoDisplayNode(Node):
10 |     def __init__(self):
11 |         super().__init__('video_display_node')
12 |         
13 |         self.declare_parameter('video_source', '/dev/video0')
14 |         self.declare_parameter('isSegment', False)
15 |         self.declare_parameter('detectionList', ['person'])
16 | 
17 |         self.video_source = self.get_parameter('video_source').value
18 |         self.isSegment = self.get_parameter('isSegment').value
19 |         self.detectionList = self.get_parameter('detectionList').value
20 | 
21 |         self.get_logger().info(f'Video source: {self.video_source}')
22 |         self.get_logger().info(f'isSegment: {self.isSegment}')
23 |         self.get_logger().info('My list:')
24 |         for item in self.detectionList:
25 |             self.get_logger().info(f' - {item}')
26 | 
27 |         self.cap = cv2.VideoCapture(self.video_source)
28 | 
29 |         self.core = ov.Core()
30 |         self.device = "CPU" #"GPU.0"
31 |         self.processor = ImageProcessor(self.core, self.device)
32 | 
33 |     def __del__(self):
34 |         self.cap.release()
35 | 
36 |     def run(self):
37 |         ret, frame = self.cap.read()
38 |         if ret:
39 |             self.pil_image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
40 |             self.processed_image = self.processor.process_image(self.pil_image, self.detectionList, self.isSegment)
41 |             cv2.imshow('Video Frame', self.processed_image)
42 |             cv2.waitKey(0)
43 | 
44 | def main(args=None):
45 |     rclpy.init(args=args)
46 |     node = VideoDisplayNode()
47 |     node.run()
48 |     rclpy.shutdown()
49 | 
50 | if __name__ == '__main__':
51 |     main()
52 | 


--------------------------------------------------------------------------------
/ros2_vlm/ros2_vlm/modules/blipvisual.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | from pathlib import Path
  3 | import openvino as ov
  4 | import time
  5 | from PIL import Image
  6 | from transformers import BlipProcessor, BlipForQuestionAnswering
  7 | from functools import partial
  8 | import torch
  9 | import numpy as np
 10 | from typing import List, Dict
 11 | from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
 12 | 
 13 | def init_past_inputs(model_inputs: List):
 14 |     """
 15 |     Helper function for initialization of past inputs on first inference step
 16 |     Parameters:
 17 |       model_inputs (List): list of model inputs
 18 |     Returns:
 19 |       pkv (List[ov.Tensor]): list of filled past key values
 20 |     """
 21 |     pkv = []
 22 |     for input_tensor in model_inputs[4:]:
 23 |         partial_shape = input_tensor.partial_shape
 24 |         partial_shape[0] = 1
 25 |         partial_shape[2] = 0
 26 |         pkv.append(ov.Tensor(ov.Type.f32, partial_shape.get_shape()))
 27 |     return pkv
 28 | 
 29 | 
 30 | def postprocess_text_decoder_outputs(output: Dict):
 31 |     """
 32 |     Helper function for rearranging model outputs and wrapping to CausalLMOutputWithCrossAttentions
 33 |     Parameters:
 34 |       output (Dict): dictionary with model output
 35 |     Returns
 36 |       wrapped_outputs (CausalLMOutputWithCrossAttentions): outputs wrapped to CausalLMOutputWithCrossAttentions format
 37 |     """
 38 |     logits = torch.from_numpy(output[0])
 39 |     past_kv = list(output.values())[1:]
 40 |     return CausalLMOutputWithCrossAttentions(
 41 |         loss=None,
 42 |         logits=logits,
 43 |         past_key_values=past_kv,
 44 |         hidden_states=None,
 45 |         attentions=None,
 46 |         cross_attentions=None,
 47 |     )
 48 | 
 49 | def text_decoder_forward(
 50 |     ov_text_decoder_with_past: ov.CompiledModel,
 51 |     input_ids: torch.Tensor,
 52 |     attention_mask: torch.Tensor,
 53 |     past_key_values: List[ov.Tensor],
 54 |     encoder_hidden_states: torch.Tensor,
 55 |     encoder_attention_mask: torch.Tensor,
 56 |     **kwargs
 57 | ):
 58 |     """
 59 |     Inference function for text_decoder in one generation step
 60 |     Parameters:
 61 |       input_ids (torch.Tensor): input token ids
 62 |       attention_mask (torch.Tensor): attention mask for input token ids
 63 |       past_key_values (List[ov.Tensor] list of cached decoder hidden states from previous step
 64 |       encoder_hidden_states (torch.Tensor): encoder (vision or text) hidden states
 65 |       encoder_attention_mask (torch.Tensor): attnetion mask for encoder hidden states
 66 |     Returns
 67 |       model outputs (CausalLMOutputWithCrossAttentions): model prediction wrapped to CausalLMOutputWithCrossAttentions class including predicted logits and hidden states for caching
 68 |     """
 69 |     inputs = [input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask]
 70 |     if past_key_values is None:
 71 |         inputs.extend(init_past_inputs(ov_text_decoder_with_past.inputs))
 72 |     else:
 73 |         inputs.extend(past_key_values)
 74 |     outputs = ov_text_decoder_with_past(inputs)
 75 |     return postprocess_text_decoder_outputs(outputs)
 76 | 
 77 | 
 78 | class OVBlipModel:
 79 |     """
 80 |     Model class for inference BLIP model with OpenVINO
 81 |     """
 82 | 
 83 |     def __init__(
 84 |         self,
 85 |         config,
 86 |         decoder_start_token_id: int,
 87 |         vision_model,
 88 |         text_encoder,
 89 |         text_decoder,
 90 |     ):
 91 |         """
 92 |         Initialization class parameters
 93 |         """
 94 |         self.vision_model = vision_model
 95 |         self.vision_model_out = vision_model.output(0)
 96 |         self.text_encoder = text_encoder
 97 |         self.text_encoder_out = text_encoder.output(0)
 98 |         self.text_decoder = text_decoder
 99 |         self.config = config
100 |         self.decoder_start_token_id = decoder_start_token_id
101 |         self.decoder_input_ids = config.text_config.bos_token_id
102 | 
103 |     def generate_answer(self, pixel_values: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor, **generate_kwargs):
104 |         """
105 |         Visual Question Answering prediction
106 |         Parameters:
107 |           pixel_values (torch.Tensor): preprocessed image pixel values
108 |           input_ids (torch.Tensor): question token ids after tokenization
109 |           attention_mask (torch.Tensor): attention mask for question tokens
110 |         Retruns:
111 |           generation output (torch.Tensor): tensor which represents sequence of generated answer token ids
112 |         """
113 |         image_embed = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]
114 |         image_attention_mask = np.ones(image_embed.shape[:-1], dtype=int)
115 |         if isinstance(input_ids, list):
116 |             input_ids = torch.LongTensor(input_ids)
117 |         question_embeds = self.text_encoder(
118 |             [
119 |                 input_ids.detach().numpy(),
120 |                 attention_mask.detach().numpy(),
121 |                 image_embed,
122 |                 image_attention_mask,
123 |             ]
124 |         )[self.text_encoder_out]
125 |         question_attention_mask = np.ones(question_embeds.shape[:-1], dtype=int)
126 | 
127 |         bos_ids = np.full((question_embeds.shape[0], 1), fill_value=self.decoder_start_token_id)
128 | 
129 |         outputs = self.text_decoder.generate(
130 |             input_ids=torch.from_numpy(bos_ids),
131 |             eos_token_id=self.config.text_config.sep_token_id,
132 |             pad_token_id=self.config.text_config.pad_token_id,
133 |             encoder_hidden_states=torch.from_numpy(question_embeds),
134 |             encoder_attention_mask=torch.from_numpy(question_attention_mask),
135 |             **generate_kwargs,
136 |         )
137 |         return outputs
138 | 
139 |     def generate_caption(self, pixel_values: torch.Tensor, input_ids: torch.Tensor = None, attention_mask: torch.Tensor = None, **generate_kwargs):
140 |         """
141 |         Image Captioning prediction
142 |         Parameters:
143 |           pixel_values (torch.Tensor): preprocessed image pixel values
144 |           input_ids (torch.Tensor, *optional*, None): pregenerated caption token ids after tokenization, if provided caption generation continue provided text
145 |           attention_mask (torch.Tensor): attention mask for caption tokens, used only if input_ids provided
146 |         Retruns:
147 |           generation output (torch.Tensor): tensor which represents sequence of generated caption token ids
148 |         """
149 |         batch_size = pixel_values.shape[0]
150 | 
151 |         image_embeds = self.vision_model(pixel_values.detach().numpy())[self.vision_model_out]
152 | 
153 |         image_attention_mask = torch.ones(image_embeds.shape[:-1], dtype=torch.long)
154 | 
155 |         if isinstance(input_ids, list):
156 |             input_ids = torch.LongTensor(input_ids)
157 |         elif input_ids is None:
158 |             input_ids = torch.LongTensor(
159 |                 [
160 |                     [
161 |                         self.config.text_config.bos_token_id,
162 |                         self.config.text_config.eos_token_id,
163 |                     ]
164 |                 ]
165 |             ).repeat(batch_size, 1)
166 |         input_ids[:, 0] = self.config.text_config.bos_token_id
167 |         attention_mask = attention_mask[:, :-1] if attention_mask is not None else None
168 | 
169 |         outputs = self.text_decoder.generate(
170 |             input_ids=input_ids[:, :-1],
171 |             eos_token_id=self.config.text_config.sep_token_id,
172 |             pad_token_id=self.config.text_config.pad_token_id,
173 |             attention_mask=attention_mask,
174 |             encoder_hidden_states=torch.from_numpy(image_embeds),
175 |             encoder_attention_mask=image_attention_mask,
176 |             **generate_kwargs,
177 |         )
178 |         return outputs
179 | 
180 | class BlipVQA:
181 |     def __init__(self, core, vision_model_path, text_encoder_path, text_decoder_path, device="CPU"):
182 |         self.processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
183 |         self.model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
184 | 
185 |         self.vision_model = self.model.vision_model
186 |         self.text_encoder = self.model.text_encoder
187 |         self.text_decoder = self.model.text_decoder
188 | 
189 |         self.VISION_MODEL_OV = Path(vision_model_path)
190 |         self.TEXT_ENCODER_OV = Path(text_encoder_path)
191 |         self.TEXT_DECODER_OV = Path(text_decoder_path)
192 | 
193 |         self.core = core
194 |         self.device = device
195 | 
196 |         self.ov_vision_model = self.core.compile_model(self.VISION_MODEL_OV, self.device)
197 |         self.ov_text_encoder = self.core.compile_model(self.TEXT_ENCODER_OV, self.device)
198 |         self.ov_text_decoder_with_past = self.core.compile_model(self.TEXT_DECODER_OV, self.device)
199 | 
200 |         self.text_decoder.forward = partial(text_decoder_forward, ov_text_decoder_with_past=self.ov_text_decoder_with_past)
201 | 
202 |         self.ov_model = OVBlipModel(self.model.config, self.model.decoder_start_token_id, self.ov_vision_model, self.ov_text_encoder, self.text_decoder)
203 | 
204 |     def preprocess(self, image_path, question):
205 |         raw_image = Image.open(image_path).convert("RGB")
206 |         inputs = self.processor(raw_image, question, return_tensors="pt")
207 |         return raw_image, inputs
208 | 
209 |     def generate_answer(self, image_path, question, max_length=20):
210 |         raw_image, inputs = self.preprocess(image_path, question)
211 |         start = time.perf_counter()
212 |         out = self.ov_model.generate_answer(**inputs, max_length=max_length)
213 |         end = time.perf_counter() - start
214 |         answer = self.processor.decode(out[0], skip_special_tokens=True)
215 |         return answer, end
216 | 
217 | 


--------------------------------------------------------------------------------
/ros2_vlm/ros2_vlm/modules/checkpoints/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/ros2_vlm/ros2_vlm/modules/checkpoints/README.md


--------------------------------------------------------------------------------
/ros2_vlm/ros2_vlm/modules/groundedsam.py:
--------------------------------------------------------------------------------
  1 | from pathlib import Path
  2 | import sys
  3 | import os
  4 | import cv2
  5 | import torch
  6 | import numpy as np
  7 | import openvino as ov
  8 | from PIL import Image
  9 | import supervision as sv
 10 | import transformers
 11 | from typing import Union, List
 12 | from torchvision.transforms.functional import resize, InterpolationMode
 13 | 
 14 | current_dir = os.getcwd()
 15 | 
 16 | additional_path = '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules'
 17 | updated_path = os.path.join(current_dir + additional_path)
 18 | 
 19 | sys.path.append(os.path.join(current_dir + '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/GroundingDINO')) 
 20 | sys.path.append(os.path.join(current_dir + '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/GroundingDINO')) 
 21 | sys.path.append(os.path.join(current_dir + '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/checkpoints')) 
 22 | sys.path.append(os.path.join(current_dir + '/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/openvino_irs')) 
 23 | 
 24 | ground_dino_dir =  updated_path / Path("GroundingDINO")
 25 | efficient_sam_dir =  updated_path / Path("EfficientSAM")
 26 | 
 27 | from .GroundingDINO.groundingdino.models.GroundingDINO.bertwarper import generate_masks_with_special_tokens_and_transfer_map
 28 | from .GroundingDINO.groundingdino.models import build_model
 29 | from .GroundingDINO.groundingdino.util.slconfig import SLConfig
 30 | from .GroundingDINO.groundingdino.util.utils import clean_state_dict
 31 | from .GroundingDINO.groundingdino.util import get_tokenlizer
 32 | from .GroundingDINO.groundingdino.util.utils import get_phrases_from_posmap
 33 | from .GroundingDINO.groundingdino.util.inference import Model
 34 | from .GroundingDINO.groundingdino.datasets import transforms as T
 35 | 
 36 | class ImageProcessor:
 37 |     def __init__(self, core, device):
 38 |         self.core = core
 39 |         self.irs_path = Path("openvino_irs")
 40 |         self.ov_dino_name = "openvino_grounding_dino"
 41 |         self.ov_dino_path = updated_path / self.irs_path / f"{self.ov_dino_name}.xml"
 42 | 
 43 |         self.ov_dino_model = self.core.read_model(self.ov_dino_path)
 44 |         self.device = device    #"AUTO"
 45 | 
 46 |         self.ground_dino_img_size = (1024, 1280)
 47 | 
 48 |         self.pt_device = "cpu"
 49 |         self.ckpt_base_path = Path("checkpoints")
 50 | 
 51 |         self.grounding_dino_config_path = f"{ground_dino_dir}/groundingdino/config/GroundingDINO_SwinT_OGC.py"
 52 |         self.grounding_dino_checkpoint_path = updated_path / self.ckpt_base_path / "groundingdino_swint_ogc.pth"
 53 | 
 54 |         self.ov_compiled_grounded_dino = self.core.compile_model(self.ov_dino_model, self.device)
 55 | 
 56 |         self.box_threshold = 0.3
 57 |         self.text_threshold = 0.25
 58 | 
 59 |         self.ov_efficient_sam_name = "openvino_efficient_sam"
 60 |         self.ov_efficient_sam_path = updated_path / self.irs_path / f"{self.ov_efficient_sam_name}.xml"
 61 | 
 62 |         self.ov_efficient_sam = core.read_model(self.ov_efficient_sam_path)
 63 | 
 64 |         self.ov_compiled_efficient_sam = core.compile_model(self.ov_efficient_sam, device_name=self.device)
 65 | 
 66 |         self.model, self.max_text_len, self.dino_tokenizer, *_ = self.load_pt_grounding_dino(self.grounding_dino_config_path, self.grounding_dino_checkpoint_path)
 67 | 
 68 |     def sig(self, x):
 69 |         return 1 / (1 + np.exp(-x))
 70 |     
 71 |     def load_pt_grounding_dino(self, model_config_path, model_checkpoint_path):
 72 |         args = SLConfig.fromfile(model_config_path)
 73 | 
 74 |         args.device = self.pt_device
 75 |         args.use_checkpoint = False
 76 |         args.use_transformer_ckpt = False
 77 | 
 78 |         model = build_model(args)
 79 |         checkpoint = torch.load(model_checkpoint_path, map_location=self.pt_device)
 80 |         model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False)
 81 |         _ = model.eval()
 82 | 
 83 |         return (
 84 |             model,
 85 |             args.max_text_len,
 86 |             get_tokenlizer.get_tokenlizer(args.text_encoder_type),
 87 |         )
 88 | 
 89 |     def transform_image(self, pil_image: Image.Image) -> torch.Tensor:
 90 |         
 91 |         transform = T.Compose(
 92 |             [
 93 |                 T.RandomResize([800], max_size=1333),
 94 |                 T.ToTensor(),
 95 |                 T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
 96 |             ]
 97 |         )
 98 |         image, _ = transform(pil_image, None)  # 3, h, w
 99 |         return image
100 |     
101 |     def get_ov_grounding_output(
102 |         self,
103 |         model: ov.CompiledModel,
104 |         pil_image: Image.Image,
105 |         caption: Union[str, List[str]],
106 |         box_threshold: float,
107 |         text_threshold: float,
108 |         dino_tokenizer: transformers.PreTrainedTokenizerBase,
109 |         max_text_len: int) -> (torch.Tensor, List[str], torch.Tensor):
110 | 
111 |         if isinstance(caption, list):
112 |             caption = ". ".join(caption)
113 |         caption = caption.lower()
114 |         caption = caption.strip()
115 |         if not caption.endswith("."):
116 |             caption = caption + "."
117 |         captions = [caption]
118 | 
119 |         tokenized = dino_tokenizer(captions, padding="longest", return_tensors="pt")
120 |         specical_tokens = dino_tokenizer.convert_tokens_to_ids(["[CLS]", "[SEP]", ".", "?"])
121 | 
122 |         (
123 |             text_self_attention_masks,
124 |             position_ids,
125 |             cate_to_token_mask_list,
126 |         ) = generate_masks_with_special_tokens_and_transfer_map(tokenized, specical_tokens, dino_tokenizer)
127 | 
128 |         if text_self_attention_masks.shape[1] > max_text_len:
129 |             text_self_attention_masks = text_self_attention_masks[:, :max_text_len, :max_text_len]
130 | 
131 |             position_ids = position_ids[:, :max_text_len]
132 |             tokenized["input_ids"] = tokenized["input_ids"][:, :max_text_len]
133 |             tokenized["attention_mask"] = tokenized["attention_mask"][:, :max_text_len]
134 |             tokenized["token_type_ids"] = tokenized["token_type_ids"][:, :max_text_len]
135 | 
136 |         inputs = {}
137 |         inputs["attention_mask.1"] = tokenized["attention_mask"]
138 |         inputs["text_self_attention_masks"] = text_self_attention_masks
139 |         inputs["input_ids"] = tokenized["input_ids"]
140 |         inputs["position_ids"] = position_ids
141 |         inputs["token_type_ids"] = tokenized["token_type_ids"]
142 | 
143 | 
144 |         input_img = resize(
145 |             self.transform_image(pil_image),
146 |             self.ground_dino_img_size,
147 |             interpolation=InterpolationMode.BICUBIC,
148 |         )[None, ...]
149 |         inputs["samples"] = input_img
150 | 
151 |         request = model.create_infer_request()
152 |         request.start_async(inputs, share_inputs=False)
153 |         request.wait()
154 | 
155 |         logits = torch.from_numpy(self.sig(np.squeeze(request.get_tensor("pred_logits").data, 0)))
156 |         boxes = torch.from_numpy(np.squeeze(request.get_tensor("pred_boxes").data, 0))
157 | 
158 |         filt_mask = logits.max(dim=1)[0] > box_threshold
159 |         logits, boxes = logits[filt_mask], boxes[filt_mask]
160 | 
161 |         tokenized = dino_tokenizer(caption)
162 |         pred_phrases = []
163 |         for logit in logits:
164 |             pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, dino_tokenizer)
165 |             pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})")
166 | 
167 |         return boxes, pred_phrases, logits.max(dim=1)[0]
168 |     
169 |     def predict_efficient_sam_mask(self, compiled_efficient_sam: ov.CompiledModel, image: Image.Image, bbox: torch.Tensor):
170 |         input_size = 1024
171 |         w, h = image.size[:2]
172 |         scale = input_size / max(w, h)
173 |         new_w = int(w * scale)
174 |         new_h = int(h * scale)
175 |         image = image.resize((new_w, new_h))
176 | 
177 |         numpy_image = np.array(image, dtype=np.float32) / 255.0
178 |         numpy_image = np.transpose(numpy_image, (2, 0, 1))[None, ...]
179 | 
180 |         scaled_points = bbox * scale
181 | 
182 |         bounding_box = scaled_points.reshape([1, 1, 2, 2])
183 |         bbox_labels = np.reshape(np.array([2, 3]), [1, 1, 2])
184 | 
185 |         res = compiled_efficient_sam((numpy_image, bounding_box, bbox_labels))
186 | 
187 |         predicted_logits, predicted_iou = res[0], res[1]
188 | 
189 |         all_masks = torch.ge(torch.sigmoid(torch.from_numpy(predicted_logits[0, 0, :, :, :])), 0.5).numpy()
190 |         predicted_iou = predicted_iou[0, 0, ...]
191 | 
192 |         max_predicted_iou = -1
193 |         selected_mask_using_predicted_iou = None
194 |         for m in range(all_masks.shape[0]):
195 |             curr_predicted_iou = predicted_iou[m]
196 |             if curr_predicted_iou > max_predicted_iou or selected_mask_using_predicted_iou is None:
197 |                 max_predicted_iou = curr_predicted_iou
198 |                 selected_mask_using_predicted_iou = all_masks[m]
199 |         return selected_mask_using_predicted_iou
200 | 
201 |     def predict_efficient_sam_masks(self, compiled_efficient_sam: ov.CompiledModel, pil_image: Image.Image, transformed_boxes) -> torch.Tensor:
202 |         masks = []
203 |         for bbox in transformed_boxes:
204 |             mask = self.predict_efficient_sam_mask(compiled_efficient_sam, pil_image, bbox)
205 |             mask = Image.fromarray(mask).resize(pil_image.size)
206 |             masks.append(np.array(mask))
207 |         masks = torch.from_numpy(np.array(masks))
208 |         return masks
209 | 
210 |     def process_image(self, pil_image: Image.Image, classes_prompt: List[str], use_segment: bool = False) -> np.ndarray:
211 |         boxes, pred_phrases, logits = self.get_ov_grounding_output(self.ov_compiled_grounded_dino, pil_image, classes_prompt, self.box_threshold, self.text_threshold, self.dino_tokenizer, self.max_text_len)
212 | 
213 |         source_w, source_h = pil_image.size
214 |         detections = Model.post_process_result(source_h=source_h, source_w=source_w, boxes=boxes, logits=logits)
215 | 
216 |         class_id = Model.phrases2classes(phrases=pred_phrases, classes=list(map(str.lower, classes_prompt)))
217 |         detections.class_id = class_id
218 | 
219 |         if use_segment:
220 |             masks = self.predict_efficient_sam_masks(self.ov_compiled_efficient_sam, pil_image, detections.xyxy)
221 |             detections.mask = masks.numpy()
222 | 
223 |             box_annotator = sv.BoxAnnotator()
224 |             mask_annotator = sv.MaskAnnotator()
225 | 
226 |             labels = [f"{classes_prompt[class_id] if class_id is not None else 'None'} {confidence:0.2f}" for _, _, confidence, class_id, _, _ in detections]
227 | 
228 |             annotated_image = np.array(pil_image)
229 |             annotated_image = mask_annotator.annotate(scene=np.array(pil_image).copy(), detections=detections)
230 |             mask_annotated_image = box_annotator.annotate(scene=annotated_image, detections=detections, labels=labels)
231 | 
232 |             annotated_frame_bgr = cv2.cvtColor(mask_annotated_image, cv2.COLOR_RGB2BGR)
233 |         else:
234 |             box_annotator = sv.BoxAnnotator()
235 |             box_annotated_image = box_annotator.annotate(scene=np.array(pil_image).copy(), detections=detections)
236 | 
237 |             annotated_frame_bgr = cv2.cvtColor(box_annotated_image, cv2.COLOR_RGB2BGR)
238 | 
239 |         return annotated_frame_bgr
240 | 
241 | 


--------------------------------------------------------------------------------
/ros2_vlm/ros2_vlm/modules/openvino_irs/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nilutpolkashyap/vlms_with_ros2_workshop/e3cbffcbd9377f374767df15b1f89d9b650c04e3/ros2_vlm/ros2_vlm/modules/openvino_irs/README.md


--------------------------------------------------------------------------------
/ros2_vlm/setup.cfg:
--------------------------------------------------------------------------------
1 | [develop]
2 | script_dir=$base/lib/ros2_vlm
3 | [install]
4 | install_scripts=$base/lib/ros2_vlm
5 | 


--------------------------------------------------------------------------------
/ros2_vlm/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | import os
 3 | from glob import glob 
 4 | 
 5 | package_name = 'ros2_vlm'
 6 | modules = "ros2_vlm/modules"
 7 | 
 8 | setup(
 9 |     name=package_name,
10 |     version='0.0.0',
11 |     packages=[package_name, modules],
12 |     data_files=[
13 |         ('share/ament_index/resource_index/packages',
14 |             ['resource/' + package_name]),
15 |         ('share/' + package_name, ['package.xml']),
16 |     ],
17 |     install_requires=['setuptools'],
18 |     zip_safe=True,
19 |     maintainer='Nilutpol Kashyap',
20 |     maintainer_email='nilutpolkashyap@todo.todo',
21 |     description='Application of Vision Language Models with ROS 2 workshop',
22 |     license='MIT License',
23 |     tests_require=['pytest'],
24 |     entry_points={
25 |         'console_scripts': [
26 |             'grounded_sam = ros2_vlm.grounded_sam:main',
27 |             'blip_visual_qna = ros2_vlm.blip_visual_qna:main',
28 |         ],
29 |     },
30 | )
31 | 


--------------------------------------------------------------------------------
/ros2_vlm/test/test_copyright.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2015 Open Source Robotics Foundation, Inc.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | from ament_copyright.main import main
16 | import pytest
17 | 
18 | 
19 | # Remove the `skip` decorator once the source file(s) have a copyright header
20 | @pytest.mark.skip(reason='No copyright header has been placed in the generated source file.')
21 | @pytest.mark.copyright
22 | @pytest.mark.linter
23 | def test_copyright():
24 |     rc = main(argv=['.', 'test'])
25 |     assert rc == 0, 'Found errors'
26 | 


--------------------------------------------------------------------------------
/ros2_vlm/test/test_flake8.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2017 Open Source Robotics Foundation, Inc.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | from ament_flake8.main import main_with_errors
16 | import pytest
17 | 
18 | 
19 | @pytest.mark.flake8
20 | @pytest.mark.linter
21 | def test_flake8():
22 |     rc, errors = main_with_errors(argv=[])
23 |     assert rc == 0, \
24 |         'Found %d code style errors / warnings:\n' % len(errors) + \
25 |         '\n'.join(errors)
26 | 


--------------------------------------------------------------------------------
/ros2_vlm/test/test_pep257.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2015 Open Source Robotics Foundation, Inc.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #     http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | from ament_pep257.main import main
16 | import pytest
17 | 
18 | 
19 | @pytest.mark.linter
20 | @pytest.mark.pep257
21 | def test_pep257():
22 |     rc = main(argv=['.', 'test'])
23 |     assert rc == 0, 'Found code style errors / warnings'
24 | 


--------------------------------------------------------------------------------