├── README.md
├── README_zh.md
├── chat.py
├── chat_genai.py
├── convert.py
└── requirements.txt


/README.md:
--------------------------------------------------------------------------------
  1 | English | [简体中文](README_zh.md)
  2 | 
  3 | # Qwen2.openvino Demo
  4 | 
  5 | This sample shows how to deploy Qwen2 using OpenVINO
  6 | 
  7 | ## 1. Environment configuration
  8 | 
  9 | We recommend that you create a new virtual environment and then install the dependencies as follows. The 
 10 | recommended Python version is `3.10+`.
 11 | 
 12 | Linux
 13 | 
 14 | ```
 15 | python3 -m venv openvino_env
 16 | 
 17 | source openvino_env/bin/activate
 18 | 
 19 | python3 -m pip install --upgrade pip
 20 | 
 21 | pip install wheel setuptools
 22 | 
 23 | pip install -r requirements.txt
 24 | ```
 25 | 
 26 | Windows Powershell
 27 | 
 28 | ```
 29 | python3 -m venv openvino_env
 30 | 
 31 | .\openvino_env\Scripts\activate
 32 | 
 33 | python3 -m pip install --upgrade pip
 34 | 
 35 | pip install wheel setuptools
 36 | 
 37 | pip install -r requirements.txt
 38 | ```
 39 | > Note:
 40 | > If you are using an existing python environment, recommend following command to use all the dependencies with latest versions:  
 41 | > pip install -U --upgrade-strategy eager -r requirements.txt
 42 | 
 43 | ## 2. Convert model
 44 | 
 45 | Since the Hugging Face model needs to be converted to an OpenVINO IR model, you need to download the model and convert.
 46 | 
 47 | ```
 48 | python3 convert.py --model_id qwen/Qwen2-7B-Instruct --precision int4 --output {your_path}/Qwen2-7B-Instruct-ov --modelscope
 49 | ```
 50 | 
 51 | ### Parameters that can be selected
 52 | 
 53 | * `--model_id` - path (absolute path) to be used from Huggngface_hub (https://huggingface.co/models) or the directory
 54 |   where the model is located.
 55 | * `--precision` - model precision: fp16, int8 or int4.
 56 | * `--output` - the path where the converted model is saved.
 57 | * `--modelscope` - if downloading the model from Model Scope.
 58 | 
 59 | ## 3. Run streaming chatbot
 60 | 
 61 | ```
 62 | python3 chat.py --model_path {your_path}/Qwen2-7B-Instruct-ov --max_sequence_length 4096 --device CPU
 63 | ```
 64 | 
 65 | or
 66 | 
 67 | ```
 68 | python3 chat_genai.py --model_path {your_path}/Qwen2-7B-Instruct-ov --max_sequence_length 4096 --device CPU
 69 | ```
 70 | 
 71 | ### Parameters that can be selected
 72 | 
 73 | * `--model_path` - The path to the directory where the OpenVINO IR model is located.
 74 | * `--max_sequence_length` - Maximum size of output tokens.
 75 | * `--device` - The device to run inference on. e.g "CPU","GPU".
 76 | 
 77 | ## Example
 78 | 
 79 | ```
 80 | ====Starting conversation====
 81 | User: hello
 82 | Qwen2-OpenVINO: Hello! How can I assist you today?
 83 | 
 84 | User: who are you ?
 85 | Qwen2-OpenVINO: I am an AI language model created by Alibaba Cloud. My purpose is to help users with their questions and provide them with accurate information. Is there anything specific you would like to know about me?
 86 | 
 87 | User: could you tell me a story ?
 88 | Qwen2-OpenVINO: Sure, here's a short story for you:
 89 | 
 90 | Once upon a time, in a small village nestled in the mountains, there lived a young girl named Lily who loved nature. She spent most of her days exploring the forest and watching the birds singing.
 91 | 
 92 | One day, while she was wandering through the woods, she stumbled upon a hidden cave deep within the forest. Inside, she found a beautiful crystal that sparkled with light. She picked it up and held it close to her heart, feeling a sense of joy and wonder.
 93 | 
 94 | As she walked away from the cave, she felt a sense of peace wash over her. She realized that sometimes, the things we miss the most are the simple things in life, like the beauty of nature or the warmth of the sun on our skin.
 95 | 
 96 | From that day forward, Lily made a habit of spending time in nature whenever she could. She would spend hours walking through the forest, watching the birds sing, and taking in the beauty around her. She knew that these moments were precious and that they would stay with her forever.
 97 | 
 98 | And so, Lily continued to live her life with a sense of joy and wonder, always cherishing the simple things in life.
 99 | 
100 | User: please give this story a title
101 | Qwen2-OpenVINO: "Nature's Magic: A Journey Through the Forest Crystal"
102 | ```
103 | 
104 | ## FAQ
105 | 
106 | 1. Do I need to install the OpenVINO C++ inference engine?
107 |     - Unnecessary
108 | 
109 | 2. Do I have to use Intel hardware?
110 |     - It is recommended to use Intel x86 devices, and this is where we tested it. For example:
111 |     - Intel CPU, including personal computer CPU and server CPU.
112 |     - Intel's integrated GPU. For example: Arc™ Series and Iris® Series.
113 |     - Intel's discrete graphics card. For example: ARC™ A770 graphics card.
114 |   
115 | 3. Why can't OpenVINO find the GPU in my system(Linux)?
116 |    - Ensure OpenCL drivers are installed correctly.
117 |    - Ensure you enabled the right permissions for GPU device
118 |    - More information can be found in [Install GPU drivers](https://github.com/openvinotoolkit/openvino_notebooks/wiki/Ubuntu#1-install-python-git-and-gpu-drivers-optional)
119 | 
120 | 4. Is C++ supported ?
121 |    - Please refer to this [example](https://github.com/openvinotoolkit/openvino.genai/tree/master/src)
122 | 
123 | 
124 | Post your questions [here](https://community.intel.com/t5/Intel-Distribution-of-OpenVINO/bd-p/distribution-openvino-toolkit).
125 | 


--------------------------------------------------------------------------------
/README_zh.md:
--------------------------------------------------------------------------------
  1 | 简体中文 | [English](README.md)
  2 | 
  3 | # Qwen2.openvino Demo
  4 | 
  5 | 这是如何使用 OpenVINO 部署 Qwen2 的示例
  6 | 
  7 | ## 1. 环境配置
  8 | 
  9 | 我们推荐您新建一个虚拟环境，然后按照以下安装依赖。
 10 | 推荐在python3.10以上的环境下运行该示例。
 11 | 
 12 | Linux
 13 | 
 14 | ```
 15 | python3 -m venv openvino_env
 16 | 
 17 | source openvino_env/bin/activate
 18 | 
 19 | python3 -m pip install --upgrade pip
 20 | 
 21 | pip install wheel setuptools
 22 | 
 23 | pip install -r requirements.txt
 24 | ```
 25 | 
 26 | Windows Powershell
 27 | 
 28 | ```
 29 | python3 -m venv openvino_env
 30 | 
 31 | .\openvino_env\Scripts\activate
 32 | 
 33 | python3 -m pip install --upgrade pip
 34 | 
 35 | pip install wheel setuptools
 36 | 
 37 | pip install -r requirements.txt
 38 | ```
 39 | > Note:
 40 | > 如果你使用的是一个已经存在的python环境，请使用以下方法进行更新
 41 | > pip install -U --upgrade-strategy eager -r requirements.txt
 42 | 
 43 | ## 2. 转换模型
 44 | 
 45 | 由于需要将Huggingface模型转换为OpenVINO IR模型，因此您需要下载模型并转换。
 46 | 
 47 | ```
 48 | python3 convert.py --model_id qwen/Qwen2-7B-Instruct --precision int4 --output {your_path}/Qwen2-7B-Instruct-ov --modelscope
 49 | ```
 50 | 
 51 | ### 可以选择的参数
 52 | 
 53 | * `--model_id` - 用于从 Huggngface_hub (https://huggingface.co/models) 或 模型所在目录的路径（绝对路径）
 54 | * `--precision` - 模型精度：fp16, int8 或 int4。
 55 | * `--output` - 转换后模型保存的地址
 56 | * `--modelscope` - 通过魔搭社区下载模型
 57 | 
 58 | ## 3. 运行流式聊天机器人
 59 | 
 60 | ```
 61 | python3 chat.py --model_path {your_path}/Qwen2-7B-Instruct-ov --max_sequence_length 4096 --device CPU
 62 | ```
 63 | 
 64 | 或者
 65 | 
 66 | ```
 67 | python3 chat_genai.py --model_path {your_path}/Qwen2-7B-Instruct-ov --max_sequence_length 4096 --device CPU
 68 | ```
 69 | 
 70 | ### 可以选择的参数
 71 | 
 72 | * `--model_path` - OpenVINO IR 模型所在目录的路径。
 73 | * `--max_sequence_length` - 输出标记的最大大小。
 74 | * `--device` - 运行推理的设备。例如："CPU","GPU"。
 75 | 
 76 | ## 例子
 77 | 
 78 | ```
 79 | ====Starting conversation====
 80 | 用户: 你好
 81 | Qwen2-OpenVINO: 你好！有什么我可以帮助你的吗？
 82 | 
 83 | 用户: 你是谁？
 84 | Qwen2-OpenVINO: 我是来自阿里云的超大规模语言模型，我叫通义千问。
 85 | 
 86 | 用户: 请给我讲一个故事
 87 | Qwen2-OpenVINO: 好的，这是一个关于一只小兔子和它的朋友的故事。
 88 | 
 89 | 有一天，小兔子和他的朋友们决定去森林里探险。他们带上食物、水和一些工具，开始了他们的旅程。在旅途中，他们遇到了各种各样的动物，包括松鼠、狐狸、小鸟等等。他们一起玩耍、分享食物，还互相帮助解决问题。最后，他们在森林的深处找到了一个神秘的洞穴，里面藏着许多宝藏。他们带着所有的宝藏回到了家，庆祝这次愉快的冒险。
 90 | 
 91 | 用户: 请为这个故事起个标题
 92 | Qwen2-OpenVINO: "小兔子与朋友们的冒险之旅"
 93 | ```
 94 | 
 95 | ## 常见问题
 96 | 
 97 | 1. 需要安装 OpenVINO C++ 推理引擎吗
 98 |    - 不需要
 99 | 
100 | 2. 一定要使用 Intel 的硬件吗？
101 |    - 我们仅在 Intel 设备上尝试，我们推荐使用x86架构的英特尔设备，包括但不限制于：
102 |    - 英特尔的CPU，包括个人电脑CPU 和服务器CPU。
103 |    - 英特尔的集成显卡。 例如：Arc™，Iris® 系列。
104 |    - 英特尔的独立显卡。例如：ARC™ A770 显卡。
105 |   
106 | 3. 为什么OpenVINO没检测到我系统上的GPU设备？
107 |    - 确保OpenCL驱动是安装正确的。
108 |    - 确保你有足够的权限访问GPU设备
109 |    - 更多信息可以参考[Install GPU drivers](https://github.com/openvinotoolkit/openvino_notebooks/wiki/Ubuntu#1-install-python-git-and-gpu-drivers-optional)
110 | 
111 | 4. 是否支持C++？
112 |    - C++示例可以[参考](https://github.com/openvinotoolkit/openvino.genai/tree/master/src)
113 | 
114 | 您也可以在这里提交 [问题](https://community.intel.com/t5/Intel-Distribution-of-OpenVINO/bd-p/distribution-openvino-toolkit).
115 | 


--------------------------------------------------------------------------------
/chat.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | from typing import List, Tuple
  3 | from threading import Thread
  4 | import torch
  5 | from optimum.intel.openvino import OVModelForCausalLM
  6 | from transformers import (AutoTokenizer, AutoConfig,
  7 |                           TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)
  8 | 
  9 | 
 10 | class StopOnTokens(StoppingCriteria):
 11 |     def __init__(self, token_ids):
 12 |         self.token_ids = token_ids
 13 | 
 14 |     def __call__(
 15 |             self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
 16 |     ) -> bool:
 17 |         for stop_id in self.token_ids:
 18 |             if input_ids[0][-1] == stop_id:
 19 |                 return True
 20 |         return False
 21 | 
 22 | 
 23 | if __name__ == "__main__":
 24 |     parser = argparse.ArgumentParser(add_help=False)
 25 |     parser.add_argument('-h',
 26 |                         '--help',
 27 |                         action='help',
 28 |                         help='Show this help message and exit.')
 29 |     parser.add_argument('-m',
 30 |                         '--model_path',
 31 |                         required=True,
 32 |                         type=str,
 33 |                         help='Required. model path')
 34 |     parser.add_argument('-l',
 35 |                         '--max_sequence_length',
 36 |                         default=256,
 37 |                         required=False,
 38 |                         type=int,
 39 |                         help='Required. maximun length of output')
 40 |     parser.add_argument('-d',
 41 |                         '--device',
 42 |                         default='CPU',
 43 |                         required=False,
 44 |                         type=str,
 45 |                         help='Required. device for inference')
 46 |     args = parser.parse_args()
 47 |     model_dir = args.model_path
 48 | 
 49 |     ov_config = {"PERFORMANCE_HINT": "LATENCY",
 50 |                  "NUM_STREAMS": "1", "CACHE_DIR": ""}
 51 | 
 52 |     tokenizer = AutoTokenizer.from_pretrained(
 53 |         model_dir)
 54 |     print("====Compiling model====")
 55 |     ov_model = OVModelForCausalLM.from_pretrained(
 56 |         model_dir,
 57 |         device=args.device,
 58 |         ov_config=ov_config,
 59 |         config=AutoConfig.from_pretrained(model_dir),
 60 |     )
 61 | 
 62 |     streamer = TextIteratorStreamer(
 63 |         tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
 64 |     )
 65 |     stop_tokens = [151643, 151645]
 66 |     stop_tokens = [StopOnTokens(stop_tokens)]
 67 | 
 68 |     def convert_history_to_token(history: List[Tuple[str, str]]):
 69 | 
 70 |         messages = []
 71 |         for idx, (user_msg, model_msg) in enumerate(history):
 72 |             if idx == len(history) - 1 and not model_msg:
 73 |                 messages.append({"role": "user", "content": user_msg})
 74 |                 break
 75 |             if user_msg:
 76 |                 messages.append({"role": "user", "content": user_msg})
 77 |             if model_msg:
 78 |                 messages.append({"role": "assistant", "content": model_msg})
 79 | 
 80 |         model_inputs = tokenizer.apply_chat_template(messages,
 81 |                                                      add_generation_prompt=True,
 82 |                                                      tokenize=True,
 83 |                                                      return_tensors="pt")
 84 |         return model_inputs
 85 | 
 86 |     history = []
 87 |     print("====Starting conversation====")
 88 |     while True:
 89 |         input_text = input("用户: ")
 90 |         if input_text.lower() == 'stop':
 91 |             break
 92 | 
 93 |         if input_text.lower() == 'clear':
 94 |             history = []
 95 |             print("AI助手: 对话历史已清空")
 96 |             continue
 97 | 
 98 |         print("Qwen2-OpenVINO:", end=" ")
 99 |         history = history + [[input_text, ""]]
100 |         model_inputs = convert_history_to_token(history)
101 |         generate_kwargs = dict(
102 |             input_ids=model_inputs,
103 |             max_new_tokens=args.max_sequence_length,
104 |             temperature=0.1,
105 |             do_sample=True,
106 |             top_p=1.0,
107 |             top_k=50,
108 |             repetition_penalty=1.1,
109 |             streamer=streamer,
110 |             stopping_criteria=StoppingCriteriaList(stop_tokens),
111 |             pad_token_id=151645,
112 |         )
113 | 
114 |         t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
115 |         t1.start()
116 | 
117 |         partial_text = ""
118 |         for new_text in streamer:
119 |             print(new_text, end="", flush=True)
120 |             partial_text += new_text
121 |         print("\n")
122 |         history[-1][1] = partial_text


--------------------------------------------------------------------------------
/chat_genai.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import openvino_genai
 3 | 
 4 | 
 5 | def streamer(subword):
 6 |     print(subword, end='', flush=True)
 7 |     return False
 8 | 
 9 | if __name__ == "__main__":
10 |     parser = argparse.ArgumentParser(add_help=False)
11 |     parser.add_argument('-h',
12 |                         '--help',
13 |                         action='help',
14 |                         help='Show this help message and exit.')
15 |     parser.add_argument('-m',
16 |                         '--model_path',
17 |                         required=True,
18 |                         type=str,
19 |                         help='Required. model path')
20 |     parser.add_argument('-l',
21 |                         '--max_sequence_length',
22 |                         default=256,
23 |                         required=False,
24 |                         type=int,
25 |                         help='Required. maximun length of output')
26 |     parser.add_argument('-d',
27 |                         '--device',
28 |                         default='CPU',
29 |                         required=False,
30 |                         type=str,
31 |                         help='Required. device for inference')
32 |     args = parser.parse_args()
33 |     pipe = openvino_genai.LLMPipeline(args.model_path, args.device)
34 | 
35 |     config = openvino_genai.GenerationConfig()
36 |     config.max_new_tokens = args.max_sequence_length
37 | 
38 |     pipe.start_chat()
39 |     while True:
40 |         try:
41 |             prompt = input('question:\n')
42 |         except EOFError:
43 |             break
44 |         pipe.generate(prompt, config, streamer)
45 |         print('\n----------')
46 |     pipe.finish_chat()


--------------------------------------------------------------------------------
/convert.py:
--------------------------------------------------------------------------------
 1 | from transformers import AutoTokenizer
 2 | from optimum.intel import OVWeightQuantizationConfig
 3 | from optimum.intel.openvino import OVModelForCausalLM
 4 | 
 5 | import os
 6 | from pathlib import Path
 7 | import argparse
 8 | 
 9 | if __name__ == '__main__':
10 |     parser = argparse.ArgumentParser(add_help=False)
11 |     parser.add_argument('-h',
12 |                         '--help',
13 |                         action='help',
14 |                         help='Show this help message and exit.')
15 |     parser.add_argument('-m',
16 |                         '--model_id',
17 |                         default='Qwen/Qwen1.5-0.5B-Chat',
18 |                         required=False,
19 |                         type=str,
20 |                         help='orignal model path')
21 |     parser.add_argument('-p',
22 |                         '--precision',
23 |                         required=False,
24 |                         default="int4",
25 |                         type=str,
26 |                         choices=["fp16", "int8", "int4"],
27 |                         help='fp16, int8 or int4')
28 |     parser.add_argument('-o',
29 |                         '--output',
30 |                         required=False,
31 |                         type=str,
32 |                         help='path to save the ir model')
33 |     parser.add_argument('-ms',
34 |                         '--modelscope',
35 |                         action='store_true',
36 |                         help='download model from Model Scope')
37 |     args = parser.parse_args()
38 | 
39 |     ir_model_path = Path(args.model_id.split(
40 |         "/")[1] + '-ov') if args.output is None else Path(args.output)
41 | 
42 |     if not ir_model_path.exists():
43 |         os.mkdir(ir_model_path)
44 | 
45 |     compression_configs = {
46 |         "sym": True,
47 |         "group_size": 128,
48 |         "ratio": 0.8,
49 |     }
50 |     if args.modelscope:
51 |         from modelscope import snapshot_download
52 | 
53 |         print("====Downloading model from ModelScope=====")
54 |         model_path = snapshot_download(args.model_id, cache_dir='./')
55 |     else:
56 |         model_path = args.model_id
57 | 
58 |     print("====Exporting IR=====")
59 |     if args.precision == "int4":
60 |         ov_model = OVModelForCausalLM.from_pretrained(model_path, export=True,
61 |                                                       compile=False, quantization_config=OVWeightQuantizationConfig(
62 |                                                           bits=4, **compression_configs))
63 |     elif args.precision == "int8":
64 |         ov_model = OVModelForCausalLM.from_pretrained(model_path, export=True,
65 |                                                       compile=False, load_in_8bit=True)
66 |     else:
67 |         ov_model = OVModelForCausalLM.from_pretrained(model_path, export=True,
68 |                                                       compile=False, load_in_8bit=False)
69 | 
70 |     print("====Saving IR=====")
71 |     ov_model.save_pretrained(ir_model_path)
72 | 
73 |     print("====Exporting tokenizer=====")
74 |     tokenizer = AutoTokenizer.from_pretrained(
75 |         model_path)
76 |     tokenizer.save_pretrained(ir_model_path)
77 | 
78 |     print("====Exporting IR tokenizer=====")
79 |     from optimum.exporters.openvino.convert import export_tokenizer
80 |     export_tokenizer(tokenizer, ir_model_path)
81 |     print("====Finished=====")
82 |     del ov_model
83 |     del model_path
84 |     
85 |     
86 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | --extra-index-url https://download.pytorch.org/whl/cpu
 2 | numpy
 3 | openvino==2024.4.0
 4 | openvino-genai==2024.4.0.0
 5 | nncf>=2.11.0
 6 | optimum-intel>=1.17.0
 7 | transformers>=4.40.0,<4.42.0
 8 | onnx>=1.15.0
 9 | huggingface-hub>=0.21.3
10 | torch>=2.1
11 | modelscope


--------------------------------------------------------------------------------