├── Audio
    └── __init__.py
├── NLP
    ├── __init__.py
    ├── Qwen.md
    └── CogVLM.md
├── Multi-Modal
    └── __init__.py
├── Robotic
    └── __init__.py
├── Visual-Perception
    └── __init__.py
├── asserts
    ├── demo.jpg
    ├── robot.png
    ├── robot_l.png
    └── robot_s.png
├── requirements.txt
├── LICENSE
├── chatme.py
├── main.py
├── .gitmodules
├── README.md
└── README_en.md


/Audio/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/NLP/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Multi-Modal/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Robotic/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Visual-Perception/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/asserts/demo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/HEAD/asserts/demo.jpg


--------------------------------------------------------------------------------
/asserts/robot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/HEAD/asserts/robot.png


--------------------------------------------------------------------------------
/asserts/robot_l.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/HEAD/asserts/robot_l.png


--------------------------------------------------------------------------------
/asserts/robot_s.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/HEAD/asserts/robot_s.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | SwissArmyTransformer>=0.4.9
 2 | transformers>=4.36.2
 3 | xformers>=0.0.22
 4 | torch>=2.1.0
 5 | torchvision>=0.16.2
 6 | spacy>=3.6.0
 7 | pillow>=10.2.0
 8 | deepspeed>=0.13.1
 9 | seaborn>=0.13.2
10 | loguru~=0.7.2
11 | streamlit>=1.31.0
12 | timm>=0.9.12
13 | accelerate>=0.26.1
14 | pydantic>=2.6.0


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 RVC-Boss
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/NLP/Qwen.md:
--------------------------------------------------------------------------------
 1 | # Qwen
 2 | 
 3 | Quickstart
 4 | 
 5 | 
 6 | If not using docker, please make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries.
 7 | 
 8 | ```
 9 | pip install -r requirements.txt
10 | ```
11 | 
12 | If your device supports fp16 or bf16, we recommend installing flash-attention (we support flash attention 2 now.) for higher efficiency and lower memory usage. (flash-attention is optional and the project can run normally without installing it)
13 | 
14 | ```
15 | git clone https://github.com/Dao-AILab/flash-attention
16 | cd flash-attention && pip install .
17 | # Below are optional. Installing them might be slow.
18 | # pip install csrc/layer_norm
19 | # If the version of flash-attn is higher than 2.1.1, the following is not needed.
20 | # pip install csrc/rotary
21 | ```
22 | 
23 | Now you can start with Transformers🤗.
24 | 
25 | To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, please make sure that you are using the latest code.
26 | 
27 | ```
28 | from transformers import AutoModelForCausalLM, AutoTokenizer
29 | from transformers.generation import GenerationConfig
30 | 
31 | # Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
32 | tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
33 | 
34 | # use bf16
35 | # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
36 | # use fp16
37 | # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
38 | # use cpu only
39 | # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
40 | # use auto mode, automatically select precision based on the device.
41 | model = AutoModelForCausalLM.from_pretrained(
42 |     "Qwen/Qwen-7B-Chat",
43 |     device_map="auto",
44 |     trust_remote_code=True
45 | ).eval()
46 | 
47 | # Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
48 | # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
49 | 
50 | # 1st dialogue turn
51 | response, history = model.chat(tokenizer, "你好", history=None)
52 | print(response)
53 | # 你好！很高兴为你提供帮助。
54 | 
55 | # 2nd dialogue turn
56 | response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
57 | print(response)
58 | # 这是一个关于一个年轻人奋斗创业最终取得成功的故事。
59 | # 故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
60 | # 为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
61 | # 毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
62 | # 最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
63 | # 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。
64 | 
65 | # 3rd dialogue turn
66 | response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
67 | print(response)
68 | # 《奋斗创业：一个年轻人的成功之路》
69 | ```
70 | 
71 | Running Qwen, the base language model, is also simple.
72 | 
73 | In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below:
74 | 
75 | ```
76 | from modelscope import snapshot_download
77 | from transformers import AutoModelForCausalLM, AutoTokenizer
78 | 
79 | # Downloading model checkpoint to a local dir model_dir
80 | # model_dir = snapshot_download('qwen/Qwen-7B')
81 | # model_dir = snapshot_download('qwen/Qwen-7B-Chat')
82 | # model_dir = snapshot_download('qwen/Qwen-14B')
83 | model_dir = snapshot_download('qwen/Qwen-14B-Chat')
84 | 
85 | # Loading local checkpoints
86 | # trust_remote_code is still set as True since we still load codes from local dir instead of transformers
87 | tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
88 | model = AutoModelForCausalLM.from_pretrained(
89 |     model_dir,
90 |     device_map="auto",
91 |     trust_remote_code=True
92 | ).eval()
93 | ```
94 | 


--------------------------------------------------------------------------------
/chatme.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This is a demo for using CogAgent and CogVLM in CLI
  3 | Make sure you have installed vicuna-7b-v1.5 tokenizer model (https://huggingface.co/lmsys/vicuna-7b-v1.5), full checkpoint of vicuna-7b-v1.5 LLM is not required.
  4 | In this demo, We us chat template, you can use others to replace such as 'vqa'.
  5 | Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow.
  6 | Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation.
  7 | """
  8 | 
  9 | import argparse
 10 | import torch
 11 | 
 12 | from PIL import Image
 13 | from transformers import AutoModelForCausalLM, LlamaTokenizer
 14 | 
 15 | parser = argparse.ArgumentParser()
 16 | parser.add_argument("--image", type=str)
 17 | parser.add_argument("--question", type=str)
 18 | parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits')
 19 | parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt')
 20 | parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
 21 | parser.add_argument("--fp16", action="store_true")
 22 | parser.add_argument("--bf16", action="store_true")
 23 | 
 24 | args = parser.parse_args()
 25 | MODEL_PATH = args.from_pretrained
 26 | TOKENIZER_PATH = args.local_tokenizer
 27 | DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
 28 | 
 29 | tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH)
 30 | if args.bf16:
 31 |     torch_type = torch.bfloat16
 32 | else:
 33 |     torch_type = torch.float16
 34 | 
 35 | print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE))
 36 | 
 37 | if args.quant:
 38 |     model = AutoModelForCausalLM.from_pretrained(
 39 |         MODEL_PATH,
 40 |         torch_dtype=torch_type,
 41 |         low_cpu_mem_usage=True,
 42 |         load_in_4bit=True,
 43 |         trust_remote_code=True
 44 |     ).eval()
 45 | else:
 46 |     model = AutoModelForCausalLM.from_pretrained(
 47 |         MODEL_PATH,
 48 |         torch_dtype=torch_type,
 49 |         low_cpu_mem_usage=True,
 50 |         load_in_4bit=args.quant is not None,
 51 |         trust_remote_code=True
 52 |     ).to(DEVICE).eval()
 53 | 
 54 | text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
 55 | 
 56 | while True:
 57 |     image_path = args.image
 58 |     if image_path == '':
 59 |         print('You did not enter image path, the following will be a plain text conversation.')
 60 |         image = None
 61 |         text_only_first_query = True    
 62 |     else:
 63 |         image = Image.open(image_path).convert('RGB')
 64 |     
 65 |     history = []
 66 |   
 67 |     query = args.question
 68 |     if query == "clear":
 69 |         break
 70 | 
 71 |     if image is None:
 72 |         if text_only_first_query:
 73 |             query = text_only_template.format(query)
 74 |             text_only_first_query = False
 75 |         else:
 76 |             old_prompt = ''
 77 |             for _, (old_query, response) in enumerate(history):
 78 |                 old_prompt += old_query + " " + response + "\n"
 79 |             query = old_prompt + "用户: {} 小千:".format(query)
 80 | 
 81 |     if image is None:
 82 |         input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, template_version='base')
 83 |     else:
 84 |         input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image])
 85 | 
 86 |     inputs = {
 87 |         'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
 88 |         'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
 89 |         'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
 90 |         'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]] if image is not None else None,
 91 |     }
 92 |     if 'cross_images' in input_by_model and input_by_model['cross_images']:
 93 |         inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]]
 94 | 
 95 |     # add any transformers params here.
 96 |     gen_kwargs = {"max_length": 2048,
 97 |                   "do_sample": False} # "temperature": 0.9
 98 |     with torch.no_grad():
 99 |         outputs = model.generate(**inputs, **gen_kwargs)
100 |         outputs = outputs[:, inputs['input_ids'].shape[1]:]
101 |         response = tokenizer.decode(outputs[0])
102 |         response = response.split("</s>")[0]
103 |         print("\n小千:", response)
104 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This is a demo for using CogAgent and CogVLM in CLI
  3 | Make sure you have installed vicuna-7b-v1.5 tokenizer model (https://huggingface.co/lmsys/vicuna-7b-v1.5), full checkpoint of vicuna-7b-v1.5 LLM is not required.
  4 | In this demo, We us chat template, you can use others to replace such as 'vqa'.
  5 | Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow.
  6 | Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation.
  7 | """
  8 | 
  9 | import argparse
 10 | import torch
 11 | 
 12 | from PIL import Image
 13 | from transformers import AutoModelForCausalLM, LlamaTokenizer
 14 | 
 15 | parser = argparse.ArgumentParser()
 16 | parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits')
 17 | parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt')
 18 | parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
 19 | parser.add_argument("--fp16", action="store_true")
 20 | parser.add_argument("--bf16", action="store_true")
 21 | 
 22 | args = parser.parse_args()
 23 | MODEL_PATH = args.from_pretrained
 24 | TOKENIZER_PATH = args.local_tokenizer
 25 | DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
 26 | 
 27 | tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH)
 28 | if args.bf16:
 29 |     torch_type = torch.bfloat16
 30 | else:
 31 |     torch_type = torch.float16
 32 | 
 33 | print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE))
 34 | 
 35 | if args.quant:
 36 |     model = AutoModelForCausalLM.from_pretrained(
 37 |         MODEL_PATH,
 38 |         torch_dtype=torch_type,
 39 |         low_cpu_mem_usage=True,
 40 |         load_in_4bit=True,
 41 |         trust_remote_code=True
 42 |     ).eval()
 43 | else:
 44 |     model = AutoModelForCausalLM.from_pretrained(
 45 |         MODEL_PATH,
 46 |         torch_dtype=torch_type,
 47 |         low_cpu_mem_usage=True,
 48 |         load_in_4bit=args.quant is not None,
 49 |         trust_remote_code=True
 50 |     ).to(DEVICE).eval()
 51 | 
 52 | text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
 53 | 
 54 | while True:
 55 |     image_path = input("图片地址 >>>>> ")
 56 |     if image_path == '':
 57 |         print('You did not enter image path, the following will be a plain text conversation.')
 58 |         image = None
 59 |         text_only_first_query = True    
 60 |     else:
 61 |         image = Image.open(image_path).convert('RGB')
 62 |     
 63 |     history = []
 64 | 
 65 |     while True:
 66 |         query = input("Human:")
 67 |         if query == "clear":
 68 |             break
 69 | 
 70 |         if image is None:
 71 |             if text_only_first_query:
 72 |                 query = text_only_template.format(query)
 73 |                 text_only_first_query = False
 74 |             else:
 75 |                 old_prompt = ''
 76 |                 for _, (old_query, response) in enumerate(history):
 77 |                     old_prompt += old_query + " " + response + "\n"
 78 |                 query = old_prompt + "用户: {} 小千:".format(query)
 79 | 
 80 |         if image is None:
 81 |             input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, template_version='base')
 82 |         else:
 83 |             input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image])
 84 | 
 85 |         inputs = {
 86 |             'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
 87 |             'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
 88 |             'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
 89 |             'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]] if image is not None else None,
 90 |         }
 91 |         if 'cross_images' in input_by_model and input_by_model['cross_images']:
 92 |             inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]]
 93 | 
 94 |         # add any transformers params here.
 95 |         gen_kwargs = {"max_length": 2048,
 96 |                       "do_sample": False} # "temperature": 0.9
 97 |         with torch.no_grad():
 98 |             outputs = model.generate(**inputs, **gen_kwargs)
 99 |             outputs = outputs[:, inputs['input_ids'].shape[1]:]
100 |             response = tokenizer.decode(outputs[0])
101 |             response = response.split("</s>")[0]
102 |             print("\nCog:", response)
103 |         history.append((query, response))
104 | 


--------------------------------------------------------------------------------
/NLP/CogVLM.md:
--------------------------------------------------------------------------------
  1 | # CogVlm
  2 | 
  3 | ## basic demo
  4 | 
  5 | ```
  6 | """
  7 | This is a demo for using CogAgent and CogVLM in CLI
  8 | Make sure you have installed vicuna-7b-v1.5 tokenizer model (https://huggingface.co/lmsys/vicuna-7b-v1.5), full checkpoint of vicuna-7b-v1.5 LLM is not required.
  9 | In this demo, We us chat template, you can use others to replace such as 'vqa'.
 10 | Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow.
 11 | Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation.
 12 | """
 13 | 
 14 | import argparse
 15 | import torch
 16 | 
 17 | from PIL import Image
 18 | from transformers import AutoModelForCausalLM, LlamaTokenizer
 19 | 
 20 | parser = argparse.ArgumentParser()
 21 | parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits')
 22 | parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt')
 23 | parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
 24 | parser.add_argument("--fp16", action="store_true")
 25 | parser.add_argument("--bf16", action="store_true")
 26 | 
 27 | args = parser.parse_args()
 28 | MODEL_PATH = args.from_pretrained
 29 | TOKENIZER_PATH = args.local_tokenizer
 30 | DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
 31 | 
 32 | tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH)
 33 | if args.bf16:
 34 |     torch_type = torch.bfloat16
 35 | else:
 36 |     torch_type = torch.float16
 37 | 
 38 | print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE))
 39 | 
 40 | if args.quant:
 41 |     model = AutoModelForCausalLM.from_pretrained(
 42 |         MODEL_PATH,
 43 |         torch_dtype=torch_type,
 44 |         low_cpu_mem_usage=True,
 45 |         load_in_4bit=True,
 46 |         trust_remote_code=True
 47 |     ).eval()
 48 | else:
 49 |     model = AutoModelForCausalLM.from_pretrained(
 50 |         MODEL_PATH,
 51 |         torch_dtype=torch_type,
 52 |         low_cpu_mem_usage=True,
 53 |         load_in_4bit=args.quant is not None,
 54 |         trust_remote_code=True
 55 |     ).to(DEVICE).eval()
 56 | 
 57 | text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
 58 | 
 59 | while True:
 60 |     image_path = input("image path >>>>> ")
 61 |     if image_path == '':
 62 |         print('You did not enter image path, the following will be a plain text conversation.')
 63 |         image = None
 64 |         text_only_first_query = True    
 65 |     else:
 66 |         image = Image.open(image_path).convert('RGB')
 67 |     
 68 |     history = []
 69 | 
 70 |     while True:
 71 |         query = input("Human:")
 72 |         if query == "clear":
 73 |             break
 74 | 
 75 |         if image is None:
 76 |             if text_only_first_query:
 77 |                 query = text_only_template.format(query)
 78 |                 text_only_first_query = False
 79 |             else:
 80 |                 old_prompt = ''
 81 |                 for _, (old_query, response) in enumerate(history):
 82 |                     old_prompt += old_query + " " + response + "\n"
 83 |                 query = old_prompt + "USER: {} ASSISTANT:".format(query)
 84 | 
 85 |         if image is None:
 86 |             input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, template_version='base')
 87 |         else:
 88 |             input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image])
 89 | 
 90 |         inputs = {
 91 |             'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
 92 |             'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
 93 |             'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
 94 |             'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]] if image is not None else None,
 95 |         }
 96 |         if 'cross_images' in input_by_model and input_by_model['cross_images']:
 97 |             inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]]
 98 | 
 99 |         # add any transformers params here.
100 |         gen_kwargs = {"max_length": 2048,
101 |                       "do_sample": False} # "temperature": 0.9
102 |         with torch.no_grad():
103 |             outputs = model.generate(**inputs, **gen_kwargs)
104 |             outputs = outputs[:, inputs['input_ids'].shape[1]:]
105 |             response = tokenizer.decode(outputs[0])
106 |             response = response.split("</s>")[0]
107 |             print("\nCog:", response)
108 |         history.append((query, response))
109 | ```
110 | 


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
  1 | [submodule "Visual-Perception/2D-Perception/WeSAM"]
  2 | 	path = Visual-Perception/2D-Perception/WeSAM
  3 | 	url = git@github.com:zhang-haojie/wesam.git
  4 | [submodule "Visual-Perception/3D-Perception/SAM-6D"]
  5 | 	path = Visual-Perception/3D-Perception/SAM-6D
  6 | 	url = git@github.com:JiehongLin/SAM-6D.git
  7 | [submodule "Visual-Perception/3D-Perception/VISTA"]
  8 | 	path = Visual-Perception/3D-Perception/VISTA
  9 | 	url = git@github.com:Gorilla-Lab-SCUT/VISTA.git
 10 | [submodule "Visual-Perception/3D-Perception/Frustum-ConvNet"]
 11 | 	path = Visual-Perception/3D-Perception/Frustum-ConvNet
 12 | 	url = git@github.com:Gorilla-Lab-SCUT/frustum-convnet.git
 13 | [submodule "Visual-Perception/3D-Perception/SSTNet"]
 14 | 	path = Visual-Perception/3D-Perception/SSTNet
 15 | 	url = git@github.com:Gorilla-Lab-SCUT/SSTNet.git
 16 | [submodule "Audio/chatbot_SER"]
 17 | 	path = Audio/chatbot_SER
 18 | 	url = https://github.com/qiaoweima/chatbot_SER.git
 19 | [submodule "Audio/chatbot_TTS"]
 20 | 	path = Audio/chatbot_TTS
 21 | 	url = https://github.com/qiaoweima/chatbot_TTS.git
 22 | [submodule "Audio/Audio-Anti-Spoofing"]
 23 | 	path = Audio/Audio-Anti-Spoofing
 24 | 	url = https://github.com/qiaoweima/Audio-Anti-Spoofing.git
 25 | [submodule "Audio/chatbot_ASR"]
 26 | 	path = Audio/chatbot_ASR
 27 | 	url = https://github.com/qiaoweima/chatbot_ASR.git
 28 | [submodule "Audio/Blizzard_Challenge"]
 29 | 	path = Audio/Blizzard_Challenge
 30 | 	url = https://github.com/qiaoweima/Blizzard_Challenge.git
 31 | [submodule "Visual-Perception/2D-Perception/DOQ"]
 32 | 	path = Visual-Perception/2D-Perception/DOQ
 33 | 	url = https://github.com/SherlockHolmes221/DOQ.git
 34 | [submodule "Visual-Perception/2D-Perception/HQM"]
 35 | 	path = Visual-Perception/2D-Perception/HQM
 36 | 	url = https://github.com/MuchHair/HQM.git
 37 | [submodule "Visual-Perception/2D-Perception/PD-Net"]
 38 | 	path = Visual-Perception/2D-Perception/PD-Net
 39 | 	url = https://github.com/MuchHair/PD-Net.git
 40 | [submodule "Visual-Perception/2D-Perception/GGNet"]
 41 | 	path = Visual-Perception/2D-Perception/GGNet
 42 | 	url = https://github.com/SherlockHolmes221/GGNet.git
 43 | [submodule "Visual-Perception/2D-Perception/HRNets"]
 44 | 	path = Visual-Perception/2D-Perception/HRNets
 45 | 	url = https://github.com/HRNet/HRNet-Image-Classification.git
 46 | [submodule "Visual-Perception/2D-Perception/ISCLNet"]
 47 | 	path = Visual-Perception/2D-Perception/ISCLNet
 48 | 	url = https://github.com/lphxx6222712/ISCLNet.git
 49 | [submodule "Visual-Perception/2D-Perception/SM-H"]
 50 | 	path = Visual-Perception/2D-Perception/SM-H
 51 | 	url = https://github.com/Dshaoshuai/Partitioning-stateful-data-stream-applications-in-dynamic-edge-cloud-environments.git
 52 | [submodule "Visual-Perception/2D-Perception/CBFL"]
 53 | 	path = Visual-Perception/2D-Perception/CBFL
 54 | 	url = https://github.com/lizhipengcs/CBFL.git
 55 | [submodule "Visual-Perception/2D-Perception/DRN-for-Single-Image-Super-Resolution"]
 56 | 	path = Visual-Perception/2D-Perception/DRN-for-Single-Image-Super-Resolution
 57 | 	url = https://github.com/guoyongcs/DRN.git
 58 | [submodule "Visual-Perception/2D-Perception/DRN-for-Video-Grounding"]
 59 | 	path = Visual-Perception/2D-Perception/DRN-for-Video-Grounding
 60 | 	url = https://github.com/Alvin-Zeng/DRN.git
 61 | [submodule "Visual-Perception/2D-Perception/PGCN"]
 62 | 	path = Visual-Perception/2D-Perception/PGCN
 63 | 	url = https://github.com/Alvin-Zeng/PGCN.git
 64 | [submodule "Visual-Perception/2D-Perception/NAT"]
 65 | 	path = Visual-Perception/2D-Perception/NAT
 66 | 	url = https://github.com/guoyongcs/NAT.git
 67 | [submodule "Visual-Perception/2D-Perception/EATA"]
 68 | 	path = Visual-Perception/2D-Perception/EATA
 69 | 	url = https://github.com/mr-eggplant/EATA.git
 70 | [submodule "Visual-Perception/2D-Perception/CNAS"]
 71 | 	path = Visual-Perception/2D-Perception/CNAS
 72 | 	url = https://github.com/guoyongcs/CNAS.git
 73 | [submodule "Visual-Perception/2D-Perception/CTNAS"]
 74 | 	path = Visual-Perception/2D-Perception/CTNAS
 75 | 	url = https://github.com/chenyaofo/CTNAS.git
 76 | [submodule "Visual-Perception/2D-Perception/SymNets"]
 77 | 	path = Visual-Perception/2D-Perception/SymNets
 78 | 	url = https://github.com/Gorilla-Lab-SCUT/SymNets.git
 79 | [submodule "Visual-Perception/2D-Perception/RSPNet"]
 80 | 	path = Visual-Perception/2D-Perception/RSPNet
 81 | 	url = https://github.com/PeihaoChen/RSPNet.git
 82 | [submodule "Visual-Perception/2D-Perception/CPGA"]
 83 | 	path = Visual-Perception/2D-Perception/CPGA
 84 | 	url = https://github.com/SCUT-AILab/CPGA.git
 85 | [submodule "Visual-Perception/2D-Perception/SGE-LA"]
 86 | 	path = Visual-Perception/2D-Perception/SGE-LA
 87 | 	url = https://github.com/Kali-Hac/SGE-LA.git
 88 | [submodule "Visual-Perception/2D-Perception/SAR"]
 89 | 	path = Visual-Perception/2D-Perception/SAR
 90 | 	url = https://github.com/mr-eggplant/SAR.git
 91 | [submodule "Visual-Perception/2D-Perception/EPS-AD"]
 92 | 	path = Visual-Perception/2D-Perception/EPS-AD
 93 | 	url = https://github.com/ZSHsh98/EPS-AD.git
 94 | [submodule "Visual-Perception/2D-Perception/MME"]
 95 | 	path = Visual-Perception/2D-Perception/MME
 96 | 	url = https://github.com/XinyuSun/MME.git
 97 | [submodule "Visual-Perception/2D-Perception/DENet"]
 98 | 	path = Visual-Perception/2D-Perception/DENet
 99 | 	url = https://github.com/lizhaoliu-Lec/DENet.git
100 | [submodule "Visual-Perception/2D-Perception/DAS"]
101 | 	path = Visual-Perception/2D-Perception/DAS
102 | 	url = https://github.com/lizhaoliu-Lec/DAS.git
103 | [submodule "Visual-Perception/2D-Perception/ProCA"]
104 | 	path = Visual-Perception/2D-Perception/ProCA
105 | 	url = https://github.com/Hongbin98/ProCA.git
106 | [submodule "Visual-Perception/2D-Perception/AGNet"]
107 | 	path = Visual-Perception/2D-Perception/AGNet
108 | 	url = https://github.com/HzFu/AGNet.git
109 | [submodule "Visual-Perception/2D-Perception/BPAI-Net"]
110 | 	path = Visual-Perception/2D-Perception/BPAI-Net
111 | 	url = https://github.com/SCUT-AILab/BPAI-Net.git
112 | [submodule "Visual-Perception/2D-Perception/LCCGAN-v2"]
113 | 	path = Visual-Perception/2D-Perception/LCCGAN-v2
114 | 	url = https://github.com/SCUTjinchengli/LCCGAN-v2.git
115 | [submodule "Visual-Perception/2D-Perception/CoUDA"]
116 | 	path = Visual-Perception/2D-Perception/CoUDA
117 | 	url = https://github.com/Vanint/CoUDA.git
118 | [submodule "Visual-Perception/2D-Perception/CAGEs"]
119 | 	path = Visual-Perception/2D-Perception/CAGEs
120 | 	url = https://github.com/Kali-Hac/Locality-Awareness-SGE.git
121 | [submodule "Visual-Perception/2D-Perception/CorrReg"]
122 | 	path = Visual-Perception/2D-Perception/CorrReg
123 | 	url = https://github.com/JiehongLin/CorrReg.git
124 | [submodule "Visual-Perception/2D-Perception/SkeletonNet"]
125 | 	path = Visual-Perception/2D-Perception/SkeletonNet
126 | 	url = https://github.com/Gorilla-Lab-SCUT/SkeletonNet.git
127 | [submodule "Visual-Perception/2D-Perception/PMF"]
128 | 	path = Visual-Perception/2D-Perception/PMF
129 | 	url = https://github.com/ICEORY/PMF.git
130 | [submodule "Visual-Perception/2D-Perception/CPEM"]
131 | 	path = Visual-Perception/2D-Perception/CPEM
132 | 	url = https://github.com/deepmo24/CPEM.git
133 | [submodule "Visual-Perception/2D-Perception/CR-NeRF"]
134 | 	path = Visual-Perception/2D-Perception/CR-NeRF
135 | 	url = https://github.com/YifYang993/CR-NeRF-PyTorch.git
136 | [submodule "Visual-Perception/2D-Perception/CPCM"]
137 | 	path = Visual-Perception/2D-Perception/CPCM
138 | 	url = https://github.com/lizhaoliu-Lec/CPCM.git
139 | [submodule "Visual-Perception/2D-Perception/SSTNet"]
140 | 	path = Visual-Perception/2D-Perception/SSTNet
141 | 	url = https://github.com/Gorilla-Lab-SCUT/SSTNet.git
142 | [submodule "Visual-Perception/2D-Perception/RegNet"]
143 | 	path = Visual-Perception/2D-Perception/RegNet
144 | 	url = https://github.com/PeihaoChen/regnet.git
145 | [submodule "Visual-Perception/2D-Perception/DRAW"]
146 | 	path = Visual-Perception/2D-Perception/DRAW
147 | 	url = https://github.com/menggehe/DRAW.git
148 | [submodule "NLP/DRAW"]
149 | 	path = NLP/DRAW
150 | 	url = https://github.com/menggehe/DRAW.git
151 | [submodule "Audio/RegNet"]
152 | 	path = Audio/RegNet
153 | 	url = https://github.com/PeihaoChen/regnet.git
154 | [submodule "Visual-Perception/3D-Perception/CPCM"]
155 | 	path = Visual-Perception/3D-Perception/CPCM
156 | 	url = https://github.com/lizhaoliu-Lec/CPCM.git
157 | [submodule "Visual-Perception/3D-Perception/CR-NeRF"]
158 | 	path = Visual-Perception/3D-Perception/CR-NeRF
159 | 	url = https://github.com/YifYang993/CR-NeRF-PyTorch.git
160 | [submodule "Visual-Perception/3D-Perception/CPEM"]
161 | 	path = Visual-Perception/3D-Perception/CPEM
162 | 	url = https://github.com/deepmo24/CPEM.git
163 | [submodule "Visual-Perception/3D-Perception/PMF"]
164 | 	path = Visual-Perception/3D-Perception/PMF
165 | 	url = https://github.com/ICEORY/PMF.git
166 | [submodule "Visual-Perception/3D-Perception/SkeletonNet"]
167 | 	path = Visual-Perception/3D-Perception/SkeletonNet
168 | 	url = https://github.com/Gorilla-Lab-SCUT/SkeletonNet.git
169 | [submodule "Visual-Perception/3D-Perception/CorrReg"]
170 | 	path = Visual-Perception/3D-Perception/CorrReg
171 | 	url = https://github.com/JiehongLin/CorrReg.git
172 | [submodule "Visual-Perception/3D-Perception/CAGEs"]
173 | 	path = Visual-Perception/3D-Perception/CAGEs
174 | 	url = https://github.com/Kali-Hac/Locality-Awareness-SGE.git
175 | [submodule "Multi-Modal/TDS"]
176 | 	path = Multi-Modal/TDS
177 | 	url = https://github.com/Zhiquan-Wen/TDS.git
178 | [submodule "Multi-Modal/D-VQA"]
179 | 	path = Multi-Modal/D-VQA
180 | 	url = https://github.com/Zhiquan-Wen/D-VQA.git
181 | [submodule "Multi-Modal/HPGM"]
182 | 	path = Multi-Modal/HPGM
183 | 	url = https://github.com/chenqi008/HPGM.git
184 | [submodule "Multi-Modal/CMRAN"]
185 | 	path = Multi-Modal/CMRAN
186 | 	url = https://github.com/FloretCat/CMRAN.git
187 | [submodule "Multi-Modal/CRN_tvqa"]
188 | 	path = Multi-Modal/CRN_tvqa
189 | 	url = https://github.com/guanghuixu/CRN_tvqa.git
190 | [submodule "Multi-Modal/LaBERT"]
191 | 	path = Multi-Modal/LaBERT
192 | 	url = https://github.com/bearcatt/LaBERT.git
193 | [submodule "Multi-Modal/V2C"]
194 | 	path = Multi-Modal/V2C
195 | 	url = https://github.com/chenqi008/V2C.git
196 | [submodule "Robotic/ActiveCamera"]
197 | 	path = Robotic/ActiveCamera
198 | 	url = https://github.com/PeihaoChen/ActiveCamera.git
199 | [submodule "Robotic/WS-MGMap"]
200 | 	path = Robotic/WS-MGMap
201 | 	url = https://github.com/PeihaoChen/WS-MGMap.git
202 | [submodule "Robotic/YouTube-VLN"]
203 | 	path = Robotic/YouTube-VLN
204 | 	url = https://github.com/JeremyLinky/YouTube-VLN.git
205 | [submodule "NLP/CogVLM"]
206 | 	path = NLP/CogVLM
207 | 	url = git@github.com:THUDM/CogVLM.git
208 | [submodule "NLP/Qwen"]
209 | 	path = NLP/Qwen
210 | 	url = git@github.com:QwenLM/Qwen.git
211 | [submodule "Visual-Perception/2D-Perception/TTAC"]
212 | 	path = Visual-Perception/2D-Perception/TTAC
213 | 	url = https://github.com/Gorilla-Lab-SCUT/TTAC
214 | [submodule "Visual-Perception/2D-Perception/TTAC2"]
215 | 	path = Visual-Perception/2D-Perception/TTAC2
216 | 	url = https://github.com/Gorilla-Lab-SCUT/TTAC2
217 | [submodule "Visual-Perception/2D-Perception/TRIBE"]
218 | 	path = Visual-Perception/2D-Perception/TRIBE
219 | 	url = https://github.com/Gorilla-Lab-SCUT/TRIBE
220 | [submodule "Audio/SSL-PVAD"]
221 | 	path = Audio/SSL-PVAD
222 | 	url = https://github.com/HolgerBovbjerg/SSL-PVAD.git
223 | [submodule "Visual-Perception/3D-Perception/HelixSurf"]
224 | 	path = Visual-Perception/3D-Perception/HelixSurf
225 | 	url = https://github.com/Gorilla-Lab-SCUT/HelixSurf
226 | [submodule "Visual-Perception/3D-Perception/QS3"]
227 | 	path = Visual-Perception/3D-Perception/QS3
228 | 	url = https://github.com/gorilla-lab-scut/qs3
229 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <div align=center>
  2 |   <img src="asserts/robot.png" width=300 >
  3 | 
  4 | <h1 style="margin-top: -40px;">  视听融合感知智能引擎平台 </h1> 
  5 | 
  6 | <!-- <a href='http://183.63.152.178:6710/#/login'><img src='https://img.shields.io/badge/Project-Page-Green'></a> &nbsp; 
  7 | <a href='http://183.63.152.178:6710/#/engine-platform/visual-semantics'><img src='https://img.shields.io/badge/Demo-Website-blue'></a> &nbsp; 
  8 | <a href='https://github.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/blob/main/LICENSE'><img src=https://img.shields.io/badge/License-MIT-yellow></a> -->
  9 | 
 10 | <!-- [![madewithlove](https://img.shields.io/badge/made_with-%E2%9D%A4-red?style=for-the-badge&labelColor=orange)](https://github.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception) &nbsp;  -->
 11 | [![Project Page](https://img.shields.io/badge/Project-Page-F9AB00?style=for-the-badge)](http://183.63.152.178:6710/#/login) &nbsp; 
 12 | [![License](https://img.shields.io/badge/LICENSE-MIT-green.svg?style=for-the-badge)](https://github.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/blob/main/LICENSE) &nbsp; 
 13 | [![Demo Website](https://img.shields.io/badge/Demo-Website-yellow.svg?style=for-the-badge)](http://183.63.152.178:6710/#/engine-platform/visual-semantics) &nbsp; 
 14 | 
 15 | 
 16 | 📕 中文版 README | 📗 [English README](./README_en.md)
 17 | 
 18 | </div>
 19 | 
 20 | ### 📻 安装指南
 21 | 
 22 | 在使用我们的模型之前，您需要先确保环境中已安装所有必要的依赖项。这些依赖项涵盖了模型运行所需的各类库和工具，确保您可以顺利进行模型推理。
 23 | 
 24 | 请按照以下步骤进行安装:
 25 | 
 26 | 1. **打开终端或命令提示符**：根据您的操作系统，打开相应的命令行界面。
 27 | 2. **使用pip安装依赖项**：输入以下命令，通过pip安装所需的Python包和库。
 28 | 
 29 | ```bash
 30 | pip install -r requirements.txt
 31 | ```
 32 | 
 33 | 
 34 | ### 🚀 推理指南
 35 | 
 36 | 安装完所有必要的依赖项后，您就可以开始使用我们的模型进行推理了。我们提供了两种推理方式：使用终端进行推理和使用交互式推理。
 37 | 
 38 | 这里我们以示例图片`asserts/demo.jpg`为例进行说明:
 39 | 
 40 | <img src="asserts/demo.jpg" width=400>
 41 | 
 42 | #### 1. 使用终端进行推理
 43 | 
 44 | 如果您希望直接在终端中运行推理脚本，可以使用以下命令:
 45 | 
 46 | ```bash
 47 | python chatme.py --image asserts/demo.jpg --question "货架上有几个苹果？"
 48 | ```
 49 | 
 50 | 此命令会加载预训练的模型，并使用提供的图片(`demo.jpg`)和问题(`"货架上有几个苹果？"`)进行推理。
 51 | 
 52 | 模型会分析图片并尝试回答提出的问题，推理结果将以文本形式输出到终端中，例如:
 53 | 
 54 | ```
 55 | 小千：货架上有三个苹果。
 56 | ```
 57 | 
 58 | #### 2. 使用交互式推理
 59 | 
 60 | 除了使用终端进行推理，您还可以使用交互式推理功能与大模型进行实时交互。要启动交互式终端，请运行以下命令:
 61 | 
 62 | ```bash
 63 | python main.py
 64 | ```
 65 | 
 66 | 此命令会启动一个交互式终端，等待您输入图片地址。您可以在终端中输入图片地址(例如`asserts/demo.jpg`)，然后按下回车键。
 67 | 
 68 | 模型会根据您提供的图片进行推理，并等待您输入问题。
 69 | 
 70 | 一旦您输入了问题(例如`"货架上有几个苹果？"`)，模型就会分析图片并尝试回答，推理结果将以文本形式输出到终端中，例如:
 71 | 
 72 | ```bash
 73 | 图片地址 >>>>> asserts/demo.jpg
 74 | 用户：货架上有几个苹果？
 75 | 小千：货架上有三个苹果。
 76 | ```
 77 | 
 78 | 通过这种方式，您可以轻松地与模型进行交互，并向其提出各种问题。
 79 | 
 80 | 
 81 | ### 🧾 References
 82 | 
 83 | #### 📈 Benchmark ####
 84 | 
 85 |   - [AGE Challenge Dataset](https://age.grand-challenge.org)
 86 | 
 87 |   - [COVID-DA Dataset](https://drive.google.com/file/d/1w2brbYLn1s1hvmLkKKsBsm1mCbz4F512/view?usp=sharing)
 88 | 
 89 |   - [Visually Aligned Sound (VAS) Dataset](https://drive.google.com/file/d/14birixmH7vwIWKxCHI0MIWCcZyohF59g/view?usp=sharing)
 90 | 
 91 | #### 📷 Visual Perception
 92 | 
 93 | - [2D Perception](Visual-Perception/2D-Perception/)
 94 | 
 95 |   - [Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation](https://github.com/zhang-haojie/wesam)
 96 | 
 97 |   - [Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering Regularizad Self-Training](https://github.com/Gorilla-Lab-SCUT/TTAC2)
 98 | 
 99 |   - [Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization](https://github.com/Gorilla-Lab-SCUT/TRIBE)
100 | 
101 |   - [Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering](https://github.com/Gorilla-Lab-SCUT/TTAC)
102 |  
103 |   - [Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection](https://github.com/SherlockHolmes221/DOQ)
104 |  
105 |   - [High-resolution networks (HRNets) for Image classification](https://github.com/HRNet/HRNet-Image-Classification)
106 |  
107 |   <!-- - [Late Fusion via Subspace Search with Consistency Preservation](https://github.com/xiangfasong/HCMF) -->
108 |  
109 |   - [Intra- and Inter-Slice Contrastive Learning for Point Supervised OCT Fluid Segmentation](https://github.com/lphxx6222712/ISCLNet)
110 |  
111 |   - [Partitioning Stateful Data Stream Applications in Dynamic Edge Cloud Environments](https://github.com/Dshaoshuai/Partitioning-stateful-data-stream-applications-in-dynamic-edge-cloud-environments)
112 |   
113 |   <!-- - !! [Learning defense transformations for counterattacking adversarial examples](https://github.com/SCUTjinchengli/DefenseTransformer) -->
114 |  
115 |   <!-- - !! 2019 [Multi-marginal Wasserstein GAN](https://github.com/deepmo24/MWGAN) -->
116 |  
117 |   <!-- - !! 2018 [Adversarial Learning with Local Coordinate Coding](https://github.com/guoyongcs/LCCGAN) -->
118 |  
119 |   - [Closed-loop Matters: Dual Regression Networks for Single Image Super-Resolution](https://github.com/guoyongcs/DRN)
120 |  
121 |   - [Dense Regression Network for Video Grounding](https://github.com/Alvin-Zeng/DRN)
122 |  
123 |   - [Graph Convolutional Networks for Temporal Action Localization](https://github.com/Alvin-Zeng/PGCN)
124 |  
125 |   - [NAT: Neural Architecture Transformer for Accurate and Compact Architectures](https://github.com/guoyongcs/NAT)
126 |  
127 |   <!-- - !! 2018 [Discrimination-aware Channel Pruning for Deep Neural Networks](https://github.com/SCUT-AILab/DCP) -->
128 |  
129 |   - [Efficient Test-Time Model Adaptation without Forgetting](https://github.com/mr-eggplant/EATA)
130 |  
131 |   - [Breaking the Curse of Space Explosion: Towards Effcient NAS with Curriculum Search](https://github.com/guoyongcs/CNAS)
132 |  
133 |   - [Contrastive Neural Architecture Search with Neural Architecture Comparators](https://github.com/chenyaofo/CTNAS)
134 |  
135 |   - [Domain-Symnetric Networks for Adversarial Domain Adaptation](https://github.com/Gorilla-Lab-SCUT/SymNets)
136 |  
137 |   - [RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning](https://github.com/PeihaoChen/RSPNet)
138 |  
139 |   <!-- - !! 2017[MPGL](https://github.com/donggong1/mpgl) -->
140 |  
141 |   - [Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation](https://github.com/SCUT-AILab/CPGA)
142 |  
143 |   - [Self-Supervised Gait Encoding with Locality-Aware Attention for Person Re-Identification](https://github.com/Kali-Hac/SGE-LA)
144 |  
145 |   - [Towards Stable Test-Time Adaptation in Dynamic Wild World](https://github.com/mr-eggplant/SAR)
146 |  
147 |   - [Detecting Adversarial Data by Probing Multiple Perturbations Using Expected Perturbation Score](https://github.com/ZSHsh98/EPS-AD)
148 |  
149 |   - [Masked Motion Encoding for Self-Supervised Video Representation Learning](https://github.com/XinyuSun/MME)
150 |  
151 |   <!-- - !! [Hard Sample Matters a Lot in Zero-Shot Quantization](https://github.com/lihuantong/HAST) -->
152 |  
153 |   - [Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation](https://github.com/SCUT-AILab/CPGA)
154 |  
155 |   - [Dynamic Extension Nets for Few-shot Semantic Segmentation](https://github.com/lizhaoliu-Lec/DENet)
156 |  
157 |   - [Densely-Anchored Sampling for Deep Metric Learning](https://github.com/lizhaoliu-Lec/DAS)
158 |  
159 |   - [Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation](https://github.com/Hongbin98/ProCA)
160 |  
161 |   <!-- - !! ["Generative Low bitwidth Data Free Quantization](https://github.com/xushoukai/GDFQ) -->
162 |  
163 |   <!-- - !! [Deep Transferring Quantization](https://github.com/xiezheng-cs/DTQ) -->
164 |  
165 |   - [Attention Guided Network for Retinal Image Segmentation](https://github.com/HzFu/AGNet)
166 |  
167 |   <!-- - !! [Distinguishing Differences Matters: Focal Contrastive Network for Peripheral Anterior Synechiae Recognition](https://github.com/YifYang993/FC-Net) -->
168 |  
169 |   <!-- - !! 2018 [Cartoon-to-Photo Facial Translation with Generative Adversarial Networks](https://github.com/JunhongH/CP-GAN) -->
170 |  
171 |   <!-- - !! [QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision](https://github.com/ziplab/QTool) -->
172 |  
173 |   <!-- - !! 2019 [Facial Image-to-Video Translation by a Hidden Affine Transformation](https://github.com/sunlightsgy/AffineGAN) -->
174 |  
175 |   <!-- - !! [Pareto-aware Neural Architecture Generation for Diverse Computational Budgets](https://github.com/guoyongcs/PNAG) -->
176 |     
177 |   - [Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection/Object Detection](https://github.com/MuchHair/HQM.git)
178 |     
179 |   - [Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection](https://github.com/SherlockHolmes221/GGNet.git)
180 |     
181 |   - [Polysemy Deciphering Network for Human-Object Interaction Detection](https://github.com/MuchHair/PD-Net.git)
182 |  
183 |   - [Bidirectional Posture-Appearance Interaction Network for Driver Behavior Recognition](https://github.com/SCUT-AILab/BPAI-Net)
184 |  
185 |   - [Improving Generative Adversarial Networks with Local Coordinate Coding](https://github.com/SCUTjinchengli/LCCGAN-v2)
186 |  
187 |   <!-- - !! [GCM: Graph Convolutional Module for Temporal Action Localization in Videos](https://github.com/Alvin-Zeng/GCM) -->
188 |  
189 |   <!-- - !! 2019 [Towards Accurate and Compact Architectures via Neural Architecture Transformer](https://github.com/guoyongcs/NATv2) -->
190 |  
191 |   - [Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis](https://github.com/Vanint/CoUDA)
192 |  
193 |   <!-- - !! 2019 [Auto-Embedding Generative Adversarial Networks for High Resolution Image Synthesis](https://github.com/guoyongcs/AEGAN) -->
194 |  
195 |   <!-- - !!! [LayerOT](https://github.com/SCUTjinchengli/LayerOT) -->
196 |  
197 |   <!-- - [Towards Effective Deep Transfer via Attentive Feature Alignment](https://github.com/xiezheng-cs/AFA) -->
198 |  
199 |   <!-- - !! [Content-Aware Convolution for Efficient Deep Neural Networks](https://github.com/guoyongcs/CAC) -->
200 |  
201 |   <!-- - !! [Multi-way Backpropagation for Training Compact Deep Neural Networks](https://github.com/tanmingkui/multiwaybp) -->
202 |  
203 |     
204 | - [3D Perception](Visual-Perception/3D-Perception/)
205 | 
206 |   - [SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation](https://github.com/JiehongLin/SAM-6D)
207 | 
208 |   - [Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection](https://github.com/Gorilla-Lab-SCUT/frustum-convnet)
209 | 
210 |   - [Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks](https://github.com/Gorilla-Lab-SCUT/SSTNet)
211 | 
212 |   - [VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention](https://github.com/Gorilla-Lab-SCUT/VISTA)
213 |  
214 |   - [A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D Skeleton Based Person Re-Identification](https://github.com/Kali-Hac/Locality-Awareness-SGE)
215 | 
216 |   - [Deep Multi-View Learning Using Neuron-Wise Correlation-Maximizing Regularizers](https://github.com/JiehongLin/CorrReg)
217 |  
218 |   - [A Skeleton-Bridged Deep Learning Approach for Generating Meshes of Complex Topologies From Single RGB Images](https://github.com/Gorilla-Lab-SCUT/SkeletonNet)
219 | 
220 |   - [Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation](https://github.com/ICEORY/PMF)
221 |  
222 |   - [CPEM: Consistent Parameter Estimation Model](https://github.com/deepmo24/CPEM)
223 |  
224 |   - [CR-NeRF: Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections](https://github.com/YifYang993/CR-NeRF-PyTorch)
225 |  
226 |   - [Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation](https://github.com/lizhaoliu-Lec/CPCM)
227 |  
228 |   - [Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap](https://github.com/gorilla-lab-scut/qs3)
229 | 
230 |   - [HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes with Iterative Intertwined Regularization](https://github.com/Gorilla-Lab-SCUT/HelixSurf)
231 | 
232 | 
233 | #### 🎧 Audio
234 | 
235 | - [Automatic Speech Recognition](https://github.com/qiaoweima/chatbot_ASR)
236 |   
237 | - [Dialogue System](https://github.com/qiaoweima/chatbot_SER)
238 |   
239 | - [Text To Speech](https://github.com/qiaoweima/chatbot_TTS.git)
240 |   
241 | - [Audio Anti-spoofing](https://github.com/qiaoweima/Audio-Anti-Spoofing/tree/main)
242 |   
243 | - [Blizzard_Challenge](https://github.com/qiaoweima/Blizzard_Challenge)
244 | 
245 | - [Voice Activity Detection](https://github.com/HolgerBovbjerg/SSL-PVAD)
246 | 
247 | - [RegNet](https://github.com/PeihaoChen/regnet)
248 |   
249 | #### 💬 NLP
250 | 
251 | - [How to Train Your Agent to Read and Write](https://github.com/menggehe/DRAW)
252 | 
253 | - [CogVLM](https://github.com/THUDM/CogVLM.git)
254 | 
255 | - [Qwen](https://github.com/QwenLM/Qwen.git)
256 | 
257 | #### 🔮 Multi-Modal
258 | 
259 | - [Test-Time Model Adaptation for Visual Question Answering with Debiased Self-Supervisions](https://github.com/Zhiquan-Wen/TDS)
260 | 
261 | - [Debiased Visual Question Answering from Feature and Sample Perspectives](https://github.com/Zhiquan-Wen/D-VQA)
262 | 
263 | - [Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only](https://github.com/chenqi008/HPGM)
264 | 
265 | - [Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization](https://github.com/FloretCat/CMRAN)
266 | 
267 | - [Cascade Reasoning Network for Text-based Visual Question Answering](https://github.com/guanghuixu/CRN_tvqa)
268 | 
269 | - [Length-Controllable Image Captioning](https://github.com/bearcatt/LaBERT)
270 | 
271 | - [V2C: Visual Voice Cloning](https://github.com/chenqi008/V2C)
272 | 
273 | #### 🤖 Robotic
274 | 
275 | - [Learning Active Camera for Multi-Object Navigation](https://github.com/PeihaoChen/ActiveCamera)
276 | 
277 | - [Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation](https://github.com/PeihaoChen/WS-MGMap)
278 | 
279 | - [Learning Vision-and-Language Navigation from YouTube Videos](https://github.com/JeremyLinky/YouTube-VLN)
280 | 
281 | <!-- #### Others
282 | 
283 |  - [Rat-Crypto-Trader](https://github.com/louisoutin/rat_crypto_trader)
284 |    
285 |  - [PPN](https://github.com/kshre/PPN)
286 |    
287 |  - [OA3](https://github.com/Vanint/OA3) -->
288 | 


--------------------------------------------------------------------------------
/README_en.md:
--------------------------------------------------------------------------------
  1 | <div align=center>
  2 |   <img src="asserts/robot.png" width=300 >
  3 | 
  4 | <h1 style="margin-top: -40px;"> Visual-Auditory Fusion Perception AI Platform </h1> 
  5 | 
  6 | <!-- [![madewithlove](https://img.shields.io/badge/made_with-%E2%9D%A4-red?style=for-the-badge&labelColor=orange)](https://github.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception) &nbsp;  -->
  7 | [![Project Page](https://img.shields.io/badge/Project-Page-F9AB00?style=for-the-badge)](http://183.63.152.178:6710/#/login) &nbsp; 
  8 | [![License](https://img.shields.io/badge/LICENSE-MIT-green.svg?style=for-the-badge)](https://github.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/blob/main/LICENSE) &nbsp; 
  9 | [![Demo Website](https://img.shields.io/badge/Demo-Website-yellow.svg?style=for-the-badge)](http://183.63.152.178:6710/#/engine-platform/visual-semantics) &nbsp; 
 10 | 
 11 | [📕 中文版 README](./README.md) | 📗 English README
 12 | 
 13 | </div>
 14 | 
 15 | ### 📻 Installation Guide
 16 | 
 17 | Before using our model, you need to ensure that all necessary dependencies are installed in your environment. These dependencies cover various libraries and tools required for the model's operation, ensuring a smooth inference process.
 18 | 
 19 | Please follow these steps for the installation:
 20 | 
 21 | 1. **Open the Terminal or Command Prompt**: Depending on your operating system, open the corresponding command-line interface.
 22 | 2. **Install dependencies using pip**: Enter the following command to install the required Python packages and libraries using pip.
 23 | 
 24 | ```bash
 25 | pip install -r requirements.txt
 26 | ```
 27 | 
 28 | 
 29 | ### 🚀 Inference Guide
 30 | 
 31 | After installing all the necessary dependencies, you can start using our model for inference. We provide two ways of performing inference: using the terminal and using the interactive inference.
 32 | 
 33 | Here, we will use the example image `asserts/demo.jpg` for illustration:
 34 | 
 35 | <img src="asserts/demo.jpg" width=400>
 36 | 
 37 | #### 1. Inference using the Terminal
 38 | 
 39 | If you want to directly run the inference script in the terminal, you can use the following command:
 40 | 
 41 | ```bash
 42 | python chatme.py --image asserts/demo.jpg --question "How many apples are there on the shelf?"
 43 | ```
 44 | 
 45 | This command will load the pre-trained model and perform inference using the provided image (`demo.jpg`) and question (`"How many apples are there on the shelf?"`).
 46 | 
 47 | The model will analyze the image and attempt to answer the question. The inference result will be output to the terminal in text form, for example:
 48 | 
 49 | ```
 50 | Xiaochuan: There are three apples on the shelf.
 51 | ```
 52 | 
 53 | #### 2. Interactive Inference
 54 | 
 55 | In addition to using the terminal for inference, you can also use the interactive inference feature to interact with the large model in real-time. To start the interactive terminal, run the following command:
 56 | 
 57 | ```bash
 58 | python main.py
 59 | ```
 60 | 
 61 | This command will launch an interactive terminal that waits for you to enter the image path. You can type the image path (e.g., `asserts/demo.jpg`) in the terminal and press Enter.
 62 | 
 63 | The model will perform inference based on the provided image and wait for you to enter a question.
 64 | 
 65 | Once you enter a question (e.g., `"How many apples are there on the shelf?"`), the model will analyze the image and attempt to answer it. The inference result will be output to the terminal in text form, for example:
 66 | 
 67 | ```bash
 68 | Image Path >>>>> asserts/demo.jpg
 69 | User: How many apples are there on the shelf?
 70 | Xiaochuan: There are three apples on the shelf.
 71 | ```
 72 | 
 73 | Using this approach, you can easily interact with the model and ask it various questions.
 74 | 
 75 | 
 76 | ### 🧾 References
 77 | 
 78 | #### 📈 Benchmark ####
 79 | 
 80 |   - [AGE Challenge Dataset](https://age.grand-challenge.org)
 81 | 
 82 |   - [COVID-DA Dataset](https://drive.google.com/file/d/1w2brbYLn1s1hvmLkKKsBsm1mCbz4F512/view?usp=sharing)
 83 | 
 84 |   - [Visually Aligned Sound (VAS) Dataset](https://drive.google.com/file/d/14birixmH7vwIWKxCHI0MIWCcZyohF59g/view?usp=sharing)
 85 | 
 86 | #### 📷 Visual Perception
 87 | 
 88 | - [2D Perception](Visual-Perception/2D-Perception/)
 89 | 
 90 |   - [Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation](https://github.com/zhang-haojie/wesam)
 91 | 
 92 |   - [Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering Regularizad Self-Training](https://github.com/Gorilla-Lab-SCUT/TTAC2)
 93 | 
 94 |   - [Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization](https://github.com/Gorilla-Lab-SCUT/TRIBE)
 95 | 
 96 |   - [Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering](https://github.com/Gorilla-Lab-SCUT/TTAC)
 97 |  
 98 |   - [Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection](https://github.com/SherlockHolmes221/DOQ)
 99 |  
100 |   - [High-resolution networks (HRNets) for Image classification](https://github.com/HRNet/HRNet-Image-Classification)
101 |  
102 |   <!-- - [Late Fusion via Subspace Search with Consistency Preservation](https://github.com/xiangfasong/HCMF) -->
103 |  
104 |   - [Intra- and Inter-Slice Contrastive Learning for Point Supervised OCT Fluid Segmentation](https://github.com/lphxx6222712/ISCLNet)
105 |  
106 |   - [Partitioning Stateful Data Stream Applications in Dynamic Edge Cloud Environments](https://github.com/Dshaoshuai/Partitioning-stateful-data-stream-applications-in-dynamic-edge-cloud-environments)
107 |   
108 |   <!-- - !! [Learning defense transformations for counterattacking adversarial examples](https://github.com/SCUTjinchengli/DefenseTransformer) -->
109 |  
110 |   <!-- - !! 2019 [Multi-marginal Wasserstein GAN](https://github.com/deepmo24/MWGAN) -->
111 |  
112 |   <!-- - !! 2018 [Adversarial Learning with Local Coordinate Coding](https://github.com/guoyongcs/LCCGAN) -->
113 |  
114 |   - [Closed-loop Matters: Dual Regression Networks for Single Image Super-Resolution](https://github.com/guoyongcs/DRN)
115 |  
116 |   - [Dense Regression Network for Video Grounding](https://github.com/Alvin-Zeng/DRN)
117 |  
118 |   - [Graph Convolutional Networks for Temporal Action Localization](https://github.com/Alvin-Zeng/PGCN)
119 |  
120 |   - [NAT: Neural Architecture Transformer for Accurate and Compact Architectures](https://github.com/guoyongcs/NAT)
121 |  
122 |   <!-- - !! 2018 [Discrimination-aware Channel Pruning for Deep Neural Networks](https://github.com/SCUT-AILab/DCP) -->
123 |  
124 |   - [Efficient Test-Time Model Adaptation without Forgetting](https://github.com/mr-eggplant/EATA)
125 |  
126 |   - [Breaking the Curse of Space Explosion: Towards Effcient NAS with Curriculum Search](https://github.com/guoyongcs/CNAS)
127 |  
128 |   - [Contrastive Neural Architecture Search with Neural Architecture Comparators](https://github.com/chenyaofo/CTNAS)
129 |  
130 |   - [Domain-Symnetric Networks for Adversarial Domain Adaptation](https://github.com/Gorilla-Lab-SCUT/SymNets)
131 |  
132 |   - [RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning](https://github.com/PeihaoChen/RSPNet)
133 |  
134 |   <!-- - !! 2017[MPGL](https://github.com/donggong1/mpgl) -->
135 |  
136 |   - [Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation](https://github.com/SCUT-AILab/CPGA)
137 |  
138 |   - [Self-Supervised Gait Encoding with Locality-Aware Attention for Person Re-Identification](https://github.com/Kali-Hac/SGE-LA)
139 |  
140 |   - [Towards Stable Test-Time Adaptation in Dynamic Wild World](https://github.com/mr-eggplant/SAR)
141 |  
142 |   - [Detecting Adversarial Data by Probing Multiple Perturbations Using Expected Perturbation Score](https://github.com/ZSHsh98/EPS-AD)
143 |  
144 |   - [Masked Motion Encoding for Self-Supervised Video Representation Learning](https://github.com/XinyuSun/MME)
145 |  
146 |   <!-- - !! [Hard Sample Matters a Lot in Zero-Shot Quantization](https://github.com/lihuantong/HAST) -->
147 |  
148 |   - [Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation](https://github.com/SCUT-AILab/CPGA)
149 |  
150 |   - [Dynamic Extension Nets for Few-shot Semantic Segmentation](https://github.com/lizhaoliu-Lec/DENet)
151 |  
152 |   - [Densely-Anchored Sampling for Deep Metric Learning](https://github.com/lizhaoliu-Lec/DAS)
153 |  
154 |   - [Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation](https://github.com/Hongbin98/ProCA)
155 |  
156 |   <!-- - !! ["Generative Low bitwidth Data Free Quantization](https://github.com/xushoukai/GDFQ) -->
157 |  
158 |   <!-- - !! [Deep Transferring Quantization](https://github.com/xiezheng-cs/DTQ) -->
159 |  
160 |   - [Attention Guided Network for Retinal Image Segmentation](https://github.com/HzFu/AGNet)
161 |  
162 |   <!-- - !! [Distinguishing Differences Matters: Focal Contrastive Network for Peripheral Anterior Synechiae Recognition](https://github.com/YifYang993/FC-Net) -->
163 |  
164 |   <!-- - !! 2018 [Cartoon-to-Photo Facial Translation with Generative Adversarial Networks](https://github.com/JunhongH/CP-GAN) -->
165 |  
166 |   <!-- - !! [QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision](https://github.com/ziplab/QTool) -->
167 |  
168 |   <!-- - !! 2019 [Facial Image-to-Video Translation by a Hidden Affine Transformation](https://github.com/sunlightsgy/AffineGAN) -->
169 |  
170 |   <!-- - !! [Pareto-aware Neural Architecture Generation for Diverse Computational Budgets](https://github.com/guoyongcs/PNAG) -->
171 |     
172 |   - [Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection/Object Detection](https://github.com/MuchHair/HQM.git)
173 |     
174 |   - [Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection](https://github.com/SherlockHolmes221/GGNet.git)
175 |     
176 |   - [Polysemy Deciphering Network for Human-Object Interaction Detection](https://github.com/MuchHair/PD-Net.git)
177 |  
178 |   - [Bidirectional Posture-Appearance Interaction Network for Driver Behavior Recognition](https://github.com/SCUT-AILab/BPAI-Net)
179 |  
180 |   - [Improving Generative Adversarial Networks with Local Coordinate Coding](https://github.com/SCUTjinchengli/LCCGAN-v2)
181 |  
182 |   <!-- - !! [GCM: Graph Convolutional Module for Temporal Action Localization in Videos](https://github.com/Alvin-Zeng/GCM) -->
183 |  
184 |   <!-- - !! 2019 [Towards Accurate and Compact Architectures via Neural Architecture Transformer](https://github.com/guoyongcs/NATv2) -->
185 |  
186 |   - [Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis](https://github.com/Vanint/CoUDA)
187 |  
188 |   <!-- - !! 2019 [Auto-Embedding Generative Adversarial Networks for High Resolution Image Synthesis](https://github.com/guoyongcs/AEGAN) -->
189 |  
190 |   <!-- - !!! [LayerOT](https://github.com/SCUTjinchengli/LayerOT) -->
191 |  
192 |   <!-- - [Towards Effective Deep Transfer via Attentive Feature Alignment](https://github.com/xiezheng-cs/AFA) -->
193 |  
194 |   <!-- - !! [Content-Aware Convolution for Efficient Deep Neural Networks](https://github.com/guoyongcs/CAC) -->
195 |  
196 |   <!-- - !! [Multi-way Backpropagation for Training Compact Deep Neural Networks](https://github.com/tanmingkui/multiwaybp) -->
197 |  
198 |     
199 | - [3D Perception](Visual-Perception/3D-Perception/)
200 | 
201 |   - [SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation](https://github.com/JiehongLin/SAM-6D)
202 | 
203 |   - [Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection](https://github.com/Gorilla-Lab-SCUT/frustum-convnet)
204 | 
205 |   - [Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks](https://github.com/Gorilla-Lab-SCUT/SSTNet)
206 | 
207 |   - [VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention](https://github.com/Gorilla-Lab-SCUT/VISTA)
208 |  
209 |   - [A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D Skeleton Based Person Re-Identification](https://github.com/Kali-Hac/Locality-Awareness-SGE)
210 | 
211 |   - [Deep Multi-View Learning Using Neuron-Wise Correlation-Maximizing Regularizers](https://github.com/JiehongLin/CorrReg)
212 |  
213 |   - [A Skeleton-Bridged Deep Learning Approach for Generating Meshes of Complex Topologies From Single RGB Images](https://github.com/Gorilla-Lab-SCUT/SkeletonNet)
214 | 
215 |   - [Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation](https://github.com/ICEORY/PMF)
216 |  
217 |   - [CPEM: Consistent Parameter Estimation Model](https://github.com/deepmo24/CPEM)
218 |  
219 |   - [CR-NeRF: Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections](https://github.com/YifYang993/CR-NeRF-PyTorch)
220 |  
221 |   - [Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation](https://github.com/lizhaoliu-Lec/CPCM)
222 |  
223 |   - [Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap](https://github.com/gorilla-lab-scut/qs3)
224 | 
225 |   - [HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes with Iterative Intertwined Regularization](https://github.com/Gorilla-Lab-SCUT/HelixSurf)
226 | 
227 | 
228 | #### 🎧 Audio
229 | 
230 | - [Automatic Speech Recognition](https://github.com/qiaoweima/chatbot_ASR)
231 |   
232 | - [Dialogue System](https://github.com/qiaoweima/chatbot_SER)
233 |   
234 | - [Text To Speech](https://github.com/qiaoweima/chatbot_TTS.git)
235 |   
236 | - [Audio Anti-spoofing](https://github.com/qiaoweima/Audio-Anti-Spoofing/tree/main)
237 |   
238 | - [Blizzard_Challenge](https://github.com/qiaoweima/Blizzard_Challenge)
239 | 
240 | - [Voice Activity Detection](https://github.com/HolgerBovbjerg/SSL-PVAD)
241 | 
242 | - [RegNet](https://github.com/PeihaoChen/regnet)
243 |   
244 | #### 💬 NLP
245 | 
246 | - [How to Train Your Agent to Read and Write](https://github.com/menggehe/DRAW)
247 | 
248 | - [CogVLM](https://github.com/THUDM/CogVLM.git)
249 | 
250 | - [Qwen](https://github.com/QwenLM/Qwen.git)
251 | 
252 | #### 🔮 Multi-Modal
253 | 
254 | - [Test-Time Model Adaptation for Visual Question Answering with Debiased Self-Supervisions](https://github.com/Zhiquan-Wen/TDS)
255 | 
256 | - [Debiased Visual Question Answering from Feature and Sample Perspectives](https://github.com/Zhiquan-Wen/D-VQA)
257 | 
258 | - [Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only](https://github.com/chenqi008/HPGM)
259 | 
260 | - [Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization](https://github.com/FloretCat/CMRAN)
261 | 
262 | - [Cascade Reasoning Network for Text-based Visual Question Answering](https://github.com/guanghuixu/CRN_tvqa)
263 | 
264 | - [Length-Controllable Image Captioning](https://github.com/bearcatt/LaBERT)
265 | 
266 | - [V2C: Visual Voice Cloning](https://github.com/chenqi008/V2C)
267 | 
268 | #### 🤖 Robotic
269 | 
270 | - [Learning Active Camera for Multi-Object Navigation](https://github.com/PeihaoChen/ActiveCamera)
271 | 
272 | - [Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation](https://github.com/PeihaoChen/WS-MGMap)
273 | 
274 | - [Learning Vision-and-Language Navigation from YouTube Videos](https://github.com/JeremyLinky/YouTube-VLN)
275 | 
276 | <!-- #### Others
277 | 
278 |  - [Rat-Crypto-Trader](https://github.com/louisoutin/rat_crypto_trader)
279 |    
280 |  - [PPN](https://github.com/kshre/PPN)
281 |    
282 |  - [OA3](https://github.com/Vanint/OA3) -->
283 | 


--------------------------------------------------------------------------------