├── Audio └── __init__.py ├── NLP ├── __init__.py ├── Qwen.md └── CogVLM.md ├── Multi-Modal └── __init__.py ├── Robotic └── __init__.py ├── Visual-Perception └── __init__.py ├── asserts ├── demo.jpg ├── robot.png ├── robot_l.png └── robot_s.png ├── requirements.txt ├── LICENSE ├── chatme.py ├── main.py ├── .gitmodules ├── README.md └── README_en.md /Audio/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /NLP/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /Multi-Modal/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /Robotic/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /Visual-Perception/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /asserts/demo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/HEAD/asserts/demo.jpg -------------------------------------------------------------------------------- /asserts/robot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/HEAD/asserts/robot.png -------------------------------------------------------------------------------- /asserts/robot_l.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/HEAD/asserts/robot_l.png -------------------------------------------------------------------------------- /asserts/robot_s.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/HEAD/asserts/robot_s.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | SwissArmyTransformer>=0.4.9 2 | transformers>=4.36.2 3 | xformers>=0.0.22 4 | torch>=2.1.0 5 | torchvision>=0.16.2 6 | spacy>=3.6.0 7 | pillow>=10.2.0 8 | deepspeed>=0.13.1 9 | seaborn>=0.13.2 10 | loguru~=0.7.2 11 | streamlit>=1.31.0 12 | timm>=0.9.12 13 | accelerate>=0.26.1 14 | pydantic>=2.6.0 -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 RVC-Boss 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /NLP/Qwen.md: -------------------------------------------------------------------------------- 1 | # Qwen 2 | 3 | Quickstart 4 | 5 | 6 | If not using docker, please make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. 7 | 8 | ``` 9 | pip install -r requirements.txt 10 | ``` 11 | 12 | If your device supports fp16 or bf16, we recommend installing flash-attention (we support flash attention 2 now.) for higher efficiency and lower memory usage. (flash-attention is optional and the project can run normally without installing it) 13 | 14 | ``` 15 | git clone https://github.com/Dao-AILab/flash-attention 16 | cd flash-attention && pip install . 17 | # Below are optional. Installing them might be slow. 18 | # pip install csrc/layer_norm 19 | # If the version of flash-attn is higher than 2.1.1, the following is not needed. 20 | # pip install csrc/rotary 21 | ``` 22 | 23 | Now you can start with Transformers🤗. 24 | 25 | To use Qwen-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. Remember to pass in the correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/Qwen-14B-Chat". However, please make sure that you are using the latest code. 26 | 27 | ``` 28 | from transformers import AutoModelForCausalLM, AutoTokenizer 29 | from transformers.generation import GenerationConfig 30 | 31 | # Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat" 32 | tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) 33 | 34 | # use bf16 35 | # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval() 36 | # use fp16 37 | # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval() 38 | # use cpu only 39 | # model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval() 40 | # use auto mode, automatically select precision based on the device. 41 | model = AutoModelForCausalLM.from_pretrained( 42 | "Qwen/Qwen-7B-Chat", 43 | device_map="auto", 44 | trust_remote_code=True 45 | ).eval() 46 | 47 | # Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this. 48 | # model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True) 49 | 50 | # 1st dialogue turn 51 | response, history = model.chat(tokenizer, "你好", history=None) 52 | print(response) 53 | # 你好!很高兴为你提供帮助。 54 | 55 | # 2nd dialogue turn 56 | response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history) 57 | print(response) 58 | # 这是一个关于一个年轻人奋斗创业最终取得成功的故事。 59 | # 故事的主人公叫李明,他来自一个普通的家庭,父母都是普通的工人。从小,李明就立下了一个目标:要成为一名成功的企业家。 60 | # 为了实现这个目标,李明勤奋学习,考上了大学。在大学期间,他积极参加各种创业比赛,获得了不少奖项。他还利用课余时间去实习,积累了宝贵的经验。 61 | # 毕业后,李明决定开始自己的创业之路。他开始寻找投资机会,但多次都被拒绝了。然而,他并没有放弃。他继续努力,不断改进自己的创业计划,并寻找新的投资机会。 62 | # 最终,李明成功地获得了一笔投资,开始了自己的创业之路。他成立了一家科技公司,专注于开发新型软件。在他的领导下,公司迅速发展起来,成为了一家成功的科技企业。 63 | # 李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险,不断学习和改进自己。他的成功也证明了,只要努力奋斗,任何人都有可能取得成功。 64 | 65 | # 3rd dialogue turn 66 | response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history) 67 | print(response) 68 | # 《奋斗创业:一个年轻人的成功之路》 69 | ``` 70 | 71 | Running Qwen, the base language model, is also simple. 72 | 73 | In the event of a network issue while attempting to download model checkpoints and codes from HuggingFace, an alternative approach is to initially fetch the checkpoint from ModelScope and then load it from the local directory as outlined below: 74 | 75 | ``` 76 | from modelscope import snapshot_download 77 | from transformers import AutoModelForCausalLM, AutoTokenizer 78 | 79 | # Downloading model checkpoint to a local dir model_dir 80 | # model_dir = snapshot_download('qwen/Qwen-7B') 81 | # model_dir = snapshot_download('qwen/Qwen-7B-Chat') 82 | # model_dir = snapshot_download('qwen/Qwen-14B') 83 | model_dir = snapshot_download('qwen/Qwen-14B-Chat') 84 | 85 | # Loading local checkpoints 86 | # trust_remote_code is still set as True since we still load codes from local dir instead of transformers 87 | tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) 88 | model = AutoModelForCausalLM.from_pretrained( 89 | model_dir, 90 | device_map="auto", 91 | trust_remote_code=True 92 | ).eval() 93 | ``` 94 | -------------------------------------------------------------------------------- /chatme.py: -------------------------------------------------------------------------------- 1 | """ 2 | This is a demo for using CogAgent and CogVLM in CLI 3 | Make sure you have installed vicuna-7b-v1.5 tokenizer model (https://huggingface.co/lmsys/vicuna-7b-v1.5), full checkpoint of vicuna-7b-v1.5 LLM is not required. 4 | In this demo, We us chat template, you can use others to replace such as 'vqa'. 5 | Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow. 6 | Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation. 7 | """ 8 | 9 | import argparse 10 | import torch 11 | 12 | from PIL import Image 13 | from transformers import AutoModelForCausalLM, LlamaTokenizer 14 | 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument("--image", type=str) 17 | parser.add_argument("--question", type=str) 18 | parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits') 19 | parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt') 20 | parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path') 21 | parser.add_argument("--fp16", action="store_true") 22 | parser.add_argument("--bf16", action="store_true") 23 | 24 | args = parser.parse_args() 25 | MODEL_PATH = args.from_pretrained 26 | TOKENIZER_PATH = args.local_tokenizer 27 | DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' 28 | 29 | tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH) 30 | if args.bf16: 31 | torch_type = torch.bfloat16 32 | else: 33 | torch_type = torch.float16 34 | 35 | print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE)) 36 | 37 | if args.quant: 38 | model = AutoModelForCausalLM.from_pretrained( 39 | MODEL_PATH, 40 | torch_dtype=torch_type, 41 | low_cpu_mem_usage=True, 42 | load_in_4bit=True, 43 | trust_remote_code=True 44 | ).eval() 45 | else: 46 | model = AutoModelForCausalLM.from_pretrained( 47 | MODEL_PATH, 48 | torch_dtype=torch_type, 49 | low_cpu_mem_usage=True, 50 | load_in_4bit=args.quant is not None, 51 | trust_remote_code=True 52 | ).to(DEVICE).eval() 53 | 54 | text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:" 55 | 56 | while True: 57 | image_path = args.image 58 | if image_path == '': 59 | print('You did not enter image path, the following will be a plain text conversation.') 60 | image = None 61 | text_only_first_query = True 62 | else: 63 | image = Image.open(image_path).convert('RGB') 64 | 65 | history = [] 66 | 67 | query = args.question 68 | if query == "clear": 69 | break 70 | 71 | if image is None: 72 | if text_only_first_query: 73 | query = text_only_template.format(query) 74 | text_only_first_query = False 75 | else: 76 | old_prompt = '' 77 | for _, (old_query, response) in enumerate(history): 78 | old_prompt += old_query + " " + response + "\n" 79 | query = old_prompt + "用户: {} 小千:".format(query) 80 | 81 | if image is None: 82 | input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, template_version='base') 83 | else: 84 | input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image]) 85 | 86 | inputs = { 87 | 'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE), 88 | 'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE), 89 | 'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE), 90 | 'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]] if image is not None else None, 91 | } 92 | if 'cross_images' in input_by_model and input_by_model['cross_images']: 93 | inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]] 94 | 95 | # add any transformers params here. 96 | gen_kwargs = {"max_length": 2048, 97 | "do_sample": False} # "temperature": 0.9 98 | with torch.no_grad(): 99 | outputs = model.generate(**inputs, **gen_kwargs) 100 | outputs = outputs[:, inputs['input_ids'].shape[1]:] 101 | response = tokenizer.decode(outputs[0]) 102 | response = response.split("")[0] 103 | print("\n小千:", response) 104 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | """ 2 | This is a demo for using CogAgent and CogVLM in CLI 3 | Make sure you have installed vicuna-7b-v1.5 tokenizer model (https://huggingface.co/lmsys/vicuna-7b-v1.5), full checkpoint of vicuna-7b-v1.5 LLM is not required. 4 | In this demo, We us chat template, you can use others to replace such as 'vqa'. 5 | Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow. 6 | Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation. 7 | """ 8 | 9 | import argparse 10 | import torch 11 | 12 | from PIL import Image 13 | from transformers import AutoModelForCausalLM, LlamaTokenizer 14 | 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits') 17 | parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt') 18 | parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path') 19 | parser.add_argument("--fp16", action="store_true") 20 | parser.add_argument("--bf16", action="store_true") 21 | 22 | args = parser.parse_args() 23 | MODEL_PATH = args.from_pretrained 24 | TOKENIZER_PATH = args.local_tokenizer 25 | DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' 26 | 27 | tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH) 28 | if args.bf16: 29 | torch_type = torch.bfloat16 30 | else: 31 | torch_type = torch.float16 32 | 33 | print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE)) 34 | 35 | if args.quant: 36 | model = AutoModelForCausalLM.from_pretrained( 37 | MODEL_PATH, 38 | torch_dtype=torch_type, 39 | low_cpu_mem_usage=True, 40 | load_in_4bit=True, 41 | trust_remote_code=True 42 | ).eval() 43 | else: 44 | model = AutoModelForCausalLM.from_pretrained( 45 | MODEL_PATH, 46 | torch_dtype=torch_type, 47 | low_cpu_mem_usage=True, 48 | load_in_4bit=args.quant is not None, 49 | trust_remote_code=True 50 | ).to(DEVICE).eval() 51 | 52 | text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:" 53 | 54 | while True: 55 | image_path = input("图片地址 >>>>> ") 56 | if image_path == '': 57 | print('You did not enter image path, the following will be a plain text conversation.') 58 | image = None 59 | text_only_first_query = True 60 | else: 61 | image = Image.open(image_path).convert('RGB') 62 | 63 | history = [] 64 | 65 | while True: 66 | query = input("Human:") 67 | if query == "clear": 68 | break 69 | 70 | if image is None: 71 | if text_only_first_query: 72 | query = text_only_template.format(query) 73 | text_only_first_query = False 74 | else: 75 | old_prompt = '' 76 | for _, (old_query, response) in enumerate(history): 77 | old_prompt += old_query + " " + response + "\n" 78 | query = old_prompt + "用户: {} 小千:".format(query) 79 | 80 | if image is None: 81 | input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, template_version='base') 82 | else: 83 | input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image]) 84 | 85 | inputs = { 86 | 'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE), 87 | 'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE), 88 | 'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE), 89 | 'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]] if image is not None else None, 90 | } 91 | if 'cross_images' in input_by_model and input_by_model['cross_images']: 92 | inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]] 93 | 94 | # add any transformers params here. 95 | gen_kwargs = {"max_length": 2048, 96 | "do_sample": False} # "temperature": 0.9 97 | with torch.no_grad(): 98 | outputs = model.generate(**inputs, **gen_kwargs) 99 | outputs = outputs[:, inputs['input_ids'].shape[1]:] 100 | response = tokenizer.decode(outputs[0]) 101 | response = response.split("")[0] 102 | print("\nCog:", response) 103 | history.append((query, response)) 104 | -------------------------------------------------------------------------------- /NLP/CogVLM.md: -------------------------------------------------------------------------------- 1 | # CogVlm 2 | 3 | ## basic demo 4 | 5 | ``` 6 | """ 7 | This is a demo for using CogAgent and CogVLM in CLI 8 | Make sure you have installed vicuna-7b-v1.5 tokenizer model (https://huggingface.co/lmsys/vicuna-7b-v1.5), full checkpoint of vicuna-7b-v1.5 LLM is not required. 9 | In this demo, We us chat template, you can use others to replace such as 'vqa'. 10 | Strongly suggest to use GPU with bfloat16 support, otherwise, it will be slow. 11 | Mention that only one picture can be processed at one conversation, which means you can not replace or insert another picture during the conversation. 12 | """ 13 | 14 | import argparse 15 | import torch 16 | 17 | from PIL import Image 18 | from transformers import AutoModelForCausalLM, LlamaTokenizer 19 | 20 | parser = argparse.ArgumentParser() 21 | parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits') 22 | parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt') 23 | parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path') 24 | parser.add_argument("--fp16", action="store_true") 25 | parser.add_argument("--bf16", action="store_true") 26 | 27 | args = parser.parse_args() 28 | MODEL_PATH = args.from_pretrained 29 | TOKENIZER_PATH = args.local_tokenizer 30 | DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' 31 | 32 | tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH) 33 | if args.bf16: 34 | torch_type = torch.bfloat16 35 | else: 36 | torch_type = torch.float16 37 | 38 | print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE)) 39 | 40 | if args.quant: 41 | model = AutoModelForCausalLM.from_pretrained( 42 | MODEL_PATH, 43 | torch_dtype=torch_type, 44 | low_cpu_mem_usage=True, 45 | load_in_4bit=True, 46 | trust_remote_code=True 47 | ).eval() 48 | else: 49 | model = AutoModelForCausalLM.from_pretrained( 50 | MODEL_PATH, 51 | torch_dtype=torch_type, 52 | low_cpu_mem_usage=True, 53 | load_in_4bit=args.quant is not None, 54 | trust_remote_code=True 55 | ).to(DEVICE).eval() 56 | 57 | text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:" 58 | 59 | while True: 60 | image_path = input("image path >>>>> ") 61 | if image_path == '': 62 | print('You did not enter image path, the following will be a plain text conversation.') 63 | image = None 64 | text_only_first_query = True 65 | else: 66 | image = Image.open(image_path).convert('RGB') 67 | 68 | history = [] 69 | 70 | while True: 71 | query = input("Human:") 72 | if query == "clear": 73 | break 74 | 75 | if image is None: 76 | if text_only_first_query: 77 | query = text_only_template.format(query) 78 | text_only_first_query = False 79 | else: 80 | old_prompt = '' 81 | for _, (old_query, response) in enumerate(history): 82 | old_prompt += old_query + " " + response + "\n" 83 | query = old_prompt + "USER: {} ASSISTANT:".format(query) 84 | 85 | if image is None: 86 | input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, template_version='base') 87 | else: 88 | input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image]) 89 | 90 | inputs = { 91 | 'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE), 92 | 'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE), 93 | 'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE), 94 | 'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]] if image is not None else None, 95 | } 96 | if 'cross_images' in input_by_model and input_by_model['cross_images']: 97 | inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]] 98 | 99 | # add any transformers params here. 100 | gen_kwargs = {"max_length": 2048, 101 | "do_sample": False} # "temperature": 0.9 102 | with torch.no_grad(): 103 | outputs = model.generate(**inputs, **gen_kwargs) 104 | outputs = outputs[:, inputs['input_ids'].shape[1]:] 105 | response = tokenizer.decode(outputs[0]) 106 | response = response.split("")[0] 107 | print("\nCog:", response) 108 | history.append((query, response)) 109 | ``` 110 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "Visual-Perception/2D-Perception/WeSAM"] 2 | path = Visual-Perception/2D-Perception/WeSAM 3 | url = git@github.com:zhang-haojie/wesam.git 4 | [submodule "Visual-Perception/3D-Perception/SAM-6D"] 5 | path = Visual-Perception/3D-Perception/SAM-6D 6 | url = git@github.com:JiehongLin/SAM-6D.git 7 | [submodule "Visual-Perception/3D-Perception/VISTA"] 8 | path = Visual-Perception/3D-Perception/VISTA 9 | url = git@github.com:Gorilla-Lab-SCUT/VISTA.git 10 | [submodule "Visual-Perception/3D-Perception/Frustum-ConvNet"] 11 | path = Visual-Perception/3D-Perception/Frustum-ConvNet 12 | url = git@github.com:Gorilla-Lab-SCUT/frustum-convnet.git 13 | [submodule "Visual-Perception/3D-Perception/SSTNet"] 14 | path = Visual-Perception/3D-Perception/SSTNet 15 | url = git@github.com:Gorilla-Lab-SCUT/SSTNet.git 16 | [submodule "Audio/chatbot_SER"] 17 | path = Audio/chatbot_SER 18 | url = https://github.com/qiaoweima/chatbot_SER.git 19 | [submodule "Audio/chatbot_TTS"] 20 | path = Audio/chatbot_TTS 21 | url = https://github.com/qiaoweima/chatbot_TTS.git 22 | [submodule "Audio/Audio-Anti-Spoofing"] 23 | path = Audio/Audio-Anti-Spoofing 24 | url = https://github.com/qiaoweima/Audio-Anti-Spoofing.git 25 | [submodule "Audio/chatbot_ASR"] 26 | path = Audio/chatbot_ASR 27 | url = https://github.com/qiaoweima/chatbot_ASR.git 28 | [submodule "Audio/Blizzard_Challenge"] 29 | path = Audio/Blizzard_Challenge 30 | url = https://github.com/qiaoweima/Blizzard_Challenge.git 31 | [submodule "Visual-Perception/2D-Perception/DOQ"] 32 | path = Visual-Perception/2D-Perception/DOQ 33 | url = https://github.com/SherlockHolmes221/DOQ.git 34 | [submodule "Visual-Perception/2D-Perception/HQM"] 35 | path = Visual-Perception/2D-Perception/HQM 36 | url = https://github.com/MuchHair/HQM.git 37 | [submodule "Visual-Perception/2D-Perception/PD-Net"] 38 | path = Visual-Perception/2D-Perception/PD-Net 39 | url = https://github.com/MuchHair/PD-Net.git 40 | [submodule "Visual-Perception/2D-Perception/GGNet"] 41 | path = Visual-Perception/2D-Perception/GGNet 42 | url = https://github.com/SherlockHolmes221/GGNet.git 43 | [submodule "Visual-Perception/2D-Perception/HRNets"] 44 | path = Visual-Perception/2D-Perception/HRNets 45 | url = https://github.com/HRNet/HRNet-Image-Classification.git 46 | [submodule "Visual-Perception/2D-Perception/ISCLNet"] 47 | path = Visual-Perception/2D-Perception/ISCLNet 48 | url = https://github.com/lphxx6222712/ISCLNet.git 49 | [submodule "Visual-Perception/2D-Perception/SM-H"] 50 | path = Visual-Perception/2D-Perception/SM-H 51 | url = https://github.com/Dshaoshuai/Partitioning-stateful-data-stream-applications-in-dynamic-edge-cloud-environments.git 52 | [submodule "Visual-Perception/2D-Perception/CBFL"] 53 | path = Visual-Perception/2D-Perception/CBFL 54 | url = https://github.com/lizhipengcs/CBFL.git 55 | [submodule "Visual-Perception/2D-Perception/DRN-for-Single-Image-Super-Resolution"] 56 | path = Visual-Perception/2D-Perception/DRN-for-Single-Image-Super-Resolution 57 | url = https://github.com/guoyongcs/DRN.git 58 | [submodule "Visual-Perception/2D-Perception/DRN-for-Video-Grounding"] 59 | path = Visual-Perception/2D-Perception/DRN-for-Video-Grounding 60 | url = https://github.com/Alvin-Zeng/DRN.git 61 | [submodule "Visual-Perception/2D-Perception/PGCN"] 62 | path = Visual-Perception/2D-Perception/PGCN 63 | url = https://github.com/Alvin-Zeng/PGCN.git 64 | [submodule "Visual-Perception/2D-Perception/NAT"] 65 | path = Visual-Perception/2D-Perception/NAT 66 | url = https://github.com/guoyongcs/NAT.git 67 | [submodule "Visual-Perception/2D-Perception/EATA"] 68 | path = Visual-Perception/2D-Perception/EATA 69 | url = https://github.com/mr-eggplant/EATA.git 70 | [submodule "Visual-Perception/2D-Perception/CNAS"] 71 | path = Visual-Perception/2D-Perception/CNAS 72 | url = https://github.com/guoyongcs/CNAS.git 73 | [submodule "Visual-Perception/2D-Perception/CTNAS"] 74 | path = Visual-Perception/2D-Perception/CTNAS 75 | url = https://github.com/chenyaofo/CTNAS.git 76 | [submodule "Visual-Perception/2D-Perception/SymNets"] 77 | path = Visual-Perception/2D-Perception/SymNets 78 | url = https://github.com/Gorilla-Lab-SCUT/SymNets.git 79 | [submodule "Visual-Perception/2D-Perception/RSPNet"] 80 | path = Visual-Perception/2D-Perception/RSPNet 81 | url = https://github.com/PeihaoChen/RSPNet.git 82 | [submodule "Visual-Perception/2D-Perception/CPGA"] 83 | path = Visual-Perception/2D-Perception/CPGA 84 | url = https://github.com/SCUT-AILab/CPGA.git 85 | [submodule "Visual-Perception/2D-Perception/SGE-LA"] 86 | path = Visual-Perception/2D-Perception/SGE-LA 87 | url = https://github.com/Kali-Hac/SGE-LA.git 88 | [submodule "Visual-Perception/2D-Perception/SAR"] 89 | path = Visual-Perception/2D-Perception/SAR 90 | url = https://github.com/mr-eggplant/SAR.git 91 | [submodule "Visual-Perception/2D-Perception/EPS-AD"] 92 | path = Visual-Perception/2D-Perception/EPS-AD 93 | url = https://github.com/ZSHsh98/EPS-AD.git 94 | [submodule "Visual-Perception/2D-Perception/MME"] 95 | path = Visual-Perception/2D-Perception/MME 96 | url = https://github.com/XinyuSun/MME.git 97 | [submodule "Visual-Perception/2D-Perception/DENet"] 98 | path = Visual-Perception/2D-Perception/DENet 99 | url = https://github.com/lizhaoliu-Lec/DENet.git 100 | [submodule "Visual-Perception/2D-Perception/DAS"] 101 | path = Visual-Perception/2D-Perception/DAS 102 | url = https://github.com/lizhaoliu-Lec/DAS.git 103 | [submodule "Visual-Perception/2D-Perception/ProCA"] 104 | path = Visual-Perception/2D-Perception/ProCA 105 | url = https://github.com/Hongbin98/ProCA.git 106 | [submodule "Visual-Perception/2D-Perception/AGNet"] 107 | path = Visual-Perception/2D-Perception/AGNet 108 | url = https://github.com/HzFu/AGNet.git 109 | [submodule "Visual-Perception/2D-Perception/BPAI-Net"] 110 | path = Visual-Perception/2D-Perception/BPAI-Net 111 | url = https://github.com/SCUT-AILab/BPAI-Net.git 112 | [submodule "Visual-Perception/2D-Perception/LCCGAN-v2"] 113 | path = Visual-Perception/2D-Perception/LCCGAN-v2 114 | url = https://github.com/SCUTjinchengli/LCCGAN-v2.git 115 | [submodule "Visual-Perception/2D-Perception/CoUDA"] 116 | path = Visual-Perception/2D-Perception/CoUDA 117 | url = https://github.com/Vanint/CoUDA.git 118 | [submodule "Visual-Perception/2D-Perception/CAGEs"] 119 | path = Visual-Perception/2D-Perception/CAGEs 120 | url = https://github.com/Kali-Hac/Locality-Awareness-SGE.git 121 | [submodule "Visual-Perception/2D-Perception/CorrReg"] 122 | path = Visual-Perception/2D-Perception/CorrReg 123 | url = https://github.com/JiehongLin/CorrReg.git 124 | [submodule "Visual-Perception/2D-Perception/SkeletonNet"] 125 | path = Visual-Perception/2D-Perception/SkeletonNet 126 | url = https://github.com/Gorilla-Lab-SCUT/SkeletonNet.git 127 | [submodule "Visual-Perception/2D-Perception/PMF"] 128 | path = Visual-Perception/2D-Perception/PMF 129 | url = https://github.com/ICEORY/PMF.git 130 | [submodule "Visual-Perception/2D-Perception/CPEM"] 131 | path = Visual-Perception/2D-Perception/CPEM 132 | url = https://github.com/deepmo24/CPEM.git 133 | [submodule "Visual-Perception/2D-Perception/CR-NeRF"] 134 | path = Visual-Perception/2D-Perception/CR-NeRF 135 | url = https://github.com/YifYang993/CR-NeRF-PyTorch.git 136 | [submodule "Visual-Perception/2D-Perception/CPCM"] 137 | path = Visual-Perception/2D-Perception/CPCM 138 | url = https://github.com/lizhaoliu-Lec/CPCM.git 139 | [submodule "Visual-Perception/2D-Perception/SSTNet"] 140 | path = Visual-Perception/2D-Perception/SSTNet 141 | url = https://github.com/Gorilla-Lab-SCUT/SSTNet.git 142 | [submodule "Visual-Perception/2D-Perception/RegNet"] 143 | path = Visual-Perception/2D-Perception/RegNet 144 | url = https://github.com/PeihaoChen/regnet.git 145 | [submodule "Visual-Perception/2D-Perception/DRAW"] 146 | path = Visual-Perception/2D-Perception/DRAW 147 | url = https://github.com/menggehe/DRAW.git 148 | [submodule "NLP/DRAW"] 149 | path = NLP/DRAW 150 | url = https://github.com/menggehe/DRAW.git 151 | [submodule "Audio/RegNet"] 152 | path = Audio/RegNet 153 | url = https://github.com/PeihaoChen/regnet.git 154 | [submodule "Visual-Perception/3D-Perception/CPCM"] 155 | path = Visual-Perception/3D-Perception/CPCM 156 | url = https://github.com/lizhaoliu-Lec/CPCM.git 157 | [submodule "Visual-Perception/3D-Perception/CR-NeRF"] 158 | path = Visual-Perception/3D-Perception/CR-NeRF 159 | url = https://github.com/YifYang993/CR-NeRF-PyTorch.git 160 | [submodule "Visual-Perception/3D-Perception/CPEM"] 161 | path = Visual-Perception/3D-Perception/CPEM 162 | url = https://github.com/deepmo24/CPEM.git 163 | [submodule "Visual-Perception/3D-Perception/PMF"] 164 | path = Visual-Perception/3D-Perception/PMF 165 | url = https://github.com/ICEORY/PMF.git 166 | [submodule "Visual-Perception/3D-Perception/SkeletonNet"] 167 | path = Visual-Perception/3D-Perception/SkeletonNet 168 | url = https://github.com/Gorilla-Lab-SCUT/SkeletonNet.git 169 | [submodule "Visual-Perception/3D-Perception/CorrReg"] 170 | path = Visual-Perception/3D-Perception/CorrReg 171 | url = https://github.com/JiehongLin/CorrReg.git 172 | [submodule "Visual-Perception/3D-Perception/CAGEs"] 173 | path = Visual-Perception/3D-Perception/CAGEs 174 | url = https://github.com/Kali-Hac/Locality-Awareness-SGE.git 175 | [submodule "Multi-Modal/TDS"] 176 | path = Multi-Modal/TDS 177 | url = https://github.com/Zhiquan-Wen/TDS.git 178 | [submodule "Multi-Modal/D-VQA"] 179 | path = Multi-Modal/D-VQA 180 | url = https://github.com/Zhiquan-Wen/D-VQA.git 181 | [submodule "Multi-Modal/HPGM"] 182 | path = Multi-Modal/HPGM 183 | url = https://github.com/chenqi008/HPGM.git 184 | [submodule "Multi-Modal/CMRAN"] 185 | path = Multi-Modal/CMRAN 186 | url = https://github.com/FloretCat/CMRAN.git 187 | [submodule "Multi-Modal/CRN_tvqa"] 188 | path = Multi-Modal/CRN_tvqa 189 | url = https://github.com/guanghuixu/CRN_tvqa.git 190 | [submodule "Multi-Modal/LaBERT"] 191 | path = Multi-Modal/LaBERT 192 | url = https://github.com/bearcatt/LaBERT.git 193 | [submodule "Multi-Modal/V2C"] 194 | path = Multi-Modal/V2C 195 | url = https://github.com/chenqi008/V2C.git 196 | [submodule "Robotic/ActiveCamera"] 197 | path = Robotic/ActiveCamera 198 | url = https://github.com/PeihaoChen/ActiveCamera.git 199 | [submodule "Robotic/WS-MGMap"] 200 | path = Robotic/WS-MGMap 201 | url = https://github.com/PeihaoChen/WS-MGMap.git 202 | [submodule "Robotic/YouTube-VLN"] 203 | path = Robotic/YouTube-VLN 204 | url = https://github.com/JeremyLinky/YouTube-VLN.git 205 | [submodule "NLP/CogVLM"] 206 | path = NLP/CogVLM 207 | url = git@github.com:THUDM/CogVLM.git 208 | [submodule "NLP/Qwen"] 209 | path = NLP/Qwen 210 | url = git@github.com:QwenLM/Qwen.git 211 | [submodule "Visual-Perception/2D-Perception/TTAC"] 212 | path = Visual-Perception/2D-Perception/TTAC 213 | url = https://github.com/Gorilla-Lab-SCUT/TTAC 214 | [submodule "Visual-Perception/2D-Perception/TTAC2"] 215 | path = Visual-Perception/2D-Perception/TTAC2 216 | url = https://github.com/Gorilla-Lab-SCUT/TTAC2 217 | [submodule "Visual-Perception/2D-Perception/TRIBE"] 218 | path = Visual-Perception/2D-Perception/TRIBE 219 | url = https://github.com/Gorilla-Lab-SCUT/TRIBE 220 | [submodule "Audio/SSL-PVAD"] 221 | path = Audio/SSL-PVAD 222 | url = https://github.com/HolgerBovbjerg/SSL-PVAD.git 223 | [submodule "Visual-Perception/3D-Perception/HelixSurf"] 224 | path = Visual-Perception/3D-Perception/HelixSurf 225 | url = https://github.com/Gorilla-Lab-SCUT/HelixSurf 226 | [submodule "Visual-Perception/3D-Perception/QS3"] 227 | path = Visual-Perception/3D-Perception/QS3 228 | url = https://github.com/gorilla-lab-scut/qs3 229 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | 4 |

视听融合感知智能引擎平台

5 | 6 | 9 | 10 | 11 | [![Project Page](https://img.shields.io/badge/Project-Page-F9AB00?style=for-the-badge)](http://183.63.152.178:6710/#/login)   12 | [![License](https://img.shields.io/badge/LICENSE-MIT-green.svg?style=for-the-badge)](https://github.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/blob/main/LICENSE)   13 | [![Demo Website](https://img.shields.io/badge/Demo-Website-yellow.svg?style=for-the-badge)](http://183.63.152.178:6710/#/engine-platform/visual-semantics)   14 | 15 | 16 | 📕 中文版 README | 📗 [English README](./README_en.md) 17 | 18 |
19 | 20 | ### 📻 安装指南 21 | 22 | 在使用我们的模型之前,您需要先确保环境中已安装所有必要的依赖项。这些依赖项涵盖了模型运行所需的各类库和工具,确保您可以顺利进行模型推理。 23 | 24 | 请按照以下步骤进行安装: 25 | 26 | 1. **打开终端或命令提示符**:根据您的操作系统,打开相应的命令行界面。 27 | 2. **使用pip安装依赖项**:输入以下命令,通过pip安装所需的Python包和库。 28 | 29 | ```bash 30 | pip install -r requirements.txt 31 | ``` 32 | 33 | 34 | ### 🚀 推理指南 35 | 36 | 安装完所有必要的依赖项后,您就可以开始使用我们的模型进行推理了。我们提供了两种推理方式:使用终端进行推理和使用交互式推理。 37 | 38 | 这里我们以示例图片`asserts/demo.jpg`为例进行说明: 39 | 40 | 41 | 42 | #### 1. 使用终端进行推理 43 | 44 | 如果您希望直接在终端中运行推理脚本,可以使用以下命令: 45 | 46 | ```bash 47 | python chatme.py --image asserts/demo.jpg --question "货架上有几个苹果?" 48 | ``` 49 | 50 | 此命令会加载预训练的模型,并使用提供的图片(`demo.jpg`)和问题(`"货架上有几个苹果?"`)进行推理。 51 | 52 | 模型会分析图片并尝试回答提出的问题,推理结果将以文本形式输出到终端中,例如: 53 | 54 | ``` 55 | 小千:货架上有三个苹果。 56 | ``` 57 | 58 | #### 2. 使用交互式推理 59 | 60 | 除了使用终端进行推理,您还可以使用交互式推理功能与大模型进行实时交互。要启动交互式终端,请运行以下命令: 61 | 62 | ```bash 63 | python main.py 64 | ``` 65 | 66 | 此命令会启动一个交互式终端,等待您输入图片地址。您可以在终端中输入图片地址(例如`asserts/demo.jpg`),然后按下回车键。 67 | 68 | 模型会根据您提供的图片进行推理,并等待您输入问题。 69 | 70 | 一旦您输入了问题(例如`"货架上有几个苹果?"`),模型就会分析图片并尝试回答,推理结果将以文本形式输出到终端中,例如: 71 | 72 | ```bash 73 | 图片地址 >>>>> asserts/demo.jpg 74 | 用户:货架上有几个苹果? 75 | 小千:货架上有三个苹果。 76 | ``` 77 | 78 | 通过这种方式,您可以轻松地与模型进行交互,并向其提出各种问题。 79 | 80 | 81 | ### 🧾 References 82 | 83 | #### 📈 Benchmark #### 84 | 85 | - [AGE Challenge Dataset](https://age.grand-challenge.org) 86 | 87 | - [COVID-DA Dataset](https://drive.google.com/file/d/1w2brbYLn1s1hvmLkKKsBsm1mCbz4F512/view?usp=sharing) 88 | 89 | - [Visually Aligned Sound (VAS) Dataset](https://drive.google.com/file/d/14birixmH7vwIWKxCHI0MIWCcZyohF59g/view?usp=sharing) 90 | 91 | #### 📷 Visual Perception 92 | 93 | - [2D Perception](Visual-Perception/2D-Perception/) 94 | 95 | - [Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation](https://github.com/zhang-haojie/wesam) 96 | 97 | - [Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering Regularizad Self-Training](https://github.com/Gorilla-Lab-SCUT/TTAC2) 98 | 99 | - [Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization](https://github.com/Gorilla-Lab-SCUT/TRIBE) 100 | 101 | - [Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering](https://github.com/Gorilla-Lab-SCUT/TTAC) 102 | 103 | - [Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection](https://github.com/SherlockHolmes221/DOQ) 104 | 105 | - [High-resolution networks (HRNets) for Image classification](https://github.com/HRNet/HRNet-Image-Classification) 106 | 107 | 108 | 109 | - [Intra- and Inter-Slice Contrastive Learning for Point Supervised OCT Fluid Segmentation](https://github.com/lphxx6222712/ISCLNet) 110 | 111 | - [Partitioning Stateful Data Stream Applications in Dynamic Edge Cloud Environments](https://github.com/Dshaoshuai/Partitioning-stateful-data-stream-applications-in-dynamic-edge-cloud-environments) 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | - [Closed-loop Matters: Dual Regression Networks for Single Image Super-Resolution](https://github.com/guoyongcs/DRN) 120 | 121 | - [Dense Regression Network for Video Grounding](https://github.com/Alvin-Zeng/DRN) 122 | 123 | - [Graph Convolutional Networks for Temporal Action Localization](https://github.com/Alvin-Zeng/PGCN) 124 | 125 | - [NAT: Neural Architecture Transformer for Accurate and Compact Architectures](https://github.com/guoyongcs/NAT) 126 | 127 | 128 | 129 | - [Efficient Test-Time Model Adaptation without Forgetting](https://github.com/mr-eggplant/EATA) 130 | 131 | - [Breaking the Curse of Space Explosion: Towards Effcient NAS with Curriculum Search](https://github.com/guoyongcs/CNAS) 132 | 133 | - [Contrastive Neural Architecture Search with Neural Architecture Comparators](https://github.com/chenyaofo/CTNAS) 134 | 135 | - [Domain-Symnetric Networks for Adversarial Domain Adaptation](https://github.com/Gorilla-Lab-SCUT/SymNets) 136 | 137 | - [RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning](https://github.com/PeihaoChen/RSPNet) 138 | 139 | 140 | 141 | - [Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation](https://github.com/SCUT-AILab/CPGA) 142 | 143 | - [Self-Supervised Gait Encoding with Locality-Aware Attention for Person Re-Identification](https://github.com/Kali-Hac/SGE-LA) 144 | 145 | - [Towards Stable Test-Time Adaptation in Dynamic Wild World](https://github.com/mr-eggplant/SAR) 146 | 147 | - [Detecting Adversarial Data by Probing Multiple Perturbations Using Expected Perturbation Score](https://github.com/ZSHsh98/EPS-AD) 148 | 149 | - [Masked Motion Encoding for Self-Supervised Video Representation Learning](https://github.com/XinyuSun/MME) 150 | 151 | 152 | 153 | - [Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation](https://github.com/SCUT-AILab/CPGA) 154 | 155 | - [Dynamic Extension Nets for Few-shot Semantic Segmentation](https://github.com/lizhaoliu-Lec/DENet) 156 | 157 | - [Densely-Anchored Sampling for Deep Metric Learning](https://github.com/lizhaoliu-Lec/DAS) 158 | 159 | - [Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation](https://github.com/Hongbin98/ProCA) 160 | 161 | 162 | 163 | 164 | 165 | - [Attention Guided Network for Retinal Image Segmentation](https://github.com/HzFu/AGNet) 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | - [Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection/Object Detection](https://github.com/MuchHair/HQM.git) 178 | 179 | - [Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection](https://github.com/SherlockHolmes221/GGNet.git) 180 | 181 | - [Polysemy Deciphering Network for Human-Object Interaction Detection](https://github.com/MuchHair/PD-Net.git) 182 | 183 | - [Bidirectional Posture-Appearance Interaction Network for Driver Behavior Recognition](https://github.com/SCUT-AILab/BPAI-Net) 184 | 185 | - [Improving Generative Adversarial Networks with Local Coordinate Coding](https://github.com/SCUTjinchengli/LCCGAN-v2) 186 | 187 | 188 | 189 | 190 | 191 | - [Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis](https://github.com/Vanint/CoUDA) 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | - [3D Perception](Visual-Perception/3D-Perception/) 205 | 206 | - [SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation](https://github.com/JiehongLin/SAM-6D) 207 | 208 | - [Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection](https://github.com/Gorilla-Lab-SCUT/frustum-convnet) 209 | 210 | - [Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks](https://github.com/Gorilla-Lab-SCUT/SSTNet) 211 | 212 | - [VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention](https://github.com/Gorilla-Lab-SCUT/VISTA) 213 | 214 | - [A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D Skeleton Based Person Re-Identification](https://github.com/Kali-Hac/Locality-Awareness-SGE) 215 | 216 | - [Deep Multi-View Learning Using Neuron-Wise Correlation-Maximizing Regularizers](https://github.com/JiehongLin/CorrReg) 217 | 218 | - [A Skeleton-Bridged Deep Learning Approach for Generating Meshes of Complex Topologies From Single RGB Images](https://github.com/Gorilla-Lab-SCUT/SkeletonNet) 219 | 220 | - [Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation](https://github.com/ICEORY/PMF) 221 | 222 | - [CPEM: Consistent Parameter Estimation Model](https://github.com/deepmo24/CPEM) 223 | 224 | - [CR-NeRF: Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections](https://github.com/YifYang993/CR-NeRF-PyTorch) 225 | 226 | - [Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation](https://github.com/lizhaoliu-Lec/CPCM) 227 | 228 | - [Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap](https://github.com/gorilla-lab-scut/qs3) 229 | 230 | - [HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes with Iterative Intertwined Regularization](https://github.com/Gorilla-Lab-SCUT/HelixSurf) 231 | 232 | 233 | #### 🎧 Audio 234 | 235 | - [Automatic Speech Recognition](https://github.com/qiaoweima/chatbot_ASR) 236 | 237 | - [Dialogue System](https://github.com/qiaoweima/chatbot_SER) 238 | 239 | - [Text To Speech](https://github.com/qiaoweima/chatbot_TTS.git) 240 | 241 | - [Audio Anti-spoofing](https://github.com/qiaoweima/Audio-Anti-Spoofing/tree/main) 242 | 243 | - [Blizzard_Challenge](https://github.com/qiaoweima/Blizzard_Challenge) 244 | 245 | - [Voice Activity Detection](https://github.com/HolgerBovbjerg/SSL-PVAD) 246 | 247 | - [RegNet](https://github.com/PeihaoChen/regnet) 248 | 249 | #### 💬 NLP 250 | 251 | - [How to Train Your Agent to Read and Write](https://github.com/menggehe/DRAW) 252 | 253 | - [CogVLM](https://github.com/THUDM/CogVLM.git) 254 | 255 | - [Qwen](https://github.com/QwenLM/Qwen.git) 256 | 257 | #### 🔮 Multi-Modal 258 | 259 | - [Test-Time Model Adaptation for Visual Question Answering with Debiased Self-Supervisions](https://github.com/Zhiquan-Wen/TDS) 260 | 261 | - [Debiased Visual Question Answering from Feature and Sample Perspectives](https://github.com/Zhiquan-Wen/D-VQA) 262 | 263 | - [Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only](https://github.com/chenqi008/HPGM) 264 | 265 | - [Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization](https://github.com/FloretCat/CMRAN) 266 | 267 | - [Cascade Reasoning Network for Text-based Visual Question Answering](https://github.com/guanghuixu/CRN_tvqa) 268 | 269 | - [Length-Controllable Image Captioning](https://github.com/bearcatt/LaBERT) 270 | 271 | - [V2C: Visual Voice Cloning](https://github.com/chenqi008/V2C) 272 | 273 | #### 🤖 Robotic 274 | 275 | - [Learning Active Camera for Multi-Object Navigation](https://github.com/PeihaoChen/ActiveCamera) 276 | 277 | - [Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation](https://github.com/PeihaoChen/WS-MGMap) 278 | 279 | - [Learning Vision-and-Language Navigation from YouTube Videos](https://github.com/JeremyLinky/YouTube-VLN) 280 | 281 | 288 | -------------------------------------------------------------------------------- /README_en.md: -------------------------------------------------------------------------------- 1 |
2 | 3 | 4 |

Visual-Auditory Fusion Perception AI Platform

5 | 6 | 7 | [![Project Page](https://img.shields.io/badge/Project-Page-F9AB00?style=for-the-badge)](http://183.63.152.178:6710/#/login)   8 | [![License](https://img.shields.io/badge/LICENSE-MIT-green.svg?style=for-the-badge)](https://github.com/Gorilla-Lab-SCUT/Visual-Auditory-Fusion-Perception/blob/main/LICENSE)   9 | [![Demo Website](https://img.shields.io/badge/Demo-Website-yellow.svg?style=for-the-badge)](http://183.63.152.178:6710/#/engine-platform/visual-semantics)   10 | 11 | [📕 中文版 README](./README.md) | 📗 English README 12 | 13 |
14 | 15 | ### 📻 Installation Guide 16 | 17 | Before using our model, you need to ensure that all necessary dependencies are installed in your environment. These dependencies cover various libraries and tools required for the model's operation, ensuring a smooth inference process. 18 | 19 | Please follow these steps for the installation: 20 | 21 | 1. **Open the Terminal or Command Prompt**: Depending on your operating system, open the corresponding command-line interface. 22 | 2. **Install dependencies using pip**: Enter the following command to install the required Python packages and libraries using pip. 23 | 24 | ```bash 25 | pip install -r requirements.txt 26 | ``` 27 | 28 | 29 | ### 🚀 Inference Guide 30 | 31 | After installing all the necessary dependencies, you can start using our model for inference. We provide two ways of performing inference: using the terminal and using the interactive inference. 32 | 33 | Here, we will use the example image `asserts/demo.jpg` for illustration: 34 | 35 | 36 | 37 | #### 1. Inference using the Terminal 38 | 39 | If you want to directly run the inference script in the terminal, you can use the following command: 40 | 41 | ```bash 42 | python chatme.py --image asserts/demo.jpg --question "How many apples are there on the shelf?" 43 | ``` 44 | 45 | This command will load the pre-trained model and perform inference using the provided image (`demo.jpg`) and question (`"How many apples are there on the shelf?"`). 46 | 47 | The model will analyze the image and attempt to answer the question. The inference result will be output to the terminal in text form, for example: 48 | 49 | ``` 50 | Xiaochuan: There are three apples on the shelf. 51 | ``` 52 | 53 | #### 2. Interactive Inference 54 | 55 | In addition to using the terminal for inference, you can also use the interactive inference feature to interact with the large model in real-time. To start the interactive terminal, run the following command: 56 | 57 | ```bash 58 | python main.py 59 | ``` 60 | 61 | This command will launch an interactive terminal that waits for you to enter the image path. You can type the image path (e.g., `asserts/demo.jpg`) in the terminal and press Enter. 62 | 63 | The model will perform inference based on the provided image and wait for you to enter a question. 64 | 65 | Once you enter a question (e.g., `"How many apples are there on the shelf?"`), the model will analyze the image and attempt to answer it. The inference result will be output to the terminal in text form, for example: 66 | 67 | ```bash 68 | Image Path >>>>> asserts/demo.jpg 69 | User: How many apples are there on the shelf? 70 | Xiaochuan: There are three apples on the shelf. 71 | ``` 72 | 73 | Using this approach, you can easily interact with the model and ask it various questions. 74 | 75 | 76 | ### 🧾 References 77 | 78 | #### 📈 Benchmark #### 79 | 80 | - [AGE Challenge Dataset](https://age.grand-challenge.org) 81 | 82 | - [COVID-DA Dataset](https://drive.google.com/file/d/1w2brbYLn1s1hvmLkKKsBsm1mCbz4F512/view?usp=sharing) 83 | 84 | - [Visually Aligned Sound (VAS) Dataset](https://drive.google.com/file/d/14birixmH7vwIWKxCHI0MIWCcZyohF59g/view?usp=sharing) 85 | 86 | #### 📷 Visual Perception 87 | 88 | - [2D Perception](Visual-Perception/2D-Perception/) 89 | 90 | - [Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation](https://github.com/zhang-haojie/wesam) 91 | 92 | - [Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering Regularizad Self-Training](https://github.com/Gorilla-Lab-SCUT/TTAC2) 93 | 94 | - [Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization](https://github.com/Gorilla-Lab-SCUT/TRIBE) 95 | 96 | - [Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering](https://github.com/Gorilla-Lab-SCUT/TTAC) 97 | 98 | - [Distillation Using Oracle Queries for Transformer-based Human-Object Interaction Detection](https://github.com/SherlockHolmes221/DOQ) 99 | 100 | - [High-resolution networks (HRNets) for Image classification](https://github.com/HRNet/HRNet-Image-Classification) 101 | 102 | 103 | 104 | - [Intra- and Inter-Slice Contrastive Learning for Point Supervised OCT Fluid Segmentation](https://github.com/lphxx6222712/ISCLNet) 105 | 106 | - [Partitioning Stateful Data Stream Applications in Dynamic Edge Cloud Environments](https://github.com/Dshaoshuai/Partitioning-stateful-data-stream-applications-in-dynamic-edge-cloud-environments) 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | - [Closed-loop Matters: Dual Regression Networks for Single Image Super-Resolution](https://github.com/guoyongcs/DRN) 115 | 116 | - [Dense Regression Network for Video Grounding](https://github.com/Alvin-Zeng/DRN) 117 | 118 | - [Graph Convolutional Networks for Temporal Action Localization](https://github.com/Alvin-Zeng/PGCN) 119 | 120 | - [NAT: Neural Architecture Transformer for Accurate and Compact Architectures](https://github.com/guoyongcs/NAT) 121 | 122 | 123 | 124 | - [Efficient Test-Time Model Adaptation without Forgetting](https://github.com/mr-eggplant/EATA) 125 | 126 | - [Breaking the Curse of Space Explosion: Towards Effcient NAS with Curriculum Search](https://github.com/guoyongcs/CNAS) 127 | 128 | - [Contrastive Neural Architecture Search with Neural Architecture Comparators](https://github.com/chenyaofo/CTNAS) 129 | 130 | - [Domain-Symnetric Networks for Adversarial Domain Adaptation](https://github.com/Gorilla-Lab-SCUT/SymNets) 131 | 132 | - [RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning](https://github.com/PeihaoChen/RSPNet) 133 | 134 | 135 | 136 | - [Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation](https://github.com/SCUT-AILab/CPGA) 137 | 138 | - [Self-Supervised Gait Encoding with Locality-Aware Attention for Person Re-Identification](https://github.com/Kali-Hac/SGE-LA) 139 | 140 | - [Towards Stable Test-Time Adaptation in Dynamic Wild World](https://github.com/mr-eggplant/SAR) 141 | 142 | - [Detecting Adversarial Data by Probing Multiple Perturbations Using Expected Perturbation Score](https://github.com/ZSHsh98/EPS-AD) 143 | 144 | - [Masked Motion Encoding for Self-Supervised Video Representation Learning](https://github.com/XinyuSun/MME) 145 | 146 | 147 | 148 | - [Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation](https://github.com/SCUT-AILab/CPGA) 149 | 150 | - [Dynamic Extension Nets for Few-shot Semantic Segmentation](https://github.com/lizhaoliu-Lec/DENet) 151 | 152 | - [Densely-Anchored Sampling for Deep Metric Learning](https://github.com/lizhaoliu-Lec/DAS) 153 | 154 | - [Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation](https://github.com/Hongbin98/ProCA) 155 | 156 | 157 | 158 | 159 | 160 | - [Attention Guided Network for Retinal Image Segmentation](https://github.com/HzFu/AGNet) 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | - [Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection/Object Detection](https://github.com/MuchHair/HQM.git) 173 | 174 | - [Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection](https://github.com/SherlockHolmes221/GGNet.git) 175 | 176 | - [Polysemy Deciphering Network for Human-Object Interaction Detection](https://github.com/MuchHair/PD-Net.git) 177 | 178 | - [Bidirectional Posture-Appearance Interaction Network for Driver Behavior Recognition](https://github.com/SCUT-AILab/BPAI-Net) 179 | 180 | - [Improving Generative Adversarial Networks with Local Coordinate Coding](https://github.com/SCUTjinchengli/LCCGAN-v2) 181 | 182 | 183 | 184 | 185 | 186 | - [Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis](https://github.com/Vanint/CoUDA) 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | - [3D Perception](Visual-Perception/3D-Perception/) 200 | 201 | - [SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation](https://github.com/JiehongLin/SAM-6D) 202 | 203 | - [Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection](https://github.com/Gorilla-Lab-SCUT/frustum-convnet) 204 | 205 | - [Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks](https://github.com/Gorilla-Lab-SCUT/SSTNet) 206 | 207 | - [VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention](https://github.com/Gorilla-Lab-SCUT/VISTA) 208 | 209 | - [A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D Skeleton Based Person Re-Identification](https://github.com/Kali-Hac/Locality-Awareness-SGE) 210 | 211 | - [Deep Multi-View Learning Using Neuron-Wise Correlation-Maximizing Regularizers](https://github.com/JiehongLin/CorrReg) 212 | 213 | - [A Skeleton-Bridged Deep Learning Approach for Generating Meshes of Complex Topologies From Single RGB Images](https://github.com/Gorilla-Lab-SCUT/SkeletonNet) 214 | 215 | - [Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation](https://github.com/ICEORY/PMF) 216 | 217 | - [CPEM: Consistent Parameter Estimation Model](https://github.com/deepmo24/CPEM) 218 | 219 | - [CR-NeRF: Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections](https://github.com/YifYang993/CR-NeRF-PyTorch) 220 | 221 | - [Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation](https://github.com/lizhaoliu-Lec/CPCM) 222 | 223 | - [Quasi-Balanced Self-Training on Noise-Aware Synthesis of Object Point Clouds for Closing Domain Gap](https://github.com/gorilla-lab-scut/qs3) 224 | 225 | - [HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes with Iterative Intertwined Regularization](https://github.com/Gorilla-Lab-SCUT/HelixSurf) 226 | 227 | 228 | #### 🎧 Audio 229 | 230 | - [Automatic Speech Recognition](https://github.com/qiaoweima/chatbot_ASR) 231 | 232 | - [Dialogue System](https://github.com/qiaoweima/chatbot_SER) 233 | 234 | - [Text To Speech](https://github.com/qiaoweima/chatbot_TTS.git) 235 | 236 | - [Audio Anti-spoofing](https://github.com/qiaoweima/Audio-Anti-Spoofing/tree/main) 237 | 238 | - [Blizzard_Challenge](https://github.com/qiaoweima/Blizzard_Challenge) 239 | 240 | - [Voice Activity Detection](https://github.com/HolgerBovbjerg/SSL-PVAD) 241 | 242 | - [RegNet](https://github.com/PeihaoChen/regnet) 243 | 244 | #### 💬 NLP 245 | 246 | - [How to Train Your Agent to Read and Write](https://github.com/menggehe/DRAW) 247 | 248 | - [CogVLM](https://github.com/THUDM/CogVLM.git) 249 | 250 | - [Qwen](https://github.com/QwenLM/Qwen.git) 251 | 252 | #### 🔮 Multi-Modal 253 | 254 | - [Test-Time Model Adaptation for Visual Question Answering with Debiased Self-Supervisions](https://github.com/Zhiquan-Wen/TDS) 255 | 256 | - [Debiased Visual Question Answering from Feature and Sample Perspectives](https://github.com/Zhiquan-Wen/D-VQA) 257 | 258 | - [Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only](https://github.com/chenqi008/HPGM) 259 | 260 | - [Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization](https://github.com/FloretCat/CMRAN) 261 | 262 | - [Cascade Reasoning Network for Text-based Visual Question Answering](https://github.com/guanghuixu/CRN_tvqa) 263 | 264 | - [Length-Controllable Image Captioning](https://github.com/bearcatt/LaBERT) 265 | 266 | - [V2C: Visual Voice Cloning](https://github.com/chenqi008/V2C) 267 | 268 | #### 🤖 Robotic 269 | 270 | - [Learning Active Camera for Multi-Object Navigation](https://github.com/PeihaoChen/ActiveCamera) 271 | 272 | - [Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation](https://github.com/PeihaoChen/WS-MGMap) 273 | 274 | - [Learning Vision-and-Language Navigation from YouTube Videos](https://github.com/JeremyLinky/YouTube-VLN) 275 | 276 | 283 | --------------------------------------------------------------------------------