├── evaluation ├── __init__.py ├── main_utils.py └── eval.py ├── figures └── teaser.png ├── LICENSE.md ├── README.md └── data ├── croissanta_hf_data.json └── croissant_data.json /evaluation/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /figures/teaser.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eric-ai-lab/MMWorld/HEAD/figures/teaser.png -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) [2024] 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos 2 | 3 | [Xuehai He](https://sheehan1230.github.io/)†,1, [Weixi Feng*](https://weixi-feng.github.io/)2, [Kaizhi Zheng*](https://kzzheng.github.io/)1, [Yujie Lu*](https://yujielu10.github.io/)2, [Wanrong Zhu*](https://wanrong-zhu.com/)2, [Jiachen Li*](https://sites.google.com/view/jiachenli/)2, [Yue Fan*](http://www.yfan.site/)1, [Jianfeng Wang](https://scholar.google.com/citations?user=vJWEw_8AAAAJ&hl=en)3, [Linjie Li](https://www.linkedin.com/in/linjie-li/)3, [Zhengyuan Yang](https://zyang-ur.github.io/)3, [Kevin Lin](https://sites.google.com/site/kevinlin311tw/me)3, [William Yang Wang](https://sites.cs.ucsb.edu/~william/)2, [Xin Eric Wang](https://eric-xw.github.io/)†,1 4 | 5 | 6 | 1UCSC, 2UCSB, 3Microsoft 7 | 8 | *Equal contribution 9 | 10 | 11 | 12 | 13 | 14 | ![Teaser figure](figures/teaser.png) 15 | 16 | 17 | ## TODO 18 | - [x] Release dataset 19 | - [x] Release evaluation code 20 | - [x] EvalAI server setup 21 | - [x] Hugging Face server setup 22 | - [x] Support evaluation with lmms-eval 23 | 24 | ## :fire: News 25 | * **[2025.07.14]** We integrate the benchmark into [AGI-Eval](https://agi-eval.cn/evaluation/detail?id=66) platform. More models and results will be updated there. 26 | * **[2024.09.21]** We integrate the benchmark into lmms-eval. 27 | * **[2024.09.17]** We set up the Hugging Face server. 28 | * **[2024.08.9]** We set up the EvalAI server. The portal will open for submissions soon. 29 | * **[2024.07.1]** We add the evaluation toolkit. 30 | * **[2024.06.12]** We release our dataset. 31 | 32 | 33 | 34 | ## Dataset Structure 35 | The dataset can be downloaded from Hugging Face. 36 | Each entry in the dataset contains the following fields: 37 | - `video_id`: Unique identifier for the video. Same as the relative path of the downloaded video 38 | - `video_url`: URL of the video 39 | - `discipline`: Main discipline of the video content 40 | - `subdiscipline`: Sub-discipline of the video content 41 | - `captions`: List of captions describing the video content 42 | - `questions`: List of questions related to the video content, each with options and correct answer 43 | 44 | 45 | ## Example Entry 46 | 47 | ```json 48 | { 49 | "video_id": "eng_vid1", 50 | "video_url": "https://youtu.be/-e1_QhJ1EhQ", 51 | "discipline": "Tech & Engineering", 52 | "subdiscipline": "Robotics", 53 | "captions": [ 54 | "The humanoid robot Atlas interacts with objects and modifies the course to reach its goal." 55 | ], 56 | "questions": [ 57 | { 58 | "type": "Explanation", 59 | "question": "Why is the engineer included at the beginning of the video?", 60 | "options": { 61 | "a": "The reason might be to imply the practical uses of Atlas in a commercial setting, to be an assistant who can perform complex tasks", 62 | "b": "To show how professional engineers can be forgetful sometimes", 63 | "c": "The engineer is controlling the robot manually", 64 | "d": "The engineer is instructing Atlas to build a house" 65 | }, 66 | "answer": "The reason might be to imply the practical uses of Atlas in a commercial setting, to be an assistant who can perform complex tasks", 67 | "requires_domain_knowledge": false, 68 | "requires_audio": false, 69 | "requires_visual": true, 70 | "question_only": false, 71 | "correct_answer_label": "a" 72 | } 73 | ] 74 | } 75 | ``` 76 | 77 | 78 | ## Evaluation 79 | 80 | You can do evaluation by running our evaluation code [eval.py](evaluation/eval.py). Note that access to the GPT-4 API is required, as defined in line 387 of `eval.py`. 81 | To use our example evaluation code, you need to define your model initialization function, such as: 82 | ```python 83 | modelname_init() 84 | ``` 85 | at line 357 of eval.py, and the model answer function, such as: 86 | ```python 87 | modelname_answer() 88 | ``` 89 | at line 226 of eval.py. 90 | 91 | Alternatively, you may prepare your model results and submit them to the EvalAI server. The model results format should be as follows: 92 | 93 | ```json 94 | { 95 | "detailed_results": [ 96 | { 97 | "video_id": "eng_vid1", 98 | "model_answer": "a", 99 | }, 100 | ... 101 | ] 102 | } 103 | ``` 104 | 105 | 106 | 107 | 108 | ## License Agreement 109 | Please refer to [LICENSE](./LICENSE.md). 110 | All videos of the MMworld benchmark are obtained from the Internet which are not property of our institutions. The copyright remains with the original owners of the video. 111 | Should you encounter any data samples violating the copyright or licensing regulations of any site, please contact us. Upon verification, those samples will be promptly removed. 112 | 113 | 114 | ## Citation 115 | ``` 116 | @misc{he2024mmworld, 117 | title={MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos}, 118 | author={Xuehai He and Weixi Feng and Kaizhi Zheng and Yujie Lu and Wanrong Zhu and Jiachen Li and Yue Fan and Jianfeng Wang and Linjie Li and Zhengyuan Yang and Kevin Lin and William Yang Wang and Lijuan Wang and Xin Eric Wang}, 119 | year={2024}, 120 | eprint={2406.08407}, 121 | archivePrefix={arXiv}, 122 | primaryClass={cs.CV} 123 | } 124 | ``` 125 | -------------------------------------------------------------------------------- /data/croissanta_hf_data.json: -------------------------------------------------------------------------------- 1 | { 2 | "@context": { 3 | "@language": "en", 4 | "@vocab": "https://schema.org/", 5 | "citeAs": "cr:citeAs", 6 | "column": "cr:column", 7 | "conformsTo": "dct:conformsTo", 8 | "cr": "http://mlcommons.org/croissant/", 9 | "data": { 10 | "@id": "cr:data", 11 | "@type": "@json" 12 | }, 13 | "dataBiases": "cr:dataBiases", 14 | "dataCollection": "cr:dataCollection", 15 | "dataType": { 16 | "@id": "cr:dataType", 17 | "@type": "@vocab" 18 | }, 19 | "dct": "http://purl.org/dc/terms/", 20 | "extract": "cr:extract", 21 | "field": "cr:field", 22 | "fileProperty": "cr:fileProperty", 23 | "fileObject": "cr:fileObject", 24 | "fileSet": "cr:fileSet", 25 | "format": "cr:format", 26 | "includes": "cr:includes", 27 | "isLiveDataset": "cr:isLiveDataset", 28 | "jsonPath": "cr:jsonPath", 29 | "key": "cr:key", 30 | "md5": "cr:md5", 31 | "parentField": "cr:parentField", 32 | "path": "cr:path", 33 | "personalSensitiveInformation": "cr:personalSensitiveInformation", 34 | "recordSet": "cr:recordSet", 35 | "references": "cr:references", 36 | "regex": "cr:regex", 37 | "repeated": "cr:repeated", 38 | "replace": "cr:replace", 39 | "sc": "https://schema.org/", 40 | "separator": "cr:separator", 41 | "source": "cr:source", 42 | "subField": "cr:subField", 43 | "transform": "cr:transform" 44 | }, 45 | "@type": "sc:Dataset", 46 | "distribution": [ 47 | { 48 | "@type": "cr:FileObject", 49 | "@id": "repo", 50 | "name": "repo", 51 | "description": "The Hugging Face git repository.", 52 | "contentUrl": "https://huggingface.co/datasets/Xuehai/MMWorld/tree/refs%2Fconvert%2Fparquet", 53 | "encodingFormat": "git+https", 54 | "sha256": "https://github.com/mlcommons/croissant/issues/80" 55 | }, 56 | { 57 | "@type": "cr:FileSet", 58 | "@id": "parquet-files-for-config-default", 59 | "name": "parquet-files-for-config-default", 60 | "description": "The underlying Parquet files as converted by Hugging Face (see: https://huggingface.co/docs/datasets-server/parquet).", 61 | "containedIn": { 62 | "@id": "repo" 63 | }, 64 | "encodingFormat": "application/x-parquet", 65 | "includes": "default/*/*.parquet" 66 | } 67 | ], 68 | "recordSet": [ 69 | { 70 | "@type": "cr:RecordSet", 71 | "@id": "default", 72 | "name": "default", 73 | "description": "Xuehai/MMWorld - 'default' subset\n\nAdditional information:\n- 2 skipped columns: questions, captions", 74 | "field": [ 75 | { 76 | "@type": "cr:Field", 77 | "@id": "default/video_url", 78 | "name": "default/video_url", 79 | "description": "Column 'video_url' from the Hugging Face parquet file.", 80 | "dataType": "sc:Text", 81 | "source": { 82 | "fileSet": { 83 | "@id": "parquet-files-for-config-default" 84 | }, 85 | "extract": { 86 | "column": "video_url" 87 | } 88 | } 89 | }, 90 | { 91 | "@type": "cr:Field", 92 | "@id": "default/correct_answer_label", 93 | "name": "default/correct_answer_label", 94 | "description": "Column 'correct_answer_label' from the Hugging Face parquet file.", 95 | "dataType": "sc:Text", 96 | "source": { 97 | "fileSet": { 98 | "@id": "parquet-files-for-config-default" 99 | }, 100 | "extract": { 101 | "column": "correct_answer_label" 102 | } 103 | } 104 | }, 105 | { 106 | "@type": "cr:Field", 107 | "@id": "default/subdiscipline", 108 | "name": "default/subdiscipline", 109 | "description": "Column 'subdiscipline' from the Hugging Face parquet file.", 110 | "dataType": "sc:Text", 111 | "source": { 112 | "fileSet": { 113 | "@id": "parquet-files-for-config-default" 114 | }, 115 | "extract": { 116 | "column": "subdiscipline" 117 | } 118 | } 119 | }, 120 | { 121 | "@type": "cr:Field", 122 | "@id": "default/video_id", 123 | "name": "default/video_id", 124 | "description": "Column 'video_id' from the Hugging Face parquet file.", 125 | "dataType": "sc:Text", 126 | "source": { 127 | "fileSet": { 128 | "@id": "parquet-files-for-config-default" 129 | }, 130 | "extract": { 131 | "column": "video_id" 132 | } 133 | } 134 | }, 135 | { 136 | "@type": "cr:Field", 137 | "@id": "default/discipline", 138 | "name": "default/discipline", 139 | "description": "Column 'discipline' from the Hugging Face parquet file.", 140 | "dataType": "sc:Text", 141 | "source": { 142 | "fileSet": { 143 | "@id": "parquet-files-for-config-default" 144 | }, 145 | "extract": { 146 | "column": "discipline" 147 | } 148 | } 149 | }, 150 | { 151 | "@type": "cr:Field", 152 | "@id": "default/clip_video_url", 153 | "name": "default/clip_video_url", 154 | "description": "Column 'clip_video_url' from the Hugging Face parquet file.", 155 | "dataType": "sc:Text", 156 | "source": { 157 | "fileSet": { 158 | "@id": "parquet-files-for-config-default" 159 | }, 160 | "extract": { 161 | "column": "clip_video_url" 162 | } 163 | } 164 | }, 165 | { 166 | "@type": "cr:Field", 167 | "@id": "default/duration", 168 | "name": "default/duration", 169 | "description": "Column 'duration' from the Hugging Face parquet file.", 170 | "dataType": "sc:Text", 171 | "source": { 172 | "fileSet": { 173 | "@id": "parquet-files-for-config-default" 174 | }, 175 | "extract": { 176 | "column": "duration" 177 | } 178 | } 179 | } 180 | ] 181 | } 182 | ], 183 | "conformsTo": "http://mlcommons.org/croissant/1.0", 184 | "name": "MMWorld", 185 | "description": "Xuehai/MMWorld dataset hosted on Hugging Face and contributed by the HF Datasets community", 186 | "alternateName": [ 187 | "Xuehai/MMWorld" 188 | ], 189 | "creator": { 190 | "@type": "Person", 191 | "name": "He", 192 | "url": "https://huggingface.co/Xuehai" 193 | }, 194 | "keywords": [ 195 | "cc-by-4.0", 196 | "Croissant", 197 | "🇺🇸 Region: US" 198 | ], 199 | "license": "https://choosealicense.com/licenses/cc-by-4.0/", 200 | "url": "https://huggingface.co/datasets/Xuehai/MMWorld" 201 | } -------------------------------------------------------------------------------- /data/croissant_data.json: -------------------------------------------------------------------------------- 1 | { 2 | "@context": { 3 | "@language": "en", 4 | "@vocab": "https://schema.org/", 5 | "citeAs": "cr:citeAs", 6 | "column": "cr:column", 7 | "conformsTo": "dct:conformsTo", 8 | "cr": "http://mlcommons.org/croissant/", 9 | "rai": "http://mlcommons.org/croissant/RAI/", 10 | "data": { 11 | "@id": "cr:data", 12 | "@type": "@json" 13 | }, 14 | "dataType": { 15 | "@id": "cr:dataType", 16 | "@type": "@vocab" 17 | }, 18 | "dct": "http://purl.org/dc/terms/", 19 | "examples": { 20 | "@id": "cr:examples", 21 | "@type": "@json" 22 | }, 23 | "extract": "cr:extract", 24 | "field": "cr:field", 25 | "fileProperty": "cr:fileProperty", 26 | "fileObject": "cr:fileObject", 27 | "fileSet": "cr:fileSet", 28 | "format": "cr:format", 29 | "includes": "cr:includes", 30 | "isLiveDataset": "cr:isLiveDataset", 31 | "jsonPath": "cr:jsonPath", 32 | "key": "cr:key", 33 | "md5": "cr:md5", 34 | "parentField": "cr:parentField", 35 | "path": "cr:path", 36 | "recordSet": "cr:recordSet", 37 | "references": "cr:references", 38 | "regex": "cr:regex", 39 | "repeated": "cr:repeated", 40 | "replace": "cr:replace", 41 | "sc": "https://schema.org/", 42 | "separator": "cr:separator", 43 | "source": "cr:source", 44 | "subField": "cr:subField", 45 | "transform": "cr:transform" 46 | }, 47 | "@type": "sc:Dataset", 48 | "name": "mmworld", 49 | "description": "Dataset containing video IDs, URLs, disciplines, subdisciplines, captions, and questions for various videos.", 50 | "conformsTo": "http://mlcommons.org/croissant/1.0", 51 | "license": "https://creativecommons.org/licenses/by/4.0/", 52 | "url": "https://mmworld-bench.github.io/", 53 | "distribution": [ 54 | { 55 | "@type": "cr:FileObject", 56 | "@id": "mmworld", 57 | "name": "mmworld.json", 58 | "description": "Dataset containing video IDs, URLs, disciplines, subdisciplines, captions, and questions.", 59 | "contentUrl": "mmworld.json", 60 | "encodingFormat": "application/json", 61 | "sha256": "658ed65e043b845be1adce62f995d4fd85b610eeee911d2d2b6ebf78e82b1f5a" 62 | } 63 | ], 64 | "recordSet": [ 65 | { 66 | "@type": "cr:RecordSet", 67 | "@id": "video_metadata", 68 | "name": "Video Metadata", 69 | "description": "Metadata for each video.", 70 | "field": [ 71 | { 72 | "@type": "cr:Field", 73 | "@id": "video_id", 74 | "name": "video_id", 75 | "description": "The video ID.", 76 | "dataType": "sc:Text", 77 | "source": { 78 | "fileObject": { 79 | "@id": "mmworld" 80 | }, 81 | "extract": { 82 | "jsonPath": "$[*].video_id" 83 | } 84 | } 85 | }, 86 | { 87 | "@type": "cr:Field", 88 | "@id": "video_url", 89 | "name": "video_url", 90 | "description": "The video URL.", 91 | "dataType": "sc:Text", 92 | "source": { 93 | "fileObject": { 94 | "@id": "mmworld" 95 | }, 96 | "extract": { 97 | "jsonPath": "$[*].video_url" 98 | } 99 | } 100 | }, 101 | { 102 | "@type": "cr:Field", 103 | "@id": "discipline", 104 | "name": "discipline", 105 | "description": "The discipline.", 106 | "dataType": "sc:Text", 107 | "source": { 108 | "fileObject": { 109 | "@id": "mmworld" 110 | }, 111 | "extract": { 112 | "jsonPath": "$[*].discipline" 113 | } 114 | } 115 | }, 116 | { 117 | "@type": "cr:Field", 118 | "@id": "subdiscipline", 119 | "name": "subdiscipline", 120 | "description": "The subdiscipline.", 121 | "dataType": "sc:Text", 122 | "source": { 123 | "fileObject": { 124 | "@id": "mmworld" 125 | }, 126 | "extract": { 127 | "jsonPath": "$[*].subdiscipline" 128 | } 129 | } 130 | }, 131 | { 132 | "@type": "cr:Field", 133 | "@id": "captions", 134 | "name": "captions", 135 | "description": "The video captions.", 136 | "dataType": "sc:Text", 137 | "source": { 138 | "fileObject": { 139 | "@id": "mmworld" 140 | }, 141 | "extract": { 142 | "jsonPath": "$[*].captions" 143 | } 144 | } 145 | } 146 | ] 147 | }, 148 | { 149 | "@type": "cr:RecordSet", 150 | "@id": "questions", 151 | "name": "Questions", 152 | "description": "Questions associated with the videos.", 153 | "field": [ 154 | { 155 | "@type": "cr:Field", 156 | "@id": "type", 157 | "name": "type", 158 | "description": "The type of question.", 159 | "dataType": "sc:Text", 160 | "source": { 161 | "fileObject": { 162 | "@id": "mmworld" 163 | }, 164 | "extract": { 165 | "jsonPath": "$[*].questions[*].type" 166 | } 167 | } 168 | }, 169 | { 170 | "@type": "cr:Field", 171 | "@id": "question", 172 | "name": "question", 173 | "description": "The question.", 174 | "dataType": "sc:Text", 175 | "source": { 176 | "fileObject": { 177 | "@id": "mmworld" 178 | }, 179 | "extract": { 180 | "jsonPath": "$[*].questions[*].question" 181 | } 182 | } 183 | }, 184 | { 185 | "@type": "cr:Field", 186 | "@id": "options", 187 | "name": "options", 188 | "description": "The options for the question.", 189 | "dataType": "sc:Text", 190 | "source": { 191 | "fileObject": { 192 | "@id": "mmworld" 193 | }, 194 | "extract": { 195 | "jsonPath": "$[*].questions[*].options" 196 | } 197 | } 198 | }, 199 | { 200 | "@type": "cr:Field", 201 | "@id": "answer", 202 | "name": "answer", 203 | "description": "The correct answer for the question.", 204 | "dataType": "sc:Text", 205 | "source": { 206 | "fileObject": { 207 | "@id": "mmworld" 208 | }, 209 | "extract": { 210 | "jsonPath": "$[*].questions[*].answer" 211 | } 212 | } 213 | }, 214 | { 215 | "@type": "cr:Field", 216 | "@id": "requires_domain_knowledge", 217 | "name": "requires_domain_knowledge", 218 | "description": "Whether the question requires domain knowledge.", 219 | "dataType": "sc:Text", 220 | "source": { 221 | "fileObject": { 222 | "@id": "mmworld" 223 | }, 224 | "extract": { 225 | "jsonPath": "$[*].questions[*].requires_domain_knowledge" 226 | } 227 | } 228 | }, 229 | { 230 | "@type": "cr:Field", 231 | "@id": "requires_audio", 232 | "name": "requires_audio", 233 | "description": "Whether the question requires audio.", 234 | "dataType": "sc:Text", 235 | "source": { 236 | "fileObject": { 237 | "@id": "mmworld" 238 | }, 239 | "extract": { 240 | "jsonPath": "$[*].questions[*].requires_audio" 241 | } 242 | } 243 | }, 244 | { 245 | "@type": "cr:Field", 246 | "@id": "requires_visual", 247 | "name": "requires_visual", 248 | "description": "Whether the question requires visual.", 249 | "dataType": "sc:Text", 250 | "source": { 251 | "fileObject": { 252 | "@id": "mmworld" 253 | }, 254 | "extract": { 255 | "jsonPath": "$[*].questions[*].requires_visual" 256 | } 257 | } 258 | }, 259 | { 260 | "@type": "cr:Field", 261 | "@id": "question_only", 262 | "name": "question_only", 263 | "description": "Whether the question is a question-only type.", 264 | "dataType": "sc:Text", 265 | "source": { 266 | "fileObject": { 267 | "@id": "mmworld" 268 | }, 269 | "extract": { 270 | "jsonPath": "$[*].questions[*].question_only" 271 | } 272 | } 273 | }, 274 | { 275 | "@type": "cr:Field", 276 | "@id": "correct_answer_label", 277 | "name": "correct_answer_label", 278 | "description": "The label of the correct answer.", 279 | "dataType": "sc:Text", 280 | "source": { 281 | "fileObject": { 282 | "@id": "mmworld" 283 | }, 284 | "extract": { 285 | "jsonPath": "$[*].questions[*].correct_answer_label" 286 | } 287 | } 288 | } 289 | ] 290 | } 291 | ] 292 | } -------------------------------------------------------------------------------- /evaluation/main_utils.py: -------------------------------------------------------------------------------- 1 | # from moviepy.editor import VideoFileClip 2 | # from pytube import YouTube 3 | import sys 4 | import os 5 | 6 | import base64 7 | 8 | def calculate_video_length(video_path): 9 | try: 10 | with VideoFileClip(video_path) as video: 11 | return video.duration 12 | except Exception as e: 13 | print(f"Error calculating video length for {video_path}: {e}") 14 | return 0 15 | 16 | 17 | def show_progress(stream, chunk, bytes_remaining): 18 | total_size = stream.filesize 19 | bytes_downloaded = total_size - bytes_remaining 20 | sys.stdout.write(f"Downloading: {bytes_downloaded / total_size * 100:.2f}%\r") 21 | sys.stdout.flush() 22 | 23 | 24 | def answer_post_processing(model_answer): 25 | parts = model_answer.split('.') 26 | model_answer_processed = parts[0].replace("The answer is", "").strip().lower().strip("\"'") 27 | model_answer_label = model_answer_processed.split(':')[0].strip() 28 | return model_answer_label 29 | 30 | 31 | def format_time(seconds): 32 | hours = int(seconds // 3600) 33 | minutes = int((seconds % 3600) // 60) 34 | seconds = int(seconds % 60) 35 | return f"{hours:02}:{minutes:02}:{seconds:02}" 36 | 37 | 38 | def get_transcript_with_formatted_time(video_url): 39 | video_id = extract_video_id(video_url) 40 | if video_id is None: 41 | return "Invalid YouTube URL or Video ID not found" 42 | return YouTubeTranscriptApi.get_transcript(video_id) 43 | 44 | 45 | def encode_images_to_base64(directory): 46 | images_base64 = [] 47 | for image_name in os.listdir(directory): 48 | image_path = os.path.join(directory, image_name) 49 | with open(image_path, "rb") as image_file: 50 | encoded_string = base64.b64encode(image_file.read()).decode() 51 | images_base64.append({"image": encoded_string}) 52 | return images_base64 53 | 54 | 55 | def download_video(url, video_id, output_path): 56 | if os.path.exists(output_path): 57 | print(f"Video with {url} already downloaded. Skipping download.") 58 | return output_path 59 | try: 60 | if "shorts" in url: 61 | video_id = video_id.replace("shorts/", "") 62 | print(f"Shorts video detected. ") 63 | yt = YouTube(url, on_progress_callback=show_progress) 64 | video_stream = yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first() 65 | if video_stream: 66 | video_stream.download(output_path=output_path, filename=f"{video_id}.mp4") 67 | print(f"\nDownloaded {yt.title} successfully.") 68 | return True 69 | else: 70 | print("No suitable video stream found for:", url) 71 | return False 72 | except Exception as e: 73 | print(f"Error downloading video {url}: {e}") 74 | return False 75 | 76 | 77 | def compute_question_accuracy(model_answer, correct_answer_label, options): 78 | return model_answer == correct_answer_label.lower().strip() 79 | 80 | 81 | def compute_question_accuracy_with_gpt(answer_evaluator, model_answer, correct_answer_label, question, options): 82 | options_str = "\n".join([f"Option {label.upper()}: {text}" for label, text in options.items()]) 83 | # prompt = (f"I will present a response from a question-answering model and several answer options. " 84 | # f"Your task is to evaluate the response and determine which of the following options it most closely aligns with.\n\n" 85 | # f"Response: '{model_answer}'\n\n" 86 | # f"Options:\n{options_str}\n\n" 87 | # "Indicate the most similar option by responding with the corresponding letter only (a, b, c, or d).") 88 | prompt="For question: "+question+"\n"\ 89 | + options_str \ 90 | + "(please select one)\nGround truth answer: "\ 91 | + correct_answer_label\ 92 | + "\nModel predicted answer: "+model_answer\ 93 | + "\nBased on the question and the ground truth answer, is the model's predicted answer correct? If multi-choice provided, think about which choice is selected by the model, is it correct? (please answer yes/no)\n" 94 | try: 95 | response = answer_evaluator.chat.completions.create( 96 | model="xx", 97 | messages=[ 98 | {"role": "system", "content": "You are a helpful assistant that provides concise answers."}, 99 | {"role": "user", "content": prompt} 100 | ] 101 | ) 102 | 103 | 104 | except Exception as e: 105 | print(f"Error evaluating response: {e}") 106 | # ai_response = model_answer 107 | return 'Error', False 108 | ai_response = response.choices[0].message.content.strip().lower() 109 | if "yes" in ai_response : 110 | gpt_judge_correct = True 111 | else: 112 | gpt_judge_correct = False 113 | return ai_response, gpt_judge_correct 114 | 115 | 116 | 117 | def gpt(question, options, prompt): 118 | constructed_url = 'xx' 119 | headers = { 120 | 'Content-Type': 'application/json', 121 | "api-key": "xx", 122 | } 123 | 124 | def run_api(body): 125 | request = requests.post(constructed_url, headers=headers, json=body) 126 | response = request.json() 127 | return response 128 | 129 | body = [{ 130 | 'role' : 'system', 131 | 'content' : ['You are an expert in assisting human. Follows the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.'], 132 | }, 133 | { 134 | 'role' : 'user', 135 | 'content' : [prompt], 136 | }, 137 | ] 138 | 139 | inputs = {} 140 | inputs['messages'] = body # for "chat" 141 | inputs['max_tokens'] = 1024 142 | inputs['stop'] = "{END}" 143 | results = run_api(inputs) 144 | return results 145 | 146 | def gpt4v(question, options, prompt, video, video_id): 147 | no_of_frames_to_returned = 8 148 | 149 | videoframefolder = f"./video_benchmark/clipped_video/{video_id}" 150 | if not os.path.exists(videoframefolder): 151 | os.makedirs(videoframefolder) 152 | diskwriter = KeyFrameDiskWriter(location=videoframefolder) 153 | video_file_path = video 154 | 155 | print(f"Input video file path = {video_file_path}") 156 | 157 | 158 | try: 159 | vd.extract_video_keyframes( 160 | no_of_frames=no_of_frames_to_returned, file_path=video_file_path, 161 | writer=diskwriter 162 | ) 163 | except Exception as e: 164 | print(f"Error in extracting video keyframes: {e}") 165 | images_base64 = [] 166 | 167 | 168 | if len(os.listdir(videoframefolder)) == 0: 169 | images_base64 = [] 170 | else: 171 | images_base64 = encode_images_to_base64(videoframefolder) 172 | constructed_url = 'xx' 173 | headers = { 174 | 'Content-Type': 'application/json', 175 | 'api-key': 'xx' 176 | } 177 | 178 | def run_api(body): 179 | request = requests.post(constructed_url, headers=headers, json=body) 180 | response = request.json() 181 | return response 182 | 183 | prompt = f"Based on the following video frames extracted from the video, answer the question {question} by selecting one from the giving answers {options}." 184 | 185 | body = [{ 186 | 'role' : 'system', 187 | 'content' : ['You are an expert in assisting human. Follows the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.'], 188 | }, 189 | { 190 | 'role' : 'user', 191 | 'content' : [prompt, *images_base64], 192 | }, 193 | ] 194 | 195 | inputs = {} 196 | inputs['messages'] = body 197 | inputs['max_tokens'] = 1024 198 | inputs['stop'] = "{END}" 199 | results = run_api(inputs) 200 | return results 201 | 202 | def gpt4o(question, options, prompt, video, video_id): 203 | no_of_frames_to_returned = 8 204 | 205 | videoframefolder = f"./video_benchmark/clipped_video/{video_id}" 206 | if not os.path.exists(videoframefolder): 207 | os.makedirs(videoframefolder) 208 | diskwriter = KeyFrameDiskWriter(location=videoframefolder) 209 | video_file_path = video 210 | 211 | print(f"Input video file path = {video_file_path}") 212 | 213 | 214 | try: 215 | vd.extract_video_keyframes( 216 | no_of_frames=no_of_frames_to_returned, file_path=video_file_path, 217 | writer=diskwriter 218 | ) 219 | except Exception as e: 220 | print(f"Error in extracting video keyframes: {e}") 221 | images_base64 = [] 222 | 223 | 224 | if len(os.listdir(videoframefolder)) == 0: 225 | images_base64 = [] 226 | else: 227 | images_base64 = encode_images_to_base64(videoframefolder) 228 | api_base = "xx" 229 | deployment_name = "xx" 230 | api_version = "2024-03-01-preview" 231 | constructed_url = f"{api_base}/openai/deployments/{deployment_name}/chat/completions?api-version={api_version}" 232 | headers = { 233 | 'Content-Type': 'application/json', 234 | 'api-key': 'xx' 235 | } 236 | 237 | def run_api(body): 238 | request = requests.post(constructed_url, headers=headers, json=body) 239 | response = request.json() 240 | return response 241 | 242 | prompt = f"Based on the following video frames extracted from the video, answer the question {question} by selecting one from the giving answers {options}." 243 | 244 | 245 | body = [ 246 | { 247 | 'role': 'system', 248 | 'content': 'You are an expert in assisting humans. Follow the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.' 249 | }, 250 | { 251 | 'role': 'user', 252 | 'content': prompt 253 | }, 254 | { 255 | 'role': 'user', 256 | 'content': images_base64 257 | } 258 | ] 259 | 260 | inputs = {} 261 | inputs['messages'] = body # for "chat" 262 | inputs['max_tokens'] = 2000 263 | inputs['stop'] = "{END}" 264 | results = run_api(inputs) 265 | return results -------------------------------------------------------------------------------- /evaluation/eval.py: -------------------------------------------------------------------------------- 1 | import json 2 | import sys 3 | import os 4 | import PIL.Image 5 | import glob 6 | 7 | import requests 8 | import time 9 | import argparse 10 | 11 | import argparse 12 | from main_utils import * 13 | from openai import AzureOpenAI 14 | 15 | import copy 16 | 17 | from Katna.video import Video 18 | from Katna.writer import KeyFrameDiskWriter 19 | vd = Video() 20 | 21 | 22 | 23 | 24 | 25 | 26 | def answer_generator(answer_evaluator, video_file, question, options, correct_answer, correct_answer_label, question_type, annotations, video_id, detailed_results, subject_data, modelname, tokenizer=None, processor=None, image_processor=None): 27 | prompt = f"Answer the question {question} by selecting one from the giving answers {options}. Respond with only single letter such as a, b, c ,d." 28 | # prompt = f"Answer the question {question} by selecting one from the giving answers {options}. Also give reasons of why you select this answer after your selelcted amswer." 29 | subject_data["total_questions"] += 1 30 | 31 | 32 | 33 | if modelname == 'gpt' or modelname == 'gpt4o': 34 | max_retries= 3000000 35 | retry_delay = 0.0001 36 | retry_count = 0 37 | 38 | while retry_count < max_retries: 39 | if modelname == 'gpt': 40 | model_response = gpt4v(question, options, prompt, video_file, video_id) 41 | elif modelname == 'gpt4o': 42 | model_response = gpt4o(question, options, prompt, video_file, video_id) 43 | if 'choices' in model_response: 44 | model_answer = model_response['choices'][0]['message']['content'] 45 | print('The model answer is:', model_answer) 46 | break 47 | elif model_response['error']['code'] == '429': 48 | print(f"Rate limit exceeded. Error message is {model_response}, Retrying in {retry_delay} seconds..., retry count: {retry_count}") 49 | time.sleep(retry_delay) 50 | elif model_response['error']['code'] == 'content_filter': 51 | print(f"Content filter triggered. Error message is {model_response}, Retrying in {retry_delay} seconds..., retry count: {retry_count}") 52 | model_answer = 'content_filter' 53 | time.sleep(retry_delay) 54 | break 55 | elif 'error' in model_response: 56 | print(f"Error message is {model_response['error']['message']}, Retrying in {retry_delay} seconds..., retry count: {retry_count}") 57 | model_answer = model_response['error']['message'] 58 | time.sleep(retry_delay) 59 | break 60 | 61 | retry_count += 1 62 | print('Model selected: gpt, question:', question, 'options:', options, 'answer:', model_answer) 63 | elif modelname == 'gemini': 64 | safety_settings = [ 65 | { 66 | "category": "HARM_CATEGORY_DANGEROUS", 67 | "threshold": "BLOCK_NONE", 68 | }, 69 | { 70 | "category": "HARM_CATEGORY_HARASSMENT", 71 | "threshold": "BLOCK_NONE", 72 | }, 73 | { 74 | "category": "HARM_CATEGORY_HATE_SPEECH", 75 | "threshold": "BLOCK_NONE", 76 | }, 77 | { 78 | "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", 79 | "threshold": "BLOCK_NONE", 80 | }, 81 | { 82 | "category": "HARM_CATEGORY_DANGEROUS_CONTENT", 83 | "threshold": "BLOCK_NONE", 84 | }, 85 | ] 86 | no_of_frames_to_returned = 10 87 | 88 | videoframefolder = f"./clipped_video/{video_id}" 89 | images = [] 90 | if os.path.exists(videoframefolder) and len(os.listdir(videoframefolder)) > 0: 91 | for image_file_name in os.listdir(videoframefolder): 92 | image_path = os.path.join(videoframefolder, image_file_name) 93 | try: 94 | img = PIL.Image.open(image_path) 95 | images.append(img) 96 | except Exception as e: 97 | print(f"Error loading image {image_file_name}: {e}") 98 | else: 99 | if not os.path.exists(videoframefolder): 100 | os.makedirs(videoframefolder) 101 | diskwriter = KeyFrameDiskWriter(location=videoframefolder) 102 | video_file_path = videofile 103 | 104 | print(f"Input video file path = {video_file_path}") 105 | 106 | 107 | try: 108 | vd.extract_video_keyframes( 109 | no_of_frames=no_of_frames_to_returned, file_path=video_file_path, 110 | writer=diskwriter 111 | ) 112 | image_path = os.path.join(videoframefolder, image_file_name) 113 | img = PIL.Image.open(image_path) 114 | images.append(img) 115 | except Exception as e: 116 | print(f"Error in extracting video keyframes: {e}") 117 | 118 | if images: 119 | for attempt in range(5): 120 | try: 121 | model_answer = models.generate_content([prompt] + images, safety_settings=safety_settings).text 122 | break 123 | except Exception as e: 124 | print(f"Attempt {attempt+1} failed: {e}") 125 | if attempt == 4: 126 | model_answer = 'error' 127 | else: 128 | print("No images found in the directory.") 129 | model_answer = 'No images to process.' 130 | print('Model selected: gemini, question:', question, 'options:', options, 'answer:', model_answer) 131 | elif modelname == 'claude': 132 | import anthropic 133 | from io import BytesIO 134 | 135 | client = anthropic.Anthropic( 136 | api_key="xx", 137 | ) 138 | 139 | no_of_frames_to_returned = 10 140 | 141 | videoframefolder = f"./video_benchmark/clipped_video/{video_id}" 142 | images = [] 143 | if os.path.exists(videoframefolder) and len(os.listdir(videoframefolder)) > 0: 144 | for image_file_name in os.listdir(videoframefolder): 145 | image_path = os.path.join(videoframefolder, image_file_name) 146 | try: 147 | img = PIL.Image.open(image_path) 148 | buffered = BytesIO() 149 | img.save(buffered, format="JPEG") 150 | img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8") 151 | images.append(img_base64) 152 | except Exception as e: 153 | print(f"Error loading image {image_file_name}: {e}") 154 | 155 | 156 | message_content = [] 157 | 158 | for img_base64 in images: 159 | message_content.append( 160 | { 161 | "type": "image", 162 | "source": { 163 | "type": "base64", 164 | "media_type": "image/jpeg", 165 | "data": img_base64, 166 | }, 167 | } 168 | ) 169 | 170 | 171 | message_content.append( 172 | { 173 | "type": "text", 174 | "text": prompt 175 | } 176 | ) 177 | 178 | 179 | messages_payload = [{"role": "user", "content": message_content}] 180 | message = client.messages.create( 181 | model="claude-3-5-sonnet-20240620", 182 | max_tokens=1024, 183 | messages=messages_payload, 184 | ) 185 | 186 | model_answer = message.content[0].text 187 | model_answer = answer_post_processing(model_answer) 188 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 189 | elif modelname == 'videochat': 190 | model_answer = videochat_answer(models, video_file, question, options) 191 | model_answer = answer_post_processing(model_answer) 192 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 193 | elif modelname == 'videollama': 194 | model_answer = videollama_answer(models, video_file, question, options) 195 | model_answer = answer_post_processing(model_answer) 196 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 197 | elif modelname == 'chatunivi': 198 | model_answer = chatunivi_answer(models, video_file, question, options, prompt, tokenizer) 199 | model_answer = answer_post_processing(model_answer) 200 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 201 | elif modelname == 'mplugowl': 202 | model_answer = mplugowl_answer(models, video_id, video_file, question, options, prompt, tokenizer, image_processor) 203 | model_answer = answer_post_processing(model_answer) 204 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 205 | elif modelname == 'otter': 206 | model_answer = otter_answer(models, video_file, question, options, prompt, image_processor) 207 | model_answer = answer_post_processing(model_answer) 208 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 209 | elif 'xinstruct' in modelname: 210 | model_answer = xinstruct_answer(models, video_file, question, options, image_processor) 211 | model_answer = answer_post_processing(model_answer) 212 | print('Model selected: xinstruct, question:', question, 'options:', options, 'answer:', model_answer) 213 | elif modelname == 'pandagpt': 214 | model_answer = pandagpt_answer(models, video_file, question, options, prompt) 215 | model_answer = answer_post_processing(model_answer) 216 | print('Model selected: PandaGPT, question:', question, 'options:', options, 'answer:', model_answer) 217 | elif modelname == 'imagebind_llm': 218 | model_answer = imagebind_llm_answer(models, video_file, question, options, prompt) 219 | model_answer = answer_post_processing(model_answer) 220 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 221 | elif modelname == 'lwm': 222 | model_answer = lwm_answer(models, video_file, question, options, prompt) 223 | model_answer = answer_post_processing(model_answer) 224 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 225 | elif modelname == 'videollava': 226 | model_answer = videollava_answer(models, video_file, question, options, prompt, tokenizer, processor, video_processor) 227 | model_answer = answer_post_processing(model_answer) 228 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer) 229 | else: 230 | print("Invalid model name. Exiting.") 231 | sys.exit(1) 232 | 233 | gpt_processed_answer, is_correct = compute_question_accuracy_with_gpt(answer_evaluator, model_answer, correct_answer_label, question ,options) 234 | if is_correct: 235 | subject_data["correct_answers"] += 1 236 | 237 | for annotation, value in annotations.items(): 238 | subject_data["accuracy_per_annotation"].setdefault(annotation, {"total": 0, "correct": 0}) 239 | if value: 240 | subject_data["accuracy_per_annotation"][annotation]["total"] += 1 241 | if value and is_correct: 242 | subject_data["accuracy_per_annotation"][annotation]["correct"] += 1 243 | 244 | test = copy.deepcopy(subject_data["accuracy_per_annotation"]) 245 | subject_data["accuracy_per_question_type"].setdefault(question_type, {"total": 0, "correct": 0}) 246 | subject_data["accuracy_per_question_type"][question_type]["total"] += 1 247 | if is_correct: 248 | subject_data["accuracy_per_question_type"][question_type]["correct"] += 1 249 | 250 | detailed_results.append({ 251 | "subject": subject, 252 | "video_id": video_id, 253 | "question": question, 254 | "correct_answer": correct_answer, 255 | "correct_answer_label": correct_answer_label, 256 | "model_answer": model_answer, 257 | 'gpt_processed_answer': gpt_processed_answer, 258 | "options": options, 259 | "is_correct": is_correct, 260 | "annotations": annotations, 261 | "question_type": question_type, 262 | "subject_data": test 263 | }) 264 | 265 | with open(detailed_results_paths[run_idx], 'w') as f: 266 | json.dump(detailed_results, f, indent=4) 267 | print(f"Saved detailed results to {detailed_results_paths[run_idx]}") 268 | 269 | 270 | 271 | 272 | if __name__ == "__main__": 273 | 274 | parser = argparse.ArgumentParser(description="Initialize and run model") 275 | parser.add_argument("modelname", type=str, help="Name of the model to initialize and run") 276 | parser.add_argument("--textonly", action="store_true", help="Flag to indicate if the model should run in text-only mode") 277 | 278 | args = parser.parse_args() 279 | 280 | 281 | modelname = args.modelname 282 | textonly = args.textonly 283 | 284 | 285 | 286 | if modelname == "imagebind_llm": 287 | sys.path.append(os.path.abspath("./LLaMA-Adapter/imagebind_LLM")) 288 | from eval_imagebind_llm import imagebind_llm_init, imagebind_llm_answer 289 | elif modelname == "lwm": 290 | sys.path.append('./video_benchmark/LWM') 291 | from eval_LWM import lwm_init, lwm_answer 292 | elif modelname == "mplugowl": 293 | sys.path.append('./video_benchmark/mPLUG-Owl2') 294 | from eval_mplug_owl import mplugowl_init, mplugowl_answer 295 | elif modelname == "otter": 296 | sys.path.append('./video_benchmark/otter') 297 | sys.path.append('./video_benchmark/otter/src') 298 | from eval_otter import otter_init, otter_answer 299 | elif modelname == "videochat": 300 | sys.path.append('./video_benchmark/video_chat2') 301 | from eval_video_chat import videochat2_init, videochat_answer 302 | elif modelname == "videollama": 303 | sys.path.append('./video_benchmark/Video_llama') 304 | from eval_video_llama import videollama_init, videollama_answer 305 | elif modelname == "videollava": 306 | sys.path.append('./video_benchmark/Video-LLaVA') 307 | from eval_video_llava import videollava_init, videollava_answer 308 | elif modelname == "xinstruct": 309 | sys.path.append('./video_benchmark/LAVIS-XInstructBLIP') 310 | from eval_xinstruct import xinstruct_init, xinstruct_answer 311 | elif modelname == "pandagpt": 312 | sys.path.append(os.path.abspath("./PandaGPT")) 313 | sys.path.append(os.path.abspath('./PandaGPT/code')) 314 | from eval_pandagpt import pandagpt_init, pandagpt_answer 315 | elif modelname == "llamaadapter": 316 | sys.path.append(os.path.abspath("./LLaMA-Adapter/imagebind_LLM")) 317 | from eval_imagebind_llm import imagebind_llm_init, imagebind_llm_answer 318 | 319 | 320 | if modelname == 'gemini': 321 | import google.generativeai as genai 322 | models = genai.GenerativeModel('gemini-pro-vision') 323 | GOOGLE_API_KEY="xx" 324 | genai.configure(api_key=GOOGLE_API_KEY) 325 | 326 | elif modelname == 'videochat': 327 | models = videochat2_init() 328 | 329 | elif modelname == 'videollama': 330 | models = videollama_init() 331 | 332 | elif modelname == 'chatunivi': 333 | models, tokenizer = chatunivi_init() 334 | 335 | elif modelname == 'otter': 336 | models, image_processor = otter_init() 337 | 338 | elif modelname == 'mplugowl': 339 | models, tokenizer, image_processor = mplugowl_init() 340 | 341 | elif modelname == 'xinstruct-7b': 342 | models, image_processor = xinstruct_init("vicuna7b_v2") 343 | 344 | elif modelname == 'xinstruct-13b': 345 | models, image_processor = xinstruct_init("vicuna13b") 346 | 347 | elif modelname == 'pandagpt': 348 | models = pandagpt_init() 349 | 350 | elif modelname == 'imagebind_llm': 351 | models = imagebind_llm_init() 352 | 353 | elif modelname == 'lwm': 354 | models = lwm_init() 355 | 356 | elif modelname == 'videollava': 357 | models, tokenizer, processor, video_processor = videollava_init() 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | num_runs = 3 366 | 367 | detailed_results_dir = 'detailed_results' 368 | final_results_dir = 'final_results' 369 | 370 | detailed_results_paths = [os.path.join(detailed_results_dir, f'{modelname}_detailed_results_{i}.json') for i in range(num_runs)] 371 | final_results_paths = [os.path.join(final_results_dir, f'{modelname}_final_results_run_{i}.json') for i in range(num_runs)] 372 | 373 | if not os.path.exists(detailed_results_dir): 374 | os.makedirs(detailed_results_dir ) 375 | if not os.path.exists(final_results_dir): 376 | os.makedirs(final_results_dir) 377 | 378 | print(f"Using model: {modelname}, textonly: {textonly}") 379 | 380 | 381 | videofile = "./video_benchmark/dataset/mmworld.json" 382 | with open(videofile, 'r') as file: 383 | dataset = json.load(file) 384 | 385 | 386 | 387 | answer_evaluator = AzureOpenAI( 388 | azure_endpoint="xx", 389 | api_key="xx", 390 | api_version="2023-12-01-preview" 391 | ) 392 | 393 | 394 | for run_idx in range(num_runs): 395 | total_questions = 0 396 | correct_answers = 0 397 | detailed_results = [] 398 | accuracy_per_annotation = {} 399 | accuracy_per_question_type = {} 400 | results_by_subject = {} 401 | 402 | 403 | failed_downloads = [] 404 | missed_video = set() 405 | wrong_video = set() 406 | success_downloads = [] 407 | 408 | 409 | for video_data in dataset: 410 | subject = video_data["discipline"] 411 | if subject not in results_by_subject: 412 | results_by_subject[subject] = { 413 | "total_questions": 0, 414 | "correct_answers": 0, 415 | "accuracy_per_annotation": {}, 416 | "accuracy_per_question_type": {}, 417 | "detailed_results": [] 418 | } 419 | 420 | video_id = video_data["video_id"] 421 | 422 | 423 | for question_data in video_data["questions"]: 424 | question = question_data["question"] 425 | options = question_data["options"] 426 | 427 | correct_answer = question_data["answer"] 428 | correct_answer_label = question_data["correct_answer_label"] 429 | question_type = question_data["type"] 430 | requires_video = question_data["requires_visual"] 431 | annotations = { 432 | "requires_audio": question_data["requires_audio"], 433 | "requires_domain_knowledge": question_data["requires_domain_knowledge"], 434 | "requires_video": requires_video, 435 | "question_only": question_data["question_only"] 436 | } 437 | 438 | video_files = glob.glob(f"./all_data/{video_data['video_id']}/*.mp4") 439 | 440 | if len(video_files) == 0: 441 | missed_video.add(video_data["video_id"]) 442 | 443 | for video_file in video_files: 444 | try: 445 | answer_generator(answer_evaluator, video_file, question, options, correct_answer, correct_answer_label, question_type, annotations, video_data["video_id"], 446 | results_by_subject[subject]["detailed_results"], results_by_subject[subject], modelname, locals().get('tokenizer', None), locals().get('processor', None), locals().get('image_processor', None)) 447 | except Exception as e: 448 | print(f"Error encountered: {e}") 449 | wrong_video.add(video_data["video_id"]) 450 | 451 | 452 | with open('failed_downloads.json', 'w') as f: 453 | json.dump(list(missed_video), f, indent=4) 454 | with open('problemed_video.json', 'w') as f: 455 | json.dump(list(wrong_video), f, indent=4) 456 | 457 | for subject, data in results_by_subject.items(): 458 | 459 | total_questions += data["total_questions"] 460 | correct_answers += data["correct_answers"] 461 | for annotation, value in data["accuracy_per_annotation"].items(): 462 | accuracy_per_annotation.setdefault(annotation, {"total": 0, "correct": 0}) 463 | accuracy_per_annotation[annotation]["total"] += value["total"] 464 | accuracy_per_annotation[annotation]["correct"] += value["correct"] 465 | for question_type, value in data["accuracy_per_question_type"].items(): 466 | accuracy_per_question_type.setdefault(question_type, {"total": 0, "correct": 0}) 467 | accuracy_per_question_type[question_type]["total"] += value["total"] 468 | accuracy_per_question_type[question_type]["correct"] += value["correct"] 469 | 470 | 471 | overall_accuracy = correct_answers / total_questions * 100 472 | results = { 473 | "overall_accuracy": overall_accuracy, 474 | "total_questions": total_questions, 475 | "correct_answers": correct_answers, 476 | "accuracy_per_annotation": accuracy_per_annotation, 477 | "accuracy_per_question_type": accuracy_per_question_type, 478 | "results_by_subject": results_by_subject 479 | } 480 | 481 | with open(final_results_paths[run_idx], 'w') as file: 482 | json.dump(results, file, indent=4) 483 | print(f'Final results saved in {final_results_paths[run_idx]}') 484 | 485 | --------------------------------------------------------------------------------