├── evaluation
    ├── __init__.py
    ├── main_utils.py
    └── eval.py
├── figures
    └── teaser.png
├── LICENSE.md
├── README.md
└── data
    ├── croissanta_hf_data.json
    └── croissant_data.json


/evaluation/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/figures/teaser.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eric-ai-lab/MMWorld/HEAD/figures/teaser.png


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) [2024]
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
  2 | 
  3 | [Xuehai He](https://sheehan1230.github.io/)<sup style="color: #FFB6C1;">†,1</sup>, [Weixi Feng*](https://weixi-feng.github.io/)<sup style="color: #ADD8E6;">2</sup>, [Kaizhi Zheng*](https://kzzheng.github.io/)<sup style="color: #FFB6C1;">1</sup>, [Yujie Lu*](https://yujielu10.github.io/)<sup style="color: #ADD8E6;">2</sup>, [Wanrong Zhu*](https://wanrong-zhu.com/)<sup style="color: #ADD8E6;">2</sup>, [Jiachen Li*](https://sites.google.com/view/jiachenli/)<sup style="color: #ADD8E6;">2</sup>, [Yue Fan*](http://www.yfan.site/)<sup style="color: #FFB6C1;">1</sup>, [Jianfeng Wang](https://scholar.google.com/citations?user=vJWEw_8AAAAJ&hl=en)<sup style="color: #90EE90;">3</sup>, [Linjie Li](https://www.linkedin.com/in/linjie-li/)<sup style="color: #90EE90;">3</sup>, [Zhengyuan Yang](https://zyang-ur.github.io/)<sup style="color: #90EE90;">3</sup>, [Kevin Lin](https://sites.google.com/site/kevinlin311tw/me)<sup style="color: #90EE90;">3</sup>, [William Yang Wang](https://sites.cs.ucsb.edu/~william/)<sup style="color: #ADD8E6;">2</sup>, [Xin Eric Wang](https://eric-xw.github.io/)<sup style="color: #FFB6C1;">†,1</sup>
  4 | 
  5 | 
  6 | <sup style="color: #FFB6C1;">1</sup>UCSC, <sup style="color: #FFB6C1;">2</sup>UCSB, <sup style="color: #FFB6C1;">3</sup>Microsoft
  7 | 
  8 | <sup style="color: #FFB6C1;">*</sup>Equal contribution
  9 | 
 10 | <a href='https://arxiv.org/abs/2406.08407'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://mmworld-bench.github.io/'><img src='https://img.shields.io/badge/Project-Page-green'></a> <a href='https://huggingface.co/datasets/Xuehai/MMWorld'><img src='https://img.shields.io/badge/🤗-Dataset-blue'></a> <a href='https://eval.ai/web/challenges/challenge-page/2359'><img src='https://img.shields.io/badge/EvalAI-Challenge-orange'>
 11 | <a href='https://huggingface.co/spaces/Xuehai/MMWorld'><img src='https://img.shields.io/badge/Space-HuggingFace-yellow'>
 12 | </a>
 13 | 
 14 | ![Teaser figure](figures/teaser.png)
 15 | 
 16 | 
 17 | ## TODO
 18 | - [x] Release dataset
 19 | - [x] Release evaluation code  
 20 | - [x] EvalAI server setup
 21 | - [x] Hugging Face server setup
 22 | - [x] Support evaluation with lmms-eval
 23 | 
 24 | ## :fire: News
 25 | * **[2025.07.14]** We integrate the benchmark into [AGI-Eval](https://agi-eval.cn/evaluation/detail?id=66) platform. More models and results will be updated there.
 26 | * **[2024.09.21]** We integrate the benchmark into lmms-eval.
 27 | * **[2024.09.17]** We set up the Hugging Face server.
 28 | * **[2024.08.9]** We set up the EvalAI server. The portal will open for submissions soon.
 29 | * **[2024.07.1]** We add the evaluation toolkit.
 30 | * **[2024.06.12]** We release our dataset.
 31 | 
 32 | 
 33 | 
 34 | ## Dataset Structure
 35 | The dataset can be downloaded from Hugging Face.
 36 | Each entry in the dataset contains the following fields:
 37 | - `video_id`: Unique identifier for the video. Same as the relative path of the downloaded video
 38 | - `video_url`: URL of the video
 39 | - `discipline`: Main discipline of the video content
 40 | - `subdiscipline`: Sub-discipline of the video content
 41 | - `captions`: List of captions describing the video content
 42 | - `questions`: List of questions related to the video content, each with options and correct answer
 43 | 
 44 | 
 45 | ## Example Entry
 46 | 
 47 | ```json
 48 | {
 49 |   "video_id": "eng_vid1",
 50 |   "video_url": "https://youtu.be/-e1_QhJ1EhQ",
 51 |   "discipline": "Tech & Engineering",
 52 |   "subdiscipline": "Robotics",
 53 |   "captions": [
 54 |     "The humanoid robot Atlas interacts with objects and modifies the course to reach its goal."
 55 |   ],
 56 |   "questions": [
 57 |     {
 58 |       "type": "Explanation",
 59 |       "question": "Why is the engineer included at the beginning of the video?",
 60 |       "options": {
 61 |         "a": "The reason might be to imply the practical uses of Atlas in a commercial setting, to be an assistant who can perform complex tasks",
 62 |         "b": "To show how professional engineers can be forgetful sometimes",
 63 |         "c": "The engineer is controlling the robot manually",
 64 |         "d": "The engineer is instructing Atlas to build a house"
 65 |       },
 66 |       "answer": "The reason might be to imply the practical uses of Atlas in a commercial setting, to be an assistant who can perform complex tasks",
 67 |       "requires_domain_knowledge": false,
 68 |       "requires_audio": false,
 69 |       "requires_visual": true,
 70 |       "question_only": false,
 71 |       "correct_answer_label": "a"
 72 |     }
 73 |   ]
 74 | }
 75 | ```
 76 | 
 77 | 
 78 | ## Evaluation
 79 | 
 80 | You can do evaluation by running our evaluation code [eval.py](evaluation/eval.py). Note that access to the GPT-4 API is required, as defined in line 387 of `eval.py`.
 81 | To use our example evaluation code, you need to define your model initialization function, such as:
 82 | ```python
 83 | modelname_init()
 84 | ```
 85 | at line 357 of eval.py, and the model answer function, such as:
 86 | ```python
 87 | modelname_answer()
 88 | ``` 
 89 | at line 226 of eval.py.
 90 | 
 91 | Alternatively, you may prepare your model results and submit them to the EvalAI server. The model results format should be as follows:
 92 | 
 93 | ```json
 94 | {
 95 |     "detailed_results": [
 96 |         {
 97 |             "video_id": "eng_vid1",
 98 |             "model_answer": "a</s>",
 99 |         },
100 |         ...
101 |     ]
102 | }
103 | ```
104 | 
105 | 
106 | 
107 | 
108 | ## License Agreement
109 | Please refer to [LICENSE](./LICENSE.md).
110 | All videos of the MMworld benchmark are obtained from the Internet which are not property of our institutions. The copyright remains with the original owners of the video.
111 | Should you encounter any data samples violating the copyright or licensing regulations of any site, please contact us. Upon verification, those samples will be promptly removed.
112 | 
113 | 
114 | ## Citation
115 | ```
116 | @misc{he2024mmworld,
117 |       title={MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos}, 
118 |       author={Xuehai He and Weixi Feng and Kaizhi Zheng and Yujie Lu and Wanrong Zhu and Jiachen Li and Yue Fan and Jianfeng Wang and Linjie Li and Zhengyuan Yang and Kevin Lin and William Yang Wang and Lijuan Wang and Xin Eric Wang},
119 |       year={2024},
120 |       eprint={2406.08407},
121 |       archivePrefix={arXiv},
122 |       primaryClass={cs.CV}
123 | }
124 | ```
125 | 


--------------------------------------------------------------------------------
/data/croissanta_hf_data.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "@context": {
  3 |         "@language": "en",
  4 |         "@vocab": "https://schema.org/",
  5 |         "citeAs": "cr:citeAs",
  6 |         "column": "cr:column",
  7 |         "conformsTo": "dct:conformsTo",
  8 |         "cr": "http://mlcommons.org/croissant/",
  9 |         "data": {
 10 |             "@id": "cr:data",
 11 |             "@type": "@json"
 12 |         },
 13 |         "dataBiases": "cr:dataBiases",
 14 |         "dataCollection": "cr:dataCollection",
 15 |         "dataType": {
 16 |             "@id": "cr:dataType",
 17 |             "@type": "@vocab"
 18 |         },
 19 |         "dct": "http://purl.org/dc/terms/",
 20 |         "extract": "cr:extract",
 21 |         "field": "cr:field",
 22 |         "fileProperty": "cr:fileProperty",
 23 |         "fileObject": "cr:fileObject",
 24 |         "fileSet": "cr:fileSet",
 25 |         "format": "cr:format",
 26 |         "includes": "cr:includes",
 27 |         "isLiveDataset": "cr:isLiveDataset",
 28 |         "jsonPath": "cr:jsonPath",
 29 |         "key": "cr:key",
 30 |         "md5": "cr:md5",
 31 |         "parentField": "cr:parentField",
 32 |         "path": "cr:path",
 33 |         "personalSensitiveInformation": "cr:personalSensitiveInformation",
 34 |         "recordSet": "cr:recordSet",
 35 |         "references": "cr:references",
 36 |         "regex": "cr:regex",
 37 |         "repeated": "cr:repeated",
 38 |         "replace": "cr:replace",
 39 |         "sc": "https://schema.org/",
 40 |         "separator": "cr:separator",
 41 |         "source": "cr:source",
 42 |         "subField": "cr:subField",
 43 |         "transform": "cr:transform"
 44 |     },
 45 |     "@type": "sc:Dataset",
 46 |     "distribution": [
 47 |         {
 48 |             "@type": "cr:FileObject",
 49 |             "@id": "repo",
 50 |             "name": "repo",
 51 |             "description": "The Hugging Face git repository.",
 52 |             "contentUrl": "https://huggingface.co/datasets/Xuehai/MMWorld/tree/refs%2Fconvert%2Fparquet",
 53 |             "encodingFormat": "git+https",
 54 |             "sha256": "https://github.com/mlcommons/croissant/issues/80"
 55 |         },
 56 |         {
 57 |             "@type": "cr:FileSet",
 58 |             "@id": "parquet-files-for-config-default",
 59 |             "name": "parquet-files-for-config-default",
 60 |             "description": "The underlying Parquet files as converted by Hugging Face (see: https://huggingface.co/docs/datasets-server/parquet).",
 61 |             "containedIn": {
 62 |                 "@id": "repo"
 63 |             },
 64 |             "encodingFormat": "application/x-parquet",
 65 |             "includes": "default/*/*.parquet"
 66 |         }
 67 |     ],
 68 |     "recordSet": [
 69 |         {
 70 |             "@type": "cr:RecordSet",
 71 |             "@id": "default",
 72 |             "name": "default",
 73 |             "description": "Xuehai/MMWorld - 'default' subset\n\nAdditional information:\n- 2 skipped columns: questions, captions",
 74 |             "field": [
 75 |                 {
 76 |                     "@type": "cr:Field",
 77 |                     "@id": "default/video_url",
 78 |                     "name": "default/video_url",
 79 |                     "description": "Column 'video_url' from the Hugging Face parquet file.",
 80 |                     "dataType": "sc:Text",
 81 |                     "source": {
 82 |                         "fileSet": {
 83 |                             "@id": "parquet-files-for-config-default"
 84 |                         },
 85 |                         "extract": {
 86 |                             "column": "video_url"
 87 |                         }
 88 |                     }
 89 |                 },
 90 |                 {
 91 |                     "@type": "cr:Field",
 92 |                     "@id": "default/correct_answer_label",
 93 |                     "name": "default/correct_answer_label",
 94 |                     "description": "Column 'correct_answer_label' from the Hugging Face parquet file.",
 95 |                     "dataType": "sc:Text",
 96 |                     "source": {
 97 |                         "fileSet": {
 98 |                             "@id": "parquet-files-for-config-default"
 99 |                         },
100 |                         "extract": {
101 |                             "column": "correct_answer_label"
102 |                         }
103 |                     }
104 |                 },
105 |                 {
106 |                     "@type": "cr:Field",
107 |                     "@id": "default/subdiscipline",
108 |                     "name": "default/subdiscipline",
109 |                     "description": "Column 'subdiscipline' from the Hugging Face parquet file.",
110 |                     "dataType": "sc:Text",
111 |                     "source": {
112 |                         "fileSet": {
113 |                             "@id": "parquet-files-for-config-default"
114 |                         },
115 |                         "extract": {
116 |                             "column": "subdiscipline"
117 |                         }
118 |                     }
119 |                 },
120 |                 {
121 |                     "@type": "cr:Field",
122 |                     "@id": "default/video_id",
123 |                     "name": "default/video_id",
124 |                     "description": "Column 'video_id' from the Hugging Face parquet file.",
125 |                     "dataType": "sc:Text",
126 |                     "source": {
127 |                         "fileSet": {
128 |                             "@id": "parquet-files-for-config-default"
129 |                         },
130 |                         "extract": {
131 |                             "column": "video_id"
132 |                         }
133 |                     }
134 |                 },
135 |                 {
136 |                     "@type": "cr:Field",
137 |                     "@id": "default/discipline",
138 |                     "name": "default/discipline",
139 |                     "description": "Column 'discipline' from the Hugging Face parquet file.",
140 |                     "dataType": "sc:Text",
141 |                     "source": {
142 |                         "fileSet": {
143 |                             "@id": "parquet-files-for-config-default"
144 |                         },
145 |                         "extract": {
146 |                             "column": "discipline"
147 |                         }
148 |                     }
149 |                 },
150 |                 {
151 |                     "@type": "cr:Field",
152 |                     "@id": "default/clip_video_url",
153 |                     "name": "default/clip_video_url",
154 |                     "description": "Column 'clip_video_url' from the Hugging Face parquet file.",
155 |                     "dataType": "sc:Text",
156 |                     "source": {
157 |                         "fileSet": {
158 |                             "@id": "parquet-files-for-config-default"
159 |                         },
160 |                         "extract": {
161 |                             "column": "clip_video_url"
162 |                         }
163 |                     }
164 |                 },
165 |                 {
166 |                     "@type": "cr:Field",
167 |                     "@id": "default/duration",
168 |                     "name": "default/duration",
169 |                     "description": "Column 'duration' from the Hugging Face parquet file.",
170 |                     "dataType": "sc:Text",
171 |                     "source": {
172 |                         "fileSet": {
173 |                             "@id": "parquet-files-for-config-default"
174 |                         },
175 |                         "extract": {
176 |                             "column": "duration"
177 |                         }
178 |                     }
179 |                 }
180 |             ]
181 |         }
182 |     ],
183 |     "conformsTo": "http://mlcommons.org/croissant/1.0",
184 |     "name": "MMWorld",
185 |     "description": "Xuehai/MMWorld dataset hosted on Hugging Face and contributed by the HF Datasets community",
186 |     "alternateName": [
187 |         "Xuehai/MMWorld"
188 |     ],
189 |     "creator": {
190 |         "@type": "Person",
191 |         "name": "He",
192 |         "url": "https://huggingface.co/Xuehai"
193 |     },
194 |     "keywords": [
195 |         "cc-by-4.0",
196 |         "Croissant",
197 |         "🇺🇸 Region: US"
198 |     ],
199 |     "license": "https://choosealicense.com/licenses/cc-by-4.0/",
200 |     "url": "https://huggingface.co/datasets/Xuehai/MMWorld"
201 | }


--------------------------------------------------------------------------------
/data/croissant_data.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "@context": {
  3 |     "@language": "en",
  4 |     "@vocab": "https://schema.org/",
  5 |     "citeAs": "cr:citeAs",
  6 |     "column": "cr:column",
  7 |     "conformsTo": "dct:conformsTo",
  8 |     "cr": "http://mlcommons.org/croissant/",
  9 |     "rai": "http://mlcommons.org/croissant/RAI/",
 10 |     "data": {
 11 |       "@id": "cr:data",
 12 |       "@type": "@json"
 13 |     },
 14 |     "dataType": {
 15 |       "@id": "cr:dataType",
 16 |       "@type": "@vocab"
 17 |     },
 18 |     "dct": "http://purl.org/dc/terms/",
 19 |     "examples": {
 20 |       "@id": "cr:examples",
 21 |       "@type": "@json"
 22 |     },
 23 |     "extract": "cr:extract",
 24 |     "field": "cr:field",
 25 |     "fileProperty": "cr:fileProperty",
 26 |     "fileObject": "cr:fileObject",
 27 |     "fileSet": "cr:fileSet",
 28 |     "format": "cr:format",
 29 |     "includes": "cr:includes",
 30 |     "isLiveDataset": "cr:isLiveDataset",
 31 |     "jsonPath": "cr:jsonPath",
 32 |     "key": "cr:key",
 33 |     "md5": "cr:md5",
 34 |     "parentField": "cr:parentField",
 35 |     "path": "cr:path",
 36 |     "recordSet": "cr:recordSet",
 37 |     "references": "cr:references",
 38 |     "regex": "cr:regex",
 39 |     "repeated": "cr:repeated",
 40 |     "replace": "cr:replace",
 41 |     "sc": "https://schema.org/",
 42 |     "separator": "cr:separator",
 43 |     "source": "cr:source",
 44 |     "subField": "cr:subField",
 45 |     "transform": "cr:transform"
 46 |   },
 47 |   "@type": "sc:Dataset",
 48 |   "name": "mmworld",
 49 |   "description": "Dataset containing video IDs, URLs, disciplines, subdisciplines, captions, and questions for various videos.",
 50 |   "conformsTo": "http://mlcommons.org/croissant/1.0",
 51 |   "license": "https://creativecommons.org/licenses/by/4.0/",
 52 |   "url": "https://mmworld-bench.github.io/",
 53 |   "distribution": [
 54 |     {
 55 |       "@type": "cr:FileObject",
 56 |       "@id": "mmworld",
 57 |       "name": "mmworld.json",
 58 |       "description": "Dataset containing video IDs, URLs, disciplines, subdisciplines, captions, and questions.",
 59 |       "contentUrl": "mmworld.json",
 60 |       "encodingFormat": "application/json",
 61 |       "sha256": "658ed65e043b845be1adce62f995d4fd85b610eeee911d2d2b6ebf78e82b1f5a"
 62 |     }
 63 |   ],
 64 |   "recordSet": [
 65 |     {
 66 |       "@type": "cr:RecordSet",
 67 |       "@id": "video_metadata",
 68 |       "name": "Video Metadata",
 69 |       "description": "Metadata for each video.",
 70 |       "field": [
 71 |         {
 72 |           "@type": "cr:Field",
 73 |           "@id": "video_id",
 74 |           "name": "video_id",
 75 |           "description": "The video ID.",
 76 |           "dataType": "sc:Text",
 77 |           "source": {
 78 |             "fileObject": {
 79 |               "@id": "mmworld"
 80 |             },
 81 |             "extract": {
 82 |               "jsonPath": "$[*].video_id"
 83 |             }
 84 |           }
 85 |         },
 86 |         {
 87 |           "@type": "cr:Field",
 88 |           "@id": "video_url",
 89 |           "name": "video_url",
 90 |           "description": "The video URL.",
 91 |           "dataType": "sc:Text",
 92 |           "source": {
 93 |             "fileObject": {
 94 |               "@id": "mmworld"
 95 |             },
 96 |             "extract": {
 97 |               "jsonPath": "$[*].video_url"
 98 |             }
 99 |           }
100 |         },
101 |         {
102 |           "@type": "cr:Field",
103 |           "@id": "discipline",
104 |           "name": "discipline",
105 |           "description": "The discipline.",
106 |           "dataType": "sc:Text",
107 |           "source": {
108 |             "fileObject": {
109 |               "@id": "mmworld"
110 |             },
111 |             "extract": {
112 |               "jsonPath": "$[*].discipline"
113 |             }
114 |           }
115 |         },
116 |         {
117 |           "@type": "cr:Field",
118 |           "@id": "subdiscipline",
119 |           "name": "subdiscipline",
120 |           "description": "The subdiscipline.",
121 |           "dataType": "sc:Text",
122 |           "source": {
123 |             "fileObject": {
124 |               "@id": "mmworld"
125 |             },
126 |             "extract": {
127 |               "jsonPath": "$[*].subdiscipline"
128 |             }
129 |           }
130 |         },
131 |         {
132 |           "@type": "cr:Field",
133 |           "@id": "captions",
134 |           "name": "captions",
135 |           "description": "The video captions.",
136 |           "dataType": "sc:Text",
137 |           "source": {
138 |             "fileObject": {
139 |               "@id": "mmworld"
140 |             },
141 |             "extract": {
142 |               "jsonPath": "$[*].captions"
143 |             }
144 |           }
145 |         }
146 |       ]
147 |     },
148 |     {
149 |       "@type": "cr:RecordSet",
150 |       "@id": "questions",
151 |       "name": "Questions",
152 |       "description": "Questions associated with the videos.",
153 |       "field": [
154 |         {
155 |           "@type": "cr:Field",
156 |           "@id": "type",
157 |           "name": "type",
158 |           "description": "The type of question.",
159 |           "dataType": "sc:Text",
160 |           "source": {
161 |             "fileObject": {
162 |               "@id": "mmworld"
163 |             },
164 |             "extract": {
165 |               "jsonPath": "$[*].questions[*].type"
166 |             }
167 |           }
168 |         },
169 |         {
170 |           "@type": "cr:Field",
171 |           "@id": "question",
172 |           "name": "question",
173 |           "description": "The question.",
174 |           "dataType": "sc:Text",
175 |           "source": {
176 |             "fileObject": {
177 |               "@id": "mmworld"
178 |             },
179 |             "extract": {
180 |               "jsonPath": "$[*].questions[*].question"
181 |             }
182 |           }
183 |         },
184 |         {
185 |           "@type": "cr:Field",
186 |           "@id": "options",
187 |           "name": "options",
188 |           "description": "The options for the question.",
189 |           "dataType": "sc:Text",
190 |           "source": {
191 |             "fileObject": {
192 |               "@id": "mmworld"
193 |             },
194 |             "extract": {
195 |               "jsonPath": "$[*].questions[*].options"
196 |             }
197 |           }
198 |         },
199 |         {
200 |           "@type": "cr:Field",
201 |           "@id": "answer",
202 |           "name": "answer",
203 |           "description": "The correct answer for the question.",
204 |           "dataType": "sc:Text",
205 |           "source": {
206 |             "fileObject": {
207 |               "@id": "mmworld"
208 |             },
209 |             "extract": {
210 |               "jsonPath": "$[*].questions[*].answer"
211 |             }
212 |           }
213 |         },
214 |         {
215 |           "@type": "cr:Field",
216 |           "@id": "requires_domain_knowledge",
217 |           "name": "requires_domain_knowledge",
218 |           "description": "Whether the question requires domain knowledge.",
219 |           "dataType": "sc:Text",
220 |           "source": {
221 |             "fileObject": {
222 |               "@id": "mmworld"
223 |             },
224 |             "extract": {
225 |               "jsonPath": "$[*].questions[*].requires_domain_knowledge"
226 |             }
227 |           }
228 |         },
229 |         {
230 |           "@type": "cr:Field",
231 |           "@id": "requires_audio",
232 |           "name": "requires_audio",
233 |           "description": "Whether the question requires audio.",
234 |           "dataType": "sc:Text",
235 |           "source": {
236 |             "fileObject": {
237 |               "@id": "mmworld"
238 |             },
239 |             "extract": {
240 |               "jsonPath": "$[*].questions[*].requires_audio"
241 |             }
242 |           }
243 |         },
244 |         {
245 |           "@type": "cr:Field",
246 |           "@id": "requires_visual",
247 |           "name": "requires_visual",
248 |           "description": "Whether the question requires visual.",
249 |           "dataType": "sc:Text",
250 |           "source": {
251 |             "fileObject": {
252 |               "@id": "mmworld"
253 |             },
254 |             "extract": {
255 |               "jsonPath": "$[*].questions[*].requires_visual"
256 |             }
257 |           }
258 |         },
259 |         {
260 |           "@type": "cr:Field",
261 |           "@id": "question_only",
262 |           "name": "question_only",
263 |           "description": "Whether the question is a question-only type.",
264 |           "dataType": "sc:Text",
265 |           "source": {
266 |             "fileObject": {
267 |               "@id": "mmworld"
268 |             },
269 |             "extract": {
270 |               "jsonPath": "$[*].questions[*].question_only"
271 |             }
272 |           }
273 |         },
274 |         {
275 |           "@type": "cr:Field",
276 |           "@id": "correct_answer_label",
277 |           "name": "correct_answer_label",
278 |           "description": "The label of the correct answer.",
279 |           "dataType": "sc:Text",
280 |           "source": {
281 |             "fileObject": {
282 |               "@id": "mmworld"
283 |             },
284 |             "extract": {
285 |               "jsonPath": "$[*].questions[*].correct_answer_label"
286 |             }
287 |           }
288 |         }
289 |       ]
290 |     }
291 |   ]
292 | }


--------------------------------------------------------------------------------
/evaluation/main_utils.py:
--------------------------------------------------------------------------------
  1 | # from moviepy.editor import VideoFileClip
  2 | # from pytube import YouTube
  3 | import sys
  4 | import os
  5 | 
  6 | import base64
  7 | 
  8 | def calculate_video_length(video_path):
  9 |     try:
 10 |         with VideoFileClip(video_path) as video:
 11 |             return video.duration
 12 |     except Exception as e:
 13 |         print(f"Error calculating video length for {video_path}: {e}")
 14 |         return 0
 15 |     
 16 | 
 17 | def show_progress(stream, chunk, bytes_remaining):
 18 |     total_size = stream.filesize
 19 |     bytes_downloaded = total_size - bytes_remaining 
 20 |     sys.stdout.write(f"Downloading: {bytes_downloaded / total_size * 100:.2f}%\r")
 21 |     sys.stdout.flush()
 22 | 
 23 | 
 24 | def answer_post_processing(model_answer):
 25 |     parts = model_answer.split('.')
 26 |     model_answer_processed = parts[0].replace("The answer is", "").strip().lower().strip("\"'")
 27 |     model_answer_label = model_answer_processed.split(':')[0].strip()
 28 |     return model_answer_label
 29 | 
 30 | 
 31 | def format_time(seconds):
 32 |     hours = int(seconds // 3600)
 33 |     minutes = int((seconds % 3600) // 60)
 34 |     seconds = int(seconds % 60)
 35 |     return f"{hours:02}:{minutes:02}:{seconds:02}"
 36 | 
 37 | 
 38 | def get_transcript_with_formatted_time(video_url):
 39 |     video_id = extract_video_id(video_url)
 40 |     if video_id is None:
 41 |         return "Invalid YouTube URL or Video ID not found"
 42 |     return YouTubeTranscriptApi.get_transcript(video_id)
 43 | 
 44 | 
 45 | def encode_images_to_base64(directory):
 46 |     images_base64 = []
 47 |     for image_name in os.listdir(directory):
 48 |         image_path = os.path.join(directory, image_name)
 49 |         with open(image_path, "rb") as image_file:
 50 |             encoded_string = base64.b64encode(image_file.read()).decode()
 51 |             images_base64.append({"image": encoded_string})
 52 |     return images_base64
 53 | 
 54 | 
 55 | def download_video(url, video_id, output_path):
 56 |     if os.path.exists(output_path):
 57 |         print(f"Video with {url} already downloaded. Skipping download.")
 58 |         return output_path
 59 |     try:
 60 |         if "shorts" in url:
 61 |             video_id = video_id.replace("shorts/", "")
 62 |             print(f"Shorts video detected. ")
 63 |         yt = YouTube(url, on_progress_callback=show_progress)
 64 |         video_stream = yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first()
 65 |         if video_stream:
 66 |             video_stream.download(output_path=output_path, filename=f"{video_id}.mp4")
 67 |             print(f"\nDownloaded {yt.title} successfully.")
 68 |             return True
 69 |         else:
 70 |             print("No suitable video stream found for:", url)
 71 |             return False
 72 |     except Exception as e:
 73 |         print(f"Error downloading video {url}: {e}")
 74 |         return False
 75 | 
 76 | 
 77 | def compute_question_accuracy(model_answer, correct_answer_label, options):
 78 |     return model_answer == correct_answer_label.lower().strip()
 79 | 
 80 | 
 81 | def compute_question_accuracy_with_gpt(answer_evaluator, model_answer, correct_answer_label, question, options):
 82 |     options_str = "\n".join([f"Option {label.upper()}: {text}" for label, text in options.items()])
 83 |     # prompt = (f"I will present a response from a question-answering model and several answer options. "
 84 |     #           f"Your task is to evaluate the response and determine which of the following options it most closely aligns with.\n\n"
 85 |     #           f"Response: '{model_answer}'\n\n"
 86 |     #           f"Options:\n{options_str}\n\n"
 87 |     #           "Indicate the most similar option by responding with the corresponding letter only (a, b, c, or d).")
 88 |     prompt="For question: "+question+"\n"\
 89 |         + options_str \
 90 |         + "(please select one)\nGround truth answer: "\
 91 |         + correct_answer_label\
 92 |         + "\nModel predicted answer: "+model_answer\
 93 |         + "\nBased on the question and the ground truth answer, is the model's predicted answer correct? If multi-choice provided, think about which choice is selected by the model, is it correct? (please answer yes/no)\n"
 94 |     try:
 95 |         response = answer_evaluator.chat.completions.create(
 96 |             model="xx",
 97 |             messages=[
 98 |                 {"role": "system", "content": "You are a helpful assistant that provides concise answers."},
 99 |                 {"role": "user", "content": prompt}
100 |             ]
101 |         )
102 | 
103 |        
104 |     except Exception as e:
105 |         print(f"Error evaluating response: {e}")
106 |         # ai_response = model_answer
107 |         return 'Error', False
108 |     ai_response = response.choices[0].message.content.strip().lower()
109 |     if "yes" in ai_response :
110 |         gpt_judge_correct = True
111 |     else:
112 |         gpt_judge_correct = False
113 |     return ai_response, gpt_judge_correct
114 | 
115 | 
116 | 
117 | def gpt(question, options, prompt):
118 |             constructed_url = 'xx'
119 |             headers = {
120 |                 'Content-Type': 'application/json',
121 |                 "api-key": "xx",
122 |             }
123 | 
124 |             def run_api(body):
125 |                 request = requests.post(constructed_url, headers=headers, json=body)
126 |                 response = request.json()
127 |                 return response
128 | 
129 |             body = [{   
130 |                         'role' : 'system',
131 |                         'content' : ['You are an expert in assisting human. Follows the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.'],
132 |                     },
133 |                     {   
134 |                         'role' : 'user',
135 |                         'content' : [prompt],  
136 |                     },
137 |             ]
138 | 
139 |             inputs = {}
140 |             inputs['messages'] = body # for "chat"
141 |             inputs['max_tokens'] = 1024
142 |             inputs['stop'] = "{END}"
143 |             results = run_api(inputs)
144 |             return results
145 |         
146 | def gpt4v(question, options, prompt, video, video_id):
147 |     no_of_frames_to_returned = 8
148 | 
149 |     videoframefolder = f"./video_benchmark/clipped_video/{video_id}"
150 |     if not os.path.exists(videoframefolder):
151 |         os.makedirs(videoframefolder)
152 |         diskwriter = KeyFrameDiskWriter(location=videoframefolder)
153 |         video_file_path = video
154 | 
155 |         print(f"Input video file path = {video_file_path}")
156 | 
157 | 
158 |         try:
159 |             vd.extract_video_keyframes(
160 |                 no_of_frames=no_of_frames_to_returned, file_path=video_file_path,
161 |                 writer=diskwriter
162 |             )
163 |         except Exception as e:
164 |             print(f"Error in extracting video keyframes: {e}")
165 |             images_base64 = []
166 | 
167 |     
168 |     if len(os.listdir(videoframefolder)) == 0:
169 |         images_base64 = []
170 |     else:
171 |         images_base64 = encode_images_to_base64(videoframefolder)
172 |     constructed_url = 'xx'
173 |     headers = {
174 |         'Content-Type': 'application/json',
175 |         'api-key': 'xx'
176 |     }
177 | 
178 |     def run_api(body):
179 |         request = requests.post(constructed_url, headers=headers, json=body)
180 |         response = request.json()
181 |         return response
182 | 
183 |     prompt = f"Based on the following video frames extracted from the video, answer the question {question} by selecting one from the giving answers {options}."
184 | 
185 |     body = [{   
186 |                 'role' : 'system',
187 |                 'content' : ['You are an expert in assisting human. Follows the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.'],
188 |             },
189 |             {   
190 |                 'role' : 'user',
191 |                 'content' : [prompt, *images_base64],  
192 |             },
193 |     ]
194 | 
195 |     inputs = {}
196 |     inputs['messages'] = body 
197 |     inputs['max_tokens'] = 1024
198 |     inputs['stop'] = "{END}"
199 |     results = run_api(inputs)
200 |     return results
201 | 
202 | def gpt4o(question, options, prompt, video, video_id):
203 |     no_of_frames_to_returned = 8
204 | 
205 |     videoframefolder = f"./video_benchmark/clipped_video/{video_id}"
206 |     if not os.path.exists(videoframefolder):
207 |         os.makedirs(videoframefolder)
208 |         diskwriter = KeyFrameDiskWriter(location=videoframefolder)
209 |         video_file_path = video
210 | 
211 |         print(f"Input video file path = {video_file_path}")
212 | 
213 | 
214 |         try:
215 |             vd.extract_video_keyframes(
216 |                 no_of_frames=no_of_frames_to_returned, file_path=video_file_path,
217 |                 writer=diskwriter
218 |             )
219 |         except Exception as e:
220 |             print(f"Error in extracting video keyframes: {e}")
221 |             images_base64 = []
222 | 
223 |     
224 |     if len(os.listdir(videoframefolder)) == 0:
225 |         images_base64 = []
226 |     else:
227 |         images_base64 = encode_images_to_base64(videoframefolder)
228 |     api_base = "xx" 
229 |     deployment_name = "xx" 
230 |     api_version = "2024-03-01-preview"
231 |     constructed_url = f"{api_base}/openai/deployments/{deployment_name}/chat/completions?api-version={api_version}"
232 |     headers = {
233 |         'Content-Type': 'application/json',
234 |         'api-key': 'xx'
235 |     }
236 | 
237 |     def run_api(body):
238 |         request = requests.post(constructed_url, headers=headers, json=body)
239 |         response = request.json()
240 |         return response
241 | 
242 |     prompt = f"Based on the following video frames extracted from the video, answer the question {question} by selecting one from the giving answers {options}."
243 | 
244 | 
245 |     body = [
246 |             {
247 |                 'role': 'system',
248 |                 'content': 'You are an expert in assisting humans. Follow the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.'
249 |             },
250 |             {
251 |                 'role': 'user',
252 |                 'content': prompt
253 |             },
254 |             {
255 |                 'role': 'user',
256 |                 'content': images_base64
257 |             }
258 |         ]
259 | 
260 |     inputs = {}
261 |     inputs['messages'] = body # for "chat"
262 |     inputs['max_tokens'] = 2000
263 |     inputs['stop'] = "{END}"
264 |     results = run_api(inputs)
265 |     return results


--------------------------------------------------------------------------------
/evaluation/eval.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import sys
  3 | import os
  4 | import PIL.Image
  5 | import glob
  6 | 
  7 | import requests
  8 | import time
  9 | import argparse
 10 | 
 11 | import argparse
 12 | from main_utils import *
 13 | from openai import AzureOpenAI
 14 | 
 15 | import copy
 16 | 
 17 | from Katna.video import Video
 18 | from Katna.writer import KeyFrameDiskWriter
 19 | vd = Video()
 20 | 
 21 | 
 22 | 
 23 | 
 24 | 
 25 | 
 26 | def answer_generator(answer_evaluator, video_file, question, options, correct_answer, correct_answer_label, question_type, annotations, video_id, detailed_results, subject_data, modelname, tokenizer=None, processor=None, image_processor=None):
 27 |     prompt = f"Answer the question {question} by selecting one from the giving answers {options}. Respond with only single letter such as a, b, c ,d."
 28 |     # prompt = f"Answer the question {question} by selecting one from the giving answers {options}. Also give reasons of why you select this answer after your selelcted amswer."
 29 |     subject_data["total_questions"] += 1
 30 | 
 31 |     
 32 |     
 33 |     if modelname == 'gpt' or modelname == 'gpt4o': 
 34 |         max_retries= 3000000
 35 |         retry_delay = 0.0001 
 36 |         retry_count = 0
 37 | 
 38 |         while retry_count < max_retries:
 39 |             if modelname == 'gpt':
 40 |                 model_response = gpt4v(question, options, prompt, video_file, video_id)
 41 |             elif modelname == 'gpt4o':
 42 |                 model_response = gpt4o(question, options, prompt, video_file, video_id)
 43 |             if 'choices' in model_response:
 44 |                 model_answer = model_response['choices'][0]['message']['content']
 45 |                 print('The model answer is:', model_answer)
 46 |                 break
 47 |             elif model_response['error']['code'] == '429':
 48 |                 print(f"Rate limit exceeded. Error message is {model_response}, Retrying in {retry_delay} seconds..., retry count: {retry_count}")
 49 |                 time.sleep(retry_delay)
 50 |             elif model_response['error']['code'] == 'content_filter':
 51 |                 print(f"Content filter triggered. Error message is {model_response}, Retrying in {retry_delay} seconds..., retry count: {retry_count}")
 52 |                 model_answer = 'content_filter'
 53 |                 time.sleep(retry_delay)
 54 |                 break
 55 |             elif 'error' in model_response:
 56 |                 print(f"Error message is {model_response['error']['message']}, Retrying in {retry_delay} seconds..., retry count: {retry_count}")
 57 |                 model_answer = model_response['error']['message']
 58 |                 time.sleep(retry_delay)
 59 |                 break
 60 | 
 61 |             retry_count += 1
 62 |         print('Model selected: gpt, question:', question, 'options:', options, 'answer:', model_answer)
 63 |     elif modelname == 'gemini':
 64 |         safety_settings = [
 65 |         {
 66 |             "category": "HARM_CATEGORY_DANGEROUS",
 67 |             "threshold": "BLOCK_NONE",
 68 |         },
 69 |         {
 70 |             "category": "HARM_CATEGORY_HARASSMENT",
 71 |             "threshold": "BLOCK_NONE",
 72 |         },
 73 |         {
 74 |             "category": "HARM_CATEGORY_HATE_SPEECH",
 75 |             "threshold": "BLOCK_NONE",
 76 |         },
 77 |         {
 78 |             "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
 79 |             "threshold": "BLOCK_NONE",
 80 |         },
 81 |         {
 82 |             "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
 83 |             "threshold": "BLOCK_NONE",
 84 |         },
 85 |     ]
 86 |         no_of_frames_to_returned = 10
 87 | 
 88 |         videoframefolder = f"./clipped_video/{video_id}"
 89 |         images = [] 
 90 |         if os.path.exists(videoframefolder) and len(os.listdir(videoframefolder)) > 0:
 91 |             for image_file_name in os.listdir(videoframefolder):
 92 |                 image_path = os.path.join(videoframefolder, image_file_name)
 93 |                 try:
 94 |                     img = PIL.Image.open(image_path)
 95 |                     images.append(img)
 96 |                 except Exception as e:
 97 |                     print(f"Error loading image {image_file_name}: {e}")
 98 |         else:
 99 |             if not os.path.exists(videoframefolder):
100 |                 os.makedirs(videoframefolder)
101 |             diskwriter = KeyFrameDiskWriter(location=videoframefolder)
102 |             video_file_path = videofile
103 | 
104 |             print(f"Input video file path = {video_file_path}")
105 | 
106 | 
107 |             try:
108 |                 vd.extract_video_keyframes(
109 |                     no_of_frames=no_of_frames_to_returned, file_path=video_file_path,
110 |                     writer=diskwriter
111 |                 )
112 |                 image_path = os.path.join(videoframefolder, image_file_name)
113 |                 img = PIL.Image.open(image_path)
114 |                 images.append(img)
115 |             except Exception as e:
116 |                 print(f"Error in extracting video keyframes: {e}")
117 | 
118 |         if images:
119 |             for attempt in range(5):  
120 |                 try:
121 |                     model_answer = models.generate_content([prompt] + images, safety_settings=safety_settings).text
122 |                     break
123 |                 except Exception as e:
124 |                     print(f"Attempt {attempt+1} failed: {e}")
125 |                     if attempt == 4:
126 |                         model_answer = 'error'
127 |         else:
128 |             print("No images found in the directory.")
129 |             model_answer = 'No images to process.'
130 |         print('Model selected: gemini, question:', question, 'options:', options, 'answer:', model_answer)
131 |     elif modelname == 'claude': 
132 |         import anthropic
133 |         from io import BytesIO
134 | 
135 |         client = anthropic.Anthropic(
136 |             api_key="xx",
137 |         )
138 | 
139 |         no_of_frames_to_returned = 10
140 | 
141 |         videoframefolder = f"./video_benchmark/clipped_video/{video_id}"
142 |         images = [] 
143 |         if os.path.exists(videoframefolder) and len(os.listdir(videoframefolder)) > 0:
144 |             for image_file_name in os.listdir(videoframefolder):
145 |                 image_path = os.path.join(videoframefolder, image_file_name)
146 |                 try:
147 |                     img = PIL.Image.open(image_path)
148 |                     buffered = BytesIO()
149 |                     img.save(buffered, format="JPEG")
150 |                     img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
151 |                     images.append(img_base64)
152 |                 except Exception as e:
153 |                     print(f"Error loading image {image_file_name}: {e}")
154 | 
155 | 
156 |         message_content = []
157 | 
158 |         for img_base64 in images:
159 |             message_content.append(
160 |                 {
161 |                     "type": "image",
162 |                     "source": {
163 |                         "type": "base64",
164 |                         "media_type": "image/jpeg",
165 |                         "data": img_base64,
166 |                     },
167 |                 }
168 |             )
169 | 
170 | 
171 |         message_content.append(
172 |             {
173 |                 "type": "text",
174 |                 "text": prompt
175 |             }
176 |         )
177 | 
178 | 
179 |         messages_payload = [{"role": "user", "content": message_content}]
180 |         message = client.messages.create(
181 |             model="claude-3-5-sonnet-20240620",
182 |             max_tokens=1024,
183 |             messages=messages_payload,
184 |         )
185 | 
186 |         model_answer = message.content[0].text
187 |         model_answer = answer_post_processing(model_answer)
188 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
189 |     elif modelname == 'videochat': 
190 |         model_answer = videochat_answer(models, video_file, question, options)
191 |         model_answer = answer_post_processing(model_answer)
192 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
193 |     elif modelname == 'videollama': 
194 |         model_answer = videollama_answer(models, video_file, question, options)
195 |         model_answer = answer_post_processing(model_answer)
196 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
197 |     elif modelname == 'chatunivi':
198 |         model_answer = chatunivi_answer(models, video_file, question, options, prompt, tokenizer)
199 |         model_answer = answer_post_processing(model_answer)
200 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
201 |     elif modelname == 'mplugowl':
202 |         model_answer = mplugowl_answer(models, video_id, video_file, question, options, prompt, tokenizer, image_processor)
203 |         model_answer = answer_post_processing(model_answer)
204 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
205 |     elif modelname == 'otter':
206 |         model_answer = otter_answer(models, video_file, question, options, prompt, image_processor)
207 |         model_answer = answer_post_processing(model_answer)
208 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
209 |     elif 'xinstruct' in modelname:
210 |         model_answer = xinstruct_answer(models, video_file, question, options, image_processor)
211 |         model_answer = answer_post_processing(model_answer)
212 |         print('Model selected: xinstruct, question:', question, 'options:', options, 'answer:', model_answer)
213 |     elif modelname == 'pandagpt':
214 |         model_answer = pandagpt_answer(models, video_file, question, options, prompt)
215 |         model_answer = answer_post_processing(model_answer)
216 |         print('Model selected: PandaGPT, question:', question, 'options:', options, 'answer:', model_answer)
217 |     elif modelname == 'imagebind_llm':
218 |         model_answer = imagebind_llm_answer(models, video_file, question, options, prompt)
219 |         model_answer = answer_post_processing(model_answer)
220 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
221 |     elif modelname == 'lwm':
222 |         model_answer = lwm_answer(models, video_file, question, options, prompt)
223 |         model_answer = answer_post_processing(model_answer)
224 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
225 |     elif modelname == 'videollava':
226 |         model_answer = videollava_answer(models, video_file, question, options, prompt, tokenizer, processor, video_processor)
227 |         model_answer = answer_post_processing(model_answer)
228 |         print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
229 |     else:
230 |         print("Invalid model name. Exiting.")
231 |         sys.exit(1)
232 | 
233 |     gpt_processed_answer, is_correct = compute_question_accuracy_with_gpt(answer_evaluator, model_answer, correct_answer_label, question ,options)
234 |     if is_correct:
235 |         subject_data["correct_answers"] += 1
236 |     
237 |     for annotation, value in annotations.items():
238 |         subject_data["accuracy_per_annotation"].setdefault(annotation, {"total": 0, "correct": 0})
239 |         if value:
240 |             subject_data["accuracy_per_annotation"][annotation]["total"] += 1
241 |         if value and is_correct:
242 |             subject_data["accuracy_per_annotation"][annotation]["correct"] += 1
243 |     
244 |     test = copy.deepcopy(subject_data["accuracy_per_annotation"])
245 |     subject_data["accuracy_per_question_type"].setdefault(question_type, {"total": 0, "correct": 0})
246 |     subject_data["accuracy_per_question_type"][question_type]["total"] += 1
247 |     if is_correct:
248 |         subject_data["accuracy_per_question_type"][question_type]["correct"] += 1
249 | 
250 |     detailed_results.append({
251 |         "subject": subject,
252 |         "video_id": video_id,
253 |         "question": question,
254 |         "correct_answer": correct_answer,
255 |         "correct_answer_label": correct_answer_label,
256 |         "model_answer": model_answer,
257 |         'gpt_processed_answer': gpt_processed_answer,
258 |         "options": options,
259 |         "is_correct": is_correct,
260 |         "annotations": annotations,
261 |         "question_type": question_type,
262 |         "subject_data": test
263 |     })
264 | 
265 |     with open(detailed_results_paths[run_idx], 'w') as f:
266 |         json.dump(detailed_results, f, indent=4)
267 |         print(f"Saved detailed results to {detailed_results_paths[run_idx]}")
268 | 
269 | 
270 | 
271 | 
272 | if __name__ == "__main__":
273 | 
274 |     parser = argparse.ArgumentParser(description="Initialize and run model")
275 |     parser.add_argument("modelname", type=str, help="Name of the model to initialize and run")
276 |     parser.add_argument("--textonly", action="store_true", help="Flag to indicate if the model should run in text-only mode")
277 | 
278 |     args = parser.parse_args()
279 | 
280 | 
281 |     modelname = args.modelname
282 |     textonly = args.textonly
283 | 
284 | 
285 | 
286 |     if modelname == "imagebind_llm":
287 |         sys.path.append(os.path.abspath("./LLaMA-Adapter/imagebind_LLM"))
288 |         from eval_imagebind_llm import imagebind_llm_init, imagebind_llm_answer
289 |     elif modelname == "lwm":
290 |         sys.path.append('./video_benchmark/LWM')
291 |         from eval_LWM import lwm_init, lwm_answer
292 |     elif modelname == "mplugowl":
293 |         sys.path.append('./video_benchmark/mPLUG-Owl2')
294 |         from eval_mplug_owl import mplugowl_init, mplugowl_answer
295 |     elif modelname == "otter":
296 |         sys.path.append('./video_benchmark/otter')
297 |         sys.path.append('./video_benchmark/otter/src')
298 |         from eval_otter import otter_init, otter_answer
299 |     elif modelname == "videochat":
300 |         sys.path.append('./video_benchmark/video_chat2')
301 |         from eval_video_chat import videochat2_init, videochat_answer
302 |     elif modelname == "videollama":
303 |         sys.path.append('./video_benchmark/Video_llama')
304 |         from eval_video_llama import videollama_init, videollama_answer
305 |     elif modelname == "videollava":
306 |         sys.path.append('./video_benchmark/Video-LLaVA')
307 |         from eval_video_llava import videollava_init, videollava_answer
308 |     elif modelname == "xinstruct":
309 |         sys.path.append('./video_benchmark/LAVIS-XInstructBLIP')
310 |         from eval_xinstruct import xinstruct_init, xinstruct_answer
311 |     elif modelname == "pandagpt":
312 |         sys.path.append(os.path.abspath("./PandaGPT"))
313 |         sys.path.append(os.path.abspath('./PandaGPT/code'))
314 |         from eval_pandagpt import pandagpt_init, pandagpt_answer
315 |     elif modelname == "llamaadapter":
316 |         sys.path.append(os.path.abspath("./LLaMA-Adapter/imagebind_LLM"))
317 |         from eval_imagebind_llm import imagebind_llm_init, imagebind_llm_answer
318 | 
319 | 
320 |     if modelname == 'gemini':
321 |         import google.generativeai as genai
322 |         models = genai.GenerativeModel('gemini-pro-vision')
323 |         GOOGLE_API_KEY="xx"
324 |         genai.configure(api_key=GOOGLE_API_KEY)
325 | 
326 |     elif modelname == 'videochat':
327 |         models = videochat2_init()
328 | 
329 |     elif modelname == 'videollama':
330 |         models = videollama_init()
331 | 
332 |     elif modelname == 'chatunivi':
333 |         models, tokenizer = chatunivi_init()
334 | 
335 |     elif modelname == 'otter':
336 |         models, image_processor = otter_init()
337 | 
338 |     elif modelname == 'mplugowl':
339 |         models, tokenizer, image_processor = mplugowl_init()
340 | 
341 |     elif modelname == 'xinstruct-7b':
342 |         models, image_processor = xinstruct_init("vicuna7b_v2")
343 | 
344 |     elif modelname == 'xinstruct-13b':
345 |         models, image_processor = xinstruct_init("vicuna13b")
346 | 
347 |     elif modelname == 'pandagpt':
348 |         models = pandagpt_init()
349 | 
350 |     elif modelname == 'imagebind_llm':
351 |         models = imagebind_llm_init()
352 | 
353 |     elif modelname == 'lwm':
354 |         models = lwm_init()
355 | 
356 |     elif modelname == 'videollava':
357 |         models, tokenizer, processor, video_processor = videollava_init()
358 | 
359 | 
360 | 
361 | 
362 | 
363 | 
364 | 
365 |     num_runs = 3
366 | 
367 |     detailed_results_dir = 'detailed_results'
368 |     final_results_dir = 'final_results'
369 | 
370 |     detailed_results_paths = [os.path.join(detailed_results_dir, f'{modelname}_detailed_results_{i}.json') for i in range(num_runs)]
371 |     final_results_paths = [os.path.join(final_results_dir, f'{modelname}_final_results_run_{i}.json') for i in range(num_runs)]
372 | 
373 |     if not os.path.exists(detailed_results_dir):
374 |         os.makedirs(detailed_results_dir )
375 |     if not os.path.exists(final_results_dir):
376 |         os.makedirs(final_results_dir)
377 | 
378 |     print(f"Using model: {modelname}, textonly: {textonly}")
379 | 
380 | 
381 |     videofile = "./video_benchmark/dataset/mmworld.json"
382 |     with open(videofile, 'r') as file:
383 |         dataset = json.load(file)
384 | 
385 |         
386 | 
387 |     answer_evaluator = AzureOpenAI(
388 |         azure_endpoint="xx", 
389 |         api_key="xx",
390 |         api_version="2023-12-01-preview"
391 |     )
392 | 
393 | 
394 |     for run_idx in range(num_runs):
395 |         total_questions = 0
396 |         correct_answers = 0
397 |         detailed_results = []
398 |         accuracy_per_annotation = {}
399 |         accuracy_per_question_type = {}
400 |         results_by_subject = {}
401 | 
402 | 
403 |         failed_downloads = []
404 |         missed_video = set()
405 |         wrong_video = set()
406 |         success_downloads = []
407 | 
408 | 
409 |         for video_data in dataset:
410 |             subject = video_data["discipline"]
411 |             if subject not in results_by_subject:
412 |                 results_by_subject[subject] = {
413 |                     "total_questions": 0,
414 |                     "correct_answers": 0,
415 |                     "accuracy_per_annotation": {},
416 |                     "accuracy_per_question_type": {},
417 |                     "detailed_results": []
418 |                 }
419 |                 
420 |             video_id = video_data["video_id"]
421 | 
422 | 
423 |             for question_data in video_data["questions"]:
424 |                 question = question_data["question"]
425 |                 options = question_data["options"]
426 |                 
427 |                 correct_answer = question_data["answer"]
428 |                 correct_answer_label = question_data["correct_answer_label"]
429 |                 question_type = question_data["type"]
430 |                 requires_video = question_data["requires_visual"]
431 |                 annotations = {
432 |                     "requires_audio": question_data["requires_audio"],
433 |                     "requires_domain_knowledge": question_data["requires_domain_knowledge"],
434 |                     "requires_video": requires_video,
435 |                     "question_only": question_data["question_only"]
436 |                 }
437 | 
438 |                 video_files = glob.glob(f"./all_data/{video_data['video_id']}/*.mp4")
439 | 
440 |                 if len(video_files) == 0:
441 |                     missed_video.add(video_data["video_id"])
442 |                 
443 |                 for video_file in video_files:
444 |                     try:
445 |                         answer_generator(answer_evaluator, video_file, question, options, correct_answer, correct_answer_label, question_type, annotations, video_data["video_id"], 
446 |                             results_by_subject[subject]["detailed_results"], results_by_subject[subject], modelname, locals().get('tokenizer', None), locals().get('processor', None), locals().get('image_processor', None))
447 |                     except Exception as e:
448 |                         print(f"Error encountered: {e}")
449 |                         wrong_video.add(video_data["video_id"])
450 |                         
451 | 
452 |         with open('failed_downloads.json', 'w') as f:
453 |             json.dump(list(missed_video), f, indent=4)
454 |         with open('problemed_video.json', 'w') as f:
455 |             json.dump(list(wrong_video), f, indent=4)
456 | 
457 |         for subject, data in results_by_subject.items():
458 | 
459 |             total_questions += data["total_questions"]
460 |             correct_answers += data["correct_answers"]
461 |             for annotation, value in data["accuracy_per_annotation"].items():
462 |                 accuracy_per_annotation.setdefault(annotation, {"total": 0, "correct": 0})
463 |                 accuracy_per_annotation[annotation]["total"] += value["total"]
464 |                 accuracy_per_annotation[annotation]["correct"] += value["correct"]
465 |             for question_type, value in data["accuracy_per_question_type"].items():
466 |                 accuracy_per_question_type.setdefault(question_type, {"total": 0, "correct": 0})
467 |                 accuracy_per_question_type[question_type]["total"] += value["total"]
468 |                 accuracy_per_question_type[question_type]["correct"] += value["correct"]
469 | 
470 | 
471 |         overall_accuracy = correct_answers / total_questions * 100
472 |         results = {
473 |             "overall_accuracy": overall_accuracy,
474 |             "total_questions": total_questions,
475 |             "correct_answers": correct_answers,
476 |             "accuracy_per_annotation": accuracy_per_annotation,
477 |             "accuracy_per_question_type": accuracy_per_question_type,
478 |             "results_by_subject": results_by_subject
479 |         }
480 | 
481 |         with open(final_results_paths[run_idx], 'w') as file:
482 |             json.dump(results, file, indent=4)
483 |         print(f'Final results saved in {final_results_paths[run_idx]}')
484 | 
485 | 


--------------------------------------------------------------------------------