├── evaluation
├── __init__.py
├── main_utils.py
└── eval.py
├── figures
└── teaser.png
├── LICENSE.md
├── README.md
└── data
├── croissanta_hf_data.json
└── croissant_data.json
/evaluation/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/figures/teaser.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eric-ai-lab/MMWorld/HEAD/figures/teaser.png
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) [2024]
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
2 |
3 | [Xuehai He](https://sheehan1230.github.io/)†,1, [Weixi Feng*](https://weixi-feng.github.io/)2, [Kaizhi Zheng*](https://kzzheng.github.io/)1, [Yujie Lu*](https://yujielu10.github.io/)2, [Wanrong Zhu*](https://wanrong-zhu.com/)2, [Jiachen Li*](https://sites.google.com/view/jiachenli/)2, [Yue Fan*](http://www.yfan.site/)1, [Jianfeng Wang](https://scholar.google.com/citations?user=vJWEw_8AAAAJ&hl=en)3, [Linjie Li](https://www.linkedin.com/in/linjie-li/)3, [Zhengyuan Yang](https://zyang-ur.github.io/)3, [Kevin Lin](https://sites.google.com/site/kevinlin311tw/me)3, [William Yang Wang](https://sites.cs.ucsb.edu/~william/)2, [Xin Eric Wang](https://eric-xw.github.io/)†,1
4 |
5 |
6 | 1UCSC, 2UCSB, 3Microsoft
7 |
8 | *Equal contribution
9 |
10 |
11 |
12 |
13 |
14 | 
15 |
16 |
17 | ## TODO
18 | - [x] Release dataset
19 | - [x] Release evaluation code
20 | - [x] EvalAI server setup
21 | - [x] Hugging Face server setup
22 | - [x] Support evaluation with lmms-eval
23 |
24 | ## :fire: News
25 | * **[2025.07.14]** We integrate the benchmark into [AGI-Eval](https://agi-eval.cn/evaluation/detail?id=66) platform. More models and results will be updated there.
26 | * **[2024.09.21]** We integrate the benchmark into lmms-eval.
27 | * **[2024.09.17]** We set up the Hugging Face server.
28 | * **[2024.08.9]** We set up the EvalAI server. The portal will open for submissions soon.
29 | * **[2024.07.1]** We add the evaluation toolkit.
30 | * **[2024.06.12]** We release our dataset.
31 |
32 |
33 |
34 | ## Dataset Structure
35 | The dataset can be downloaded from Hugging Face.
36 | Each entry in the dataset contains the following fields:
37 | - `video_id`: Unique identifier for the video. Same as the relative path of the downloaded video
38 | - `video_url`: URL of the video
39 | - `discipline`: Main discipline of the video content
40 | - `subdiscipline`: Sub-discipline of the video content
41 | - `captions`: List of captions describing the video content
42 | - `questions`: List of questions related to the video content, each with options and correct answer
43 |
44 |
45 | ## Example Entry
46 |
47 | ```json
48 | {
49 | "video_id": "eng_vid1",
50 | "video_url": "https://youtu.be/-e1_QhJ1EhQ",
51 | "discipline": "Tech & Engineering",
52 | "subdiscipline": "Robotics",
53 | "captions": [
54 | "The humanoid robot Atlas interacts with objects and modifies the course to reach its goal."
55 | ],
56 | "questions": [
57 | {
58 | "type": "Explanation",
59 | "question": "Why is the engineer included at the beginning of the video?",
60 | "options": {
61 | "a": "The reason might be to imply the practical uses of Atlas in a commercial setting, to be an assistant who can perform complex tasks",
62 | "b": "To show how professional engineers can be forgetful sometimes",
63 | "c": "The engineer is controlling the robot manually",
64 | "d": "The engineer is instructing Atlas to build a house"
65 | },
66 | "answer": "The reason might be to imply the practical uses of Atlas in a commercial setting, to be an assistant who can perform complex tasks",
67 | "requires_domain_knowledge": false,
68 | "requires_audio": false,
69 | "requires_visual": true,
70 | "question_only": false,
71 | "correct_answer_label": "a"
72 | }
73 | ]
74 | }
75 | ```
76 |
77 |
78 | ## Evaluation
79 |
80 | You can do evaluation by running our evaluation code [eval.py](evaluation/eval.py). Note that access to the GPT-4 API is required, as defined in line 387 of `eval.py`.
81 | To use our example evaluation code, you need to define your model initialization function, such as:
82 | ```python
83 | modelname_init()
84 | ```
85 | at line 357 of eval.py, and the model answer function, such as:
86 | ```python
87 | modelname_answer()
88 | ```
89 | at line 226 of eval.py.
90 |
91 | Alternatively, you may prepare your model results and submit them to the EvalAI server. The model results format should be as follows:
92 |
93 | ```json
94 | {
95 | "detailed_results": [
96 | {
97 | "video_id": "eng_vid1",
98 | "model_answer": "a",
99 | },
100 | ...
101 | ]
102 | }
103 | ```
104 |
105 |
106 |
107 |
108 | ## License Agreement
109 | Please refer to [LICENSE](./LICENSE.md).
110 | All videos of the MMworld benchmark are obtained from the Internet which are not property of our institutions. The copyright remains with the original owners of the video.
111 | Should you encounter any data samples violating the copyright or licensing regulations of any site, please contact us. Upon verification, those samples will be promptly removed.
112 |
113 |
114 | ## Citation
115 | ```
116 | @misc{he2024mmworld,
117 | title={MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos},
118 | author={Xuehai He and Weixi Feng and Kaizhi Zheng and Yujie Lu and Wanrong Zhu and Jiachen Li and Yue Fan and Jianfeng Wang and Linjie Li and Zhengyuan Yang and Kevin Lin and William Yang Wang and Lijuan Wang and Xin Eric Wang},
119 | year={2024},
120 | eprint={2406.08407},
121 | archivePrefix={arXiv},
122 | primaryClass={cs.CV}
123 | }
124 | ```
125 |
--------------------------------------------------------------------------------
/data/croissanta_hf_data.json:
--------------------------------------------------------------------------------
1 | {
2 | "@context": {
3 | "@language": "en",
4 | "@vocab": "https://schema.org/",
5 | "citeAs": "cr:citeAs",
6 | "column": "cr:column",
7 | "conformsTo": "dct:conformsTo",
8 | "cr": "http://mlcommons.org/croissant/",
9 | "data": {
10 | "@id": "cr:data",
11 | "@type": "@json"
12 | },
13 | "dataBiases": "cr:dataBiases",
14 | "dataCollection": "cr:dataCollection",
15 | "dataType": {
16 | "@id": "cr:dataType",
17 | "@type": "@vocab"
18 | },
19 | "dct": "http://purl.org/dc/terms/",
20 | "extract": "cr:extract",
21 | "field": "cr:field",
22 | "fileProperty": "cr:fileProperty",
23 | "fileObject": "cr:fileObject",
24 | "fileSet": "cr:fileSet",
25 | "format": "cr:format",
26 | "includes": "cr:includes",
27 | "isLiveDataset": "cr:isLiveDataset",
28 | "jsonPath": "cr:jsonPath",
29 | "key": "cr:key",
30 | "md5": "cr:md5",
31 | "parentField": "cr:parentField",
32 | "path": "cr:path",
33 | "personalSensitiveInformation": "cr:personalSensitiveInformation",
34 | "recordSet": "cr:recordSet",
35 | "references": "cr:references",
36 | "regex": "cr:regex",
37 | "repeated": "cr:repeated",
38 | "replace": "cr:replace",
39 | "sc": "https://schema.org/",
40 | "separator": "cr:separator",
41 | "source": "cr:source",
42 | "subField": "cr:subField",
43 | "transform": "cr:transform"
44 | },
45 | "@type": "sc:Dataset",
46 | "distribution": [
47 | {
48 | "@type": "cr:FileObject",
49 | "@id": "repo",
50 | "name": "repo",
51 | "description": "The Hugging Face git repository.",
52 | "contentUrl": "https://huggingface.co/datasets/Xuehai/MMWorld/tree/refs%2Fconvert%2Fparquet",
53 | "encodingFormat": "git+https",
54 | "sha256": "https://github.com/mlcommons/croissant/issues/80"
55 | },
56 | {
57 | "@type": "cr:FileSet",
58 | "@id": "parquet-files-for-config-default",
59 | "name": "parquet-files-for-config-default",
60 | "description": "The underlying Parquet files as converted by Hugging Face (see: https://huggingface.co/docs/datasets-server/parquet).",
61 | "containedIn": {
62 | "@id": "repo"
63 | },
64 | "encodingFormat": "application/x-parquet",
65 | "includes": "default/*/*.parquet"
66 | }
67 | ],
68 | "recordSet": [
69 | {
70 | "@type": "cr:RecordSet",
71 | "@id": "default",
72 | "name": "default",
73 | "description": "Xuehai/MMWorld - 'default' subset\n\nAdditional information:\n- 2 skipped columns: questions, captions",
74 | "field": [
75 | {
76 | "@type": "cr:Field",
77 | "@id": "default/video_url",
78 | "name": "default/video_url",
79 | "description": "Column 'video_url' from the Hugging Face parquet file.",
80 | "dataType": "sc:Text",
81 | "source": {
82 | "fileSet": {
83 | "@id": "parquet-files-for-config-default"
84 | },
85 | "extract": {
86 | "column": "video_url"
87 | }
88 | }
89 | },
90 | {
91 | "@type": "cr:Field",
92 | "@id": "default/correct_answer_label",
93 | "name": "default/correct_answer_label",
94 | "description": "Column 'correct_answer_label' from the Hugging Face parquet file.",
95 | "dataType": "sc:Text",
96 | "source": {
97 | "fileSet": {
98 | "@id": "parquet-files-for-config-default"
99 | },
100 | "extract": {
101 | "column": "correct_answer_label"
102 | }
103 | }
104 | },
105 | {
106 | "@type": "cr:Field",
107 | "@id": "default/subdiscipline",
108 | "name": "default/subdiscipline",
109 | "description": "Column 'subdiscipline' from the Hugging Face parquet file.",
110 | "dataType": "sc:Text",
111 | "source": {
112 | "fileSet": {
113 | "@id": "parquet-files-for-config-default"
114 | },
115 | "extract": {
116 | "column": "subdiscipline"
117 | }
118 | }
119 | },
120 | {
121 | "@type": "cr:Field",
122 | "@id": "default/video_id",
123 | "name": "default/video_id",
124 | "description": "Column 'video_id' from the Hugging Face parquet file.",
125 | "dataType": "sc:Text",
126 | "source": {
127 | "fileSet": {
128 | "@id": "parquet-files-for-config-default"
129 | },
130 | "extract": {
131 | "column": "video_id"
132 | }
133 | }
134 | },
135 | {
136 | "@type": "cr:Field",
137 | "@id": "default/discipline",
138 | "name": "default/discipline",
139 | "description": "Column 'discipline' from the Hugging Face parquet file.",
140 | "dataType": "sc:Text",
141 | "source": {
142 | "fileSet": {
143 | "@id": "parquet-files-for-config-default"
144 | },
145 | "extract": {
146 | "column": "discipline"
147 | }
148 | }
149 | },
150 | {
151 | "@type": "cr:Field",
152 | "@id": "default/clip_video_url",
153 | "name": "default/clip_video_url",
154 | "description": "Column 'clip_video_url' from the Hugging Face parquet file.",
155 | "dataType": "sc:Text",
156 | "source": {
157 | "fileSet": {
158 | "@id": "parquet-files-for-config-default"
159 | },
160 | "extract": {
161 | "column": "clip_video_url"
162 | }
163 | }
164 | },
165 | {
166 | "@type": "cr:Field",
167 | "@id": "default/duration",
168 | "name": "default/duration",
169 | "description": "Column 'duration' from the Hugging Face parquet file.",
170 | "dataType": "sc:Text",
171 | "source": {
172 | "fileSet": {
173 | "@id": "parquet-files-for-config-default"
174 | },
175 | "extract": {
176 | "column": "duration"
177 | }
178 | }
179 | }
180 | ]
181 | }
182 | ],
183 | "conformsTo": "http://mlcommons.org/croissant/1.0",
184 | "name": "MMWorld",
185 | "description": "Xuehai/MMWorld dataset hosted on Hugging Face and contributed by the HF Datasets community",
186 | "alternateName": [
187 | "Xuehai/MMWorld"
188 | ],
189 | "creator": {
190 | "@type": "Person",
191 | "name": "He",
192 | "url": "https://huggingface.co/Xuehai"
193 | },
194 | "keywords": [
195 | "cc-by-4.0",
196 | "Croissant",
197 | "🇺🇸 Region: US"
198 | ],
199 | "license": "https://choosealicense.com/licenses/cc-by-4.0/",
200 | "url": "https://huggingface.co/datasets/Xuehai/MMWorld"
201 | }
--------------------------------------------------------------------------------
/data/croissant_data.json:
--------------------------------------------------------------------------------
1 | {
2 | "@context": {
3 | "@language": "en",
4 | "@vocab": "https://schema.org/",
5 | "citeAs": "cr:citeAs",
6 | "column": "cr:column",
7 | "conformsTo": "dct:conformsTo",
8 | "cr": "http://mlcommons.org/croissant/",
9 | "rai": "http://mlcommons.org/croissant/RAI/",
10 | "data": {
11 | "@id": "cr:data",
12 | "@type": "@json"
13 | },
14 | "dataType": {
15 | "@id": "cr:dataType",
16 | "@type": "@vocab"
17 | },
18 | "dct": "http://purl.org/dc/terms/",
19 | "examples": {
20 | "@id": "cr:examples",
21 | "@type": "@json"
22 | },
23 | "extract": "cr:extract",
24 | "field": "cr:field",
25 | "fileProperty": "cr:fileProperty",
26 | "fileObject": "cr:fileObject",
27 | "fileSet": "cr:fileSet",
28 | "format": "cr:format",
29 | "includes": "cr:includes",
30 | "isLiveDataset": "cr:isLiveDataset",
31 | "jsonPath": "cr:jsonPath",
32 | "key": "cr:key",
33 | "md5": "cr:md5",
34 | "parentField": "cr:parentField",
35 | "path": "cr:path",
36 | "recordSet": "cr:recordSet",
37 | "references": "cr:references",
38 | "regex": "cr:regex",
39 | "repeated": "cr:repeated",
40 | "replace": "cr:replace",
41 | "sc": "https://schema.org/",
42 | "separator": "cr:separator",
43 | "source": "cr:source",
44 | "subField": "cr:subField",
45 | "transform": "cr:transform"
46 | },
47 | "@type": "sc:Dataset",
48 | "name": "mmworld",
49 | "description": "Dataset containing video IDs, URLs, disciplines, subdisciplines, captions, and questions for various videos.",
50 | "conformsTo": "http://mlcommons.org/croissant/1.0",
51 | "license": "https://creativecommons.org/licenses/by/4.0/",
52 | "url": "https://mmworld-bench.github.io/",
53 | "distribution": [
54 | {
55 | "@type": "cr:FileObject",
56 | "@id": "mmworld",
57 | "name": "mmworld.json",
58 | "description": "Dataset containing video IDs, URLs, disciplines, subdisciplines, captions, and questions.",
59 | "contentUrl": "mmworld.json",
60 | "encodingFormat": "application/json",
61 | "sha256": "658ed65e043b845be1adce62f995d4fd85b610eeee911d2d2b6ebf78e82b1f5a"
62 | }
63 | ],
64 | "recordSet": [
65 | {
66 | "@type": "cr:RecordSet",
67 | "@id": "video_metadata",
68 | "name": "Video Metadata",
69 | "description": "Metadata for each video.",
70 | "field": [
71 | {
72 | "@type": "cr:Field",
73 | "@id": "video_id",
74 | "name": "video_id",
75 | "description": "The video ID.",
76 | "dataType": "sc:Text",
77 | "source": {
78 | "fileObject": {
79 | "@id": "mmworld"
80 | },
81 | "extract": {
82 | "jsonPath": "$[*].video_id"
83 | }
84 | }
85 | },
86 | {
87 | "@type": "cr:Field",
88 | "@id": "video_url",
89 | "name": "video_url",
90 | "description": "The video URL.",
91 | "dataType": "sc:Text",
92 | "source": {
93 | "fileObject": {
94 | "@id": "mmworld"
95 | },
96 | "extract": {
97 | "jsonPath": "$[*].video_url"
98 | }
99 | }
100 | },
101 | {
102 | "@type": "cr:Field",
103 | "@id": "discipline",
104 | "name": "discipline",
105 | "description": "The discipline.",
106 | "dataType": "sc:Text",
107 | "source": {
108 | "fileObject": {
109 | "@id": "mmworld"
110 | },
111 | "extract": {
112 | "jsonPath": "$[*].discipline"
113 | }
114 | }
115 | },
116 | {
117 | "@type": "cr:Field",
118 | "@id": "subdiscipline",
119 | "name": "subdiscipline",
120 | "description": "The subdiscipline.",
121 | "dataType": "sc:Text",
122 | "source": {
123 | "fileObject": {
124 | "@id": "mmworld"
125 | },
126 | "extract": {
127 | "jsonPath": "$[*].subdiscipline"
128 | }
129 | }
130 | },
131 | {
132 | "@type": "cr:Field",
133 | "@id": "captions",
134 | "name": "captions",
135 | "description": "The video captions.",
136 | "dataType": "sc:Text",
137 | "source": {
138 | "fileObject": {
139 | "@id": "mmworld"
140 | },
141 | "extract": {
142 | "jsonPath": "$[*].captions"
143 | }
144 | }
145 | }
146 | ]
147 | },
148 | {
149 | "@type": "cr:RecordSet",
150 | "@id": "questions",
151 | "name": "Questions",
152 | "description": "Questions associated with the videos.",
153 | "field": [
154 | {
155 | "@type": "cr:Field",
156 | "@id": "type",
157 | "name": "type",
158 | "description": "The type of question.",
159 | "dataType": "sc:Text",
160 | "source": {
161 | "fileObject": {
162 | "@id": "mmworld"
163 | },
164 | "extract": {
165 | "jsonPath": "$[*].questions[*].type"
166 | }
167 | }
168 | },
169 | {
170 | "@type": "cr:Field",
171 | "@id": "question",
172 | "name": "question",
173 | "description": "The question.",
174 | "dataType": "sc:Text",
175 | "source": {
176 | "fileObject": {
177 | "@id": "mmworld"
178 | },
179 | "extract": {
180 | "jsonPath": "$[*].questions[*].question"
181 | }
182 | }
183 | },
184 | {
185 | "@type": "cr:Field",
186 | "@id": "options",
187 | "name": "options",
188 | "description": "The options for the question.",
189 | "dataType": "sc:Text",
190 | "source": {
191 | "fileObject": {
192 | "@id": "mmworld"
193 | },
194 | "extract": {
195 | "jsonPath": "$[*].questions[*].options"
196 | }
197 | }
198 | },
199 | {
200 | "@type": "cr:Field",
201 | "@id": "answer",
202 | "name": "answer",
203 | "description": "The correct answer for the question.",
204 | "dataType": "sc:Text",
205 | "source": {
206 | "fileObject": {
207 | "@id": "mmworld"
208 | },
209 | "extract": {
210 | "jsonPath": "$[*].questions[*].answer"
211 | }
212 | }
213 | },
214 | {
215 | "@type": "cr:Field",
216 | "@id": "requires_domain_knowledge",
217 | "name": "requires_domain_knowledge",
218 | "description": "Whether the question requires domain knowledge.",
219 | "dataType": "sc:Text",
220 | "source": {
221 | "fileObject": {
222 | "@id": "mmworld"
223 | },
224 | "extract": {
225 | "jsonPath": "$[*].questions[*].requires_domain_knowledge"
226 | }
227 | }
228 | },
229 | {
230 | "@type": "cr:Field",
231 | "@id": "requires_audio",
232 | "name": "requires_audio",
233 | "description": "Whether the question requires audio.",
234 | "dataType": "sc:Text",
235 | "source": {
236 | "fileObject": {
237 | "@id": "mmworld"
238 | },
239 | "extract": {
240 | "jsonPath": "$[*].questions[*].requires_audio"
241 | }
242 | }
243 | },
244 | {
245 | "@type": "cr:Field",
246 | "@id": "requires_visual",
247 | "name": "requires_visual",
248 | "description": "Whether the question requires visual.",
249 | "dataType": "sc:Text",
250 | "source": {
251 | "fileObject": {
252 | "@id": "mmworld"
253 | },
254 | "extract": {
255 | "jsonPath": "$[*].questions[*].requires_visual"
256 | }
257 | }
258 | },
259 | {
260 | "@type": "cr:Field",
261 | "@id": "question_only",
262 | "name": "question_only",
263 | "description": "Whether the question is a question-only type.",
264 | "dataType": "sc:Text",
265 | "source": {
266 | "fileObject": {
267 | "@id": "mmworld"
268 | },
269 | "extract": {
270 | "jsonPath": "$[*].questions[*].question_only"
271 | }
272 | }
273 | },
274 | {
275 | "@type": "cr:Field",
276 | "@id": "correct_answer_label",
277 | "name": "correct_answer_label",
278 | "description": "The label of the correct answer.",
279 | "dataType": "sc:Text",
280 | "source": {
281 | "fileObject": {
282 | "@id": "mmworld"
283 | },
284 | "extract": {
285 | "jsonPath": "$[*].questions[*].correct_answer_label"
286 | }
287 | }
288 | }
289 | ]
290 | }
291 | ]
292 | }
--------------------------------------------------------------------------------
/evaluation/main_utils.py:
--------------------------------------------------------------------------------
1 | # from moviepy.editor import VideoFileClip
2 | # from pytube import YouTube
3 | import sys
4 | import os
5 |
6 | import base64
7 |
8 | def calculate_video_length(video_path):
9 | try:
10 | with VideoFileClip(video_path) as video:
11 | return video.duration
12 | except Exception as e:
13 | print(f"Error calculating video length for {video_path}: {e}")
14 | return 0
15 |
16 |
17 | def show_progress(stream, chunk, bytes_remaining):
18 | total_size = stream.filesize
19 | bytes_downloaded = total_size - bytes_remaining
20 | sys.stdout.write(f"Downloading: {bytes_downloaded / total_size * 100:.2f}%\r")
21 | sys.stdout.flush()
22 |
23 |
24 | def answer_post_processing(model_answer):
25 | parts = model_answer.split('.')
26 | model_answer_processed = parts[0].replace("The answer is", "").strip().lower().strip("\"'")
27 | model_answer_label = model_answer_processed.split(':')[0].strip()
28 | return model_answer_label
29 |
30 |
31 | def format_time(seconds):
32 | hours = int(seconds // 3600)
33 | minutes = int((seconds % 3600) // 60)
34 | seconds = int(seconds % 60)
35 | return f"{hours:02}:{minutes:02}:{seconds:02}"
36 |
37 |
38 | def get_transcript_with_formatted_time(video_url):
39 | video_id = extract_video_id(video_url)
40 | if video_id is None:
41 | return "Invalid YouTube URL or Video ID not found"
42 | return YouTubeTranscriptApi.get_transcript(video_id)
43 |
44 |
45 | def encode_images_to_base64(directory):
46 | images_base64 = []
47 | for image_name in os.listdir(directory):
48 | image_path = os.path.join(directory, image_name)
49 | with open(image_path, "rb") as image_file:
50 | encoded_string = base64.b64encode(image_file.read()).decode()
51 | images_base64.append({"image": encoded_string})
52 | return images_base64
53 |
54 |
55 | def download_video(url, video_id, output_path):
56 | if os.path.exists(output_path):
57 | print(f"Video with {url} already downloaded. Skipping download.")
58 | return output_path
59 | try:
60 | if "shorts" in url:
61 | video_id = video_id.replace("shorts/", "")
62 | print(f"Shorts video detected. ")
63 | yt = YouTube(url, on_progress_callback=show_progress)
64 | video_stream = yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first()
65 | if video_stream:
66 | video_stream.download(output_path=output_path, filename=f"{video_id}.mp4")
67 | print(f"\nDownloaded {yt.title} successfully.")
68 | return True
69 | else:
70 | print("No suitable video stream found for:", url)
71 | return False
72 | except Exception as e:
73 | print(f"Error downloading video {url}: {e}")
74 | return False
75 |
76 |
77 | def compute_question_accuracy(model_answer, correct_answer_label, options):
78 | return model_answer == correct_answer_label.lower().strip()
79 |
80 |
81 | def compute_question_accuracy_with_gpt(answer_evaluator, model_answer, correct_answer_label, question, options):
82 | options_str = "\n".join([f"Option {label.upper()}: {text}" for label, text in options.items()])
83 | # prompt = (f"I will present a response from a question-answering model and several answer options. "
84 | # f"Your task is to evaluate the response and determine which of the following options it most closely aligns with.\n\n"
85 | # f"Response: '{model_answer}'\n\n"
86 | # f"Options:\n{options_str}\n\n"
87 | # "Indicate the most similar option by responding with the corresponding letter only (a, b, c, or d).")
88 | prompt="For question: "+question+"\n"\
89 | + options_str \
90 | + "(please select one)\nGround truth answer: "\
91 | + correct_answer_label\
92 | + "\nModel predicted answer: "+model_answer\
93 | + "\nBased on the question and the ground truth answer, is the model's predicted answer correct? If multi-choice provided, think about which choice is selected by the model, is it correct? (please answer yes/no)\n"
94 | try:
95 | response = answer_evaluator.chat.completions.create(
96 | model="xx",
97 | messages=[
98 | {"role": "system", "content": "You are a helpful assistant that provides concise answers."},
99 | {"role": "user", "content": prompt}
100 | ]
101 | )
102 |
103 |
104 | except Exception as e:
105 | print(f"Error evaluating response: {e}")
106 | # ai_response = model_answer
107 | return 'Error', False
108 | ai_response = response.choices[0].message.content.strip().lower()
109 | if "yes" in ai_response :
110 | gpt_judge_correct = True
111 | else:
112 | gpt_judge_correct = False
113 | return ai_response, gpt_judge_correct
114 |
115 |
116 |
117 | def gpt(question, options, prompt):
118 | constructed_url = 'xx'
119 | headers = {
120 | 'Content-Type': 'application/json',
121 | "api-key": "xx",
122 | }
123 |
124 | def run_api(body):
125 | request = requests.post(constructed_url, headers=headers, json=body)
126 | response = request.json()
127 | return response
128 |
129 | body = [{
130 | 'role' : 'system',
131 | 'content' : ['You are an expert in assisting human. Follows the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.'],
132 | },
133 | {
134 | 'role' : 'user',
135 | 'content' : [prompt],
136 | },
137 | ]
138 |
139 | inputs = {}
140 | inputs['messages'] = body # for "chat"
141 | inputs['max_tokens'] = 1024
142 | inputs['stop'] = "{END}"
143 | results = run_api(inputs)
144 | return results
145 |
146 | def gpt4v(question, options, prompt, video, video_id):
147 | no_of_frames_to_returned = 8
148 |
149 | videoframefolder = f"./video_benchmark/clipped_video/{video_id}"
150 | if not os.path.exists(videoframefolder):
151 | os.makedirs(videoframefolder)
152 | diskwriter = KeyFrameDiskWriter(location=videoframefolder)
153 | video_file_path = video
154 |
155 | print(f"Input video file path = {video_file_path}")
156 |
157 |
158 | try:
159 | vd.extract_video_keyframes(
160 | no_of_frames=no_of_frames_to_returned, file_path=video_file_path,
161 | writer=diskwriter
162 | )
163 | except Exception as e:
164 | print(f"Error in extracting video keyframes: {e}")
165 | images_base64 = []
166 |
167 |
168 | if len(os.listdir(videoframefolder)) == 0:
169 | images_base64 = []
170 | else:
171 | images_base64 = encode_images_to_base64(videoframefolder)
172 | constructed_url = 'xx'
173 | headers = {
174 | 'Content-Type': 'application/json',
175 | 'api-key': 'xx'
176 | }
177 |
178 | def run_api(body):
179 | request = requests.post(constructed_url, headers=headers, json=body)
180 | response = request.json()
181 | return response
182 |
183 | prompt = f"Based on the following video frames extracted from the video, answer the question {question} by selecting one from the giving answers {options}."
184 |
185 | body = [{
186 | 'role' : 'system',
187 | 'content' : ['You are an expert in assisting human. Follows the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.'],
188 | },
189 | {
190 | 'role' : 'user',
191 | 'content' : [prompt, *images_base64],
192 | },
193 | ]
194 |
195 | inputs = {}
196 | inputs['messages'] = body
197 | inputs['max_tokens'] = 1024
198 | inputs['stop'] = "{END}"
199 | results = run_api(inputs)
200 | return results
201 |
202 | def gpt4o(question, options, prompt, video, video_id):
203 | no_of_frames_to_returned = 8
204 |
205 | videoframefolder = f"./video_benchmark/clipped_video/{video_id}"
206 | if not os.path.exists(videoframefolder):
207 | os.makedirs(videoframefolder)
208 | diskwriter = KeyFrameDiskWriter(location=videoframefolder)
209 | video_file_path = video
210 |
211 | print(f"Input video file path = {video_file_path}")
212 |
213 |
214 | try:
215 | vd.extract_video_keyframes(
216 | no_of_frames=no_of_frames_to_returned, file_path=video_file_path,
217 | writer=diskwriter
218 | )
219 | except Exception as e:
220 | print(f"Error in extracting video keyframes: {e}")
221 | images_base64 = []
222 |
223 |
224 | if len(os.listdir(videoframefolder)) == 0:
225 | images_base64 = []
226 | else:
227 | images_base64 = encode_images_to_base64(videoframefolder)
228 | api_base = "xx"
229 | deployment_name = "xx"
230 | api_version = "2024-03-01-preview"
231 | constructed_url = f"{api_base}/openai/deployments/{deployment_name}/chat/completions?api-version={api_version}"
232 | headers = {
233 | 'Content-Type': 'application/json',
234 | 'api-key': 'xx'
235 | }
236 |
237 | def run_api(body):
238 | request = requests.post(constructed_url, headers=headers, json=body)
239 | response = request.json()
240 | return response
241 |
242 | prompt = f"Based on the following video frames extracted from the video, answer the question {question} by selecting one from the giving answers {options}."
243 |
244 |
245 | body = [
246 | {
247 | 'role': 'system',
248 | 'content': 'You are an expert in assisting humans. Follow the user prompt in a completion mode. Generate precise and clear response. End your response with {END}.'
249 | },
250 | {
251 | 'role': 'user',
252 | 'content': prompt
253 | },
254 | {
255 | 'role': 'user',
256 | 'content': images_base64
257 | }
258 | ]
259 |
260 | inputs = {}
261 | inputs['messages'] = body # for "chat"
262 | inputs['max_tokens'] = 2000
263 | inputs['stop'] = "{END}"
264 | results = run_api(inputs)
265 | return results
--------------------------------------------------------------------------------
/evaluation/eval.py:
--------------------------------------------------------------------------------
1 | import json
2 | import sys
3 | import os
4 | import PIL.Image
5 | import glob
6 |
7 | import requests
8 | import time
9 | import argparse
10 |
11 | import argparse
12 | from main_utils import *
13 | from openai import AzureOpenAI
14 |
15 | import copy
16 |
17 | from Katna.video import Video
18 | from Katna.writer import KeyFrameDiskWriter
19 | vd = Video()
20 |
21 |
22 |
23 |
24 |
25 |
26 | def answer_generator(answer_evaluator, video_file, question, options, correct_answer, correct_answer_label, question_type, annotations, video_id, detailed_results, subject_data, modelname, tokenizer=None, processor=None, image_processor=None):
27 | prompt = f"Answer the question {question} by selecting one from the giving answers {options}. Respond with only single letter such as a, b, c ,d."
28 | # prompt = f"Answer the question {question} by selecting one from the giving answers {options}. Also give reasons of why you select this answer after your selelcted amswer."
29 | subject_data["total_questions"] += 1
30 |
31 |
32 |
33 | if modelname == 'gpt' or modelname == 'gpt4o':
34 | max_retries= 3000000
35 | retry_delay = 0.0001
36 | retry_count = 0
37 |
38 | while retry_count < max_retries:
39 | if modelname == 'gpt':
40 | model_response = gpt4v(question, options, prompt, video_file, video_id)
41 | elif modelname == 'gpt4o':
42 | model_response = gpt4o(question, options, prompt, video_file, video_id)
43 | if 'choices' in model_response:
44 | model_answer = model_response['choices'][0]['message']['content']
45 | print('The model answer is:', model_answer)
46 | break
47 | elif model_response['error']['code'] == '429':
48 | print(f"Rate limit exceeded. Error message is {model_response}, Retrying in {retry_delay} seconds..., retry count: {retry_count}")
49 | time.sleep(retry_delay)
50 | elif model_response['error']['code'] == 'content_filter':
51 | print(f"Content filter triggered. Error message is {model_response}, Retrying in {retry_delay} seconds..., retry count: {retry_count}")
52 | model_answer = 'content_filter'
53 | time.sleep(retry_delay)
54 | break
55 | elif 'error' in model_response:
56 | print(f"Error message is {model_response['error']['message']}, Retrying in {retry_delay} seconds..., retry count: {retry_count}")
57 | model_answer = model_response['error']['message']
58 | time.sleep(retry_delay)
59 | break
60 |
61 | retry_count += 1
62 | print('Model selected: gpt, question:', question, 'options:', options, 'answer:', model_answer)
63 | elif modelname == 'gemini':
64 | safety_settings = [
65 | {
66 | "category": "HARM_CATEGORY_DANGEROUS",
67 | "threshold": "BLOCK_NONE",
68 | },
69 | {
70 | "category": "HARM_CATEGORY_HARASSMENT",
71 | "threshold": "BLOCK_NONE",
72 | },
73 | {
74 | "category": "HARM_CATEGORY_HATE_SPEECH",
75 | "threshold": "BLOCK_NONE",
76 | },
77 | {
78 | "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
79 | "threshold": "BLOCK_NONE",
80 | },
81 | {
82 | "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
83 | "threshold": "BLOCK_NONE",
84 | },
85 | ]
86 | no_of_frames_to_returned = 10
87 |
88 | videoframefolder = f"./clipped_video/{video_id}"
89 | images = []
90 | if os.path.exists(videoframefolder) and len(os.listdir(videoframefolder)) > 0:
91 | for image_file_name in os.listdir(videoframefolder):
92 | image_path = os.path.join(videoframefolder, image_file_name)
93 | try:
94 | img = PIL.Image.open(image_path)
95 | images.append(img)
96 | except Exception as e:
97 | print(f"Error loading image {image_file_name}: {e}")
98 | else:
99 | if not os.path.exists(videoframefolder):
100 | os.makedirs(videoframefolder)
101 | diskwriter = KeyFrameDiskWriter(location=videoframefolder)
102 | video_file_path = videofile
103 |
104 | print(f"Input video file path = {video_file_path}")
105 |
106 |
107 | try:
108 | vd.extract_video_keyframes(
109 | no_of_frames=no_of_frames_to_returned, file_path=video_file_path,
110 | writer=diskwriter
111 | )
112 | image_path = os.path.join(videoframefolder, image_file_name)
113 | img = PIL.Image.open(image_path)
114 | images.append(img)
115 | except Exception as e:
116 | print(f"Error in extracting video keyframes: {e}")
117 |
118 | if images:
119 | for attempt in range(5):
120 | try:
121 | model_answer = models.generate_content([prompt] + images, safety_settings=safety_settings).text
122 | break
123 | except Exception as e:
124 | print(f"Attempt {attempt+1} failed: {e}")
125 | if attempt == 4:
126 | model_answer = 'error'
127 | else:
128 | print("No images found in the directory.")
129 | model_answer = 'No images to process.'
130 | print('Model selected: gemini, question:', question, 'options:', options, 'answer:', model_answer)
131 | elif modelname == 'claude':
132 | import anthropic
133 | from io import BytesIO
134 |
135 | client = anthropic.Anthropic(
136 | api_key="xx",
137 | )
138 |
139 | no_of_frames_to_returned = 10
140 |
141 | videoframefolder = f"./video_benchmark/clipped_video/{video_id}"
142 | images = []
143 | if os.path.exists(videoframefolder) and len(os.listdir(videoframefolder)) > 0:
144 | for image_file_name in os.listdir(videoframefolder):
145 | image_path = os.path.join(videoframefolder, image_file_name)
146 | try:
147 | img = PIL.Image.open(image_path)
148 | buffered = BytesIO()
149 | img.save(buffered, format="JPEG")
150 | img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
151 | images.append(img_base64)
152 | except Exception as e:
153 | print(f"Error loading image {image_file_name}: {e}")
154 |
155 |
156 | message_content = []
157 |
158 | for img_base64 in images:
159 | message_content.append(
160 | {
161 | "type": "image",
162 | "source": {
163 | "type": "base64",
164 | "media_type": "image/jpeg",
165 | "data": img_base64,
166 | },
167 | }
168 | )
169 |
170 |
171 | message_content.append(
172 | {
173 | "type": "text",
174 | "text": prompt
175 | }
176 | )
177 |
178 |
179 | messages_payload = [{"role": "user", "content": message_content}]
180 | message = client.messages.create(
181 | model="claude-3-5-sonnet-20240620",
182 | max_tokens=1024,
183 | messages=messages_payload,
184 | )
185 |
186 | model_answer = message.content[0].text
187 | model_answer = answer_post_processing(model_answer)
188 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
189 | elif modelname == 'videochat':
190 | model_answer = videochat_answer(models, video_file, question, options)
191 | model_answer = answer_post_processing(model_answer)
192 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
193 | elif modelname == 'videollama':
194 | model_answer = videollama_answer(models, video_file, question, options)
195 | model_answer = answer_post_processing(model_answer)
196 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
197 | elif modelname == 'chatunivi':
198 | model_answer = chatunivi_answer(models, video_file, question, options, prompt, tokenizer)
199 | model_answer = answer_post_processing(model_answer)
200 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
201 | elif modelname == 'mplugowl':
202 | model_answer = mplugowl_answer(models, video_id, video_file, question, options, prompt, tokenizer, image_processor)
203 | model_answer = answer_post_processing(model_answer)
204 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
205 | elif modelname == 'otter':
206 | model_answer = otter_answer(models, video_file, question, options, prompt, image_processor)
207 | model_answer = answer_post_processing(model_answer)
208 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
209 | elif 'xinstruct' in modelname:
210 | model_answer = xinstruct_answer(models, video_file, question, options, image_processor)
211 | model_answer = answer_post_processing(model_answer)
212 | print('Model selected: xinstruct, question:', question, 'options:', options, 'answer:', model_answer)
213 | elif modelname == 'pandagpt':
214 | model_answer = pandagpt_answer(models, video_file, question, options, prompt)
215 | model_answer = answer_post_processing(model_answer)
216 | print('Model selected: PandaGPT, question:', question, 'options:', options, 'answer:', model_answer)
217 | elif modelname == 'imagebind_llm':
218 | model_answer = imagebind_llm_answer(models, video_file, question, options, prompt)
219 | model_answer = answer_post_processing(model_answer)
220 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
221 | elif modelname == 'lwm':
222 | model_answer = lwm_answer(models, video_file, question, options, prompt)
223 | model_answer = answer_post_processing(model_answer)
224 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
225 | elif modelname == 'videollava':
226 | model_answer = videollava_answer(models, video_file, question, options, prompt, tokenizer, processor, video_processor)
227 | model_answer = answer_post_processing(model_answer)
228 | print('Model selected:', modelname, 'question:', question, 'options:', options, 'answer:', model_answer)
229 | else:
230 | print("Invalid model name. Exiting.")
231 | sys.exit(1)
232 |
233 | gpt_processed_answer, is_correct = compute_question_accuracy_with_gpt(answer_evaluator, model_answer, correct_answer_label, question ,options)
234 | if is_correct:
235 | subject_data["correct_answers"] += 1
236 |
237 | for annotation, value in annotations.items():
238 | subject_data["accuracy_per_annotation"].setdefault(annotation, {"total": 0, "correct": 0})
239 | if value:
240 | subject_data["accuracy_per_annotation"][annotation]["total"] += 1
241 | if value and is_correct:
242 | subject_data["accuracy_per_annotation"][annotation]["correct"] += 1
243 |
244 | test = copy.deepcopy(subject_data["accuracy_per_annotation"])
245 | subject_data["accuracy_per_question_type"].setdefault(question_type, {"total": 0, "correct": 0})
246 | subject_data["accuracy_per_question_type"][question_type]["total"] += 1
247 | if is_correct:
248 | subject_data["accuracy_per_question_type"][question_type]["correct"] += 1
249 |
250 | detailed_results.append({
251 | "subject": subject,
252 | "video_id": video_id,
253 | "question": question,
254 | "correct_answer": correct_answer,
255 | "correct_answer_label": correct_answer_label,
256 | "model_answer": model_answer,
257 | 'gpt_processed_answer': gpt_processed_answer,
258 | "options": options,
259 | "is_correct": is_correct,
260 | "annotations": annotations,
261 | "question_type": question_type,
262 | "subject_data": test
263 | })
264 |
265 | with open(detailed_results_paths[run_idx], 'w') as f:
266 | json.dump(detailed_results, f, indent=4)
267 | print(f"Saved detailed results to {detailed_results_paths[run_idx]}")
268 |
269 |
270 |
271 |
272 | if __name__ == "__main__":
273 |
274 | parser = argparse.ArgumentParser(description="Initialize and run model")
275 | parser.add_argument("modelname", type=str, help="Name of the model to initialize and run")
276 | parser.add_argument("--textonly", action="store_true", help="Flag to indicate if the model should run in text-only mode")
277 |
278 | args = parser.parse_args()
279 |
280 |
281 | modelname = args.modelname
282 | textonly = args.textonly
283 |
284 |
285 |
286 | if modelname == "imagebind_llm":
287 | sys.path.append(os.path.abspath("./LLaMA-Adapter/imagebind_LLM"))
288 | from eval_imagebind_llm import imagebind_llm_init, imagebind_llm_answer
289 | elif modelname == "lwm":
290 | sys.path.append('./video_benchmark/LWM')
291 | from eval_LWM import lwm_init, lwm_answer
292 | elif modelname == "mplugowl":
293 | sys.path.append('./video_benchmark/mPLUG-Owl2')
294 | from eval_mplug_owl import mplugowl_init, mplugowl_answer
295 | elif modelname == "otter":
296 | sys.path.append('./video_benchmark/otter')
297 | sys.path.append('./video_benchmark/otter/src')
298 | from eval_otter import otter_init, otter_answer
299 | elif modelname == "videochat":
300 | sys.path.append('./video_benchmark/video_chat2')
301 | from eval_video_chat import videochat2_init, videochat_answer
302 | elif modelname == "videollama":
303 | sys.path.append('./video_benchmark/Video_llama')
304 | from eval_video_llama import videollama_init, videollama_answer
305 | elif modelname == "videollava":
306 | sys.path.append('./video_benchmark/Video-LLaVA')
307 | from eval_video_llava import videollava_init, videollava_answer
308 | elif modelname == "xinstruct":
309 | sys.path.append('./video_benchmark/LAVIS-XInstructBLIP')
310 | from eval_xinstruct import xinstruct_init, xinstruct_answer
311 | elif modelname == "pandagpt":
312 | sys.path.append(os.path.abspath("./PandaGPT"))
313 | sys.path.append(os.path.abspath('./PandaGPT/code'))
314 | from eval_pandagpt import pandagpt_init, pandagpt_answer
315 | elif modelname == "llamaadapter":
316 | sys.path.append(os.path.abspath("./LLaMA-Adapter/imagebind_LLM"))
317 | from eval_imagebind_llm import imagebind_llm_init, imagebind_llm_answer
318 |
319 |
320 | if modelname == 'gemini':
321 | import google.generativeai as genai
322 | models = genai.GenerativeModel('gemini-pro-vision')
323 | GOOGLE_API_KEY="xx"
324 | genai.configure(api_key=GOOGLE_API_KEY)
325 |
326 | elif modelname == 'videochat':
327 | models = videochat2_init()
328 |
329 | elif modelname == 'videollama':
330 | models = videollama_init()
331 |
332 | elif modelname == 'chatunivi':
333 | models, tokenizer = chatunivi_init()
334 |
335 | elif modelname == 'otter':
336 | models, image_processor = otter_init()
337 |
338 | elif modelname == 'mplugowl':
339 | models, tokenizer, image_processor = mplugowl_init()
340 |
341 | elif modelname == 'xinstruct-7b':
342 | models, image_processor = xinstruct_init("vicuna7b_v2")
343 |
344 | elif modelname == 'xinstruct-13b':
345 | models, image_processor = xinstruct_init("vicuna13b")
346 |
347 | elif modelname == 'pandagpt':
348 | models = pandagpt_init()
349 |
350 | elif modelname == 'imagebind_llm':
351 | models = imagebind_llm_init()
352 |
353 | elif modelname == 'lwm':
354 | models = lwm_init()
355 |
356 | elif modelname == 'videollava':
357 | models, tokenizer, processor, video_processor = videollava_init()
358 |
359 |
360 |
361 |
362 |
363 |
364 |
365 | num_runs = 3
366 |
367 | detailed_results_dir = 'detailed_results'
368 | final_results_dir = 'final_results'
369 |
370 | detailed_results_paths = [os.path.join(detailed_results_dir, f'{modelname}_detailed_results_{i}.json') for i in range(num_runs)]
371 | final_results_paths = [os.path.join(final_results_dir, f'{modelname}_final_results_run_{i}.json') for i in range(num_runs)]
372 |
373 | if not os.path.exists(detailed_results_dir):
374 | os.makedirs(detailed_results_dir )
375 | if not os.path.exists(final_results_dir):
376 | os.makedirs(final_results_dir)
377 |
378 | print(f"Using model: {modelname}, textonly: {textonly}")
379 |
380 |
381 | videofile = "./video_benchmark/dataset/mmworld.json"
382 | with open(videofile, 'r') as file:
383 | dataset = json.load(file)
384 |
385 |
386 |
387 | answer_evaluator = AzureOpenAI(
388 | azure_endpoint="xx",
389 | api_key="xx",
390 | api_version="2023-12-01-preview"
391 | )
392 |
393 |
394 | for run_idx in range(num_runs):
395 | total_questions = 0
396 | correct_answers = 0
397 | detailed_results = []
398 | accuracy_per_annotation = {}
399 | accuracy_per_question_type = {}
400 | results_by_subject = {}
401 |
402 |
403 | failed_downloads = []
404 | missed_video = set()
405 | wrong_video = set()
406 | success_downloads = []
407 |
408 |
409 | for video_data in dataset:
410 | subject = video_data["discipline"]
411 | if subject not in results_by_subject:
412 | results_by_subject[subject] = {
413 | "total_questions": 0,
414 | "correct_answers": 0,
415 | "accuracy_per_annotation": {},
416 | "accuracy_per_question_type": {},
417 | "detailed_results": []
418 | }
419 |
420 | video_id = video_data["video_id"]
421 |
422 |
423 | for question_data in video_data["questions"]:
424 | question = question_data["question"]
425 | options = question_data["options"]
426 |
427 | correct_answer = question_data["answer"]
428 | correct_answer_label = question_data["correct_answer_label"]
429 | question_type = question_data["type"]
430 | requires_video = question_data["requires_visual"]
431 | annotations = {
432 | "requires_audio": question_data["requires_audio"],
433 | "requires_domain_knowledge": question_data["requires_domain_knowledge"],
434 | "requires_video": requires_video,
435 | "question_only": question_data["question_only"]
436 | }
437 |
438 | video_files = glob.glob(f"./all_data/{video_data['video_id']}/*.mp4")
439 |
440 | if len(video_files) == 0:
441 | missed_video.add(video_data["video_id"])
442 |
443 | for video_file in video_files:
444 | try:
445 | answer_generator(answer_evaluator, video_file, question, options, correct_answer, correct_answer_label, question_type, annotations, video_data["video_id"],
446 | results_by_subject[subject]["detailed_results"], results_by_subject[subject], modelname, locals().get('tokenizer', None), locals().get('processor', None), locals().get('image_processor', None))
447 | except Exception as e:
448 | print(f"Error encountered: {e}")
449 | wrong_video.add(video_data["video_id"])
450 |
451 |
452 | with open('failed_downloads.json', 'w') as f:
453 | json.dump(list(missed_video), f, indent=4)
454 | with open('problemed_video.json', 'w') as f:
455 | json.dump(list(wrong_video), f, indent=4)
456 |
457 | for subject, data in results_by_subject.items():
458 |
459 | total_questions += data["total_questions"]
460 | correct_answers += data["correct_answers"]
461 | for annotation, value in data["accuracy_per_annotation"].items():
462 | accuracy_per_annotation.setdefault(annotation, {"total": 0, "correct": 0})
463 | accuracy_per_annotation[annotation]["total"] += value["total"]
464 | accuracy_per_annotation[annotation]["correct"] += value["correct"]
465 | for question_type, value in data["accuracy_per_question_type"].items():
466 | accuracy_per_question_type.setdefault(question_type, {"total": 0, "correct": 0})
467 | accuracy_per_question_type[question_type]["total"] += value["total"]
468 | accuracy_per_question_type[question_type]["correct"] += value["correct"]
469 |
470 |
471 | overall_accuracy = correct_answers / total_questions * 100
472 | results = {
473 | "overall_accuracy": overall_accuracy,
474 | "total_questions": total_questions,
475 | "correct_answers": correct_answers,
476 | "accuracy_per_annotation": accuracy_per_annotation,
477 | "accuracy_per_question_type": accuracy_per_question_type,
478 | "results_by_subject": results_by_subject
479 | }
480 |
481 | with open(final_results_paths[run_idx], 'w') as file:
482 | json.dump(results, file, indent=4)
483 | print(f'Final results saved in {final_results_paths[run_idx]}')
484 |
485 |
--------------------------------------------------------------------------------