├── LICENSE
├── MultiMAE-DER_Fine-Tuning Code
└── MultiMAE_DER_FSLF.ipynb
├── MultiMAE-DER_Preprocessing Code
├── MFCC.jpg
├── Preprocessing_Audio.py
├── Preprocessing_CFAS.py
├── Preprocessing_FFLS.py
├── Preprocessing_FSLF.py
├── Preprocessing_OFOS.py
├── Preprocessing_RFAS.py
├── Preprocessing_SFAS.py
├── Tool.py
├── audio_img.jpg
├── img.jpg
└── video_img.jpg
├── README.md
└── images
├── MultiMAE-DER.png
├── MultiMAE-DER_Program_Flowchart.png
├── Multimodal_Sequence_Fusion_Strategy.png
├── Result_on_CREMA-D.png
├── Result_on_IEMOCAP.png
└── Result_on_RAVDESS.png
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/MFCC.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Peihao-Xiang/MultiMAE-DER/88d3f671f4e5d1e26d4bd04848179320ec674ec2/MultiMAE-DER_Preprocessing Code/MFCC.jpg
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/Preprocessing_Audio.py:
--------------------------------------------------------------------------------
1 | import os, warnings
2 | import cv2
3 | import shutil
4 | import numpy as np
5 | import pandas as pd
6 | import tensorflow as tf
7 | import matplotlib.pyplot as plt
8 | from decord import VideoReader
9 | from moviepy.editor import AudioFileClip
10 |
11 | from scipy.io import wavfile # scipy library to read wav files
12 | import numpy as np
13 | from scipy.fftpack import dct
14 | from matplotlib import pyplot as plt
15 | from PIL import Image
16 |
17 | input_size = 224
18 | mid_point = 17
19 |
20 | def normalize_audio(audio):
21 | audio = audio / np.max(np.abs(audio))
22 | return audio
23 |
24 | def MFCC(signal,sample_rate):
25 | pre_emphasis = 0.97
26 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
27 |
28 | frame_size = 0.025
29 | frame_stride = 0.0001
30 |
31 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate # Convert from seconds to samples
32 | signal_length = len(emphasized_signal)
33 | frame_length = int(round(frame_length))
34 | frame_step = int(round(frame_step))
35 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame
36 |
37 | pad_signal_length = num_frames * frame_step + frame_length
38 | z = np.zeros((pad_signal_length - signal_length))
39 | pad_signal = np.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal
40 |
41 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
42 | frames = pad_signal[indices.astype(np.int32, copy=False)]
43 | frames *= np.hamming(frame_length)
44 | NFFT = 512
45 |
46 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT
47 | pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum
48 | nfilt = 40
49 |
50 | low_freq_mel = 0
51 | high_freq_mel = (2595 * np.log10(1 + (sample_rate / 2) / 700)) # Convert Hz to Mel
52 | mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale
53 | hz_points = (700 * (10**(mel_points / 2595) - 1)) # Convert Mel to Hz
54 | bin = np.floor((NFFT + 1) * hz_points / sample_rate)
55 |
56 | fbank = np.zeros((nfilt, int(np.floor(NFFT / 2 + 1))))
57 | for m in range(1, nfilt + 1):
58 | f_m_minus = int(bin[m - 1]) # left
59 | f_m = int(bin[m]) # center
60 | f_m_plus = int(bin[m + 1]) # right
61 |
62 | for k in range(f_m_minus, f_m):
63 | fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
64 | for k in range(f_m, f_m_plus):
65 | fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
66 | filter_banks = np.dot(pow_frames, fbank.T)
67 | filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical Stability
68 | filter_banks = 20 * np.log10(filter_banks) # dB
69 | num_ceps = 13
70 | mfcc = dct(filter_banks, type = 2, axis=1, norm="ortho")[:,1: (num_ceps + 1)] # keep 2-13
71 | cep_lifter = 22
72 | (nframes, ncoeff) = mfcc.shape
73 | n = np.arange(ncoeff)
74 | lift = 1 + (cep_lifter / 2) * np.sin(np.pi * n/ cep_lifter)
75 | mfcc *= lift
76 | return mfcc
77 |
78 | def preprocessing_audio(path, save_path):
79 | n = 1
80 |
81 | for class_name in os.listdir(path):
82 | class_dir = os.path.join(path, class_name)
83 | save_dir = os.path.join(save_path, class_name)
84 |
85 | for video_file in os.listdir(class_dir):
86 | video_path = os.path.join(class_dir, video_file)
87 |
88 | video_name = os.path.basename(video_path).split(".")[0]
89 | mp4_name = str(n) + '.mp4'
90 | path_video_save = os.path.join(save_dir, mp4_name)
91 |
92 | fourcc = cv2.VideoWriter_fourcc(*'mp4v')
93 | output_video = cv2.VideoWriter(path_video_save, fourcc, 16.0, (224, 224))
94 |
95 | audio_clip = AudioFileClip(video_path)
96 | audio_name = os.path.basename(video_path).split(".")[0]
97 | wave_name = str(audio_name) + '.wav'
98 | path_audio_save = os.path.join('Data\\RAVDESS\\RAVDESS_WAVE', wave_name)
99 |
100 | audio_clip.write_audiofile(path_audio_save)
101 | fs, Audiodata = wavfile.read(path_audio_save)
102 | Audiodata = normalize_audio(Audiodata)
103 | step=int((len(Audiodata))/mid_point) - 1
104 | tx=np.arange(0,len(Audiodata),step)
105 |
106 | # Only Save Audio Spectrograms
107 | for i in range(16):
108 | signal=Audiodata[tx[i]:tx[i+2]]
109 | mfcc=MFCC(signal,fs)
110 |
111 | fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
112 | cax = ax.matshow(
113 | np.transpose(mfcc),
114 | interpolation="nearest",
115 | aspect="auto",
116 | # cmap=plt.cm.afmhot_r,
117 | origin="lower",
118 | )
119 |
120 | plt.axis('off')
121 | fig.savefig("MFCC.jpg")
122 | audio_img = Image.open("MFCC.jpg")
123 | audio_img = audio_img.resize((224, 224))
124 | audio_img = np.array(audio_img)
125 |
126 | cv2.imwrite('audio_img.jpg',audio_img)
127 | audio_img = Image.open("audio_img.jpg")
128 | audio_img = audio_img.resize((224, 224))
129 | audio_img = np.array(audio_img)
130 |
131 | plt.close('all')
132 | output_video.write(audio_img)
133 |
134 | output_video.release()
135 | cv2.destroyAllWindows()
136 | n = n + 1
137 |
138 | return n
139 |
140 | if __name__ == '__main__':
141 |
142 | shutil.rmtree("Data\\RAVDESS\\RAVDESS_WAVE")
143 | os.mkdir("Data\\RAVDESS\\RAVDESS_WAVE")
144 |
145 | path_train = 'Data\\RAVDESS\\RAVDESS_RAW\\train'
146 | save_train_path = 'Data\\RAVDESS\\RAVDESS_Audio\\train'
147 |
148 | # path_test = 'Data\\RAVDESS\\RAVDESS_RAW\\test'
149 | # save_test_path = 'Data\\RAVDESS\\RAVDESS_Audio\\test'
150 |
151 | # path_val = 'Data\\RAVDESS\\RAVDESS_RAW\\val'
152 | # save_val_path = 'Data\\RAVDESS\\RAVDESS_Audio\\val'
153 |
154 | n_train = preprocessing_audio(path_train, save_train_path)
155 | # n_test = preprocessing_audio(path_test, save_test_path)
156 | # n_val = preprocessing_audio(path_val, save_val_path)
157 |
158 | print(n_train)
159 | # print(n_test)
160 | # print(n_val)
161 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/Preprocessing_CFAS.py:
--------------------------------------------------------------------------------
1 | import os, warnings
2 | import cv2
3 | import shutil
4 | import numpy as np
5 | import pandas as pd
6 | import tensorflow as tf
7 | import matplotlib.pyplot as plt
8 | from decord import VideoReader
9 | from moviepy.editor import AudioFileClip
10 |
11 | from scipy.io import wavfile # scipy library to read wav files
12 | import numpy as np
13 | from scipy.fftpack import dct
14 | from matplotlib import pyplot as plt
15 | from PIL import Image
16 |
17 | input_size = 224
18 | num_frame = 16
19 | sampling_rate = 3
20 |
21 | def normalize_audio(audio):
22 | audio = audio / np.max(np.abs(audio))
23 | return audio
24 |
25 | def MFCC(signal,sample_rate):
26 | pre_emphasis = 0.97
27 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
28 |
29 | frame_size = 0.025
30 | frame_stride = 0.0001
31 |
32 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate # Convert from seconds to samples
33 | signal_length = len(emphasized_signal)
34 | frame_length = int(round(frame_length))
35 | frame_step = int(round(frame_step))
36 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame
37 |
38 | pad_signal_length = num_frames * frame_step + frame_length
39 | z = np.zeros((pad_signal_length - signal_length))
40 | pad_signal = np.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal
41 |
42 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
43 | frames = pad_signal[indices.astype(np.int32, copy=False)]
44 | frames *= np.hamming(frame_length)
45 | NFFT = 512
46 |
47 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT
48 | pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum
49 | nfilt = 40
50 |
51 | low_freq_mel = 0
52 | high_freq_mel = (2595 * np.log10(1 + (sample_rate / 2) / 700)) # Convert Hz to Mel
53 | mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale
54 | hz_points = (700 * (10**(mel_points / 2595) - 1)) # Convert Mel to Hz
55 | bin = np.floor((NFFT + 1) * hz_points / sample_rate)
56 |
57 | fbank = np.zeros((nfilt, int(np.floor(NFFT / 2 + 1))))
58 | for m in range(1, nfilt + 1):
59 | f_m_minus = int(bin[m - 1]) # left
60 | f_m = int(bin[m]) # center
61 | f_m_plus = int(bin[m + 1]) # right
62 |
63 | for k in range(f_m_minus, f_m):
64 | fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
65 | for k in range(f_m, f_m_plus):
66 | fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
67 | filter_banks = np.dot(pow_frames, fbank.T)
68 | filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical Stability
69 | filter_banks = 20 * np.log10(filter_banks) # dB
70 | num_ceps = 13
71 | mfcc = dct(filter_banks, type = 2, axis=1, norm="ortho")[:,1: (num_ceps + 1)] # keep 2-13
72 | cep_lifter = 22
73 | (nframes, ncoeff) = mfcc.shape
74 | n = np.arange(ncoeff)
75 | lift = 1 + (cep_lifter / 2) * np.sin(np.pi * n/ cep_lifter)
76 | mfcc *= lift
77 | return mfcc
78 |
79 | def read_video(file_path):
80 | vr = VideoReader(file_path)
81 | frames = vr.get_batch(range(len(vr))).asnumpy()
82 | return format_frames(
83 | frames,
84 | output_size=(input_size, input_size)
85 | )
86 |
87 | def format_frames(frame, output_size):
88 | frame = tf.image.convert_image_dtype(frame, tf.uint8)
89 | frame = tf.image.resize(frame, size=list(output_size))
90 | return frame
91 |
92 | def uniform_temporal_subsample(
93 | x, num_samples, clip_idx, total_clips, frame_rate=1, temporal_dim=-4
94 | ):
95 | t = tf.shape(x)[temporal_dim]
96 | max_offset = t - num_samples * frame_rate
97 | step = max_offset // total_clips
98 | offset = clip_idx * step
99 | indices = tf.linspace(
100 | tf.cast(offset, tf.float32),
101 | tf.cast(offset + (num_samples-1) * frame_rate, tf.float32),
102 | num_samples
103 | )
104 | indices = tf.clip_by_value(indices, 0, tf.cast(t - 1, tf.float32))
105 | indices = tf.cast(tf.round(indices), tf.int32)
106 | return tf.gather(x, indices, axis=temporal_dim)
107 |
108 |
109 | def clip_generator(
110 | image, num_frames=32, frame_rate=1, num_clips=1, crop_size=224
111 | ):
112 | clips_list = []
113 | for i in range(num_clips):
114 | frame = uniform_temporal_subsample(
115 | image, num_frames, i, num_clips, frame_rate=frame_rate, temporal_dim=0
116 | )
117 | clips_list.append(frame)
118 |
119 | video = tf.stack(clips_list)
120 | video = tf.reshape(
121 | video, [num_clips*num_frames, crop_size, crop_size, 3]
122 | )
123 | return video
124 |
125 | def video_audio(path, save_path):
126 | n = 1
127 |
128 | for class_name in os.listdir(path):
129 | class_dir = os.path.join(path, class_name)
130 | save_dir = os.path.join(save_path, class_name)
131 |
132 | for video_file in os.listdir(class_dir):
133 | video_path = os.path.join(class_dir, video_file)
134 |
135 | video_name = os.path.basename(video_path).split(".")[0]
136 | mp4_name = str(video_name) + '.mp4'
137 | path_video_save = os.path.join(save_dir, mp4_name)
138 |
139 | fourcc = cv2.VideoWriter_fourcc(*'mp4v')
140 | output_video = cv2.VideoWriter(path_video_save, fourcc, 16.0, (224, 224))
141 |
142 | video_ds = read_video(video_path)
143 | video_ds = clip_generator(video_ds, num_frame, sampling_rate, num_clips=1)
144 |
145 | audio_clip = AudioFileClip(video_path)
146 | audio_name = os.path.basename(video_path).split(".")[0]
147 | wave_name = str(audio_name) + '.wav'
148 | path_audio_save = os.path.join('Data\\MEAD\\MEAD_WAVE', wave_name)
149 |
150 | audio_clip.write_audiofile(path_audio_save)
151 | fs, Audiodata = wavfile.read(path_audio_save)
152 | Audiodata = normalize_audio(Audiodata)
153 | step=int((len(Audiodata))/17) - 1
154 | tx=np.arange(0,len(Audiodata),step)
155 |
156 | # Combine of Face and Spectrogram
157 | for i in range(16):
158 | video_img = video_ds.numpy()[i]
159 | video_img = video_img.astype('uint8')
160 |
161 | signal=Audiodata[tx[i]:tx[i+2]]
162 | mfcc=MFCC(signal,fs)
163 |
164 | fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
165 | cax = ax.matshow(
166 | np.transpose(mfcc),
167 | interpolation="nearest",
168 | aspect="auto",
169 | #cmap=plt.cm.afmhot_r,
170 | origin="lower",
171 | )
172 |
173 | plt.axis('off')
174 | fig.savefig("MFCC.jpg")
175 | audio_img = Image.open("MFCC.jpg")
176 | audio_img = audio_img.resize((224, 224))
177 | audio_img = np.array(audio_img)
178 |
179 | img = np.concatenate((video_img, audio_img), axis = 0)
180 | cv2.imwrite('img.jpg',img)
181 | img = Image.open("img.jpg")
182 | img = img.resize((224, 224))
183 | img = np.array(img)
184 |
185 | plt.close('all')
186 | output_video.write(img)
187 |
188 | output_video.release()
189 | cv2.destroyAllWindows()
190 | n = n + 1
191 |
192 | return n
193 |
194 | if __name__ == '__main__':
195 |
196 | shutil.rmtree("Data\\MEAD\\MEAD_WAVE")
197 | os.mkdir("Data\\MEAD\\MEAD_WAVE")
198 |
199 | path_train = 'Data\\MEAD\\MEAD\\train'
200 | save_train_path = 'Data\\MEAD\\MEAD_CFAS\\train'
201 |
202 | path_test = 'Data\\MEAD\\MEAD\\test'
203 | save_test_path = 'Data\\MEAD\\MEAD_CFAS\\test'
204 |
205 | path_val = 'Data\\MEAD\\MEAD\\val'
206 | save_val_path = 'Data\\MEAD\\MEAD_CFAS\\val'
207 |
208 | n_train = video_audio(path_train, save_train_path)
209 | n_test = video_audio(path_test, save_test_path)
210 | n_val = video_audio(path_val, save_val_path)
211 |
212 | print(n_train)
213 | print(n_test)
214 | print(n_val)
215 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/Preprocessing_FFLS.py:
--------------------------------------------------------------------------------
1 | import os, warnings
2 | import cv2
3 | import shutil
4 | import numpy as np
5 | import pandas as pd
6 | import tensorflow as tf
7 | import matplotlib.pyplot as plt
8 | from decord import VideoReader
9 | from moviepy.editor import AudioFileClip
10 |
11 | from scipy.io import wavfile # scipy library to read wav files
12 | import numpy as np
13 | from scipy.fftpack import dct
14 | from matplotlib import pyplot as plt
15 | from PIL import Image
16 |
17 | input_size = 224
18 | num_frame = 8
19 | sampling_rate = 6
20 |
21 | def normalize_audio(audio):
22 | audio = audio / np.max(np.abs(audio))
23 | return audio
24 |
25 | def MFCC(signal,sample_rate):
26 | pre_emphasis = 0.97
27 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
28 |
29 | frame_size = 0.025
30 | frame_stride = 0.0001
31 |
32 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate # Convert from seconds to samples
33 | signal_length = len(emphasized_signal)
34 | frame_length = int(round(frame_length))
35 | frame_step = int(round(frame_step))
36 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame
37 |
38 | pad_signal_length = num_frames * frame_step + frame_length
39 | z = np.zeros((pad_signal_length - signal_length))
40 | pad_signal = np.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal
41 |
42 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
43 | frames = pad_signal[indices.astype(np.int32, copy=False)]
44 | frames *= np.hamming(frame_length)
45 | NFFT = 512
46 |
47 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT
48 | pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum
49 | nfilt = 40
50 |
51 | low_freq_mel = 0
52 | high_freq_mel = (2595 * np.log10(1 + (sample_rate / 2) / 700)) # Convert Hz to Mel
53 | mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale
54 | hz_points = (700 * (10**(mel_points / 2595) - 1)) # Convert Mel to Hz
55 | bin = np.floor((NFFT + 1) * hz_points / sample_rate)
56 |
57 | fbank = np.zeros((nfilt, int(np.floor(NFFT / 2 + 1))))
58 | for m in range(1, nfilt + 1):
59 | f_m_minus = int(bin[m - 1]) # left
60 | f_m = int(bin[m]) # center
61 | f_m_plus = int(bin[m + 1]) # right
62 |
63 | for k in range(f_m_minus, f_m):
64 | fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
65 | for k in range(f_m, f_m_plus):
66 | fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
67 | filter_banks = np.dot(pow_frames, fbank.T)
68 | filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical Stability
69 | filter_banks = 20 * np.log10(filter_banks) # dB
70 | num_ceps = 13
71 | mfcc = dct(filter_banks, type = 2, axis=1, norm="ortho")[:,1: (num_ceps + 1)] # keep 2-13
72 | cep_lifter = 22
73 | (nframes, ncoeff) = mfcc.shape
74 | n = np.arange(ncoeff)
75 | lift = 1 + (cep_lifter / 2) * np.sin(np.pi * n/ cep_lifter)
76 | mfcc *= lift
77 | return mfcc
78 |
79 | def read_video(file_path):
80 | vr = VideoReader(file_path)
81 | frames = vr.get_batch(range(len(vr))).asnumpy()
82 | return format_frames(
83 | frames,
84 | output_size=(input_size, input_size)
85 | )
86 |
87 | def format_frames(frame, output_size):
88 | frame = tf.image.convert_image_dtype(frame, tf.uint8)
89 | frame = tf.image.resize(frame, size=list(output_size))
90 | return frame
91 |
92 | def uniform_temporal_subsample(
93 | x, num_samples, clip_idx, total_clips, frame_rate=1, temporal_dim=-4
94 | ):
95 | t = tf.shape(x)[temporal_dim]
96 | max_offset = t - num_samples * frame_rate
97 | step = max_offset // total_clips
98 | offset = clip_idx * step
99 | indices = tf.linspace(
100 | tf.cast(offset, tf.float32),
101 | tf.cast(offset + (num_samples-1) * frame_rate, tf.float32),
102 | num_samples
103 | )
104 | indices = tf.clip_by_value(indices, 0, tf.cast(t - 1, tf.float32))
105 | indices = tf.cast(tf.round(indices), tf.int32)
106 | return tf.gather(x, indices, axis=temporal_dim)
107 |
108 |
109 | def clip_generator(
110 | image, num_frames=32, frame_rate=1, num_clips=1, crop_size=224
111 | ):
112 | clips_list = []
113 | for i in range(num_clips):
114 | frame = uniform_temporal_subsample(
115 | image, num_frames, i, num_clips, frame_rate=frame_rate, temporal_dim=0
116 | )
117 | clips_list.append(frame)
118 |
119 | video = tf.stack(clips_list)
120 | video = tf.reshape(
121 | video, [num_clips*num_frames, crop_size, crop_size, 3]
122 | )
123 | return video
124 |
125 | def video_audio(path, save_path):
126 | n = 1
127 |
128 | for class_name in os.listdir(path):
129 | class_dir = os.path.join(path, class_name)
130 | save_dir = os.path.join(save_path, class_name)
131 |
132 | for video_file in os.listdir(class_dir):
133 | video_path = os.path.join(class_dir, video_file)
134 |
135 | video_name = os.path.basename(video_path).split(".")[0]
136 | mp4_name = str(video_name) + '.mp4'
137 | path_video_save = os.path.join(save_dir, mp4_name)
138 |
139 | fourcc = cv2.VideoWriter_fourcc(*'mp4v')
140 | output_video = cv2.VideoWriter(path_video_save, fourcc, 16.0, (224, 224))
141 |
142 | video_ds = read_video(video_path)
143 | video_ds = clip_generator(video_ds, num_frame, sampling_rate, num_clips=1)
144 |
145 | audio_clip = AudioFileClip(video_path)
146 | audio_name = os.path.basename(video_path).split(".")[0]
147 | wave_name = str(audio_name) + '.wav'
148 | path_audio_save = os.path.join('Data\\MEAD\\MEAD_WAVE', wave_name)
149 |
150 | audio_clip.write_audiofile(path_audio_save)
151 | fs, Audiodata = wavfile.read(path_audio_save)
152 | Audiodata = normalize_audio(Audiodata)
153 | step=int((len(Audiodata))/9) - 1
154 | tx=np.arange(0,len(Audiodata),step)
155 |
156 | # First Face Late Spectrogram
157 | for i in range(8):
158 | video_img = video_ds.numpy()[i]
159 | video_img = video_img.astype('uint8')
160 | plt.axis('off')
161 |
162 | cv2.imwrite('video_img.jpg',video_img)
163 | video_img = Image.open("video_img.jpg")
164 | video_img = video_img.resize((224, 224))
165 | video_img = np.array(video_img)
166 |
167 | plt.close('all')
168 | output_video.write(video_img)
169 |
170 | for i in range(8):
171 | signal=Audiodata[tx[i]:tx[i+2]]
172 | mfcc=MFCC(signal,fs)
173 |
174 | fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
175 | cax = ax.matshow(
176 | np.transpose(mfcc),
177 | interpolation="nearest",
178 | aspect="auto",
179 | #cmap=plt.cm.afmhot_r,
180 | origin="lower",
181 | )
182 |
183 | plt.axis('off')
184 | fig.savefig("MFCC.jpg")
185 | audio_img = Image.open("MFCC.jpg")
186 | audio_img = audio_img.resize((224, 224))
187 | audio_img = np.array(audio_img)
188 |
189 | cv2.imwrite('audio_img.jpg',audio_img)
190 | audio_img = Image.open("audio_img.jpg")
191 | audio_img = audio_img.resize((224, 224))
192 | audio_img = np.array(audio_img)
193 |
194 | plt.close('all')
195 | output_video.write(audio_img)
196 |
197 | output_video.release()
198 | cv2.destroyAllWindows()
199 | n = n + 1
200 |
201 | return n
202 |
203 | if __name__ == '__main__':
204 |
205 | shutil.rmtree("Data\\MEAD\\MEAD_WAVE")
206 | os.mkdir("Data\\MEAD\\MEAD_WAVE")
207 |
208 | path_train = 'Data\\MEAD\\MEAD\\train'
209 | save_train_path = 'Data\\MEAD\\MEAD_FFLS\\train'
210 |
211 | path_test = 'Data\\MEAD\\MEAD\\test'
212 | save_test_path = 'Data\\MEAD\\MEAD_FFLS\\test'
213 |
214 | path_val = 'Data\\MEAD\\MEAD\\val'
215 | save_val_path = 'Data\\MEAD\\MEAD_FFLS\\val'
216 |
217 | n_train = video_audio(path_train, save_train_path)
218 | n_test = video_audio(path_test, save_test_path)
219 | n_val = video_audio(path_val, save_val_path)
220 |
221 | print(n_train)
222 | print(n_test)
223 | print(n_val)
224 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/Preprocessing_FSLF.py:
--------------------------------------------------------------------------------
1 | import os, warnings
2 | import cv2
3 | import shutil
4 | import numpy as np
5 | import pandas as pd
6 | import tensorflow as tf
7 | import matplotlib.pyplot as plt
8 | from decord import VideoReader
9 | from moviepy.editor import AudioFileClip
10 |
11 | from scipy.io import wavfile # scipy library to read wav files
12 | import numpy as np
13 | from scipy.fftpack import dct
14 | from matplotlib import pyplot as plt
15 | from PIL import Image
16 |
17 | input_size = 224
18 | num_frame = 8
19 | sampling_rate = 6
20 |
21 | def normalize_audio(audio):
22 | audio = audio / np.max(np.abs(audio))
23 | return audio
24 |
25 | def MFCC(signal,sample_rate):
26 | pre_emphasis = 0.97
27 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
28 |
29 | frame_size = 0.025
30 | frame_stride = 0.0001
31 |
32 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate # Convert from seconds to samples
33 | signal_length = len(emphasized_signal)
34 | frame_length = int(round(frame_length))
35 | frame_step = int(round(frame_step))
36 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame
37 |
38 | pad_signal_length = num_frames * frame_step + frame_length
39 | z = np.zeros((pad_signal_length - signal_length))
40 | pad_signal = np.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal
41 |
42 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
43 | frames = pad_signal[indices.astype(np.int32, copy=False)]
44 | frames *= np.hamming(frame_length)
45 | NFFT = 512
46 |
47 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT
48 | pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum
49 | nfilt = 40
50 |
51 | low_freq_mel = 0
52 | high_freq_mel = (2595 * np.log10(1 + (sample_rate / 2) / 700)) # Convert Hz to Mel
53 | mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale
54 | hz_points = (700 * (10**(mel_points / 2595) - 1)) # Convert Mel to Hz
55 | bin = np.floor((NFFT + 1) * hz_points / sample_rate)
56 |
57 | fbank = np.zeros((nfilt, int(np.floor(NFFT / 2 + 1))))
58 | for m in range(1, nfilt + 1):
59 | f_m_minus = int(bin[m - 1]) # left
60 | f_m = int(bin[m]) # center
61 | f_m_plus = int(bin[m + 1]) # right
62 |
63 | for k in range(f_m_minus, f_m):
64 | fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
65 | for k in range(f_m, f_m_plus):
66 | fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
67 | filter_banks = np.dot(pow_frames, fbank.T)
68 | filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical Stability
69 | filter_banks = 20 * np.log10(filter_banks) # dB
70 | num_ceps = 13
71 | mfcc = dct(filter_banks, type = 2, axis=1, norm="ortho")[:,1: (num_ceps + 1)] # keep 2-13
72 | cep_lifter = 22
73 | (nframes, ncoeff) = mfcc.shape
74 | n = np.arange(ncoeff)
75 | lift = 1 + (cep_lifter / 2) * np.sin(np.pi * n/ cep_lifter)
76 | mfcc *= lift
77 | return mfcc
78 |
79 | def read_video(file_path):
80 | vr = VideoReader(file_path)
81 | frames = vr.get_batch(range(len(vr))).asnumpy()
82 | return format_frames(
83 | frames,
84 | output_size=(input_size, input_size)
85 | )
86 |
87 | def format_frames(frame, output_size):
88 | frame = tf.image.convert_image_dtype(frame, tf.uint8)
89 | frame = tf.image.resize(frame, size=list(output_size))
90 | return frame
91 |
92 | def uniform_temporal_subsample(
93 | x, num_samples, clip_idx, total_clips, frame_rate=1, temporal_dim=-4
94 | ):
95 | t = tf.shape(x)[temporal_dim]
96 | max_offset = t - num_samples * frame_rate
97 | step = max_offset // total_clips
98 | offset = clip_idx * step
99 | indices = tf.linspace(
100 | tf.cast(offset, tf.float32),
101 | tf.cast(offset + (num_samples-1) * frame_rate, tf.float32),
102 | num_samples
103 | )
104 | indices = tf.clip_by_value(indices, 0, tf.cast(t - 1, tf.float32))
105 | indices = tf.cast(tf.round(indices), tf.int32)
106 | return tf.gather(x, indices, axis=temporal_dim)
107 |
108 |
109 | def clip_generator(
110 | image, num_frames=32, frame_rate=1, num_clips=1, crop_size=224
111 | ):
112 | clips_list = []
113 | for i in range(num_clips):
114 | frame = uniform_temporal_subsample(
115 | image, num_frames, i, num_clips, frame_rate=frame_rate, temporal_dim=0
116 | )
117 | clips_list.append(frame)
118 |
119 | video = tf.stack(clips_list)
120 | video = tf.reshape(
121 | video, [num_clips*num_frames, crop_size, crop_size, 3]
122 | )
123 | return video
124 |
125 | def video_audio(path, save_path):
126 | n = 1
127 |
128 | for class_name in os.listdir(path):
129 | class_dir = os.path.join(path, class_name)
130 | save_dir = os.path.join(save_path, class_name)
131 |
132 | for video_file in os.listdir(class_dir):
133 | video_path = os.path.join(class_dir, video_file)
134 |
135 | video_name = os.path.basename(video_path).split(".")[0]
136 | mp4_name = str(video_name) + '.mp4'
137 | path_video_save = os.path.join(save_dir, mp4_name)
138 |
139 | fourcc = cv2.VideoWriter_fourcc(*'mp4v')
140 | output_video = cv2.VideoWriter(path_video_save, fourcc, 16.0, (224, 224))
141 |
142 | video_ds = read_video(video_path)
143 | video_ds = clip_generator(video_ds, num_frame, sampling_rate, num_clips=1)
144 |
145 | audio_clip = AudioFileClip(video_path)
146 | audio_name = os.path.basename(video_path).split(".")[0]
147 | wave_name = str(audio_name) + '.wav'
148 | path_audio_save = os.path.join('Data\\MEAD\\MEAD_WAVE', wave_name)
149 |
150 | audio_clip.write_audiofile(path_audio_save)
151 | fs, Audiodata = wavfile.read(path_audio_save)
152 | Audiodata = normalize_audio(Audiodata)
153 | step=int((len(Audiodata))/9) - 1
154 | tx=np.arange(0,len(Audiodata),step)
155 |
156 | # First Spectrogram Late Face
157 | for i in range(8):
158 | signal=Audiodata[tx[i]:tx[i+2]]
159 | mfcc=MFCC(signal,fs)
160 |
161 | fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
162 | cax = ax.matshow(
163 | np.transpose(mfcc),
164 | interpolation="nearest",
165 | aspect="auto",
166 | #cmap=plt.cm.afmhot_r,
167 | origin="lower",
168 | )
169 |
170 | plt.axis('off')
171 | fig.savefig("MFCC.jpg")
172 | audio_img = Image.open("MFCC.jpg")
173 | audio_img = audio_img.resize((224, 224))
174 | audio_img = np.array(audio_img)
175 |
176 | cv2.imwrite('audio_img.jpg',audio_img)
177 | audio_img = Image.open("audio_img.jpg")
178 | audio_img = audio_img.resize((224, 224))
179 | audio_img = np.array(audio_img)
180 |
181 | plt.close('all')
182 | output_video.write(audio_img)
183 |
184 | for i in range(8):
185 | video_img = video_ds.numpy()[i]
186 | video_img = video_img.astype('uint8')
187 | plt.axis('off')
188 |
189 | cv2.imwrite('video_img.jpg',video_img)
190 | video_img = Image.open("video_img.jpg")
191 | video_img = video_img.resize((224, 224))
192 | video_img = np.array(video_img)
193 |
194 | plt.close('all')
195 | output_video.write(video_img)
196 |
197 | output_video.release()
198 | cv2.destroyAllWindows()
199 | n = n + 1
200 |
201 | return n
202 |
203 | if __name__ == '__main__':
204 |
205 | shutil.rmtree("Data\\MEAD\\MEAD_WAVE")
206 | os.mkdir("Data\\MEAD\\MEAD_WAVE")
207 |
208 | path_train = 'Data\\MEAD\\MEAD\\train'
209 | save_train_path = 'Data\\MEAD\\MEAD_FSLF\\train'
210 |
211 | path_test = 'Data\\MEAD\\MEAD\\test'
212 | save_test_path = 'Data\\MEAD\\MEAD_FSLF\\test'
213 |
214 | path_val = 'Data\\MEAD\\MEAD\\val'
215 | save_val_path = 'Data\\MEAD\\MEAD_FSLF\\val'
216 |
217 | n_train = video_audio(path_train, save_train_path)
218 | n_test = video_audio(path_test, save_test_path)
219 | n_val = video_audio(path_val, save_val_path)
220 |
221 | print(n_train)
222 | print(n_test)
223 | print(n_val)
224 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/Preprocessing_OFOS.py:
--------------------------------------------------------------------------------
1 | import os, warnings
2 | import cv2
3 | import shutil
4 | import numpy as np
5 | import pandas as pd
6 | import tensorflow as tf
7 | import matplotlib.pyplot as plt
8 | from decord import VideoReader
9 | from moviepy.editor import AudioFileClip
10 |
11 | from scipy.io import wavfile # scipy library to read wav files
12 | import numpy as np
13 | from scipy.fftpack import dct
14 | from matplotlib import pyplot as plt
15 | from PIL import Image
16 |
17 | input_size = 224
18 | num_frame = 8
19 | sampling_rate = 6
20 |
21 | def normalize_audio(audio):
22 | audio = audio / np.max(np.abs(audio))
23 | return audio
24 |
25 | def MFCC(signal,sample_rate):
26 | pre_emphasis = 0.97
27 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
28 |
29 | frame_size = 0.025
30 | frame_stride = 0.0001
31 |
32 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate # Convert from seconds to samples
33 | signal_length = len(emphasized_signal)
34 | frame_length = int(round(frame_length))
35 | frame_step = int(round(frame_step))
36 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame
37 |
38 | pad_signal_length = num_frames * frame_step + frame_length
39 | z = np.zeros((pad_signal_length - signal_length))
40 | pad_signal = np.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal
41 |
42 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
43 | frames = pad_signal[indices.astype(np.int32, copy=False)]
44 | frames *= np.hamming(frame_length)
45 | NFFT = 512
46 |
47 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT
48 | pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum
49 | nfilt = 40
50 |
51 | low_freq_mel = 0
52 | high_freq_mel = (2595 * np.log10(1 + (sample_rate / 2) / 700)) # Convert Hz to Mel
53 | mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale
54 | hz_points = (700 * (10**(mel_points / 2595) - 1)) # Convert Mel to Hz
55 | bin = np.floor((NFFT + 1) * hz_points / sample_rate)
56 |
57 | fbank = np.zeros((nfilt, int(np.floor(NFFT / 2 + 1))))
58 | for m in range(1, nfilt + 1):
59 | f_m_minus = int(bin[m - 1]) # left
60 | f_m = int(bin[m]) # center
61 | f_m_plus = int(bin[m + 1]) # right
62 |
63 | for k in range(f_m_minus, f_m):
64 | fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
65 | for k in range(f_m, f_m_plus):
66 | fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
67 | filter_banks = np.dot(pow_frames, fbank.T)
68 | filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical Stability
69 | filter_banks = 20 * np.log10(filter_banks) # dB
70 | num_ceps = 13
71 | mfcc = dct(filter_banks, type = 2, axis=1, norm="ortho")[:,1: (num_ceps + 1)] # keep 2-13
72 | cep_lifter = 22
73 | (nframes, ncoeff) = mfcc.shape
74 | n = np.arange(ncoeff)
75 | lift = 1 + (cep_lifter / 2) * np.sin(np.pi * n/ cep_lifter)
76 | mfcc *= lift
77 | return mfcc
78 |
79 | def read_video(file_path):
80 | vr = VideoReader(file_path)
81 | frames = vr.get_batch(range(len(vr))).asnumpy()
82 | return format_frames(
83 | frames,
84 | output_size=(input_size, input_size)
85 | )
86 |
87 | def format_frames(frame, output_size):
88 | frame = tf.image.convert_image_dtype(frame, tf.uint8)
89 | frame = tf.image.resize(frame, size=list(output_size))
90 | return frame
91 |
92 | def uniform_temporal_subsample(
93 | x, num_samples, clip_idx, total_clips, frame_rate=1, temporal_dim=-4
94 | ):
95 | t = tf.shape(x)[temporal_dim]
96 | max_offset = t - num_samples * frame_rate
97 | step = max_offset // total_clips
98 | offset = clip_idx * step
99 | indices = tf.linspace(
100 | tf.cast(offset, tf.float32),
101 | tf.cast(offset + (num_samples-1) * frame_rate, tf.float32),
102 | num_samples
103 | )
104 | indices = tf.clip_by_value(indices, 0, tf.cast(t - 1, tf.float32))
105 | indices = tf.cast(tf.round(indices), tf.int32)
106 | return tf.gather(x, indices, axis=temporal_dim)
107 |
108 |
109 | def clip_generator(
110 | image, num_frames=32, frame_rate=1, num_clips=1, crop_size=224
111 | ):
112 | clips_list = []
113 | for i in range(num_clips):
114 | frame = uniform_temporal_subsample(
115 | image, num_frames, i, num_clips, frame_rate=frame_rate, temporal_dim=0
116 | )
117 | clips_list.append(frame)
118 |
119 | video = tf.stack(clips_list)
120 | video = tf.reshape(
121 | video, [num_clips*num_frames, crop_size, crop_size, 3]
122 | )
123 | return video
124 |
125 | def video_audio(path, save_path):
126 | n = 1
127 |
128 | for class_name in os.listdir(path):
129 | class_dir = os.path.join(path, class_name)
130 | save_dir = os.path.join(save_path, class_name)
131 |
132 | for video_file in os.listdir(class_dir):
133 | video_path = os.path.join(class_dir, video_file)
134 |
135 | video_name = os.path.basename(video_path).split(".")[0]
136 | mp4_name = str(video_name) + '.mp4'
137 | path_video_save = os.path.join(save_dir, mp4_name)
138 |
139 | fourcc = cv2.VideoWriter_fourcc(*'mp4v')
140 | output_video = cv2.VideoWriter(path_video_save, fourcc, 16.0, (224, 224))
141 |
142 | video_ds = read_video(video_path)
143 | video_ds = clip_generator(video_ds, num_frame, sampling_rate, num_clips=1)
144 |
145 | audio_clip = AudioFileClip(video_path)
146 | audio_name = os.path.basename(video_path).split(".")[0]
147 | wave_name = str(audio_name) + '.wav'
148 | path_audio_save = os.path.join('Data\\MEAD\\MEAD_WAVE', wave_name)
149 |
150 | audio_clip.write_audiofile(path_audio_save)
151 | fs, Audiodata = wavfile.read(path_audio_save)
152 | Audiodata = normalize_audio(Audiodata)
153 | step=int((len(Audiodata))/9) - 1
154 | tx=np.arange(0,len(Audiodata),step)
155 |
156 | # One Face One Spectrogram
157 | for i in range(8):
158 | video_img = video_ds.numpy()[i]
159 | video_img = video_img.astype('uint8')
160 | plt.axis('off')
161 |
162 | cv2.imwrite('video_img.jpg',video_img)
163 | video_img = Image.open("video_img.jpg")
164 | video_img = video_img.resize((224, 224))
165 | video_img = np.array(video_img)
166 |
167 | plt.close('all')
168 | output_video.write(video_img)
169 |
170 | signal=Audiodata[tx[i]:tx[i+2]]
171 | mfcc=MFCC(signal,fs)
172 |
173 | fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
174 | cax = ax.matshow(
175 | np.transpose(mfcc),
176 | interpolation="nearest",
177 | aspect="auto",
178 | #cmap=plt.cm.afmhot_r,
179 | origin="lower",
180 | )
181 |
182 | plt.axis('off')
183 | fig.savefig("MFCC.jpg")
184 | audio_img = Image.open("MFCC.jpg")
185 | audio_img = audio_img.resize((224, 224))
186 | audio_img = np.array(audio_img)
187 |
188 | cv2.imwrite('audio_img.jpg',audio_img)
189 | audio_img = Image.open("audio_img.jpg")
190 | audio_img = audio_img.resize((224, 224))
191 | audio_img = np.array(audio_img)
192 |
193 | plt.close('all')
194 | output_video.write(audio_img)
195 |
196 | output_video.release()
197 | cv2.destroyAllWindows()
198 | n = n + 1
199 |
200 | return n
201 |
202 | if __name__ == '__main__':
203 |
204 | shutil.rmtree("Data\\MEAD\\MEAD_WAVE")
205 | os.mkdir("Data\\MEAD\\MEAD_WAVE")
206 |
207 | path_train = 'Data\\MEAD\\MEAD\\train'
208 | save_train_path = 'Data\\MEAD\\MEAD_OFOS\\train'
209 |
210 | path_test = 'Data\\MEAD\\MEAD\\test'
211 | save_test_path = 'Data\\MEAD\\MEAD_OFOS\\test'
212 |
213 | path_val = 'Data\\MEAD\\MEAD\\val'
214 | save_val_path = 'Data\\MEAD\\MEAD_OFOS\\val'
215 |
216 | n_train = video_audio(path_train, save_train_path)
217 | n_test = video_audio(path_test, save_test_path)
218 | n_val = video_audio(path_val, save_val_path)
219 |
220 | print(n_train)
221 | print(n_test)
222 | print(n_val)
223 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/Preprocessing_RFAS.py:
--------------------------------------------------------------------------------
1 | import os, warnings
2 | import cv2
3 | import random
4 | import numpy as np
5 | import pandas as pd
6 | import tensorflow as tf
7 | import matplotlib.pyplot as plt
8 | from decord import VideoReader
9 | from moviepy.editor import AudioFileClip
10 |
11 | from scipy.io import wavfile # scipy library to read wav files
12 | import numpy as np
13 | from scipy.fftpack import dct
14 | from matplotlib import pyplot as plt
15 | from PIL import Image
16 |
17 | input_size = 224
18 | num_frame = 16
19 | sampling_rate = 1
20 |
21 | def read_video(file_path):
22 | vr = VideoReader(file_path)
23 | frames = vr.get_batch(range(len(vr))).asnumpy()
24 | return format_frames(
25 | frames,
26 | output_size=(input_size, input_size)
27 | )
28 |
29 | def format_frames(frame, output_size):
30 | frame = tf.image.convert_image_dtype(frame, tf.uint8)
31 | frame = tf.image.resize(frame, size=list(output_size))
32 | return frame
33 |
34 | def uniform_temporal_subsample(
35 | x, num_samples, clip_idx, total_clips, frame_rate=1, temporal_dim=-4
36 | ):
37 | t = tf.shape(x)[temporal_dim]
38 | max_offset = t - num_samples * frame_rate
39 | step = max_offset // total_clips
40 | offset = clip_idx * step
41 | indices = tf.linspace(
42 | tf.cast(offset, tf.float32),
43 | tf.cast(offset + (num_samples-1) * frame_rate, tf.float32),
44 | num_samples
45 | )
46 | indices = tf.clip_by_value(indices, 0, tf.cast(t - 1, tf.float32))
47 | indices = tf.cast(tf.round(indices), tf.int32)
48 | return tf.gather(x, indices, axis=temporal_dim)
49 |
50 |
51 | def clip_generator(
52 | image, num_frames=32, frame_rate=1, num_clips=1, crop_size=224
53 | ):
54 | clips_list = []
55 | for i in range(num_clips):
56 | frame = uniform_temporal_subsample(
57 | image, num_frames, i, num_clips, frame_rate=frame_rate, temporal_dim=0
58 | )
59 | clips_list.append(frame)
60 |
61 | video = tf.stack(clips_list)
62 | video = tf.reshape(
63 | video, [num_clips*num_frames, crop_size, crop_size, 3]
64 | )
65 | return video
66 |
67 | def video_audio(path, save_path):
68 | n = 1
69 |
70 | for class_name in os.listdir(path):
71 | class_dir = os.path.join(path, class_name)
72 | save_dir = os.path.join(save_path, class_name)
73 |
74 | for video_file in os.listdir(class_dir):
75 | video_path = os.path.join(class_dir, video_file)
76 |
77 | video_name = os.path.basename(video_path).split(".")[0]
78 | mp4_name = str(video_name) + '.mp4'
79 | path_video_save = os.path.join(save_dir, mp4_name)
80 |
81 | fourcc = cv2.VideoWriter_fourcc(*'mp4v')
82 | output_video = cv2.VideoWriter(path_video_save, fourcc, 16.0, (224, 224))
83 |
84 | video_ds = read_video(video_path)
85 | video_ds = clip_generator(video_ds, num_frame, sampling_rate, num_clips=1)
86 |
87 | L1 = random.sample(range(0, 16), 16)
88 | # Random Face and Spectrogram
89 | for i in L1:
90 | video_img = video_ds.numpy()[i]
91 | video_img = video_img.astype('uint8')
92 | plt.axis('off')
93 |
94 | cv2.imwrite('video_img.jpg',video_img)
95 | video_img = Image.open("video_img.jpg")
96 | video_img = video_img.resize((224, 224))
97 | video_img = np.array(video_img)
98 |
99 | plt.close('all')
100 | output_video.write(video_img)
101 |
102 | output_video.release()
103 | cv2.destroyAllWindows()
104 | n = n + 1
105 |
106 | return n
107 |
108 | if __name__ == '__main__':
109 |
110 | path_train = 'Data\\MEAD\\MEAD_OFOS\\train'
111 | save_train_path = 'Data\\MEAD\\MEAD_RFAS\\train'
112 |
113 | path_test = 'Data\\MEAD\\MEAD_OFOS\\test'
114 | save_test_path = 'Data\\MEAD\\MEAD_RFAS\\test'
115 |
116 | path_val = 'Data\\MEAD\\MEAD_OFOS\\val'
117 | save_val_path = 'Data\\MEAD\\MEAD_RFAS\\val'
118 |
119 | n_train = video_audio(path_train, save_train_path)
120 | n_test = video_audio(path_test, save_test_path)
121 | n_val = video_audio(path_val, save_val_path)
122 |
123 | print(n_train)
124 | print(n_test)
125 | print(n_val)
126 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/Preprocessing_SFAS.py:
--------------------------------------------------------------------------------
1 | import os, warnings
2 | import cv2
3 | import shutil
4 | import numpy as np
5 | import pandas as pd
6 | import tensorflow as tf
7 | import matplotlib.pyplot as plt
8 | from decord import VideoReader
9 | from moviepy.editor import AudioFileClip
10 |
11 | from scipy.io import wavfile # scipy library to read wav files
12 | import numpy as np
13 | from scipy.fftpack import dct
14 | from matplotlib import pyplot as plt
15 | from PIL import Image
16 |
17 | input_size = 224
18 | num_frame = 16
19 | sampling_rate = 3
20 |
21 | def normalize_audio(audio):
22 | audio = audio / np.max(np.abs(audio))
23 | return audio
24 |
25 | def MFCC(signal,sample_rate):
26 | pre_emphasis = 0.97
27 | emphasized_signal = np.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
28 |
29 | frame_size = 0.025
30 | frame_stride = 0.0001
31 |
32 | frame_length, frame_step = frame_size * sample_rate, frame_stride * sample_rate # Convert from seconds to samples
33 | signal_length = len(emphasized_signal)
34 | frame_length = int(round(frame_length))
35 | frame_step = int(round(frame_step))
36 | num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame
37 |
38 | pad_signal_length = num_frames * frame_step + frame_length
39 | z = np.zeros((pad_signal_length - signal_length))
40 | pad_signal = np.append(emphasized_signal, z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal
41 |
42 | indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
43 | frames = pad_signal[indices.astype(np.int32, copy=False)]
44 | frames *= np.hamming(frame_length)
45 | NFFT = 512
46 |
47 | mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT
48 | pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum
49 | nfilt = 40
50 |
51 | low_freq_mel = 0
52 | high_freq_mel = (2595 * np.log10(1 + (sample_rate / 2) / 700)) # Convert Hz to Mel
53 | mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale
54 | hz_points = (700 * (10**(mel_points / 2595) - 1)) # Convert Mel to Hz
55 | bin = np.floor((NFFT + 1) * hz_points / sample_rate)
56 |
57 | fbank = np.zeros((nfilt, int(np.floor(NFFT / 2 + 1))))
58 | for m in range(1, nfilt + 1):
59 | f_m_minus = int(bin[m - 1]) # left
60 | f_m = int(bin[m]) # center
61 | f_m_plus = int(bin[m + 1]) # right
62 |
63 | for k in range(f_m_minus, f_m):
64 | fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
65 | for k in range(f_m, f_m_plus):
66 | fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
67 | filter_banks = np.dot(pow_frames, fbank.T)
68 | filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical Stability
69 | filter_banks = 20 * np.log10(filter_banks) # dB
70 | num_ceps = 13
71 | mfcc = dct(filter_banks, type = 2, axis=1, norm="ortho")[:,1: (num_ceps + 1)] # keep 2-13
72 | cep_lifter = 22
73 | (nframes, ncoeff) = mfcc.shape
74 | n = np.arange(ncoeff)
75 | lift = 1 + (cep_lifter / 2) * np.sin(np.pi * n/ cep_lifter)
76 | mfcc *= lift
77 | return mfcc
78 |
79 | def read_video(file_path):
80 | vr = VideoReader(file_path)
81 | frames = vr.get_batch(range(len(vr))).asnumpy()
82 | return format_frames(
83 | frames,
84 | output_size=(input_size, input_size)
85 | )
86 |
87 | def format_frames(frame, output_size):
88 | frame = tf.image.convert_image_dtype(frame, tf.uint8)
89 | frame = tf.image.resize(frame, size=list(output_size))
90 | return frame
91 |
92 | def uniform_temporal_subsample(
93 | x, num_samples, clip_idx, total_clips, frame_rate=1, temporal_dim=-4
94 | ):
95 | t = tf.shape(x)[temporal_dim]
96 | max_offset = t - num_samples * frame_rate
97 | step = max_offset // total_clips
98 | offset = clip_idx * step
99 | indices = tf.linspace(
100 | tf.cast(offset, tf.float32),
101 | tf.cast(offset + (num_samples-1) * frame_rate, tf.float32),
102 | num_samples
103 | )
104 | indices = tf.clip_by_value(indices, 0, tf.cast(t - 1, tf.float32))
105 | indices = tf.cast(tf.round(indices), tf.int32)
106 | return tf.gather(x, indices, axis=temporal_dim)
107 |
108 |
109 | def clip_generator(
110 | image, num_frames=32, frame_rate=1, num_clips=1, crop_size=224
111 | ):
112 | clips_list = []
113 | for i in range(num_clips):
114 | frame = uniform_temporal_subsample(
115 | image, num_frames, i, num_clips, frame_rate=frame_rate, temporal_dim=0
116 | )
117 | clips_list.append(frame)
118 |
119 | video = tf.stack(clips_list)
120 | video = tf.reshape(
121 | video, [num_clips*num_frames, crop_size, crop_size, 3]
122 | )
123 | return video
124 |
125 | def video_audio(path, save_path):
126 | n = 1
127 |
128 | for class_name in os.listdir(path):
129 | class_dir = os.path.join(path, class_name)
130 | save_dir = os.path.join(save_path, class_name)
131 |
132 | for video_file in os.listdir(class_dir):
133 | video_path = os.path.join(class_dir, video_file)
134 |
135 | video_name = os.path.basename(video_path).split(".")[0]
136 | mp4_name = str(video_name) + '.mp4'
137 | path_video_save = os.path.join(save_dir, mp4_name)
138 |
139 | fourcc = cv2.VideoWriter_fourcc(*'mp4v')
140 | output_video = cv2.VideoWriter(path_video_save, fourcc, 16.0, (224, 224))
141 |
142 | video_ds = read_video(video_path)
143 | video_ds = clip_generator(video_ds, num_frame, sampling_rate, num_clips=1)
144 |
145 | audio_clip = AudioFileClip(video_path)
146 | audio_name = os.path.basename(video_path).split(".")[0]
147 | wave_name = str(audio_name) + '.wav'
148 | path_audio_save = os.path.join('Data\\MEAD\\MEAD_WAVE', wave_name)
149 |
150 | audio_clip.write_audiofile(path_audio_save)
151 | fs, Audiodata = wavfile.read(path_audio_save)
152 | Audiodata = normalize_audio(Audiodata)
153 | step=int((len(Audiodata))/17) - 1
154 | tx=np.arange(0,len(Audiodata),step)
155 |
156 | # Sum of Face and Spectrogram
157 | for i in range(16):
158 | video_img = video_ds.numpy()[i]
159 | video_img = video_img.astype('uint8')
160 | plt.axis('off')
161 |
162 | cv2.imwrite('video_img.jpg',video_img)
163 | video_img = Image.open("video_img.jpg")
164 | video_img = video_img.resize((224, 224))
165 | video_img = np.array(video_img)
166 |
167 | signal=Audiodata[tx[i]:tx[i+2]]
168 | mfcc=MFCC(signal,fs)
169 |
170 | fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))
171 | cax = ax.matshow(
172 | np.transpose(mfcc),
173 | interpolation="nearest",
174 | aspect="auto",
175 | # cmap=plt.cm.afmhot_r,
176 | origin="lower",
177 | )
178 |
179 | plt.axis('off')
180 | fig.savefig("MFCC.jpg")
181 | audio_img = Image.open("MFCC.jpg")
182 | audio_img = audio_img.resize((224, 224))
183 | audio_img = np.array(audio_img)
184 |
185 | cv2.imwrite('audio_img.jpg',audio_img)
186 | audio_img = Image.open("audio_img.jpg")
187 | audio_img = audio_img.resize((224, 224))
188 | audio_img = np.array(audio_img)
189 |
190 | img = video_img + audio_img
191 |
192 | plt.close('all')
193 | output_video.write(img)
194 |
195 | output_video.release()
196 | cv2.destroyAllWindows()
197 | n = n + 1
198 |
199 | return n
200 |
201 | if __name__ == '__main__':
202 |
203 | shutil.rmtree("Data\\MEAD\\MEAD_WAVE")
204 | os.mkdir("Data\\MEAD\\MEAD_WAVE")
205 |
206 | path_train = 'Data\\MEAD\\MEAD\\train'
207 | save_train_path = 'Data\\MEAD\\MEAD_SFAS\\train'
208 |
209 | path_test = 'Data\\MEAD\\MEAD\\test'
210 | save_test_path = 'Data\\MEAD\\MEAD_SFAS\\test'
211 |
212 | path_val = 'Data\\MEAD\\MEAD\\val'
213 | save_val_path = 'Data\\MEAD\\MEAD_SFAS\\val'
214 |
215 | n_train = video_audio(path_train, save_train_path)
216 | n_test = video_audio(path_test, save_test_path)
217 | n_val = video_audio(path_val, save_val_path)
218 |
219 | print(n_train)
220 | print(n_test)
221 | print(n_val)
222 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/Tool.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | def rename(read_path, save_path):
4 | n = 0
5 |
6 | for people_name in os.listdir(read_path):
7 | people_dir = os.path.join(read_path, people_name)
8 |
9 | for class_name in os.listdir(people_dir):
10 | class_dir = os.path.join(people_dir, class_name)
11 | save_dir = os.path.join(save_path, class_name)
12 |
13 | for level_name in os.listdir(class_dir):
14 | level_dir = os.path.join(class_dir, level_name)
15 |
16 | for video_file in os.listdir(level_dir):
17 | video_path = os.path.join(level_dir, video_file)
18 | video_name = os.path.basename(video_path).split(".")[0]
19 |
20 | rename = str(people_name) + '_' + str(class_name) + '_' + str(level_name) + '_' + str(video_name) + '.mp4'
21 | rename_path = os.path.join(save_dir, rename)
22 |
23 | if os.path.exists(video_path):
24 | os.rename(video_path, rename_path)
25 | n = n + 1
26 |
27 | return n
28 |
29 | if __name__ == '__main__':
30 |
31 | read_path = 'Data\\MEAD_RAW'
32 | save_path = 'Data\\MEAD'
33 |
34 | n = rename(read_path, save_path)
35 | print(n)
36 |
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/audio_img.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Peihao-Xiang/MultiMAE-DER/88d3f671f4e5d1e26d4bd04848179320ec674ec2/MultiMAE-DER_Preprocessing Code/audio_img.jpg
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/img.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Peihao-Xiang/MultiMAE-DER/88d3f671f4e5d1e26d4bd04848179320ec674ec2/MultiMAE-DER_Preprocessing Code/img.jpg
--------------------------------------------------------------------------------
/MultiMAE-DER_Preprocessing Code/video_img.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Peihao-Xiang/MultiMAE-DER/88d3f671f4e5d1e26d4bd04848179320ec674ec2/MultiMAE-DER_Preprocessing Code/video_img.jpg
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition (IEEE ICPRS 2024)
2 |
3 | [](https://paperswithcode.com/sota/emotion-recognition-on-ravdess?p=multimae-der-multimodal-masked-autoencoder)
4 | [](https://paperswithcode.com/sota/video-emotion-recognition-on-crema-d?p=multimae-der-multimodal-masked-autoencoder)
5 | [](https://paperswithcode.com/sota/multimodal-emotion-recognition-on-iemocap-4?p=multimae-der-multimodal-masked-autoencoder)
6 |
7 | > [](https://hcps.fiu.edu/) [](https://arxiv.org/abs/2404.18327) [](#citation) [](https://ieeexplore.ieee.org/document/10677820)
8 | > [Peihao Xiang](https://scholar.google.com/citations?user=k--3fM4AAAAJ&hl=zh-CN&oi=ao), [Chaohao Lin](https://scholar.google.com/citations?hl=zh-CN&user=V3l7dAEAAAAJ), [Kaida Wu](https://ieeexplore.ieee.org/author/167739911238744), and [Ou Bai](https://scholar.google.com/citations?hl=zh-CN&user=S0j4DOoAAAAJ)
9 | > HCPS Laboratory, Department of Electrical and Computer Engineering, Florida International University
10 |
11 | [](https://colab.research.google.com/github/Peihao-Xiang/MultiMAE-DER/blob/main/MultiMAE-DER_Fine-Tuning%20Code/MultiMAE_DER_FSLF.ipynb)
12 | [](https://huggingface.co/datasets/NoahMartinezXiang/RAVDESS)
13 |
14 | Official TensorFlow implementation and pre-trained VideoMAE models for MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition.
15 |
16 | Note: The .ipynb is just a simple example. In addition, the VideoMAE encoder model should be pre-trained using the MAE-DFER method, but this repository does not provide it.
17 |
18 | ## Overview
19 |
20 | This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAE-DER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMA-D. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.
21 |
22 |
23 |
25 | Illustration of our MultiMAE-DER.
26 |
31 |
33 | Multimodal Sequence Fusion Strategies.
34 |
39 |
40 | The architecture of MultiMAE-DER.
41 |