TY - GEN
T1 - ConTrans-Detect
T2 - 29th International Conference on Mechatronics and Machine Vision in Practice, M2VIP 2023
AU - Sun, Weirong
AU - Ma, Yujun
AU - Zhang, Hong
AU - Wang, Ruili
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - With the recent advancement of generative deep learning technologies, DeepFakes are the outcome of the manipulation to generate synthetic images, such as swapping a person's face in a video with another face in another video. Nowadays, deep generative models make it easy to generate fake videos, which is hard to detect. Existing methods have utilized Convolutional Neural Networks (CNNs) to identify manipulated regions for DeepFake video detection. However, these methods might not entirely tackle the difficulties of learning low-level spatial features and capturing temporal variations in temporal information, which are crucial for face forgery detection. Therefore, we propose a Convolution-Transformer Deepfake Detection (ConTrans-Detect) model, comprising a multi-scale CNN module for spatial feature representation and a multi-branch Transformer for temporal feature modeling. The multi-scale CNN module uses 3D Inception block to extract multi-scale low-level features (e.g., edges, corners, and angles) from videos. The multi-branch Transformer module consists of multi-stream Transformer layers, each taking different temporal resolutions and spatial feature dimensions as input to perceive various motion variations. Our model achieves an AUC of 0.929 and 0.920 f1 score, surpassing several state-of-The-Art performances on the DeepFake Detection Challenge Datasets (DFDC).
AB - With the recent advancement of generative deep learning technologies, DeepFakes are the outcome of the manipulation to generate synthetic images, such as swapping a person's face in a video with another face in another video. Nowadays, deep generative models make it easy to generate fake videos, which is hard to detect. Existing methods have utilized Convolutional Neural Networks (CNNs) to identify manipulated regions for DeepFake video detection. However, these methods might not entirely tackle the difficulties of learning low-level spatial features and capturing temporal variations in temporal information, which are crucial for face forgery detection. Therefore, we propose a Convolution-Transformer Deepfake Detection (ConTrans-Detect) model, comprising a multi-scale CNN module for spatial feature representation and a multi-branch Transformer for temporal feature modeling. The multi-scale CNN module uses 3D Inception block to extract multi-scale low-level features (e.g., edges, corners, and angles) from videos. The multi-branch Transformer module consists of multi-stream Transformer layers, each taking different temporal resolutions and spatial feature dimensions as input to perceive various motion variations. Our model achieves an AUC of 0.929 and 0.920 f1 score, surpassing several state-of-The-Art performances on the DeepFake Detection Challenge Datasets (DFDC).
KW - Convolutional neural network
KW - DeepFake video detection
KW - Privacy
KW - Security
KW - Vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85186120766&partnerID=8YFLogxK
U2 - 10.1109/M2VIP58386.2023.10413387
DO - 10.1109/M2VIP58386.2023.10413387
M3 - Conference contribution
AN - SCOPUS:85186120766
T3 - 2023 29th International Conference on Mechatronics and Machine Vision in Practice, M2VIP 2023
BT - 2023 29th International Conference on Mechatronics and Machine Vision in Practice, M2VIP 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 November 2023 through 24 November 2023
ER -