Face forgery techniques such as Generative Adversarial Network (GAN) have been widely used for image synthesis in movie production, journalism, etc. What backfires is that these generative technologies are widely abused to impersonate credible people and distribute illegal, misleading, and confusing information to the public. However, to our dismay, the problem with previous fake face detection methods is that they fail to distinguish between different fake generation modalities (various GANs), so none of these methods generalize to opening counterfeit scenes. These previous methods are almost ineffective in identifying fake faces when faced with unknown forgery approaches.
To address this challenge, this paper first further analyzes the weaknesses of GAN-based generators. Our validation experimental results of different face generation models, such as Deepfakes, Face2Face, FaceSwap, etc., found that the faces generated by other models have no generalization. Our experiments revealed that the recent fake faces generated by GANs are still not robust enough because it does not consider enough pixels. Inspired by this finding, we design a novel convolutional neural network that uses frequency texture augmentation and knowledge distillation to enhance its global texture perception, effectively describe textures at different semantic levels in images, and improve robustness. It is worth mentioning that we introduce two core components: Discrete Cosine Transform (DCT) and Knowledge Distillation (KDL). DCT could play the role of image compression and also as image distinguishing between fake faces and real faces. KDL is used to extract features from counterfeit and real image targets, making our model generalize to multiple types of fake face generation methods.
Experiments were done on two datasets, Celeb-DF and FaceForenscics++, demonstrating that DCT facilitates deep fakes detection in some cases. Knowledge distillation plays a key role in our model. Our model achieves better and more consistent performance in image processing or cross-domain settings, especially when images are subject to Gaussian noise.