Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection

Deepfakes, or artificially generated audiovisual renderings, can be used to defame a public figure or influence public opinion. With the recent discovery of generative adversarial networks, an attacker using a normal desktop computer fitted with an off-the-shelf graphics processing unit can make renditions realistic enough to easily fool a human observer. Detecting deepfakes is thus becoming important for reporters, social media platforms, and the general public. In this work, we introduce simple, yet surprisingly efficient digital forensic methods for audio spoof and visual deepfake detection. Our methods combine convolutional latent representations with bidirectional recurrent structures and entropy-based cost functions. The latent representations for both audio and video are carefully chosen to extract semantically rich information from the recordings. By feeding these into a recurrent framework, we can detect both spatial and temporal signatures of deepfake renditions. The entropy-based cost functions work well in isolation as well as in context with traditional cost functions. We demonstrate our methods on the FaceForensics++ and Celeb-DF video datasets and the ASVSpoof 2019 Logical Access audio datasets, achieving new benchmarks in all categories. We also perform extensive studies to demonstrate generalization to new domains and gain further insight into the effectiveness of the new architectures.

paper