Deepfake Detection Across Image, Video, and Audio: A Comprehensive Survey with Empirical Evaluation of Generalization and Robustness
Hong-HanhNguyen-Le1Email
Van-TuanTran2Email
Dinh-ThucNguyen3Email
Nhien-AnLe-Khac1✉Email
1School of Computer ScienceUniversity College DublinDublinIreland
2School of Computer Science and StatisticsTrinity College DublinDublinIreland
3Knowledge Engineering DepartmentUniversity of ScienceHo Chi Minh CityVietnam
Abstract
Deepfakes (DFs) have emerged as a significant threat in recent years, being exploited for malicious purposes such as impersonation, misinformation dissemination, and artistic style imitation, thereby raising critical ethical and security concerns. This survey presents a comprehensive analysis of passive DF detection methods across image, video, and audio modalities, addressing critical gaps in existing literature. Unlike previous surveys which examine modalities in isolation, we explore inter-modality relationships and shared challenges. We systematically categorize detection approaches based on their underlying methodologies: forensic-based, data-driven, fingerprint-based, and hybrid techniques for visual modalities, and handcrafted versus learnable features for audio. We also extend our analysis beyond mere detection accuracy to include essential performance dimensions for real-world deployment, including generalization and robustness. Additionally, this survey provides an in-depth empirical evaluations of 50 state-of-the-art detection methods across 10 popular datasets, assessing their performance in three critical dimensions: (1) within-domain accuracy, (2) cross-domain generalization, and (3) robustness against adversarial attacks. Our experiments reveal a persistent generalization gap, with performance degradations of 15-20% in cross-domain scenarios, and vulnerability to white-box adversarial attacks exceeding 80% success rates. We also analyze the advantages and limitations of existing datasets, benchmarks, and evaluation metrics for passive DF detection. Finally, we propose future research directions to address these unexplored and emerging issues in the field of passive DF detection. This survey serves as a comprehensive resource for researchers and practitioners, providing insights into the current landscape, methodological approaches, and promising future directions in this rapidly evolving field.
Keywords
Deepfake Detection
Generalization
Robustness
Empirical Evaluation
Cross-Modality Analysis
A
Introduction
The term deepfake(DF) describes synthetic media (e.g., image, video, audio) produced by generative models (GMs) for malicious purposes, causing serious threats across financial, political, and social domains. In February 2024, a sophisticated DF attacks on Arup Group caused a
million loss when the attacker used AI-generated avatars of CEO in a video conference \cite{Chen2024-os}. Similarly, in Singapore, a coordinated campaign impersonating political leaders defrauded
investors of
million through synthetic endorsements of cryptocurrency schemes \cite{Surashit-2024}. Beyond financial fraud, DF have been used for electoral interference where
voters received AI-generated robocalls that mimic President Biden's voice patterns to suppress voter turnout \cite{Guardian-2024}.
In response to these emerging threats, two main categories of DF detection approaches have been developed: proactive and passive. Proactive approaches counteract DF manipulation before it occurs by adding perturbations to human data or embedding watermarks during the synthetic data generation process \cite{nguyen2025survey}. In contrast, passive approaches are popular and straightforward solution against DFs as they are cast as binary classification problem. Passive approaches detect DFs after they are generated, focusing on analyzing intrinsic properties of them to distinguish between authentic and synthetic content. In this work, we focus on passive approaches rather than proactive ones.
Motivation
A
While there are several surveys of passive DF detection that examine specific modalities in isolation kaur2024deepfake, mirsky2021creation, wang2202gan, yi2023audio, malik2022deepfake, the field requires a holistic analysis that transcends these single approaches. This survey is motivated by three fundamental imperatives that are essential for advancing passive DF detection research toward practical and deployable solutions.
First, understanding inter-modality relationships is crucial for developing unified detection strategies. The current landscape of DF detection research suffers from a fragmented approach where image, video, and audio detection methods are developed and evaluated independently. This isolation fails to recognize that DF artifacts manifest differently across modalities, yet there may be underlying commonalities that can be exploited. Image-based detectors that achieve near-perfect accuracy cannot simply be applied frame-by-frame to videos, as they miss critical temporal inconsistencies such as unnatural head movements, flickering artifacts, and inter-frame coherence violations liu2023ti2net, gu2021spatiotemporal, yin2023dynamic. Similarly, audio DF exhibit entirely distinct artifact patterns, such as spectral anomalies, phase coherence irregularities, and prosodic inconsistencies, which require specialized detection architectures sun2023ai,yan2022initial. By exploring these inter-modality relationships, we can identify shared features and patterns that transcend individual modalities, paving the way for more robust and versatile detection frameworks.
Second, existing surveys fundamentally fail to address critical properties required for real-world deployment beyond accuracy metrics. Current survey papers mirsky2021creation,malik2022deepfake, wang2202gan, pei2024deepfake, wang2022deepfake,pei2024deepfake, kaur2024deepfake, yi2023audio primarily focus on reviewing detection methods based on their accuracy performance within the same dataset (within-domain evaluation), which achieves high accuracy (often exceeding
). However, these evaluations do not account for the diverse challenges encountered in real-world scenarios:
Generalization capability is a critical concern as detectors trained on specific datasets often fail to maintain performance when exposed to unseen data distributions, novel manipulation techniques, or different generator architectures corvi2023intriguing, ojha2023towards, corvi2023detection, chen2022ost. While surveys report impressive within-domain accuracies, they fail to examine how these same detectors perform when facing new data types, unseen manipulation techniques, or emerging generator architectures kaur2024deepfake, malik2022deepfake, wang2202gan, pei2024deepfake. This gap highlights the need for a comprehensive survey and evaluation of existing generalization methods.
Robustness to adversarial and distortion attacks represents another critical dimension. Existing surveys do not examine how detection methods withstand adversarial attacks designed to evade detection corvi2023intriguing, hou2023evading or post-processing operations that degrade detection performance feng2023self, xu2024learning, le2023quality. This oversight leaves researchers and practitioners without guidance on the resilience of current methods against real-world manipulations.
Third, the absence of empirical evaluation in existing surveys prevents meaningful comparison and validation of proposed methods. While current survey papers kaur2024deepfake, malik2022deepfake, wang2202gan, pei2024deepfake provide extensive taxonomies and theoretical discussions, they lack empirical evaluations that would enable researchers to understand the relative strengths and limitations of different approaches. Without comparative experiments, it is impossible to assess whether reported improvements represent genuine advances or merely result from favorable experimental setups. This survey addresses this critical gap by providing comprehensive empirical evaluation of current methods across three essential protocols: (1) in-distribution (within-domain) evaluation to establish baseline performance, (2) out-of-distribution (cross-domain) evaluation to assess generalization capabilities, and (3) robustness evaluation against adversarial attacks.
These motivations underscore the need for this comprehensive survey. By providing a unified analysis of detection methods across all modalities, reviewing approaches that address generalization and robustness challenges, and conducting empirical experiments, this survey aims to bridge the gap between academic research and practical deployment requirements.
begin{landscape}{\fontsize{6pt}{6pt}\selectfont\setlength{\tabcolsep}{3pt}\renewcommand{\arraystretch}{2.5}\begin{longtable}{|>{\centering\arraybackslash}p{4.5cm}|c|>{\centering\arraybackslash}p{1.25cm}|>{\centering\arraybackslash}p{1.5cm}|>{\centering\arraybackslash}p{1.2cm}|>{\centering\arraybackslash}p{1.15cm}|>{\centering\arraybackslash}p{1.15cm}|>{\centering\arraybackslash}p{1.15cm}|>{\centering\arraybackslash}p{1.25cm}|>{\centering\arraybackslash}p{1.15cm}|>{\centering\arraybackslash}p{1.15cm}|>{\centering\arraybackslash}p{1.15cm}|}\caption{Comparative analysis of existing surveys on passive DF detection methods}\label{tab:compare}\\\hline\multirow{2}{*}{Surveys} &\multirow{2}{*}{Year} &\multirow{2}{*}{Modality} &\multirow{2}{*}{\parbox{1.25cm}{\centeringGeneration Techniques}} &\multirow{2}{*}{\parbox{1.25cm}{\centeringDetection Methods}} &\multirow{2}{*}{\parbox{1.25cm}{\centeringGenerali-\\zation}} &\multirow{2}{*}{\parbox{1.25cm}{\centeringRobust-\\ness}} &\multirow{2}{*}{\parbox{1.25cm}{\centeringInter-\\modality Analysis}} &\multicolumn{3}{c|}{Empirical Evaluation} &\multirow{2}{*}{\parbox{1.25cm}{\centeringDataset & Benchmark Analysis}} \\& & & & & & & &\parbox{1.25cm}{\centeringUnimodal Detection Accuracy} &\parbox{1.25cm}{\centeringGenerali-\\zation} &\parbox{1.25cm}{\centeringRobust-\\ness} & \\\hline\endfirsthead\multicolumn{12}{c}{{\bfseries Table \thetable\continued from previous page}} \\\hline\multirow{2}{*}{Surveys} &\multirow{2}{*}{Year} &\multirow{2}{*}{Modality} &\multirow{2}{*}{\parbox{1.25cm}{\centeringGeneration Techniques}} &\multirow{2}{*}{\parbox{1.25cm}{\centeringDetection Methods}} &\multirow{2}{*}{\parbox{1.25cm}{\centeringGenerali-\\zation}} &\multirow{2}{*}{\parbox{1.25cm}{\centeringRobust-\\ness}} &\multirow{2}{*}{\parbox{1.25cm}{\centeringInter-\\modality Analysis}} &\multicolumn{3}{c|}{Empirical Evaluation} &\multirow{2}{*}{\parbox{1.25cm}{\centeringDataset & Benchmark Analysis}} \\& & & & & & & &\parbox{1.25cm}{\centeringUnimodal Detection Accuracy} &\parbox{1.25cm}{\centeringGenerali-\\zation} &\parbox{1.25cm}{\centeringRobust-\\ness} & \\\hline\endhead\hline\endfoot\hline\endlastfootThe creation and detection of deepfakes: A survey \cite{mirsky2021creation} &2021 & Image &
&
&
&
&
&
&
&
&
\\\hlineDeepfake detection for human face images and videos: A survey \cite{malik2022deepfake} &2022 & Image \& Video &
&
&
&
&
&
&
&
&
\\\hlineGan-generated faces detection: A survey and new perspectives \cite{wang2202gan} &2022 & Image &
&
&
&
&
&
&
&
&
\\\hlineAudio deepfake detection: A survey \cite{yi2023audio} &2023 & Audio &
&
&
&
&
&
&
&
&
\\\hlineA Survey on Speech Deepfake Detection \cite{li2025survey} &2025 & Audio &
&
&
&
&
&
&
&
&
\\\hlineDeepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward \cite{masood2023deepfakes} &2024 & Unimodal &
&
&
&
&
&
&
&
&
\\\hlineA Survey on the Detection and Impacts of Deepfakes in Visual, Audio, and Textual Formats \cite{mubarak2023survey} &2024 & Unimodal &
&
&
&
&
&
&
&
&
\\\hlineDeepfake Generation and Detection: A Benchmark and Survey \cite{pei2024deepfake} &2024 & Image \& Video &
&
&
&
&
&
&
&
&
\\\hlineDeepfake video detection: challenges and opportunities \cite{kaur2024deepfake} &2024 & Video &
&
&
&
&
&
&
&
&
\\\hlineEvolving from Single-modal to Multimodal Facial Deepfake Detection: A Survey \cite{liu2024evolving} &2024 & Unimodal \& Multimodal &
&
&
&
&
&
&
&
&
\\\hlineDeepfake detection: A comprehensive study from the reliability perspective \cite{wang2022deepfake} &2024 & Image &
&
&
&
&
&
&
&
&
\\\hlineOurs &2025 & Unimodal &
&
&
&
&
&
&
&
&
\\\hline\end{longtable}\noindent\begin{minipage}{\linewidth}Table Notes:\begin{itemize}[leftmargin=*, parsep=0pt, itemsep=0pt, topsep=0pt]\item
: Comprehensive coverage;
: Partial coverage;
: Limited discussion;
: Not covered.\end{itemize}\end{minipage}}\end{landscape}
Related Survey Work
Previous surveys have typically reviewed single modality in isolation, including image mirsky2021creation,malik2022deepfake,wang2202gan,pei2024deepfake,wang2022deepfake, video malik2022deepfake,pei2024deepfake,kaur2024deepfake, and audio yi2023audio, li2025survey. Table \ref{tab:compare} compares our work with existing survey papers, highlighting the strengths and drawbacks of each work.
citeauthor{mirsky2021creation}\cite{mirsky2021creation} provides comprehensive insights into GAN architectures employed in DF generation and \citeauthor{pei2024deepfake}\cite{pei2024deepfake} focused more on benchmarks for evaluating GMs. \citeauthor{wang2202gan}\cite{wang2202gan} reviews detection techniques specifically for identifying GAN-generated artifacts, encompassing DL-based, physical-based, and physiological-based methods, while \citeauthor{malik2022deepfake}\cite{malik2022deepfake} provides broader categories of DF detection approaches in both image and video modalities. \cite{yi2023audio} is the first work that reviews existing approaches for detecting fake audios, while \citeauthor{li2025survey}\cite{li2025survey} extends this by reviewing and evaluating optimization techniques applied in the model training process (e.g., data augmentation), activation functions, and loss functions. Instead of reviewing state-of-the-art (SoTA) detection approaches, \citeauthor{kaur2024deepfake} examines real-world applications of DF video detectors, particularly focusing on computational complexity and scalability considerations. Additionally, \citeauthor{wang2022deepfake} contributes valuable insights into the reliability aspects of image-based DF detection methods. Recent surveys have reviewed single-modal DF detection approaches masood2023deepfakes, mubarak2023survey or both single- and multi-model approaches \cite{liu2024evolving}.
As shown in Table \ref{tab:compare}, existing surveys primarily focus on reviewing generation techniques and detection methods, with limited attention to critical aspects such as generalization, robustness, inter-modality analysis, empirical evaluation, and dataset analysis. Beyond the work \cite{wang2022deepfake}'s full coverage of robustness, no existing survey provides systematic categories of generalization and robustness capabilities of DF detectors. Furthermore, prior works rely exclusively on reported results from original papers without independent evaluation under various conditions.
Our survey aims to fill these gaps by providing a comprehensive cross-modality review of passive DF detection methods, with a strong emphasis on generalization and robustness. We also conduct extensive empirical evaluations to validate and compare existing approaches (50 approaches across modalities).
Contributions
In summary, the main contributions of our work are as follows:
Comprehensive cross-modality review with real-world focus: We systematically examine passive DF detection across image, video, and audio modalities, analyzing inter-modality relationships and critical real-world requirements including generalization to novel forgery methods and robustness against adversarial and distortion attacks.
Extensive empirical evaluation: We conduct experiments evaluating SoTA unimodal DF detection methods on three protocols: (1) In-distribution accuracy, (2) Out-of-distribution generalization, and (3) Robustness against adversarial attacks. This empirical analysis provides practical insights into the strengths and limitations of existing approaches. Specifically, we evaluate a total of 50 approaches across 10 popular datasets.
Dataset limitation analysis: We identify critical gaps in current benchmark datasets, including class imbalance, and insufficient challenging scenarios (multiple faces, occlusions.
Emerging challenges and future directions: We synthesize key unexplored areas including detection under challenging conditions, computational efficiency for real-time deployment, privacy-preserving mechanisms, and adaptation to rapidly evolving generative models.
Survey methodology
This survey follows a systematic approach to select relevant literature on passive DF detection and ensure reproducibility of our literature review process.
Literature Search Strategy
Databases. We utilize multiple academic databases to ensure comprehensive coverage of relevant literature, including IEEE Xplore, ACM Digital Library, Google Scholar, and arXiv. We only select papers published in peer-reviewed journals or conferences regadring AI and security fields, such as CVPR, ICCV, ECCV, NeurIPS, ICML, AAAI, IJCAI, ICASSP, Interspeech, ACM CCS, IEEE TIFS, IEEE T-PAMI, IEEE T-NNLS.
Search strings. We employ a combination of keywords and phrases to capture relevant literature. The primary search terms include "deepfake detection," "deepfake generalization," "deepfake robustness," "image deepfake," "video deepfake," "audio deepfake," and "multimodal deepfake detection." We also use Boolean operators (AND, OR) to refine our search queries. For example, we use search strings such as "image AND deepfake detection AND generalization," "image AND deepfake detection AND adversarial attacks," and "audio AND deepfake detection AND contrastive learning." We also include synonyms and related terms to ensure a comprehensive search. For instance, we use "forgery detection" as an alternative to "deepfake detection" and "adversarial robustness" as a synonym for "robustness against adversarial attacks."
Inclusion and exclusion criteria
We establish clear inclusion and exclusion criteria to filter relevant studies. The inclusion criteria are as follows:
The study must be published in a peer-reviewed journal or conference.
The study must focus on deepfake detection using deep learning techniques.
The study are published between 2020 and 2025 (emphasis on post-2022).
The exclusion criteria are as follows:
Studies that do not involve deep learning techniques for deepfake detection.
Studies that focus on generation-only without detection.
Studies that do not provide empirical results or evaluations.
Fig. 1
Overview of our survey structure.
Click here to Correct
Survey Structure
The rest of this survey is structured as follows: Sect. 9 presents common benchmarks, datasets, and metrics. We review unimodal DF detection methods across image, video, and audio modalities and analyze inter-modality relationships in Sect. 13. Sect. 20 provides a comprehensive review of generalization and robustness techniques. We present out comprehensive empirical evaluation in Sect. 23, which first verifies reported results in original papers, then evaluates generalization and robustness capabilities of SoTA methods. Sect. 37 discusses current challenges and future research directions. Finally, we conclude the survey in Sect. 44. An overview of our survey structure is illustrated in Figure 1.
Datasets, Benchmarks, and Metrics
Datasets
The deepfake detection field has produced extensive datasets across visual and audio modalities, which are summarized in Table \ref{tab:passive-dataset} (page 9). Visual modality datasets dominate with over 20 publicly available resources dang2020detection, he2021forgerynet, le2021openforensics, wang2023dire, yan2024df40, yang2019exposing, korshunov2018deepfakes, rossler2019faceforensics++, dolhansky2019deepfake, li2020celeb, jiang2020deeperforensics, zi2020wilddeepfake, he2021forgerynet, zhou2021face, kwon2021kodf, narayan2022deephy, narayan2023df, while audio modality datasets are fewer, with 11 notable datasets wang2020asvspoof, liu2023asvspoof, frank2021wavefake, yi2022add, yi2023add, li2024cross, muller2022does, yaroshchuk2023open, zhang2022partialspoof, muller2024mlaad, yan2024voicewukong.
In the visual domain, FaceForensics++ (FF++) [rossler2019faceforensics++] is the most widely used training dataset, containing over 1,000 videos manipulated with four different techniques: DeepFakes, Face2Face, FaceSwap, and NeuralTextures. Building on FF++, subsequent datasets expanded in scale and diversity. DFDC \cite{dolhansky2019deepfake} increased volumn with 4,113 fake videos, while CelebDF \cite{li2020celeb} focused on improved visual quality with 5,689 celebrity DFs. ForgeryNet \cite{he2021forgerynet} further scaled to 2.9 million images and 220,000 videos across 15 manipulation methods. Recent datasets have begun incorporating diffusion model (DM) generations to generate DFs, such as DF40 yan2024df40 which includes 40 distinct synthetis models spanning both GANs and DMs, while DiffusionForensics \cite{wang2023dire} specifically targets DMs with 6 different architectures. The audio modality shows similar evolution patterns. ASVSpoof challenges wang2020asvspoof, liu2023asvspoof provided first audio DF datasets with hundreds of thousands pf samples targeting voice conversion and text-to-speech attacks. WaveFake \cite{frank2021wavefake} introduced a large-scale dataset with 117,985 samples generated by various SOTA models. More recent datasets like ADD \cite{yi2023add}, MLAAD \cite{muller2024mlaad}, VoiceWukong yan2024df40 have expanded the diversity of synthesis techniques and languages covered, covering up to 38 languages and 50 synthesis models.
However, these datasets exhibit significant limitations that hinder the advancement of real-world DF detection methods. The first issue is severe class imbalance, where fake samples often outnumber real ones, leading to biased training and poor generalization \cite{layton2024sok}. For instance, DFDC \cite{dolhansky2019deepfake} has a 3:1 ratio of fake to real videos, while CelebDF \cite{li2020celeb} has nearly 5 times more fake videos. Only VoiceWukong \cite{yan2024voicewukong} maintains a balanced distribution between real and fake samples. The second issue is the lack of challenging scenarios. Most datasets contain relatively clean, single-face manipulations under controlled conditions. Only a few datasets like OpenForensics \cite{le2021openforensics}, DF-Platter \cite{narayan2023df}, and DeeShy \cite{narayan2022deephy} attempt to involve challenging scenarios such as multiple fake faces in an image or video, low-quality DFs, or face occlusion. Third, many datasets focus on specific generation techniques (e.g., GANs) and do not reflect the latest advancements in generative models (e.g., DMs). While newer datasets like DF40 yan2024df40 and DiffusionForensics \cite{wang2023dire} have started to include diffusion-based DFs, most existing datasets still lack this diversity. Finally, the most concerning is the complete absence of adversarial robustness in testing in current datasets. None of the existing datasets incorporate adversarially perturbed DFs designed to evade detection, which is a critical real-world threat corvi2023intriguing, hou2023evading.
Benchmarks
DeepfakeBench \cite{yan2023deepfakebench} is the first comprehensive and public benchmark for DF detection in image and video modalities, offering an integrated framework of 34 detectors and 10 datasets, standardized data preprocessing, and evaluation protocols. Regarding the audio modality, XMAD-Bench \cite{ciobanu2025xmad} is the first comprehensive benchmark which support 12 SoTA detectors and 7 languages. However, these benchmark do not cover datasets for challenging scenratios, such as multiple fake faces, face occlusions, partial fake, and low quality. Recently, \cite{li2023continual} introduces a continual benchmark that simulates a diverse stream of DFs sequentially over time, challenging detection models to learn new fake types without forgetting previously encountered ones. Despite its innovative approach to testing detector robustness against catastrophic forgetting, CDDB has received limited adoption in subsequent research.
Evaluation Metrics
Standard evaluation metrics for DF detection include accuracy, precision, recall, F1-score, area under the receiver operating characteristic curve (AUC-ROC), and equal error rate (EER). Accuracy measures the overall correctness of predictions but can be misleading in imbalanced datasets. Precision quantifies the proportion of true positive detections among all positive predictions, while recall measures the proportion of actual positives correctly identified. The F1-score provides a harmonic mean of precision and recall, balancing both metrics. AUC-ROC evaluates the trade-off between true positive and false positive rates across different thresholds, providing a comprehensive measure of model performance. EER represents the point where false positive and false negative rates are equal, commonly used in audio DF detectors. For robustness evaluation against adversarial attacks, the attack success rate (ASR) metric is commonly used.
begin{landscape}{\fontsize{6pt}{6pt}\selectfont\setlength{\tabcolsep}{3pt}\renewcommand{\arraystretch}{1.5}\begin{longtable}{c|c|c|cc|cc|c|c|c|c|c|c|c}\caption{A Summarization of Datasets for Each Modality in Passive DF Detection Approaches.\label{tab:passive-dataset}}\\\toprule\multirow{3}{*}{} & \multirow{3}{*}{Name} & \multirow{3}{*}{Year} & \multicolumn{2}{c|}{Samples} & \multicolumn{2}{c|}{Gen.} & \multirow{3}{*}{SM} & \multirow{3}{*}{\makecell{\textbf{Mani.}\\\textbf{Tech.}}} & \multirow{3}{*}{\makecell{\textbf{Mult.}\\\textbf{Faces}}} & \multirow{3}{*}{\makecell{\textbf{Low}\\\textbf{Qual.}}} & \multirow{3}{*}{\makecell{\textbf{Face}\\\textbf{Occl.}}} & \multirow{3}{*}{Lang.} & \multirow{3}{*}{\makecell{\textbf{Partial}\\\textbf{Fake}}} \\& & & Real & Fake & GANs & DMs & & & & & & & \\& & & & & & & & & & & & & \\\hline\hline\endfirsthead\multicolumn{14}{c}{\tablename\\thetable\-- Continued from previous page} \\\toprule\multirow{3}{*}{} & \multirow{3}{*}{Name} & \multirow{3}{*}{Year} & \multicolumn{2}{c|}{Samples} & \multicolumn{2}{c|}{Gen.} & \multirow{3}{*}{SM} & \multirow{3}{*}{\makecell{\textbf{Mani.}\\\textbf{Tech.}}} & \multirow{3}{*}{\makecell{\textbf{Mult.}\\\textbf{Faces}}} & \multirow{3}{*}{\makecell{\textbf{Low}\\\textbf{Qual.}}} & \multirow{3}{*}{\makecell{\textbf{Face}\\\textbf{Occl.}}} & \multirow{3}{*}{Lang.} & \multirow{3}{*}{\makecell{\textbf{Partial}\\\textbf{Fake}}} \\& & & Real & Fake & GANs & DMs & & & & & & & \\& & & & & & & & & & & & & \\\hline\hline\endhead\hline\multicolumn{14}{r}{Continued on next page} \\\endfoot\bottomrule\endlastfoot\multirow{6}{*}{\rotatebox{90}{Image}} & DFFD \cite{dang2020detection} & 2019 & 58,703 & 240,336 & \checkmark &
& 7 & FS, FR, FE &
&
&
& - &
\\& ForgeryNet \cite{he2021forgerynet} & 2021 & 1,438,201 & 1,457,861 & \checkmark &
& 15 & FS, FR, FE &
&
&
& - &
\\& OpenForensics \cite{le2021openforensics} & 2021 & 160,467 & 173,660 & \checkmark &
& 1 & FS & \checkmark &
& \checkmark & - &
\\& DiffusionForensics \cite{wang2023dire} & 2023 & 134,000 & 137,200 &
& \checkmark & 9 & ES &
&
&
& - &
\\& DiffusionFace \cite{chen2024diffusionface} & 2024 & 30,000 & 600,000 &
& \checkmark & 11 & FS, ES, FE &
&
&
& - &
\\& DF40 yan2024df40 & 2024 & 1590 & 1M & \checkmark & \checkmark & 40 & ES, FE &
&
&
& - &
\\\hline\multirow{14}{*}{\rotatebox{90}{Video}} & UADFV \cite{yang2019exposing} & 2018 & 49 & 49 & \checkmark &
& 1 & FS &
&
&
& - &
\\& DF-TIMIT \cite{korshunov2018deepfakes} & 2018 & 320 & 640 & \checkmark &
& 2 & FS &
& \checkmark &
& - &
\\& DFFD \cite{dang2020detection} & 2019 & 1,000 & 3,000 & \checkmark &
& 7 & FS, FR, FE &
&
&
& - &
\\& FF++ [rossler2019faceforensics++] & 2019 & 1,000 & 4,000 & \checkmark &
& 4 & FS, FR &
&
&
& - &
\\& DFDC \cite{dolhansky2019deepfake} & 2019 & 1,131 & 4,113 & \checkmark &
& 8 & - &
&
&
& - &
\\& Celeb-DF \cite{li2020celeb} & 2020 & 590 & 5,639 & \checkmark &
& 1 & FS, FR &
&
&
& - &
\\& DF-1.0 \cite{jiang2020deeperforensics} & 2020 & 50,000 & 10,000 & \checkmark &
& 1 & FS &
&
&
& - &
\\& Wild-DF \cite{zi2020wilddeepfake} & 2021 & 3,805 & 3,509 & \checkmark &
& 1 & - &
&
&
& - &
\\& ForgeryNet \cite{he2021forgerynet} & 2021 & 99,630 & 121,617 & \checkmark &
& 15 & FS, FR, FE &
&
&
& - &
\\& FFIW \cite{zhou2021face} & 2021 & 10,000 & 10,000 & \checkmark &
& 3 & FS &
&
&
& - &
\\& KoDF \cite{kwon2021kodf} & 2021 & 62,166 & 175.776 & \checkmark &
& 6 & FS, FR &
&
&
& - &
\\& DeeShy \cite{narayan2022deephy} & 2022 & 100 & 5,040 & \checkmark &
& 3 & FS, FR &
&
& \checkmark & - &
\\& DF-Platter \cite{narayan2023df} & 2023 & 133,260 & 132,496 & \checkmark &
& 3 & FS & \checkmark & \checkmark & \checkmark & - &
\\& DF40 yan2024df40 & 2024 & 1590 & 0.1M & \checkmark & \checkmark & 40 & FS, FR &
&
&
& - &
\\\hline\multirow{11}{*}{\rotatebox{90}{Audio}} & ASVspoof 2019 \cite{wang2020asvspoof} & 2019 & 41,913 & 300,678 & - & - & - & VC, TTS & - &
& - & en &
\\& ASVspoof 2021 \cite{liu2023asvspoof} & 2021 & 22,617 & 589,212 & - & - & - & VC, TTS & - &
& - & en &
\\& WaveFake \cite{frank2021wavefake} & 2021 & 18,100 & 117,985 & \checkmark &
& 6 & VC & - &
& - & en, jp &
\\& ADD 2022 \cite{yi2022add} & 2022 & 36,953 & 123,932 & - & - & - & VC & - & \checkmark & - & ch & \checkmark \\& ADD 2023 \cite{yi2023add} & 2023 & 172,819 & 113,042 & - & - & - & VC & - & \checkmark & - & ch & \checkmark \\& ITW \cite{muller2022does} & 2022 & 17,000 & 14,000 & - & - & - & VC & - &
& - & en &
\\& ODSS \cite{yaroshchuk2023open} & 2023 & 11,032 & 18,993 & \checkmark &
& 2 & TTS & - &
& - & en, es, de &
\\& PS \cite{zhang2022partialspoof} & 2023 & 12,483 & 121,461 & - & - & - & VC, TTS & - &
& - & en & \checkmark \\& CD-ADD \cite{li2024cross} & 2024 & 28,212 & 120,459 & \checkmark & \checkmark & 5 & VC, TTS & - &
& - & en &
\\& MLAAD \cite{muller2024mlaad} & 2024 & 20,000 & 134,000 & \checkmark & \checkmark & 82 & TTS & - &
& - & many (38) &
\\& VoiceWukong \cite{yan2024voicewukong} & 2024 & 413,400 & 413,400 & \checkmark & \checkmark & 34 & VC, TTS & - & \checkmark & - & en, ch &
\\\end{longtable}}\end{landscape}
Unimodal Detection
This section provides a comprehensive overview of unimodal approaches that develop detectors for a specific modality, typically either the visual or audio, to identify manipulated content. In the following subsections, we first provide a comprehensive review of SoTA approaches for each modality (Sec. 14, 15, 16), then discuss their strengths, limitations and inter-modality relationships in Sec. 17. Figure 2 illustrates the taxonomy of unimodal DF detection approaches.
Fig. 2
Taxonomy of unimodal DF detection approaches.
Click here to Correct
Image Modality
We divide DF image detection methods into four main categories: Forensic-based, Data-driven, Fingerprint-based, and Hybrid. Figure 3 illustrates four main categories in the image modality.
Fig. 3
Illustration of passive DF detection approaches in image modality.
Click here to Correct
Forensic-based methods.
Methods in this group detect DFs based on predefined rules related to differences in physical-based and physiologically-based signals between real and fake images. For example, the inconsistencies between left and right eyes or the unnaturalness of lighting conditions in fake images. The key insight behind these approaches is that generators cannot perfectly replicate the complex physical properties and biological consistencies inherent in natural image formation. Some studies reveal that natural images exhibit consistent relationships between luminance and chrominance components chen2021robust, talib2025chrominance; thus, analyzing multiple color spaces can provide a more comprehensive view of potential manipulation traces. Other works use cross-color spatial co-occurrence matrices \cite{qiao2023csc} or Wasserstein distance \cite{amin2024exploring} to analyze statistical relationships between color channels of real and fake images. Biological-based inconsistencies are also analyzed to identify DFs, such as the differences between left and right eyes hu2021exposing,wang2022eyes and the discrepancies between the internal face region and the surrounding context \cite{nirkin2021deepfake}.
Data-driven methods. Instead of relying on predefined rules, these methods leverage deep learning (DL) to automatically learn discriminative features from large-scale datasets. Based on how they formulate the detection problem, data-driven approaches can be categorized into three main groups.
circled{1}Conventional classification methods cast DF detection as a binary classification problem, where the model learns to classify an input image as real or fake yang2021mtd,sha2023fake. However, simply treating DF detection as a binary classification problem may not be optimal due to the subtle and localized nature of the differences between real and fake images zhao2021multi, han2023fcd. \citeauthor{zhao2021multi}\cite{zhao2021multi} reformulated as a fine-grained classification task that enables the detector to focus on local regions, while \citeauthor{han2023fcd}\cite{han2023fcd} reformulated as multi-task learning by designing a network that can detect multiple types of DF face images simultaneously.
circled{2}Pixel-level segmentation methods formulate detection as a dense prediction task that generates localization maps to identify manipulated regions at the pixel level. By predicting manipulation masks, these methods not only determine the input as real or fake but also pinpoint which image regions have been manipulated. This additional localization capability provides more interpretable results and can help understand the manipulation techniques used. These methods localize the manipulated area in a fully supervised setting through an encoder-decoder architecture huang2022fakelocator, wang2022lisiam, mazaheri2022detection, katamneni2024contextual or attention mechanism das2022gca, guo2023hierarchical, hong2024contrastive. However, these require dense pixel-wise ground-truth masks, which might not always be available. \citeauthor{tantaru2024weakly}\cite{tantaru2024weakly} overcomes this limitation by applying the GradCAM explainability technique to the activations produced by a classification network to highlight the regions most predictive of the fake class.
circled{3}Reconstruction-based learning methods formulate detection through the lens of image reconstruction. These approaches typically attempt to reconstruct the input image using an encoder-decoder architecture and analyze the reconstruction errors. The key principle is to focus on learning what real images should look like through reconstruction, and the reconstruction of fake images is different from that of real images. However, \citeauthor{shi2023discrepancy}\cite{shi2023discrepancy} recognized that single reconstruction-based methods he2021beyond, cao2022end, liu2023fedforgery, guo2024deepfake have limited feature representation and proposed a double-head reconstruction module that combines discrepancy-guided encoding, dual reconstruction, and aggregation-based detection to improve forgery detection performance.
Fingerprint-based methods. Fingerprint-based methods for DF detection explore the unique patterns or artifacts unintentionally embedded into the generated images by the architectural design and training process of GMs. The key insight is that different GMs leave distinct fingerprints in their outputs due to their architectural design and training process. For GAN-generated images, these methods focus on artifacts introduced by the upsampling operations commonly used in generator architectures. Several studies have shown that the frequency spectrum of GAN-generated images often contains distinctive patterns - "checkerboards", particularly in the middle and high-frequency bands frank2020leveraging, qian2020thinking, dzanic2020fourier, liu2020global, tan2024rethinking. Regarding images generated by DMs, the iterative denoising process of DMs tends to leave characteristic traces that can be detected through careful analysis wang2023dire, ma2023exposing. Recently, \citeauthor{tan2024rethinking}\cite{tan2024rethinking} explores how upsampling operations in GANs and DMs create distinctive patterns in how neighboring pixels relate to each other, while the work \cite{tan2024frequency} applies convolutional layers to the phase and amplitude spectra of the network's internal feature maps rather than solely analyzing the frequency artifacts of the input image.
Hybrid methods. Detection approaches that solely rely on spatial domain information have shown to be highly susceptible to variations in dataset quality and post-processing operations qian2020thinking, wang2020cnn. To address this limitation, hybrid approaches aim to leverage complementary information from different domains to enhance detection performance. The key insight is that DFs often leave traces across multiple feature spaces, and combining these cues can provide more robust detection than relying on a single domain. Based on the types of features being fused, hybrid methods can be categorized into two main groups.
circled{1}Spatial and frequency fusion methods combine features from the spatial domain (capturing visual content and local artifacts) with frequency domain information (revealing generation artifacts in the spectral space). The spatial stream typically employs convolutional networks to extract content-level features, while the frequency stream analyzes spectral patterns through discrete cosine transform (DCT) or discrete Fourier transform (DFT) miao2023f, miao2022hierarchical, wang2022m2tr, wang2023dynamic, luo2021generalizing. This dual-stream architecture helps capture both semantic inconsistencies and subtle frequency-domain artifacts introduced during the generation process.
circled{2}Spatial and noise fusion methods integrate standard RGB features with noise patterns extracted through SRM noise filters \cite{fridrich2012rich}. These approaches recognize that deepfake generation often leaves distinctive noise patterns that are different from those found in authentic images. By analyzing both content features and noise residuals, these methods can better distinguish between natural imaging noise patterns and synthetic artifacts \cite{kong2022detect}, \cite{kong2022detect}, \cite{shuai2023locate}. Instead of relying on handcrafted SRM noise filters, TruFor \cite{guillaro2023trufor} trains Noiseprint++ extractor in a self-supervised manner using contrastive learning to capture stronger noise traces.
Fig. 4
Illustration of passive DF detection approaches in video modality.
Click here to Correct
Video Modality
Compare with image modality. Video-based DF detection presents unique challenges compared to image-based detection \cite{vahdati2024beyond}. On the one hand, approaches developed for image detection can be naturally extended to videos by applying them to individual frames. Forensic-based methods can analyze physical or physiological inconsistencies in each frame, while data-driven approaches can classify frames independently. The final decision can then be obtained by aggregating frame-level predictions through voting or averaging mechanisms. However, this frame-by-frame analysis fails to capture a crucial aspect of DF videos: temporal coherence. Manipulated videos often exhibit subtle temporal inconsistencies, such as unnatural head movements, inconsistent facial expressions across frames, or flickering artifacts in manipulated regions. These temporal artifacts cannot be detected by examining frames in isolation, making pure image-based approaches insufficient for robust video DF detection. Therefore, it is imperative to develop dedicated video-level approaches that explicitly model temporal relationships. In this survey, we categorize detection approaches in the video modality into two primary groups: (i) forensic-based and (ii) data-driven, which are illustrated in Figure 4.
Forensic-based methods primarily extend insights from image-modality forensic analysis to the video modality by examining physical and biological inconsistencies frame by frame. \citeauthor{xia2022towards}\cite{xia2022towards} analyzes multiple color channels, while \citeauthor{huda2024fake}\cite{huda2024fake} uses multiple texture feature descriptors to capture surface inconsistencies per frame. Other studies analyze physiological signals that should remain consistent in authentic videos but may be imperfectly replicated in DFs, such as blood flow patterns through skin color changes \cite{jeon2022deepfake}, facial movements sun2021improving, li2023forensic, or the difference between front and side face images \cite{li2023forensic}. Advanced methods sun2021improving, li2023forensic utilize Recurrent Neural Network to model the temporal characteristics on the embedded feature sequences.
Data-driven approaches. Based on how temporal information is processed, these methods can be categorized into two main approaches: Frame-level and Video-level.
circled{1}Frame-level methods essentially apply image-modality DL techniques to individual frames, treating the video as a sequence of independent images. These methods are typically based on CNN architectures to classify each frame as real or fake, then aggregate these frame-level predictions through techniques like majority voting or temporal averaging to make a final video-level decision wang2023noise, ciamarra2024deepfake, bonettini2021video. To capture a broad set of forgery clues (e.g., blending ghosts, skin tone inconsistencies, tooth details, stitching seams), \citeauthor{ba2024exposing}\cite{ba2024exposing} proposes a local disentanglement module to extract multiple local representations and then fuses them into a global semantic-rich feature. Rather than relying on per-frame, some works gu2021spatiotemporal, gu2022delving, gu2022region, gu2022hierarchical have focused on capturing local inconsistency within densely sampled video snippets, where the entire video is densely divided into multiple snippets. While straightforward to implement, these frame-based approaches may miss temporal inconsistencies that are only visible when analyzing frame sequences.
circled{2}Video-level methods explicitly model temporal relationships by utilizing architectures designed to capture both spatial and temporal information. Early attempts combined convolution-based networks and recurrent-based networks to extract global spatio-temporal features montserrat2020deepfakes, saikia2022hybrid; however, this approach has been demonstrated to be less effective \cite{zhao2023istvt}. Recent approaches have utilized advanced architectures or techniques that can capture long-range dependencies and inter-frame inconsistencies, such as transformers zhao2023istvt, coccomini2022combining, zhao2022self, zhang2022deepfake, 3D CNN \cite{zhang2021detecting}, temporal convolution networks \cite{zheng2021exploring} and attention mechanism \cite{hu2024delocate}.
Another line of work aims at identifying the identity-related inconsistencies in DF videos. These approaches recognize that while DFs may maintain visual quality in individual frames, they often struggle to consistently preserve identity characteristics across an entire video sequence. These methods typically employ two main strategies. The first strategy is to utilize pre-trained face recognition to extract identity embeddings and analyze how these embeddings evolve over time to detect unnatural variations agarwal2020detecting, huang2023implicit. The second strategy focuses on capturing spatio-temporal identity features directly through end-to-end training through 3D face reconstruction or identity-specific transformer cozzolino2021id,liu2023ti2net,dong2022protecting. Recently, MinTime \cite{coccomini2024mintime} aims to address a crucial limitation in previous identity-driven approaches that struggle with videos containing multiple people. Particularly, the authors introduce an identity-aware attention mechanism that applies masking to process each identity in the video independently, enabling the detector to handle multiple identities in the video.
Fig. 5
Illustration of passive DF detection approaches in audio modality.
Click here to Correct
Audio Modality
Compare with image modality. Methodologies developed for visual media cannot be directly applied to detect DF audio \cite{lee2024tug}. Image DF detection typically relies on visual artifacts such as inconsistent lighting, unnatural eye reflections, or imperfect facial blending, which can be analyzed using CNNs trained on pixel-level data. In contrast, audio signals exist primarily as 1D temporal waveforms or 2D time-frequency representations where artifacts manifest as spectral anomalies, inconsistent phase coherence, or irregular prosodic patterns. For example, synthetic voices often exhibit unnatural pauses, overly consistent pitch modulation, or artifacts in high-frequency bands that are imperceptible to humans but detectable via mel-spectrogram analysis. Given this distinct nature of audio signals, current data-driven approaches differ primarily in extracting discriminative acoustic features before classification. Current data-driven approaches typically follow a two-stage pipeline: a front-end for feature extraction and a back-end for classification. The front-end transforms raw audio signals into acoustic representations, while the back-end analyzes these features to make a binary real/fake decision. Furthermore, another group of audio DF detection - fingerprint-based approaches that explore artifacts left by speech synthesis models, especially the vocoders. Figure 5 illustrates these approaches.
Data-driven methods. Based on the front-end feature extraction techniques employed, data-driven methods can be classified into two main categories.
circled{1}Handcrafted-feature-based methods utilize expert-based audio processing techniques as front-end to convert the audio waveform into predefined acoustic features. These features fall into two main categories - physical and perceptual - each capturing different aspects that help distinguish between authentic and synthetic speech \cite{li2022comparative}. Mel Frequency Cepstral Coefficients (MFCC) capture the spectral envelope of speech, which often exhibits unnatural patterns in synthetic audio due to imperfect modeling of vocal tract characteristics \cite{alzantot2019deep}. Constant Q transform (CQT) and spectrogram representations reveal frequency distributions and power spectrum patterns that may be distorted in DF audio yang2018extended, xue2022audio, kwak2021resmax, wani2024abc.
In contrast, perceptual features are designed to capture characteristics that align with human auditory perception and natural speech properties. Techniques like Jitter and Shimmer measure frequency and amplitude variations, respectively, capturing micro-perturbations that are naturally present in human voice but often missing or incorrectly reproduced in synthetic speech \cite{gao2021detection}. Chromagram analysis examines pitch-related features, helping identify unnatural pitch progressions or tonal qualities that may occur in DF audio \cite{saleem2019voice}. These extracted features provide rich information for the classifier to distinguish between real and fake audios.
circled{2}Learnable feature-based methods leverage NNs to automatically extract discriminative features directly from raw audio in an end-to-end (E2E) manner. These methods design specialized network architectures that learn optimal feature representations through supervised learning, such as the spectro-temporal graph attention network jung2022aasist, huang2023discriminative, Inception-style architecture \cite{hua2021towards}, RawNet-based architecture \cite{tak2021end}, and Conformer-based architecture shin2024hm, rosello2023conformer. Taking the advantage of Mamba \cite{gu2023mamba}, \citeauthor{chen2024rawbmamba}\cite{chen2024rawbmamba} proposes a novel E2E bidirectional state space model for audio DF detection to capture more long-range relationships. \citeauthor{yang2024robust}\cite{yang2024robust} introduces a multi-view feature fusion approach that concatenates handcrafted and learnable features to capture a broad range of audio features. For resource-constrained applications, \citeauthor{chakravarty2024lightweight}\cite{chakravarty2024lightweight} propose a lightweight feature extractor using linear discriminant analysis to reduce a 2048-dimensional feature vector to a single, highly discriminative feature.
Inspired by advancements in speech self-supervised learning, several works wang2024can,tran2024spoofed,guo2024audio leverage self-supervised learning (SSL) speech models like Wav2vec2 and XLS-R as front-end feature extractors for DF detection. These models, pre-trained on amounts of unlabeled speech data, demonstrate remarkable capacity for capturing subtle acoustic characteristics across diverse speaking conditions. However, naively fine-tuning these large pre-trained models for DF detection presents challenges, including overfitting with limited training data and computational overhead martin2024exploring,wu2024adapter. To address these concerns, \citeauthor{martin2024exploring}\cite{martin2024exploring} a lightweight downstream classifier with minimal trainable parameters to preserve the generalized audio representations of the SSL model while \citeauthor{pan2024attentive}\cite{pan2024attentive} selectively fine-tunes only early and middle layers to reduce computational requirements. Beyond traditional supervised learning paradigms, some researchers have explored alternative training strategies, including re-synthesizing real samples using various neural vocoders \cite{doan2024balance} or modeling only genuine speech representations and classifying any sample falling outside these established boundaries as fake \cite{kim2024one}.
Fingerprint-based methods. Similar to the image modality, these methods recognize that different vocoder architectures produce characteristic artifacts in the frequency spectrum, phase patterns, or temporal structure of the generated audio. The core idea is to identify and extract these vocoder-specific "fingerprints" that may be detectable through signal analysis. For instance, neural vocoders often struggle to reproduce certain aspects of natural speech, such as phase coherence across frequency bands or precise harmonic structures. Particularly, these methods have explored that the spectrograms of original audio have more natural and consistent high-frequency components and more natural harmonic structures compared to synthetic audio yan2022initial, sun2023ai.
Comparative Analysis across Modalities
Strengths and Limitations between Methods of each Modality
Table \ref{tab:passive-detection} summarizes the key ideas, strengths, and limitations of unimodal approaches across modalities.
Image modality. Forensic-based and fingerprint-based methods provide strong interpretability for the detection decisions and do not rely on specific generator architectures or manipulation techniques. However, they may become less effective for DM-based images that can preserve semantic content in images \cite{vahdati2024beyond} and struggle with low-quality fake images or images applied distortion techniques \cite{barni2020cnn}. Data-driven methods can automatically learn discriminative features from data and enable end-to-end training. However, they rely on the quality and quantity of training data, have limited generalization across different datasets or manipulation techniques, and lack interpretability since detectors are often black-box.
Video modality. Video detection methods extend image-based approaches by incorporating temporal dynamics, enabling detection of flickering artifacts, unnatural movements, and identity inconsistencies across frames. Nevertheless, video methods face substantial challenges in practical deployment. The computational overhead from processing multiple frames simultaneously creates bottlenecks in real-time applications, with memory requirements often exceeding available GPU capacity for high-resolution content. Real-world video compression and transmission artifacts further degrade performance, as many video-based detectors are trained on high-quality datasets and struggle with lower-quality inputs.
Audio modality. Handcrafted-feature-based methods are computationally efficient and provide interpretability by using predefined acoustic features. However, they often lack generalization to out-of-domain data and cannot adapt to the specific characteristics of different datasets \cite{yang2024robust}. Learnable-feature-based methods offer an end-to-end training process and can capture complex temporal patterns across multiple time scales. However, using large pre-trained SSL models as front-end feature extractors can lead to computational overhead and overfitting to limited downstream data. Additionally, learned features often lack clear acoustic meaning, making their decisions harder to explain. Fingerprint-based methods provide better explanations for their decisions and are less sensitive to speech content. However, they are vulnerable to distortion techniques that can remove artifacts.
begin{landscape}{\fontsize{6.5pt}{6.5pt}\selectfont\setlength{\tabcolsep}{3pt}\renewcommand{\arraystretch}{2}\begin{longtable}{>{\centering\arraybackslash}p{0.12\textwidth}|>{\centering\arraybackslash}p{0.14\textwidth}|>{\centering\arraybackslash}p{0.12\textwidth}|>{\centering\arraybackslash}p{0.2\textwidth}|>{\raggedright\arraybackslash}p{0.2\textwidth}|>{\raggedright\arraybackslash}p{0.3\textwidth}}\caption{A Summarization of Ideas, Strengths, and Limitations of Approaches across Multi-Modality\label{tab:passive-detection}}\\\hline & Approach & Article & Key Idea & Strengths & Limitations \\\hline\endfirsthead\multicolumn{6}{c}{\tablename\\thetable\-- Continued from previous page} \\\hline& Approach & Article & Key Idea & Strengths & Limitations \\\hline\endhead\hline\multicolumn{6}{r}{Continued on next page} \\\endfoot\hline\endlastfoot\hhline{======}\multicolumn{6}{c}{Image Modality} \\\hline\multirow{2}{*}{\makecell{\textbf{Forensic-}\\\textbf{based}}}& Physical-based& \cite{chen2021robust}, \cite{amin2024exploring}, \cite{qiao2023csc}, \cite{talib2025chrominance}& Color statistics analysis& \multirow{2}{*}{\parbox{\linewidth}{- Provide strong interpretability for the detection decisions\newline - Don't rely on generator architectures or manipulation techniques}}& \multirow{2}{*}{\parbox{\linewidth}{- May become less effective for DM-based images that can preserve semantic content in images \cite{vahdati2024beyond}\newline - Struggle to low-quality fake images or images applied distortion techniques \cite{barni2020cnn}}} \\& Physiologically-based& \cite{nirkin2021deepfake}, \cite{hu2021exposing}, \cite{wang2022eyes}& Biological consistency between eyes or facial symmetry& & \\\hline\multirow{3}{*}{\makecell{\textbf{Data-}\\\textbf{driven}}}& Conventional classification& \cite{huang2022fakelocator}, \cite{wang2022lisiam}, \cite{mazaheri2022detection}, \cite{katamneni2024contextual}, \cite{das2022gca}, \cite{guo2023hierarchical}, \cite{tantaru2024weakly}& Cast as simple binary classification or fine-grained classification problem& \multirow{3}{*}{\parbox{\linewidth}{- Automatically learn discriminative features from data\newline - End-to-end training}}& \multirow{3}{*}{\parbox{\linewidth}{- Rely on the quality and quantity of training data\newline - Limited generalization across different datasets or manipulation techniques\newline - Lack of interpretability since detectors are black-box}} \\& Pixel-level segmentation& \cite{frank2020leveraging}, \cite{qian2020thinking}, \cite{dzanic2020fourier}, \cite{liu2020global}, \cite{tan2024rethinking}, \cite{hong2024contrastive}& Cast as segmentation problem to predict manipulated regions& & \\& Reconstruction-based learning& \cite{he2021beyond}, \cite{cao2022end}, \cite{liu2023fedforgery}, \cite{shi2023discrepancy}, \cite{guo2024deepfake}& Formulate through the lens of image reconstruction& & \\\hline\multirow{2}{*}{\makecell{\textbf{Fingerprint-}\\\textbf{based}}}& Upsampling operation fingerprint& \cite{frank2020leveraging}, \cite{qian2020thinking}, \cite{dzanic2020fourier}, \cite{liu2020global}, \cite{tan2024rethinking}, \cite{tan2024frequency}& Upsampling operations leave checkerboard patterns in the frequency domain& \multirow{2}{*}{\parbox{\linewidth}{- Less sensitive to image content since methods focus on model-specific artifacts\newline - Provide clearer reasoning for their decision compared to black-box DL methods}}& \multirow{2}{*}{\parbox{\linewidth}{- Generators can remove these fingerprints through additional constraints [chandrasegaran2021closer, neves2020ganprintr\newline - Vulnerable to distortion techniques which can remove these fingerprints}} \\& Denoising process fingerprint& \cite{wang2023dire}, \cite{ma2023exposing}& Denoising process leaves lower reconstruction errors between fake and reconstructed images& & \\\hline\multirow{2}{*}{\makecell{\textbf{Hybrid}}}& Spatial and frequency fusion& miao2023f, \cite{miao2022hierarchical}, wang2022m2tr, \cite{wang2023dynamic}& Leverage spatial and frequency domain& \multirow{2}{*}{\parbox{\linewidth}{- Capture a broader range of artifacts through different modalities\newline - More robust against single-modality evasion techniques}}& \multirow{2}{*}{\parbox{\linewidth}{- Processing multiple feature streams requires more computational resources and may increase inference time\newline - Multi-stream architectures can be harder to train}} \\& Spatial and noise fusion& \cite{kong2022detect}, \cite{kong2022detect}, \cite{shuai2023locate}, \cite{guillaro2023trufor}& Leverage RGB features and noise patterns& & \\\hline\hhline{======}\multicolumn{6}{c}{Video Modality} \\\hline\multirow{2}{*}{\makecell{\textbf{Forensic-}\\\textbf{based}}}& Physical-based& \cite{xia2022towards}, \cite{huda2024fake}& Color statistics or surface inconsistencies analysis frame-by-frame& \multirow{2}{*}{\parbox{\linewidth}{- Provide strong interpretability for the detection decisions\newline - Don't rely on generator architectures or manipulation techniques}}& \multirow{2}{*}{\parbox{\linewidth}{- May become less effective for DM-based images that can preserve semantic content in images \cite{vahdati2024beyond}\newline - Struggle to low-quality fake images or images applied distortion techniques \cite{barni2020cnn}}} \\& Physiologically-based& \cite{sun2021improving}, \cite{li2023forensic}, \cite{jeon2022deepfake}& Skin color changes, facial movements, and symmetrical face patches& & \\\hline\multirow{3}{*}{\makecell{\textbf{Data-}\\\textbf{driven}}}& Frame-based& \cite{wang2023noise}, \cite{ciamarra2024deepfake}, \cite{ba2024exposing}, \cite{bonettini2021video}, \cite{gu2021spatiotemporal}, \cite{gu2022delving}, \cite{gu2022region}, \cite{gu2022hierarchical}& Based on CNN architectures to classify each frame as real or fake and then fuse them for the final decision& - Take advantage of image-level approaches& - Lack temporal information to detect DFs\newline - Low performance when detecting DF videos \cite{vahdati2024beyond} \\\hline& \multirow{2}{*}{Video-level}& \cite{montserrat2020deepfakes}, \cite{saikia2022hybrid}, \cite{coccomini2022combining}, \cite{zhang2022deepfake}, \cite{zhang2021detecting}, \cite{zheng2021exploring}& Design architectures or techniques that can capture inter-frame inconsistencies& \multirow{2}{*}{\parbox{\linewidth}{- Capture both spatial and temporal relationships to identify inconsistencies in video}}& \multirow{2}{*}{\parbox{\linewidth}{- High computational and GPU memory requirements for processing multiple frames simultaneously\newline - Identity-driven approaches may struggle when dealing with face reenactment where the identity is preserved throughout the manipulated video}} \\& & \cite{agarwal2020detecting}, \cite{huang2023implicit}, \cite{cozzolino2021id}, liu2023ti2net, \cite{dong2022protecting}, \cite{coccomini2024mintime}& Explore identity inconsistencies over the video& & \\\hline\hhline{======}\multicolumn{6}{c}{Audio Modality} \\\hline\multirow{2}{*}{\makecell{\textbf{Data-}\\\textbf{driven}}}& Handcrafted-feature-based& \cite{alzantot2019deep}, \cite{yang2018extended}, \cite{xue2022audio}, \cite{kwak2021resmax}, \cite{gao2021detection}, \cite{saleem2019voice}, \cite{wani2024abc}& Utilize expert-based audio processing techniques to extract physical and perceptual features for DF audio detection& - Extract predefined features typically requires less computational power\newline - Make detection decisions more transparent and explainable& - Lack of generalization of out-of-domains \cite{yang2024robust}\newline - Predefined features cannot adapt to the specific characteristics of different datasets \\& Learnable-feature-based& \cite{jung2022aasist}, \cite{huang2023discriminative}, \cite{hua2021towards}, \cite{tak2021end}, \cite{shin2024hm}, \cite{chakravarty2024lightweight}, \cite{chen2024rawbmamba}, \cite{yang2024robust}, \cite{wang2024can}, \cite{tran2024spoofed}, \cite{guo2024audio}, \cite{martin2024exploring}, \cite{pan2024attentive}, \cite{doan2024balance}, \cite{kim2024one}& Leverage supervised-trainable front-ends or pre-trained self-supervised models to optimal feature representations& - Offer an E2E training process\newline - Capture sophisticated temporal patterns across multiple time scales& - Use SSL models as front-end feature extraction can lead to computational overhead and overfitting to limited downstream data\newline - Learned features often lack clear acoustic meaning, making their decisions harder to explain \\\hlinemakecell{\textbf{Fingerprint-}\\\textbf{based}}& Vocoder fingerprint& \cite{yan2022initial}, \cite{sun2023ai}& Explore artifacts left by vocoders& - Provide better explanation for the decisions\newline - Less sensitive to speech content& - Vulnerable to distortion techniques that can remove artifacts \\\hline\end{longtable}}\end{landscape}
Inter-connected Relationships between Modalities
While each modality has unique characteristics, there are inter-connected relationships and shared challenges across modalities.
Fundamental detection principles. Despite input-level differences, all modalities converge on three core detection principles. First, each exploits artifacts specifically introduced during the generation process, including upsampling artifacts in images, temporal flickering in videos, and vocoder fingerprints in audio. Second, frequency domain analysis proves universally effective, revealing checkerboard patterns in images, temporal frequency inconsistencies in videos, and spectral anomalies in audio. Third, identity consistency manifests differently across modalities through facial symmetry in images, temporal identity tracking in videos, and speaker characteristics in audio.
Cross-modality knowledge transfer. Successful techniques from one modality can potentially transfer to others. The effectiveness of self-supervised learning models (Wav2vec2, XLS-R) in audio detection suggests similar pre-trained visual models could enhance image and video detection. Reconstruction-based approaches that learn authentic data distributions in images show promise for adaptation to audio spectrograms. Similarly, self-blending augmentation techniques proven effective for images have successfully transferred to video domains, indicating potential benefits for audio applications.
Fig. 6
Taxonomy of generalization and robustness approaches in passive DF detection.
Click here to Correct
Real-world Requirements of Effective Passive Deepfake Detection
In addition to high detection accuracy, real-world DF detection systems must meet several critical requirements to be effective in practical applications. These include generalization to unseen data, robustness against adversarial attacks and distortion attacks. In this section, we discuss each requirement in detail, along with current approaches proposed to address them. Figure 6 provides a taxonomy of existing approaches to enhance generalization and robustness of DF detectors.
Generalization
Generalization refers to the capability of DF detectors to maintain performance when confronted with previously unseen data distributions, manipulation techniques, and generator architectures. This capability is crucial as DF generation techniques continuously evolve. Researchers typically assess generalization through two protocols: (i) Within-domain evaluation and (ii) Cross-domain evaluation. We define 5 key approaches to enhance generalization, including data augmentation, designed training strategies, disentanglement learning, unsupervised learning, and adaptive learning.
Data augmentation (DA) approaches enhance model generalization by expanding training data diversity. Rather than collecting additional samples, these methods apply various transformations to existing data, creating a broader distribution of examples for the model to learn from. Depending on specific modalities, augmentation operations are various, including (i) image modality: Gaussian blur, JPEG compression, random crop ni2022core, luo2023beyond; (ii) video modality: temporal dropout, temporal repeat, clip-level blending \cite{wang2023altfreezing}; (iii) audio modality: band-pass filters, time- and pitch-shifting, RawBoost doan2024balance, martin2024exploring. Other works employ more sophisticated DA techniques, including dynamically erasing sensitive facial regions wang2021representative, das2021towards to prevent overfitting and latent space augmentation to simulate variations within forgery features yan2023transcending, huang2025generalizable. Contrastive learning (CL) has been integrated with DA to eliminate task-irrelevant contextual factors \cite{sun2022dual}.
Designed training strategy methods structure the learning process to better capture fundamental differences between real and synthetic content. Rather than treating DF detection as a simple binary classification problem, they incorporate inductive biases about the underlying structure of manipulation artifacts into the training process, helping models learn more abstract, generalizable representations. \citeauthor{choi2024exploiting}\cite{choi2024exploiting} propose a two-stage training approach that first learns a robust style representation through supervised CL and then integrates these with content features for DF detection. This idea is extended by \citeauthor{zhu2025slim}\cite{zhu2025slim} for the audio modality. \citeauthor{cheng2024can}\cite{cheng2024can} introduce a progressive training strategy that organizes the latent feature space in a progressive manner, establishing a transition from real to fake as following a specific path: Real
Blendfake (self-blended)
Blendfake (cross-blended)
DF. This strategy helps the model to learn a more structured representation of forgery artifacts rather than treating each type as disconnected examples. Other works aim to mitigate the dependence on large-scale labeled data by generating synthetic training examples during the training process. \citeauthor{chen2022self}\cite{chen2022self} leverages the generator-discriminator framework with self-supervised auxiliary tasks to encourage learning generalizable features. In contrast, other works produce more challenging forgery artifacts by re-synthesizing real samples \cite{doan2024balance} or self-blending transformed versions of real samples shiohara2022detecting, yan2024generalizing. \citeauthor{hasanaath2025fsbi}\cite{hasanaath2025fsbi} extends the idea of self-blending by extracting frequency domain features from self-blended images to amplify artifact information. Meanwhile, FreqBlender \cite{hanzhefreqblender} directly generates pseudo-fake faces in the frequency domain by using a decoder to estimate the probability map of the corresponding frequency knowledge.
Disentanglement learning techniques separate content-specific information (e.g., identity, background) from manipulation artifacts, allowing models to focus specifically on learning manipulation-related features without content-specific biases. \citeauthor{liang2022exploring}\cite{liang2022exploring} designs a dual-encoder architecture that separately extracts content and artifact features and employs CL to maximize the separation between these feature spaces. \citeauthor{yan2023ucf}\cite{yan2023ucf} proposes a multi-task disentanglement framework with a contrastive regularization loss for encouraging separation between common and specific forgery features while \citeauthor{guo2023controllable}\cite{guo2023controllable} constructs a controllable geometric embedding space to decouple irrelevant correlations. DiffusionFake \cite{chen2025diffusionfake} forces the detection network to learn disentangled representations of source and target features inherent in DFs by leveraging a pre-trained Stable Diffusion model to reconstruct both source and target identities. This technique helps the model to identify mixed identity information inherent in DFs.
Unsupervised learning methods leverage unlabeled data to learn more generalizable representations. Without relying on explicit labels which are time-consuming to collect, these approaches focus on learning universal patterns that distinguish authentic from manipulated content. To eliminate the need for pixel-level forgery annotations, UPCL \cite{zhuang2022uia} uses multivariate Gaussian estimation to model the distribution of real and fake image patches in a face region. Unsupervised contrastive learning (UCL) has been employed in \cite{qiao2024fully} to model the feature space based on intrinsic data characteristics, where similar samples are pulled together while dissimilar samples are pushed apart. Several researchers have explored transfer learning from self-supervised representations on generic datasets before fine-tuning for DF detection kim2024one, oorloff2024avff.
Adaptive learning approaches enable DF detectors to continuously adapt to new manipulation methods, generator families, or data distributions without compromising performance on previously learned knowledge. Common used techniques include meta-learning sun2021domain, han2023sigma, zhu2024tmfd, chen2022ost, incremental learning \cite{pan2023dfil}, knowledge distillation \cite{lu2024one}, and domain generalization \cite{xie2023domain}. We refer readers to publications hospedales2021meta, van2022three, gou2021knowledge, zhou2022domain for a better understanding of these techniques. Instead of fine-tuning the entire model, some studies design an adaption module to train a few parameters of detectors on new types of fake samples, such as leveraging low-rank adaptation matrices zhang2023adaptive, wu2024adapter and designing a domain alignment loss on the source and target domain \cite{seraj2024semi}. Similarly, \citeauthor{oiso2024prompt}\cite{oiso2024prompt} uses prompt tuning on the target domain to adapt pre-trained models to new target domains by adding learnable prompts to the intermediate feature vectors in the front-end. \citeauthor{yan2025orthogonal}\cite{yan2025orthogonal} proposes to use Singular Value Decomposition (SVD) to explicitly decompose a pre-trained model's feature space into two orthogonal subspaces, allowing it to preserve general knowledge while learning new forgery-specific patterns in the residual components. Recently, \citeauthor{nguyen2025think}\cite{nguyen2025think} introduces an adaptation method that enables backpropagation-based approaches to adapt to new forgery types without requiring any training data. The key novelty is the Uncertainty-aware Negative Learning objective which uses noisy pseudo-labels during online test-time adaptation to encourage the model to explore alternative options rather than reinforcing potentially incorrect initial predictions.
Robustness
Robustness reflects the detector's ability to maintain reliable performance under various attack scenarios. We identify two critical threats that can compromise DF detectors' resilience: adversarial attacks and distortion attacks. Table \ref{tab:robustness} summarizes existing methods to improve robustness of DF detectors.
Threat landscape. DF detectors are increasingly vulnerable to adversarial attacks, which introduce carefully crafted perturbations into DF content to deceive detection systems while remaining imperceptible to humans. Recent studies have demonstrated that even SoTA detection models can be significantly compromised by adversarial examples (AEs) li2021exploring, jia2022exploring, neekhara2021adversarial, hou2023evading. In contrast to adversarial attacks, distortion attacks employ various post-processing operations to alter DF quality, including compression, resizing, noise addition, and color adjustments. corvi2023intriguing, xu2024profake show these techniques effectively obscure generation artifacts, substantially reducing detection performance. These attacks are particularly concerning as they often mirror common media processing operations, making them difficult to distinguish from benign transformations. Therefore, it is crucial to develop methods that enhance the detectors' robustness to adversarial attacks and distortion attacks.
Robustness to adversarial attacks. Adversarial training \cite{bai2021recent} is widely adopted to improve model robustness against adversarial attacks by incorporating adversarial examples into the training process. This technique has been demonstrated to be effective across both visual and audio modalities devasthale2022adversarially, nguyen2024d. Beyond conventional adversarial training, more sophisticated approaches have recently emerged, including adaptive adversarial training that dynamically adjusts training samples based on attack difficulty \cite{kawa2022defense} and frequency-based adversarial training that uses a generator to generate frequency-level perturbation maps \cite{jeong2022frepgan}. For the black-box setting, D4 \cite{hooda2024d4} presents an ensemble-based method using multiple detector models to exam disjoint parts of the frequency spectrum, forcing attackers to find perturbations effective across multiple frequency ranges. \citeauthor{khan2024adversarially}\cite{khan2024adversarially} designs an adversarial similarity loss that maintains feature space proximity between original samples and their adversarial versions. These two techniques reduce the space of possible attacks and make successful attacks much harder to generate.
Robustness to distortion attacks. One straightforward approach is to simulate diverse distortion techniques during the training process. Multiple studies have demonstrated that data augmentation significantly improves detection robustness by exposing models to various distortion techniques they might encounter in real-world scenarios luo2023beyond, wang2023altfreezing, yan2023transcending. Beyond simple augmentation, \citeauthor{xu2024profake}\cite{xu2024profake} combines a generator with progressive learning strategy to train the models from easy scenarios (non-degraded samples) to increasingly difficult ones (heavily-degraded samples). This method helps the model build robust representations while avoiding the training instability that can occur when immediately introducing heavily degraded examples. \citeauthor{feng2023self}\cite{feng2023self} cast the problem as anomaly detection where both original DFs and their degraded versions are considered anomalous deviations from authentic content. However, these methods work under the assumption that distortion techniques are well-known, which cannot adopt real-world settings where adversaries can apply unknown techniques. To mitigate this, OST \cite{chen2022ost} applies the test-time-training technique [liu2021ttt++] that enables detectors to adapt to unknown techniques at inference time.
Real-world resilience represents a critical dimension of DF detection systems that focuses on maintaining reliable performance under naturally occurring real-world conditions. While robustness addresses deliberate adversarial and distortion attacks, real-world resilience concerns the detector's ability to function effectively despite routine transformations media undergoes without malicious intent. \circled{1} First, content uploaded to social media platforms (Instagram, TikTok, YouTube) or messaging applications (WhatsApp, Telegram) undergoes automatic compression with varying quality settings based on bandwidth considerations, device specifications, and platform requirements. For instance, YouTube employs different compression algorithms for different resolution options, while WhatsApp significantly reduces image quality during transmission. \circled{2} Second, detection systems must handle wide variations in media presentation. Videos uploaded to TikTok may be automatically fitted with thumbnails that resize content with different aspect ratios, potentially cropping key areas of the frame. Similarly, Instagram Reels limit video length and apply format-specific processing. In audio contexts, platforms like Instagram or Facebook limit clip duration, forcing content to be truncated. \circled{3} Third, videos in the real world often suffer from natural misalignment between their audio and visual streams due to technical issues in encoding and recording processes. A common example is when there is a consistent shift (time delay) of a few frames between what is seen and what is heard throughout the video. This creates a significant challenge for multimodal detectors since they cannot simply flag videos where the audio and visuals do not perfectly sync up. These transformations can inadvertently degrade detection performance by altering or removing subtle manipulation artifacts that detectors rely on.
One straightforward way to address the first problem is to employ multiple models for different quality levels liao2023famm, woo2022add, lee2022bznet. However, these methods have significant limitations: (i) Cause computational costs and training data overhead and (ii) Not practical for real-world settings because they require prior knowledge about input video quality to select the appropriate detection model. \citeauthor{le2023quality}\cite{le2023quality} address these gaps by proposing a universal intra-model collaborative learning framework that enables effective simultaneous detection of different quality DFs using a single model. Differently, \citeauthor{chen2024compressed}\cite{chen2024compressed} uses 3D spatiotemporal trajectories to analyze facial landmark movements across frames with the hypothesis that video compression does not significantly alter the distribution of facial landmarks. In the audio modality, FTDKD \cite{wang2024ftdkd} uses the knowledge distillation technique to help the student model learn lost high-frequency information from the teacher model during compression. Regarding the second problem, the works by xu2023tall, xu2024learning simulate the thumbnails by using augmentation techniques that mask fixed-size square areas at the same positions within frames. To handle the third problem in multimodal detectors, \citeauthor{feng2023self}\cite{feng2023self} utilizes an autoregressive model to estimate the temporal offset between each video frame and its corresponding audio signal, effectively capturing the time delay distribution.
Empirical Evaluation and Analysis
While existing surveys have extensively reviewed deepfake detection methods from theoretical and architectural perspectives li2025survey, mirsky2021creation, wang2202gan,pei2024deepfake,pei2024deepfake,kaur2024deepfake, there remains a critical gap in systematic empirical evaluations of these methods under diverse evaluation protocols. To address this limitation, we conduct comprehensive experiments with three distinct objectives. First, we verify the reported performance of 10 popular unimodal detection methods across image, video, and audio modalities, providing independent validation of their baseline detection capabilities. Second, we systematically evaluate methods (33 methods) specifically designed to enhance generalization, assessing their effectiveness in maintaining performance across different datasets. Third, we examine approaches (6 methods) that claim to improve robustness against adversarial attacks, measuring their resilience when confronted with deliberately crafted adversarial examples. Through this multi-faceted evaluation, we not only validate claimed improvements in generalization and robustness but also reveal practical insights into the trade-offs between detection accuracy, cross-domain performance, and adversarial robustness. These findings provide researchers and practitioners with evidence-based guidance for selecting and developing detection methods that can withstand the evolving challenges of real-world deployment.
Experimental Settings
Datasets
We utilize 10 widely adopted datasets across image, video, and audio modalities to evaluate the detection methods. To ensure a fair comparison, we select training and testing datasets that are commonly used in the literature. Specifically, for the image modality, we use FF++ [rossler2019faceforensics++] as the training dataset, and evaluate methods on DFFD \cite{dang2020detection}, Celeb-DF \cite{li2020celeb}, Wild-DF \cite{zi2020wilddeepfake}, and DFDC \cite{dolhansky2019deepfake}. For the audio modality, we select ASVspoof2019 \cite{wang2020asvspoof} as the training dataset, and utilize ASVSpoof21 \cite{liu2023asvspoof}, WaveFake \cite{frank2021wavefake}, and ITW \cite{muller2022does} for evaluation. More details about these datasets can be found in Sect. 9.
Evaluation Metrics
We adopt standard metrics to comprehensively evaluate detection performance across different aspects. For assessing the reported performance of detection methods and generalization capability, we employ three widely used metrics: ACC, AUC for visual modality, and EER for audio modality. To quantify robustness against adversarial attacks, we measure the ASR, which indicates the percentage of successful evasions. Regarding video-level methods, metrics are computed at the video level through majority voting aggregation of frame-level predictions, while audio evaluations are performed at the utterance level to maintain consistency with standard benchmarks. More details about these metrics can be found in Sect. 9.
Evaluation Protocols
Detection approaches are evaluated under different distinct protocols to assess their performance, robustness, and generalization capabilities yan2023deepfakebench, yan2024df40. In this work, we focus on three key protocols: (i) Within-domain evaluation, (ii) Cross-domain evaluation, and (iii) Adversarial attacks evaluation.
circled{1}Within-domain evaluation examines the accuracy performance of the detector when training and testing on the same dataset. This protocol aims to assess the detection capability under controlled conditions where the manipulation techniques and data distributions are consistent. \circled{2}Cross-domain evaluation involves evaluating the detector's generalization capability across different/new datasets to handle the distribution shifts in the real-world settings. \circled{3}Adversarial attacks evaluation measures the detector's resilience to adversarial examples (AEs). There are three common levels of attacks: (i) white-box attack - attackers have complete knowledge of the model architecture and parameters; (ii) black-box attack - attackers have access only to model outputs; and (iii) transferable attack - the most challenging attack in which attackers can create AEs without needing direct access to the ultimate target model. More details about different types of adversarial attacks in terms of definition and mathematics can be explored in \cite{li2024survey}.
Evaluation Results
Reported Performance Verification
In this experiment, we select 13 popular unimodal methods, including 8 visual methods (UCF \cite{yan2023ucf}, TALL \cite{xu2023tall}, STIL \cite{gu2021spatiotemporal}, SRM \cite{luo2021generalizing}, F3Net \cite{qian2020thinking}, UIA-ViT \cite{zhuang2022uia}, RECCE \cite{cao2022end}, CORE \cite{ni2022core}) and 3 audio methods (Aasist \cite{jung2022aasist}, Conformer \cite{rosello2023conformer}, SCL \cite{doan2024balance}, SSL \cite{wang2024can}, RawBMamba \cite{chen2024rawbmamba}). We use 4 datasets for evaluation in each modality, which are used by these methods. For the visual modality, we train all methods on FF++ and evaluate them on Celeb-DF, DFDC, and DFD. For the audio modality, we train all methods on ASVspoof2019 and evaluate them on ASVSpoof21, WaveFake, and ITW. Note that some methods do not report datasets and metrics (denoted by "-") in their original papers, so we only provide our verified results for those cases.
Table \ref{tab:verification} shows the comparison results between the reported performance in the original papers and our verified results. We can observe that our verified results are generally lower than the reported performance. Especially, several methods show a significant performance drop, such as UCF \cite{yan2023ucf} (AUC drops from
to
on Celeb-DF), STIL \cite{gu2021spatiotemporal} (AUC drops from
to
on DFDC), and Conformer \cite{rosello2023conformer} (EER increases from
to
on ASVSpoof21). Only a few methods achieve comparable results, such as Aasist \cite{jung2022aasist} (EER drops slightly from
to
on ASVSpoof19) and RawBMamba \cite{chen2024rawbmamba} (EER remains stable at around
on ASVSpoof19).
The discrepancies can be attributed to several factors: (i) Differences in experimental settings, such as data preprocessing, training procedures, and hyperparameter tuning; (ii) Variations in dataset splits or versions used for training and testing; (iii) Potential overfitting to specific datasets or manipulation techniques in the original studies; (iv) Incomplete reporting of experimental details in some papers, making exact replication challenging. These findings highlight the importance of independent verification of reported results to ensure the reliability and reproducibility of DF detection methods. Moreover, we recommend that future research should utilize DeepfakeBench \cite{yan2023deepfakebench} framework to facilitate standardized evaluations and comparisons across studies.
begin{landscape}\fontsize{6.25pt}{6.25pt}\selectfont\setlength{\tabcolsep}{3pt}\renewcommand{\arraystretch}{2}\begin{longtable}{|c|c|c|c|c|c|}\caption{Verification of Reported Performance for Popular DF Detection Methods. The symbol
indicates the performance difference between the reported and our verified results.}\label{tab:verification}\\\hlineMethod & Dataset & Metric & Reported & Verified &
Continued from previous page} \\\hlineMethod & Dataset & Metric & Reported & Verified &
Continued on next page} \\\endfoot\hline\endlastfoot\multicolumn{6}{|c|}{Visual Methods} \\\hline\multirow{8}{*}{UCF \cite{yan2023ucf}}& \multirow{2}{*}{FF++} & AUC & 99.66 & 98.25 & -1.35 \\& & ACC & - & 93.51 & - \\& \multirow{2}{*}{CelebDF} & AUC & 82.4 & 70.61 & -11.79 \\& & ACC & - & 58.61 & - \\& \multirow{2}{*}{DFDC} & AUC & 80.5 & 69.95 & -10.55 \\& & ACC & - & 63.25 & - \\& \multirow{2}{*}{DFD} & AUC & 94.5 & 77.04 & -17.46 \\& & ACC & - & 51.31 & - \\\hline\multirow{8}{*}{TALL \cite{xu2023tall}}& \multirow{2}{*}{FF++} & AUC & 99.87 & 96.04 & -3.83 \\& & ACC & 98.65 & 91.80 & -6.85 \\& \multirow{2}{*}{CelebDF} & AUC & 90.79 & 84.04 & -6.75 \\& & ACC & - & 77.58 & - \\& \multirow{2}{*}{DFDC} & AUC & 76.78 & 74.05 & -2.73 \\& & ACC & - & 65.19 & - \\& \multirow{2}{*}{DFD} & AUC & - & 85.55 & - \\& & ACC & - & 83.46 & - \\\hline\multirow{8}{*}{STIL \cite{gu2021spatiotemporal}}& \multirow{2}{*}{FF++} & AUC & 99.64 & 96.16 & -3.48 \\& & ACC & - & 90.57 & - \\& \multirow{2}{*}{CelebDF} & AUC & 75.58 & 70.49 & -5.09 \\& & ACC & - & 62.46 & - \\& \multirow{2}{*}{DFDC} & AUC & 89.80 & 66.89 & -22.91 \\& & ACC & - & 60.09 & - \\& \multirow{2}{*}{DFD} & AUC & - & 77.11 & - \\& & ACC & - & 71.20 & - \\\hline\multirow{8}{*}{SRM \cite{luo2021generalizing}}& \multirow{2}{*}{FF++} & AUC & 99.4 & 98.23 & -1.17 \\& & ACC & 97.1 & 95.15 & -1.95 \\& \multirow{2}{*}{CelebDF} & AUC & 79.4 & 56.37 & -23.03 \\& & ACC & - & 55.52 & - \\& \multirow{2}{*}{DFDC} & AUC & 79.7 & 77.06 & -2.64 \\& & ACC & - & 71.22 & - \\& \multirow{2}{*}{DFD} & AUC & 91.9 & 85.88 & -6.02 \\& & ACC & - & 85.83 & - \\\hline\multirow{8}{*}{F3Net \cite{qian2020thinking}}& \multirow{2}{*}{FF++} & AUC & 99.9 & 98.58 & -1.32 \\& & ACC & 99.9 & 95.36 & -4.54 \\& \multirow{2}{*}{CelebDF} & AUC & - & 58.64 & - \\& & ACC & - & 56.06 & - \\& \multirow{2}{*}{DFDC} & AUC & - & 71.25 & - \\& & ACC & - & 64.67 & - \\& \multirow{2}{*}{DFD} & AUC & - & 84.58 & - \\& & ACC & - & 77.28 & - \\\hline\multirow{8}{*}{UIA-ViT \cite{zhuang2022uia}}& \multirow{2}{*}{FF++} & AUC & 99.33 & 94.06 & -5.27 \\& & ACC & 99.15 & 89.15 & -10 \\& \multirow{2}{*}{CelebDF} & AUC & 84.5 & 69.25 & -15.25 \\& & ACC & - & 64.09 & - \\& \multirow{2}{*}{DFDC} & AUC & 75.80 & 72.43 & -3.37 \\& & ACC & - & 60.97 & - \\& \multirow{2}{*}{DFD} & AUC & 94.68 & 76.23 & -18.45 \\& & ACC & - & 70.51 & - \\\hline\pagebreak[4] \multirow{8}{*}{RECCE \cite{cao2022end}}& \multirow{2}{*}{FF++} & AUC & 0.9932 & 0.9805 & -1.27 \\& & ACC & 0.9706 & 0.9351 & -3.55 \\& \multirow{2}{*}{CelebDF} & AUC & 0.6871 & 0.5047 & -18.24 \\& & ACC & - & 0.5705 & - \\& \multirow{2}{*}{DFDC} & AUC & 0.6906 & 0.7162 & +2.56 \\& & ACC & - & 0.6545 & - \\& \multirow{2}{*}{DFD} & AUC & - & 0.8122 & - \\& & ACC & - & 0.8519 & - \\\hline\multirow{8}{*}{CORE \cite{ni2022core}}& \multirow{2}{*}{FF++} & AUC & 1.0 & 0.9793 & -2.07 \\& & ACC & 0.9997 & 0.9249 & -7.48 \\& \multirow{2}{*}{CelebDF} & AUC & 0.7571 & 0.6610 & -9.61 \\& & ACC & - & 0.6349 & - \\& \multirow{2}{*}{DFDC} & AUC & 0.7241 & 0.7125 & -1.16 \\& & ACC & - & 0.6467 & - \\& \multirow{2}{*}{DFD} & AUC & 0.94 & 0.8588 & -8.12 \\& & ACC & - & 0.8718 & - \\\hline\multicolumn{6}{|c|}{Audio Methods} \\\hline\multirow{4}{*}{Aasist \cite{jung2022aasist}}& ASVSpoof-19-LA & EER & 0.99 & 0.83 & -0.16 \\& ASVSpoof-21-DF & EER & - & 23.64 & - \\& WaveFake & EER & - & 39.01 & - \\& ITW & EER & - & 56.45 & - \\\hline\multirow{4}{*}{Conformer \cite{rosello2023conformer}}& ASVSpoof-19-LA & EER & 0.97 & 0.95 & -0.02 \\& ASVSpoof-21-DF & EER & 3.28 & 94.82 & +91.54 \\& WaveFake & EER & - & 68.17 & - \\& ITW & EER & - & 56.45 & - \\\hline\pagebreak[4]\multirow{4}{*}{SCL \cite{doan2024balance}}& ASVSpoof-19-LA & EER & 2.88 & 3.33 & +0.45 \\& ASVSpoof-21-DF & EER & 2.17 & 2.87 & +0.7 \\& WaveFake & EER & - & 22.91 & - \\& ITW & EER & 4.51 & 4.55 & 0.04 \\\hline\multirow{4}{*}{SSL \cite{wang2024can}}& ASVSpoof-19-LA & EER & 1.91 & 2.66 & +0.75 \\& ASVSpoof-21-DF & EER & 5.67 & 5.72 & +0.05 \\& WaveFake & EER & 1.3 & 1.99 & +0.69 \\& ITW & EER & 6.1 & 5.26 & -0.84 \\\hline\multirow{4}{*}{RawBMamba \cite{chen2024rawbmamba}}& ASVSpoof-19-LA & EER & 1.19 & 1.2 & +0.01 \\& ASVSpoof-21-DF & EER & 15.85 & 49.99 & +34.14 \\& WaveFake & EER & - & 35.45 & - \\& ITW & EER & - & 48.16 & - \\\hline\end{longtable}\end{landscape}
Generalization Evaluation
Table \ref{tab:generalization} summarizes the generalization performance of 33 methods across image, video, and audio modalities. For each method, we report the within-domain (WD) performance on the training dataset and cross-domain (CD) performance on multiple unseen datasets.
Observations.
Across modalities, most methods reach near-perfect performance in WD settings, but drop significantly under CD evaluation. In the visual track, many methods exceed
ACC and
AUC on FF++, yet CD ACC typically falls into the
band, with only a few approaches sustaining
ACC. Among augmentation-based methods, static schemes improve CD but remain moderate (e.g., image/video:
ACC, AUC
), while dynamic/semantics-aware augmentation is substantially stronger, such as representative-based erasing \cite{wang2021representative} retains
ACC and
AUC, and frequency-domain self-blending (FreqBlender) \cite{hanzhefreqblender} attains
ACC and
AUC. Designed training strategies yield mixed results, with frequency-domain self-blending \cite{hanzhefreqblender} emerging as the most effective technique (
ACC,
AUC in CD). This superior performance suggests that manipulating features in the frequency domain helps models learn more universal artifacts. Disentanglement methods liang2022exploring,guo2023controllable, chen2025diffusionfake generally underperform compared to other categories, with the best method \cite{chen2025diffusionfake} achieving only
ACC in CD evaluation. This indicates that while disentangling content and artifact features is conceptually appealing, practical implementations often struggle with imperfect separation, leading to overfitting to content biases. Unsupervised learning methods show promise but with high variance in performance. Transfer learning from self-supervised representations \cite{kim2024one} achieves competitive results in audio (
EER), while unsupervised contrastive learning \cite{qiao2024fully} maintains reasonable visual detection performance (
ACC). Adaptive learning methods show the most significant improvements in CD performance. Recent methods, including orthogonal weight modification \cite{yan2025orthogonal} and test-time adaptation \cite{nguyen2025think}, deliver substantial boosts without requiring full retraining, achieving
and
ACC respectively in CD settings.
begin{landscape}\fontsize{7pt}{7pt}\selectfont\setlength{\tabcolsep}{3pt}\renewcommand{\arraystretch}{2}\begin{longtable}{|>{\centering\arraybackslash}p{2.15cm}|c|>{\centering\arraybackslash}p{3.5cm}|>{\centering\arraybackslash}p{1.25cm}|c|c|c|c|>{\centering\arraybackslash}p{1.75cm}|c|c|c|}\caption{Evaluation of Deepfake Detection Generalization Methods on Within-Domain and Cross-Domain Protocols}\label{tab:generalization}\\\hline\multirow{3}{*}{Approach} &\multirow{3}{*}{Article} &\multirow{3}{*}{Key Idea} &\multirow{3}{*}{\parbox{1.2cm}{\centeringTraining Dataset}} &\multicolumn{8}{c|}{Evaluation} \\& & & &\multicolumn{4}{c|}{WD} &\multicolumn{4}{c|}{CD} \\& & & &Dataset &ACC (%) &AUC (%) &EER (%) &Dataset &ACC (%) &AUC (%) &EER (%) \\\hline\endfirsthead\multicolumn{12}{c}{{\bfseries \tablename\\thetable{} -- continued from previous page}} \\\hline\multirow{3}{*}{Approach} &\multirow{3}{*}{Article} &\multirow{3}{*}{Key Idea} &\multirow{3}{*}{\parbox{1.2cm}{\centeringTraining Dataset}} &\multicolumn{8}{c|}{Evaluation} \\& & & &\multicolumn{4}{c|}{WD} &\multicolumn{4}{c|}{CD} \\& & & &Dataset &ACC (%) &AUC (%) &EER (%) &Dataset &ACC (%) &AUC (%) &EER (%) \\\hline\endhead\hline \multicolumn{12}{|r|}{{Continued on next page}} \\\hline\endfoot\hline\endlastfoot\multirow{9}{*}{\parbox{1.8cm}{\centering Data Augmentation}} & \cite{luo2023beyond} & Static image augmentation & \circled{2} & \circled{2} & 94.57 & 97.87 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 73.21 & 81.43 & - \\& \cite{doan2024balance} & Static audio augmentation & \circled{1} & \circled{1} & - & - & 0.15 & \circled{8}, \circled{9}, \circled{10} & - & - & 4.76 \\& \cite{martin2024exploring} & Static audio augmentation & \circled{1} & \circled{1} & - & - & 0.22 & \circled{8}, \circled{9}, \circled{10} & - & - & 11.49 \\& \cite{wang2023altfreezing} & Static video augmentation & \circled{2} & \circled{2} & 98.03 & 99.4 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 78.14 & 80.07 & - \\& \cite{das2021towards} & Dynamic image augmentation & \circled{2} & \circled{2} & 98.24 & 96.73 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 69.59 & 71.01 & - \\& \cite{wang2021representative} & Dynamic image augmentation & \circled{2} & \circled{2} & 99.34 & 99.97 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 80.78 & 82.03 & - \\& \cite{yan2023transcending} & Image-latent space augmentation & \circled{2} & \circled{2} & 97.54 & 92.34 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 83.45 & 81.06 & - \\& \cite{huang2025generalizable} & Audio-latent space augmentation & \circled{1} & \circled{1} & - & - & 1.9 & \circled{8}, \circled{9}, \circled{10} & - & - & 5.76 \\& \cite{sun2022dual} & Dual-contrastive learning & \circled{2} & \circled{2} & 98.71 & 99.3 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 82.16 & 80.45 & 5.76 \\\hline\multirow{7}{*}{\parbox{1.8cm}{\centering Designed Training Strategy}} & \cite{chen2022self} & GAN framework + self-supervised auxiliary tasks & \circled{2} & \circled{2} & 97.13 & 98.21 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 84.86 & 76.89 & - \\& \cite{shiohara2022detecting} & Self-blending technique & \circled{2} & \circled{2} & 93.21 & 99.65 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 84.35 & 81.35 & - \\& \cite{hasanaath2025fsbi} & Self-blending frequency technique & \circled{2} & \circled{2} & 83.15 & 86.31 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 73.15 & 70.11 & - \\& \cite{hanzhefreqblender} & Self-blending in frequency domain & \circled{2} & \circled{2} & 97.56 & 96.13 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 83.01 & 85.18 & - \\& \cite{yan2024generalizing} & Video-level self-blending technique & \circled{2} & \circled{2} & 97.8 & 95.23 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 81.03 & 83.21 & - \\& \cite{zhu2025slim} & Two-stage training to learn style and linguistic representations & \circled{1} & \circled{1} & - & - & 2.0 & \circled{8}, \circled{9}, \circled{10} & - & - & 9.2 \\& \cite{choi2024exploiting} & Two-stage training to learn style representation & \circled{2} & \circled{2} & 94.21 & 99.1 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 82.9 & 84.78 & - \\\hline\multirow{3}{*}{\parbox{1.8cm}{\centering Disentanglement Learning}} & \cite{liang2022exploring} & Dual-encoder architecture & \circled{2} & \circled{2} & 97.83 & 94.5 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 72.3 & 76.83 & - \\& \cite{guo2023controllable} & Model a controllable geometric embedding space & \circled{2} & \circled{2} & 93.43 & 98.4 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 73.14 & 83.31 & - \\& \cite{chen2025diffusionfake} & Identity desentanglement by reversing the generaive process & \circled{2} & \circled{2} & 94.23 & 97.5 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 80.5 & 83.78 & - \\\hline\multirow{3}{*}{\parbox{1.8cm}{\centering Unsupervised Learning}} & \cite{zhuang2022uia} & Multivariate Gaussian estimation & \circled{2} & \circled{2} & 92.15 & 99.33 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 79.45 & 84.87 & - \\& \cite{qiao2024fully} & Unsupervised contrastive learning & \circled{2} & \circled{2} & 97.45 & 99.24 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 81.98 & 82.56 & - \\& \cite{kim2024one} & Transfer learning from self-supervised representations & \circled{1} & \circled{1} & - & - & 1.02 & \circled{8}, \circled{9}, \circled{10} & - & - & 7.31 \\\hline\multirow{9}{*}{\parbox{1.8cm}{\centering Adaptive Learning}} & \cite{han2023sigma} & Meta learning & \circled{2} & \circled{2} & 97.0 & 98.1 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 78.26 & 82.01 & - \\& \cite{zhu2024tmfd} & Meta learning & \circled{2} & \circled{2} & 95.6 & 97.24 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 68.67 & 75.97 & - \\& \cite{chen2022ost} & Meta-learning + test-time training & \circled{2} & \circled{2} & 92.56 & 98.04 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 69.7 & 82.1 & - \\& \cite{pan2023dfil} & Incremental learning + knowledge distillation & \circled{2} & \circled{2} & 98.55 & 95.53 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 70.78 & 78.9 & - \\& \cite{xie2023domain} & Source domains aggredation + triplet loss & \circled{1} & \circled{1} & - & - & 4.21 & \circled{8}, \circled{9}, \circled{10} & - & - & 17.06 \\& \cite{lu2024one} & One-class knowledge distillation & \circled{1} & \circled{1} & - & - & 1.78 & \circled{8}, \circled{9}, \circled{10} & - & - & 8.36 \\& \cite{zhang2023adaptive} & Low-rank adaptation matrices & \circled{1} & \circled{1} & - & - & 6.51 & \circled{8}, \circled{9}, \circled{10} & - & - & 3.28 \\& \cite{wu2024adapter} & Low-rank adaptation matrices & \circled{1} & \circled{1} & - & - & 1.98 & \circled{8}, \circled{9}, \circled{10} & - & - & 11.25 \\& \cite{oiso2024prompt} & Prompt tuning & \circled{1} & \circled{1} & - & - & 6.43 & \circled{8}, \circled{9}, \circled{10} & - & - & 9.05 \\& \cite{yan2025orthogonal} & Orthogonal weight modification & \circled{2} & \circled{2} & 97.61 & 99.0 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 89.15 & 91.72 & - \\& \cite{nguyen2025think} & Test-time adaptation and nagative learning & \circled{2} & \circled{2} & 94.26 & 96.14 & - & \circled{3}, \circled{4}, \circled{5}, \circled{6}, \circled{7} & 89.05 & 90.25 & - \\\end{longtable}\noindent\begin{minipage}{\linewidth}Table Notes:\begin{itemize}[leftmargin=*, parsep=0pt, itemsep=0pt, topsep=0pt]\item i) Training Dataset: The dataset that the method trained on\item ii) Evaluation: Protocols that the method evaluated on: Within-Domain (WD), Cross-Domain (CD)\item iii) Numbers denotes the datasets used for training and evaluation: ASVspoof19 (\circled{1}), FF++ (\circled{2}), CelebDF (\circled{3}), DFDC (\circled{4}), DFFD (\circled{5}), WildDF (\circled{6}), UADFV (\circled{7}), ASVSpoof21 (\circled{8}), WaveFake (\circled{9}), ITW (\circled{10}).\end{itemize}\end{minipage}\end{landscape}
Insights.
From these results, several key insights emerge:
Methods leveraging frequency domain information consistently outperform spatial-only approaches. This indicates that artifacts left by generative models are more stable across different datasets.
While data augmentation provides modest improvements at low computational cost, adaptive learning methods achieve superior generalization at the expense of increased complexity and runtime adaptation requirements. This trade-off must be carefully considered for real-world deployment scenarios.
AUC –ACC mismatch indicates calibration drift. Several methods show high AUC with lower ACC in CD, implying threshold miscalibration under shift. Post-hoc calibration or domain-aware thresholding is recommended for deployment.
Audio detection methods generally exhibit larger performance drops in CD settings compared to visual methods. This suggests that audio artifacts may be more sensitive to dataset-specific characteristics, necessitating further research into robust audio feature representations.
These findings underscore a critical challenge for the field: despite significant research efforts, the generalization gap remains substantial enough to limit practical deployment of deepfake detectors. The consistent
performance drop across domains indicates that current detection methods remain vulnerable to novel generation techniques, highlighting the urgent need for fundamentally new approaches that can learn truly invariant forgery representations.
Robustness Evaluation
We evaluate six methods against three common adversarial settings, including white-box (WA), black-box (BA), and transferable attacks (TA), using Attack Success Rate (ASR; lower is better). Table \ref{tab:robustness} shows all defenses reduce ASR, but the magnitude and consistency vary notably across threat models.
Attack Severity Analysis.
The results reveal a clear vulnerability level of detection methods to adversarial attacks. WA is consistently the most effective, achieving ASR of
before defense, indicating that full model knowledge successfully evade detection in approximately 9 out of 10 cases. This severity stems from the attacker's ability to compute gradients and craft optimal perturbations tailored to the target model's decision boundaries. BAs demonstrate moderate threat levels with a
ASR, while transferable attacks show the lowest baseline ASR at
, suggesting that it is more challenging to generate effective AEs without direct access to the target model. This is because gradient obfuscation and model mismatch significantly limit attack success.
Defense Effectiveness Analysis.
Table \ref{tab:robustness} shows that detection methods exhibit varying degrees of robustness improvements across different defense strategies. Pure adversarial training devasthale2022adversarially, nguyen2024d cuts WA ASR from
to
and TA ASR from
to
, demonstrating that exposure to AEs during training enhances model robustness. However, the post-defense WA ASR remains relatively high at
, indicating that static adversarial training alone is insufficient for robust defense against fully informed attackers. Adaptive adversarial training \cite{kawa2022defense} demonstrates exceptional performance against WAs, reducing ASR from
to
. This dramatic reduction suggests that dynamically adjusting training samples based on attack difficulty enables models to learn more robust decision boundaries. For BA, disjoint frequency ensembles \cite{hooda2024d4} lower ASR from
to
, indicating that leveraging diverse spectral features can improve robustness in query-limited scenarios.
Insights.
Several key insights emerge from this evaluation:
No single defense method provides universal protection across all attack types. Therefore, it is important to adopt a multi-faceted approach, combining adversarial training, input transformations, and ensemble strategies to enhance overall robustness.
The substantial gap between WA (
ASR) and TA success rates (
) suggests that many detectors inadvertently rely on gradient masking rather than learning genuinely robust features. This vulnerability becomes critical when attackers gain model access.
While adaptive method \cite{kawa2022defense} achieve superior robustness, they require sophisticated training procedures and increased computational overhead. This trade-off between robustness and efficiency must be balanced based on deployment constraints.
begin{table}[ht]\begin{talltblr}[ caption = {Evaluation of Deepfake Detection Robustness Methods against Adversarial Attacks}, label = {tab:robustness},]{ width = \linewidth, colspec = {Q[163]Q[120]Q[200]Q[102]Q[125]Q[113]}, cells = {c}, cell{1}{1} = {r=2}{}, cell{1}{2} = {r=2}{}, cell{1}{3} = {r=2}{}, cell{1}{4} = {r=2}{}, cell{1}{5} = {c=2}{0.238\linewidth}, cell{3}{1} = {r=6}{}, vline{2-5} = {1-8}{}, vline{6} = {2-8}{}, hline{1,9} = {-}{0.08em}, hline{2} = {5-6}{}, hline{3} = {-}{}, hline{3} = {2}{-}{},}Approach & Article & Key Idea & Attack Type & Evaluation (ASR (%)) & \\& & & & Before Defense & After Defense \\{Robustness to \\adversarial attacks} & \cite{devasthale2022adversarially} & Purely adversarial training & WA & 81.06 & 37.03 \\& nguyen2024d & Purely adversarial training & TA & 34.17 & 2.01 \\& \cite{jeong2022frepgan} & Frequency Perturbation & WA & 87.69 & 40.67 \\& \cite{kawa2022defense} & Adaptive adversarial training & WA & 85.49 & 5.25 \\& \cite{hooda2024d4} & Disjoint frequency ensemble & BA & 74.05 & 27.54 \\& \cite{khan2024adversarially} & Adversarial similarity loss & WA & 95.62 & 25.38\end{talltblr}\end{table}
Challenges and Future Directions
Detection of DFs in Challenging Conditions
Current approaches often focus on fully synthetic content and demonstrate limited effectiveness in identifying DFs under challenging conditions, including face occlusion, multiple facial DFs, and partial manipulations.
Face occlusion. Advanced techniques, such as 3D face restoration \cite{chen2024blind} and blind inpainting \cite{criminisi2004region}, have successfully addressed occlusion challenges in generic face recognition. There exists significant potential to adapt and integrate these techniques into DF detection frameworks to maintain performance when faces are partially obscured or occluded.
Multiple facial DFs. Researchers evaluate their detection methods on specialized datasets designed for multi-subject forgery detection, including OpenForensics \cite{le2021openforensics} and DF-Platter \cite{narayan2023df}. These datasets specifically address the increasingly common scenario where multiple subjects within a single frame/image have been manipulated, presenting distinct challenges beyond single-face DF detection.
Partial manipulation. Detection systems must evolve to identify localized manipulations that alter only specific facial regions, partial frames, or partial utterances. Developing patch-based architectures VisionTransformer \cite{khan2022transformers} and implementing region-specific attention mechanisms could improve the detection of these increasingly sophisticated partial manipulations.
Generalization to Emerging Generation Techniques
Our empirical evaluation in Sect. 30 reveals that existing DF detectors exhibit significant performance degradation when confronted with unseen generation methods. With the rapid evolution of GMs, detectors must adapt to novel architectures and training paradigms that may introduce entirely new artifact patterns. This necessitates the development of detection approaches that can learn invariant features robust to a wide spectrum of generation techniques. (1) Foundation model approaches that leverage large-scale pre-training on diverse synthetic data to learn generalizable representations. (2) Continual learning frameworks that enable detectors to incrementally update their knowledge base as new generation methods emerge, without catastrophic forgetting of previously learned artifacts. (3) Generation-aware meta-learning strategies that explicitly models the space of possible generation techniques rather than memorizing specific artifacts.
Generalization and Robustness Trade-off.
Research from the broader computer vision field suggests that there exists an inherent trade-off between a model's ability to generalize to unseen data and its robustness against adversarial perturbations zhang2019theoretically, stutz2019disentangling, li2023trade, gowda2024conserve. In the context of DF detection, this trade-off may also exist, as models optimized for high generalization may inadvertently become more susceptible to adversarial attacks, and vice versa. Understanding and managing this trade-off is crucial for developing practical DF detectors that can operate effectively in real-world scenarios where both unseen generation techniques and adversarial threats are prevalent. Future research should focus on (1) Theoretical analyses to formally characterize the relationship between generalization and robustness in DF detection, identifying conditions under which improvements in one aspect may lead to degradation in the other. (2) Multi-objective optimization that explicitly balances generalization and robustness during training, allowing deployment-specific trade-off selection.
Dataset Imbalances and Biases
DF detectors trained on current datasets face significant performance degradation when deployed in real-world settings due to class imbalance issues \cite{layton2024sok}. Particularly, these detectors exhibit a bias toward classifying content as fake when confronted with realistic deployment scenarios where authentic content vastly outnumbers DFs. This imbalance leads to excessive false positive rates that undermine the practical utility of detection systems. Long-tailed recognition is a potential direction to improve the accuracy of the "real" class with the least influence on the "fake" class miao2024out, bai2023effectiveness. Additionally, researchers should consider reformulating DF detection as an anomaly detection problem where models learn a representation of authentic content variation rather than decision boundaries between real and fake classes ho2024long, salehi2021unified. This approach naturally accommodates imbalanced scenarios by focusing on characterizing normal content patterns.
Computation Efficiency for Edge Deployment
Many advanced detection approaches incur significant computational overhead, making them impractical for real-time or resource-constrained applications. For example, large-scale self-supervised learning (SSL) models used as feature extractors demand substantial memory and processing power, hindering deployment on edge devices or mobile platforms \cite{wu2024adapter}. Future research should prioritize the development of lightweight architectures and efficient training paradigms that maintain high detection performance while minimizing computational requirements. Techniques such as model pruning \cite{he2023structured}, quantization \cite{egashira2024exploiting}, and knowledge distillation \cite{gou2021knowledge} can be leveraged to create compact yet effective DF detectors suitable for deployment in diverse environments.
Privacy-preserving detection
Current detection approaches often require access to full media content or datasets for training, raising significant privacy concerns when deployed at scale across communication platforms or personal content. Federated learning \cite{zhang2021survey} is a distributed machine learning paradigm that enables multiple models to collaboratively train a global model without sharing their raw data. This framework should be adapted for DF detection to enable detector training and updating without centralizing sensitive media. Furthermore, researchers should consider developing privacy-preserving feature extraction techniques that convert media into non-reversible representations before analysis. These approaches would enable detection systems to operate on transformed data that preserves manipulation artifacts while removing personally identifiable information (e.g., speech content or identity information).
Conclusion
This survey has systematically analyzed the inter-modality relationships between unimodal deepfake detection methods across image, video, and audio domains. We also reviewed state-of-the-art techniques for enhancing detection generalization and robustness against adversarial attacks. Our empirical evaluations of 50 methods across 10 datasets reveal critical insights. First, there exists a gap between reported performance and our validated results, underscoring the need for future research to adopt standardized evaluation protocols. Second, while many methods achieve near-perfect accuracy in within-domain settings, current detectors exhibit a persistent 15-20% performance degradation in cross-domain scenarios, indicating fundamental limitations in learning invariant forgery representations. Third, adversarial attacks pose a significant threat to detection reliability, with white-box attacks achieving over 80% success rates against undefended models. Although adaptive adversarial training can substantially mitigate this vulnerability, it introduces considerable computational overhead. Our analysis also reveals critical dataset limitations, including severe class imbalance and absence of adversarial robustness benchmarks, which constrain the development of practical detection systems.
Despite significant research efforts, deepfake detection remains an open problem. The rapid evolution of generative models demands a paradigm shift from pattern memorization toward learning truly invariant forgery representations, requiring fundamental advances in both theoretical understanding and practical implementation to achieve reliable real-world deployment.
Declarations
Funding. This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant number 18/CRT/6183.
Conflict of interest/Competing interests. The authors declare no competing interests.
bibliography{my_bib}
A
A
Author Contribution
Hong-Hanh Nguyen-Le conducted a systematic review, analyzed trends and relationships, and wrote the main manuscriptVan-Tuan Tran conducted empirical experiments, and drew all the figuresDinh-Thuc Nguyen and Nhien-An Le-Khac supervised the project, provided critical feedback, and revised the manuscript
References:
Mirsky, Yisroel and Lee, Wenke (2021) The creation and detection of deepfakes: A survey. ACM computing surveys (CSUR) 54(1): 1--41 ACM New York, NY, USA
Wang, X and Guo, H and Hu, S and Chang, MC and Lyu, S Gan-generated faces detection: A survey and new perspectives. arXiv 2022. arXiv preprint arXiv:2202.07145
Yi, Jiangyan and Wang, Chenglong and Tao, Jianhua and Zhang, Xiaohui and Zhang, Chu Yuan and Zhao, Yan (2023) Audio deepfake detection: A survey. arXiv preprint arXiv:2308.14970
Mubarak, Rami and Alsboui, Tariq and Alshaikh, Omar and Inuwa-Dutse, Isa and Khan, Saad and Parkinson, Simon (2023) A survey on the detection and impacts of deepfakes in visual, audio, and textual formats. Ieee Access 11: 144497--144529 IEEE
Pei, Gan and Zhang, Jiangning and Hu, Menghan and Zhang, Zhenyu and Wang, Chengjie and Wu, Yunsheng and Zhai, Guangtao and Yang, Jian and Shen, Chunhua and Tao, Dacheng (2024) Deepfake generation and detection: A benchmark and survey. arXiv preprint arXiv:2403.17881
Kaur, Achhardeep and Noori Hoshyar, Azadeh and Saikrishna, Vidya and Firmin, Selena and Xia, Feng (2024) Deepfake video detection: challenges and opportunities. Artificial Intelligence Review 57(6): 1--47 Springer
Liu, Ping and Tao, Qiqi and Zhou, Joey Tianyi (2024) Evolving from Single-modal to Multi-modal Facial Deepfake Detection: A Survey. arXiv preprint arXiv:2406.06965
Masood, Momina and Nawaz, Mariam and Malik, Khalid Mahmood and Javed, Ali and Irtaza, Aun and Malik, Hafiz (2023) Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Applied intelligence 53(4): 3974--4026 Springer
Liu, Xuechen and Wang, Xin and Sahidullah, Md and Patino, Jose and Delgado, H{\'e}ctor and Kinnunen, Tomi and Todisco, Massimiliano and Yamagishi, Junichi and Evans, Nicholas and Nautsch, Andreas and others (2023) Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Transactions on Audio, Speech, and Language Processing IEEE
Frank, Joel and Sch{\"o}nherr, Lea (2021) Wavefake: A data set to facilitate audio deepfake detection. arXiv preprint arXiv:2111.02813
Yi, Jiangyan and Fu, Ruibo and Tao, Jianhua and Nie, Shuai and Ma, Haoxin and Wang, Chenglong and Wang, Tao and Tian, Zhengkun and Bai, Ye and Fan, Cunhang and others (2022) Add 2022: the first audio deep synthesis detection challenge. IEEE, 9216--9220, ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Yi, Jiangyan and Tao, Jianhua and Fu, Ruibo and Yan, Xinrui and Wang, Chenglong and Wang, Tao and Zhang, Chu Yuan and Zhang, Xiaohui and Zhao, Yan and Ren, Yong and others (2023) Add 2023: the second audio deepfake detection challenge. arXiv preprint arXiv:2305.13774
He, Yinan and Gan, Bei and Chen, Siyu and Zhou, Yichun and Yin, Guojun and Song, Luchuan and Sheng, Lu and Shao, Jing and Liu, Ziwei (2021) Forgerynet: A versatile benchmark for comprehensive forgery analysis. 4360--4369, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Korshunov, Pavel and Marcel, S{\'e}bastien (2018) Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint arXiv:1812.08685
Li, Yuezun and Yang, Xin and Sun, Pu and Qi, Honggang and Lyu, Siwei (2020) Celeb-df: A large-scale challenging dataset for deepfake forensics. 3207--3216, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rossler, Andreas and Cozzolino, Davide and Verdoliva, Luisa and Riess, Christian and Thies, Justus and Nie{\ss}ner, Matthias (2019) Faceforensics + +: Learning to detect manipulated facial images. 1--11, Proceedings of the IEEE/CVF international conference on computer vision
Dolhansky, Brian and Howes, Russ and Pflaum, Ben and Baram, Nicole and Ferrer, Cristian Canton (2019) The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854
Jiang, Liming and Li, Ren and Wu, Wayne and Qian, Chen and Loy, Chen Change (2020) Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. 2889--2898, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zi, Bojia and Chang, Minghao and Chen, Jingjing and Ma, Xingjun and Jiang, Yu-Gang (2020) Wilddeepfake: A challenging real-world dataset for deepfake detection. 2382--2390, Proceedings of the 28th ACM international conference on multimedia
Narayan, Kartik and Agarwal, Harsh and Thakral, Kartik and Mittal, Surbhi and Vatsa, Mayank and Singh, Richa (2023) Df-platter: Multi-face heterogeneous deepfake dataset. 9739--9748, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Barni, Mauro and Kallas, Kassem and Nowroozi, Ehsan and Tondi, Benedetta (2020) CNN detection of GAN-generated face images based on cross-band co-occurrences analysis. IEEE, 1--6, 2020 IEEE international workshop on information forensics and security (WIFS)
Chen, Beijing and Liu, Xin and Zheng, Yuhui and Zhao, Guoying and Shi, Yun-Qing (2021) A robust GAN-generated face detection method based on dual-color spaces and an improved Xception. IEEE Transactions on Circuits and Systems for Video Technology 32(6): 3527--3538 IEEE
Hu, Shu and Li, Yuezun and Lyu, Siwei (2021) Exposing GAN-generated faces using inconsistent corneal specular highlights. IEEE, 2500--2504, ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Nirkin, Yuval and Wolf, Lior and Keller, Yosi and Hassner, Tal (2021) Deepfake detection based on discrepancies between faces and their context. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(10): 6111--6121 IEEE
Han, Ruidong and Wang, Xiaofeng and Bai, Ningning and Wang, Qin and Liu, Zinian and Xue, Jianru (2023) FCD-Net: Learning to detect multiple types of homologous deepfake face images. IEEE Transactions on Information Forensics and Security IEEE
Zhao, Hanqing and Zhou, Wenbo and Chen, Dongdong and Wei, Tianyi and Zhang, Weiming and Yu, Nenghai (2021) Multi-attentional deepfake detection. 2185--2194, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Guo, Xiao and Liu, Xiaohong and Ren, Zhiyuan and Grosz, Steven and Masi, Iacopo and Liu, Xiaoming (2023) Hierarchical fine-grained image forgery detection and localization. 3155--3165, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Dang, Hao and Liu, Feng and Stehouwer, Joel and Liu, Xiaoming and Jain, Anil K (2020) On the detection of digital face manipulation. 5781--5790, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition
Huang, Yihao and Juefei-Xu, Felix and Guo, Qing and Liu, Yang and Pu, Geguang (2022) Fakelocator: Robust localization of gan-based face manipulations. IEEE Transactions on Information Forensics and Security 17: 2657--2672 IEEE
Wang, Jian and Sun, Yunlian and Tang, Jinhui (2022) LiSiam: Localization invariance Siamese network for deepfake detection. IEEE Transactions on Information Forensics and Security 17: 2425--2436 IEEE
He, Yang and Yu, Ning and Keuper, Margret and Fritz, Mario (2021) Beyond the spectrum: Detecting deepfakes via re-synthesis. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Cao, Junyi and Ma, Chao and Yao, Taiping and Chen, Shen and Ding, Shouhong and Yang, Xiaokang (2022) End-to-end reconstruction-classification learning for face forgery detection. 4113--4122, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, Decheng and Dang, Zhan and Peng, Chunlei and Zheng, Yu and Li, Shuang and Wang, Nannan and Gao, Xinbo (2023) FedForgery: generalized face forgery detection with residual federated learning. IEEE Transactions on Information Forensics and Security IEEE
Shi, Zenan and Chen, Haipeng and Chen, Long and Zhang, Dong (2023) Discrepancy-guided reconstruction learning for image forgery detection. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)
Fridrich, Jessica and Kodovsky, Jan (2012) Rich models for steganalysis of digital images. IEEE Transactions on information Forensics and Security 7(3): 868--882 IEEE
Luo, Yuchen and Zhang, Yong and Yan, Junchi and Liu, Wei (2021) Generalizing face forgery detection with high-frequency features. 16317--16326, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Kong, Chenqi and Chen, Baoliang and Li, Haoliang and Wang, Shiqi and Rocha, Anderson and Kwong, Sam (2022) Detect and locate: Exposing face manipulation by semantic-and noise-level telltales. IEEE Transactions on Information Forensics and Security 17: 1741--1756 IEEE
Shuai, Chao and Zhong, Jieming and Wu, Shuang and Lin, Feng and Wang, Zhibo and Ba, Zhongjie and Liu, Zhenguang and Cavallaro, Lorenzo and Ren, Kui (2023) Locate and verify: A two-stream network for improved deepfake detection. 7131--7142, Proceedings of the 31st ACM International Conference on Multimedia
Wang, Sheng-Yu and Wang, Oliver and Zhang, Richard and Owens, Andrew and Efros, Alexei A (2020) CNN-generated images are surprisingly easy to spot... for now. 8695--8704, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Das, Sowmen and Seferbekov, Selim and Datta, Arup and Islam, Md Saiful and Amin, Md Ruhul (2021) Towards solving the deepfake problem: An analysis on improving deepfake detection using dynamic face augmentation. 3776--3785, Proceedings of the IEEE/CVF International Conference on Computer Vision
Guillaro, Fabrizio and Cozzolino, Davide and Sud, Avneesh and Dufour, Nicholas and Verdoliva, Luisa (2023) Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. 20606--20615, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Chengrui and Deng, Weihong (2021) Representative forgery mining for fake face detection. 14923--14932, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yan, Zhiyuan and Luo, Yuhao and Lyu, Siwei and Liu, Qingshan and Wu, Baoyuan (2023) Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. arXiv preprint arXiv:2311.11278
Chen, Liang and Zhang, Yong and Song, Yibing and Liu, Lingqiao and Wang, Jue (2022) Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. 18710--18719, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Sun, Ke and Yao, Taiping and Chen, Shen and Ding, Shouhong and Li, Jilin and Ji, Rongrong (2022) Dual contrastive learning for general face forgery detection. 2316--2324, 2, 36, Proceedings of the AAAI Conference on Artificial Intelligence
Shiohara, Kaede and Yamasaki, Toshihiko (2022) Detecting deepfakes with self-blended images. 18720--18729, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yan, Zhiyuan and Zhao, Yandan and Chen, Shen and Fu, Xinghe and Yao, Taiping and Ding, Shouhong and Yuan, Li (2024) Generalizing deepfake video detection with plug-and-play: Video-level blending and spatiotemporal adapter tuning. arXiv preprint arXiv:2408.17065
Sun, Ke and Liu, Hong and Ye, Qixiang and Gao, Yue and Liu, Jianzhuang and Shao, Ling and Ji, Rongrong (2021) Domain general face forgery detection by learning to weight. 2638--2646, 3, 35, Proceedings of the AAAI conference on artificial intelligence
Chen, Liang and Zhang, Yong and Song, Yibing and Wang, Jue and Liu, Lingqiao (2022) Ost: Improving generalization of deepfake detection via one-shot test-time training. Advances in Neural Information Processing Systems 35: 24597--24610
Liu, Yuejiang and Kothari, Parth and Van Delft, Bastien and Bellot-Gurlet, Baptiste and Mordan, Taylor and Alahi, Alexandre (2021) Ttt + +: When does self-supervised test-time training fail or thrive?. Advances in Neural Information Processing Systems 34: 21808--21820
Yan, Zhiyuan and Zhang, Yong and Fan, Yanbo and Wu, Baoyuan (2023) Ucf: Uncovering common features for generalizable deepfake detection. 22412--22423, Proceedings of the IEEE/CVF International Conference on Computer Vision
Liang, Jiahao and Shi, Huafeng and Deng, Weihong (2022) Exploring disentangled content information for face forgery detection. Springer, 128--145, European Conference on Computer Vision
Zhuang, Wanyi and Chu, Qi and Tan, Zhentao and Liu, Qiankun and Yuan, Haojie and Miao, Changtao and Luo, Zixiang and Yu, Nenghai (2022) UIA-ViT: Unsupervised inconsistency-aware method based on vision transformer for face forgery detection. Springer, 391--407, European Conference on Computer Vision
Pan, Kun and Yin, Yifang and Wei, Yao and Lin, Feng and Ba, Zhongjie and Liu, Zhenguang and Wang, Zhibo and Cavallaro, Lorenzo and Ren, Kui (2023) DFIL: Deepfake Incremental Learning by Exploiting Domain-invariant Forgery Clues. 8035--8046, Proceedings of the 31st ACM International Conference on Multimedia
Wang, Zhendong and Bao, Jianmin and Zhou, Wengang and Wang, Weilun and Hu, Hezhen and Chen, Hong and Li, Houqiang (2023) Dire for diffusion-generated image detection. 22445--22455, Proceedings of the IEEE/CVF International Conference on Computer Vision
Ma, Ruipeng and Duan, Jinhao and Kong, Fei and Shi, Xiaoshuang and Xu, Kaidi (2023) Exposing the fake: Effective diffusion-generated images detection. arXiv preprint arXiv:2307.06272
Neekhara, Paarth and Dolhansky, Brian and Bitton, Joanna and Ferrer, Cristian Canton (2021) Adversarial threats to deepfake detection: A practical perspective. 923--932, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Jeong, Yonghyun and Kim, Doyeon and Ro, Youngmin and Choi, Jongwon (2022) Frepgan: Robust deepfake detection using frequency-level perturbations. 1060--1068, 1, 36, Proceedings of the AAAI conference on artificial intelligence
Hooda, Ashish and Mangaokar, Neal and Feng, Ryan and Fawaz, Kassem and Jha, Somesh and Prakash, Atul (2024) D4: Detection of adversarial diffusion deepfakes using disjoint ensembles. 3812--3822, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Sha, Zeyang and Li, Zheng and Yu, Ning and Zhang, Yang (2023) De-fake: Detection and attribution of fake images generated by text-to-image generation models. 3418--3432, Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
Nguyen-Le, Hong-Hanh and Tran, Van-Tuan and Nguyen, Dinh-Thuc and Le-Khac, Nhien-An (2024) D-CAPTCHA + +: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack. IEEE, 1--8, 2024 International Joint Conference on Neural Networks (IJCNN)
Liu, Zhengzhe and Qi, Xiaojuan and Torr, Philip HS (2020) Global texture enhancement for fake face detection in the wild. 8060--8069, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Frank, Joel and Eisenhofer, Thorsten and Sch{\"o}nherr, Lea and Fischer, Asja and Kolossa, Dorothea and Holz, Thorsten (2020) Leveraging frequency analysis for deep fake image recognition. PMLR, 3247--3258, International conference on machine learning
Dzanic, Tarik and Shah, Karan and Witherden, Freddie (2020) Fourier spectrum discrepancies in deep network generated images. Advances in neural information processing systems 33: 3022--3032
Qian, Yuyang and Yin, Guojun and Sheng, Lu and Chen, Zixuan and Shao, Jing (2020) Thinking in frequency: Face forgery detection by mining frequency-aware clues. Springer, 86--103, European conference on computer vision
Ojha, Utkarsh and Li, Yuheng and Lee, Yong Jae (2023) Towards universal fake image detectors that generalize across generative models. 24480--24489, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Chandrasegaran, Keshigeyan and Tran, Ngoc-Trung and Cheung, Ngai-Man (2021) A closer look at fourier spectrum discrepancies for cnn-generated images detection. 7200--7209, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Neves, Joao C and Tolosana, Ruben and Vera-Rodriguez, Ruben and Lopes, Vasco and Proen{\c{c}}a, Hugo and Fierrez, Julian (2020) Ganprintr: Improved fakes and evaluation of the state of the art in face manipulation detection. IEEE Journal of Selected Topics in Signal Processing 14(5): 1038--1048 IEEE
Miao, Changtao and Tan, Zichang and Chu, Qi and Yu, Nenghai and Guo, Guodong (2022) Hierarchical frequency-assisted interactive networks for face manipulation detection. IEEE Transactions on Information Forensics and Security 17: 3008--3021 IEEE
Miao, Changtao and Tan, Zichang and Chu, Qi and Liu, Huan and Hu, Honggang and Yu, Nenghai (2023) F 2 trans: High-frequency fine-grained transformer for face forgery detection. IEEE Transactions on Information Forensics and Security 18: 1039--1051 IEEE
Wang, Junke and Wu, Zuxuan and Ouyang, Wenhao and Han, Xintong and Chen, Jingjing and Jiang, Yu-Gang and Li, Ser-Nam (2022) M2tr: Multi-modal multi-scale transformers for deepfake detection. 615--623, Proceedings of the 2022 international conference on multimedia retrieval
Wang, Yuan and Yu, Kun and Chen, Chen and Hu, Xiyuan and Peng, Silong (2023) Dynamic graph learning with content-guided spatial-frequency relation reasoning for deepfake detection. 7278--7287, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Guo, Yinlin and Huang, Haofan and Chen, Xi and Zhao, He and Wang, Yuehai (2024) Audio Deepfake Detection With Self-Supervised Wavlm And Multi-Fusion Attentive Classifier. IEEE, 12702--12706, ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Shin, Hyun-seo and Heo, Jungwoo and Kim, Ju-ho and Lim, Chan-yeong and Kim, Wonbin and Yu, Ha-Jin (2024) HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods. IEEE, 10581--10585, ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Yan, Xinrui and Yi, Jiangyan and Tao, Jianhua and Wang, Chenglong and Ma, Haoxin and Wang, Tao and Wang, Shiming and Fu, Ruibo (2022) An initial investigation for detecting vocoder fingerprints of fake audio. 61--68, Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia
Sun, Chengzhe and Jia, Shan and Hou, Shuwei and Lyu, Siwei (2023) Ai-synthesized voice detection using neural vocoder artifacts. 904--912, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xie, Yuankun and Cheng, Haonan and Wang, Yutian and Ye, Long (2023) Domain Generalization Via Aggregation and Separation for Audio Deepfake Detection. IEEE Transactions on Information Forensics and Security IEEE
Lu, Jingze and Zhang, Yuxiang and Wang, Wenchao and Shang, Zengqiang and Zhang, Pengyuan (2024) One-Class Knowledge Distillation for Spoofing Speech Detection. IEEE, 11251--11255, ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Kawa, Piotr and Plata, Marcin and Syga, Piotr (2022) Defense against adversarial attacks on audio deepfake detection. INTERSPEECH 202
Sun, Zekun and Han, Yujie and Hua, Zeyu and Ruan, Na and Jia, Weijia (2021) Improving the efficiency and robustness of deepfakes detection through precise geometric features. 3609--3618, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bonettini, Nicolo and Cannas, Edoardo Daniele and Mandelli, Sara and Bondi, Luca and Bestagini, Paolo and Tubaro, Stefano (2021) Video face manipulation detection through ensemble of cnns. IEEE, 5012--5019, 2020 25th international conference on pattern recognition (ICPR)
Wang, Tianyi and Chow, Kam Pui (2023) Noise based deepfake detection via multi-head relative-interaction. 14548--14556, 12, Proceedings of the AAAI Conference on Artificial Intelligence
Ciamarra, Andrea and Caldelli, Roberto and Becattini, Federico and Seidenari, Lorenzo and Del Bimbo, Alberto (2024) Deepfake detection by exploiting surface anomalies: the SurFake approach. 1024--1033, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Ba, Zhongjie and Liu, Qingyu and Liu, Zhenguang and Wu, Shuang and Lin, Feng and Lu, Li and Ren, Kui (2024) Exposing the Deception: Uncovering More Forgery Clues for Deepfake Detection. 719--728, 2, 38, Proceedings of the AAAI Conference on Artificial Intelligence
Zhao, Cairong and Wang, Chutian and Hu, Guosheng and Chen, Haonan and Liu, Chun and Tang, Jinhui (2023) ISTVT: interpretable spatial-temporal video transformer for deepfake detection. IEEE Transactions on Information Forensics and Security 18: 1335--1348 IEEE
Montserrat, Daniel Mas and Hao, Hanxiang and Yarlagadda, Sri K and Baireddy, Sriram and Shao, Ruiting and Horv{\'a}th, J{\'a}nos and Bartusiak, Emily and Yang, Justin and Guera, David and Zhu, Fengqing and others (2020) Deepfakes detection with automatic face weighting. 668--669, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops
Saikia, Pallabi and Dholaria, Dhwani and Yadav, Priyanka and Patel, Vaidehi and Roy, Mohendra (2022) A hybrid CNN-LSTM model for video deepfake detection by leveraging optical flow features. IEEE, 1--7, 2022 international joint conference on neural networks (IJCNN)
Agarwal, Shruti and Farid, Hany and El-Gaaly, Tarek and Lim, Ser-Nam (2020) Detecting deep-fake videos from appearance and behavior. IEEE, 1--6, 2020 IEEE international workshop on information forensics and security (WIFS)
Cozzolino, Davide and R{\"o}ssler, Andreas and Thies, Justus and Nie{\ss}ner, Matthias and Verdoliva, Luisa (2021) Id-reveal: Identity-aware deepfake video detection. 15108--15117, Proceedings of the IEEE/CVF International Conference on Computer Vision
Liu, Baoping and Liu, Bo and Ding, Ming and Zhu, Tianqing and Yu, Xin (2023) TI2Net: temporal identity inconsistency network for deepfake detection. 4691--4700, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Huang, Baojin and Wang, Zhongyuan and Yang, Jifan and Ai, Jiaxin and Zou, Qin and Wang, Qian and Ye, Dengpan (2023) Implicit identity driven deepfake face swapping detection. 4490--4499, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Dong, Xiaoyi and Bao, Jianmin and Chen, Dongdong and Zhang, Ting and Zhang, Weiming and Yu, Nenghai and Chen, Dong and Wen, Fang and Guo, Baining (2022) Protecting celebrities from deepfake with identity consistency transformer. 9468--9478, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Gu, Zhihao and Chen, Yang and Yao, Taiping and Ding, Shouhong and Li, Jilin and Huang, Feiyue and Ma, Lizhuang (2021) Spatiotemporal inconsistency learning for deepfake video detection. 3473--3481, Proceedings of the 29th ACM international conference on multimedia
Gu, Zhihao and Chen, Yang and Yao, Taiping and Ding, Shouhong and Li, Jilin and Ma, Lizhuang (2022) Delving into the local: Dynamic inconsistency learning for deepfake video detection. 744--752, 1, 36, Proceedings of the AAAI Conference on Artificial Intelligence
Gu, Zhihao and Yao, Taiping and Chen, Yang and Ding, Shouhong and Ma, Lizhuang (2022) Hierarchical contrastive inconsistency learning for deepfake video detection. Springer, 596--613, European Conference on Computer Vision
Gu, Zhihao and Yao, Taiping and Chen, Yang and Yi, Ran and Ding, Shouhong and Ma, Lizhuang (2022) Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection.. 920--926, IJCAI
Yin, Qilin and Lu, Wei and Li, Bin and Huang, Jiwu (2023) Dynamic difference learning with spatio-temporal correlation for deepfake video detection. IEEE Transactions on Information Forensics and Security IEEE
Han, Bing and Li, Jianshu and Ren, Wenqi and Luo, Man and Liu, Jian and Cao, Xiaochun (2023) SIGMA-DF: Single-Side Guided Meta-Learning for Deepfake Detection. 153--161, Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
Zhu, Xiaogang and Lin, Bo and He, Xinan and Xu, Jianfeng and Ding, Feng (2024) TMFD: Two-Stage Meta-learning Feature Disentanglement Framework for DeepFake Detection. IEEE, 1--8, 2024 IEEE International Joint Conference on Biometrics (IJCB)
Devasthale, Aditya and Sural, Shamik (2022) Adversarially robust deepfake video detection. IEEE, 396--403, 2022 IEEE Symposium Series on Computational Intelligence (SSCI)
Luo, Anwei and Kong, Chenqi and Huang, Jiwu and Hu, Yongjian and Kang, Xiangui and Kot, Alex C (2023) Beyond the prior forgery knowledge: Mining critical clues for general face forgery detection. IEEE Transactions on Information Forensics and Security 19: 1168--1182 IEEE
Wang, Zhendong and Bao, Jianmin and Zhou, Wengang and Wang, Weilun and Li, Houqiang (2023) Altfreezing for more general video face forgery detection. 4129--4138, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ni, Yunsheng and Meng, Depu and Yu, Changqian and Quan, Chengbin and Ren, Dongchun and Zhao, Youjian (2022) Core: Consistent representation learning for face forgery detection. 12--21, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Feng, Chao and Chen, Ziyang and Owens, Andrew (2023) Self-supervised video forensics by audio-visual anomaly detection. 10491--10503, proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Xueyu and Huang, Jiajun and Ma, Siqi and Nepal, Surya and Xu, Chang (2022) Deepfake disrupter: The detector of deepfake is my friend. 14920--14929, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Corvi, Riccardo and Cozzolino, Davide and Zingarini, Giada and Poggi, Giovanni and Nagano, Koki and Verdoliva, Luisa (2023) On the detection of synthetic images generated by diffusion models. IEEE, 1--5, ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Yan, Zhiyuan and Zhang, Yong and Yuan, Xinhang and Lyu, Siwei and Wu, Baoyuan (2023) Deepfakebench: A comprehensive benchmark of deepfake detection. arXiv preprint arXiv:2307.01426
Chen, Heather and Magramo, Kathleen. Finance worker pays out \$25 million after video call with deepfake `chief financial officer'. Accessed on 14 March 2024. https://edition.cnn.com/2024/02/04/asia/deepfake-cfo-scam-hong-kong-intl-hnk/index.html, 2024
Natnicha, Surasit. Criminal exploitation of deepfakes in South East Asia. Accessed on 18 March 2025. https://globalinitiative.net/analysis/deepfakes-ai-cyber-scam-south-east-asia-organized-crime/, 2024
The Guardian. Democrats sound alarm over AI robocall to voters mimicking Biden. Accessed on 18 March 2025. https://www.theguardian.com/us-news/2024/jan/22/biden-fake-robocalls-new-hampshire, 2024
Malik, Asad and Kuribayashi, Minoru and Abdullahi, Sani M and Khan, Ahmad Neyaz (2022) DeepFake detection for human face images and videos: A survey. Ieee Access 10: 18757--18775 IEEE
Kwon, Patrick and You, Jaeseong and Nam, Gyuhyeon and Park, Sungwoo and Chae, Gyeongsu (2021) Kodf: A large-scale korean deepfake detection dataset. 10744--10753, Proceedings of the IEEE/CVF international conference on computer vision
Zhou, Tianfei and Wang, Wenguan and Liang, Zhiyuan and Shen, Jianbing (2021) Face forensics in the wild. 5778--5788, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yan, Zhiyuan and Yao, Taiping and Chen, Shen and Zhao, Yandan and Fu, Xinghe and Zhu, Junwei and Luo, Donghao and Wang, Chengjie and Ding, Shouhong and Wu, Yunsheng and others (2024) Df40: Toward next-generation deepfake detection. Advances in Neural Information Processing Systems
Narayan, Kartik and Agarwal, Harsh and Thakral, Kartik and Mittal, Surbhi and Vatsa, Mayank and Singh, Richa (2022) Deephy: On deepfake phylogeny. IEEE, 1--10, 2022 IEEE International Joint Conference on Biometrics (IJCB)
Li, Yuang and Zhang, Min and Ren, Mengxin and Ma, Miaomiao and Wei, Daimeng and Yang, Hao (2024) Cross-Domain Audio Deepfake Detection: Dataset and Analysis. The 62nd Annual Meeting of the Association for Computational Linguistics
Le, Trung-Nghia and Nguyen, Huy H and Yamagishi, Junichi and Echizen, Isao (2021) Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild. 10117--10127, Proceedings of the IEEE/CVF international conference on computer vision
Layton, Seth and Tucker, Tyler and Olszewski, Daniel and Warren, Kevin and Butler, Kevin and Traynor, Patrick (2024)
SoK
: The Good, The Bad, and The Unbalanced: Measuring Structural Limitations of Deepfake Media Datasets. 1027--1044, 33rd USENIX Security Symposium (USENIX Security 24)
Yan, Ziwei and Zhao, Yanjie and Wang, Haoyu (2024) Voicewukong: Benchmarking deepfake voice detection. arXiv preprint arXiv:2409.06348
Amin, Muhammad Ahmad and Hu, Yongjian and Guan, Yu and Amin, Muhammad Zain (2024) Exploring varying color spaces through representative forgery learning to improve deepfake detection. Digital Signal Processing 147: 104426 Elsevier
Qiao, Tong and Chen, Yuxing and Zhou, Xiaofei and Shi, Ran and Shao, Hang and Shen, Kunye and Luo, Xiangyang (2023) CSC-Net: Cross-color spatial co-occurrence matrix network for detecting synthesized fake images. IEEE Transactions on Cognitive and Developmental Systems 16(1): 369--379 IEEE
Wang, Jun and Tondi, Benedetta and Barni, Mauro (2022) An eyes-based siamese neural network for the detection of gan-generated face images. Frontiers in Signal Processing 2: 918725 Frontiers Media SA
Talib, Manar Abu and Nasir, Qassim and Nassif, Ali Bou and Fadhl, Norah Ba and Gouda, Omar Mohamed (2025) Chrominance and Luminance: a study to detect deepfakes. Multimedia Tools and Applications : 1--19 Springer
Yang, Jiachen and Li, Aiyun and Xiao, Shuai and Lu, Wen and Gao, Xinbo (2021) MTD-Net: Learning to detect deepfakes images by multi-scale texture difference. IEEE Transactions on Information Forensics and Security 16: 4234--4245 IEEE
Katamneni, Vinaya Sree and Rattani, Ajita (2024) Contextual cross-modal attention for audio-visual deepfake detection and localization. IEEE, 1--11, 2024 IEEE International Joint Conference on Biometrics (IJCB)
Mazaheri, Ghazal and Roy-Chowdhury, Amit K (2022) Detection and localization of facial expression manipulations. 1035--1045, Proceedings of the IEEE/CVF winter conference on applications of computer vision
Das, Sowmen and Islam, Md Saiful and Amin, Md Ruhul (2022) Gca-net: utilizing gated context attention for improving image forgery localization and detection. 81--90, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Tan, Chuangchuang and Zhao, Yao and Wei, Shikui and Gu, Guanghua and Liu, Ping and Wei, Yunchao (2024) Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. 28130--28139, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Hong, Cheng-Yao and Hsu, Yen-Chi and Liu, Tyng-Luh (2024) Contrastive learning for deepfake classification and localization via multi-label ranking. 17627--17637, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xia, Zhiming and Qiao, Tong and Xu, Ming and Zheng, Ning and Xie, Shichuang (2022) Towards DeepFake video forensics based on facial textural disparities in multi-color channels. Information sciences 607: 654--669 Elsevier
Jeon, Su Min and Seong, Hyeon Ah and Lee, Eui Chul (2022) Deepfake video detection using the frequency characteristic of remote photoplethysmography. Springer, 1--6, International Conference on Intelligent Human Computer Interaction
Huda, Noor ul and Javed, Ali and Maswadi, Kholoud and Alhazmi, Ali and Ashraf, Rehan (2024) Fake-checker: A fusion of texture features and deep learning for deepfakes detection. Multimedia Tools and Applications 83(16): 49013--49037 Springer
Hu, Juan and Liao, Xin and Gao, Difei and Tsutsui, Satoshi and Wang, Qian and Qin, Zheng and Shou, Mike Zheng (2024) Delocate: Detection and localization for deepfake videos with randomly-located tampered traces. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence : 5862--5871
Yaroshchuk, Artem and Papastergiopoulos, Christoforos and Cuccovillo, Luca and Aichroth, Patrick and Votis, Konstantinos and Tzovaras, Dimitrios (2023) An open dataset of synthetic speech. IEEE, 1--6, 2023 IEEE International Workshop on Information Forensics and Security (WIFS)
Zhang, Lin and Wang, Xin and Cooper, Erica and Evans, Nicholas and Yamagishi, Junichi (2022) The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31: 813--825 IEEE
M{\"u}ller, Nicolas M and Kawa, Piotr and Choong, Wei Herng and Casanova, Edresson and G{\"o}lge, Eren and M{\"u}ller, Thorsten and Syga, Piotr and Sperl, Philip and B{\"o}ttinger, Konstantin (2024) Mlaad: The multi-language audio anti-spoofing dataset. IEEE, 1--7, 2024 International Joint Conference on Neural Networks (IJCNN)
Xue, Jun and Fan, Cunhang and Lv, Zhao and Tao, Jianhua and Yi, Jiangyan and Zheng, Chengshi and Wen, Zhengqi and Yuan, Minmin and Shao, Shegang (2022) Audio deepfake detection based on a combination of f0 information and real plus imaginary spectrogram features. 19--26, Proceedings of the 1st international workshop on deepfake detection for audio multimedia
Li, Menglu and Ahmadiadli, Yasaman and Zhang, Xiao-Ping (2022) A comparative study on physical and perceptual features for deepfake audio detection. 35--41, Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia
Alzantot, Moustafa and Wang, Ziqi and Srivastava, Mani B (2019) Deep Residual Neural Networks for Audio Spoofing Detection. arXiv preprint arXiv:1907.00501
Yang, Jichen and Das, Rohan Kumar and Li, Haizhou (2018) Extended constant-Q cepstral coefficients for detection of spoofing attacks. IEEE, 1024--1029, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Vahdati, Danial Samadi and Nguyen, Tai D and Azizpour, Aref and Stamm, Matthew C (2024) Beyond deepfake images: Detecting ai-generated videos. 4397--4408, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, Gen and Zhao, Xianfeng and Cao, Yun (2023) Forensic symmetry for deepfakes. IEEE Transactions on Information Forensics and Security 18: 1095--1110 IEEE
Guo, Siyou and Gao, Mingliang and Li, Qilei and Jeon, Gwanggil and Camacho, David (2024) Deepfake Detection via a Progressive Attention Network. IEEE, 1--6, 2024 International Joint Conference on Neural Networks (IJCNN)
Coccomini, Davide Alessandro and Messina, Nicola and Gennaro, Claudio and Falchi, Fabrizio (2022) Combining efficientnet and vision transformers for video deepfake detection. Springer, 219--229, International conference on image analysis and processing
Zhang, Daichi and Lin, Fanzhao and Hua, Yingying and Wang, Pengju and Zeng, Dan and Ge, Shiming (2022) Deepfake video detection with spatiotemporal dropout transformer. 5833--5841, Proceedings of the 30th ACM international conference on multimedia
Zhang, Daichi and Li, Chenyu and Lin, Fanzhao and Zeng, Dan and Ge, Shiming (2021) Detecting Deepfake Videos with Temporal Dropout 3DCNN.. 1288--1294, IJCAI
Zheng, Yinglin and Bao, Jianmin and Chen, Dong and Zeng, Ming and Wen, Fang (2021) Exploring temporal coherence for more general video face forgery detection. 15044--15054, Proceedings of the IEEE/CVF international conference on computer vision
Zhao, Hanqing and Zhou, Wenbo and Chen, Dongdong and Zhang, Weiming and Yu, Nenghai (2022) Self-supervised transformer for deepfake detection. arXiv preprint arXiv:2203.01265
Coccomini, Davide Alessandro and Zilos, Giorgos Kordopatis and Amato, Giuseppe and Caldelli, Roberto and Falchi, Fabrizio and Papadopoulos, Symeon and Gennaro, Claudio (2024) MINTIME: multi-identity size-invariant video deepfake detection. IEEE Transactions on Information Forensics and Security IEEE
Kwak, Il-Youp and Kwag, Sungsu and Lee, Junhee and Huh, Jun Ho and Lee, Choong-Hoon and Jeon, Youngbae and Hwang, Jeonghwan and Yoon, Ji Won (2021) ResMax: Detecting voice spoofing attacks with residual network and max feature map. IEEE, 4837--4844, 2020 25th International Conference on Pattern Recognition (ICPR)
Gao, Yang and Lian, Jiachen and Raj, Bhiksha and Singh, Rita (2021) Detection and evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems. IEEE, 544--551, 2021 IEEE Spoken Language Technology Workshop (SLT)
Saleem, Summra and Dilawari, Aniqa and Khan, Muhammad Usman Ghani and Husnain, Muhammad (2019) Voice conversion and spoofed voice detection from parallel English and Urdu corpus using cyclic GANs. IEEE, 1--6, 2019 International Conference on Robotics and Automation in Industry (ICRAI)
Lee, Hannah and Lee, Changyeon and Farhat, Kevin and Qiu, Lin and Geluso, Steve and Kim, Aerin and Etzioni, Oren (2024) The Tug-of-War Between Deepfake Generation and Detection. arXiv preprint arXiv:2407.06174
Wani, Taiba Majid and Gulzar, Reeva and Amerini, Irene (2024) Abc-capsnet: Attention based cascaded capsule network for audio deepfake detection. 2464--2472, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop
Jung, Jee-weon and Heo, Hee-Soo and Tak, Hemlata and Shim, Hye-jin and Chung, Joon Son and Lee, Bong-Jin and Yu, Ha-Jin and Evans, Nicholas (2022) Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. IEEE, 6367--6371, ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP)
Rosello, Eros and Alan{\'\i}s, Alejandro G{\'o}mez and Gomez, Angel M and Peinado, Antonio M and Harte, N and Carson-Berndsen, J and Jones, G (2023) A conformer-based classifier for variable-length utterance processing in anti-spoofing.. 5281--5285, 2023, Interspeech
Nguyen-Le, Hong-Hanh and Tran, Van-Tuan and Nguyen, Dinh-Thuc and Le-Khac, Nhien-An (2025) A Survey on Proactive Deepfake Defense: Disruption and Watermarking. Authorea Preprints Authorea
Huang, Bingyuan and Cui, Sanshuai and Huang, Jiwu and Kang, Xiangui (2023) Discriminative frequency information learning for end-to-end speech anti-spoofing. IEEE Signal Processing Letters 30: 185--189 IEEE
Hua, Guang and Teoh, Andrew Beng Jin and Zhang, Haijian (2021) Towards end-to-end synthetic speech detection. IEEE Signal Processing Letters 28: 1265--1269 IEEE
Tak, Hemlata and Patino, Jose and Todisco, Massimiliano and Nautsch, Andreas and Evans, Nicholas and Larcher, Anthony (2021) End-to-end anti-spoofing with rawnet2. IEEE, 6369--6373, ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Gu, Albert and Dao, Tri (2023) Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752
Chen, Yujie and Yi, Jiangyan and Xue, Jun and Wang, Chenglong and Zhang, Xiaohui and Dong, Shunbo and Zeng, Siding and Tao, Jianhua and Lv, Zhao and Fan, Cunhang (2024) RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection. 2720--2724, Proc. Interspeech 2024
Chakravarty, Nidhi and Dua, Mohit (2024) A lightweight feature extraction technique for deepfake audio detection. Multimedia Tools and Applications 83(26): 67443--67467 Springer
Wang, Xin and Yamagishi, Junichi (2024) Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?. IEEE, 10311--10315, ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Mart{\'\i}n-Do{\ n}as, Juan M and {\'A}lvarez, Aitor and Rosello, Eros and Gomez, Angel M and Peinado, Antonio M (2024) Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection. 2085--2089, Interspeech 2024
Tran, Hoan My and Guennec, David and Martin, Philippe and Sini, Aghilas and Lolive, Damien and Delhay, Arnaud and Marteau, Pierre-Fran{\c{c}}ois (2024) Spoofed speech detection with a focus on speaker embedding. INTERSPEECH 2024
Pan, Zihan and Liu, Tianchi and Sailor, Hardik B and Wang, Qiongqiong (2024) Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection. 2090--2094, Proc. Interspeech 2024
Kim, Hyun Myung and Jang, Kangwook and Kim, Hoirin (2024) One-class learning with adaptive centroid shift for audio deepfake detection. 4853--4857, Proc. Interspeech 2024
Doan, Thien-Phuc and Nguyen-Vu, Long and Hong, Kihun and Jung, Souhwan (2024) Balance, Multiple Augmentation, and Re-synthesis: A Triad Training Strategy for Enhanced Audio Deepfake Detection. 2105--2109, Proc. Interspeech 2024
Yang, Yujie and Qin, Haochen and Zhou, Hang and Wang, Chengcheng and Guo, Tianyu and Han, Kai and Wang, Yunhe (2024) A robust audio deepfake detection system via multi-view feature. IEEE, 13131--13135, ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Oorloff, Trevine and Koppisetti, Surya and Bonettini, Nicol{\`o} and Solanki, Divyaraj and Colman, Ben and Yacoob, Yaser and Shahriyari, Ali and Bharaj, Gaurav (2024) Avff: Audio-visual feature fusion for video deepfake detection. 27102--27112, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Wen and Gu, Yanmei and Wang, Zhiming and Zhu, Huijia and Qian, Yanmin (2025) Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation. arXiv preprint arXiv:2501.14240
Hasanaath, Ahmed Abul and Luqman, Hamzah and Katib, Raed and Anwar, Saeed (2025) FSBI: Deepfake detection with frequency enhanced self-blended images. Image and Vision Computing : 105418 Elsevier
Chen, Shen and Yao, Taiping and Liu, Hong and Sun, Xiaoshuai and Ding, Shouhong and Ji, Rongrong and others (2025) DiffusionFake: Enhancing Generalization in Deepfake Detection via Guided Stable Diffusion. Advances in Neural Information Processing Systems 37: 101474--101497
Chen, Zhongxi and Sun, Ke and Zhou, Ziyin and Lin, Xianming and Sun, Xiaoshuai and Cao, Liujuan and Ji, Rongrong (2024) Diffusionface: Towards a comprehensive dataset for diffusion-based face forgery analysis. arXiv preprint arXiv:2403.18471
Guo, Ying and Zhen, Cheng and Yan, Pengfei (2023) Controllable guide-space for generalizable face forgery detection. 20818--20827, Proceedings of the IEEE/CVF international conference on computer vision
Hanzhe, LI and Zhou, Jiaran and Li, Yuezun and Wu, Baoyuan and Li, Bin and Dong, Junyu (2024) FreqBlender: Enhancing DeepFake Detection by Blending Frequency Knowledge. The Thirty-eighth Annual Conference on Neural Information Processing Systems
Qiao, Tong and Xie, Shichuang and Chen, Yanli and Retraint, Florent and Luo, Xiangyang (2024) Fully unsupervised deepfake video detection via enhanced contrastive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(7): 4654--4668 IEEE
Choi, Jongwook and Kim, Taehoon and Jeong, Yonghyun and Baek, Seungryul and Choi, Jongwon (2024) Exploiting style latent flows for generalizing deepfake video detection. 1133--1143, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Cheng, Jikang and Yan, Zhiyuan and Zhang, Ying and Luo, Yuhao and Wang, Zhongyuan and Li, Chen (2024) Can We Leave Deepfake Data Behind in Training Deepfake Detector?. The Thirty-eighth Annual Conference on Neural Information Processing Systems
Zhu, Yi and Koppisetti, Surya and Tran, Trang and Bharaj, Gaurav (2025) Slim: Style-linguistics mismatch model for generalized audio deepfake detection. Advances in Neural Information Processing Systems 37: 67901--67928
Zhang, Xiaohui and Yi, Jiangyan and Tao, Jianhua and Wang, Chenlong and Xu, Le and Fu, Ruibo (2023) Adaptive fake audio detection with low-rank model squeezing. arXiv preprint arXiv:2306.04956
Wu, Haochen and Guo, Wu and Peng, Shengyu and Li, Zhuhai and Zhang, Jie (2024) Adapter Learning from Pre-trained Model for Robust Spoof Speech Detection. 2095--2099, Proc. Interspeech 2024
Seraj, Md Shamim and Singh, Ankita and Chakraborty, Shayok (2024) Semi-supervised deep domain adaptation for deepfake detection. 1061--1071, Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Oiso, Hideyuki and Matsunaga, Yuto and Kakizaki, Kazuya and Miyagawa, Taiki (2024) Prompt Tuning for Audio Deepfake Detection: Computationally Efficient Test-time Domain Adaptation with Limited Target Dataset. 2710--2714, Proc. Interspeech 2024
M{\"u}ller, Nicolas and Czempin, Pavel and Diekmann, Franziska and Froghyar, Adam and B{\"o}ttinger, Konstantin (2022) Does Audio Deepfake Detection Generalize?. 2783--2787, Proc. Interspeech 2022
Hospedales, Timothy and Antoniou, Antreas and Micaelli, Paul and Storkey, Amos (2021) Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence 44(9): 5149--5169 IEEE
Van de Ven, Gido M and Tuytelaars, Tinne and Tolias, Andreas S (2022) Three types of incremental learning. Nature Machine Intelligence 4(12): 1185--1197 Nature Publishing Group UK London
Gou, Jianping and Yu, Baosheng and Maybank, Stephen J and Tao, Dacheng (2021) Knowledge distillation: A survey. International Journal of Computer Vision 129(6): 1789--1819 Springer
Zhou, Kaiyang and Liu, Ziwei and Qiao, Yu and Xiang, Tao and Loy, Chen Change (2022) Domain generalization: A survey. IEEE transactions on pattern analysis and machine intelligence 45(4): 4396--4415 IEEE
Jia, Shuai and Ma, Chao and Yao, Taiping and Yin, Bangjie and Ding, Shouhong and Yang, Xiaokang (2022) Exploring frequency adversarial attacks for face forgery detection. 4103--4112, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Li, Dongze and Wang, Wei and Fan, Hongxing and Dong, Jing (2021) Exploring adversarial fake images on face manifold. 5789--5798, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Hou, Yang and Guo, Qing and Huang, Yihao and Xie, Xiaofei and Ma, Lei and Zhao, Jianjun (2023) Evading deepfake detectors via adversarial statistical consistency. 12271--12280, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Corvi, Riccardo and Cozzolino, Davide and Poggi, Giovanni and Nagano, Koki and Verdoliva, Luisa (2023) Intriguing properties of synthetic images: from generative adversarial networks to diffusion models. 973--982, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Bai, Tao and Luo, Jinqi and Zhao, Jun and Wen, Bihan and Wang, Qian (2021) Recent advances in adversarial training for adversarial robustness. arXiv preprint arXiv:2102.01356
Li, Yanjie and Xie, Bin and Guo, Songtao and Yang, Yuanyuan and Xiao, Bin (2024) A survey of robustness and safety of 2d and 3d deep learning models against adversarial attacks. ACM Computing Surveys 56(6): 1--37 ACM New York, NY
Khan, Sarwar and Chen, Jun-Cheng and Liao, Wen-Hung and Chen, Chu-Song (2024) Adversarially robust Deepfake detection via adversarial feature similarity learning. Springer, 503--516, International conference on multimedia modeling
Xu, Huiyu and Wang, Yaopeng and Wang, Zhibo and Ba, Zhongjie and Liu, Wenxin and Jin, Lu and Weng, Haiqin and Wei, Tao and Ren, Kui (2024) ProFake: Detecting Deepfakes in the Wild against Quality Degradation with Progressive Quality-adaptive Learning. 2207--2221, Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security
Yang, Xin and Li, Yuezun and Lyu, Siwei (2019) Exposing deep fakes using inconsistent head poses. IEEE, 8261--8265, ICASSP 2019-2019 IEEE international conference on acoustic, speech, and signal processing (ICASSP)
Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others (2020) ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64: 101114 Elsevier
Liao, Xin and Wang, Yumei and Wang, Tianyi and Hu, Juan and Wu, Xiaoshuai (2023) FAMM: Facial muscle motions for detecting compressed deepfake videos over social networks. IEEE Transactions on Circuits and Systems for Video Technology 33(12): 7236--7251 IEEE
Woo, Simon and others (2022) Add: Frequency attention and multi-view based knowledge distillation to detect low-quality compressed deepfake images. 122--130, 1, 36, Proceedings of the AAAI conference on artificial intelligence
Lee, Sangyup and An, Jaeju and Woo, Simon S (2022) Bznet: Unsupervised multi-scale branch zooming network for detecting low-quality deepfake videos. 3500--3510, Proceedings of the ACM Web Conference 2022
Le, Binh M and Woo, Simon S (2023) Quality-agnostic deepfake detection with intra-model collaborative learning. 22378--22389, Proceedings of the IEEE/CVF International Conference on Computer Vision
Chen, Zongmei and Liao, Xin and Wu, Xiaoshuai and Chen, Yanxiang (2024) Compressed Deepfake Video Detection Based on 3D Spatiotemporal Trajectories. IEEE, 1--8, 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Wang, Bo and Tang, Yeling and Wei, Fei and Ba, Zhongjie and Ren, Kui (2024) FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing IEEE
Xu, Yuting and Liang, Jian and Jia, Gengyun and Yang, Ziming and Zhang, Yanhao and He, Ran (2023) Tall: Thumbnail layout for deepfake video detection. 22658--22668, Proceedings of the IEEE/CVF international conference on computer vision
Xu, Yuting and Liang, Jian and Sheng, Lijun and Zhang, Xiao-Yu (2024) Learning spatiotemporal inconsistency via thumbnail layout for face deepfake detection. International Journal of Computer Vision 132(12): 5663--5680 Springer
Stutz, David and Hein, Matthias and Schiele, Bernt (2019) Disentangling adversarial robustness and generalization. 6976--6987, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Gowda, Shruthi and Zonooz, Bahram and Arani, Elahe (2024) Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training. The Twelfth International Conference on Learning Representations
Chen, Zhengrui and Lu, Liying and Yuan, Ziyang and Zhu, Yiming and Li, Yu and Yuan, Chun and Deng, Weihong (2024) Blind face restoration under extreme conditions: leveraging 3D-2D prior fusion for superior structural and texture recovery. 1263--1271, 2, 38, Proceedings of the AAAI Conference on Artificial Intelligence
Criminisi, Antonio and P{\'e}rez, Patrick and Toyama, Kentaro (2004) Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing 13(9): 1200--1212 IEEE
Khan, Salman and Naseer, Muzammal and Hayat, Munawar and Zamir, Syed Waqas and Khan, Fahad Shahbaz and Shah, Mubarak (2022) Transformers in vision: A survey. ACM computing surveys (CSUR) 54(10s): 1--41 ACM New York, NY
Zhang, Chen and Xie, Yu and Bai, Hang and Yu, Bin and Li, Weihong and Gao, Yuan (2021) A survey on federated learning. Knowledge-Based Systems 216: 106775 Elsevier
Bai, Jianhong and Liu, Zuozhu and Wang, Hualiang and Hao, Jin and Feng, Yang and Chu, Huanpeng and Hu, Haoji (2023) On the effectiveness of out-of-distribution data in self-supervised long-tail learning. The Eleventh International Conference on Learning Representations
Miao, Wenjun and Pang, Guansong and Bai, Xiao and Li, Tianqi and Zheng, Jin (2024) Out-of-distribution detection in long-tailed recognition with calibrated outlier class learning. 4216--4224, 5, 38, Proceedings of the AAAI Conference on Artificial Intelligence
Ho, Chih-Hui and Peng, Kuan-Chuan and Vasconcelos, Nuno (2024) Long-tailed anomaly detection with learnable class names. 12435--12446, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Salehi, Mohammadreza and Mirzaei, Hossein and Hendrycks, Dan and Li, Yixuan and Rohban, Mohammad Hossein and Sabokrou, Mohammad (2022) A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges. Transactions on Machine Learning Research
Tantaru, Dragos-Constantin and Oneata, Elisabeta and Oneata, Dan (2024) Weakly-supervised deepfake localization in diffusion-generated images. IEEE Computer Society, 6246--6256, 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Li, Chuqiao and Huang, Zhiwu and Paudel, Danda Pani and Wang, Yabin and Shahbazi, Mohamad and Hong, Xiaopeng and Van Gool, Luc (2023) A continual deepfake detection benchmark: Dataset, methods, and essentials. 1339--1349, Proceedings of the IEEE/CVF winter conference on applications of computer vision
Tan, Chuangchuang and Zhao, Yao and Wei, Shikui and Gu, Guanghua and Liu, Ping and Wei, Yunchao (2024) Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. 5052--5060, 5, 38, Proceedings of the AAAI Conference on Artificial Intelligence
Yan, Zhiyuan and Wang, Jiangming and Jin, Peng and Zhang, Ke-Yue and Liu, Chengchun and Chen, Shen and Yao, Taiping and Ding, Shouhong and Wu, Baoyuan and Yuan, Li (2025) Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection. Forty-second International Conference on Machine Learning
Nguyen-Le, Hong-Hanh and Tran, Van-Tuan and Nguyen, Dinh-Thuc and Le-Khac, Nhien-An (2025) Think Twice Before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation. International Joint Conferences on Artificial Intelligence Organization, https://doi.org/10.24963/ijcai.2025/854, 10.24963/ijcai.2025/854, , Main Track, 7679--7687, James Kwok, IJCAI-25, Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, {IJCAI-25}
Li, Menglu and Ahmadiadli, Yasaman and Zhang, Xiao-Ping (2025) A survey on speech deepfake detection. ACM Computing Surveys 57(7): 1--38 ACM New York, NY
Ciobanu, Ioan-Paul and Hiji, Andrei-Iulian and Ristea, Nicolae-Catalin and Irofti, Paul and Rusu, Cristian and Ionescu, Radu Tudor (2025) XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark. arXiv preprint arXiv:2506.00462
Zhang, Hongyang and Yu, Yaodong and Jiao, Jiantao and Xing, Eric and El Ghaoui, Laurent and Jordan, Michael (2019) Theoretically principled trade-off between robustness and accuracy. PMLR, 7472--7482, International conference on machine learning
Li, Yanxi and Xu, Chang (2023) Trade-off between robustness and accuracy of vision transformers. 7558--7568, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
He, Yang and Xiao, Lingao (2023) Structured pruning for deep convolutional neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence 46(5): 2900--2919 IEEE
Egashira, Kazuki and Vero, Mark and Staab, Robin and He, Jingxuan and Vechev, Martin (2024) Exploiting llm quantization. Advances in Neural Information Processing Systems 37: 41709--41732
Additional Files
Additional file 16
Click here to Correct
Additional file 17
Click here to Correct
Total words in MS: 14390
Total words in Title: 17
Total words in Abstract: 244
Total Keyword count: 5
Total Images in MS: 6
Total Tables in MS: 0
Total Reference count: 229