Program-Guided Refinement with Debate: A Multi-Agent LLM-Based Automated Fact-Checking Model

TaoXue1,2,3,4Emailxuetao@xpu.edu.cn

WenzhuoLiu2,3,4Email230711010@stu.xpu.edu.cn

LongXi1✉,2,3,4Emaillxi@xpu.edu.cn

WenLv1,2,3,4Email20240219@xpu.edu.cn

1School of CybersecurityXi’an Polytechnic University710048Xi'anShaanxiChina

2School of Computer ScienceXi’an Polytechnic University710048Xi'anShaanxiChina

3Shaanxi Key Laboratory of Clothing Intelligence710048Xi'anShaanxiChina

4State-Province Joint Engineering and Research Center of Advanced Networking and Intelligent Information Services710048Xi'anShaanxiChina

Abstract

The explosive spread of online information has led to the proliferation of false claims, which makes automated fact-checking increasingly urgent. Existing automated fact-checking models have three common problems: the lack of interpretability, the hallucination phenomenon and the lower inference efficiency. In this paper, we propose a novel model, PGR-Debate (Program-Guided Refinement with Debate), to address these challenges. By designing a multi-agent debate, the PGR-Debate decomposes complex claims into three executable sub-tasks: Question, Verify and Predict, thereby significantly enhancing the interpretability of fact-checking. To alleviate the hallucination problem, we design two Debater agents and one Finalizer agent. The two Debater agents engage in interactive debates to identify and correct errors in the reasoning program. The Finalizer then rewrites the program, gradually improving the faithfulness and credibility of explanations. To accelerate inference and enable lightweight deployment, we adopt a knowledge distillation strategy. A high-performance model serves as the teacher, and a task-aware distillation framework transfers its multi-hop reasoning capability to a smaller student model. This approach improves inference efficiency while preserving reasoning consistency. The model requires neither domain-specific pretraining nor task-specific fine-tuning, but leverages instruction-based prompting and knowledge distillation. Experiments on the standard FEVEROUS-S and HOVER datasets demonstrate that PGR-Debate outperforms multiple baselines under different evidence availability settings (Gold, Open-book), reduces reasoning time to 30%–50% of traditional methods, and boosts the student model’s inference speed by 1.9× after distillation. Moreover, the error rate of predictions is reduced by about 50%, with the semantic error rate dropping from 6.8% to 2.88% after distillation. Experimental results on the HOVER dataset demonstrate that, compared with the ProgramFC baseline (Qwen2.5-14B), our PGR-Debate model improves explanation faithfulness by 42–43 percentage points (about 3.3×) at the sentence level and by 7–8 percentage points (about 1.4×) at the program level, significantly enhancing the factual consistency of reasoning chains.

Keywords

Fact checking

LLM

Multi-agent debate

Claim decomposition

Distillation

and Wenzhuo Liu: These authors contributed equally to this work.

Introduction

With the rapid growth of the Internet and social media, information spreads faster and on a larger scale than ever before. However, this efficiency also accelerates the proliferation of misinformation, threatening social stability and public understanding bib42. Fact-checking serves as a key defense mechanism by verifying the authenticity of claims and providing clear, credible explanations. In fast-spreading contexts such as rumor diffusion during emergent events bib43, fact-checking must go beyond simple verification to generate high-quality explanations that build public trust. This is especially crucial in multi-hop fact-checking, where complex claims require logical reasoning across multiple evidence sources bib40,bib41.

However, existing fact-checking models suffer from several critical limitations. First, they exhibit limited semantic understanding, which weakens their ability to capture complex claim–evidence relations \cite{bib7,bib8}. Second, their generated explanations often lack interpretability and faithfulness bib41,bib42. Finally, their inference processes are computationally inefficient, making real-time deployment challenging bib23.

The emergence of large language models (LLMs) has brought new opportunities to fact-checking. Their powerful text generation capabilities enable the production of explanations that resemble human language, and their semantic understanding helps capture complex claim–evidence relations bib13. However, most existing models still struggle with accurately modeling subtle semantic dependencies in multi-hop reasoning, which limits their ability to handle complex claims \cite{bib7,bib8}.

Moreover, although LLMs possess strong contextual reasoning and multimodal learning capabilities bib21, their generated explanations often lack interpretability and faithfulness. This gap leads to situations where the final prediction may be correct, but the reasoning process is unreliable or inconsistent with the evidence bib41,bib42. Such issues undermine the credibility of fact-checking results, especially in high-stakes scenarios.

Finally, the deployment of fact-checking models remains constrained by computational efficiency. Pretraining and fine-tuning large models require substantial resources, while high-capacity models incur significant inference latency. Lightweight models offer faster responses but typically sacrifice reasoning accuracy in multi-hop fact-checking, resulting in a trade-off between performance and efficiency bib23.

To address the aforementioned challenges of large language models for fact-checking, we propose a novel automated fact-checking framework, PGR-Debate (Program-Guided Refinement with Debate). The model aims to improve semantic reasoning accuracy, enhance explanation faithfulness, and enable efficient lightweight deployment. Specifically, PGR-Debate first employs zero-shot and few-shot prompting to decompose complex fact-checking tasks into structured and executable subtasks, clarifying the claim–evidence relationship. It then integrates a multi-agent debate refinement mechanism to detect and correct reasoning errors, thereby mitigating hallucinations and improving interpretability. We treat hallucination as the inverse manifestation of explanation faithfulness: mitigating hallucination improves faithfulness. Finally, knowledge distillation transfers multi-hop reasoning capabilities from a high-capacity teacher model to a lightweight student model, significantly boosting inference efficiency without sacrificing accuracy.

By combining program-guided reasoning with multi-agent debate refinement, the PGR-Debate model effectively addresses the limitations of existing LLMs in fact-checking, namely insufficient accuracy, poor interpretability, and low efficiency.

The main contributions of this paper are as follows:

We propose a novel fact-checking model, PGR-Debate, which significantly improves accuracy and interpretability, enabling large language models to perform fact-checking tasks and generate coherent explanations without domain-specific pretraining or fine-tuning.

We design a multi-agent debate refinement mechanism that further enhances the credibility and faithfulness of fact-checking results while reducing hallucinations to some extent.

We construct a knowledge distillation dataset to transfer the reasoning and analytical capabilities of the DeepSeek teacher model to a lightweight student model, which substantially improves inference efficiency while maintaining multi-hop reasoning accuracy.

Related Work

Fact-Checking

Fact-checking tasks can be categorized into three directions: the advances and limitations of fact-checking datasets, deep learning methods, and large language model approaches.

Advances and Limitations of Fact-Checking Datasets. Many researchers have proposed various datasets for developing and evaluating automated fact-checking systems. However, it is worth noting that the construction of most datasets typically relies on evidence drawn from a single document to support or refute a claim. For example, in the FEVER dataset, over 87% of claims can be verified using information from just one Wikipedia article \cite{bib7}. To address this limitation, subsequent studies have introduced datasets specifically designed to examine multi-step reasoning in fact-checking \cite{bib7,bib8}.

Deep Learning Model. These approaches mainly include logic-based and attention-based methods \cite{bib9}. Sheng et al. bib13proposed the Pref-FEND model, which attempts to achieve joint detection by integrating graph convolutional networks with attention mechanisms. Hu et al. bib14 focused on the quality of evidence and the verification process, introducing a three-step method consisting of retrieval, querying, and verification, followed by answer generation. Graph-based models \cite{bib8} have also been applied to facilitate reasoning over multiple pieces of evidence. Although such methods have achieved substantial performance improvements, they still suffer from weak interpretability and heavy reliance on large-scale training data.

Large Language Model. In the field of fact-checking with large language models (LLMs), current tasks are mainly focused on two directions: image-based fact verification bib25and text-based fact verification bib24. Some researchers adopt knowledge-based approaches, leveraging external knowledge to determine the veracity of rumors bib33,bib34,bib35. Beyond fact-checking tasks themselves, Moritz et al. bib22 investigated how different groups on social platforms prioritize various objectives during the fact-checking process. In addition, several studies have explored few-shot bib37 and prompt-based bib38 methods, where precise task descriptions are provided as input to LLMs to enable step-by-step reasoning and prediction. Existing research has largely concentrated on the creation or adaptation of fact-checking datasets tailored for specific tasks bib19,bib20, which are then used to fine-tune LLMs for predicting the authenticity of real-world claims bib17,bib18. However, most methods rely on cumbersome prompts or repeated generations when handling complex claims, leading to low reasoning efficiency. Some studies have achieved promising results through zero-shot and few-shot learning bib12,bib36. For example, Pan et al. \cite{bib2} proposed decomposing the process of verifying complex text into multiple steps and introduced an automated fact-checking model. Dhankar et al. bib21 developed a model to address multimodal fact-checking, employing multimodal LLMs to detect both textual entailment and visual entailment tasks. Despite their performance improvements, these models still suffer from severe hallucination issues.

Justification Production

When dealing with complex real-world claims, providing only a binary judgment of veracity is often unconvincing \cite{bib1}. Justifying the reasoning behind decisions is a core element of journalistic fact-checking, as fact-checkers must persuade readers of the validity of their evidence-based explanations bib15. Prior studies have proposed various approaches to generate explanations for model predictions, such as highlighting relevant evidence through attention weights bib11, employing knowledge-graph-based logical systems to construct arguments bib28, and producing summaries of the retrieved evidence bib10.

In recent years, Guo et al. bib23 observed that the widespread use of black-box models in automated fact-checking systems has drawn significant attention to the generation of explanations for fact-checking claims. Logic-based or knowledge-graph-based approaches bib26,bib27are capable of producing relatively rigorous explanations, but they rely heavily on handcrafted rules and thus lack scalability. Beyond focusing solely on the logical relations within claims, some studies bib28,bib29 have explored automatically generating explanations by summarizing fact-checking articles. Research has also revealed that existing fact-checking tasks primarily emphasize determining the veracity of claims from evidence, while lacking effective tools for deeper reasoning about the logical connections between claims and evidence bib30. Yao et al. bib31 proposed a cross-modal attention framework to fuse image and text features for multimodal fact verification. Although such studies provide plausible explanations for decision-making, the generated explanations may not faithfully reflect the model’s actual reasoning process. Maynez et al. bib39 found that despite recent progress, abstractive summarization models still produce factual errors and hallucinations, resulting in misleading justifications. To address these issues, some research has introduced multi-agent frameworks for LLMs \cite{bib3}, allowing them to generate more faithful explanations.

Existing methods still suffer from unstable attention, rigid logic rules, and frequent hallucinations, resulting in unreliable reasoning. To address these issues, PGR-Debate employs program-guided reasoning to build explicit logical chains, integrates multi-agent debate to detect and correct errors, and applies knowledge distillation to improve efficiency. This design enhances explanation faithfulness, reduces hallucinations, and enables lightweight deployment while maintaining accuracy.

The PGR-Debate Model for Automated Fact-Checking

Figure 1 illustrates the overall structure of PGR-Debate. The PGR-Debate contains claim decomposition and veracity prediction(Sect. 6), multi-agent debate (Sect. 7), refinement and rewriting, and knowledge distillation (Sect. 8). This design not only alleviates the difficulty LLMs face in handling complex prompts but also mitigates their limited self-correction ability by enabling multiple agents to debate and supervise each other over the textual content.

Fig. 1

Overall structure of PGR-Debate. It includes knowledge distillation, claim decomposition, multi-agent debate, refine and rewrite, veracity prediction, and their execution process.

To enhance the lightweight deployment capability of the model, the high-performance model DeepSeek-R1 serves as the teacher and Qwen2.5-14B as the student for knowledge distillation, after which the distilled student model is applied to claim decomposition. The decomposed subtasks are then refined through multi-agent debate, and finally, the system outputs the veracity prediction for the original claim.

PGR-Debate divides the fact-checking model into two main modules: claim decomposition and fact checking. From a black-box perspective, the original claim first enters the claim decomposition module, where the natural language statement is semantically decomposed into multiple subtasks that can be directly executed by the large language model.

Before proceeding to final verification, the decomposed reasoning program is refined through a multi-agent debate mechanism. In this stage, two Debater agents interactively detect and correct logical or semantic errors in the reasoning steps, while the Finalizer agent integrates their feedback to produce a coherent and logically consistent program. This process effectively mitigates potential hallucinations and improves the faithfulness of the reasoning chain.

The refined subtasks are then passed to the fact-checking module, where the LLM predicts the veracity of each sub-claim. Furthermore, the label prediction tasks are categorized based on the nature of the knowledge sources. Finally, the PGR-Debate model outputs a veracity label for the original claim.

Claim Decomposition

In the PGR-Debate model, the generation of reasoning programs is a crucial step, aimed at transforming complex claims into clear and executable reasoning procedures to ensure the smooth execution of subsequent verification. In the claim decomposition stage, this study draws on and extends the approach of the ProgramFC framework \cite{bib2} with an agent, decomposing complex claims into three types of subtasks: Question, Verify, and Predict, which can then be refined through multi-agent debate. As illustrated in Figure 2, the execution process of claim decomposition follows a structured framework.

Fig. 2

Execution model of the claim decomposition module

For each input claim, an agent (the Programmer) is employed to perform the claim decomposition task. Based on predefined prompts, the Programmer generates a program

$P$

consisting of n sequential reasoning steps

$S_i(i \in [1,n])$

. As shown on the left side of Figure 2, the original claim

$C$

is taken as input, and the Programmer outputs the reasoning program

$P$

. Each reasoning step is a natural-language-controlled instruction that guides the use of auxiliary subtask-handling functions. Every reasoning step is associated with a task label, and each label corresponds to a specific action within a subtask handler. Although this method effectively produces reasoning chains, it often introduces logical errors or hallucinations when handling complex claims. To address this issue, we incorporate a debate refinement mechanism to enhance logical consistency and the faithfulness of explanations.

Task Labels. Question: In this step, explicit questions are posed for ambiguous or uncertain descriptions within the original claim

$C$

. The execution result

$V_i$

is a string, typically a declarative answer in natural language.

Verify: Based on the answers obtained in the Question step, this step verifies the truthfulness of the sub-claims decomposed from the original claim

$C$

. Leveraging the natural language processing capabilities of the LLM, the model uses prompts to interpret and decompose each sub-claim, with each sub-claim corresponding to a Verify step

$S_i$

. The execution result

$V_i$

is a veracity label, where

$V_i \in \{\text{TRUE}, \text{FALSE}\}$

Predict: This step combines the veracity results of the Verify steps according to the semantic decomposition of the original claim. It generates an instruction containing logical relations (e.g., in Figure 2, the Predict step references

$fact_1$

and

$fact_2$

), which is then evaluated by an interpreter to determine the overall logical outcome. The execution result

$V_n$

is likewise a veracity label, where

$V_n \in \{\text{TRUE}, \text{FALSE}\}$

Fig. 3

Example of claim decomposition

Figure 3 illustrates an example of claim decomposition. The original claim is broken down into multiple subtasks, including Question, Verify, and Predict steps. Each subtask produces intermediate results such as

$answer_x$

$fact_x$

, and finally a prediction

$label$

. This example demonstrates how the structured reasoning program guides the verification process from natural language claims to executable logical steps.

Multi-Agent Debate Refinement with LLMs

Multi-agent debate refinement is one of the key innovations, aiming to mitigate the hallucination problem of LLMs and to enhance the faithfulness of reasoning as well as the credibility of explanations. The multi-agent debate refinement framework is illustrated in Figure 4, and the algorithm is defined by Algorithm \ref{algo1} Debate Refinement Model.

Fig. 4

Illustration of the multi-agent debate refinement framework

Multi-agent Debate. Given an initial reasoning program

$P$

, the two debaters (D1 and D2) examine errors in

$S$

and provide feedback

$F_{i,n}$

, where

$i$

denotes the debate round and

$n \in \{1,2\}$

identifies the debater. Different task objectives are assigned to each debater: Debater 1 (D1) identifies errors according to predefined error types specified in the prompts. The prompt for D1 also includes few-shot examples of errors together with corresponding feedback. Debater 2 (D2), in contrast, focuses on potential errors that may affect the faithfulness of explanations, without relying on predefined error types. For this purpose, the prompt of D2 is designed to adopt a zero-shot manner, enabling it to detect not only predefined error types but also additional logical flaws and inconsistencies in claim decomposition that may undermine faithfulness. This design enables the debate process to comprehensively detect potential logical and semantic errors in

$P$

In subsequent debate rounds, the two debaters (D1 and D2) cross-examine each other's feedback from the previous round

$(i-1)$

, where the feedback of D1 is denoted as

$F_{i-1,2}$

and that of D2 as

$F_{i-1,1}$

. They then revise their own feedback, for instance by adding details they had previously overlooked or by removing incorrect parts of their feedback. To ensure the accuracy of the feedback, D1 and D2 continue their discussion until a final consensus is reached. In our implementation, each debater outputs a structured response in the format

texttt{\{judge:\{CORRECT/REJECT\}, asw:\{"content"\}\}}. The debate is automatically terminated once both debaters independently return

texttt{judge: CORRECT}, indicating mutual agreement that the reasoning program contains no remaining logical or semantic errors.

To handle rare cases where convergence cannot be reached (e.g., oscillating or contradictory judgments), a maximum number of debate rounds

$N=3$

is additionally imposed as a safeguard. This hybrid strategy—rule-based termination supplemented by an empirical upper bound—ensures both interpretability and computational stability, while maintaining reproducible experimental control.

Refine and Rewrite. During the

$i$

-th debate round, a Finalizer agent is introduced to evaluate whether the feedback from D1 and D2 has reached consensus. When the condition

$J(F_{i-1,2}, F_{i-1,1}) = \text{TRUE}$

is satisfied, the debate is terminated. Finally, the Finalizer concatenates the final feedback from both debaters and integrates it into the rewriting process. The Finalizer receives the initial claim

$C$

, the original reasoning program

$P$

, as well as the feedback from both agents, and performs a comprehensive assessment to rewrite the reasoning program

$P'$

The considerations of the Finalizer are not limited to checking whether the debaters' feedback is reasonable. It must also ensure that the rewritten reasoning program

$P'$

maintains a consistent format and that the content of both the original reasoning program

$P$

and the rewritten program

$P'$

does not deviate from the description of the original claim

$C$

begin{algorithmic}[1]

Require Original claim

$(C)$

, original reasoning program

$(P)$

Ensure Refined reasoning program

$(P')$

State Initialize debaters

$(D1)$

and

$(D2)$

with bidirectional reasoning process

State Initialize finalizer agent

$(F)$

for rewriting based on debaters’ feedback

State

$(F_{0,1} \Leftarrow D1(C,P))$

State

$(F_{0,2} \Leftarrow D2(C,P))$

State Set maximum number of debate iterations

$(N)$

For{

$(i=1)$

$(N)$

}

If{both

$(F_{i-1,1}.judge ==)$

\texttt{CORRECT} \textbf{and}

$(F_{i-1,2}.judge ==)$

\texttt{CORRECT}}

State break

EndIf

State

$(F_{i-1,1} \Leftarrow D1(F_{i-1,1}, F_{i-1,2}))$

State

$(F_{i-1,2} \Leftarrow D2(F_{i-1,2}, F_{i-1,1}))$

EndFor

State

$(P' \Leftarrow F(F_{\text{final},1}, F_{\text{final},2}))$

State \Return

$(P')$

end{algorithmic}

algo1

Debate Refinement Model

Knowledge Distillation

To enhance the reasoning efficiency and accuracy of the PGR-Debate model in fact-checking tasks while reducing computational resource requirements, the Knowledge Distillation (KD) is uiltized to achieve a lightweight model and efficient inference.

Specifically, the high-performance large language model DeepSeek-R1 is employed as the teacher model, whose reasoning ability is distilled to guide the training of the lightweight student model Qwen2.5-14B . The distillation process focuses on transferring the logic of claim decomposition and multi-hop reasoning capabilities, enabling the student model to approximate the performance of the teacher model with a limited number of parameters.

A task-aware distillation framework is adopted, where the optimization objective consists of two levels:

Output-layer distillation: The loss function minimizes the KL divergence between the student model and the teacher model on the final label predictions:

$\mathcal{L}_{\text{output}} = D_{\text{KL}}\left(P_T(y|x) \,\|\, P_S(y|x)\right)$

where

$x$

is the input claim or sub-claim to the fact-checking model,

$y$

is the veracity label corresponding to input

$x$

, where

$y \in \{\text{TRUE}, \text{FALSE}\}$

$P_T(y|x)$

denotes the probability distribution over labels predicted by the teacher model (DeepSeek-R1) given input

$x$

$P_S(y|x)$

denotes the probability distribution over labels predicted by the student model (Qwen2.5-14B) given input

$x$

$D_{\text{KL}}(\cdot \,\|\, \cdot)$

represents the Kullback–Leibler (KL) divergence, which measures the distance between the teacher’s and student’s output distributions, and

$\mathcal{L}_{\text{output}}$

is the output-layer distillation loss.

Intermediate-layer Distillation: This component aligns the capability of reasoning program generation, i.e., the student model is trained to learn the sub-claim decomposition logic of the teacher model through instruction tuning. The loss function is defined as cross-entropy:

$\mathcal{L}_{\text{intermediate}} = - \sum_{t=1}^{T} \log P_{\text{student}}(s_t \mid s_{<t}, x)$

where

$T$

is the length of the reasoning program sequence,

$s_t$

denotes the target token at position

$t$

generated by the teacher model, and

$P_{\text{student}}(s_t \mid s_{<t}, x)$

is the probability assigned by the student model to token

$s_t$

given the input

$x$

and the previously generated tokens

$s_{<t}$

Overall Knowledge Distillation Objective: The final distillation loss combines both the output-layer distillation loss and the intermediate-layer distillation loss, defined as:

$\mathcal{L}_{KD} = \mathcal{L}_{\text{output}} + \lambda \mathcal{L}_{\text{intermediate}}$

where

$\lambda$

is a balancing coefficient that controls the relative importance between output-layer distillation and intermediate-layer distillation.

Experiment

All experiments run on Ubuntu 24.04.1 LTS with two RTX A6000 GPUs (96\,GB total VRAM) and an Intel i7-12700F CPU. Implementations use Python 3.9.20 and the MetaGPT agent framework. Knowledge distillation is conducted on the Baidu Qianfan Cloud Platform. Since the platform does not expose underlying resource allocation details, we report the data scale and setup (teacher: DeepSeek-R1; student: Qwen2.5-14B; 3{,}000 distillation samples), while other hardware details are automatically allocated by the platform.

Experimental Design.

We structure our evaluation into four tiers to provide a clear and systematic validation path: (i) overall prediction quality, (ii) model-capability factors, (iii) explanation faithfulness, and (iv) special/diagnostic studies.Concretely, we conduct seven experiments on FEVEROUS-S and HOVER:

First, Comprehensive Prediction Verification assesses the overall F1 performance of the model (Sect. 12). Second, Impact of Model Parameter Size on Verification Performance compares different parameter sizes within the same LLM family under our framework (Sect. 13). Third, Faithfulness Evaluation with FactCCbib44 compares the faithfulness of explanations produced by PGR-Debate and the ProgramFC baseline at both sentence and program levels (Sect. 14). Fourth, Evaluation of Error Rate measures differences in label-prediction error rates between our model and baselines, reflecting generation quality (Sect. 15). Fifth, Effectiveness of Distillation and Reasoning Efficiency demonstrates that the distilled student model achieves faster inference while maintaining multi-hop verification accuracy (Sect. 16). Sixth, Closed-book Experiments report model performance in the closed-book setting (Sect. 17). Finally, Ablation Experiment examines the contribution of the debate refinement module (Sect. 18).

We evaluate under three evidence-availability settings: Gold (oracle evidence provided), Open-book (evidence retrieved from external knowledge sources), and Closed-book (parametric knowledge only).

Models. We instantiate PGR-Debate with Qwen2.5-14B and Qwen2.5-32B to study the impact of parameter size under a controlled family. Baselines include ProgramFC instantiated with the same backbones for fair comparison. Qwen is a family of large language models (LLMs) and large-scale multimodal models developed by the Qwen team at Alibaba Group, which possesses a wide range of capabilities, including natural language understanding, text generation, vision and audio comprehension, tool use, role-playing, and interactive AI agent applications.

Datasets

Training Dataset. The distillation dataset is constructed from the multi-hop reasoning dataset HOVER \cite{bib7} (including 2-hop to 4-hop claims) and the structured information dataset FEVEROUS-S \cite{bib8} training set. A 3:2 random sampling ratio was applied across the two datasets to ensure coverage of claims with varying complexity (e.g., multi-evidence fusion, table parsing). For each claim, reasoning programs were generated by DeepSeek-R1, including sub-claim decomposition steps (Question/Verify/Predict), variable reference logic (e.g., \{ANSWER_1\}

$\rightarrow$

$fact_1$

), and the final veracity labels with explanation chains. In total, a high-quality distillation dataset of 3,000 samples was constructed.

Evaluation Datasets. The evaluation was conducted on the FEVEROUS-S and HOVER datasets.

FEVEROUS \cite{bib8}: A large-scale benchmark integrating fact extraction and verification from both unstructured text and structured tables. It contains 87,026 annotated claims labeled as Supported, Refuted, or False, with 3,000 claims in the development set. In this study, we use the FEVEROUS-S subset, which provides around 3K simplified samples for efficient evaluation and knowledge distillation experiments.

HOVER \cite{bib7}: A multi-hop fact verification dataset requiring evidence from up to four Wikipedia articles, designed to test long-range reasoning and co-reference handling. The development set includes 4,000 claims.

Both datasets are widely used for benchmarking fact verification models and provide complementary challenges in reasoning complexity.

Evaluation Metrics

We employ the F1 score to evaluate the outputs generated by different models. The F1 score is a metric that integrates both precision and recall, and is widely used in classification tasks, especially in scenarios with imbalanced data distributions. By taking the harmonic mean of precision and recall, the F1 score provides a single value that reflects the overall performance of the model. The formula is defined as:

$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

where precision and recall are defined as follows:

$\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}$

where,

$TP$

denotes true positives,

$FP$

denotes false positives, and

$FN$

denotes false negatives.

The F1 score is particularly suitable for handling class imbalance. For example, in tasks such as text classification or sentiment analysis, when the number of negative samples is much larger than that of positive samples, models tend to be biased toward predicting the negative class. In such cases, relying solely on precision or recall may lead to unfair evaluation. By jointly considering both precision and recall, the F1 score provides a more balanced assessment. The value of the F1 score ranges between 0 and 1, where 1 indicates the best performance and 0 represents the worst.

We then define the error rate

$W_r$

to measure the proportion of incorrectly predicted claims relative to all claims. The formula is:

$W_r = \frac{\text{Sample}_{\text{wrong}}}{\text{Sample}_{\text{all}}}$

where

$\text{Sample}_{\text{wrong}}$

denotes the number of misclassified claims, and

$\text{Sample}_{\text{all}}$

denotes the total number of claims.

Comprehensive Prediction Verification

We conducted comparative experiments between PGR-Debate and the baseline models (ProgramFC). The overall results are presented in Table 1, and the corresponding confusion matrix is shown in Figure 5.

Table 1
Performance comparison of different models on HOVER and FEVEROUS-S datasets. Macro-F1 scores of PGR-Debate (IV) and baselines (I –III) on the HOVER and FEVEROUS-S evaluation sets for few-shot fact-checking. Gold and Open denote the gold evidence setting and the open-book setting, respectively. I: pretrained Transformers; II: FC/NLI fine-tuned models; III: in-context learning models; IV: PGR-Debate and its baseline models; V: PGR-Debate with distilled student models and their baselines. Abbreviations: Q32B = Qwen2.5-32B; QDD = Qwen-DeepSeek-Distill-By-Us-14B.
multirow{2}{*}{Language Model}	multicolumn{2}{c}{HOVER (2-hop)}	multicolumn{2}{c}{HOVER (3-hop)}	multicolumn{2}{c}{HOVER (4-hop)}	multicolumn{2}{c}{FEVEROUS-S}
cmidrule{2-3}\cmidrule{4-5}\cmidrule{6-7}\cmidrule{8-9}
	Gold	Open	Gold	Open	Gold	Open	Gold	Open
midrule
textbf{I}
BERT-FC	53.40	50.68	50.90	49.60	50.86	48.57	74.71	51.67
LisT5	56.15	52.56	53.76	51.89	51.67	50.46	77.88	54.15
midrule
textbf{II}
RoBERTa-NLI	74.62	63.62	62.23	53.99	57.98	52.40	88.28	57.80
DeBERTaV3-NLI	77.22	68.72	65.98	60.76	60.49	56.00	91.98	58.81
MULTIVERS	68.86	60.70	59.87	52.55	55.67	51.86	86.03	56.61
midrule
textbf{III}
Codex	70.63	65.07	66.46	56.63	63.49	57.27	89.77	62.58
FLAN-T5	73.69	69.02	65.66	60.23	58.08	55.42	90.81	63.73
midrule
textbf{IV}
ProgramFC (Q32B)	68.47	55.86	65.75	52.92	63.91	52.55	88.01	51.69
PGR-Debate (Q32B)	71.49	55.95	63.60	54.06	63.43	50.34	89.50	52.63
textbf{I}
BERT-FC	53.40	50.68	50.90	49.60	50.86	48.57	74.71	51.67
LisT5	56.15	52.56	53.76	51.89	51.67	50.46	77.88	54.15
midrule
textbf{II}
RoBERTa-NLI	74.62	63.62	62.23	53.99	57.98	52.40	88.28	57.80
DeBERTaV3-NLI	77.22	68.72	65.98	60.76	60.49	56.00	91.98	58.81
MULTIVERS	68.86	60.70	59.87	52.55	55.67	51.86	86.03	56.61
midrule
textbf{III}
Codex	70.63	65.07	66.46	56.63	63.49	57.27	89.77	62.58
FLAN-T5	73.69	69.02	65.66	60.23	58.08	55.42	90.81	63.73
midrule
textbf{IV}
ProgramFC (Q32B)	68.47	55.86	65.75	52.92	63.91	52.55	88.01	51.69
PGR-Debate (Q32B)	71.49	55.95	63.60	54.06	63.43	50.34	89.50	52.63
midrule
textbf{V}
ProgramFC (QDD)	51.24	51.60	50.57	48.01	53.68	48.22	79.81	50.27
PGR-Debate (QDD)	68.29	51.95	61.20	51.34	62.27	54.38	85.28	56.75

Fig. 5

(iv) Confusion matrix of the PGR-Debate model on the FEVEROUS dataset

Figure 5 (i–iv) presents the confusion matrices of the ProgramFC and PGR-Debate models on the HOVER and FEVEROUS-S datasets. These matrices provide an intuitive comparison of the prediction distributions across different veracity categories (SUPPORTS, REFUTES, and ERROR) against the ground truth labels. Figure 5(i) and Figure 5(ii) show the comparison between ProgramFC and PGR-Debate on the HOVER dataset. The diagonal elements (i.e., correctly classified samples) in the PGR-Debate confusion matrix (Figure 5(ii)) are generally higher than those in ProgramFC (Figure 5(i)), particularly in the key categories of SUPPORTS and REFUTES. This indicates that PGR-Debate demonstrates a stronger capability in correctly identifying true claims. At the same time, the off-diagonal elements (misclassifications) of PGR-Debate are relatively lower, especially in critical cases such as misclassifying SUPPORTS as REFUTES or vice versa. This reduction in severe errors is consistent with the quantitative results shown in Table 1, where PGR-Debate achieves higher F1 scores and lower error rates on the HOVER dataset.

On the FEVEROUS-S dataset (Figure 5(iii) and 5(iv)), PGR-Debate (Figure 5(iv)) likewise shows clear advantages along the main diagonal, particularly in the SUPPORTS category, where the number of correctly predicted samples is significantly higher than that of ProgramFC (Figure 5(iii)). Although predictions in the ERROR category remain challenging for both models (as reflected by non-zero off-diagonal elements), PGR-Debate still achieves a certain improvement in prediction accuracy for this category. These confusion matrices clearly demonstrate that PGR-Debate outperforms the baseline model ProgramFC in distinguishing between different veracity categories and in reducing severe misclassifications, providing visual evidence for its higher F1 score on the FEVEROUS-S dataset.

As shown in Table1, the comparison between PGR-Debate and the baseline models demonstrates that PGR-Debate outperforms the baselines in the Open-book tasks on the HOVER dataset as well as on the FEVEROUS-S dataset, but performs less effectively in the Gold classification tasks of the HOVER dataset. Since this study adopts a computationally efficient experimental setting, only relatively small-scale baseline models were used (large language models with no more than 32B parameters). It should also be noted that due to computational constraints, our experiments employed models up to 32B parameters, while some state-of-the-art systems (e.g., Codex, 175B \cite{bib2}) use substantially larger architectures. To address this limitation, we provide a theoretical scale analysis rather than a direct comparison: according to established scaling laws of large language models, performance grows sub-linearly with parameter size once model capacity exceeds task complexity. Therefore, while absolute F1 scores may differ, the relative improvements introduced by the debate mechanism are expected to generalize across scales, suggesting that PGR-Debate offers consistent reasoning gains even when applied to larger models.Furthermore, we calculated the standard deviation of scores for 2-hop, 3-hop, and 4-hop predictions. Compared to the baselines, the deviation decreased from 0.92 to 0.48, which indicates that applying PGR-Debate yields more stable reasoning performance.

After applying knowledge distillation, the experimental results of the qwen-deepseek-distill-14b model show a substantial performance gap between the PGR-Debate and ProgramFC frameworks. Under the same setting, enabling the reasoning aggregation capability in ProgramFC (with

$N=5$

) leads to a significant decline in accuracy. The same distilled model, when evaluated under the ProgramFC framework (see row V in Table 1), exhibits a severe performance drop (e.g., the HOVER 3-hop Open score decreases to only 48.01). In contrast, within the PGR-Debate framework, the distilled model maintains a stable score of 62.27 on the HOVER 4-hop Gold task (with the standard deviation reduced to 0.48, see Table 1), confirming that the multi-agent debate mechanism enhances the generalization ability of lightweight models.

Impact of Model Parameter Size on Verification Performance

The reasoning capability of our model is primarily constrained by the reasoning ability of the embedded large language model (LLM). To validate this perspective, we conducted not only horizontal comparisons across different models but also comparative experiments on different versions of the same model.

Table 2
Performance of PGR-Debate with different model parameter sizes on HOVER and FEVEROUS-S. Abbreviations: Q14B = Qwen2.5-14B; Q32B = Qwen2.5-32B; QP = Qwen-Plus.
Language Model	multicolumn{2}{c}{HOVER (2-hop)}	multicolumn{2}{c}{HOVER (3-hop)}	multicolumn{2}{c}{HOVER (4-hop)}	multicolumn{2}{c}{FEVEROUS-S}
cmidrule{2-3}\cmidrule{4-5}\cmidrule{6-7}\cmidrule{8-9}
	Gold	Open	Gold	Open	Gold	Open	Gold	Open
PGR-Debate (Q14B)	62.04	53.55	61.17	51.93	61.96	52.17	88.97	51.92
PGR-Debate (Q32B)	71.49	55.95	63.60	54.06	63.43	50.34	89.50	52.63
PGR-Debate (QP)	68.38	53.73	64.09	51.23	63.81	53.32	88.95	52.61

As shown in the experimental results of Table 2, when applying different parameter-size versions of the same multimodal large language model within the PGR-Debate framework, the larger-parameter models consistently achieve higher F1 scores. This indicates that the reasoning capability of large language models is strongly constrained by their parameter size. According to Hugging Face’s official benchmark scores, Qwen2.5-32B achieves an overall score of 37.98, while the 14B version achieves only 31.75, reflecting a 19.62% difference in overall capability. The experimental results further demonstrate that the stronger the reasoning capability of the embedded LLM, the better the generated outputs. For the same model with different parameter sizes, the larger-parameter version generally exhibits stronger reasoning performance, which is corroborated by its superior experimental results.

Faithfulness Evaluation with FactCC

To further evaluate the faithfulness of the reasoning process, we conduct an additional experiment using the FactCC model to check the consistency between the original claims and the generated outputs. In the context of fact-checking, faithfulness reflects the degree to which the model's reasoning chain aligns with the factual content of the input claims, while hallucination refers to the generation of unsupported or fabricated information during the reasoning process. Enhancing faithfulness is therefore closely tied to mitigating hallucination, ensuring that each step of the reasoning process is grounded in verifiable evidence rather than spurious content. Specifically, we measure faithfulness from two perspectives: sentence level, which evaluates the correctness of each verification subtask generated from the claim, and program level, which evaluates the faithfulness of the final reasoning program corresponding to each claim. Since each claim may produce multiple verification subtasks, the number of samples at the sentence level can exceed the dataset size (4,000), while at the program level the sample count should match it. A lower or higher number indicates potential generation errors.

Table 3
Faithfulness evaluation results on HOVER using FactCC at sentence and program levels. Abbreviations: Q32B = Qwen2.5-32B; Q14B = Qwen2.5-14B; QDD = Qwen-DeepSeek-Distill-By-Us-14B; QP = Qwen-Plus.
textbf{Model}	Sentence-level Acc. (%)	Program-level Acc. (%)
PGR-Debate (QDD)	60.77	26.22
PGR-Debate (Q14B)	48.65	6.60
PGR-Debate (QP)	61.68	27.13
ProgramFC (Q14B)	18.21	19.13
ProgramFC (QDD)	58.25	7.17

We set the threshold at 0.5 to determine whether the model's output is classified as CORRECT. The faithfulness accuracy is then computed as the proportion of CORRECT outputs over the total number of samples at each level. As shown in Table 3, PGR-Debate models achieve higher faithfulness accuracy at both sentence and program levels compared with ProgramFC. Notably, PGR-Debate (QP) achieves 61.68% at the sentence level and 27.13% at the program level, outperforming the ProgramFC baselines. These results indicate that program-guided decomposition and multi-agent debate refinement contribute to generating more faithful reasoning chains.

Evaluation of Error Rate

One advantage of PGR-Debate is that, compared with end-to-end models, it generates explicit logical reasoning programs, which can assist humans in task understanding and debugging, thereby improving the interpretability of fact-checking. To evaluate the quality of the generated reasoning programs, we randomly selected 150 claims and conducted manual inspection of PGR-Debate's reasoning outputs, with 50 examples each from 2-hop, 3-hop, and 4-hop categories.

We report the error proportions observed in our study in Table 4, using ProgramFC as the baseline model for comparison. Our findings indicate that as the complexity of claims increases, the reasoning programs become more complex, and the proportion of semantic errors rises accordingly. This highlights the difficulty of designing appropriate step-by-step reasoning strategies for complex claims. PGR-Debate, through its multi-agent debate refinement mechanism, successfully identifies and corrects most of the semantic errors that arise in reasoning programs for complex claims, effectively mitigating the sharp increase in semantic errors caused by greater claim complexity.

Table 4
Error rates (
$(W_r)$
) of different models on HOVER and FEVEROUS-S datasets (%)
Language Model	HOVER (%)	FEVEROUS-S (%)
ProgramFC (Qwen2.5-14B)	6.80	10.97
PGR-Debate (Qwen2.5-14B)	3.33	11.95
PGR-Debate (Qwen2.5-32B)	2.55	9.89
PGR-Debate (Qwen-Plus)	5.70	11.21
PGR-Debate (Qwen2.5-14B-Distilled-By-Us)	2.88	9.92
ProgramFC (Qwen2.5-14B-Distilled-By-Us)	66.35	68.27

In the HOVER dataset, we use the ProgramFC model as the baseline for comparison and conduct experiments with different large language models. When adopting the PGR-Debate model, the proportion of erroneous reasoning program outputs decreases from 6.8% with the baseline model to 3.325%. Here, the "error proportion" refers to the ratio of claims predicted as Error by PGR-Debate to the total number of test samples, which is not directly equivalent to F1 score or accuracy. On the FEVEROUS-S dataset, the error rate of the PGR-Debate model remains around 10% compared with the baseline; however, under the same experimental conditions, its prediction accuracy (the proportion of correctly predicted claims) improves. This indicates that even though a certain proportion of errors persists under some metrics, the overall number of correctly predicted claims increases, thereby improving the average F1 score and other performance indicators.

Effectiveness of Distillation and Reasoning Efficiency

To further validate the applicability of PGR-Debate in lightweight deployment scenarios, this section focuses on examining the performance retention of the model after knowledge distillation and the extent of improvement in reasoning efficiency. By comparing the performance of the student model before and after distillation with that of the baseline models, we systematically analyze the advantages of this approach in terms of both accuracy and efficiency. The experimental results are presented in Table 5.

Table 5
Comparison of reasoning program generation efficiency of PGR-Debate and baseline models (in hours)
Language Model	HOVER (h)	FEVEROUS-S (h)
ProgramFC (Qwen2.5-14B)	40	30
PGR-Debate (Qwen2.5-14B)	22	10
PGR-Debate (Qwen2.5-32B)	24	12
PGR-Debate (Qwen-Plus)	20	9
PGR-Debate (Qwen2.5-14B-Distilled-By-Us)	22	11

The results show that, compared with the ProgramFC baseline model, PGR-Debate achieves higher reasoning performance and better interpretability of results while requiring only 30%–50% of the generation time of the baseline. The reason lies in the generation paradigm of the baseline model, which adopts an aggregation reasoning approach: it generates multiple answers to the same question and then integrates their reasoning paths to determine the final explanation. Although this strategy can improve performance in terms of F1 score, it comes at the cost of significantly increased computation time. By contrast, the proposed PGR-Debate model generates only a single response, and during the multi-agent debate refinement process, claim decomposition tailored for fact-checking enables the other debate agents to use much simpler prompts than those in the baseline. Concise prompts without implicit complex instructions not only accelerate response generation by the LLM but also reduce the overall reasoning time.

In terms of distillation effectiveness, the student model (Qwen2.5-14B-Distilled) reduces the semantic error rate on the HOVER dataset from 6.8% to 3.3%, corresponding to a 51.47% relative reduction in erroneous predictions. At the same time, the standard deviation of F1 scores for 4-hop claims decreases from 0.92 to 0.48. This indicates that the distillation process effectively transfers the teacher model's multi-hop reasoning capability, significantly enhancing stability and robustness in complex reasoning scenarios.

Overall, the distilled PGR-Debate model achieves substantial improvements in both performance and efficiency: on the one hand, it effectively reduces semantic errors and strengthens stability in complex reasoning; on the other hand, it enables efficient reasoning under lightweight conditions. These results confirm the feasibility and potential value of PGR-Debate in practical applications.

Closed-book Experiments

We compared the experimental results under the closed-book setting. In this setting, the model cannot access any external knowledge sources and can only rely on its parametric knowledge. Baseline models in groups I and II of Table 1 were trained in vertical domains and are therefore not applicable in this case. We compare PGR-Debate with other LLM-based prompting approaches, including Codex, ProgramFC, and Flan-T5. Four different types of prompting strategies were employed in our experiments: (1) Direct, i.e., direct prompting; (2) CoT, few-shot prompting based on chain-of-thought reasoning; (3) ZS-CoT, zero-shot chain-of-thought prompting, appended with the phrase “let’s think step by step.”; and (4) Self-Ask, a variant of CoT that guides the model to answer by decomposing the task into a series of sub-questions.

Table 6
Closed-book experimental results: Comparison of PGR-Debate (Qwen2.5-14B) with other baselines in terms of Macro-F1 scores
Model	HOVER (2-hop)	HOVER (3-hop)	HOVER (4-hop)	FEVEROUS-S
Instruct GPT - Direct	56.51	51.75	49.68	60.13
Instruct GPT - CoT	57.20	53.66	51.83	61.05
Instruct GPT - ZS-CoT	50.30	52.30	51.58	54.78
Instruct GPT - Self-Ask	51.54	51.47	52.45	56.82
Codex	55.57	53.42	45.59	57.85
FLAN-T5	48.27	52.11	51.53	55.16
ProgramFC	54.27	54.18	52.88	59.66
PGR-Debate	54.88	53.62	53.22	53.24

The results presented in Table 6 indicate that most models achieve macro-F1 scores on the HOVER dataset only slightly above random guessing, suggesting that verifying complex claims solely based on the parametric knowledge of large language models is highly challenging. Similar to the observations in Sect. 10, we find that model performance improves as the number of required reasoning steps increases. On average, CoT prompting achieves scores 2.7 points higher than Direct prompting, underscoring the importance of step-by-step reasoning in complex fact-checking. Although PGR-Debate performs weaker than other models under most settings, it exhibits the smallest performance degradation as the reasoning depth increases. The experimental results further show that PGR-Debate achieves its best performance in the 4-hop setting, demonstrating its ability to maintain greater stability in longer reasoning chains.

Ablation Studies

Refinement Module

To evaluate the practical role of the refinement module within the PGR-Debate framework, we first conducted an ablation study on the refinement and rewriting component. This module is responsible for unifying and optimizing the reasoning program based on the feedback from the Debaters; therefore, its removal may significantly affect the model's reasoning stability and prediction accuracy. In the experimental setup, after removing the Finalizer, the system directly adopts the final debate output of Debater 2 as the prediction result, which is then compared with the full PGR-Debate model. Figure 6 illustrates the performance differences between the two settings on the HOVER dataset.

Fig. 6

Ablation Study: Removing the Refinement Module

As shown in Figure 6, removing the refinement module leads to a significant performance drop in multi-hop reasoning tasks. Specifically, the error rate increases from 2.55% to 15.61%, while both the F1 score and reasoning stability exhibit large fluctuations. This indicates that relying solely on the final output of a Debater results in inconsistent outcomes, often producing semantically biased and logically incomplete reasoning chains. In contrast, the full PGR-Debate model, by leveraging the Finalizer to unify and rewrite the feedback from both Debaters, effectively enhances the faithfulness and stability of the reasoning chain. Therefore, the refinement module plays an irreplaceable role within the overall framework and serves as a critical component in ensuring model performance.

Debate Rounds

In addition to the refinement module, the number of debate rounds is also an important parameter of the PGR-Debate model. While increasing the number of rounds may improve result accuracy, it also significantly increases reasoning time; conversely, too few rounds may lead to insufficient stability in complex reasoning. To address this, we further conducted an ablation study on the number of debate rounds, comparing the performance of the model on the HOVER dataset when the number of rounds was reduced or completely removed. The experimental results are shown in Figure 7.

Fig. 7

Ablation Study: Reducing (or Removing) Debate Rounds

As shown in Figure 7, reducing the number of debate rounds significantly affects the stability of the model’s reasoning. When the rounds are reduced to one, the model’s multi-hop reasoning performance drops markedly, with higher error rates and increased result fluctuations; when debate rounds are completely removed, this issue becomes even more pronounced, and the model often fails to maintain a stable reasoning chain. Corresponding to the performance degradation, reasoning efficiency improves substantially as the number of rounds decreases: with a single debate round, the reasoning time is reduced from 24 hours to 16 hours, and when debates are entirely removed, the time is further shortened to 9 hours.

This phenomenon indicates that debate rounds play a critical role in error correction and stabilization within PGR-Debate, with their number directly determining the reliability of the reasoning chain. However, excessive debate rounds also incur significant time costs. Therefore, in practical applications, a trade-off between performance and efficiency must be made: in resource-constrained scenarios, fewer debate rounds can be chosen to improve speed, while in scenarios requiring high reliability and stability, the full debate process should be retained.

Conclusion

We have presented an automated fact-checking model, PGR-Debate, focusing on the problems of poor interpretability, severe hallucinations, and lower inference efficiency in existing fact-checking methods. The PGR-Debate first employs program-guided claim decomposition to transform complex claims into executable subtasks, and then introduces a multi-agent debate and refinement mechanism to enhance the faithfulness and credibility of explanations through logical chain construction and error correction. Furthermore, a knowledge distillation strategy is designed to transfer multi-hop reasoning capabilities from a high-performance teacher model to a lightweight student model, significantly improving reasoning efficiency while maintaining accuracy. Experiments conducted on the FEVEROUS-S and HOVER datasets demonstrate that PGR-Debate outperforms multiple baselines in terms of accuracy, error rate, and reasoning stability, while reducing reasoning time to 30%–50% of existing methods. After distillation, the student model achieves an additional 1.9

$\times$

speedup. Moreover, the multi-agent debate mechanism effectively mitigates hallucinations, leading to significantly improved explanation faithfulness and more reliable reasoning chains. These results validate both the effectiveness and practicality of PGR-Debate for complex fact-checking tasks. PGR-Debate not only demonstrates potential in academic research but also offers new solutions for practical applications such as news fact-checking, rumor detection, and multilingual cross-modal verification, providing significant societal value and broad prospects for adoption.

In the future, we will focus on optimizing claim decomposition and prompt engineering to better align with real-world fact-checking workflows, thereby further enhancing the applicability and robustness of the model.

bibliography{sn-bibliography}

Author Contribution

T.X. proposed the initial research idea and supervised the overall project.W.L. (Wenzhuo Liu) designed and implemented the PGR-Debate framework, conducted the experiments on the FEVEROUS-S and HOVER datasets, analyzed the results, and wrote the main manuscript text.L.X. contributed to model optimization and result validation.W.Lv. assisted in literature review, data preprocessing, and manuscript revision.All authors discussed the results, provided critical feedback, and approved the final version of the manuscript.

References:

Guo, Zhijiang and Schlichtkrull, Michael and Vlachos, Andreas (2022) A survey on automated fact-checking. Transactions of the association for computational linguistics 10: 178--206 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …

Pan, Liangming and Wu, Xiaobao and Lu, Xinyuan and Luu, Anh Tuan and Wang, William Yang and Kan, Min-Yen and Nakov, Preslav (2023) Fact-Checking Complex Claims with Program-Guided Reasoning. Proceedings of ACL

Ma, Jiatong and Hu, Linmei and Li, Rang and Fu, Wenbo (2025) LoCal: Logical and Causal Fact-Checking with LLM-Based Multi-Agents. Association for Computing Machinery, New York, NY, USA, WWW '25, Sydney NSW, Australia, 12, 1614 –1625, Proceedings of the ACM on Web Conference 2025, 10.1145/3696410.3714748, 9798400712746

Jiang, Yichen and Bordia, Shikha and Zhong, Zheng and Dognin, Charles and Singh, Maneesh and Bansal, Mohit (2020) {H}o{V}er: A Dataset for Many-Hop Fact Extraction And Claim Verification. Association for Computational Linguistics, Online, 3441--3460, nov, Findings of the Association for Computational Linguistics: EMNLP 2020

Rami Aly and Zhijiang Guo and Michael Sejr Schlichtkrull and James Thorne and Andreas Vlachos and Christos Christodoulopoulos and Oana Cocarascu and Arpit Mittal (2021) {FEVEROUS}: Fact Extraction and {VER}ification Over Unstructured and Structured information. https://openreview.net/forum?id=h-flVCIlstW, NeurIPS '21, Proceedings of the 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track

Lu, Yi Ju and Te Li, Cheng (2020) GCAN: Graph-aware co-attention networks for explainable fake news detection on social media. Association for Computational Linguistics (ACL), 505--514, 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020

Kim, Jiho and Park, Sungjin and Kwon, Yeonsu and Jo, Yohan and Thorne, James and Choi, Edward (2023) FACTKG: Fact Verification via Reasoning on Knowledge Graphs. Association for Computational Linguistics (ACL), 16190--16206, 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023

Nakov, Preslav and Corney, David and Hasanain, Maram and Alam, Firoj and Elsayed, Tamer and Barr{\'o}n-Cede{\ n}o, Alberto and Papotti, Paolo and Shaar, Shaden and da San Martino, Giovanni (2021) Automated Fact-Checking for Assisting Human Fact-Checkers. International Joint Conferences on Artificial Intelligence, 4551--4558, 30th International Joint Conference on Artificial Intelligence, IJCAI 2021

Xi, Zhaohan and Du, Tianyu and Li, Changjiang and Pang, Ren and Ji, Shouling and Chen, Jinghui and Ma, Fenglong and Wang, Ting (2023) Defending pre-trained language models as few-shot learners against backdoor attacks. Advances in Neural Information Processing Systems 36: 32748--32764

Srikanth, Neha Pundlik and Sarkar, Rupak and Mane, Heran Y and Aparicio, Elizabeth M and Nguyen, Quynh C and Rudinger, Rachel and Boyd-Graber, Jordan (2024) Large Language Models Help Humans Verify Truthfulness —Except When They Are Convincingly Wrong. North American Association for Computational Linguistics

Sheng, Qiang and Zhang, Xueyao and Cao, Juan and Zhong, Lei (2021) Integrating pattern-and fact-based fake news detection via model preference learning. 1640--1650, Proceedings of the 30th ACM international conference on information & knowledge management

Thorne, James and Vlachos, Andreas (2021) Elastic weight consolidation for better bias inoculation. Association for Computational Linguistics (ACL), 957--964, 16th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL 2021

Mathieu, Camille (2017) The Chicago Guide to Fact-Checking. Reference Reviews 31(6): 7--8 Emerald Publishing Limited

Choi, Eun Cheol and Ferrara, Emilio (2024) Fact-gpt: Fact-checking augmentation via claim matching with llms. 883--886, Companion Proceedings of the ACM Web Conference 2024

Lai, Cameron and Toriumi, Fujio and Yoshida, Mitsuo (2024) A multilingual analysis of pro Russian misinformation on Twitter during the Russian invasion of Ukraine. Scientific Reports 14(1): 10155 Nature Publishing Group UK London

Schlichtkrull, Michael and Guo, Zhijiang and Vlachos, Andreas (2023) Averitec: A dataset for real-world claim verification with evidence from the web. Advances in Neural Information Processing Systems 36: 65128--65167

Sharma, Karishma and Ferrara, Emilio and Liu, Yan (2022) Construction of large-scale misinformation labeled datasets from social media discourse using label refinement. 3755--3764, Proceedings of the ACM Web Conference 2022

Dhankar, Abhishek and Za{\"\i}ane, Osmar and Bolduc, Francois (2022) UofA-Truth at Factify 2022: A Simple Approach to Multi-Modal Fact-Checking.. DE-FACTIFY@ AAAI

Pilarski, Moritz and Solovev, Kirill Olegovich and Pr{\"o}llochs, Nicolas (2024) Community notes vs. snoping: how the crowd selects fact-checking targets on social media. 1262--1275, 18, Proceedings of the International AAAI Conference on Web and Social Media

Zeng, Fengzhu and Gao, Wei (2024) Justilm: Few-shot justification generation for explainable fact-checking of real-world claims. Transactions of the Association for Computational Linguistics 12: 334--354 MIT Press One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA …

Tonglet, Jonathan and Moens, Marie Francine and Gurevych, Iryna (2024) “Image, Tell me your story! ” Predicting the original meta-context of visual misinformation. 7845--7864, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Vedula, Nikhita and Parthasarathy, Srinivasan (2021) Face-keg: Fact checking explained using knowledge graphs. 526--534, Proceedings of the 14th ACM International Conference on Web Search and Data Mining

Krishna, Amrith and Riedel, Sebastian and Vlachos, Andreas (2022) Proofver: Natural logic theorem proving for fact verification. Transactions of the Association for Computational Linguistics 10: 1013--1030 MIT Press One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA …

Atanasova, Pepa Generating fact checking explanations. Accountable and Explainable Methods for Complex Reasoning over Text, Springer, 2024, Cham, 83--103

Russo, Daniel and Tekiro{\u{g}}lu, Serra Sinem and Guerini, Marco (2023) Benchmarking the generation of fact checking explanations. Transactions of the Association for Computational Linguistics 11: 1250--1264 MIT Press One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA …

Khan, Kashif and Wang, Ruizhe and Poupart, Pascal (2022) WatClaimCheck: A new dataset for claim entailment and inference. 1293--1304, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Yao, Barry Menglong and Shah, Aditya and Sun, Lichao and Cho, Jin-Hee and Huang, Lifu (2023) End-to-end multimodal fact-checking and explanation generation: A challenging dataset and models. 2733--2743, Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Wan, Pengfei and Wang, Xiaoming and Pang, Guangyao and Wang, Liang and Min, Geyong (2023) A novel rumor detection with multi-objective loss functions in online social networks. Expert Systems with Applications 213: 119239 Elsevier

Jiang, Gongyao and Liu, Shuang and Zhao, Yu and Sun, Yueheng and Zhang, Meishan (2022) Fake news detection via knowledgeable prompt learning. Information Processing & Management 59(5): 103029 Elsevier

Sun, Mengzhu and Zhang, Xi and Ma, Jianqiang and Xie, Sihong and Liu, Yazheng and Yu, Philip S (2023) Inconsistent matters: A knowledge-guided dual-consistency network for multi-modal rumor detection. IEEE Transactions on Knowledge and Data Engineering 35(12): 12736--12749 IEEE

Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung (2024) Findings of the association for computational linguistics: EMNLP 2024. Findings of the Association for Computational Linguistics: EMNLP 2024

Lee, Nayeon and Bang, Yejin and Madotto, Andrea and Fung, Pascale (2021) Towards Few-shot Fact-Checking via Perplexity. 1971--1981, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Zhang, Xuan and Gao, Wei (2023) Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method. 996--1011, Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Maynez, Joshua and Narayan, Shashi and Bohnet, Bernd and McDonald, Ryan (2020) On Faithfulness and Factuality in Abstractive Summarization. Association for Computational Linguistics, Online, 1906--1919, 10.18653/v1/2020.acl-main.173, jul, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jurafsky, Dan and Chai, Joyce and Schluter, Natalie and Tetreault, Joel

de Arriba-P{\'e}rez, Francisco and Garc{\'\i}a-M{\'e}ndez, Silvia and Leal, F{\'a}tima and Malheiro, Benedita and Burguillo, Juan Carlos (2024) Exposing and explaining fake news on-the-fly. Machine Learning 113(7): 4615--4637 Springer

Kumari, Rina and Ashok, Nischal and Agrawal, Pawan Kumar and Ghosal, Tirthankar and Ekbal, Asif (2023) Identifying multimodal misinformation leveraging novelty detection and emotion recognition. Journal of Intelligent Information Systems 61(3): 673--694 Springer

Chen, Jing and Zhou, Gang and Lan, Mingjing and Wang, Shiyu and Li, Shunhang and Lu, Jicang (2025) Semantic-aware fake news detection with heterogeneous graph attention. Journal of Intelligent Information Systems : 1--26 Springer

Su, Chen and Zhou, Junkang and Jiang, Zhentao and Zhu, Shuwei and Li, Chao and Fang, Wei and Lu, Heng-yang (2025) Rumor detection for emergency events via few-shot ensembled prompt learning. Journal of Intelligent Information Systems : 1--32 Springer

Kryscinski, Wojciech and McCann, Bryan. Evaluating the Factual Consistency of Abstractive Text Summarization. US Patent App. 16/750,598. Google Patents, 2021

Additional Files

Additional file 11