Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection

RijhaSafdar1Emailrsafdar.dphd19seecs@seecs.edu.pk

DanyailMateen1

SyedTahaAli1Emailtaha.ali@seecs.edu.pk

UmerAshfaq1

WajahatHussain1Emailwajahat.hussain@seecs.edu.pk

1School of Electrical Engineering and Computer Science (SEECS)National University of Sciences and Technology (NUST)44000IslamabadPakistan

Rijha Safdara,1,d, Danyail Mateen, Syed Taha Alib,1, Umer Ashfaq, Wajahat Hussainc,1

1 School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan, 44000

Received: date / Accepted: date

Abstract

AI-based solutions demonstrate re- markable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investi- gate the impact of model architecture, parameter configuration, and quality of training data on the ability of these systems to generalize.

For this purpose, we introduce VulGate, a high quality state of the art dataset that mitigates the shortcomings of prior datasets, by removing mis- labeled and duplicate samples, updating new vul- nerabilities, incorporating additional metadata, in- tegrating hard samples, and including dedicated test sets. We undertake a series of experiments to demonstrate that improved dataset diversity and quality substantially enhances vulnerability detec- tion. We also introduce and benchmark multiple encoder-only and decoder-only models. We find that encoder-based models outperform other mod- els in terms of accuracy and generalization. Our model achieves 6.8% improvement in recall on the benchmark BigVul dataset and outperforms others on unseen projects, demonstrating enhanced gen- eralizability. Our results highlight the role of data quality and model selection in the development of robust vulnerability detection systems. Our find- ings suggest a direction for future systems with high cross-project effectiveness.

Keywords

LLM fine-tuning

Encoder-only

Vulnerability Detection

Generalizability

Codebases

ae-mail: rsafdar.dphd19seecs@seecs.edu.pk

be-mail: taha.ali@seecs.edu.pk

ce-mail: wajahat.hussain@seecs.edu.pk

dCorresponding author.

Introduction

With the rapid growth in digitization and software applications and systems in recent years, the issue of software vulnerabilities has become a critical concern. In 2024, a record-breaking 40,000 Com- mon Vulnerabilities and Exposures (CVEs) were published—an average of 108 per day—marking a 38% increase over 2023 (with 28,818 CVEs) [69]. This number is already dramatically increasing: the first half of 2025 has witnessed an average of 131 CVEs per day [71]. In the open-source software ecosystem, which underpins a wide range of in- dustries, including finance, energy, aerospace, and healthcare, a recent study found a surge of 98% per year in reported vulnerabilities [70].

The real-world impact is also significant. Whereas critical vulnerabilities grew 37% in 2024, the known exploited vulnerabilities jumped 96% [73]. Furthermore, 23.6% of exploited vulnerabil- ities were leveraged on or before public disclo- sure, and half were typically exploited within 192 days [74]. Meanwhile, the average time to fix soft- ware vulnerabilities has grown to eight and a half months, an increase of 47% over the last five years [72].

These alarming trends highlight the need for reliable automated solutions to identify software vulnerabilities at scale. The emergence of large language models (LLMs) appears to be a promis- ing development in this direction. Although origi- nally trained on a massive corpora of natural lan- guage, these tools demonstrate strong performance in generating code, identifying bugs, and patching vulnerabilities [26, 28, 29]. Popular coding tools, in- cluding GitHub Copilot [31] and Cursor [32], have integrated LLMs for real-time debugging.

The research community is actively investigat- ing the use of LLMs, particularly pre-trained mod- els, to identify logic bugs and subtle security vul- nerabilities [27] [30]. Whereas existing architec- tures demonstrate remarkable results on training and testing datasets, a fundamental limitation is that this performance does not generalize to unseen codebases [33]. For instance, state of the art systems VulDeePecker [3] and SySeVR [8] re- port F1 scores of 85.4% and 84.4% respectively on benchmark datasets, but encounter severe perfor-

mance degradation on the more realistic ReVeal

dataset, scoring 12.18% and 19.05% respectively [9]. This lack of generalization limits practical ap- plicability of LLMs to real-world scenarios where software code can vary considerably in terms of structure, semantics, and vulnerabilities.

Whereas prior work has speculated on multi- ple reasons for this lack of generalization, in this paper we undertake a rigorous empirical study to evaluate the role of three key factors: first, multi- ple researchers have indicated that datasets used in prior assessments are problematic, featur- ing significant label noise, duplicated samples, and class bias [9], [38], [52]. This results in overly op- timistic assessment outcomes, with models that seem accurate in benchmarks yet perform poorly in practice.

Second, most datasets lack sufficient di- versity, with a bias toward limited types of projects or vulnerability types [38], [53]. This nar- row scope prevents model’s ability to learn gener- alizable patterns, leading to missed or misclassified vulnerabilities outside the dataset’s scope.

The third factor is the choice of models and their configuration. Our experiments demon- strate that newer transformer architectures, pre- trained on code, dramatically outperform GNN and deep-learning solutions. Parameter configura- tion is also critical: prior models had limited con- text window size, mostly restricted to 512 length, which constraints their ability to capture long range dependencies in code. A small context win- dow forces the model to truncate input, omitting critical code pertaining to a vulnerability [43].

Our precise contributions in this paper are:

We introduce VulGate, a large scale, rig- orously curated dataset to train models for software vulnerability detection, classi- fication and patching. VulGate unifies and cleans samples from established corpora (in-

cluding, Devign, BigVul, ReVeal, VDISC,

D2A, CVEfixes, Cross Vul, DiverseVul, Prime- Vul and MegaVul). The dataset contains 236,663 function-level samples with newly- scraped samples up to May, 2025, along with

cleaned vulnerability-related diff-based anno- tations. Compared to the widely used BigVul dataset, VulGate expands coverage from 91 to 180 CWE types, adds 48,027 new samples (25.5% increase), and incorporates new sub- sets manually verified by security experts. It also includes a 500-sample expert-verified test set VulGate + explicitly designed to evaluate cross-project generalization.

We deliberately incorporate hard negative samples in VulGate—function pairs with high semantic similarity (≥ 0.90 cosine similarity, nearly 90% in training set). These pairs typ- ically exhibit minimal syntactic difference, of- ten just one line or token (e.g. replacing str- cpy with strncpy), reflecting real-world CVEs where vulnerabilities are caused with mini- mal changes. Hard negatives force models to learn semantic patterns as opposed to super- ficial syntax, thereby improving generalization to unseen codebases. Prior work shows that in- corporating hard negatives consistently boosts performance (e.g., Li et al. [67] reported + 0.046 MRR improvement for CodeBERT on the Java subset of CodeSearchNet [58, 61]). This is a novel contribution in the context of vulnera- bility detection, which has not been explored in prior work.

To assess model architecture and the partic- ular role of context window size in vulner- ability detection, we undertake comprehen- sive benchmarking exercises for five state of the art LLMs. These comprise encoder- only models (CodeBERT, UniXcoder variants) and decoder-only models (CodeGPT-2, Code Llama), spanning the range of lightweight to large-scale deployments with parameter counts from 125M–7B. We find that extended con- text window along with encoder-only architec- ture type jointly influence performance as their masked language modeling training allows the model to process whole code sequences simul- taneously.

Our results indicate that encoder-only mod- els capable of handling longer contexts (e.g., UniXcoder-Base-Nine with 1024-token window) and trained on VulGate, demonstrate significantly improved performance. On the widely-used BigVul benchmark, UniXcoder-Base-Nine achieves an F1 score of 94.73% and 6.8% improvement in absolute recall over CodeBERT. UniXcoder-Base-Nine also consistently outperforms baselines, including a 47 point F1 improvement on a synthetic code subset.

Moreover, our results represent a breakthrough in terms of generalization. Recent benchmark stud- ies demonstrate precipitous performance loss of

40–70% for state of the art models transitioning from seen to unseen datasets [38] [33]. In con- trast, our performance loss is an order of mag- nitude lower, dropping only 4–6% on the Prime- Vul benchmark dataset. To the best of our knowl- edge, this represents the most robust cross-dataset generalization results reported in the literature to date.

The remainder of the paper is organized as fol- lows: Section II includes background material on vulnerability detection. In Section III we under- take a literature review of leading datasets. Sec- tion IV introduces our dataset VulGate and de- scribes the data construction pipeline. In Section V, we review vulnerability detection models in the research literature and describe our benchmarking and fine-tuning experiments. In Section VI we dis- cusses insights. Section VII concludes the paper with a summary of findings and future work.

2 Background

Here we provide an overview of automated tech- niques for software vulnerability detection.

2.1 Traditional Vulnerability Detection Solutions

Conventional vulnerability detection solutions comprise static and dynamic analysis, hybrid so- lutions and symbolic execution techniques.

Static analysis tools (such as Fortify [19], Clang [20], Cppcheck [21] etc.) review source code, analyzing program structure, data and control flow, syntax accuracy, etc. and identify well-known bugs and vulnerabilities [18]. The code itself is not executed in a live environment, which allows more subtle flaws to escape detection. Static analyzers also have limited contextual understanding of the code and often suffer from high false positive (FP) rates.

Industry guidelines recommend keeping FPs below 20–30% to make a tool acceptable to users. An analysis of popular open-source embedded soft- ware using Github’s CodeQL reported an FP rate of around 23% [75]. Google’s development envi- ronment limits FPs to 10% based on user feed- back [44]. In practice, however, FP rates can be much higher: a large-scale study reports that up to 96% of static analyzer warnings were false [45]. Dynamic analysis tools evaluate software behavior during runtime, helping identify issues like memory leaks, performance bottlenecks, and security vulnerabilities. Detection techniques in- clude taint analysis (i.e., tracking data flow of un- trusted inputs) and fuzzing (i.e., testing for robust-

ness against abnormal inputs) [23, 24]. Dynamic analysis can be more effective at identifying vul- nerabilities than static analysis, but this approach has various issues: it requires security experts to run applications in virtual environments. This pro- cess can be computationally expensive and time- consuming, sometimes taking days to scale across large codebases. Dynamic analysis also struggles with limitations in code coverage (only detecting issues in code paths that were executed).

Hybrid analysis combines the aforemen- tioned static and dynamic approaches by integrat- ing run-time data extracted from dynamic analysis into a static analysis algorithm. This results in sig- nificant performance improvements (e.g., one eval- uation reported 17% improved accuracy and 25% faster detection rate [47]). However, hybrid anal- ysis is still resource intensive, requiring run-time environment, computational resources, and over- sight by security experts.

Symbolic execution is another traditional technique where tools (such as KLEE [22]) explore program paths systematically to identify bugs. The major challenge in this regard is scalability due to path explosion: an analysis by Bessler et al. demonstrates that the number of program paths often grows exponentially with function complex- ity [48]. This process often requires substantial computational resources and time for constraint solving.

These traditional techniques, though valuable, often struggle to scale across a wide range of evolv- ing codebases. Consequently, machine learning (ML) and deep learning (DL) approaches are gaining attention for their ability to learn vul- nerability patterns, enabling more accurate, scal- able, and quicker results. For instance, an ensemble learning solution, VELVET, outperforms baseline static analyzers by a factor of 4.5 and with signif- icantly reduced FP rates [49].

2.2 AI for Vulnerability Detection

We refer the reader to [76] for a detailed survey on this topic. Here we provide a brief overview:

These approaches treat vulnerability detection as a classification task. ML and DL models rely on features extracted from abstract syntax trees, con- trol flow graphs, data flow graphs, or program de- pendence graphs [3, 4]. Whereas this approach sig- nificantly improves automation, the tools required to generate input graphs are mostly language de- pendent. Cross-project generalization is also an is- sue with models overfitting to specific datasets.

LLMS, based on transformer architectures and pretrained on large code and natural language datasets, have also demonstrated exceptional ca- pability to process and generate code.

However, pre-trained models require fine- tuning for vulnerability detection, the effective- ness of which in turn depends considerably on the quality and diversity of training data. As docu- mented in our literature review (Sec.3), these mod- els give strong results on datasets on which they are trained, but performance degrades consider- ably on unseen codebases. As we demonstrate in this paper, a significant component of this drop in performance is due to label noise and data du- plication, imbalanced datasets, and small context window.

3 Literature Review: Datasets

The construction of specialized datasets for vulner- ability detection has progressed in stages, building up from small, manually curated corpora to large- scale, automatically mined datasets. This evolu- tion reflects the research community’s efforts to cater to essential properties including scale, label quality, and diversity Table 1.

Early benchmarks such as Devign [7] and Re

Veal [9] established the importance of manually

validated data with relatively high label quality. However, both these datasets are limited in scope, covering only two projects each, and lacking sup-

port for tasks such as CWE classification or fine- grained localization. Moreover, ReVeal suffers

from severe class imbalance [55], nearly 10:1 ratio of secure to vulnerable classes, making it difficult for ML models to learn robust representations.

To expand in scale, the community shifted to- ward automated data mining. BigVul is a large, real-world C/C + + dataset comprising linked CVE reports with commits across 348 projects and nearly 190,000 samples [2]. This dataset signif- icantly broadens coverage and is widely used as a benchmark in the community. However, it has shortcomings: one study reveals substantial label noise caused by reliance on commit snapshots [33], where entire commits are marked as either vulner- able or secure. Manual inspection reports approx- imately 25% correct labels, with label duplication exceeding 12.7%.

Datasets VDISC [4] and D2A [40] rely on tra- ditional vulnerability detection approaches for la- beling. VDISC collects over a million vulnerable functions from Juliet, Debian, and GitHub, labeled by static analyzers (Clang/Cppcheck/Flawfinder). Though impressive in scale, the reliance on static

analysis inevitably resulted in significant noise. Le et al. reported 50% of samples are incorrect, and it can adversely affect model evaluation [56]. D2A performs infer-based differential analysis (an opti- mized “fast feedback” form of static code analysis) across six GitHub repositories. A commit is flagged as vulnerable if static analysis warnings resolve when a patch is applied. This approach reduces the false positive rate but may introduce other misla- beling biases due to its reliance on static analysis. To address these limitations and improve pre- cision, researchers focused on ground truth: CVE- fixes provides high-quality CVE-based labels with rich metadata across 564 projects [36]. However, this dataset is biased towards known CVEs be- cause it only includes publicly disclosed and as- signed CVE. Hence, inherently biased toward known and reported cases. CrossVul is a multi- language vulnerability dataset spanning 40 + lan- guages with paired files [39]. However, label relia- bility is limited: manual scrutiny discovered label

accuracy to be 48% [57].

We next describe key datasets used for vulner- ability detection in the research literature.

Attempts were also made to integrate and unify

prior datasets. DiverseVul consolidates samples from previous datasets (Devign, BigVul, ReVeal,

CrossVul, and CVEfixes), spanning 150 CWEs with deduplication procedures [38]. This is a di- verse dataset, however, the authors report label accuracy of 60%.

MegaVul further expands in terms of scale, comprising 17,380 vulnerabilities collected from 992 open-source repositories, and spanning 169 dif- ferent vulnerability types [37]. This dataset in- cludes multiple code representations (AST, IR, PDG, etc.). However, MegaVul also inherits the limitations of commit-heuristic labeling and is im- balanced. A research study reports 16.7% of CVEs correspond to “undecidable” patches that don’t map cleanly to a single function [54].

PrimeVul, comprising 6,968 vulnerable and 228,800 secure functions, attempts to address these issues by correcting labels, carefully filtering prior corpora, and ensuring chronological splits [33]. This yields higher precision but excludes vulner- abilities that are multi-function or not belonging to the National Vulnerabilities Database (NVD), thereby introducing selection bias and limiting generalization. The authors also acknowledge that while the dataset is realistic, it is significantly im- balanced.

The evolution of datasets thus far reflects a trade-off between scale and reliability. Despite sig- nificant progress to date, persisting challenges in- clude label noise, duplication, data leakage, lack

Table 1
Comparison of Major C/C + + Vulnerability Datasets.
Dataset	Cut-off Date	Sample Size Secure/Vulnerable	Projects	Data Quality	Balanced	Unique CWEs	Mean Duplicate % (CWE)	Hard Neg.%	Strengths	Limitations
Devign∗ [7]	2019	25,872 (14,978/10,894)	FFmpeg, QEMU	★★★★	✓	–	0.31	86.41	• Manually validated by experts	• Limited projects (2) • Lacks metadata
BigVul [2]	2002–2019	188,636 (177,736/10,900)	348+	★★★	✗	91	11.54	19.45	• Large dataset • Real-world CVE-linked commits	• Heavy label noise (∼25% valid) • 12.7% duplicates [33]
BigVul^𝛙		152,256 (143571/8685)				88	1.47	19.13	• Large dataset • Real-world CVE-linked commits	• Heavy label noise (∼25% valid) • 12.7% duplicates [33]
ReVeal∗ [9]	2020	22,734 (20,494/2,240)	Chromium, Debian	★★★	✗	–	1.59	36.13	• Real CVEs • Manually validated	• Severe class imbalance (∼10:1) • Limited projects
VDISC∗ [4]	2020	1.27M Vulnerable only	Juliet, Debian, GitHub etc.	★★	✗	4	30.83	0.00	• Large dataset	• Labels via static analyzers • Noisy
D2A∗ [40]	2021	1.30M (1,276,970/18,653)	GitHub (6 repos)	★★	✗	–	84.48	6.21	• Differential static analysis reduces false positives	• Mislabeling • Highly imbalanced
CVEfixes [36]	2002–Jul 2024	168,089 (159,157/8,932)	564	★★★★	✗	180	30.57	9.36	• High quality CVE-based labels • Rich metadata	• Biased to known CVEs • Not balanced
CVEfixes^𝛙		139,400 (134,100/5,300)				180	8.07	67.5	• High quality CVE-based labels • Rich metadata	• Biased to known CVEs • Not balanced
CrossVul [39]	2021	134,126 (127,242/6,884)	498	★★★	✗	94	37.19	79.76	• Multi-language (40+) • Paired files	• Label noise (manual ∼48%) • File-level only
CrossVul^𝛙		102,386 (992,202/3,166)				91	10.50	79.00	• Multi-language (40+) • Paired files	• Label noise (manual ∼48%) • File-level only
DiverseVul∗ [38]	2023	348,987 (330,492/18,495)	797	★★★	✗	150	0.95	25.22	• Covers 150 CWEs • Deduplicated	• Label noise persists (∼60% acc.) • Leakage risk
PrimeVul∗ [33]	2024	235,768 (228,800/6,968)	libpng, openssl etc.	★★★★	✓	–	–	–	• Corrected labels • Chronological split	• Excludes multi-function/ non-NVD cases
MegaVul [37]	2006–Oct 2023	353,873 (335,898/17,975)	992	★★★★	✗	169	2.26	0.00	• 169 CWEs • Multi-representation (AST, IR, PDG)	• Imbalanced • Inherits commit-heuristic limitations
MegaVul^𝛙		309,988 (300,788/9,200)				168	1.10	10.09	• 169 CWEs • Multi-representation (AST, IR, PDG)	• Imbalanced • Inherits commit-heuristic limitations
VulGate (ours)	May 2025	236,663 (119,231/117,432)	792	★★★★★	✓	180	0.44	∼89.0	• Cleaned • Structured • Up-to-date • Supports multiple tasks	• CWE bias

†denotes the refined/updated version of the original dataset which we incorporate in our dataset VulGate.

∗ denotes datasets whose commit information is not available for updating.

Balance column: ✓ corresponds to vulnerable and secure samples equally represented and ✗ corresponds to imbalanced

dataset.

Data Quality column (star-based scale):item ★★★★★ (five stars): corresponds to large datasets with expert/automated validation and minimal noise; ★★★★ (four stars) indicates large datasets expert/semi-validated labels; ★★★ (three stars) reflects datasets with noticeable noise but still widely used and ★★ (two star) denotes datasets where labels are automatically generated by the static analyzer

Fig. 1

Automated End-to End data collection pipeline for VulGate. The workflow integrates CVE/CWE records, GitHub commits, and function-level parsing to build a structured vulnerability dataset.

of diversity, class imbalance, and limited tempo- ral coverage. To address these challenges, we in- troduce VulGate, our state of the art dataset for vulnerability detection research.

4 VulGate: Pipeline and Dataset

VulGate significantly extends prior datasets in the research literature and includes consider- able new content. VulGate consists of 1.36 GB of structured data, comprising 236,663 function-

Table 2
Illustrative examples of VulGate entries, mapping vulnerable code (Function Before) to secure patches (Function After ) with CWE annotations and metadata.
Vulnerable Code (Function Before)	Secure Code (Function After )	CWE ID	CWE Type	CWE Description	Vulnerable Index	Vulnerable Line No.	Vulnerable Code	Patch In- dex	Patch Line No.	Patch Code
void copy(char *src) { char buf[10]; strcpy(buf, src); }	void copy(char *src) { char buf[10]; strncpy(buf, src, sizeof(buf) − 1); buf[9] = ’\0’; }	CWE- 120	Buffer Overflow	Buffer Copy without Checking Size	2	3	strcpy(buf , src) ;	2	3	strncpy(buf, src, sizeof (buf) − 1); buf[9] = ’\0’;
void authenticate(char *input) { if(strcmp(input, "admin")) { printf("Access granted\n"); } }	void authenticate(char *input) { if(strcmp(input, "admin") = = 0) { printf("Access granted\n"); } }	CWE- 253	Logic Error in Auth	Incorrect Check of Return Value	1	2	if(strcmp( input, " admin "))	1	2	if(strcmp(input , "admin") == 0)

level samples, including 117,432 secure code sam- ples and 119,231 vulnerable code samples. This dataset includes real-world vulnerabilities curated from GitHub commits linked to CVEs, along with CWE mapping and metadata. VulGate also ex- pands temporal coverage of prior datasets from May, 2002 to May, 2025.

Overall, the VulGate dataset spans 792 projects, 7,600 commits, and covers 180 CWE classes, which include memory corruption, input validation, logic errors, and access control. Each code sample includes function-level code from be- fore and after patching, exact line information for vulnerabilities and patches, CWE information, and commit metadata. This fine-grained anno- tation makes VulGate not only suitable for bi- nary classification but also effective for a range of related applications, including vulnerability localization, type classification, and auto- matic patching. We believe these unique prop- erties make VulGate one of the most up to date, accurate, and versatile datasets currently available for vulnerability research.

4.1 The VulGate Data Collection Pipeline

We first describe our data collection process. We develop a fully automated data collection pipeline to scrape C/C + + vulnerabilities, integrating mul- tiple sources, including Common Vulnerabilities and Exposures (CVE) records, Common Weak- ness Enumeration (CWE) taxonomy, and GitHub repositories, and extracting function-level code snippets. The step-by-step process is depicted in Fig. 1.

The pipeline begins with 1 selecting C/C + + projects from the aforementioned vulnerability datasets, as well as new projects and extracting relevant keywords, including project name or vul- nerability related terms. Then, 2 using Selenium based web automation, the system queries public CVE data bases with these keywords to retrieve

vulnerability reports. To ensure inclusion of high- impact cases, we select CVEs with a severity score greater than 5.

Next, 3 we extract the corresponding CWE identifiers for each CVE and query the MITRE CWE repository to enrich the dataset with de- scriptive metadata, including the CWE title, type, and classification. In parallel, 4 we parse refer- ence links from CVE entries to locate correspond- ing GitHub repositories or commits and retrieve vulnerability fixes. We collect both current and parent commit identifiers, thereby enabling access to complete before and after code versions.

Table 2 presents VulGate samples, where Func- tion Before shows the vulnerable code and Func- tion After its secure patch. The first case demon- strates a buffer overflow (CWE-120) fixed by re- placing strcpy with a bounds-checked strncpy, while the second shows an authentication error (CWE-253) corrected by an explicit comparison in strcmp.

Then 5 we process these code snapshots using Clang and srcML to extract structured function- level code blocks. This enables us to precisely analyze code changes, including pinpointing flaw lines, patch lines, and their respective positions, as well as the exact code differences introduced in the fix. Finally, 6 all collected data, including CVE severity, commit metadata, and detailed code diffs, is recorded in structured CSV files, creating a high-quality dataset for downstream vulnerability research.

It took approximately 3 months to collect and pre-process data for VulGate, followed by an addi- tional 100 person-hours for manual verification of subsets by 3 security experts.

4.2 Introducing Hard Negatives

A key contribution of our paper is our incorpora- tion of hard negative samples in the training data. This is an underexplored topic in the literature. By

exposing the model to hard negative samples that are semantically similar to vulnerable instances but have different labels, we encourage the model to learn fine-grained distinctions between vulner- able and non-vulnerable samples that are difficult even for human reviewers. This approach signifi- cantly reduces false positives and improves cross- domain generalization.

These findings are consistent with work in re- lated domains: for instance, a hard negative sam- pling approach demonstrated significant improve- ment in the capacity of models to discriminate and understand code [58]. Likewise, contrastive learn- ing with hard negative samples was found to en- courage clear decision boundaries and thereby im- prove representation quality [59]. Similarly, the ef- fectiveness of retrieval tasks increased when hard negatives were paired with augmentation [60].

4.3 Incorporating Prior Datasets

Table 1 summarizes the features of widely used vulnerability datasets, including size, duplication, CWE diversity, CWE distribution, and the per- centage of hard negatives in each dataset.

We make every effort to retain and enhance fundamental features of prior datasets, to en- sure that VulGate is compatible with existing re-

search pipelines and benchmark datasets. Like De- vign, ReVeal, and PrimeVul, VulGate provides

function-level samples with explicit vulnerable and secure samples. As in the case of BigVul, CVEfixes, and MegaVul, VulGate links CVE and CWE meta- data, where each function is linked to real-world vulnerabilities. In line with DiverseVul, VulGate also integrates information from multiple datasets across a diverse set of projects.

We also address key limitations of prior datasets. To rectify issues of duplication, label noise, and poor accuracy encountered in BigVul, CrossVul, and DiverseVul, we apply a strict dupli-

cate removal and cleaning policy. To mitigate class imbalance, encountered in ReVeal and CVEfixes,

VulGate maintains a balanced distribution of vul- nerable and patched samples. We also address data leakage concerns other researchers have observed in Devign, BigVul, and DiverseVul [33], by ensur- ing clean splits in train/validation/test data free from code overlap.

Hard negatives vary considerably across the data sets. For instance, Devign 86.41% shows a very high ratio, but its small scale and restric- tion to only two projects limit its generalizability. In contrast, large datasets such as MegaVul and VDISC contain virtually no hard negatives, mak-

ing them less suitable for training. A high pro- portion 79.76% is reported in CrossVul, though this figure reflects hard negatives of more than 40 languages and is not directly representative for C/C + + datasets. Notably, VulGate is estimated to contain nearly 89% hard negatives, combining scale with challenging samples and thus offers a strong foundation for generalization in vulnerabil- ity detection.

Furthermore, unlike prior datasets, VulGate in- cludes rich metadata (CWE ID, CWE descrip- tion, flaw and patch lines, commit information, and project information). This not only supports multi-task learning for vulnerability detection, but significantly expands the scope and application of this dataset to vulnerability localization, classifi- cation, and patch generation.

5 Experimental Evaluation

We now describe our models and experiments: first, in Sec. 5.1, we review prior models used for vulnerability detection and rank their performance against a standard dataset (BigVul) used in the literature. In Sec. 5.2(Experiment #1 ), we intro- duce state of the art encoder-decoder models for vulnerability detection and evaluate their perfor- mance. These models are fine-tuned for vulnera- bility detection and demonstrate improved perfor- mance compared to prior models.

In Sec. 5.3 (Experiment #2 ) we investigate the effect of increasing context window size on vulnera- bility detection. Our results indicate that increas- ing context window size together with the right architectural choice (encoder-only models) again significantly improves performance.

In Sec. 5.4 (Experiment #3 ), we train our new models on our VulGate dataset and showcase more performance gains across diverse projects. In Sec. 5.5 (Experiment #4, we specifically test generalization capability. We introduce a test set VulGate+, containing diverse real-world code and synthetic code snippets. Our positive results high- light the importance of hard negatives and bal- anced training data.

All VulGate datasets and scripts are available at: Gen-VulGate.

5.1 Review and Evaluation of Detection Models

We now review the models commonly used for vul- nerability detection research. We cover a compre- hensive set of approaches ranging from static an- alyzers and early ML baselines to state of the art

Table 3
F1-score, Precision (P), and Recall (R) on the BigVul benchmark dataset. Bold values denote our ex- perimental runs, with the best results in each column also highlighted in bold. Results of BoW + RF, Russell, VulDeePecker, SySeVR and IVDetect are taken from [1]
Category	Techniques/Models	F1	P	R
Static Analyzer	Cppcheck [21]	12%	10%	15%
Static Analyzer	Infer [42]	19.5%	15%	28%
ML	BoW + RF [5, 6]	25%	48%	17%
ML	Russell et al. [4]	24%	16%	48%
DL	VulDeePecker [3]	19%	12%	49%
DL	SySeVR [8]	27%	15%	74%
GNN	Devign [7]	26%	18%	52%
	ReVeal [9]	30%	19%	74%
	IVDetect [10]	35%	23%	72%
Decoder	CodeGPT-2 (context window:1024)	90.45%	97%	84.45%
Decoder	CodeLlama (context window:1024)	79.73%	79.34%	80.13%
Encoder	CodeBERT [1]	91%	97%	86%
	UniXcoder-Base (context window:512)	84.68%	84.32%	85.04%
	UniXcoder-Base (context window:1024)	94.23%	97.28%	91.37%
	UniXcoder-Base-Nine (context window:512)	88.8%	87.9%	89.7%
	UniXcoder-Base-Nine (context window:1024)	94.73%	96.74%	92.8%

DL and LLM based models. For purposes of com- parison, in Table 3, we enumerate the performance of these models on a common dataset: to maintain consistency, we opt for the BigVul dataset, which is the de facto benchmark in the research literature. We report performance using standard classifica- tion metrics: F1-score, Precision (P), and Recall

(R). Results for some models (Devign, ReVeal,

and CodeBERT) were replicated and cross-checked against their original publications and found to be in agreement.

We include static analyzers Cppcheck and In- fer in our evaluation, since they represent industry standard baselines. These results also demonstrate the practical gap between traditional rule-based vulnerability detection and ML/DL approaches, particularly in terms of recall and robustness. Table 3 shows that Cppcheck achieves 12% F1, in- dicating that it misses the vast majority of vulner- able samples. Infer performs slightly better with 19.5% F1 but still exhibits very low overall effec- tiveness, showing the limitations of static analyz- ers.

Early ML approaches relied primarily on shal- low features, such as bag-of-words (BoW) and n- gram token frequency, often coupled with classifi- cation algorithms like random forests (RF) [5, 6]. These solutions are easy to deploy and computa- tionally efficient but, as Risse et al. report, this re- liance on lexical features results in failure to reason about semantic behavior of code and achieves lim- ited recall [50]. Consistent with this, the BoW + RF baseline achieves a relatively high precision of 48% but very poor recall of 17%, highlighting that such

models tend to overfit due to reliance on superficial features.

A pivotal shift in vulnerability detection oc- curred with use of learning-based techniques, first developed by Russell et al. [4], who used a deep feature representation learning technique that di- rectly interprets lexed source code. However, their use of synthetic code fragments, i.e., “good” and “bad” words injected into source code samples lim- its real-world applicability and also significantly impacts the ability of the model to generalize. The result show F1 of 24% with recall 48%, demon- strating improved performance but still weak over- all performance.

This was followed by the first wave of DL solu- tions: VulDeePecker used bidirectional LSTMs to model code sequences for vulnerability detec- tion [3]. The model achieves only 19% F1, re- flecting many false positives and false negatives. SySeVR integrated enriched vulnerability repre- sentation by combining syntax (Sy) and seman- tics (Se) slicing [8]. These models outperformed previous approaches with recall of 74% but preci- sion remained low at 15%, indicating that models struggled with complex long-range dependencies in code.

Devign [7], a GNN-based model achieves 26%

F1 but still suffers from imbalanced precision and recall. ReVeal [9], a GNN based model continues this trend, achieving 30% F1 with high recall 74%

but low precision 19%. Finally, IVDetect cap- tures long-term dependencies [10], reaching 35% F1, the best in this group, but still far below LLM- based approaches.

The results in Table 3 highlight a clear pro- gression. Vulnerability detection solutions in prior work, such as VulDeePecker, SySeVR, and IVDe- tect show limited performance (achieving F1 scores in the range of 19–35%), largely due to their re- liance on handcrafted program representations and limited ability to capture semantic context.

5.2 Experiment #1: Encoder vs Decoder Architectures

Pre-trained transformer architectures mark an- other breakthrough in vulnerability detection by bringing advances from NLP tasks into the do- main of code understanding and software security. To systematically investigate the role of architec- tural choices, we undertake an evaluation exer- cise using decoder-only and encoder-only families, specifically CodeGPT-2, CodeLlama, Code- BERT, UniXcoder-Base, and UniXcoder- Base-Nine. The results are listed in Table 3.

Implementation details and experimen- tal setup

All models were implemented in Py- Torch [62] with the HuggingFace Transformers [63] and DeepSpeed library [64]. For the encoder mod- els (CodeBERT, UniXcoder-Base and UniXcoder- Base-Nine), we use the standard configuration i.e., 12 encoder blocks, a hidden size of 768, and 12 attention heads. Following common prac- tice [34], we fine-tune using the AdamW opti- mizer with a learning rate of 2 × 10 − 5, apply- ing a linear decay schedule. Decoder-only mod- els CodeGPT-2 and CodeLlama are finetuned us- ing deepspeed and parameter-efficient fine-tuning (LoRA) respectively. A learning rate of 1 × 10 − 4, warmup steps, gradient accumulation, mixed pre- cision (fp16/bf16), and quantization (4-bit) where supported.

For all models, training was run for up to 15 epochs. We used a batch size of 30 per GPU and early stopping based on validation F1-score. Cross- entropy loss was used as the training objective. All experiments were conducted on NVIDIA RTX 2080 graphic card. Fine-tuning typically converged within 3–12 epochs.

Results

Decoder-based models such as Code GPT-2 and CodeLlama were included to evalu- ate whether autoregressive architectures designed for next-token prediction can be effective for vul- nerability detection. These models were chosen be- cause they are pre-trained on code and are pow- erful for generative tasks. Results indicate they underperformed compared to encoder-based mod- els, even with optimization strategies (deepspeed, mixed precision and quantization) and LoRA.

Although CodeGPT-2 and CodeLlama are large models having 1.5B and 7B parameters re- spectively and they benefit from large scale pre- training, the relatively weak performance, espe- cially of CodeLlama, can be attributed to a mis- alignment between the architecture and our ob- jective, i.e. classification. Decoder-only models are primarily optimized for generative tasks such as code generation rather than discriminative clas- sification. This makes them less effective at cap- turing bidirectional dependencies and fine-grained structural similarities that are critical for vulner- ability detection (this point has been noted in re- cent surveys on LLMs in software security [66]). Additionally, the scale of CodeLlama compared to the available labeled data further decrease perfor- mance, since larger models generally require very large amounts of data to adapt effectively [65].

Encoder-only models such as CodeBERT rep- resent bidirectional models trained on source code

[34] and transfer well to downstream vulnerabil- ity classification tasks. For instance, CodeBERT

Table 4
Distribution of samples in VulGate vulnerabil- ity dataset used for fine-tuning. Class 0 represents non- vulnerable samples and Class 1 represents vulnerable sam- ples.
Split	Total Samples	Class 0	Class 1
Train	189,330	95,295	94,035
Validation	23,666	11,995	11,671
Test	23,667	11,941	11,726
Total	236,663	119,231	117,432

achieves 91% F1 and 86% recall on BigVul, out- performing all other solutions in the literature. We replicate these results and our findings are consis- tent with prior reports [1].

UniXcoder [13] and UniXcoder-Base-Nine

[14] consistently outperform other models, likely due to their pretraining strategy which is explic- itly tailored for code comprehension. These models were pre-trained on NL-PL pair of CodeSearchNet dataset containing six programming languages, i.e., Java, Ruby, Python, JavaScript, Go. Addition- ally, UniXcoder-Base-Nine is further trained on 1.5 million NL-PL pairs of C, C + + and C#.

The performance improvement can be at- tributed to encoder masked language modeling (MLM) pretraining and bidirectional attention which allows the model to attend to whole se- quences, capturing structural dependencies effec- tively within source code, making them more suit- able for vulnerability detection tasks.

From this experiment we concluded that encoder-only models are generally more effective for software vulnerability detection. To further dis- entangle the effect of model architecture from in- put length, we undertook a follow-up experiment varying the context window size.

5.3 Experiment #2: Context Window Size

We vary input length from 512 to 1024. Code- BERT maximum context window size is 512, whereas both variants of UniXcoder support larger context window of 1024. We experimented with both variants of UniXcoder with context length 512 and 1024.

Implementation details and experimen- tal setup

To increase context window of decoder- only models on GPU resources (NVIDIA RTX 2080 GPU and Intel Xeon Gold 6230 CPU), we used DeepSpeed Library and optimize training of Code GPT-2 [11] and applied parameter- efficient fine-tuning (PEFT) via LoRA for CodeLlama [12] fine-tuning. In contrast, UniX- coder variants with 1024-token context window were fully fine-tuned, which allowed us to adapt

Table 5
Results of F1-score, Precision (P), and Recall (R) on the Refined Vulnerability dataset on our best perform- ing model, previous SOTA CodeBERT [1] and Static An- alyzer (Infer [42] and Cppcheck [21])
Dataset	Techniques/Models	F1	P	R
VulGate	Cppcheck	29.0%	54.0%	20.0%
	Infer	42.0%	26.0%	98.0%
	CodeBERT	85.9%	83.2%	89%
	UniXcoder-Base	87%	85.4%	88.7%
	UniXcoder-Base-Nine	88.9%	87.7%	90.0%

all model parameters for the vulnerability detec- tion task.

Results

As observed in Table 3, for both UniX- coder variants, extending the input length from

512 to 1024 tokens consistently improved per- formance. Specifically, UniXcoder-Base improved from 84.68% F1 (context window:512) to 94.23% F1 (context window:1024). Similarly, UniXcoder- Base-Nine improved from 88.8% F1 (context win- dow:512) to 94.73% F1 (context window:1024).

These results confirm that many real-world vulnerabilities span contexts exceeding 512 tokens, and that extending the input window allows the model to capture the broader control and data flow required for correct classification. In contrast, CodeBERT is constrained to 512 tokens, which limits its ability to reason over long functions de- spite achieving competitive performance (91% F1).

5.4 Experiment #3: Benchmark Evaluation on VulGate

We now investigate the performance of our leading models on the VulGate dataset.

Implementation details and experimen- tal setup: Table 4 summarizes the train/valida- tion/test split with an 80:10:10 ratio, ensuring class balance. As we noted earlier, to strengthen robustness and generalizability, VulGate explicitly incorporates hard negative samples.

In the training data, we introduce 792 projects containing 189,330 samples with equal number of vulnerable and non-vulnerable samples. The dataset contains samples with more than 90% cosine similarity, thereby forcing the model to learn the distinction between vulnerable and non- vulnerable patterns. This approach also reduces false positives and ensures that model performance shows true generalization rather than memoriza- tion. The selected samples (approx. 1000) verified by security experts.

Results

As depicted in Table 5, the encoder and decoder models significantly outperform static analyzers, Cppcheck and Infer. The UniXcoder

Fig. 2

A visual example of vulnerability (CVE-2021- 40568) correctly identified by UniXcoder-Base-Nine: Vul- nerable lines in svc_parse_slice. The check on pps_id is insufficient, leading to a potential buffer overflow.

family surpassed CodeBERT, which served as a strong baseline in prior work. UniXcoder-Base- Nine achieves an F1 score of 88.9%, precision of 87.7%, and recall of 90.0% compared to Code- BERT’s 85.9% F1. These results confirm that our fine-tuning strategy captures vulnerability pat- terns better than both traditional analyzers and pre-trained baselines.

The performance of VulGate is generally lower than on BigVul. This is because VulGate is inten- tionally harder and more realistic. It removes du- plicates, maintains a balance distribution of vul- nerable and non-vulnerable samples, spans diverse projects, includes carefully verified labels and in- corporates a large number of hard negatives which forces the model to learn true semantic features rather than superficial patterns. These character- istics make VulGate a stricter yet a reliable bench- mark for evaluating vulnerability detection mod- els, while also improving generalizability as shown in Table 6.

We observe a contrast for static analyzers: In- fer’s recall rises to 98% on VulGate as compared to 28% on BigVul, but precision drops to 26%. This highlights the impact of VulGate’s balanced and carefully cleaned samples, which increases the likelihood that static analyzers trigger on vulner- able samples. However, this comes at the cost of over-flagging many secure snippets. Thus, high re- call here should not be interpreted as improved analyzer performance but rather as sensitivity to dataset characteristics.

A visual example of vulnerability instances correctly identified by UniXcoder-Base-Nine but missed by other models (including CodeBERT [1]) is shown in Fig. 2. The snippet is from the gpac project (CVE-2021-40568) and corresponds to CWE-120 (Buffer Copy/Access without Check- ing Size) in svc_ parse_slice. This vulnerabil- ity may be exploited using a crafted MP4 file in GPAC ≤ 1.0.1 and may result in denial of ser- vice, arbitrary code execution, or privilege esca- lation. The bounds check only ensures pps_id ≤

Table 6
F1-score, Precision (P), and Recall (R) of Code- BERT and UniXcoder-Base-Nine on VulGate⁺ datasets.
Dataset	Models	F1	P	R
Linux Data	CodeBERT [1] UniXcoder-Base-Nine	75.3% 76.4%	69.5% 93.7%	82.1% 64.1%
PrimeVul Data	CodeBERT [1] UniXcoder-Base-Nine	87% 89.14%	87% 87.1%	87% 91.3%
Claude Data	CodeBERT [1] UniXcoder-Base-Nine	17.25% 64.8%	53.1% 92%	10.3% 50%

255, but does not account for the actual size of the avc− >pps array. As a result, the subse- quent indexing operation may access memory out- of-bounds and cause a buffer overflow.

5.5 Experiment #4: Cross Dataset performance

We now evaluate the generalization capabilities of CodeBERT and UniXcoder-Base-Nine beyond in- distribution data.

Implementation details and experimen- tal setup

We enhance our dataset with a test set specifically crafted for this task and strictly excluded from training, validation, or testing. De- noted as VulGate+, this test set comprises

A Collection of recent Linux vulnerabili- ties containing 78 vulnerable and 1,805 non- vulnerable samples from 2024 and 2025,

A synthetic but realistic dataset of 150 vulner- able and 150 non-vulnerable samples generated using Claude [41] and manually verified by three security experts (consisting of approxi- mately 25 hours of review), and

The PrimeVul dataset, which also builds on prior vulnerability datasets but employs a dis- tinct cleaning methodology [33]. For this set, we ensure no overlap with our training, valida- tion, or test sets, resulting in 7,422 vulnerable and 7,164 non-vulnerable unique samples.

Results

Table 6 shows that across all test sets UniXcoder-Base-Nine consistently outper- forms CodeBERT [1]. On Linux data, where Code- BERT achieved 75.3% F1 score, UniXcoder-Base- Nine is slightly outperformed with 76.4%, though recall remained limited due to the small dataset size. On PrimeVul data, our UniXcoder-Base-Nine achieved 89.1% F1 score compared to CodeBERT’s 87% with a higher recall of 91.3%, showing its abil- ity to capture more true vulnerabilities. Most im- portantly, on the expert verified Claude dataset which is the toughest due to its synthetic yet real- istic nature, where CodeBERT collapsed to just 17.25% F1 score, whereas UniXcoder-Base-Nine sustained 64.8% F1 score, representing a 3.7× rel- ative improvement.

6 Further Discussion

Prior literature consistently shows that detec- tion models suffer from catastrophic performance degradation when tested on cross-data or unseen projects, rendering them unfit for real-world use.

For instance, Chakraborty et al. demonstrate that on average, the performance of pre-trained models drops by about 73% when tested on real- world vulnerabilities [9]. For example, VulDeeP- ecker’s precision drops from 86.9% to just 11.1%. Moreover, even after retraining on real-world data, models perform on average 54% lower than ini- tially reported. Similarly, Yizheng et al. [38] found that GNNs, encoders, decoders, and transformers achieve 49% F1 score on seen projects but this score drops drastically to only 9.4% when tested on unseen projects.

In this sense, our positive scores in Table 6 rep- resent a significant breakthrough in generalization. Even our lowest F1 score (64.8% on Claude data) significantly surpasses the best reported general- ization scores in the literature.

The appropriate comparison to highlight our results is a recent benchmark study [33]: Yan- gruibo et.al test CodeBERT and UniXcoder on the BigVul dataset, achieving F1 scores of 62.88% and 65.46%, whereas on the (unseen) PrimeVul dataset, these scores drop dramatically to 20.86% and 21.43% respectively. The performance loss in both cases exceeds 40%.

In contrast, our pretrained CodeBERT and Uni Xcoder-Base-Nine models achieve F1 scores of 91% and 94.7% on BigVul (Table 3) and 87% and 89.1% respectively on PrimeVul (Table 6). Our per- formance drop is 4–6%, an order of magnitude less than results achieved by Yangruibo et.al. and, to the best of our knowledge, the highest generaliza- tion results reported in the literature to date.

These results validate our approach: trans- former architectures, pre-trained on code, partic- ularly encoder models like UniXCoder-Base-Nine significantly outperform prior models in the litera- ture. Context window size has a significant impact on performance. Furthermore, training models on a clean, diverse, and balanced dataset is critical for generalization capability.

7 Conclusion and Future Work

In this study, we undertook a comprehensive vulnerability detection exercise using LLMs. We specifically addressed the critical challenge of poor generalization. Our experiments demonstrate that model architecture, context window size, training

data quality, and hard negative mining strategies all significantly impact and improve generaliza- tion.

We conducted extensive experiments with encoder-only and decoder-only models. Among them, UniX coder-Base-Nine consistently out- performed alternatives such as CodeBERT, CodeGPT-2, and CodeLlama across both bench- mark and unseen datasets. Moreover, increasing context window size enabled models to identify vulnerabilities distributed over larger functions. Our generalization results represent a dramatic im- provement over prior results attained in the liter- ature.

In this paper, we also introduce VulGate, our custom dataset, which addresses the shortcom- ings and limitations of prior datasets. We present a complete pipeline that removes mislabeled and duplicate samples, incorporates recent vulnerabil- ities up to May 2025, adds rich metadata, inte- grates hard negative samples, and includes a ded- icated test set for generalization testing. To the best of our knowledge, VulGate is the most ad- vanced dataset for vulnerability detection to date, and is also useful for related applications of code classification, localization, and patching.

In future work, we intend to extend our ap- proach along three directions: first, evaluation across multiple programming languages will verify cross-language generalization. Second, exploration of instruction finetuning using information in the VulGate metadata to further improve vulnerabil- ity detection along with automated remediation. Finally, we intend to investigate hybrid training that incorporates contrastive learning to amplify the benefits of hard negatives and further improve generalization.

We hope our effort contributes to the develop- ment of effective and reliable automated vulnera- bility detection solutions.

Author Contribution

R.S. conceived the study, developed the VulGate dataset, conducted the experiments, and wrote the main manuscript.D.M. contributed to the methodological design and validation.U.A. assisted with data collection and preprocessing.S.T.A. supervised the research and provided critical revisions.W.H. co-supervised the project and contributed to manuscript review.All authors read and approved the final manuscript.

Data Availability

The dataset generated in this study, VulGate, is hosted on GitHub at:https://github.com/Rijha/Gen-VulgateThis repository includes the curated data, scripts for processing, and usage instructions.

References

Fu, M., Tantithamthavorn, C.: Linevul: A transformer-based line-level vulnerability predic- tion. In: Proceedings of the 19th International Conference on Mining Software Repositories, pp. 608–620 (2022)

Fan, J., Li, Y., Wang, S., Nguyen, T.N.: A C/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th inter- national conference on mining software repositories, pp. 508–512 (2020)

Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S., Deng, Z., Zhong, Y.: Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:180101681 (2018)

Russell, R., Kim, L., Hamilton, L., Lazovich, T., Harer, J., Ozdemir, O., Ellingwood, P., McConley, M.: Automated vulnerability detection in source code us- ing deep representation learning. In: 2018 17th IEEE international conference on machine learning and ap- plications (ICMLA), pp. 757–762 IEEE. (2018)

Pornprasit, C., Tantithamthavorn, C.K.: Jitline: A simpler, better, faster, finer-grained just-in-time de- fect prediction. In: 2021 IEEE/ACM 18th Interna- tional Conference on Mining Software Repositories (MSR), pp. 369–379 IEEE. (2021)

Wattanakriengkrai, S., Thongtanunam, P., Tan- tithamthavorn, C., Hata, H., Matsumoto, K.: Predict- ing defective lines using a model-agnostic technique. IEEE Transactions on Software Engineering, 48(5), 1480–1496 IEEE. (2020)

Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: Effective vulnerability identification by learning com- prehensive program semantics via graph neural net- works. Adv. neural Inform. Process. sys- tems, 32 (2019)

Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: Sysevr: A framework for using deep learning to de- tect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing, 19(4), 2244–2258 IEEE. (2021)

Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: Are we there yet? IEEE Transactions on Software Engineering, 48(9), 3280–3296 IEEE. (2021)

10.

Li, Y., Wang, S., Nguyen, T.N.: Vulnerability detec- tion with fine-grained interpretations. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foun- dations of Software Engineering, pp. 292–303 (2021)

11.

Lu, S., Barthe, G., Bieber, D.: others: CodeGPT-small-py-adaptedGPT2. Hug- ging Face [Online]. (2020). Available: https://huggingface.co/microsoft/CodeGPT-small- py-adaptedGPT2 (Accessed: 2025-06-10)

12.

Meta, A.I.: Code LLaMA: Open Foundation Models for Code. Hugging Face [Online]. (2023). Available: https://huggingface.co/codellama (Accessed: 2025- 06–10)

13.

Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J.: Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:220303850 (2022)

14.

Microsoft: UniXcoder-base-nine. Hug- ging Face [Online]. (2022). Available: https://huggingface.co/microsoft/unixcoder-base- nine (Accessed: 2025-06-10)

15.

CVE Details: CVE Details - Vulnerabil- ity Database: [Online]. (2024). Available: https://www.cvedetails.com/ (Accessed: 2025- 06–10)

16.

U.S. House of Representatives Committee on Oversight and Government Reform: Report of Investigation: Equifax Inc. Data Breach. U.S. Government Publishing Office: [On- line]. (2018). Available: https://oversight.house.gov/wp- content/uploads/2018/12/Equifax-Report.pdf (Ac- cessed May 2025)

17.

CVE Details: – 2024 Vulnerabilities. [Online]. Avail- able: (2025). https://www.cvedetails.com (Accessed

18.

Bessey, A., et al.: A few billion lines of code later: Using static analysis to find bugs in the real world. Commun. ACM, 53(2) (2010). ACM.

19.

Micro Focus: Fortify Static Code An- alyzer: [Online]. (2020). Available: https://www.microfocus.com/documentation/fortify- static-code-analyzer/

20.

LLVM Project: Clang Static Analyzer: [On- line]. (2024). Available: https://clang-analyzer.llvm.org/ (Ac- cessed May 2025)

21.

Daniel Marjamäki: Cppcheck - A tool for static C/C + + code analysis: [Online]. (2024). Available: http://cppcheck.sourceforge.net/ (Accessed May 2025)

22.

Cadar, C., Dunbar, D., Engler, D.: KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In: USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2008)

23.

Li, Z., Vijaykumar, T.N., Snavely, A., Falsafi, B.: Vul- can: Binary transformation in a distributed environ- ment. In: International Symposium on Code Gener- ation and Optimization (CGO), pp. 271–283 IEEE. (2005). 10.1109/CGO.2005.39

24.

Zalewski, M.: American Fuzzy Lop (AFL) - Security-oriented fuzzer [Online]. Avail- able: (2014). https://lcamtuf.coredump.cx/afl/ (Accessed May 2025)

25.

Cheng, W., Zhang, J., Zhou, Z., Zhu, S., Zou, W., Gong, X.: TaintTrace: Efficient flow tracing with dy- namic binary rewriting. In: IEEE Symposium on Computers and Communications (ISCC), pp. 807– 812 IEEE. (2009). 10.1109/ISCC.2009.5202251

26.

Chen, M., et al.: Evaluating large language mod- els trained on code. arXiv preprint arXiv:210703374 (2021)

27.

Chakraborty, S., et al.: Are Large Language Models Capable of Vulnerability Detection? In: IEEE S&P (2023)

28.

Niu, L., Lin, Q., Han, K., Xu, T., Liu, Y., Liu, Z.: Safe: Self-attentive function embeddings for binary similarity. In: Proceedings of the 31st USENIX Secu- rity Symposium (USENIX Security), pp. 4979–4996 (2022)

29.

Zhou, S., Han, S., Duan, M., Wang, Y., Xue, J., Zhang, D.: Devil is in the Details: Evaluating Large Language Models for Vulnerability Detection and Lo- calization. In: Proceedings of the 32nd USENIX Se- curity Symposium (USENIX Security) (2023)

30.

Lu, F., Tunstall, L., Rabe, M., Xu, J.: others: Code Llama: Open Foundation Models for Code. arXiv preprint arXiv:2308.12950 (2023)

31.

GitHub: GitHub Copilot: Your AI pair programmer: [Online]. (2023). Available: https://github.com/features/copilot (Accessed May 2025)

32.

Cursor: Cursor: The AI-first Code Editor [Online]. (2024). Available: https://www.cursor.sh (Accessed May 2025)

33.

Ding, Y., Fu, Y., Ibrahim, O., Sitawarin, C., Chen, X., Alomair, B., Wagner, D., Ray, B., Chen, Y.: Vul- nerability detection with code language models: How far are we? arXiv preprint arXiv:2403.18624 (2024)

34.

Feng, Z., Guo, D., Tang, D., et al.: CodeBERT: A pre-trained model for programming and natural lan- guages. Findings of EMNLP (2020)

35.

Guo, D., Ren, S., et al.: GraphCodeBERT: Pre- training Code Representations with Data Flow. In: ICLR (2021)

36.

Bhandari, G., Naseer, A., Moonen, L.: CVEfixes: au- tomated collection of vulnerabilities and their fixes from open-source software. In: Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 30– 39 (2021)

37.

Ni, C., Shen, L., Yang, X., Zhu, Y., Wang, S.: MegaVul: AC/C + + vulnerability dataset with com- prehensive code representations. In: Proceedings of the 21st International Conference on Mining Software Repositories, pp. 738–742 (2024)

38.

Chen, Y., Ding, Z., Alowain, L., Chen, X., Wagner, D.: Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. In: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pp. 654–668 (2023)

39.

Nikitopoulos, G., Dritsa, K., Louridas, P., Mitropou- los, D.: CrossVul: a cross-language vulnerability dataset with commit data. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineer- ing Conference and Symposium on the Foundations of Software Engineering, pp. 1565–1569 (2021)

40.

Zheng, Y., Pujar, S., Lewis, B., Buratti, L., Ep- stein, E., Yang, B., Laredo, J., Morari, A., Su, Z.: D2a: A dataset built for ai-based vulnerability de- tection methods using differential analysis. In: 2021 IEEE/ACM 43rd International Conference on Soft- ware Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 111–120 IEEE. (2021)

41.

Anthropic: Claude Language Model: [On- line]. (2024). Available: https://www.anthropic.com/claude (Accessed: July 2025)

42.

Facebook, A.I.R.: Infer: Static Analyzer for Java, C, C++, and Objective-C [Online]. (2024). Available: https://fbinfer.com/ (Accessed: 2025-07-28)

43.

Liu, X., Zheng, J., Yang, G., Wen, S., Liu, Q., Wang, X.: Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes. arXiv preprint arXiv:2503.22935 (2025)

44.

Tymchuk, Y.: The False False Positives of Static Analysis. Seminar Series on Advanced Techniques and Tools for Software Evolution SATToSE, pp. 07–09 (2017)

45.

Cui, H., Xie, M., Su, T., Zhang, C., Tan, S.H.: An Em- pirical Study of False Negatives and Positives of Static Code Analyzers From the Perspective of Historical Is- sues. (2024). arXiv preprint arXiv:2408.13855

46.

Murali, A., Mathews, N., Alfadel, M., Nagappan, M., Xu, M.: FuzzSlice: Pruning false positives in static analysis warnings through function-level fuzzing. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pp. 1–13 (2024)

47.

Shields, P.: Hybrid testing: Combining static analy- sis and directed fuzzing. PhD thesis, Massachusetts Institute of Technology (2023)

48.

Bessler, G., Cordova, J., Cullen-Baratloo, S., Dis- sem, S., Lu, E., Devin, S., Abughararh, I., Bang, L.: Metrinome: Path complexity predicts symbolic execu- tion path explosion. In: 2021 IEEE/ACM 43rd Inter- national Conference on Software Engineering: Com- panion Proceedings (ICSE-Companion), pp. 29–32 IEEE. (2021)

49.

Ding, Y., Suneja, S., Zheng, Y., Laredo, J., Morari, A., Kaiser, G., Ray, B.: VELVET: a noVel Ensemble Learning approach to automatically locate VulnEra- ble sTatements. In: 2022 IEEE International Confer- ence on Software Analysis, Evolution and Reengineer- ing (SANER), pp. 959–970 IEEE. (2022)

50.

Risse, N., Böhme, M.: Uncovering the limits of ma- chine learning for automatic vulnerability detection. In: 33rd USENIX Security Symposium (USENIX Se- curity 24), pp. 4247–4264 (2024)

51.

Li, Y., Bui, N.T., Zhang, T., Weyssow, M., Yang, C., Zhou, X., Jiang, J., Chen, J., Huang, H., Nguyen, H.H.: others: Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses? (2025). arXiv preprint arXiv:2507.21817

52.

Croft, R., Babar, M.A., Kholoosi, M.M.: Data qual- ity for software vulnerability datasets. In: 2023 IEEE/ACM 45th International Conference on Soft- ware Engineering (ICSE), pp. 121–133 IEEE. (2023)

53.

Yadav, A.S., Wilson, J.N.: R + R: Security Vulnerabil- ity Dataset Quality Is Critical. In: 2024 Annual Com- puter Security Applications Conference (ACSAC), pp. 1047–1061 IEEE. (2024)

54.

Gao, Z., Zhou, J., Zhang, B., He, Y., Zhang, C., Cui, Y., Wang, H.: Mono: Is Your Clean Vulner- ability Dataset Really Solvable? Exposing and Trap- ping Undecidable Patches and Beyond. (2025). arXiv preprint arXiv:2506.03651

55.

Wang, Z., Li, G., Li, J., Xiong, Y., Li, J., Jin, Z.: M2CVD: Multi-model collaboration for code vulner- ability detection. arXiv e-prints (arXiv–2406) (2024)

56.

Le, T.H.M., Babar, M.A.: Automatic data labeling for software vulnerability prediction models: How far are we? In: Proceedings of the 18th ACM/IEEE Interna- tional Symposium on Empirical Software Engineering and Measurement, pp. 131–142 (2024)

57.

Gao, Z., Wang, H., Zhou, Y., Zhu, W., Zhang, C.: How far have we gone in vulnerability detec- tion using large language models. arXiv preprint arXiv:231112420 (2023)

58.

Dong, H., Lin, J., Wang, Y., Leng, Y., Chen, J., Xie, Y.: Improving Code Search with Hard Negative Sam- pling Based on Fine-tuning. In: 2024 31st Asia-Pacific Software Engineering Conference (APSEC), pp. 221– 230 IEEE. (2024)

59.

Robinson, J., Chuang, C.-Y., Sra, S., Jegelka, S.: Con- trastive learning with hard negative samples. arXiv preprint arXiv:2010.04592 (2020)

60.

Shi, W., Chen, J., Feng, F., Zhang, J., Wu, J., Gao, C., He, X.: On the theories behind hard negative sampling for recommendation. In: Proceedings of the ACM Web Conference pp. 812–822 (2023). (2023)

61.

Kalantidis, Y., Sariyildiz, M.B., Pion, N., Weinza- epfel, P., Larlus, D.: Hard negative mixing for con- trastive learning. Adv. neural Inform. pro- cessing Syst. 33, 21798–21809 (2020)

62.

Collobert, R., Kavukcuoglu, K., Farabet, C.: others: Torch7: A matlab-like environment for machine learn- ing. In: BigLearn, NIPS workshop, vol. 5, p. 10 Lake Tahoe, NV. (2011)

63.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., De- langue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.: others: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

64.

Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deep- speed: System optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3505–3506 (2020)

65.

Shestov, A., Levichev, R., Mussabayev, R., Maslov, E., Zadorozhny, P., Cheshkov, A., Mussabayev, R., Toleu, A., Tolegen, G., Krassovitskiy, A.: Finetun- ing large language models for vulnerability detection. IEEE Access IEEE. (2025)

66.

Sheng, Z., Chen, Z., Gu, S., Huang, H., Gu, G., Huang, J.: Large language models in software secu- rity: A survey of vulnerability detection techniques and insights. arXiv preprint arXiv:250207049 (2025)

67.

Steenhoek, B., Rahman, M.M., Jiles, R., Le, W.: An empirical study of deep learning models for vulner- ability detection. In: 2023 IEEE/ACM 45th Interna- tional Conference on Software Engineering (ICSE), pp. 2237–2248 IEEE. (2023)

68.

Li, H., Zhou, X., Tuan, L.A., Miao, C.: Rethink- ing negative pairs in code search. arXiv preprint arXiv:231008069 (2023)

69.

Baran, G.: 40,000 + CVEs Published In Marking A 38% Increase From 2023. Cy- ber Security News (2025). [Online]. Avail- able: (2024). https://cybersecuritynews.com/40000-cves- published-in-2024

70.

Akhavani, S.A., Kharraz, B.O.A.: Open Source, Open Threats? Investigating Security Chal- lenges in Open-Source Software. arXiv [On- line]. (2025). Available: https://arxiv.org/abs/2506.12995

71.

Techzine, Global: An average of 131 CVE re- ports per day [Online]. (2025). Available: https://www.techzine.eu/news/security/133037/an- average-of-131-cve-reports-per-day (Accessed 2 Aug. 2025)

72.

Coker, J.: Software Vulnerabilities Take Almost Nine Months to Patch. Infosecurity Magazine [Online]. (2025). Available: https://www.infosecurity- magazine.com/news/software-vulnerabilities-nine/

73.

McDade, M.: Discover key statistics on common software vulnerabilities, the market, and pre- dicted trends. Expert Insights [Online]. (2025). Available: https://expertinsights.com/network- management/software-vulnerability-statistics-and- trends-2025

74.

VulnCheck: Trends in Vulnerability Exploita- tion | Blog | VulnCheck (2025). [Online]. (2024). Available: https://www.vulncheck.com/blog/2024-exploitation- trends (Accessed 2 Aug. 2025)

75.

Shen, M., Pillai, A., Yuan, B.A., Davis, J.C., Machiry, A.: An empirical study on the use of static analy- sis tools in open source embedded software. arXiv preprint arXiv:2310.00205 [Online]. (2023). Available: https://arxiv.org/abs/2310.00205

76.

Kaniewski, S., Schmidt, F., Enzweiler, M., Menth, M., Heer, T.: A Systematic Literature Review on De- tecting Software Vulnerabilities with Large Language Models. arXiv preprint arXiv:2507.22659 [On- line]. (2025). Available: https://arxiv.org/abs/2507.22659

Yes