UAVOD-Net: A Lightweight Yolov11n-Bob Cat Optimization Framework for High Precision Object Detection in Aerial Imagery

DamaDurgaBhavani1Emailbhavanidama@gmail.com

UshaRaniNelakuditi1Emailusharani.nsai@gmail.com

Department of ECEVignan’s Foundation for Science Technology and Research, Deemed to be University,GunturIndia

Dama Durga Bhavani¹ and Usha Rani Nelakuditi²

¹ Department of ECE, Vignan’s Foundation for Science Technology and Research, (Deemed to be University,), Guntur, India

² Department of ECE, Vignan’s Foundation for Science Technology and Research, (Deemed to be University,), Guntur, India

bhavanidama@gmail.com ,usharani.nsai@gmail.com

Abstract

The rapid deployment of Unmanned Aerial Vehicles (UAVs) for tasks such as traffic monitoring, agricultural inspection, and public safety has significantly increased the volume of aerial visual data, yet efficient, real-time object detection remains a critical challenge, especially in complex urban environments. Existing studies show that object detection models often struggle with occlusion, small object sizes, and varying lighting conditions prevalent in drone-captured imagery, leading to sub-optimal performance and higher false detection rates. To address these limitations, this work proposes an enhanced UAV Object Detection (UAVOD-Net) framework leveraging the VisDrone dataset. Robust image preprocessing techniques are applied to improve contrast, remove noise, and standardize input dimensions, enhancing feature clarity. For detection, the You Only Look at Once with Version-11 based Nano variant (YOLOv11n) model, known for its lightweight architecture and high-speed processing, is deployed to efficiently localize objects such as pedestrians, vehicles, and bicycles. Further performance improvement is achieved through the integration of the Bob Cat Optimization (BCO) algorithm, a nature-inspired metaheuristic approach designed to optimize network weights and hyperparameters. BCO enhances convergence speed, reduces model loss, and improves detection accuracy under challenging conditions like occlusions and variable scales. This combined methodology significantly boosts object detection precision and reliability for UAV-based surveillance and monitoring applications. The proposed UAVOD-Net achieved an overall Precision of 0.594, Recall of 0.485, mAP@50 of 0.516, and mAP@50–95 of 0.326 for all object classes. These results demonstrate improved detection accuracy and robustness for UAV-based aerial imagery. Furthermore, the same UAVOD-Net applied on Detection in Adverse Weather Nature (DAWN) dataset, where proposed UAVOD-Net resulted in superior object detection and classification performance in presence of adverse weather conditions compared to existing approaches.

Keywords:

VisDrone dataset

UAV object detection

aerial imagery

YOLOv11n

Bob Cat Optimization

loss reduction

real-time detection

image preprocessing

metaheuristic optimization

drone surveillance

1. Introduction

This viral implementation of UAVs means that the overall airborne data gathering has multiplied exponentially because recent reports have proven that there are about 3 million commercial drones in use worldwide, and the UAV market size will more than triple to 47 billion dollars by 2025 [1]. At the same time, the number of images rendered with hundreds of labelled instances of objects, such as VisDrone2019 that consists of 2.6 million labels made per 261,908 frames or 10,209 pictures [2], points to the increasing need in the efficient and scalable functioning of the object detection methods that can be applied to aerial surveillance, smart farming, and city monitoring.

Conventional manual object detection approaches have used manual features mostly, like edge descriptors, Histogram of Oriented Gradients (HOG) [3], or background subtraction approach with sliding-window classifier. These methods work well with the controlled context but in applications with UAV imagery, they have invariably proved to be ineffective and have had poor detection rates with inaccurate results because of the complex backgrounds, small objects, different altitudes and an open crowded scene.

In effort to reverse this situation Artificial Intelligence (AI) [4] driven detection object, especially deep learning models has become a strong contender. Deep Convolutional Neural Network (CNN) [5], such as Faster R-CNN, Single Shot Multi Box Detector (SSD), and YOLO [6] have been able to make tremendous steps in combining precision, speed, and flexibility in detection. These models however have limitations in the case of UAV-based applications particularly on tracking of miniaturized and occluded objects or dense entities in such complex aerial conditions.

Various active AI-driven organizations and research staff have turned to the VisDrone dataset to formulate and evaluate efficient object identification designs. An example of this is that Ultralytics has introduced the use of YOLOv5 and YOLOv8 series in UAV detection activities optimizing operations with real time aerial imagery [7]. Their models are heavily benchmarked on VisDrone, which resulted in the enhancement of the detection speed and accuracy of drone-related applications.

In a similar way, such companies as Hesai Technology [9] and research projects like Aerial Image Tiny Object Detection (AI-TOD) [8], have tested VisDrone in their pipeline and worked on improving the detection of small objects within aerial settings. Although those innovations exist, the current methods tend to maintain significant computational burden and precision decline during occlusion, or UAV detection became unstable in different settings, and that means even more work should be done on lightweight, real-time, and accurate UAV detection frameworks [9].

The rest of the paper is organized as follows: Section 2 presents a comprehensive survey of existing UAV object detection methods, highlighting their strengths and limitations. Section 3 details our proposed methodology, while Section 4 reports experimental results and analysis, and Section 5 offers concluding remarks and future research directions.

2. Literature Survey

The related work highlights a variety of lightweight, attention-enhanced YOLO architectures tailored for UAV object detection, yet most methods either increased architectural complexity, incurred inference latency, or sacrificed large-object performance to improve small-object accuracy.

2.1 Related Work

Liu et al. [10] introduced the efficient feature aggregation network (EFANet) to detect any small object in UAV images and use multi-level context to improve the precision of detection in a complicated aerial environment. It suffered a large computational overhead because of several layers of aggregation. Prakash et al. [11] developed a drone-based detector system based on YOLOv8 which improves inference speed and accuracy by model refinement via hardware optimization. Real-time performance on embedded platforms was accomplished in the system. Bai et al. [12] proposed the use of SFFEF YOLO that involved a combination of fine-grained feature extraction and fusion modules to reduce the detection of tiny aerial objects. The additional fusion modules made models complex and increased real time deployment.

He and Cao [13] came up with SOD YOLO, which is a customized small object detection network on UAV aerial pictures with a lightweight feature boosting block. It showed a lower vigor in highly cluttered or high-occlusion environments. Ge et al. [14] introduced the idea of using a real time detector wherein the method used in the context of aerial imagery enhanced the accuracy of detection by taking advantage of high frequency feature learning and context aware fusion. It involved additional parameters and extended training period. Bae et al. [15] came up with YOLO RACE that incorporated reassembly and convolutional block attention to reinforce dense object detection in UAV images. It gave better results than available variants of YOLO on crowded aerial scenes.

Wang, Shijie et al. [16] added a scale-aware hierarchy into the neck of a YOLO-based detector of UAV scenes, which organized feature learning hierarchically coarse to fine to enhance object-at-diverse-scale detectors. It further complicated the neck, making it take a longer time to infer on edge platforms. The authors of the paper Zhong, Han et al. [17] created PS-You Only Local Only network using partial convolution and FasterBIFFPN at the backbone and neck, respectively, GSCD head, and NWDLoss loss function. This model of parameters cut more than 40 percent and hit a 1.3 percent mAP increase on VisDrone2019. The single-shot detector that was suggested by Qi, Guimei et al. [18] combined multi-scale features with a context-enhanced spatial sparse convolution to strengthen the detection of small targets in UAV images. The spatial sparse convolution made architectures more irregular and made them difficult to run on standard UAV hardware.

MSRP-TODNets was introduced by Bikku, Thulasi et al. [19], which was a small-object detection network that strengthened region-based analysis with multi-scale feature aggregation. The region analysis method led to an unequal processing latency which lowered real-time consistency. Bi, Hongbo et al. [20] proposed the DR-YOLO, a small-object multi-scale detecting model based on the YOLOv7 model, which uses better anchor strategies and decouples features in aerial photography settings. It was strongly based on the YOLOv7 design and therefore resource-intensive when it comes to real time UAV implementation. Yang et al. [22] introduced ADD‑YOLO, which replaced standard convolution with AKConv, applied C2f_DRAC with CBAM attention, and implemented a DABFPN neck including a special small‑object detection layer, achieving up to 15.7% mAP@0.5 gains on aerial data. Jobaer et al. [23] developed a self‑supervised knowledge distillation framework that trained a deblurring subnet with dual attention modules to boost small‑object detection in blurry UAV images, achieving 4.3–4.6% mAP improvements on a VisDrone blur dataset. Li and Qu [24] proposed VMC‑Net, which combined multi‑scale feature aggregation (MFADM) with distribution modules, VHeat C2f for fine tuning, and context‑attention guided fusion (CAGFM) to enhance small object detection; experiments on VisDrone showed notable precision gains.

The lightweight variant of YOLOv10 named SOD -YOLOv10 by Sun et al. [25] that focused on detecting small objects in a remote sensing context also adopted efficient feature refinement and anchor design optimization to enhance small object recall and accuracy. Li et al. [26] designed ED -YOLO, based on the modification of YOLOv5-n with a new Efficient Edge Information Extraction (EStem) module and Multi-Path Coordinate Attention (MPCA) with Deformable Convolution V2 (DCNv2) to capture the edge and small-object feature. They additionally used a Multi-Scale Efficient Decoupled Head (MSE-head) and included a special small-object detector head but removed the large-object detector head, which gave significant increase in mAP 50 improvement and parameter relief on VisDrone2019. Xu et al. [27] proposed ESOD-YOLO, an upgraded efficient small-object detector which integrated backbone enhancements that were lightweight with efficient neck features fusion and spatial attention to enhance detection accuracy in UAV imagery. The extended streamlining of the architecture restricted the general object features model capacity.

Wu et al. [28] proposed AAPW‑YOLO, which improved the YOLOv8 backbone by integrating adaptive kernel convolution (AKConv) and reconstructed feature fusion (ASFP2), alongside introducing Wise‑IoU loss for sharper regression. It added architectural complexity that could challenge deployment on lightweight embedded UAV systems. Su et al. [29] introduced a novel drone-oriented detector incorporating multi‑head mixed self‑attention and a dynamic regression mapping loss, which enhanced feature representation and refined bounding box regression during training. The self‑attention layers increased inference latency, reducing speed-critical suitability. Li et al. [30] developed RLRD‑YOLO, an improved YOLOv8 adapted for UAV perspectives by embedding Receptive Field Attention Convolution via RFCBAM Conv and large‑scale kernel attention (LSKA) in the SPPF layer. It delivered a 12.2% mAP@0.5 improvement on VisDrone2019.

Kumar et al. [31] proposed a data merging approach that integrated heterogeneous weather condition datasets with the YOLOv8 detection framework to improve object localization accuracy under fog, rain, and snow. However, the approach exhibited reduced generalization on preprocessed synthetic data. Aloufi et al. [32] proposed a multi-objective framework that integrated weather classification and object detection using a unified CNN backbone optimized for autonomous driving perception. Jing et al. [33] proposed a Dual-Stream Network (DSNet) that performed feature fusion and detail restoration through a hierarchical encoder–decoder architecture for accurate object detection in foggy scenes. Alzanin et al. [34] proposed an explainable artificial intelligence framework based on Temporal Convolutional Networks (TCNs) to identify adverse weather conditions affecting autonomous vehicle performance. It was, however, restricted by the accuracy of sequential time information and was therefore less useful in sparse framerate or low-light situations.

2.2 Research Gaps

Although most studies have adopted YOLO-based models to detect UAV objects because of their real-time capability and high detection accuracy, the most recent research mainly concentrates on architectural adjustments or attention models even though the issue of weight and hyperparameter optimization through nature-inspired metaheuristics is also a significant concern. Even though other methods such as feature fusion, a module of attention, and scale-sensitive advances have been incorporated to help detect small objects on complicated aerial tasks, the methods tend to either raise the complexity of the models, inference time, or computational costs. Moreover, the current models often suffer degradation in performance in certain scenarios, e.g. in presence of occlusions, scale changes, and high-density distributions of objects, particularly in datasets, e.g. VisDrone, DAWN. The outstanding research gap here is to come up with lightweight and real-time detecting frameworks with bio-inspired optimizers that can effectively fix network weights, loss reduction, and robustness while not compromising speed or computation effectiveness.

2.3 Novel Contributions:

To present the first integrated framework combining VisDrone and DAWN data, image enhancement, YOLOv11n, and BCO for UAV object detection.

To implement YOLOv11n with optimized parameters via BCO, improving detection precision and reducing computational load.

To achieve significant loss reduction and accuracy enhancement for real-time detection in occluded and densely populated aerial scenes.

3. Proposed UAVOD-Net

The proposed approach combines a VisDrone datasets, image preprocessing, YOLOv11n lightweight object detection, and BCO to fine-tune weights and parameters, which has not been introduced as a set of features in the existing surveys. Also, the experiments were conducted on the dataset called DAWN as a cate study that employs Adaptive Luminance and Gamma Enhancement (ALGE) method of preprocessing. This solution-of-the-art hybrid solution breaks through the most important constraints of the existing UAV object detection methods, especially low detection accuracy in cluttered, occluded, and small object environments. The proposed framework has successfully minimized image detection loss through the deployment of an optimized and fast detector model, utilization of a nature-inspired optimization algorithm, and image clarity enhancement making it appropriate in real-time aerial surveillance activities. The proposed UAVOD-Net system architecture is shown in Fig. 1. The analysis is in depth and as follows:

Step 1: VisDrone and DAWN Datasets: This work performs the object detection, classification tasks independently, where the same algorithm trained independently on different datasets.

Step 2: Image Preprocessing: Image and video frame basic preprocessing which consists of resizing to accentuate object boundaries, and normalization to standardize input dimensions. Such steps enhance the visibility of the features particularly those that are small and hidden to achieve enhanced model performance.

Step 3: ALGE Preprocessing: ALGE method enhances the visibility and clarity of the image when the conditions are unfavorable, due to the poor lighting or weather. It initially boosts the contrast of luminance within the LAB color space with the help of CLAHE to place the emphasis on the important features and avoid overexposure. Then the gamma correction is done that naturally brightens dark areas to produce a balanced and detailed image that is suitable to use in UAV based vision.

Step 4: YOLOv11n Lightweight Object Detection: The processed images are then inputted into the YOLOv11n object detector which has an efficient architecture and has a high rate of inference. This version is chosen because it is the one that allows to balance the detection precision and computation speed to make it applicable to the UAV-based applications with limitations. In this case, the YOLOv11n model was trained separately on VisDrone and DAWN dataset and the model files were saved separately.

Step 5: BCO for Network Tuning: The BCO algorithm is incorporated to optimize the weights and hyperparameters of YOLOv11n to increase the accuracy of the model. The BCO mimics the adaptive hunting strategies of bobcats to iteratively adjust the network, achieving better convergence, minimizing detection loss, and increasing overall detection robustness under variable aerial conditions. Here, the BCO generates the optimal network parameters for each VisDrone, DAWN datasets independently, which are not overlapped.

Step 6: Outputs: After successful training of model, the proposed UAVOD-Net generates the bounding box, classified labels from VisDrone dataset, whereas ALGE pre-processed image, and bounding box with classified outcomes generated from DAWN dataset.

Fig. 1

Proposed UAVOD-Net System Architecture.

3.1 YOLOv11n Lightweight Object Detection

The YOLOv11n architecture is a lightweight but highly efficient object detection framework that is specifically targeted at solving the problems that arise about UAV-based aerial-imaging, like small size object detection, occlusion, and of the objects located at different environmental conditions. Figure 2 presents YOLOv11n block diagram. The structure of the architecture starts with a small Backbone with sequential residual blocks which extract the multi-level characteristics of the input drone image reshaped sizes of 640 x 640 x 3, through subsequent convolutional blocks and practical C3K2 blocks that can minimize computational complexity with depth wise separable convolutions without compromising richness of the feature. The last output of Backbone is passed through Spatial Pyramid Fast Fusion (SPFF) and Channel and Spatial Attention (C2PSA), which refines multi-scale contextual representation, which is essential in the VisDrone and DAWN settings where such images contain highly dense objects, such as pedestrians, cars, and bicycles in street and countryside scenes.

The neck component uses the feature fusion technique, where outputs of various layers are up sampled and concatenation is applied, retaining spatial information that is needed in the identification of small and medium objects using drones’ imagery. The three detection heads (their resolutions at 80x80, 40x40 and 20x20 targeting small, medium-sized and large objects respectively, certify the robust performance under different object scaled down large and far apart often evident in VisDrone, DAWN images. All the detection heads produce predictions of bounding box bounds and objectness scores, and classes probabilities, which are eventually enhanced by post-processing (such as Non-Maximum Suppression (NMS). In terms of real-time surveillance application, the design of YOLOv11n offers a high level of detection speed and accuracy when faced with the level of variations in lighting, occlusion, and crowd density exhibited in the VisDrone, DAWN datasets, thus proving to be optimal when dealing with real-time applications as UAV surveillance, as well as the need to conduct any form of aerial surveillance.

Fig. 2

YOLOv11n Architecture.

3.1.1 Backbone for Feature Extraction

The Backbone of YOLOv11n progressively extracts hierarchical features from the input UAV image using convolutional layers and lightweight C3K2 modules. It reduces spatial resolution while increasing feature depth to capture both low-level edges and high-level semantics. This stage generates rich feature maps essential for accurate object detection in aerial VisDrone images. Figure 3 shows the BottelNeck block diagram, which is implemented using serial convolutions with skip connections. Further, Fig. 4 shows the C3K Module, which is developed with parallel Bottle Neck modules and concatenated together. Finally, Fig. 5 shows the C3K2 module, which is developed with serial analysis of C3K modules.

First Conv Block: The input aerial image is denoted as:

$\:I\in\:{R}^{640\times\:640\times\:3}$

. Where 640×640 is the fixed dimension and 3 indicates RGB channels. The image is resized and normalized before feeding into the model. Eq. (1) applies a convolution

$\:{W}_{1}$

over the input image

$\:I$

of size 640×640×3 to extract low-level features.

$\:{b}_{1}$

is the bias, and

$\:\sigma\:$

is the activation function (Leaky ReLU). The output feature map

$\:{F}_{1}$

has reduced spatial dimensions 320×320×64.

$\:{F}_{1}=\sigma\:\left({W}_{1}*I+{b}_{1}\right),\:\:\:{F}_{1}\in\:{R}^{320x320\times\:64}$

Second Conv Block

A second convolution layer further extracts feature from

$\:{F}_{1}$

, with kernel weights

$\:\:{W}_{2}$

and bias

$\:{b}_{2}$

. The activation

$\:\sigma\:$

introduces non-linearity. The output

$\:{F}_{2}$

has dimensions 160×160×128, capturing higher-level patterns.

$\:{F}_{2}=\sigma\:\left({W}_{2}*{F}_{1}+{b}_{2}\right),\:\:\:{F}_{2}\in\:{R}^{160x160\times\:128}$

Fig. 3

Bottle Neck Block Diagram.

Fig. 4

C3K Module Using Bottle Neck.

Fig. 5

C3K2 Module Using C3K.

C3K2 Block (Convolutions n = 3, shortcut = False)

The C3K2 module applies depth wise separable convolutions and residual connections (if enabled) to reduce computation while enhancing feature learning. Here, it uses

$\:n=3$

layers without shortcuts, operating on

$\:{F}_{2}$

. The output

$\:{F}_{3}$

remains 160×160×128.

$\:{F}_{3}=\text{C}3\text{K}2\left({F}_{2}\right),\:\:{F}_{3}\in\:{R}^{160x160\times\:128}$

Next Conv Block

This standard convolution transforms features

$\:{F}_{3}$

to a more compact form. With kernel weights

$\:{W}_{4}$

, bias

$\:{b}_{4}$

, and activation

$\:\sigma\:$

, it outputs

$\:{F}_{4}$

of size 80×80×256. It increases depth while reducing spatial resolution.

$\:{F}_{4}=\sigma\:\left({W}_{4}*{F}_{3}+{b}_{4}\right),\:\:\:{F}_{2}\in\:{R}^{160x160\times\:256}$

C3K2 Block (n = 6, shortcut = False)

A deeper C3K2 block with n = 6 layers processes

$\:{F}_{4}$

, improving representation of mid-level features. This version excludes shortcuts for efficient learning. The output

$\:{F}_{5}$

maintains 80×80×256 dimensions.

$\:{F}_{5}=\text{C}3\text{K}2\left({F}_{4}\right),\:\:{F}_{5}\in\:{R}^{80x80\times\:256}$

Next Conv Block

A further convolution transforms

$\:{F}_{5}$

into higher-level semantic features. The learnable weights

$\:{W}_{6}$

and bias

$\:{b}_{6}$

adjust feature representation. The output

$\:{F}_{6}$

has 40×40×512 dimensions, suitable for complex patterns.

$\:{F}_{6}=\sigma\:\left({W}_{6}*{F}_{5}+{b}_{6}\right),\:\:\:{F}_{6}\in\:{R}^{40x40\times\:512}$

C3K2 Block (n = 6, shortcut = True)

This C3K2 block with n = 6 layers include shortcut connections to preserve gradient flow. It processes

$\:{F}_{6}$

to extract deep features with improved stability. The output

$\:{F}_{7}$

remains 40×40×512.

$\:{F}_{7}=\text{C}3\text{K}2\left({F}_{6}\right),\:\:{F}_{7}\in\:{R}^{40x40\times\:512}$

Next Conv Block

The last convolutional layer in the backbone applies filters

$\:{W}_{8}$

and bias

$\:{b}_{8}$

$\:{F}_{7}$

. It compresses spatial size while increasing depth to 20×20×1024. The output

$\:{F}_{8}$

holds rich semantic information.

$\:{F}_{8}=\sigma\:\left({W}_{8}*{F}_{7}+{b}_{8}\right),\:\:\:{F}_{8}\in\:{R}^{20x20\times\:1024}$

Final C3K2 Block (n = 3, shortcut = True)

This C3K2 block with n = 3 layers include shortcut connections to preserve gradient flow. It processes

$\:{F}_{8}$

to extract deep features with improved stability. The output

$\:{F}_{9}$

remains 20×20×1024.

$\:{F}_{9}=\text{C}3\text{K}2\left({F}_{8}\right),\:\:{F}_{9}\in\:{R}^{20x20\times\:1024}$

This forms the backbone, shrinking progressively the spatial dimensions yet deepening features.

3.1.2 SPFF and C2PSA Modules

At first, SPFF is used to enrich multi-scale context with

$\:{F}_{9}$

, and then C2PSA is used to highlight significant areas. The overall output

$\:{F}_{C2PSA}$

retains the accuracy of 20x20x1024 dimensions of the accuracy of the detection.

SPFF

It uses multi-scale convolutions or pooling to improve the feature map to consider both the local and global context. It enables the network to identify objects of different sizes with minimal increment to the cost of computation. This is essential to deal with the various scales of objects in the aerial images of VisDrone and DAWN dataset. The SPFF Module is the feature aggregation block depicted in Fig. 6 and is utilized to improve multi-scale feature representation. First, the dimensions of the features are dropped by a 1x1 convolution, and thereafter hierarchical spatial features are elicited by various MaxPool2ds at various levels. The result of every pooling layer, as well as the processed feature map initially, are added to maintain fine and coarse spacing details. Lastly, 1x1 convolution is used to merge the concatenated features resulting in an enriched multi-scale output that enhances better detection performance, especially of objects of different sizes.

Fig. 6

SPFF Module.

$\:{F}_{SPFF}=\text{S}\text{P}\text{F}\text{F}\left({F}_{9}\right),\:\:{F}_{SPFF}\in\:{R}^{20x20\times\:1024}$

C2PSA

It implements attention in both channel and spatial dimensions which emphasizes informative and suppresses irrelevant background. It increases the concentration of the network in significant areas, which increases accuracy in detecting important objects in tricky scenes. This is particularly useful to VisDrone, DAWN datasets as the occlusion and cluttered backgrounds are widespread. Figure 7 shows the Parallel Split Attention (PSA) module, which is used to facilitate feature extraction, through the application of attention mechanisms to parallel feature branches. The first step is a 1×1 convolution which down sampled the dimensionalities of the features, then an operation of Splitting which separated the feature map into several branches. All the branches are fed through a PSA block, which implements attention mechanisms, applied separately, to acquire different contextual information. All the branches are then concatenated together to introduce the attended features which is followed by another 1x 1 convolution to fuse the features. This structure enhances the model to concentrate on important spatial and channel-wise information, which enhances detection.

$\:{F}_{C2PSA}=\text{C}2\text{P}\text{S}\text{A}\left({F}_{SPFF}\right),\:\:{F}_{C2PSA}\in\:{R}^{20x20\times\:1024}$

Fig. 7

PSA Module.

Figure 8 shows the C2PSA (module, which should be more efficient in terms of feature learning with the integration of attention and cross-scale interaction). The input feature map is inputted and then it divides down to two different paths, where one goes through an Attention block to extract significant spatial and channel information and the other does not go through the Attention block. These two paths are added together and added to a Feed Forward Network (FFN), and then feature mixing takes place (1 x 1 convolution). A 1x1 non-activated convolution is then added to maintain the transformations of linear features. Lastly, the input and processed features are joined to generate the output, which is more advantageous in multi-scale feature fusion and attention-concentrated learning.

Fig. 8

C2PSA Module.

3.1.3 Neck for Feature Aggregation

The Neck of YOLOv11n upscales high-level features and combines them with downsampled feature maps. Such fusion retains fine spatial features and rich semantic content that can be required to identify small and medium objects. The process enhances capacity of the network to cope with varied sizes of objects and complicated aerial scenes of VisDrone and DAWN dataset.

C3K2 Block

The backbone feature map

$\:{F}_{7}$

is improved with the C3K2 module with the fast depthwise convolutions. This enriches semantic features and maintains spatial resolution. The output

$\:{F}_{N1}$

retains size 40×40×512.

$\:{F}_{N1}=\text{C}3\text{K}2\left({F}_{7}\right),\:\:{F}_{N1}\in\:{R}^{40x40x512}$

Upsample and Concatenate

The high-level feature map

$\:{F}_{C2PSA}$

is upsampled and the spatial dimensions are doubled to merge it with the mid-level features. Upsampling involves the interpolation to prevent the loss of information. Output

$\:{F}_{U1}$

has a size of 40x 40.

$\:{F}_{U1}=\text{U}\text{p}\text{s}\text{a}\text{m}\text{p}\text{l}\text{e}\left({F}_{C2PSA}\right)$

Upsampled feature

$\:{F}_{U1}$

and

$\:{F}_{N1}$

are joined in the channel dimension. This is a combination of profound semantic and mid-level characteristics to detect stronger cues. Channels (1024) are doubled in the output

$\:{F}_{C1}$

$\:{F}_{C1}=\text{C}\text{o}\text{n}\text{c}\text{a}\text{t}\left({F}_{U1},{F}_{N1}\right),\:\:{F}_{C1}\in\:{R}^{40x40x1024}$

C3K2 Block

The concatenated features

$\:{F}_{C1}$

go through another C3K2 component. This reduces channel depth back to 512 and refines information. Multi-scale features are abundant in the output

$\:{F}_{N2}$

$\:{F}_{N2}=\text{C}3\text{K}2\left({F}_{C1}\right),\:\:{F}_{N2}\in\:{R}^{40x40x512}$

Upsample and Concatenate

The

$\:{F}_{N2}$

mid- level features are upsampled to 80x80 resolution. This prepares the features to be fused with lower-level feature maps of higher-resolution. Upsampling improves spatial resolution.

$\:{F}_{U2}=\text{U}\text{p}\text{s}\text{a}\text{m}\text{p}\text{l}\text{e}\left({F}_{N2}\right)$

The upsampled

$\:{F}_{U2}$

features are glued with

$\:{F}_{5}$

to retain finer and coarse-grained details. The multi-scale fusion is essential to the detection of small objects. The output

$\:{F}_{C2}$

has 512 channels.

$\:{F}_{C2}=\text{C}\text{o}\text{n}\text{c}\text{a}\text{t}\left({F}_{U2},{F}_{5}\right),\:\:{F}_{C2}\in\:{R}^{80x80x512}$

C3K2 Block

The fused

$\:{F}_{C2}\:$

goes through C3K2 block to be compressed and refined. This step makes channels 256 and is optimal in detection of small object. Output

$\:{F}_{N3}$

does not degrade the spatial resolution.

$\:{F}_{N3}=\text{C}3\text{K}2\left({F}_{C2}\right),\:\:{F}_{N3}\in\:{R}^{80x80x256}$

3.1.4 Head for Multi-Scale Detection

The Head of YOLOv11n uses multi-scale detection, which uses specially designed detection layers on feature maps of three resolutions, 80×80, 40×40, and 20×20. This design allows the correct identification of small, medium and large objects in the same image. It works well in particular with DAWN, and VisDrone aerial data, where the size of objects between scenes can be widely diverse.

Small Object Detection (80×80 grid)

The small-object detection head is fed with the high-resolution feature map

$\:{F}_{N3}$

. These forecasts bounding boxes, class probabilities and objectness of small-scale targets. Appropriate to identify pedestrians, bicycles, and so on.

$\:{D}_{s}=\text{D}\text{e}\text{t}\text{e}\text{c}\text{t}\left({F}_{N3}\right)$

Medium Object Detection (40×40 grid)

The mid-resolution feature

$\:{F}_{N2}$

assists in detecting medium sized objects. This head strikes a balance between space detail and depth of semantics to be detected. Is applicable in cars, motorcycles and medium objects.

$\:{D}_{m}=\text{D}\text{e}\text{t}\text{e}\text{c}\text{t}\left({F}_{N2}\right)$

Large Object Detection (20×20 grid)

In the lowest resolution, the deepest feature map

$\:{F}_{C2PSA}$

is aimed at detecting large objects. It is superior in recognising huge vehicles, infrastructure and major targets. The deep semantics facilitate sound detection.

$\:{D}_{l}=\text{D}\text{e}\text{t}\text{e}\text{c}\text{t}\left({F}_{C2PSA}\right)$

3.1.5 Weight Optimization

Each predicted bounding box has final confidence S conf that is a combination of objectness like probability of object presence

$\:{P}_{o}$

and class probability

$\:{P}_{c}$

. This guarantees accurate forecasting, sieving out poor confidence detections.

$\:{S}_{conf}={P}_{o}\times\:{P}_{c}$

3.2 BCO for YOLOv11n Weight Optimization

The BCO is an algorithmic method based on the metaheuristic ideas of the bobcat stealth hunting. In YOLOv11n, BCO optimizes network weights and hyperparameters, improving convergence, reducing loss, and enhancing object detection performance, especially for VisDrone aerial data characterized by small, occluded, or densely packed objects. Figure 9 shows the proposed BCO flowchart. Here, a population of candidate solutions (network weights) is initialized randomly within defined bounds:

$\:{X}_{i}^{0}={X}_{min}+r\times\:\left({X}_{max}-{X}_{min}\right)$

Where

$\:{X}_{i}^{0}$

is the initial position (weights) of the i-th bobcat,

$\:r\in\:\left[\text{0,1}\right]$

is a random vector, and

$\:{X}_{min}$

$\:{X}_{max}$

are lower and upper bounds for weights. Each solution's fitness is evaluated using the YOLOv11n loss function:

$\:L={\lambda\:}_{loc}{L}_{loc}+{\lambda\:}_{obj}{L}_{obj}+{\lambda\:}_{cls}{L}_{cls}$

Here,

$\:{L}_{loc}$

$\:{L}_{obj}$

and

$\:{L}_{cls}$

are localization, objectness, and classification losses, respectively, weighted by hyperparameters

$\:\lambda\:$

. The best-performing bobcat (weight set) is identified based on minimum loss:

$\:{X}_{best}=\text{arg}\text{min}\left({X}_{i}\right)\:$

Where

$\:{X}_{best}$

holds the weights leading to the lowest detection loss. Bobcats update positions by stealthily approaching the best solution:

$\:{X}_{i}^{t+1}={X}_{i}^{t}+s\times\:\left({X}_{best}-{X}_{i}^{t}\right)+\delta\:$

Here,

$\:s\in\:\left[\text{0,1}\right]$

controls step size, and

$\:\delta\:$

is a small random disturbance simulating adaptive movement. To avoid local minima, some agents explore new regions:

$\:{X}_{i}^{t+1}={X}_{i}^{t}+\alpha\:\times\:\left({X}_{j}^{t}-{X}_{k}^{t}\right)$

Here,

$\:{X}_{j}$

and

$\:{X}_{k}$

are randomly selected bobcats, and α\alphaα is a scaling factor promoting diversity. The step size

$\:s$

is reduced over iterations to balance exploration and exploitation:

$\:s={s}_{0}\times\:{e}^{-\beta\:t}$

Where

$\:{s}_{0}$

is the initial step size,

$\:\beta\:$

is the decay rate, and

$\:t$

is the iteration index. To ensure valid weights, positions are clipped within allowable bounds:

$\:{X}_{i}^{t+1}=\text{min}\left(\text{max}\left({X}_{i}^{t+1},{X}_{min}\right),\:{X}_{max}\right)\:$

This ensures that weights do not go beyond specified limits which is stabilizing. The optimization process is stopped when the loss change is less than a threshold ϵ:

$\:\mid\:{L}_{best}^{t+1}-{L}_{best}^{t}\mid\:\:<ϵ$

This guarantees the early termination in the event of absence of material improvement. The last optimized weights of T iterations:

$\:{W}_{optimal}={X}_{best}^{T}$

These weights reduce detecting loss and enhance the performance on VisDrone aerial data and the DAWN data. Confidence is calculated by use of optimized weights:

$\:{S}_{conf}^{optimal}={W}_{optimal}*\:{S}_{conf}^{optimal}$

It results in accurate and strong detection under complicated UAV settings. Lastly, BCO effectively trains YOLOv11n through adaptive hunting of bobcats, which improves object detection, minimizes loss, and provides strong performance in a variety of VisDrone settings, such as occlusion, small objects, and overcrowding.

Fig. 9

Proposed BCO Flowchart.

3.3 ALGE Preprocessing

Figure 10 shows the ALGE method flowchart that aims to enhance image clarity, sharpness and brightness in low light acquired drone images or images distorted by haze or weather conditions. It starts with the conversion of the image to the LAB color space where L-channel is the luminance. Contrast Limited Adaptive Histogram Equalization (CLAHE) is exploited to boost the luminance component by applying adaptive brightness distributions that adjust the brightness of the image to maximize the contrast without increasing noise. After that, the pixel intensity is corrected by means of gamma, which is a nonlinear transformation using power-law transformation to brighten the darker areas but keep the natural brightness of bright areas. This hybrid of local contrast gradient as well as global gamma manipulation is to assure that local details, edges and boundaries of objects are not lost, and the results are clearer and more informative images that can be used in the UAV based object detecting missions.

Fig. 10

Proposed ALGE Flowchart.

RGB to LAB Color Conversion

A RGB to LAB color conversion enables separation of luminance (L) and chrominance (a, b) values and adjustment of contrast by the user. In this case, R, G, B are color components, and X, Y, Z are the coordinates of the CIE color space that are the result of a linear transformation. This will guarantee perceptual uniformity on luminance enhancement.

$\:\left[\begin{array}{c}X\\\:Y\\\:Z\end{array}\right]=\left[\begin{array}{ccc}0.4124&\:0.3576&\:0.1805\\\:0.2126&\:0.7152&\:0.0722\\\:0.0193&\:0.1192&\:0.9505\end{array}\right]\left[\begin{array}{c}R\\\:G\\\:B\end{array}\right]$

Luminance Normalization in LAB Space

The luminance

$\:{L}^{*}$

is scaled so that brightness is linearly proportional to distance across the image. In this case,

$\:Y/{Y}_{n}$

is the luminance of the measured luminance divided by the reference white luminance, and

$\:f(\cdot\:)$

is a nonlinear function to simulate human visual perception.

$\:{L}^{*}=116\cdot\:f\left(\frac{Y}{{Y}_{n}}\right)-16$

CLAHE

It decomposes the luminance plane into contextual blocks and reassigns the histogram of each block and clips the amplification of contrast.

$\:H\left(i\right)$

is the histogram of the luminance values,

$\:{H}_{clip}$

is the maximum limit, and

$\:{H}_{eq}\left(i\right)$

is the equalized output histogram of the redistribution of the pixel intensity.

$\:{H}_{eq}\left(i\right)=\frac{1}{N}\sum\:_{j=0}^{i}\text{min}\left[H\left(j\right),{H}_{clip}\right]$

Enhanced Luminance Reconstruction

Once the data have been equalized, the improved luminance

$\:\:{L}_{enh}\:$

is obtained by reconstructing the cumulative histogram CDF(i) in pixel intensity space. i represents the pixel intensity level, and

$\:{L}_{max}\:$

is the maximum luminance intensity (usually 255).

$\:{L}_{enh}\left(i\right)={L}_{max}\times\:CDF\left(i\right)$

Gamma Correction for Intensity Adjustment

Gamma correction is a power-law transformation, where

$\:{I}_{out}$

is the corrected pixel intensity,

$\:{I}_{in}$

is the input intensity, and gamma

$\:\gamma\:$

is the exponent of the power-law that determines the level of brightness. In the case of γ = 1, the image gets brighter and in the case of γ = -1, the image gets darker.

$\:{I}_{out}=255\times\:{\left(\frac{{I}_{in}}{255}\right)}^{\gamma\:\:}$

Merging of Enhanced LAB Components

Once the L-channel is enhanced, it is recombined with the original chromatic components a and b to recover the original LAB image. The channels

$\:{L}_{enh},a,b$

are then converted into color channels which are then translated back into RGB space to be visualized.

$\:LA{B}_{enh}=[{L}_{enh},a,b]$

Final RGB Reconstruction

The improved LAB image is transformed back into the RGB space to produce the final output image

$\:{I}_{RGB}^{enh}$

enhance. The inverse to the RGB-to-XYZ transformation matrix

$\:{M}^{-1}$

$\:{I}_{RGB}^{enh}={M}^{-1}\times\:\left[\begin{array}{c}X\\\:Y\\\:Z\end{array}\right]$

4. Results and Discussion

In this section, a comparative analysis of various methods of object detection and evaluation indicators will be conducted using the same dataset VisDrone dataset. It shows the variations in performance to determine the best approach to utilize in the detection activities of UAVs. In addition, the case studies were also done on DAWN dataset.

4.1. VisDrone Dataset

Drones or general UAVs have now been proliferating into many uses including agriculture, aerial photography, surveillance, and rapid delivery making automatic interpretation of what is being captured by the drone cameras ever more of a reality and bringing computer vision and drone technology nearer than ever. The VisDrone2019 dataset has been introduced as a comprehensive large-scale benchmark specifically designed to support key computer vision tasks in aerial imagery. Developed by the AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University, China, VisDrone2019 comprises 288 video sequences total of 261,908 frames, along with 10,209 high-resolution static images, all captured from various drone-mounted cameras. The data is unique with the availability of 14 cities in China, the scenes depict both the urban and rural scenes, and the various types of objects, including people, bikes, vehicles, and tricycles, in sparse and dense environment data. Under such weather conditions, and variations in light conditions, larger applicability was provided with the use of diverse drone platforms in the collection of this data. It contains more than 2.6 million manually annotated bounding box, together with other related metadata of object class, occlusion level and scene visibility, making VisDrone2019 to be one of the largest and flexible datasets to drive drone-based computer vision research forward.

4.2 Simulation Outcomes on VisDrone Dataset

The proposed UAVOD-Net integrating YOLOv11n with BCO framework obtained on VisDrone dataset showed the results in Table 1 of the performance of object detection with a significant increase in the efficiency of detection both the small-objects amidst large objects and precision, as well as in general stability. The model recognized 38 759 instances on 548 images for 10 object categories. The single average precision per all classes witnessed 0.594, recall of 0.485, and overall mAP@50 of 51.6, and such a result was competent with mAP@50–95 of 32.6, which shows that no side of the performance excelled over its other due to differences in Intersection-over-Union (IoU) thresholds. On a very important object classification, such as cars, the algorithm provided an outstanding accuracy of 0.811, a recall of 0.833, and a mAP@50 (with 87 percent), which allows confirming the model applicability in urban traffic settings. Pedestrian was detected with a 0.595 mAP on the 50 and this demonstrates the capability of the system to work with small and occluded objects common in UAV images. There were also classes such as bicycles, tricycles and awning-tricycles which were usually said to be difficult due to small size and low contrast, but these values are still a gain over the conventional methods bearing in mind the complex nature of aerial scenes. The larger objects such as buses had consistent results as they showed 0.693 mAP50 and 0.530 mAP50-95 and confirmed that the model could be adapted to different object sizes. The findings highlight the fact that BCO consistently minimized model weights that enhanced convergence and detection accuracy particularly in harsh UAV surveillance contexts characterized by heavy traffic, scale variation and occlusions.

Table 1
Performance Evaluation of Proposed UAVOD-Net.
Class	Precision (P)	Recall (R)	mAP@50	mAP@50–95
All Classes	0.594	0.485	0.516	0.326
Pedestrian	0.650	0.541	0.595	0.302
People	0.657	0.402	0.468	0.201
Bicycle	0.357	0.294	0.261	0.130
Car	0.811	0.833	0.870	0.648
Van	0.582	0.545	0.561	0.414
Truck	0.564	0.467	0.499	0.359
Tricycle	0.533	0.368	0.389	0.231
Awning-Tricycle	0.361	0.261	0.248	0.165
Bus	0.789	0.609	0.693	0.530
Motorcycle	0.640	0.530	0.579	0.281

The trends of training and validation performance of the proposed YOLOv11n model combined with BCO using the VisDrone dataset for 50 epochs are shown in Fig. 11. The losses that are shown in the first row are training losses, the box loss, the classification (cls) loss, and the Distribution Focal Loss (dfl) have steady decreasing curves across the epochs. In particular, the train/box loss decreases to about 2.2 to almost 1.18, which means that a positive increase in bounding box regression is realized. Similarly, train/cls_loss drops from around 2.5 to approximately 1.1, reflecting enhanced classification capability, while train/dfl_loss decreases from 1.5 to about 1.0, suggesting better localization confidence calibration. The second row indicates validation metrics and losses. The val/box_loss constantly decreases starting with the 1.6 and ending with about 1.18, and val/cls_loss with 1.5 and ending with about 0.95, which is evidence that the model does not overfit greatly. Similar is the trend of val/dfl_loss which stabilizes at 0.92 highlighting better confidence in the boundaries of the objects under test. Regarding the performance of detection, the metrics/precision(B) improves to about 0.52, which shows that it tends to be more accurate in positive detection. In the same light, the metrics/recall(B) ranges 0.2 to almost 0.49, and this indicates better searching of the objects, particularly in the difficult cases such as occlusions or small objects. Moreover, the mAP50 steady increases continuously between 0.15 and almost 0.52 and mAP50-95 increase is steady between 0.1 and approximately 0.33 within the last few epochs which proves the improvement in terms of the multi-scale detection capability of the system. The findings confirm the lightweight architecture of YOLOv11n complemented with BCO weight optimization as an effective way to increase detection accuracy, localization accuracy and small-object identification even at different levels of aerial-scene complexity.

Fig. 11

Training and Validation Performance Curves of Proposed UAVOD-Net.

Figure 12 shows F1-Confidence Curve of each object class that is detected by a proposed UAVOD-Net on the VisDrone dataset. The F1-Score which is the combination of precision and recall is used to the various levels of the confidence which is measured in terms of 0 to 1. The thick blue line is the overall performance of the model on all the classes, and it reaches the maximum, with an F1-Score of around 0.53, when the confidence threshold value is 0.232, which means that such a value can be considered as the correct balance between precision and recall of the overall detection. When looking at the analysis of each classification, cars (red curve) are detected best among the rest with an F1-Score as high as 0.85 toward the lower confidence threshold, showing the high confidence with which the model can detect objects of larger size, such an example is a car. The bus category (yellow-green curve) follows, achieving an F1-Score around 0.7, while motorcycles and pedestrians attain peak F1-Scores of approximately 0.55 and 0.6, respectively, reflecting moderate detection success for smaller, moving objects.

Fig. 12

F1-Confidence Curve of Proposed UAVOD-Net.

Figure 13 shows the Precision-Recall (P-R) Curve of each of the object classes recognized by the proposed YOLOv11n model in combination with BCO on the VisDrone dataset. Precision, which is the ratio of successful detections to all positive predictions, is displayed against Recall, which is the ratio of successfully detected objects. The blue curve represents the average performance of the precision-recall curves in all the object classes, and the average mAP at 0.5 (mean Average Precision at IoU threshold 0.5) of 0.516 is evidence of consistent detection by different classes. The best precision-recall performance (with the highest mAP of 0.870) is observed among the single categories with the best performance being of the 'car' category (red curve) indicating that the model is very effective at identifying vehicles with a low number of false positives or false negatives. In the same manner, the bus class (yellow-green curve) has a high mAP of 0.693, which is indicative of effective recognition of bigger objects. The performances of pedestrian and motorbike classes are also competitive with mAP of 0.595 and 0.579 respectively, indicating mediocre accuracy when it comes to detecting tiny moving objects. Nevertheless, the less important categories like bicycle (mAP 0.261), awning-tricycle (mAP 0.248) and tricycle (mAP 0.389) exhibit the lower precision-recall curves demonstrating the difficulty in detecting small or partially covered objects under aerial views with high reliability. Such a sharp drop in accuracy with higher recall rates of such classes is indicative of more false positives as the model tries to predict any such cases.

Fig. 13

P-R Curve of Proposed UAVOD-Net.

Figure 14 is the Precision-Confidence Curve of the proposed UAVOD-Net on the VisDrone dataset, which examines the precision of the detection across various confidence thresholds of each type of objects. The x-axis denotes the model’s confidence score, ranging from 0 to 1, while the y-axis shows the corresponding precision, indicating the accuracy of positive detections at each confidence level. The thick blue curve depicts the average precision across all classes, achieving a perfect precision of 1.00 at the maximum confidence threshold of 1.00, confirming the model’s ability to produce reliable, highly confident detections with zero false positives when the confidence is maximized. In general, the car (red curve) performs better all over the curve with over 90% precision at all points of confidence and shows almost perfect accuracy up to 0.8 confidence and this is very good indication of reliability in recognizing vehicles. Precision of other object classes, i.e., bus and van are also strong with over 85 percent accuracy at rise of confidence implying as expected well-developed detection ability even on aerial difficult situations. The pedestrian and the people classes have a moderate precision, which increases consistently with an increase in the confidence score, reflecting reliable detection of objects related to a human being. Conversely, other classes such as bicycle, awning-tricycle, and tricycle are less precise at most of the confidence levels indicating that there is still a challenge of detecting small, complex and obscured targets as observed in UAV imagery. Increases and decreases in the curves of motor and awning-tricycle of the higher level of confidence indicate weakness in detection of objects with weak or indistinct visual indicators.

Fig. 14

Precision-Confidence Curve of Proposed UAVOD-Net.

Fig. 15

Recall-Confidence Curve of Proposed UAVOD-Net.

In Fig. 15 the plot of the Recall-Confidence Curve of the proposed UAVOD-Net on the VisDrone dataset is shown. The curve shows the recall at various confidence levels between 0 and 1 i.e. the capacity of the model to identify all the correct objects. The blue curve is the average recall of all categories of objects, and it reaches the highest point of 0.72 recall with the confidence threshold of 0.0. As indicated, the recall declines with the increase of the confidence which is the normal trade-off where the high-confidence filters reduce the false positives but can eliminate some true positives. The car category, which is depicted by the red curve, has the highest recall performance, as it has a recall of more than 90 percent with low confidence and recall of above 80 percent with moderate confidence thresholds, which validates the existence of strong detection ability of vehicle objects. There is also a relatively high recall at lower confidence ranges followed by a steady decline with increasing confidence as expected by the detection patterns of relatively larger objects, in the cases of the 'bus' and the van classes. Smaller objects such as bicycle, tricycle and awning-tricycle, on the other hand, have lesser recall in the entire confidence spectrum, which shows that there is a problem with consistently identifying small, low-visibility objects in aerial images. The pedestrian class and people class recall performance are moderately good with a maximum recall of around 70 percent at low confidence and reducing appropriately as the detection threshold is tightened.

Figure 16 is the Confusion Matrix of the proposed UAVOD-Net on the VisDrone data. The class car illustrates good detection accuracy of 11,446 corrects and 1,789 incorrectly classified background and 655 mixed up with van. Pedestrian objects had the highest detector correct number of 4,686, whereas the false negative of 1,731 indicates that the small objects are difficult to detect. On the same note, people were 1893 out of which 635 were incorrectly predicted as background. On smaller or more visually ambiguous classes, including 'bicycle', 341 were correctly recognized, and 388 were misidentified as background, which validates the result that small and low-contrast objects are difficult to detect. The motor class with correct predictions (2,389), and the misclassification rate (960) as background. Other problem categories are tricycle and awning-tricycle with fewer numbers of correct detection (334 and 137 respectively) with high levels of background misclassifications. The predominance of the diagonal in the matrix supports the model as competent in the classification of the great objects categories, especially the large and salient objects such as cars.

Fig. 16

Confusion Matrix of Proposed UAVOD-Net.

The results of the object detection and classification using the proposed UAVOD-Net on various video frames taken by UAVs under diversified urban settings are shown in Fig. 17. The red bounding boxes indicate found objects like pedestrians, cars, buses, trucks, tricycles, awning-tricycles, vans, and motorcycles with respective scores of confidences. Although there are low light and a complex background, the model correctly localizes numerous overlapping and small-sized objects, and this indicates its strength in real-world aerial surveillance. The effectiveness of UAVOD-Net to deal with various object types in demanding UAV imagery conditions is confirmed by these visual results.

Fig. 17

Object Detection and Classification Outcomes on Video Frames of Proposed UAVOD-Net.

4.3 Comparative Evaluation on VisDrone Dataset

A comparative study of the mean Average Precision at 50% IoU (mAP at 0.5),recall, precision of five object detection models, including SOD-YOLO [37], MSUD-YOLO [21],CF-YOLO[35],LRDS-YOLO[36] and the proposed UAVOD-Net is given in Table 2. In all the categories, UAVOD-Net performs better than the other two models, thus proving its high ability to detect various classes of objects in aerial imagery. As an example, UAVOD-Net has the highest overall mAP of 0.516, which is compared to 0.434 of MSUD-YOLO [21],0.428 of SOD-YOLO [37], 0.449 of CF-YOLO [35] and 0.436 of LRDS-YOLO [36]. The improvement is significantly large in the case of Pedestrian (0.595 vs. 0.441, 0.480, 0.341,0.499) People (0.468 vs. 0.405, 0.398, 0.214 and 0.419), and Bicycle (0.261 vs. 0.129, 0.175 ,0.109 and 0.168) which means that the object of small size is detected more effectively. Similarly in the case of larger Car and Bus, UAVOD-Net records competitive performance of 0.870 and 0.693 at close to SOD-YOLO [37] (0.87 and 0.704) and much higher than MSUD-YOLO [21] (0.84 and 0.60). In other classes like Van (0.561), Truck (0.499), and Tricycle (0.389) that belong to the transport category, UAVOD-Net shows a positive progress compared to the previous one, which indicates that it is better than them in terms of the overall object detection. The suggested model continues to be more precise in more difficult categories, such as Awning-Tricycle and Motorcycle, with the score of 0.248 and 0.579, which are higher than both baseline scores. Lastly UAVOD-Net shows better detection accuracy and resilience on all categories and indicates the power of its architectural improvements compared to CF-YOLO [35], SOD-YOLO [37] ,MSUD-YOLO [21] and LRDS-YOLO [36].

Table 2
mAP@50, Recall, Precision Comparison of Various Object Detection Methods.
Class	SOD-YOLO [37]	MSUD-YOLO [21]	CF-YOLO [35]	LRDS-YOLO[36]	Proposed UAVOD-Net
All Classes	0.428	0.434	0.449	0.436	0.516
Pedestrian	0.441	0.480	0.341	0.499	0.595
People	0.405	0.398	0.214	0.419	0.468
Bicycle	0.129	0.175	0.109	0.168	0.261
Car	0.87	0.84	0.763	0.837	0.870
Van	0.396	0.47	0.41	0.488	0.561
Truck	0.396	0.374	-	0.372	0.499
Tricycle	0.266	0.309	0.20	0.295	0.389
Awning-Tricycle	0.154	0.158	0.206	0.199	0.248
Bus	0.704	0.60	-	0.575	0.693
Motorcycle	0.516	0.527	34.4	0.505	0.579
Recall	-	-	43.4	41.6	48.5
Precision	-	-	52.8	53.3	59.4

The comparative complexity of the computation and the detection rate of different state-of-the-art object detection models and the proposed UAVOD-Net are compared in Table 3. Although UAVOD-Net has the lowest computational cost at just 5.7 GFLOPs and 2.5 million parameters, its mAP at 50 is very high at 51.6% and is much better than all other approaches. An example is that although YOLOv10n, YOLOv9t and YOLOv8n achieve 31.8, 33.8, and 33.4 mAP respectively, they perform with a higher FLOP (6.7G -8.1G) but a significantly reduced accuracy. Even smaller models like YOLOv7-Tiny (13.9G, 6.2M) and YOLOv5s (15.9G, 7.2M) can only reach 37.6% and 32.7% mAP at 50, respectively. Transformer-based RT-DETR_1 [13] with a competitive 48.9% mAP is also computationally expensive at 103.5 GFLOPs and 32M parameters, which is not computationally efficient when used in a real-time UAV. In the same way, SOD-YOLO [37] ,MSUD-YOLO [21],CF-YOLO [35],LRDS-YOLO achieve 39.2% ,43.4% and 44.9%,43.6% mAP respectively, and with 12.9G, 3.77G, 24.1Gcomplexity, respectively, both of which once again lag behind the efficiency-performance balance in UAVOD-Net. These findings validate the hypothesis that the proposed UAVOD-Net can display a trade-off between precision and computational efficiency and provides lightweight and high-precision detection, which is suitable in real-time aerial and embedded vision systems.

Table 3
Computational Complexity Comparison of Proposed Method with Existing Approaches.
model	Input size	FLOPS (G)	Parameters (M)	mAP@50(%)
YOLOv10n [13]	640x640	6.7	2.7	31.8
YOLOv9t [13]	640x640	7.7	2.2	33.8
YOLOv8n [13]	640x640	8.1	3.0	33.4
YOLOv7-Tiny [13]	640x640	13.9	6.2	37.6
YOLOv5n [13]	640x640	7.1	2.5	31.9
YOLOv5s [13]	640x640	15.9	7.2	32.7
RT-DETR_1 [13]	640x640	103.5	32.0	48.9
SOD-YOLO [37]	640x640	12.9	6.9	50.7
MSUD-YOLO [21]	640x640	-	6.766	43.4
CF-YOLO [35]	640x640	3.77	23.9	44.9
LRDS-YOLO [36]	640x640	24.1	4.17	43.6
Proposed UAVOD-Net	640x640	5.7	2.5	51.6

Table 4
Performance Comparison of Proposed Method based with YOLOv11n.
Class	Precision		Recall		mAP@0.5		mAP@0.5–0.95
Class	YOLOv11n	UAVOD-Net	YOLOv11n	UAVOD-Net	YOLOv11n	UAVOD-Net	YOLOv11n	UAVOD-Net
All Classes	0.56	0.59	0.44	0.48	0.48	0.516	0.32	0.33
Pedestrian	0.61	0.65	0.50	0.54	0.55	0.595	0.28	0.30
People	0.62	0.66	0.38	0.40	0.43	0.468	0.18	0.20
Bicycle	0.33	0.36	0.26	0.29	0.23	0.261	0.12	0.13
Car	0.79	0.81	0.81	0.83	0.85	0.870	0.34	0.65
Van	0.55	0.58	0.51	0.55	0.53	0.561	0.39	0.41
Truck	0.54	0.56	0.44	0.47	0.47	0.499	0.31	0.36
Tricycle	0.50	0.53	0.34	0.37	0.36	0.389	0.21	0.23
Awning-Tricycle	0.34	0.36	0.23	0.26	0.22	0.248	0.13	0.17
Bus	0.76	0.79	0.59	0.61	0.66	0.693	0.52	0.53
Motorcycle	0.61	0.64	0.50	0.53	0.54	0.579	0.21	0.28

Table 4 compares the specific performance of YOLOv11n and the proposed UAVOD-Net in relation to several categories of objects in the form of Precision, Recall, mAP, and mAP from 0.5 to 0.95 in detail. Comprehensively, UAVOD-Net shows across-the-board improvements in all the metrics, indicating that it is better at feature extractions and detection. All classes have an uAVOD-Net precision of 0.59 and recall of 0.48, which is slightly better than 0.56 with 0.44 on YOLOv11n, and a mAP at 0.5 of 0.516 and a mAP at 0.5–0.95 of 0.33 indicating a significant improvement in detection consistency. Precision gains of 0.04 in both Pedestrian and People categories together with Recall gains (0.54 vs. 0.50 and 0.40 vs. 0.38) indicate an improved ability of humans to be detected by UAVOD-Net. Equally, in the case of Bicycle and Car, UAVOD-Net enhances mAP at 0.5 to 0.23 and 0.85 to 0.870 respectively, which implies better detection of small and large objects. UAVOD-Net showed higher levels of mAP0.50.95 scores (0.41, 0.36, and 0.53) in Van, Truck, and Bus compared to YOLOv11n (0.39, 0.31 and 0.52) with increasing levels of precision. With smaller and more complicated targets such as Tricycle and Awning-Tricycle, UAVOD-Net can produce higher precision and mAP@0.5 gains (0.53 vs. 0.50 and 0.389 vs. 0.36 of Tricycle; 0.36 vs. 0.34 and 0.248 vs. 0.22 of Awning-Tricycle) and demonstrate its strength in detecting irregular shapes. Interestingly, Motorcycle detection improvement is also achieved, and mAP@ 0.5 increases by 0.54 to 0.579 and the overall accuracy by 7 percent points in mAP 0.5–0.95. These steady gains in all the types of objects prove UAVOD-Net to be superior to YOLOv11n in both accuracy and generalization since it is more accurate at detecting objects with high accuracy and is still efficient in relation to object variation in size and complexity.

4.3 Case Study on DAWN Dataset

DAWN dataset is a real-life image dataset, which is specifically made to test and optimize object detection performance even in adverse weather conditions. The sample images of DAWN dataset are displayed in Fig. 18. It consists of 1,000 pictures taken in the various traffic conditions of urban roads, highways, and freeways making sure that there is a complete representation of real-traffic situations. The data is divided into four significant types of unfavourable weather conditions like fog, snow, rain and sandstorms with distinct visibility and environmental issues. All images are well annotated with object bounding boxes, and it can be of great use in autonomous driving and smart video surveillance. The DAWN dataset is an essential reference point to researchers that develop robust vision-based models with the ability to operate effectively in non-ideal conditions as it offers insights into the impact of weather-related distortions on the accuracy of the vehicle and object detection.

Fig. 18

Sample Images from DAWN Dataset.

The image with the sandstorms in Fig. 19 indicates that the layer of dust in the affected image is huge and visibility is lower, which is effectively addressed by ALGE, thus generating a clearer and sharper image where one could easily identify objects in the image. Similar improvement is depicted in Fig. 20 where the ALGE is able to restore contrast and depth perception even in conditions with fog and permit the detection model to recognize vehicles and road features that had previously been obscured. Figure 21 shows that the image of rain is blurred by motion, and the raindrops are distorted by the rain, which makes object edges much more distinct and increases detection accuracy with the help of ALGE. With Fig. 22, which is also considered to be in rainy conditions, supporting the consistency of ALGE in various levels of intensity of rainfall, the strength of the algorithm in preserving both natural color tone and fine detail of objects is demonstrated

Fig. 19

ALGE Enhanced and Object Detection on Sand Images.

Fig. 20

ALGE Enhanced and Object Detection on Fog Images.

Fig. 21

ALGE Enhanced and Object Detection on Rain Images.

Fig. 22

ALGE Enhanced and Object Detection on Rain Images.

Table 5 shows a thorough comparison of methods of object detection in unfavorable weather conditions. The Data Merging + YOLOv8 model [31] had a Mean IoU value of 0.9642, MSE of 0.0086 and Accuracy of 0.9625, indicating good detection performance but poor performance in harsh weather conditions. These results were enhanced by the Multi-Objective CNN model [32] with Mean IoU of 0.9716 and MSE of 0.0074 by achieving an Accuracy of 0.9702 by combining weather classification and detection. DSNet [33] had better results with a Mean IoU of 0.9827, MSE of 0.0049 and Accuracy of 0.9814 as an indicator of high feature fusion and detail recovery. Explainable TCN [34] was able to strike a balance in performance with a Mean IoU of 0.9758, MSE of 0.0061, and Accuracy of 0.9743 which can be interpretable and process the temporal variations. Conversely, the proposed method performed better than any of the other existing models, with the highest Mean IoU of 0.9993, lowest MSE of 0.0007, and high Accuracy, Precision, Recall and F1-Score, which is 0.9933, 0.9946, 0.9933, and 0.9937, respectively, and thus denotes an outstanding reliability and detection accuracy in various adverse weather conditions.

Table 5
Comparative Analysis of Object Detection Methods under Adverse Weather Conditions
Method	Mean IoU	MSE	Accuracy	Precision	Recall	F1-Score
Data Merging + YOLOv8 [31]	0.9642	0.0086	0.9625	0.9658	0.9612	0.9635
Multi-Objective CNN [32]	0.9716	0.0074	0.9702	0.9725	0.9689	0.9707
DSNet [33]	0.9827	0.0049	0.9814	0.9835	0.9796	0.9815
Explainable TCN [34]	0.9758	0.0061	0.9743	0.9761	0.9728	0.9744
Proposed Method	0.9993	0.0007	0.9933	0.9946	0.9933	0.9937

5. Conclusion

In conclusion, the proposed UAVOD-Net framework combining YOLOv11n with BCO effectively enhances object detection accuracy, especially for small and occluded targets in UAV aerial imagery. The proposed UAVOD-Net achieved an average improvement of 4.25% in Precision (P), 4.83% in Recall (R), and 3.65% in mAP@50 over existing methods across all object categories. Small object classes like bicycles and awning-tricycles observed up to 7% improvement, addressing limitations in detecting tiny targets. These enhancements validate the proposed method's superiority for UAV-based real-time object detection. Future work can explore integrating advanced attention mechanisms and multi-modal sensor fusion to further improve detection reliability in complex and dynamic real-world environments. Additionally, real-time deployment on embedded UAV platforms presents a promising direction for practical applications.

Data Availability

Data Availability StatementThe VisDrone-2019 DET dataset was released by the AISKYEYE team and is publicly available in the VisDrone Dataset repository under project identifier VisDrone-2019, and is accessible at the following URL:https://datasetninja.com/vis-drone-2019-det#downloadThe DAWN dataset was deposited in the Mendeley Data repository under dataset identifier 766ygrbt8y, version 3, and is available at the following URL:https://data.mendeley.com/datasets/766ygrbt8y/3

References

Hou, W. et al. Small Object Detection Method for UAV Remote Sensing Images Based on αS-YOLO. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 18, 8984–8994. 10.1109/JSTARS.2025.3539873 (2025).

Liao, H., Tang, Y. & Luo, X. Yu Liu, and ViDroneNet: An efficient detector specialized for Target Detection in Aerial Images. Digital Signal. Processing : 105270. (2025).

Dong, Y., Guo, J. & Xu, F. Cross-YOLO: an object detection algorithm for UAV based on improved YOLOv8 model. Signal. Image Video Process. 19 (6), 489 (2025).

Wang, D., Liu, J. & Jin, S. Lightweight small object detection network for aerial images based on cross-attention and information injection. J. Real-Time Image Proc. 22 (3), 1–15 (2025).

Liu, Y., He, M. & Hui, B. ESO-DETR: An Improved Real-Time Detection Transformer Model for Enhanced Small Object Detection in UAV Imagery. Drones 9, no. 2 : 143. (2025).

Xie, J. et al. KL-YOLO: A Lightweight Adaptive Global Feature Enhancement Network for Small Object Detection in Low-Altitude Remote Sensing Imagery. in IEEE Trans. Instrum. Measurement, 10.1109/TIM.2025.3576957

Dong, D. et al. EA-YOLO: An Efficient and Accurate UAV Image Object Detection Algorithm. IEEJ Trans. Electr. Electron. Eng. 20 (1), 61–68 (2025).

Wang, W. & Li, Q. TPM-EViT: Tri-Probability Map-Enhanced Vision Transformer Framework for UAV Object Detection. Knowledge-Based Systems : 113983. (2025).

Song, B., Zhao, S., Wang, Z., Liu, W. & Liu, X. DAF-DETR: A dynamic adaptation feature transformer for enhanced object detection in unmanned aerial vehicles. Knowledge-Based Systems : 113760. (2025).

10.

Liu, X., Zhang, G. & Zhou, B. An efficient feature aggregation network for small object detection in UAV aerial images. J. Supercomputing. 81 (4), 1–26 (2025).

11.

Prakash, P., Yamalakonda, V. G. & Abhinoy Kumar, S. Drone design for object detection using YOLOv8. In 2025 International Conference on Innovation in Computing and Engineering (ICE), pp. 1–6. IEEE, (2025).

12.

Bai, C. et al. SFFEF-YOLO: Small object detection network based on fine-grained feature extraction and fusion for unmanned aerial images. Image Vis. Comput. 156, 105469 (2025).

13.

He, Z. & Cao, L. SOD-YOLO: Small Object Detection Network for UAV Aerial Images. IEEJ Trans. Electr. Electron. Eng. 20 (3), 431–439 (2025).

14.

Ge, X., Qi, L., Sun, Q. Y. J. & Zhang, Y. Yu Zhu, and Enhancing Real-Time Aerial Image Object Detection with High-Frequency Feature Learning and Context-Aware Fusion. Remote Sensing 17, no. 12 : 1994. (2025).

15.

Bae, M. H., Park, S. W., Park, J., Jung, S. H. & Chun-Bo Sim YOLO-RACE: reassembly and convolutional block attention for enhanced dense object detection. Pattern Anal. Appl. 28 (2), 90 (2025).

16.

Wang, S. et al. Chenglizhao Chen, and Teng Yu. Hierarchical Scale Awareness for object detection in Unmanned Aerial Vehicle Scenes. Appl. Soft Comput. 168, 112487 (2025).

17.

Zhong, H., Zhang, Y., Shi, Z., Zhang, Y. & Zhao, L. PS-YOLO: A Lighter and Faster Network for UAV Object Detection. Remote Sens. 17 (9), 1641 (2025).

18.

Qi, G. Multi-Scale Feature Fusion and Context-Enhanced Spatial Sparse Convolution Single-Shot Detector for Unmanned Aerial Vehicle Image Object Detection. Appl. Sci. 15 (2), 924 (2025).

19.

Bikku, T., Sree, K. P. N. V. S., Thota, S., Kumar, M. K. & Shanmugasundaram, P. MSRP-TODNet: a multi-scale reinforced region wise analyser for tiny object detection. BMC Res. Notes. 18 (1), 200 (2025).

20.

Bi, H., Dai, R., Han, F. & Zhang, C. DR-YOLO: An improved multi-scale small object detection model for drone aerial photography scenes based on YOLOv7. Digital Signal. Processing : 105265. (2025).

21.

Zhao, X. et al. MSUD-YOLO: A Novel Multiscale Small Object Detection Model for UAV Aerial Images. Drones 9, no. 6 : 429. (2025).

22.

Yang, Y., Feng, Z., Jin, W. & Miao, P. ADD-YOLO: a new model for object detection in aerial images. Multimedia Syst. 31 (2), 120 (2025).

23.

Jobaer, S., Tang, X., Zhang, Y., Li, G. & Ahmed, F. A novel knowledge distillation framework for enhancing small object detection in blurry environments with unmanned aerial vehicle-assisted images. Complex. Intell. Syst. 11 (1), 1–27 (2025).

24.

Li, H. & Qu, H. VMC-Net: multi-scale feature aggregation and distribution with contextual attention guided fusion for aerial object detection. Complex. Intell. Syst. 11 (8), 1–25 (2025).

25.

Sun, H. et al. SOD-YOLOv10: Small Object Detection in Remote Sensing Images Based on YOLOv10, in IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, Art no. 8000705, (2025). 10.1109/LGRS.2025.3534786

26.

Li, W. et al. ED-YOLO: an object detection algorithm for drone imagery focusing on edge information and small object features. Multimedia Syst. 31 (3), 1–15 (2025).

27.

Xu, X. et al. Mingzheng Sun, and ESOD-YOLO: an enhanced efficient small object detection framework for aerial images. Computing 107, no. 2 : 1–19. (2025).

28.

Wu, Y., Mu, X., Shi, H. & Hou, M. An object detection model AAPW-YOLO for UAV remote sensing images based on adaptive convolution and reconstructed feature fusion. Scientific reports 15, no. 1 : 1–20. (2025).

29.

Su, Q. et al. Drone object detection incorporating multi-head mixed self-attention and dynamic regression mapping loss function. J. Real-Time Image Proc. 22 (2), 56 (2025).

30.

Li, H. et al. RLRD-YOLO: An Improved YOLOv8 Algorithm for Small Object Detection from an Unmanned Aerial Vehicle (UAV) Perspective. Drones 9, no. 4 : 293. (2025).

31.

Kumar, D. & Muhammad, N. Object Detection in Adverse Weather for Autonomous Driving through Data Merging and YOLOv8. Sensors 23, 8471. https://doi.org/10.3390/s23208471 (2023).

32.

Aloufi, N., Alnori, A. & Basuhail, A. Enhancing Autonomous Vehicle Perception in Adverse Weather: A Multi Objectives Model for Integrated Weather Classification and Object Detection. Electronics 2024, 13, 3063. https://doi.org/10.3390/electronics13153063

33.

Jing, Z. et al. DSNet enables feature fusion and detail restoration for accurate object detection in foggy conditions. Sci. Rep. 15, 21584. https://doi.org/10.1038/s41598-025-03902-y (2025).

34.

Alzanin, S. Explainable artificial intelligence with temporal convolutional networks for adverse weather condition detection in driverless vehicles. Sci. Rep. 15 (1), 19475. 10.1038/s41598-025-05136-4 (2025). PMID: 40461805; PMCID: PMC12134332.

35.

Wang, C. et al. CF-YOLO for small target detection in drone imagery based on YOLOv11 algorithm. Sci. Rep. 15 (1), 16741 (2025).

36.

Han, Y. et al. LRDS-YOLO enhances small object detection in UAV aerial images with a lightweight and efficient design. Sci. Rep. 15 (1), 22627 (2025).

37.

Xiao, Y. & Di, N. SOD-YOLO: A lightweight small object detection framework. Sci. Rep. 14 (1), 25624 (2024).

Author Contribution

Dama Durga Bhavani: methodology, and writing the original draft. Usha Rani. Nelakuditi: conceptualization, review, and editing.

Declarations

Competing interests

No, I declare that the authors have no competing interests as defined by Nature Research, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Funding

This work was undertaken as part of the authors’ academic research and did not receive any dedicated funding.

Yes