1. Introduction
This viral implementation of UAVs means that the overall airborne data gathering has multiplied exponentially because recent reports have proven that there are about 3 million commercial drones in use worldwide, and the UAV market size will more than triple to 47 billion dollars by 2025 [1]. At the same time, the number of images rendered with hundreds of labelled instances of objects, such as VisDrone2019 that consists of 2.6 million labels made per 261,908 frames or 10,209 pictures [2], points to the increasing need in the efficient and scalable functioning of the object detection methods that can be applied to aerial surveillance, smart farming, and city monitoring.
Conventional manual object detection approaches have used manual features mostly, like edge descriptors, Histogram of Oriented Gradients (HOG) [3], or background subtraction approach with sliding-window classifier. These methods work well with the controlled context but in applications with UAV imagery, they have invariably proved to be ineffective and have had poor detection rates with inaccurate results because of the complex backgrounds, small objects, different altitudes and an open crowded scene.
In effort to reverse this situation Artificial Intelligence (AI) [4] driven detection object, especially deep learning models has become a strong contender. Deep Convolutional Neural Network (CNN) [5], such as Faster R-CNN, Single Shot Multi Box Detector (SSD), and YOLO [6] have been able to make tremendous steps in combining precision, speed, and flexibility in detection. These models however have limitations in the case of UAV-based applications particularly on tracking of miniaturized and occluded objects or dense entities in such complex aerial conditions.
Various active AI-driven organizations and research staff have turned to the VisDrone dataset to formulate and evaluate efficient object identification designs. An example of this is that Ultralytics has introduced the use of YOLOv5 and YOLOv8 series in UAV detection activities optimizing operations with real time aerial imagery [7]. Their models are heavily benchmarked on VisDrone, which resulted in the enhancement of the detection speed and accuracy of drone-related applications.
In a similar way, such companies as Hesai Technology [9] and research projects like Aerial Image Tiny Object Detection (AI-TOD) [8], have tested VisDrone in their pipeline and worked on improving the detection of small objects within aerial settings. Although those innovations exist, the current methods tend to maintain significant computational burden and precision decline during occlusion, or UAV detection became unstable in different settings, and that means even more work should be done on lightweight, real-time, and accurate UAV detection frameworks [9].
The rest of the paper is organized as follows: Section 2 presents a comprehensive survey of existing UAV object detection methods, highlighting their strengths and limitations. Section 3 details our proposed methodology, while Section 4 reports experimental results and analysis, and Section 5 offers concluding remarks and future research directions.
2. Literature Survey
The related work highlights a variety of lightweight, attention-enhanced YOLO architectures tailored for UAV object detection, yet most methods either increased architectural complexity, incurred inference latency, or sacrificed large-object performance to improve small-object accuracy.
2.1 Related Work
Liu et al. [10] introduced the efficient feature aggregation network (EFANet) to detect any small object in UAV images and use multi-level context to improve the precision of detection in a complicated aerial environment. It suffered a large computational overhead because of several layers of aggregation. Prakash et al. [11] developed a drone-based detector system based on YOLOv8 which improves inference speed and accuracy by model refinement via hardware optimization. Real-time performance on embedded platforms was accomplished in the system. Bai et al. [12] proposed the use of SFFEF YOLO that involved a combination of fine-grained feature extraction and fusion modules to reduce the detection of tiny aerial objects. The additional fusion modules made models complex and increased real time deployment.
He and Cao [13] came up with SOD YOLO, which is a customized small object detection network on UAV aerial pictures with a lightweight feature boosting block. It showed a lower vigor in highly cluttered or high-occlusion environments. Ge et al. [14] introduced the idea of using a real time detector wherein the method used in the context of aerial imagery enhanced the accuracy of detection by taking advantage of high frequency feature learning and context aware fusion. It involved additional parameters and extended training period. Bae et al. [15] came up with YOLO RACE that incorporated reassembly and convolutional block attention to reinforce dense object detection in UAV images. It gave better results than available variants of YOLO on crowded aerial scenes.
A
Wang, Shijie et al. [
16] added a scale-aware hierarchy into the neck of a YOLO-based detector of UAV scenes, which organized feature learning hierarchically coarse to fine to enhance object-at-diverse-scale detectors. It further complicated the neck, making it take a longer time to infer on edge platforms. The authors of the paper Zhong, Han et al. [
17] created PS-You Only Local Only network using partial convolution and FasterBIFFPN at the backbone and neck, respectively, GSCD head, and NWDLoss loss function. This model of parameters cut more than 40 percent and hit a 1.3 percent mAP increase on VisDrone2019. The single-shot detector that was suggested by Qi, Guimei et al. [
18] combined multi-scale features with a context-enhanced spatial sparse convolution to strengthen the detection of small targets in UAV images. The spatial sparse convolution made architectures more irregular and made them difficult to run on standard UAV hardware.
MSRP-TODNets was introduced by Bikku, Thulasi et al. [19], which was a small-object detection network that strengthened region-based analysis with multi-scale feature aggregation. The region analysis method led to an unequal processing latency which lowered real-time consistency. Bi, Hongbo et al. [20] proposed the DR-YOLO, a small-object multi-scale detecting model based on the YOLOv7 model, which uses better anchor strategies and decouples features in aerial photography settings. It was strongly based on the YOLOv7 design and therefore resource-intensive when it comes to real time UAV implementation. Yang et al. [22] introduced ADD‑YOLO, which replaced standard convolution with AKConv, applied C2f_DRAC with CBAM attention, and implemented a DABFPN neck including a special small‑object detection layer, achieving up to 15.7% mAP@0.5 gains on aerial data. Jobaer et al. [23] developed a self‑supervised knowledge distillation framework that trained a deblurring subnet with dual attention modules to boost small‑object detection in blurry UAV images, achieving 4.3–4.6% mAP improvements on a VisDrone blur dataset. Li and Qu [24] proposed VMC‑Net, which combined multi‑scale feature aggregation (MFADM) with distribution modules, VHeat C2f for fine tuning, and context‑attention guided fusion (CAGFM) to enhance small object detection; experiments on VisDrone showed notable precision gains.
The lightweight variant of YOLOv10 named SOD -YOLOv10 by Sun et al. [25] that focused on detecting small objects in a remote sensing context also adopted efficient feature refinement and anchor design optimization to enhance small object recall and accuracy. Li et al. [26] designed ED -YOLO, based on the modification of YOLOv5-n with a new Efficient Edge Information Extraction (EStem) module and Multi-Path Coordinate Attention (MPCA) with Deformable Convolution V2 (DCNv2) to capture the edge and small-object feature. They additionally used a Multi-Scale Efficient Decoupled Head (MSE-head) and included a special small-object detector head but removed the large-object detector head, which gave significant increase in mAP 50 improvement and parameter relief on VisDrone2019. Xu et al. [27] proposed ESOD-YOLO, an upgraded efficient small-object detector which integrated backbone enhancements that were lightweight with efficient neck features fusion and spatial attention to enhance detection accuracy in UAV imagery. The extended streamlining of the architecture restricted the general object features model capacity.
Wu et al. [28] proposed AAPW‑YOLO, which improved the YOLOv8 backbone by integrating adaptive kernel convolution (AKConv) and reconstructed feature fusion (ASFP2), alongside introducing Wise‑IoU loss for sharper regression. It added architectural complexity that could challenge deployment on lightweight embedded UAV systems. Su et al. [29] introduced a novel drone-oriented detector incorporating multi‑head mixed self‑attention and a dynamic regression mapping loss, which enhanced feature representation and refined bounding box regression during training. The self‑attention layers increased inference latency, reducing speed-critical suitability. Li et al. [30] developed RLRD‑YOLO, an improved YOLOv8 adapted for UAV perspectives by embedding Receptive Field Attention Convolution via RFCBAM Conv and large‑scale kernel attention (LSKA) in the SPPF layer. It delivered a 12.2% mAP@0.5 improvement on VisDrone2019.
Kumar et al. [31] proposed a data merging approach that integrated heterogeneous weather condition datasets with the YOLOv8 detection framework to improve object localization accuracy under fog, rain, and snow. However, the approach exhibited reduced generalization on preprocessed synthetic data. Aloufi et al. [32] proposed a multi-objective framework that integrated weather classification and object detection using a unified CNN backbone optimized for autonomous driving perception. Jing et al. [33] proposed a Dual-Stream Network (DSNet) that performed feature fusion and detail restoration through a hierarchical encoder–decoder architecture for accurate object detection in foggy scenes. Alzanin et al. [34] proposed an explainable artificial intelligence framework based on Temporal Convolutional Networks (TCNs) to identify adverse weather conditions affecting autonomous vehicle performance. It was, however, restricted by the accuracy of sequential time information and was therefore less useful in sparse framerate or low-light situations.
2.2 Research Gaps
Although most studies have adopted YOLO-based models to detect UAV objects because of their real-time capability and high detection accuracy, the most recent research mainly concentrates on architectural adjustments or attention models even though the issue of weight and hyperparameter optimization through nature-inspired metaheuristics is also a significant concern. Even though other methods such as feature fusion, a module of attention, and scale-sensitive advances have been incorporated to help detect small objects on complicated aerial tasks, the methods tend to either raise the complexity of the models, inference time, or computational costs. Moreover, the current models often suffer degradation in performance in certain scenarios, e.g. in presence of occlusions, scale changes, and high-density distributions of objects, particularly in datasets, e.g. VisDrone, DAWN. The outstanding research gap here is to come up with lightweight and real-time detecting frameworks with bio-inspired optimizers that can effectively fix network weights, loss reduction, and robustness while not compromising speed or computation effectiveness.
2.3 Novel Contributions:
To present the first integrated framework combining VisDrone and DAWN data, image enhancement, YOLOv11n, and BCO for UAV object detection.
To implement YOLOv11n with optimized parameters via BCO, improving detection precision and reducing computational load.
To achieve significant loss reduction and accuracy enhancement for real-time detection in occluded and densely populated aerial scenes.
3. Proposed UAVOD-Net
The proposed approach combines a VisDrone datasets, image preprocessing, YOLOv11n lightweight object detection, and BCO to fine-tune weights and parameters, which has not been introduced as a set of features in the existing surveys. Also, the experiments were conducted on the dataset called DAWN as a cate study that employs Adaptive Luminance and Gamma Enhancement (ALGE) method of preprocessing. This solution-of-the-art hybrid solution breaks through the most important constraints of the existing UAV object detection methods, especially low detection accuracy in cluttered, occluded, and small object environments. The proposed framework has successfully minimized image detection loss through the deployment of an optimized and fast detector model, utilization of a nature-inspired optimization algorithm, and image clarity enhancement making it appropriate in real-time aerial surveillance activities. The proposed UAVOD-Net system architecture is shown in Fig. 1. The analysis is in depth and as follows:
Step 1: VisDrone and DAWN Datasets: This work performs the object detection, classification tasks independently, where the same algorithm trained independently on different datasets.
Step 2: Image Preprocessing: Image and video frame basic preprocessing which consists of resizing to accentuate object boundaries, and normalization to standardize input dimensions. Such steps enhance the visibility of the features particularly those that are small and hidden to achieve enhanced model performance.
Step 3: ALGE Preprocessing: ALGE method enhances the visibility and clarity of the image when the conditions are unfavorable, due to the poor lighting or weather. It initially boosts the contrast of luminance within the LAB color space with the help of CLAHE to place the emphasis on the important features and avoid overexposure. Then the gamma correction is done that naturally brightens dark areas to produce a balanced and detailed image that is suitable to use in UAV based vision.
Step 4: YOLOv11n Lightweight Object Detection: The processed images are then inputted into the YOLOv11n object detector which has an efficient architecture and has a high rate of inference. This version is chosen because it is the one that allows to balance the detection precision and computation speed to make it applicable to the UAV-based applications with limitations. In this case, the YOLOv11n model was trained separately on VisDrone and DAWN dataset and the model files were saved separately.
Step 5: BCO for Network Tuning: The BCO algorithm is incorporated to optimize the weights and hyperparameters of YOLOv11n to increase the accuracy of the model. The BCO mimics the adaptive hunting strategies of bobcats to iteratively adjust the network, achieving better convergence, minimizing detection loss, and increasing overall detection robustness under variable aerial conditions. Here, the BCO generates the optimal network parameters for each VisDrone, DAWN datasets independently, which are not overlapped.
Step 6: Outputs: After successful training of model, the proposed UAVOD-Net generates the bounding box, classified labels from VisDrone dataset, whereas ALGE pre-processed image, and bounding box with classified outcomes generated from DAWN dataset.
3.1 YOLOv11n Lightweight Object Detection
The YOLOv11n architecture is a lightweight but highly efficient object detection framework that is specifically targeted at solving the problems that arise about UAV-based aerial-imaging, like small size object detection, occlusion, and of the objects located at different environmental conditions. Figure 2 presents YOLOv11n block diagram. The structure of the architecture starts with a small Backbone with sequential residual blocks which extract the multi-level characteristics of the input drone image reshaped sizes of 640 x 640 x 3, through subsequent convolutional blocks and practical C3K2 blocks that can minimize computational complexity with depth wise separable convolutions without compromising richness of the feature. The last output of Backbone is passed through Spatial Pyramid Fast Fusion (SPFF) and Channel and Spatial Attention (C2PSA), which refines multi-scale contextual representation, which is essential in the VisDrone and DAWN settings where such images contain highly dense objects, such as pedestrians, cars, and bicycles in street and countryside scenes.
The neck component uses the feature fusion technique, where outputs of various layers are up sampled and concatenation is applied, retaining spatial information that is needed in the identification of small and medium objects using drones’ imagery. The three detection heads (their resolutions at 80x80, 40x40 and 20x20 targeting small, medium-sized and large objects respectively, certify the robust performance under different object scaled down large and far apart often evident in VisDrone, DAWN images. All the detection heads produce predictions of bounding box bounds and objectness scores, and classes probabilities, which are eventually enhanced by post-processing (such as Non-Maximum Suppression (NMS). In terms of real-time surveillance application, the design of YOLOv11n offers a high level of detection speed and accuracy when faced with the level of variations in lighting, occlusion, and crowd density exhibited in the VisDrone, DAWN datasets, thus proving to be optimal when dealing with real-time applications as UAV surveillance, as well as the need to conduct any form of aerial surveillance.
3.1.1 Backbone for Feature Extraction
The Backbone of YOLOv11n progressively extracts hierarchical features from the input UAV image using convolutional layers and lightweight C3K2 modules. It reduces spatial resolution while increasing feature depth to capture both low-level edges and high-level semantics. This stage generates rich feature maps essential for accurate object detection in aerial VisDrone images. Figure 3 shows the BottelNeck block diagram, which is implemented using serial convolutions with skip connections. Further, Fig. 4 shows the C3K Module, which is developed with parallel Bottle Neck modules and concatenated together. Finally, Fig. 5 shows the C3K2 module, which is developed with serial analysis of C3K modules.
First Conv Block: The input aerial image is denoted as:
. Where 640×640 is the fixed dimension and 3 indicates RGB channels. The image is resized and normalized before feeding into the model. Eq. (
1) applies a convolution
over the input image
of size 640×640×3 to extract low-level features.
is the bias, and
is the activation function (Leaky ReLU). The output feature map
has reduced spatial dimensions 320×320×64.
This forms the backbone, shrinking progressively the spatial dimensions yet deepening features.
3.1.2 SPFF and C2PSA Modules
At first, SPFF is used to enrich multi-scale context with
, and then C2PSA is used to highlight significant areas. The overall output
retains the accuracy of 20x20x1024 dimensions of the accuracy of the detection.
Figure 8 shows the C2PSA (module, which should be more efficient in terms of feature learning with the integration of attention and cross-scale interaction). The input feature map is inputted and then it divides down to two different paths, where one goes through an Attention block to extract significant spatial and channel information and the other does not go through the Attention block. These two paths are added together and added to a Feed Forward Network (FFN), and then feature mixing takes place (1 x 1 convolution). A 1x1 non-activated convolution is then added to maintain the transformations of linear features. Lastly, the input and processed features are joined to generate the output, which is more advantageous in multi-scale feature fusion and attention-concentrated learning.
3.1.3 Neck for Feature Aggregation
The Neck of YOLOv11n upscales high-level features and combines them with downsampled feature maps. Such fusion retains fine spatial features and rich semantic content that can be required to identify small and medium objects. The process enhances capacity of the network to cope with varied sizes of objects and complicated aerial scenes of VisDrone and DAWN dataset.
Upsampled feature
and
are joined in the channel dimension. This is a combination of profound semantic and mid-level characteristics to detect stronger cues. Channels (1024) are doubled in the output
.
The upsampled
features are glued with
to retain finer and coarse-grained details. The multi-scale fusion is essential to the detection of small objects. The output
has 512 channels.
3.1.4 Head for Multi-Scale Detection
The Head of YOLOv11n uses multi-scale detection, which uses specially designed detection layers on feature maps of three resolutions, 80×80, 40×40, and 20×20. This design allows the correct identification of small, medium and large objects in the same image. It works well in particular with DAWN, and VisDrone aerial data, where the size of objects between scenes can be widely diverse.
3.1.5 Weight Optimization
Each predicted bounding box has final confidence S conf that is a combination of objectness like probability of object presence
and class probability
. This guarantees accurate forecasting, sieving out poor confidence detections.
3.2 BCO for YOLOv11n Weight Optimization
The BCO is an algorithmic method based on the metaheuristic ideas of the bobcat stealth hunting. In YOLOv11n, BCO optimizes network weights and hyperparameters, improving convergence, reducing loss, and enhancing object detection performance, especially for VisDrone aerial data characterized by small, occluded, or densely packed objects. Figure
9 shows the proposed BCO flowchart. Here, a population of candidate solutions (network weights) is initialized randomly within defined bounds:
Where
is the initial position (weights) of the i-th bobcat,
is a random vector, and
,
are lower and upper bounds for weights. Each solution's fitness is evaluated using the YOLOv11n loss function:
Here,
,
and
are localization, objectness, and classification losses, respectively, weighted by hyperparameters
. The best-performing bobcat (weight set) is identified based on minimum loss:
Where
holds the weights leading to the lowest detection loss. Bobcats update positions by stealthily approaching the best solution:
Here,
controls step size, and
is a small random disturbance simulating adaptive movement. To avoid local minima, some agents explore new regions:
Here,
and
are randomly selected bobcats, and α\alphaα is a scaling factor promoting diversity. The step size
is reduced over iterations to balance exploration and exploitation:
Where
is the initial step size,
is the decay rate, and
is the iteration index. To ensure valid weights, positions are clipped within allowable bounds:
This ensures that weights do not go beyond specified limits which is stabilizing. The optimization process is stopped when the loss change is less than a threshold ϵ:
This guarantees the early termination in the event of absence of material improvement. The last optimized weights of T iterations:
These weights reduce detecting loss and enhance the performance on VisDrone aerial data and the DAWN data. Confidence is calculated by use of optimized weights:
It results in accurate and strong detection under complicated UAV settings. Lastly, BCO effectively trains YOLOv11n through adaptive hunting of bobcats, which improves object detection, minimizes loss, and provides strong performance in a variety of VisDrone settings, such as occlusion, small objects, and overcrowding.
3.3 ALGE Preprocessing
Figure 10 shows the ALGE method flowchart that aims to enhance image clarity, sharpness and brightness in low light acquired drone images or images distorted by haze or weather conditions. It starts with the conversion of the image to the LAB color space where L-channel is the luminance. Contrast Limited Adaptive Histogram Equalization (CLAHE) is exploited to boost the luminance component by applying adaptive brightness distributions that adjust the brightness of the image to maximize the contrast without increasing noise. After that, the pixel intensity is corrected by means of gamma, which is a nonlinear transformation using power-law transformation to brighten the darker areas but keep the natural brightness of bright areas. This hybrid of local contrast gradient as well as global gamma manipulation is to assure that local details, edges and boundaries of objects are not lost, and the results are clearer and more informative images that can be used in the UAV based object detecting missions.
4. Results and Discussion
In this section, a comparative analysis of various methods of object detection and evaluation indicators will be conducted using the same dataset VisDrone dataset. It shows the variations in performance to determine the best approach to utilize in the detection activities of UAVs. In addition, the case studies were also done on DAWN dataset.
4.1. VisDrone Dataset
Drones or general UAVs have now been proliferating into many uses including agriculture, aerial photography, surveillance, and rapid delivery making automatic interpretation of what is being captured by the drone cameras ever more of a reality and bringing computer vision and drone technology nearer than ever. The VisDrone2019 dataset has been introduced as a comprehensive large-scale benchmark specifically designed to support key computer vision tasks in aerial imagery. Developed by the AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University, China, VisDrone2019 comprises 288 video sequences total of 261,908 frames, along with 10,209 high-resolution static images, all captured from various drone-mounted cameras. The data is unique with the availability of 14 cities in China, the scenes depict both the urban and rural scenes, and the various types of objects, including people, bikes, vehicles, and tricycles, in sparse and dense environment data. Under such weather conditions, and variations in light conditions, larger applicability was provided with the use of diverse drone platforms in the collection of this data. It contains more than 2.6 million manually annotated bounding box, together with other related metadata of object class, occlusion level and scene visibility, making VisDrone2019 to be one of the largest and flexible datasets to drive drone-based computer vision research forward.
4.2 Simulation Outcomes on VisDrone Dataset
The proposed UAVOD-Net integrating YOLOv11n with BCO framework obtained on VisDrone dataset showed the results in Table 1 of the performance of object detection with a significant increase in the efficiency of detection both the small-objects amidst large objects and precision, as well as in general stability. The model recognized 38 759 instances on 548 images for 10 object categories. The single average precision per all classes witnessed 0.594, recall of 0.485, and overall mAP@50 of 51.6, and such a result was competent with mAP@50–95 of 32.6, which shows that no side of the performance excelled over its other due to differences in Intersection-over-Union (IoU) thresholds. On a very important object classification, such as cars, the algorithm provided an outstanding accuracy of 0.811, a recall of 0.833, and a mAP@50 (with 87 percent), which allows confirming the model applicability in urban traffic settings. Pedestrian was detected with a 0.595 mAP on the 50 and this demonstrates the capability of the system to work with small and occluded objects common in UAV images. There were also classes such as bicycles, tricycles and awning-tricycles which were usually said to be difficult due to small size and low contrast, but these values are still a gain over the conventional methods bearing in mind the complex nature of aerial scenes. The larger objects such as buses had consistent results as they showed 0.693 mAP50 and 0.530 mAP50-95 and confirmed that the model could be adapted to different object sizes. The findings highlight the fact that BCO consistently minimized model weights that enhanced convergence and detection accuracy particularly in harsh UAV surveillance contexts characterized by heavy traffic, scale variation and occlusions.
Table 1
Performance Evaluation of Proposed UAVOD-Net.
Class | Precision (P) | Recall (R) | mAP@50 | mAP@50–95 |
|---|
All Classes | 0.594 | 0.485 | 0.516 | 0.326 |
Pedestrian | 0.650 | 0.541 | 0.595 | 0.302 |
People | 0.657 | 0.402 | 0.468 | 0.201 |
Bicycle | 0.357 | 0.294 | 0.261 | 0.130 |
Car | 0.811 | 0.833 | 0.870 | 0.648 |
Van | 0.582 | 0.545 | 0.561 | 0.414 |
Truck | 0.564 | 0.467 | 0.499 | 0.359 |
Tricycle | 0.533 | 0.368 | 0.389 | 0.231 |
Awning-Tricycle | 0.361 | 0.261 | 0.248 | 0.165 |
Bus | 0.789 | 0.609 | 0.693 | 0.530 |
Motorcycle | 0.640 | 0.530 | 0.579 | 0.281 |
The trends of training and validation performance of the proposed YOLOv11n model combined with BCO using the VisDrone dataset for 50 epochs are shown in Fig. 11. The losses that are shown in the first row are training losses, the box loss, the classification (cls) loss, and the Distribution Focal Loss (dfl) have steady decreasing curves across the epochs. In particular, the train/box loss decreases to about 2.2 to almost 1.18, which means that a positive increase in bounding box regression is realized. Similarly, train/cls_loss drops from around 2.5 to approximately 1.1, reflecting enhanced classification capability, while train/dfl_loss decreases from 1.5 to about 1.0, suggesting better localization confidence calibration. The second row indicates validation metrics and losses. The val/box_loss constantly decreases starting with the 1.6 and ending with about 1.18, and val/cls_loss with 1.5 and ending with about 0.95, which is evidence that the model does not overfit greatly. Similar is the trend of val/dfl_loss which stabilizes at 0.92 highlighting better confidence in the boundaries of the objects under test. Regarding the performance of detection, the metrics/precision(B) improves to about 0.52, which shows that it tends to be more accurate in positive detection. In the same light, the metrics/recall(B) ranges 0.2 to almost 0.49, and this indicates better searching of the objects, particularly in the difficult cases such as occlusions or small objects. Moreover, the mAP50 steady increases continuously between 0.15 and almost 0.52 and mAP50-95 increase is steady between 0.1 and approximately 0.33 within the last few epochs which proves the improvement in terms of the multi-scale detection capability of the system. The findings confirm the lightweight architecture of YOLOv11n complemented with BCO weight optimization as an effective way to increase detection accuracy, localization accuracy and small-object identification even at different levels of aerial-scene complexity.
Figure 12 shows F1-Confidence Curve of each object class that is detected by a proposed UAVOD-Net on the VisDrone dataset. The F1-Score which is the combination of precision and recall is used to the various levels of the confidence which is measured in terms of 0 to 1. The thick blue line is the overall performance of the model on all the classes, and it reaches the maximum, with an F1-Score of around 0.53, when the confidence threshold value is 0.232, which means that such a value can be considered as the correct balance between precision and recall of the overall detection. When looking at the analysis of each classification, cars (red curve) are detected best among the rest with an F1-Score as high as 0.85 toward the lower confidence threshold, showing the high confidence with which the model can detect objects of larger size, such an example is a car. The bus category (yellow-green curve) follows, achieving an F1-Score around 0.7, while motorcycles and pedestrians attain peak F1-Scores of approximately 0.55 and 0.6, respectively, reflecting moderate detection success for smaller, moving objects.
Figure 13 shows the Precision-Recall (P-R) Curve of each of the object classes recognized by the proposed YOLOv11n model in combination with BCO on the VisDrone dataset. Precision, which is the ratio of successful detections to all positive predictions, is displayed against Recall, which is the ratio of successfully detected objects. The blue curve represents the average performance of the precision-recall curves in all the object classes, and the average mAP at 0.5 (mean Average Precision at IoU threshold 0.5) of 0.516 is evidence of consistent detection by different classes. The best precision-recall performance (with the highest mAP of 0.870) is observed among the single categories with the best performance being of the 'car' category (red curve) indicating that the model is very effective at identifying vehicles with a low number of false positives or false negatives. In the same manner, the bus class (yellow-green curve) has a high mAP of 0.693, which is indicative of effective recognition of bigger objects. The performances of pedestrian and motorbike classes are also competitive with mAP of 0.595 and 0.579 respectively, indicating mediocre accuracy when it comes to detecting tiny moving objects. Nevertheless, the less important categories like bicycle (mAP 0.261), awning-tricycle (mAP 0.248) and tricycle (mAP 0.389) exhibit the lower precision-recall curves demonstrating the difficulty in detecting small or partially covered objects under aerial views with high reliability. Such a sharp drop in accuracy with higher recall rates of such classes is indicative of more false positives as the model tries to predict any such cases.
Figure 14 is the Precision-Confidence Curve of the proposed UAVOD-Net on the VisDrone dataset, which examines the precision of the detection across various confidence thresholds of each type of objects. The x-axis denotes the model’s confidence score, ranging from 0 to 1, while the y-axis shows the corresponding precision, indicating the accuracy of positive detections at each confidence level. The thick blue curve depicts the average precision across all classes, achieving a perfect precision of 1.00 at the maximum confidence threshold of 1.00, confirming the model’s ability to produce reliable, highly confident detections with zero false positives when the confidence is maximized. In general, the car (red curve) performs better all over the curve with over 90% precision at all points of confidence and shows almost perfect accuracy up to 0.8 confidence and this is very good indication of reliability in recognizing vehicles. Precision of other object classes, i.e., bus and van are also strong with over 85 percent accuracy at rise of confidence implying as expected well-developed detection ability even on aerial difficult situations. The pedestrian and the people classes have a moderate precision, which increases consistently with an increase in the confidence score, reflecting reliable detection of objects related to a human being. Conversely, other classes such as bicycle, awning-tricycle, and tricycle are less precise at most of the confidence levels indicating that there is still a challenge of detecting small, complex and obscured targets as observed in UAV imagery. Increases and decreases in the curves of motor and awning-tricycle of the higher level of confidence indicate weakness in detection of objects with weak or indistinct visual indicators.
In Fig. 15 the plot of the Recall-Confidence Curve of the proposed UAVOD-Net on the VisDrone dataset is shown. The curve shows the recall at various confidence levels between 0 and 1 i.e. the capacity of the model to identify all the correct objects. The blue curve is the average recall of all categories of objects, and it reaches the highest point of 0.72 recall with the confidence threshold of 0.0. As indicated, the recall declines with the increase of the confidence which is the normal trade-off where the high-confidence filters reduce the false positives but can eliminate some true positives. The car category, which is depicted by the red curve, has the highest recall performance, as it has a recall of more than 90 percent with low confidence and recall of above 80 percent with moderate confidence thresholds, which validates the existence of strong detection ability of vehicle objects. There is also a relatively high recall at lower confidence ranges followed by a steady decline with increasing confidence as expected by the detection patterns of relatively larger objects, in the cases of the 'bus' and the van classes. Smaller objects such as bicycle, tricycle and awning-tricycle, on the other hand, have lesser recall in the entire confidence spectrum, which shows that there is a problem with consistently identifying small, low-visibility objects in aerial images. The pedestrian class and people class recall performance are moderately good with a maximum recall of around 70 percent at low confidence and reducing appropriately as the detection threshold is tightened.
Figure 16 is the Confusion Matrix of the proposed UAVOD-Net on the VisDrone data. The class car illustrates good detection accuracy of 11,446 corrects and 1,789 incorrectly classified background and 655 mixed up with van. Pedestrian objects had the highest detector correct number of 4,686, whereas the false negative of 1,731 indicates that the small objects are difficult to detect. On the same note, people were 1893 out of which 635 were incorrectly predicted as background. On smaller or more visually ambiguous classes, including 'bicycle', 341 were correctly recognized, and 388 were misidentified as background, which validates the result that small and low-contrast objects are difficult to detect. The motor class with correct predictions (2,389), and the misclassification rate (960) as background. Other problem categories are tricycle and awning-tricycle with fewer numbers of correct detection (334 and 137 respectively) with high levels of background misclassifications. The predominance of the diagonal in the matrix supports the model as competent in the classification of the great objects categories, especially the large and salient objects such as cars.
The results of the object detection and classification using the proposed UAVOD-Net on various video frames taken by UAVs under diversified urban settings are shown in Fig. 17. The red bounding boxes indicate found objects like pedestrians, cars, buses, trucks, tricycles, awning-tricycles, vans, and motorcycles with respective scores of confidences. Although there are low light and a complex background, the model correctly localizes numerous overlapping and small-sized objects, and this indicates its strength in real-world aerial surveillance. The effectiveness of UAVOD-Net to deal with various object types in demanding UAV imagery conditions is confirmed by these visual results.
4.3 Comparative Evaluation on VisDrone Dataset
A comparative study of the mean Average Precision at 50% IoU (mAP at 0.5),recall, precision of five object detection models, including SOD-YOLO [37], MSUD-YOLO [21],CF-YOLO[35],LRDS-YOLO[36] and the proposed UAVOD-Net is given in Table 2. In all the categories, UAVOD-Net performs better than the other two models, thus proving its high ability to detect various classes of objects in aerial imagery. As an example, UAVOD-Net has the highest overall mAP of 0.516, which is compared to 0.434 of MSUD-YOLO [21],0.428 of SOD-YOLO [37], 0.449 of CF-YOLO [35] and 0.436 of LRDS-YOLO [36]. The improvement is significantly large in the case of Pedestrian (0.595 vs. 0.441, 0.480, 0.341,0.499) People (0.468 vs. 0.405, 0.398, 0.214 and 0.419), and Bicycle (0.261 vs. 0.129, 0.175 ,0.109 and 0.168) which means that the object of small size is detected more effectively. Similarly in the case of larger Car and Bus, UAVOD-Net records competitive performance of 0.870 and 0.693 at close to SOD-YOLO [37] (0.87 and 0.704) and much higher than MSUD-YOLO [21] (0.84 and 0.60). In other classes like Van (0.561), Truck (0.499), and Tricycle (0.389) that belong to the transport category, UAVOD-Net shows a positive progress compared to the previous one, which indicates that it is better than them in terms of the overall object detection. The suggested model continues to be more precise in more difficult categories, such as Awning-Tricycle and Motorcycle, with the score of 0.248 and 0.579, which are higher than both baseline scores. Lastly UAVOD-Net shows better detection accuracy and resilience on all categories and indicates the power of its architectural improvements compared to CF-YOLO [35], SOD-YOLO [37] ,MSUD-YOLO [21] and LRDS-YOLO [36].
Table 2
mAP@50, Recall, Precision Comparison of Various Object Detection Methods.
Class | SOD-YOLO [37] | MSUD-YOLO [21] | CF-YOLO [35] | LRDS-YOLO[36] | Proposed UAVOD-Net |
|---|
All Classes | 0.428 | 0.434 | 0.449 | 0.436 | 0.516 |
Pedestrian | 0.441 | 0.480 | 0.341 | 0.499 | 0.595 |
People | 0.405 | 0.398 | 0.214 | 0.419 | 0.468 |
Bicycle | 0.129 | 0.175 | 0.109 | 0.168 | 0.261 |
Car | 0.87 | 0.84 | 0.763 | 0.837 | 0.870 |
Van | 0.396 | 0.47 | 0.41 | 0.488 | 0.561 |
Truck | 0.396 | 0.374 | - | 0.372 | 0.499 |
Tricycle | 0.266 | 0.309 | 0.20 | 0.295 | 0.389 |
Awning-Tricycle | 0.154 | 0.158 | 0.206 | 0.199 | 0.248 |
Bus | 0.704 | 0.60 | - | 0.575 | 0.693 |
Motorcycle | 0.516 | 0.527 | 34.4 | 0.505 | 0.579 |
Recall | - | - | 43.4 | 41.6 | 48.5 |
Precision | - | - | 52.8 | 53.3 | 59.4 |
The comparative complexity of the computation and the detection rate of different state-of-the-art object detection models and the proposed UAVOD-Net are compared in Table 3. Although UAVOD-Net has the lowest computational cost at just 5.7 GFLOPs and 2.5 million parameters, its mAP at 50 is very high at 51.6% and is much better than all other approaches. An example is that although YOLOv10n, YOLOv9t and YOLOv8n achieve 31.8, 33.8, and 33.4 mAP respectively, they perform with a higher FLOP (6.7G -8.1G) but a significantly reduced accuracy. Even smaller models like YOLOv7-Tiny (13.9G, 6.2M) and YOLOv5s (15.9G, 7.2M) can only reach 37.6% and 32.7% mAP at 50, respectively. Transformer-based RT-DETR_1 [13] with a competitive 48.9% mAP is also computationally expensive at 103.5 GFLOPs and 32M parameters, which is not computationally efficient when used in a real-time UAV. In the same way, SOD-YOLO [37] ,MSUD-YOLO [21],CF-YOLO [35],LRDS-YOLO achieve 39.2% ,43.4% and 44.9%,43.6% mAP respectively, and with 12.9G, 3.77G, 24.1Gcomplexity, respectively, both of which once again lag behind the efficiency-performance balance in UAVOD-Net. These findings validate the hypothesis that the proposed UAVOD-Net can display a trade-off between precision and computational efficiency and provides lightweight and high-precision detection, which is suitable in real-time aerial and embedded vision systems.
Table 3
Computational Complexity Comparison of Proposed Method with Existing Approaches.
model | Input size | FLOPS (G) | Parameters (M) | mAP@50(%) |
|---|
YOLOv10n [13] | 640x640 | 6.7 | 2.7 | 31.8 |
YOLOv9t [13] | 640x640 | 7.7 | 2.2 | 33.8 |
YOLOv8n [13] | 640x640 | 8.1 | 3.0 | 33.4 |
YOLOv7-Tiny [13] | 640x640 | 13.9 | 6.2 | 37.6 |
YOLOv5n [13] | 640x640 | 7.1 | 2.5 | 31.9 |
YOLOv5s [13] | 640x640 | 15.9 | 7.2 | 32.7 |
RT-DETR_1 [13] | 640x640 | 103.5 | 32.0 | 48.9 |
SOD-YOLO [37] | 640x640 | 12.9 | 6.9 | 50.7 |
MSUD-YOLO [21] | 640x640 | - | 6.766 | 43.4 |
CF-YOLO [35] | 640x640 | 3.77 | 23.9 | 44.9 |
LRDS-YOLO [36] | 640x640 | 24.1 | 4.17 | 43.6 |
Proposed UAVOD-Net | 640x640 | 5.7 | 2.5 | 51.6 |
Table 4
Performance Comparison of Proposed Method based with YOLOv11n.
Class | Precision | Recall | mAP@0.5 | mAP@0.5–0.95 |
|---|
YOLOv11n | UAVOD-Net | YOLOv11n | UAVOD-Net | YOLOv11n | UAVOD-Net | YOLOv11n | UAVOD-Net |
|---|
All Classes | 0.56 | 0.59 | 0.44 | 0.48 | 0.48 | 0.516 | 0.32 | 0.33 |
Pedestrian | 0.61 | 0.65 | 0.50 | 0.54 | 0.55 | 0.595 | 0.28 | 0.30 |
People | 0.62 | 0.66 | 0.38 | 0.40 | 0.43 | 0.468 | 0.18 | 0.20 |
Bicycle | 0.33 | 0.36 | 0.26 | 0.29 | 0.23 | 0.261 | 0.12 | 0.13 |
Car | 0.79 | 0.81 | 0.81 | 0.83 | 0.85 | 0.870 | 0.34 | 0.65 |
Van | 0.55 | 0.58 | 0.51 | 0.55 | 0.53 | 0.561 | 0.39 | 0.41 |
Truck | 0.54 | 0.56 | 0.44 | 0.47 | 0.47 | 0.499 | 0.31 | 0.36 |
Tricycle | 0.50 | 0.53 | 0.34 | 0.37 | 0.36 | 0.389 | 0.21 | 0.23 |
Awning-Tricycle | 0.34 | 0.36 | 0.23 | 0.26 | 0.22 | 0.248 | 0.13 | 0.17 |
Bus | 0.76 | 0.79 | 0.59 | 0.61 | 0.66 | 0.693 | 0.52 | 0.53 |
Motorcycle | 0.61 | 0.64 | 0.50 | 0.53 | 0.54 | 0.579 | 0.21 | 0.28 |
Table 4 compares the specific performance of YOLOv11n and the proposed UAVOD-Net in relation to several categories of objects in the form of Precision, Recall, mAP, and mAP from 0.5 to 0.95 in detail. Comprehensively, UAVOD-Net shows across-the-board improvements in all the metrics, indicating that it is better at feature extractions and detection. All classes have an uAVOD-Net precision of 0.59 and recall of 0.48, which is slightly better than 0.56 with 0.44 on YOLOv11n, and a mAP at 0.5 of 0.516 and a mAP at 0.5–0.95 of 0.33 indicating a significant improvement in detection consistency. Precision gains of 0.04 in both Pedestrian and People categories together with Recall gains (0.54 vs. 0.50 and 0.40 vs. 0.38) indicate an improved ability of humans to be detected by UAVOD-Net. Equally, in the case of Bicycle and Car, UAVOD-Net enhances mAP at 0.5 to 0.23 and 0.85 to 0.870 respectively, which implies better detection of small and large objects. UAVOD-Net showed higher levels of mAP0.50.95 scores (0.41, 0.36, and 0.53) in Van, Truck, and Bus compared to YOLOv11n (0.39, 0.31 and 0.52) with increasing levels of precision. With smaller and more complicated targets such as Tricycle and Awning-Tricycle, UAVOD-Net can produce higher precision and mAP@0.5 gains (0.53 vs. 0.50 and 0.389 vs. 0.36 of Tricycle; 0.36 vs. 0.34 and 0.248 vs. 0.22 of Awning-Tricycle) and demonstrate its strength in detecting irregular shapes. Interestingly, Motorcycle detection improvement is also achieved, and mAP@ 0.5 increases by 0.54 to 0.579 and the overall accuracy by 7 percent points in mAP 0.5–0.95. These steady gains in all the types of objects prove UAVOD-Net to be superior to YOLOv11n in both accuracy and generalization since it is more accurate at detecting objects with high accuracy and is still efficient in relation to object variation in size and complexity.
4.3 Case Study on DAWN Dataset
DAWN dataset is a real-life image dataset, which is specifically made to test and optimize object detection performance even in adverse weather conditions. The sample images of DAWN dataset are displayed in Fig. 18. It consists of 1,000 pictures taken in the various traffic conditions of urban roads, highways, and freeways making sure that there is a complete representation of real-traffic situations. The data is divided into four significant types of unfavourable weather conditions like fog, snow, rain and sandstorms with distinct visibility and environmental issues. All images are well annotated with object bounding boxes, and it can be of great use in autonomous driving and smart video surveillance. The DAWN dataset is an essential reference point to researchers that develop robust vision-based models with the ability to operate effectively in non-ideal conditions as it offers insights into the impact of weather-related distortions on the accuracy of the vehicle and object detection.
The image with the sandstorms in Fig. 19 indicates that the layer of dust in the affected image is huge and visibility is lower, which is effectively addressed by ALGE, thus generating a clearer and sharper image where one could easily identify objects in the image. Similar improvement is depicted in Fig. 20 where the ALGE is able to restore contrast and depth perception even in conditions with fog and permit the detection model to recognize vehicles and road features that had previously been obscured. Figure 21 shows that the image of rain is blurred by motion, and the raindrops are distorted by the rain, which makes object edges much more distinct and increases detection accuracy with the help of ALGE. With Fig. 22, which is also considered to be in rainy conditions, supporting the consistency of ALGE in various levels of intensity of rainfall, the strength of the algorithm in preserving both natural color tone and fine detail of objects is demonstrated
Table 5 shows a thorough comparison of methods of object detection in unfavorable weather conditions. The Data Merging + YOLOv8 model [31] had a Mean IoU value of 0.9642, MSE of 0.0086 and Accuracy of 0.9625, indicating good detection performance but poor performance in harsh weather conditions. These results were enhanced by the Multi-Objective CNN model [32] with Mean IoU of 0.9716 and MSE of 0.0074 by achieving an Accuracy of 0.9702 by combining weather classification and detection. DSNet [33] had better results with a Mean IoU of 0.9827, MSE of 0.0049 and Accuracy of 0.9814 as an indicator of high feature fusion and detail recovery. Explainable TCN [34] was able to strike a balance in performance with a Mean IoU of 0.9758, MSE of 0.0061, and Accuracy of 0.9743 which can be interpretable and process the temporal variations. Conversely, the proposed method performed better than any of the other existing models, with the highest Mean IoU of 0.9993, lowest MSE of 0.0007, and high Accuracy, Precision, Recall and F1-Score, which is 0.9933, 0.9946, 0.9933, and 0.9937, respectively, and thus denotes an outstanding reliability and detection accuracy in various adverse weather conditions.
Table 5
Comparative Analysis of Object Detection Methods under Adverse Weather Conditions
Method | Mean IoU | MSE | Accuracy | Precision | Recall | F1-Score |
|---|
Data Merging + YOLOv8 [31] | 0.9642 | 0.0086 | 0.9625 | 0.9658 | 0.9612 | 0.9635 |
Multi-Objective CNN [32] | 0.9716 | 0.0074 | 0.9702 | 0.9725 | 0.9689 | 0.9707 |
DSNet [33] | 0.9827 | 0.0049 | 0.9814 | 0.9835 | 0.9796 | 0.9815 |
Explainable TCN [34] | 0.9758 | 0.0061 | 0.9743 | 0.9761 | 0.9728 | 0.9744 |
Proposed Method | 0.9993 | 0.0007 | 0.9933 | 0.9946 | 0.9933 | 0.9937 |