1.Introduction
Chicken meat was the most produced type of meat globally in 2021, with a total output of 121,588 thousand tons, reflecting a 107% increase compared to 2000 [1]. Forecasts from major market and consumer data companies predict further growth, with chicken meat consumption expected to rise by approximately 20 kilotons by 2032 [2]. Despite this growth, the poultry industry continues to face significant challenges, including monitoring broiler growth, diagnosing diseases, assessing performance, and maintaining animal welfare. Additionally, ensuring production efficiency, improving product quality, and supporting sustainable production practices remain critical concerns.
Traditionally, these tasks have relied heavily on manual labor, which is tedious, time-consuming, and can disrupt broiler welfare. To address these limitations, Precision Livestock Farming (PLF) approaches have been developed. PLF encompasses technologies for housing and microclimate control, weight monitoring, sound analysis, locomotion and activity tracking, disease detection, and hygiene maintenance [3]. Among the available technologies, computer vision algorithms have shown particular promise. Convolutional Neural Networks (CNNs), a key architecture in deep learning, enable efficient segmentation of livestock and poultry images for various purposes. Animal segmentation in farm imagery allows for the identification of sickness, evaluation of growth and nutritional status, refinement of breeding strategies, and improvements in both animal welfare and farm productivity. These techniques are not limited to broilers but also apply to sheep, cattle, and pigs [4–6].
Image segmentation involves separating relevant pixels from the background to enhance the meaningfulness of images, making them easier to analyze and interpret. In broiler monitoring, pixel-wise classification through segmentation represents a fundamental step [7]. Traditional image processing techniques have been widely used to segment broilers from the background for specific objectives. Depth images captured by 3D Kinect cameras, for example, have been segmented using conventional approaches. Methods such as determining maximum and minimum threshold values, applying image smoothing, and converting images to binary formats are commonly employed for segmenting depth data [8]. Watershed-based segmentation, a classical image processing method, has also been applied for this purpose. Mortensen, Lisouski and Ahrendt [9] utilized a range-based watershed algorithm to effectively separate broilers from the background in 3D images.
Active Shape Model (ASM) is another type of algorithms based on image processing that segments deep images obtained from 3D cameras. To determining the cutting locations positions of broilers carcasses, The ASM was applied [10]. ASM segments objects by using a statistical shape model trained on annotated images and iteratively aligning it to object boundaries in the input image [11]. ASM's effectiveness relies heavily on accurate initial shape estimation, making it sensitive to initialization. While 3D cameras help address this issue in broiler segmentation, their high cost, complexity, environmental sensitivity, and processing demands limit widespread us
Image processing plays a key role in broiler segmentation, often using color spaces and thresholding as fundamental methods. The optimal color space is determined through observation and trial and error, aiding in selecting the best threshold. The GB color space performed well in distinguishing broilers from the background in top-view images of each pen. Next, Otsu’s algorithm [12] was used to determine the threshold and create a binary image, simplifying further processing. The mentioned operation aims to monitor broilers in each pen, focusing on their distribution, total count, and placement near feeding and drinking areas [13]. In addition, Amraei, Abdanan Mehdizadeh and Salari [14] used Otsu’s algorithm to segment images for estimating broiler weight. Erosion and dilation techniques were applied to address artifacts, shadows, and isolated areas.
Despite their widespread use, image processing algorithms face challenges in the poultry industry. Factors like broiler behavior, lighting changes, and background clutter complicate their accuracy and reliability in monitoring. Another drawback of image processing-based segmentation is its multi-step nature, requiring extensive parameter tuning and experimentation for optimal results. To segment thermographic images captured from hens, Zaninelli, Redaelli [15] employed a three-stage process involving histogram analysis of pixels, identifying and applying the suitable segmentation threshold, and subsequently applying appropriate filters to refine the image and remove unwanted noise.
Convolutional Neural Networks (CNNs) stand as fundamental tools in deep learning. CNNs draw inspiration from biological processes, aiming to emulate the neural connectivity observed in the visual cortex of the brain. Unlike traditional image processing algorithms that rely on manually crafted pre-processing filters, CNNs demand significantly less data pre-processing. CNNs have a large range of applications in image and video that mostly include classification, detection and segmentation. In addition, CNNs have the ability to perform tasks such as natural language processing (NLP) and Speech Recognition. CNN architectures diverge from conventional neural networks by employing convolution rather than standard matrix multiplication in at least one layer. In each layer of a convolutional network, there are typically three steps:
1.Parallel convolution to produce linear activations.
2.Applying non-linear activation functions (detector stage) to infer non-linear mappings.
3.Pooling function to modify the output based on nearby statistical values.
The field of deep learning continues to broaden, has been achieved significant successes in both industry and academia within a short span of time [16].
CNNs are used in poultry farming for segmentation, classification, and detection in images or videos, reducing the time and challenges of manual disease diagnosis. Therefore, with the aim of early and rapid diagnosis of diseases in poultry farms, Zhuang and Zhang [17] used CNNs. InceptionV3 architecture as a backbone in the Improved Feature Fusion Single Shot MultiBox Detector (IFSSD), was trained and then evaluated for detection diseases by images and videos in poultry farm. The research findings indicated that CNNs achieving a mean average precision of 99.7% (with Intersection over Union greater than 0.5) demonstrate notable efficacy in providing early warnings regarding changes in the health status of broilers.
Manual dead chicken detection in large poultry farms disrupts welfare and is time-consuming. A mechanical system using CNN-based computer vision and three YOLO (You Only Look Once) algorithm [18] versions was designed for automated detection and removal. These algorithms included YOLOv3, Tiny-YOLOv4, and YOLOv4 [19–21]. The designed system was capable of detecting dead chickens using a CNN based on the YOLO v4 algorithm. The accuracy of YOLOv4 algorithm in detecting dead chickens was 97.5% [22]. CNNs are also used to estimate poses and classify broiler behaviors, providing insights into flock health. A pose skeleton tracks body parts, and a Naive Bayesian Model classifies behaviors like standing, walking, eating, and resting with promising results. [23].
Convolutional Neural Networks (CNNs) have the capability to segment images into distinct regions or objects. CNNs also can be utilized to segment images of broilers captured by cameras. This segmentation facilitates the identification of individual broilers within a flock, thereby aiding in the monitoring of their health, behavior, and growth. Instance segmentation is particularly important for broiler monitoring as it allows for tracking individual animals, which is essential for assessing their behavior, health status, and growth. Mask R-CNN [24], a widely used model for instance segmentation, not only detects objects but also generates pixel-level segmentation, enabling precise identification of each broiler. Estimating a broiler's respiration rate and assessing their health may require segmenting a series of successive video frames. The total data included 3000 images extracted from broilers videos. Images were used to train and evaluate models based on the Mask R-CNN and YOLACT [25] convolutional neural network. The performance of both models on the test data to segment broilers from the background was significant. Accurate segmentation was able to achieve some image features with high accuracy so that the respiration rate and health monitoring of the broilers can more reliable [26]. The mentioned study demonstrates how convolutional neural networks differ significantly from conventional image processing techniques.
A
Tracking broilers with the help of computer vision in poultry houses is one of the operations that is carried out with purposes such as health check, welfare assessment, data-based decision making and etc. [
27,
28]. Therefore, in order to check the performance of the SAM ( Segment Anything Model) [
29] and track broiler chickens, the semantic segmentation of broiler chickens was done in Yang, Dai [
30] research. Two datasets of original and thermal images of cage-free chicks were used to evaluate SAM's zero-shot segmentation on poultry images, focusing on semantic and part-based segmentation. The model's robustness was tested with two use cases, one with a single-point user prompt and the other with all points for every object. Furthermore, two transformer-based models [
31], SegFormer [
32] and SETR [
33], were compared with SAM. SAM outperformed both in whole and part-based segmentation, showing significant improvements, particularly with total points. However, the study also highlighted SAM's limitations in handling behaviors, occlusion, and flock density.
U-Net is a CNN architecture that is widely used for image segmentation tasks [34]. Generally, an encoder and a decoder are the two main parts of the U-Net structure. Although U-Net is an older model, researchers continue to develop more specialized and efficient architectures for image segmentation [35]. A chicken detector for poultry house monitoring used a U-Net for semantic segmentation. It was trained with 400 images and tested with 100, using optimal hyperparameters. Furthermore, three methods of data augmentation were used. Multi-Otsu thresholding [36] was used to binarize the U-Net model's output. By applying a scaled watershed segmentation [37] to the resulting binary map, this procedure made it possible to determine the bounding box and center location of the chicken. When compared to a few state-of-the-art models, the performance of the proposed model demonstrated a strong ability to segment chickens with a precision of 98% [27]. The U-Net architecture was chosen mostly due to its demonstrated ability to detect objects with little training due to limited data [38] and its efficacy in extremely crowded environments [39]. Although the aforementioned model has a relatively good accuracy in image segmentation, the U-Net model's symmetric encoder-decoder structure and skip connections, may be deemed relatively complex.
Semantic segmentation was performed on chicken videos encompassing four categories of broilers and layers, both dyed and undyed. The objective was to accurately track the chickens and assess their health status. The segmentation abilities of TAM (Track Anything Model) [40] were compared to those of other state-of-the-art models, namely YOLOv5 and YOLOv8. Generally, TAM designed to track diverse objects in visual data, such as images or videos, regardless of their type or characteristics. The segmentation carried out by the TAM model involved a fusion of XMem Video Object Segmentation (VOS) [41] and Segment Anything Model (SAM) techniques. In contrast, YOLO-based models [18] required training on 1000 images for both evaluation and testing of YOLOv5 and YOLOv8 models. The mean Intersection over Union (mIoU) results for over 90% of the TAM model demonstrated its effective segmentation ability for both dyed and undyed classes of both types of chickens. The model exhibited a higher proficiency in segmenting dyed chickens. The process of dyeing the chickens proved beneficial for segmentation by enhancing color contrast between the chickens and the background [42].
YOLOv8 is a modern and enhanced object identification model designed for computer vision. Given to the YOLO [43] model's wide success in a variety of fields, endeavors were made to apply its eighth version for broiler segmentation. Because many investigations, particularly in the poultry industry, demonstrate the success of this version [44, 45]. Since YOLOv8 is a one step model for identification, it is a well-liked model in real time applications. One of the key techniques employed in the Yolov8 architecture is Anchor-Free Detection. The YOLOv8 model structure makes advantage of this technique to enhance object localization. The soft non-maximum suppression approach also includes activities like filtering the output and removing extra identification. Additionally, Yolov8's backbone (DarkNet53) makes use of Cross-Stage Partial (CSP) connections, which greatly enhance data transfer between the network's various tiers. With this architecture, feature extraction is more precise and efficient while also improving feature reuse and reducing computational bottlenecks [46].
The researches reviewed indicates that image processing, particularly threshold-based techniques, is the predominant method for chicken recognition in poultry monitoring systems [7]. However, despite its widespread usage, these methods come with drawbacks. Notably, they exhibit sensitivity to environmental factors such as variations in lighting, camera angles, and background alterations, which pose significant challenges for broiler segmentation using image processing. Due to the multi-stage nature of image processing, the segmentation process adds computing load and thus lengthens processing times, which can be a limiting factor, particularly in real-time applications where response times must be quick. Hence, in addressing these hurdles, CNNs were explored as potential solutions. Given the erratic movements of chickens in real-time scenarios, rapid detection proves challenging. Moreover, the investigated architectures in previous researches such as Mask R-CNN and U-Net under investigation imposed a considerable computational burden, resulting in notably slow detection speeds.
The primary goal of this study is to explore advanced CNN models (Mask R-CNN with MobileNetv2, YOLOv8, and SAM) and develop a lightweight, fast architecture for broiler segmentation, addressing both semantic and instance segmentation tasks without compromising accuracy. Unlike semantic segmentation, which assigns a label to each pixel in an image, instance segmentation goes a step further by detecting and distinguishing individual objects. This distinction is particularly crucial in poultry monitoring, where precise identification of each broiler is required. The study emphasizes eliminating multi-step operations in image processing-based methods and minimizing background errors, with a focus on optimizing performance for low-power and embedded systems widely utilized in poultry production systems. The RGB color space was chosen due to its compatibility with display systems, simplicity, and intuitive handling in poultry monitoring applications (Chavolla et al., 2018).
Mask R-CNN with MobileNetv2 is a two-stage model that first detects object regions and then refines segmentation, offering accurate results but at a relatively higher computational cost. YOLOv8 is a single-stage anchor-free model designed for fast and efficient object detection and segmentation, making it highly suitable for real-time applications and embedded systems with limited computational resources. SAM, a foundation model developed for zero-shot segmentation, provides highly flexible and general-purpose segmentation capabilities without requiring task-specific training. However, this flexibility comes at the cost of increased computational demand. By investigating these models, the study explores diverse segmentation strategies, each with distinct strengths, aiming to develop a solution that balances accuracy, speed, and efficiency.
2. Materials and methods
2.1. Data acquisition
To capture images, a dedicated imaging platform was constructed. The design featured a corridor structure with a width of 50 cm, enabling broilers to be identified and photographed individually as they walked through the passage. The camera was mounted at a height of 120 cm above the platform to provide a clear top-down view. An ELP USB industrial camera equipped with a 2 MP Sony IMX323 sensor, a 2.1 mm wide-angle lens, and low-light sensitivity (0.01 Lux) was used for image acquisition. A schematic representation of the platform design, created using SolidWorks 2018, is presented in Fig. 1. Data collection was conducted over 13 days during the growth period at a research poultry farm affiliated with Tabriz University (38°01'47"N 46°23'45"E), which has a capacity of 1000 broilers. This operation took place every 3 to 4 days, involving at least 85 Arian broilers each time. Each image was captured from the top view and contained a single broiler in a vertical orientation. In total, approximately 1122 random images of broilers were obtained throughout the breeding period.
2.2. Image annotations
In the context of image segmentation using CNNs, accurate annotations are essential because they serve as the ground truth that the model learns to predict. Manual labeling involves annotators meticulously outlining objects or regions of interest in an image to create precise segmentation masks. Class labels are then assigned to each mask, distinguishing between different types of objects or regions. This process ensures that the segmentation model has accurate and detailed ground truth data to learn from, significantly impacting its performance and reliability. To maintain high annotation consistency and minimize variability, all images were manually annotated by a single expert annotator. This approach helped ensure uniform labeling quality across the dataset and reduced discrepancies that could arise from multiple annotators. The images of broilers were annotated using an open-source tool called "LabelMe," which generates outputs in a standard format. These outputs were later converted into formats suitable for training different models.
2.3. Pre-processing
Pre-processing algorithms assist neural networks in better understanding image features and enhancing model performance. It is crucial that the type and stages of pre-processing are tailored to the specific characteristics of the problem to ensure optimal outcomes (De Raad et al., 2021). Initially, the dataset was randomly divided into training, validation, and testing subsets at a 70:10:20 ratio, respectively, with each subset reviewed to preserve the dataset's overall diversity. Subsequent pre-processing steps involved preparing images for neural network training through resizing to fixed dimensions and normalization based on the mean and standard deviation values typically used for pre-trained models. To further enhance the diversity and robustness of the training set, several data augmentation techniques were applied. Scaling was used to simulate different broiler sizes and distances from the camera; HSV augmentation addressed variations in lighting and color conditions common in farm environments; mosaic augmentation enabled the model to learn from composite scenes with multiple broilers and backgrounds, improving generalization; and horizontal flipping introduced spatial variability, helping the model better handle changes in broiler orientation. These augmentations collectively aimed to improve model generalization, ensuring more accurate segmentation under diverse real-world conditions.
2.4. Models
Deep learning approaches employ deep artificial neural networks for a range of applications. The training and evaluation of artificial neural networks is also done with a suitable dataset. In many industries, deep learning models became known as the preferred technique for segmenting images in recent years. Recently, the advancement of convolutional architectures for segmentation has prompted researchers to adopt these models, particularly in the agricultural sector. Segmentation models, especially instance segmentation models, have been designed to identify all pixels associated with one object in the image. The main goals of this research were to adopt and modify appropriate architectures for the identification and segmentation of broilers. In most sectors, the management of poultry farms will benefit greatly from the segmentation of broiler images.
2.4.1 Mask R-CNN
By combining the functions of object identification and zoning, the Mask R-CNN model is able to recognize a bounding box surrounding an object and identify which pixels inside the box are associated with that object. This approach generates a mask that highlights which pixels are part of an object using color or grayscale values. Segmentation algorithms, such as Mask R-CNN, are designed to produce this type of mask, which is essential for distinguishing individual objects within complex scenes. A Convolutional Neural Network (CNN) serves as the backbone in Mask R-CNN, extracting features from input images. After applying Region Proposal Network (RPN) to generate candidate object regions and RoIAlign to refine these regions, the network splits into three heads: one for object classification, one for bounding box prediction, and a third for pixel-wise segmentation mask prediction Fig. 2.. This segmentation process is critical in tasks like instance segmentation, where precise identification of objects at the pixel level is required for applications, such as tracking individual animals or assessing their health status in real-time systems.
An effective neural network architecture designed for mobile and resource-constrained devices is MobileNetv2 (Sandler et al., 2018). Through notable advances like depthwise separable convolutions and inverted residuals with linear bottlenecks, MobileNetv2 enhances the original MobileNet (Howard, 2017), reducing computational load while maintaining accuracy. MobileNetv2, with its speed, minimal computational costs, and efficient design, is an ideal feature extractor for models like Mask R-CNN, especially in real-time applications and scenarios where resource limitations are a critical consideration. Therefore, MobileNetv2 was employed in this study as the feature extractor for the Mask R-CNN model to achieve both accuracy and efficiency under these constraints.
2.4.2 YOLOv8
YOLOv8 is a modern and efficient object detection model, well-suited for real-time applications such as broiler segmentation. Its anchor-free detection mechanism, improved localization accuracy, and fast inference speed make it particularly advantageous for monitoring dynamic and dense environments like poultry houses. Therefore, YOLOv8 was selected in this study to address the specific challenges of broiler image segmentation, aiming to achieve both high precision and computational efficiency on practical systems.
According to prior research, several key strategies are necessary to successfully train and optimize the YOLOv8 model. By integrating transfer learning, data augmentation, and other optimization techniques, the model can generate more accurate segmentation masks across diverse conditions and improve generalization to unseen data. Transfer learning, in particular, accelerates model convergence by utilizing pretrained CNN architectures initialized on large datasets [47]. Based on these considerations, the YOLOv8-large model was selected for broiler image segmentation. Compared to smaller variants like YOLOv8-small or YOLOv8-medium, YOLOv8-large achieves higher accuracy due to its larger architecture and greater number of parameters, with only a modest increase in computational cost. This choice was confirmed through trial-and-error testing of different YOLOv8 versions to ensure optimal performance.
2.4.3 Segment Anything Model (SAM)
The Segment Anything Model (SAM) uses an advanced foundation model to present a novel method of object segmentation. SAM offers zero-shot transfer to various tasks and is made to operate on new image types without the need for additional training. However, it can generalize well to various visual inputs and segmentation challenges. With over a billion masks from 11 million images, the largest segmentation dataset ever used to train a SAM model. In general, SAM consists of three primary components, image and prompt encoder and the mask decoder. Using pretrained autoencoder models, the image encoder first extracts the image's key features. SAM's prompt encoder transforms various prompts, including as points, boxes, and masks, into characteristics that the model can understand. Sparse prompts require positional encoding, while dense prompts, such as masks, are processed using convolutions and integrated with picture features. With this configuration, SAM can successfully comprehend and respond to various prompt types. The mask decoder in SAM rapidly produces segmentation masks by utilizing the features extracted from the image and prompt encoders. It employs cross-attention techniques to update these features, ensuring swift interactive responses. The general architecture of SAM model can be seen in the Fig. 3.
In this research, the Region of Interest (RoI) technique was integrated within the SAM model to perform segmentation of broilers. A rectangular region focused on the central area of the designated platform was selected as the RoI (Fig. 4), which helps optimize processing efficiency by allowing the model to focus exclusively on the broiler rather than the entire platform, thereby reducing computational demands. This targeted approach not only lowers processing costs but also enhances the model's focus and accuracy for broiler segmentation, leading to more reliable results. However, it’s important to note that variations in lighting conditions can impact the model's performance, potentially affecting segmentation accuracy (Fig. 9). By concentrating the segmentation task within a controlled RoI, the model may be better able to adapt to these lighting fluctuations, ultimately improving segmentation consistency under varying environmental conditions.
2.5 Model Training and Evaluation
To provide dependable application in poultry monitoring, optimal training and evaluation of proposed models for precise broiler segmentation are conducted by checking and analyzing some criteria. Before training Mask R-CNN and Yolov8 models, certain parameters, known as hyperparameters, need to be set. These hyperparameters play a critical role in optimizing the network's training during each epoch and are important in the model's configuration. However, since SAM is a zero-shot model, it does not require these hyperparameters. Instead, the type of prompt becomes crucial, and a point-based prompt will be used for segmentation. Table 1 displays some of the key hyperparameters used to train the Mask R-CNN and YOLOv8 models. One complete pass of the model through the entire training dataset is called an epoch. Batch-size defines the number of images processed together in each step. Convergence speed is impacted by learning rate (Lr), which affects the step size in updating weights. To reduce loss, the optimizer function modifies weights according to gradients and Lr. To increase training efficiency, the Lr-scheduler adjusts the learning rate over time. The selected hyperparameters were determined through trial and error. The resulting models demonstrated optimal performance based on the chosen number and type of hyperparameters. The training process took approximately 24 minutes for the YOLOv8 model and around 95 minutes for the Mask R-CNN model. The longer training time of the Mask R-CNN model can be attributed to its two-stage approach, which involves region proposal and pixel-wise segmentation, leading to a more computationally intensive process compared to the one-stage architecture of YOLOv8, which directly predicts bounding boxes and class labels in a single step, making it faster and more efficient for real-time applications.
A
The model training was conducted on an NVIDIA QUADRO RTX 4000 GPU, utilizing the CUDA® parallel computing platform to accelerate processing. The training environment was set up on a Linux-based system running Ubuntu, ensuring optimal performance and compatibility for deep learning tasks.
Table 1
Training hyperparameters of models
Hyperparameters Model | Epoch | Batch-size | Learning rate (Lr) | Optimizer function | Lr-scheduler |
|---|
Mask R-CNN | 100 | 8 | 0.0001 | Stochastic gradient descent (SGD) | Cosine annealing |
Yolov8 | 50 | 8 | 0.0001 | Stochastic gradient descent (SGD) | Cosine annealing |
In the phase of training, the error trend is analyzed using specific loss functions over each training epoch. Ideally, after the designated number of epochs, the error should reach its minimum, while accuracy should attain its maximum. Evaluation metrics in the test section included comparing target pixels with detected pixels using measures such as Intersection over Union (IoU), Average Precision (AP), and Average Recall (AR). These metrics were employed to assess segmentation accuracy and to evaluate the model's effectiveness in accurately identifying and segmenting broiler pixels. The chosen metrics, were selected for their ability to provide a clear measure of model performance in segmentation tasks, particularly for evaluating how well the model identifies and segments the target pixels. The IoU metric, in particular, is crucial as it directly measures the overlap between predicted and ground truth pixels, providing an effective way to assess segmentation accuracy. For this study, the IoU range of 0.50:0.95 was used to account for different levels of overlap and ensure a more robust evaluation of the model's performance, particularly in real-world scenarios where the level of precision may vary.
Furthermore, the time required for broiler segmentation is a key evaluation factor examined in this research, as achieving a fast, lightweight model with high detection and segmentation accuracy is essential for the efficient management of poultry farms. Fast and lightweight models are well-suited for use on embedded systems, minimizing hardware and processing costs while enabling real time monitoring and decision-making, which are crucial for maintaining productivity and animal welfare in farm operations.
True Positive (TP) refers to pixels correctly identified as belonging to the target class, while False Positive (FP) counts pixels incorrectly predicted as the target class when they do not belong to it. False Negative (FN) represents target pixels that the model failed to detect. In segmentation tasks, True Negative (TN), indicating correctly identified non-target pixels, is typically less relevant due to the large number of background pixels. Precision, recall and IoU are calculated based on TP, FP, and FN, as shown in equations (1), (2) and (3), respectively.
3. Results and discussion
3.1 Training results
As previously stated, the Examining the train process and testing include the two components of the evaluation of presented models. The training total loss for the YOLOv8 and Mask R-CNN models is represented by the graphs in the Fig. 5. The Mask R-CNN model’s total loss decreases sharply at first and stabilizes gradually over 100 epochs, with minor oscillations even in later epochs (graph (a)). This total loss includes bounding box loss, classification loss, mask loss, and region proposal loss. The slight fluctuations reflect the complexity of segmenting and detecting objects simultaneously, as Mask R-CNN continues to refine its mask predictions and object boundaries. As can be shown in graph (b), the YOLOv8 model, which was trained for segmentation in 50 epochs, likewise demonstrated a sharp decline in total loss during the first epochs, much like Mask R-CNN. Effective convergence was demonstrated by the rapid stabilization of the total loss, which consists of segmentation loss, localization loss, and confidence loss. Given the reduced training time, this early stabilization indicates that YOLOv8 effectively adjusts to the segmentation problem.
In general, YOLOv8 achieves efficient convergence in 50 epochs with a quick initial drop in total loss and stabilizes faster than Mask R-CNN. Mask R-CNN, trained for 100 epochs, stabilizes more slowly due to its complex multi-task configuration that combines object detection and segmentation. YOLOv8 converges more quickly because of its simpler architecture focused only on segmentation. Note that the total loss reported here includes training and validation loss for both models.
3.2 Performance metrics
Another step to optimize the training process for the models that were provided was to track the precision of the validation dataset at each training epoch. The graphs of Fig. 6 illustrate the trend in validation precision (AP) for the Mask R-CNN (a) and YOLOv8 (b) models across different training epochs, calculated at an IoU (Intersection over Union) range of 0.50:0.95. IoU measures the overlap between the predicted segmentation and the ground truth, with values closer to 1 indicating a better match. AP is averaged over IoU thresholds from 0.50 to 0.95 to provide a comprehensive measure of precision across different levels of overlap accuracy. In comparison, YOLOv8 demonstrates faster convergence and higher stability in precision. It reaches close to 0.99 AP within the first few epochs and maintains this with minimal fluctuations, suggesting that its simpler architecture is well-suited for segmentation tasks. In contrast, Mask R-CNN shows more variability and lower overall precision, with values generally fluctuating between 0.3 and 0.4 AP across the training period. This variability is likely due to Mask R-CNN's more complex multi-task architecture, which combines object detection and segmentation, making it slower to stabilize. Therefore, YOLOv8 is a more effective choice for tasks that require quick convergence and high segmentation precision.
The broiler segmentation performance of Mask RCNN, YOLOv8 and SAM models assessed using several criteria is presented in Table 2. YOLOv8 delivers the most accurate segmentation results, outperforming the other two models across various IoU thresholds. In addition, YOLOv8 successfully detects all target areas by achieving a perfect recall, unlike the Mask RCNN and SAM models. With a segmentation time of 30.7 milliseconds, the YOLOv8 model can segment broilers at approximately 33 frames per second, meeting the requirements for real-time applications. This performance exceeds the commonly accepted minimum of 30 frames per second for real-time processing. The Mask RCNN model uses the MobileNetv2 backbone, which guarantees high speed (4 ms per image) and a small model size (75.3 MB). However, this lightweight backbone results in lower precision and recall compared to YOLOv8, which gives significantly higher accuracy. Examining the segmentation speed of the models reveals that SAM is considerably gets slower, particularly with Region of Interest (RoI) approach. In addition, SAM is a much larger model, which may limit its use on resource-restricted platforms.
Table 2
Evaluation results of all three models
Metrics Models | Average Precision (IoU = 0.50:0.95) | Average Recall (IoU = 0.50:0.95) | Segmentation Time (on GPU per image) | Model size |
|---|
Mask R-CNN | 0.891 | 0.895 | 4 ms | 75.3 MB |
YOLOv8-large | 0.995 | 1 | 30.7 ms | 87.9 MB |
SAM | 0.912 | 0.915 | Without RoI: 6253.4 ms With RoI: 57.6 ms | 2.5 GB |
3.3 Visual analysis
Following optimal training of the Mask R-CNN and YOLOv8 models and the implementation of the SAM model using the RoI method, the segmentation of some random test data was analyzed and evaluated (Fig. 7). To segment images of broilers random test data, three models Mask R-CNN, YOLOv8, and SAM were compared. Each model displayed different visual results in their predictions. Compared to Mask R-CNN, YOLOv8 and SAM generated more accurate segmentation borders. MobileNetv2 prefers efficiency over detailed feature extraction, which makes it less effective for complex segmentation tasks. This is probably why Mask R-CNN, which used the MobileNetv2 backbone to speed up segmentation, had lower accuracy.
Since YOLOv8 is faster and uses less processing power than SAM, it is more appropriate for embedded systems and real time applications like mobile devices or low power processors. Even though SAM produces acceptable results, it requires a lot more processing time and resources (Table 2), making it impractical for real time or resource constrained applications. This is why model SAM is primarily used as an annotation tool in many studies [48, 49]. Therefore, YOLOv8 is a more practical choice for real time segmentation tasks in embedded systems where quick processing and lower power consumption are essential. The target masks and the predicted masks from SAM and YOLOv8 are compared in Fig. 8. YOLOv8 predicted masks nearly match the target masks in tiny details, Whereas SAM predicted masks seem smoother and lack this detail. The difference might be due to the fact that SAM focuses on general segmentation tasks, giving greater importance on broad accuracy over fine details, while YOLOv8 is designed for object detection and segmentation tasks with high precision.
The high accuracy achieved in broiler segmentation using the SAM method is largely due to the application of the region of interest (RoI) technique. When the region of interest is not used, lighting circumstances clearly make it more difficult to separate broilers, as can be seen by comparing the results in the Fig. 9. By focusing the segmentation on the relevant regions, RoI decreases the influence of background noise and lighting changes that may confuse the model. Without RoI, the model processes the entire image with distracting background elements that change with lighting and often results in inaccurate segmentation. The RoI approach improves segmentation by minimizing the effect of environmental factors, such as lighting, on broiler detection. Additionally, using SAM with the RoI technique reduces computational costs while achieving high accuracy.
The precision and recall results of the SAM model are lower than YOLOv8 based on both visual observations and evaluation metrics. One reason is SAM's tendency to overlook small details. Additionally, factors like feathers and waste can lead to segmentation errors, reducing the model's overall accuracy (Fig. 10).
Considering the overall evaluation results, YOLOv8 emerges as the most practical and effective solution for broiler segmentation in real-time applications. Its superior precision (99.5%), rapid convergence, high recall, and real-time processing capability at over 30 frames per second make it an ideal choice for deployment on embedded systems where speed, accuracy, and lightweight architecture are critical. In contrast, although Mask R-CNN and SAM offer acceptable accuracy, their relatively lower precision, higher variability, and heavier computation requirements limit their practical utility in resource-constrained or time-sensitive environments. Therefore, for efficient poultry farm monitoring and management, YOLOv8 is strongly recommended as the optimal model.