A Crack Segmentation Network Integrating Multi-Scale Attention Residual and Context-Enhanced Transformer Block

JiahuiLi1Email1195664513@qq.com

YongLi1✉,2Emailliyong07@cdut.edu.cn

YingtianLiu2Emailliuyingtian@stu.cdut.edu.cn

College of Computer Science and Cyber SecurityChengdu University of Technology

2College of GeophysicsChengdu University of Technology

Abstract

Crack segmentation is crucial for infrastructure maintenance, yet it remains challenging due to false detection from crack-like textures and discontinuous predictions from inadequate continuity modeling. To address these issues, we propose a complementary synergistic fusion network (CSF-Net). Our approach features a dual-branch encoder: a local texture branch equipped with a novel multi-scale attention residual (MSAR) module to suppress texture interference, and a global structure branch incorporating a context-enhanced attention module (CEAM) to enhance the modeling of crack continuity. These complementary features are fused via a cross-branch fusion block (CFB). Extensive experiments on three public benchmarks demonstrate the superiority of CSF-Net. On the DeepCrack dataset, it achieves an mIoU of 72.79% and an F1-score of 86.67%, significantly outperforming the U-Net baseline by 7.08% and 7.49%, respectively. CSF-Net also exhibits state-of-the-art and robust performance on the CFD and Crack500 datasets, confirming its strong generalization capability.

Keywords

Crack Detection

Semantic Segmentation

Deep Learning

Transformer

Attention Mechanism

Jiahui Li , Yong Li and Yingtian Liu: These authors contributed equally to this work.

Introduction

Detection of road cracks is crucial to prevent their deterioration into more severe damage \cite{ref1,ref2}. Initial crack segmentation methods primarily relied on traditional image processing techniques, including threshold segmentation \cite{ref3}, edge detection \cite{ref4}, and mathematical morphology \cite{ref5}, while their robustness in complex real-world scenarios remains limited \cite{ref6,ref7}.

In recent years, deep learning has gradually replaced traditional segmentation methods owing to its powerful feature extraction capability \cite{ref8,ref9}. Crack segmentation techniques can be broadly categorized into CNN-based models and segmentation combining CNNs with Transformer methods. CNN-based approaches have been widely adopted because of their strong ability to capture local features, achieving relatively satisfactory segmentation results ref10,ref11. ref12 first introduced fully convolutional networks (FCNs) for semantic segmentation. Subsequently, U-Net ref13 and its variants, through encoder-decoder architectures and skip connections, alleviated the spatial information loss caused by downsampling ref14,ref15. ref16 proposed skip-level round-trip sampling blocks for cross-level feature interaction, while ref17 replaced the U-Net encoder with diverse CNNs to reduce computation and memory while maintaining accuracy. Multi-scale feature fusion was further employed to aggregate contextual information from different receptive fields, balancing the capture of fine crack details and overall structure. For instance, For example, ref18 proposed DeepCrack, which fuses multi-scale convolutional features to enhance perception of crack details and global structure. However, multi-scale fusion still relies on local convolution, limiting long-range dependency modeling ref19. Hence, attention mechanisms have been introduced to enhance global context and suppress noise. ref20 proposed MST-Net, jointly modeling spatial, channel, and pixel dimensions to improve global context awareness and segmentation accuracy, while ref21 introduced AHC-Net, employing convolutional block attention in the encoder and criss-cross attention in the decoder. Despite their effectiveness in capturing global correlations, attention mechanisms often have weaker capability in modeling fine crack details ref22,ref23.

With the development of Transformers in visual tasks, self-attention mechanisms have been applied to crack segmentation to enhance long-range dependency modeling ref24,ref25. ref26 proposed MSDCrack, which enhances global dependency modeling through self-attention pooling while preserving local feature extraction. ref27 introduced ISTD-CrackNet, a hierarchical Transformer-based model that strengthens local detail representation using deformable convolutions and multi-scale convolutions. Although these hybrid models balance local and global representation ref28, they lack specialized modules for two key challenges: false detections arising from crack-like textures and discontinuous predictions due to inadequate continuity modeling.

Therefore, this paper proposes a Complementary Synergistic Fusion Network (CSF-Net). It includes a Multi-Scale Attention Residual module (MSAR) that processes features with multi-directional convolutions, followed by a crack-like filtering unit to suppress interference, and employs a residual connection to preserve crack details. In addition, a Context-Enhanced Attention Module (CEAM) optimizes self-attention via an asymmetric attention structure and dual-path feature enhancement mechanism, effectively addressing broken or blurry predictions while modeling crack continuity and suppressing background noise. This design mitigates false detection and improves global structural continuity, preventing cracks from being segmented into fragments.

The main contributions of this paper are summarized as follows:

1.We propose CSF-Net, which achieves collaborative modeling of crack details and overall structure. CSF-Net adopts a dual-branch encoder architecture consisting of a local texture branch and a global structure branch, and employs a cross-branch fusion block (CFB) to enable semantic complementarity and feature fusion between the two branches.

2.We design the MSAR module for the local texture branch. MSAR combines multi-directional convolutions with crack-like filtering units, effectively reducing false detections caused by crack-like texture interference and enhancing the model's ability to distinguish genuine crack.

3.We propose CEAM for the global structure branch. CEAM optimizes the self-attention mechanism by employing an asymmetric attention structure and a dual-path feature enhancement mechanism. This enhances the model's ability to model the connectivity of the overall crack structure, thereby preventing continuous cracks from being segmented into fragmented segments.

Methodology

Fig. 1

Illustration of CSF-Net. The network consists of a dual-branch encoder (MSAR-based local texture branch and CETB-based global structure branch). The CFB integrates features from both branches, and its outputs are fed into the RFEB in the decoder. The decoder includes RFEBs and a skip connection.

Overall Architecture

The proposed CSF-Net adopts a dual-branch encoder architecture to simultaneously capture local crack details and global structural information. As shown in Fig. 1, the local texture branch uses MSAR modules to suppress texture interference and enhance crack details, while the global structure branch employs ConvBlocks and context-enhanced transformer blocks (CETBs) for global semantic modeling to achieve continuous modeling of the cracks. Features from the two branches are fused via CFB and then progressively refined in the decoder through Residual Feature Enhancement Block (RFEBs). Additionally, skip connections in shallow decoder layers prevent the loss of crack details caused by excessive abstraction of deep features.

Local Texture Branch

Crack-like textures often lead to false detection in segmentation. To mitigate this, we design the Multi-Scale Attention Residual (MSAR) module. As shown in Fig. 2, the MSAR module enhances the CrossNet ref29 architecture by integrating a crack-like filtering unit to improve genuine crack discrimination. The filtering unit generates a soft weight map

$(W\in [0,1]^{1\times H\times W} )$

to suppress interference from crack-like textures. Let

$(F_{in} \in R^{C \times H \times W} )$

be the input feature map. The filtering process is defined as:

$w&=\sigma \left ( Sobel\left ( Conv_{1\times 1}\left ( f_{dir}^{avg} \left ( F_{in} \right ) \right ) \right ) \right ) \\ &F_{filter} =W\odot f_{dir}^{avg} %%$

Eq.1

where

$(f_{dir}^{avg} )$

denotes directional average pooling (1×5 or 5×1),

$(\sigma )$

is the Sigmoid function, and

$(\odot )$

indicates channel-wise multiplication.

The module employs 3×3 Depthwise Separable Convolutions (DSC) ref30 for efficient feature extraction. A residual connection combines the input and enhanced features, preserving essential crack details.

Fig. 2

Illustration of MSAR. The module contains three parallel branches: horizontal and vertical convolutions with crack-like filtering units to suppress texture interference. Features from the three branches are fused and combined with the input feature via a residual connection.

Global Structure Branch

To enhance global semantic modeling and crack continuity, we introduce the Context-Enhanced Transformer Block (CETB), which optimizes the self-attention mechanism inspired by FCHiLo ref31. As depicted in Fig. 3(a), the CETB comprises a Context-Enhanced Attention Module (CEAM) and a lightweight Feedforward Network (FFN).

The CEAM employs an asymmetric structure to generate multi-scale contextual features for Queries (

$(Q)$

), Keys (

$(K)$

), and Values (

$(V)$

). Specifically,

$(Q)$

is produced from the input feature map

$(X)$

via a DSC, while

$(K)$

and

$(V)$

are derived from multi-scale average-pooled versions of

$(X)$

$&F_{g,2} =f_{2\times 2}^{avg}\left ( X \right ) ,F_{g,4} =f_{4\times 4}^{avg} \left ( X \right ) \\ &F_{g} =Concat\left ( Align\left ( F_{g,2} \right ) , Align\left ( F_{g,4}\right ) \right ) \\ &K/V=ReLU\left ( GN\left ( DSC_{3\times 3} \left ( F_{g} \right ) \right ) \right ) \\ &Q=ReLU\left ( GN\left ( DSC_{3\times 3}\left ( X \right ) \right ) \right )$

Eq.3

where

$(f_{2\times 2}^{avg} )$

and

$(f_{4\times 4}^{avg} )$

denote 2×2 and 4×4 average pooling operations, respectively.

$(F_{g} )$

represents the fused global features.

$(Align\left ( \cdot \right ) )$

denotes feature map alignment.

$(Concat\left ( \cdot \right ) )$

denotes concatenation along the channel dimension.

$(DSC_{3\times 3} \left ( \cdot \right ) )$

represents a 3×3 depthwise separable convolution.

$(GN\left ( \cdot \right ) )$

denotes Group Normalization.

$(ReLU\left ( \cdot \right ) )$

represents the activation function.

Fig. 3

Illustration of the CETB. The model consists of the CEAM and FFN. The CEAM employs an asymmetric attention structure to generate

$(Q)$

$(K)$

, and

$(V)$

, followed by multi-head attention for global dependency modeling. A dual-path enhancement mechanism strengthens crack continuity and suppresses noise.

Furthermore, a dual-path feature enhancement mechanism is applied to the attention output. As shown in Fig. 3(d), one path produces a spatial weight map via a 1×1 convolution and a sigmoid function, whereas the other path extracts local features using a 3×3 convolution. Finally, the outputs of the two branches are multiplied, thereby enabling the model to both capture crack continuity and suppress background noise under the weight constraints. The mechanism can be formulated in Eqs. (7)-(11).

$&head_{i}= softmax\left ( \frac{Q_{i} K_{i}^{T}}{ \sqrt{d_{h} } } \right )V _{i} \\ &F_{att} =Conv_{1\times 1} \left ( Concat \left ( head_{1},\dots ,head_{H} \right ) \right ) \\ &F_{1} =\sigma \left ( Conv_{1\times 1}\left ( F_{att} \right ) \right ) \\ &F_{2} =ReLU\left ( GN\left ( Conv_{3\times 3} \left ( F_{att}\right ) \right ) \right ) \\ &F_{out}=F_{1}\odot F_{2}$

Eq.7

where

$(head_{i})$

denotes the output of the

$(i)$

-th attention head,

$(F_{att})$

represents the result after multi-head attention fusion,

$(\sigma)$

denotes the Sigmoid function, and

$(Conv_{1\times 1}\left ( \cdot \right ) )$

and

$(Conv_{3\times 3}\left ( \cdot \right ) )$

represent convolution operations with kernel sizes of 3×3 and 1×1 respectively.

The FFN also adopts the lightweight design of ref31, as illustrated in Fig. 3(e).

Fig. 4

Illustration of the (a) CFB and (b) RFEB.

Feature Fusion and Refinement

To fully leverage the complementarity between the local texture branch and the global structure branch in the encoder, the CFB is designed to align and integrate features from both branches. As illustrated in Fig. 4(a), the CFB concatenates features along the channel dimension, followed by a 3×3 convolution with Group Normalization and ReLU activation for channel compression and nonlinear representation. An SE module ref32 is further introduced to highlight crack regions while suppressing irrelevant background responses. The fused features are then fed into the decoder.

In the decoder, the RFEB mitigates the progressive degradation of crack details and enhances the clarity and continuity of cracks. As shown in Fig. 4(b), The upsampled features are first processed by a residual block to preserve details and alleviate gradient vanishing, then concatenated with the corresponding CFB output. Subsequent convolutions and normalization refine the spatial features, improving edge clarity and structural continuity. Additionally, Dense Upsampling Convolution (DUC) is employed to replace conventional bilinear interpolation, preserving semantic information while better restoring fine crack structures.

Table 1
Compare the Precision, Recall, F1, and mIoU of different models on the Deepcrack dataset
Model	Precision(%) $(\uparrow )$	Recall(%) $(\uparrow )$	F1(%) $(\uparrow )$	mIoU(%) $(\uparrow )$
Hybird-Segmentor	79.69 ±3.37	53.62 ±3.13	55.87 ±0.77	43.98 ±0.87
U-Net	81.20 ±2.00	77.36 ±4.95	79.18 ±3.58	65.71 ±4.91
SA-UNet	85.62 ±0.69	73.51 ±1.77	77.32 ±0.80	64.41 ±0.98
UNet-FS	86.20 ±0.94	59.90 ±0.97	70.66 ±0.35	51.68 ±0.36
Ours	87.84 ±1.82	85.60 ±3.46	86.67 ±2.42	72.79 ±1.13

Experiment

Datasets and Experimental Setup

The experiments are conducted on three public crack datasets: DeepCrack ref33, Crack500 ref34, and CFD ref35. To enhance the model’s generalization ability, several data augmentation strategies are applied during training, including random horizontal and vertical flips, random cropping, and contrast enhancement. All training images are cropped into 224×224-pixel patches, where patches containing more than 1% crack pixels are treated as positive samples. A balanced sampling strategy is adopted to ensure sufficient participation of crack samples during training.

Model performance is evaluated on four common metrics: precision, recall, F1-score, and mIoU. To minimize the impact of random factors on experimental results, all models were independently trained five times using the same set of random seeds, and calculate the mean and standard deviation as the final results. All experiments were implemented in PyTorch on an NVIDIA GeForce RTX 2080 Ti GPU. The models were trained for 100 epochs using the Adam optimizer with a batch size of 1 and an initial learning rate of 1×

$(10^{-4} )$

Results and analysis

The experiments include four comparison models, including both general semantic segmentation models and crack-specific segmentation models: Hybrid-Segmentor ref36, U-Net ref37, Self-Attention-Based Efficient U-Net (SA-UNet) ref38, and U-Net-based Fracture Segmentation Model (UNet-FS) ref39.

Results on Deepcrack dataset

As shown in Table 1, CSF-Net achieves the best overall performance on the DeepCrack dataset, reaching 86.67% in F1-score and 72.79% in mIoU, which outperform SA-UNet by 9.35% and 8.38%, respectively. These results demonstrate that CSF-Net provides more accurate and stable crack segmentation in complex backgrounds.

To further analyze model performance on individual metrics, our model was compared against the best-performing models for precision and recall individually. Specifically, Precision exceeds UNet-FS by 1.64%, indicating the model's superiority in suppressing false detection. Recall improves by 8.24% over U-Net, showing the model's ability to preserve crack structural continuity.

To further assess the model's performance on the DeepCrack dataset, Fig. 5(a)-(d) illustrates crack segmentation results on the DeepCrack dataset. Models such as Hybrid-Segmentor show weak responses to crack edges and tend to break continuous cracks into fragmented pieces. UNet-FS and SA-UNet struggle to distinguish real cracks from crack-like textures, resulting in degraded segmentation accuracy. In contrast, CSF-Net produces more complete cracks with clearer boundaries, effectively maintaining structural continuity.

Fig. 5

Comparative visualization of segmentation results on multiple datasets.

Table 2
Compare the Precision, Recall, F1, and mIoU of different models on the CFD dataset
Model	Precision(%) $(\uparrow )$	Recall(%) $(\uparrow )$	F1(%) $(\uparrow )$	mIoU(%) $(\uparrow )$
Hybird-Segmentor	62.21 ±1.86	41.20 ±0.29	45.10 ±0.63	31.80 ±0.56
U-Net	64.12 ±0.93	68.52 ±2.26	62.64 ±1.16	47.13 ±1.09
SA-UNet	65.03 ±0.93	70.72 ±0.56	67.74 ±0.74	51.65 ±0.23
UNet-FS	41.62 ±5.80	63.42 ±0.15	50.08 ±4.33	31.60 ±5.58
Ours	67.48 ±0.26	71.92 ±0.07	69.67 ±0.16	52.30 ±0.50

Table 3
Compare the Precision, Recall, F1, and mIoU of different models on the Crack500 dataset
Model	Precision(%) $(\uparrow )$	Recall(%) $(\uparrow )$	F1(%) $(\uparrow )$	mIoU(%) $(\uparrow )$
Hybird-Segmentor	64.71 ±7.15	40.75 ±7.87	39.83 ±0.90	29.04 ±0.38
U-Net	71.76 ±2.34	57.82 ±1.44	59.25 ±0.51	46.97 ±0.41
SA-UNet	81.45 ±0.38	67.27 ±0.52	71.43 ±0.19	57.75 ±0.11
UNet-FS	58.08 ±1.65	49.27 ±3.92	53.27 ±2.97	34.15 ±2.75
Ours	74.74 ±0.59	85.23 ±0.27	79.67 ±0.42	63.60 ±0.53

Results on CFD dataset

As shown in Table 2, CSF-Net achieves 67.48%, 71.92%, 69.67%, and 52.30% in precision, recall, F1-score, and mIoU on the CFD dataset, outperforming all comparison models across all metrics. As illustrated in Fig.5(A)-(C), competing methods often struggled to accurately distinguish cracks from the background on this dataset, leading to missed or false detections. In contrast, CSF-Net effectively suppresses background interference and preserves crack continuity, producing clearer boundaries and more structurally complete predictions. These results further verify the robustness and cross-dataset generalization of the proposed model.

Results on Crack500 dataset

As shown in Table 3, CSF-Net achieves the highest recall, F1-score, and mIoU on the Crack500 dataset, reaching 85.23%, 79.67%, and 63.60%, respectively. Although its precision is slightly lower than that of SA-UNet, CSF-Net still improves recall, F1-score, and mIoU by 17.96%, 8.24%, and 5.85% compared with SA-UNet, indicating that our model still maintains a significant advantage in overall performance. As illustrated in Fig. 5(\Rmnum{1})-(\Rmnum{3}), Hybrid-Segmentor fails to capture complete crack structures, while SA-UNet misses crack details. U-Net and UNet-FS tend to misclassify crack-like textures as cracks. In contrast, our model achieves the best performance among all competing methods, demonstrating its robustness and strong generalization capability.

Discussion

To evaluate the effectiveness of each module in CSF-Net and the model's practical application potential, we conducted ablation experiments and model complexity analysis on the DeepCrack dataset.

vspace{-1.2em}

Table 4
Ablation experiments of CSF-Net on the DeepCrack dataset
Model	Precision(%) $(\uparrow )$	Recall(%) $(\uparrow )$	F1(%) $(\uparrow )$	mIoU(%) $(\uparrow )$
Net1¹	85.80	82.94	84.35	69.50
Net2²	88.34	82.69	85.42	71.75
Net3³	84.08	85.64	84.85	71.86
Net4⁴	83.32	84.53	83.92	72.13
Ours	88.82	87.64	88.23	73.22
¹MSAR removed and replaced by max-pooling (2 × 2, stride 2).
²CEAM replaced by a standard self-attention mechanism.
³CFB replaced by a simple convolution-based fusion operation.
⁴RFEB removed and replaced by bilinear upsampling.

vspace{-1.2em}

Ablation experiments

To assess the contribution of each component, we performed ablation experiments by replacing or removing key components of the network while keeping all other parameter settings fixed. As shown in Table 4, the complete model achieved the best results across all metrics, with precision, recall, F1 Score, and mIoU of 88.82%, 87.64%, 88.23%, and 73.22%, respectively.After removing MSAR (Net1), precision and mIoU decreased from 88.82% and 73.22% to 85.80% and 69.50%, respectively. When the CEAM was replaced with a standard self-attention mechanism (Net2), recall and mIoU dropped by 4.95% and 1.47%, respectively. In addition, replacing the CFB with a simple convolution-based fusion (Net3) resulted in varying degrees of decline across all metrics. Finally, removing the RFEB (Net4) resulted in a decrease of 5.50% in precision and 2.62% in mIoU.

Overall, all proposed modules positively contributes to the CSF-Net, and the collaborative interaction among components effectively enhances the model's comprehensive performance.

Model Complexity Analysis

We conducted a model complexity analysis using three metrics\textemdash Params (M), GFLOPs, and FPS\textemdash to comprehensively evaluate the computational efficiency and practical deployment potential. As shown in Table 5, although the CSF-Net integrates multiple modules, it still maintains relatively low parameter counts and computational costs. Compared with certain lightweight models such as SA-UNet, our model achieves higher segmentation accuracy with only a slight increase in complexity. In contrast to the high-complexity Hybrid-Segmentor model, our model is able to maintain competitive segmentation accuracy while reducing both parameter size and computational costs.

Table 5
Model complexity and inference speed comparison
Model	Params(M)	GFLOPs	FPS
Hybird-Segmentor	237.86	314.25	10.96
U-Net	31.04	54.74	19.79
SA-UNet	10.74	4.18	36.67
UNet-FS	31.88	56.21	21.27
Ours	22.88	30.14	34.36

vspace{-1.2em}

Conclusion

In this article, we addressed two major challenges in crack segmentation: false detections caused by crack-like texture interference and missed detections resulting from insufficient modeling of crack continuity. To overcome these issues, we proposed CSF-Net, which integrates local texture features and global structural information through a dual-branch encoder design. The MSAR module enhances the model's response to cracks and reduces false detections caused by crack-like interference, while the CEAM effectively addresses discontinuity and blurring in the predictions. Furthermore, the CFB merges features across branches to balance detail preservation and structural continuity, and the RFEB further enhances edge clarity and structural integrity during the decoding stage.

Extensive experiments on three public benchmarks demonstrate that CSF-Net achieves superior performance, significantly reducing both false detection and discontinuities in crack predictions. The results confirm the model's robustness, accuracy, and strong generalization capability across diverse scenarios. CSF-Net thus presents a potent solution for practical engineering applications. Future work will focus on developing lightweight variants to enhance real-time performance for on-site deployment.

Declarations

smallFunding Not applicable.

vspace{0.5em}Competing Interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

vspace{0.5em}Ethics Approval Not applicable.

vspace{0.5em}Consent to Participate Not applicable.

vspace{0.5em}Consent for Publication Not applicable.

vspace{0.5em}Data Availability Data associated with this research are available and can be obtained by contacting the corresponding author.

vspace{0.5em}Materials Availability Not applicable.

vspace{0.5em}Code Availability Not applicable.

vspace{0.5em}Author Contributions J.L. designed the research framework, developed the CSF-Net model, and conducted the experiments. Y.L. contributed to methodology refinement, result validation, and manuscript revision. Y.T.L. prepared Figures 1–5 and assisted with data preprocessing and visualization. All authors discussed the results and reviewed the final manuscript.

normalsize\bibliography{sn-bibliography}

Author Contribution

J.L. designed the research framework, developed the CSF-Net model, and conducted the experiments. Y.L. contributed to methodology refinement, result validation, and manuscript revision. Y.T.L. prepared Figures 1–5 and assisted with data preprocessing and visualization. All authors discussed the results and reviewed the final manuscript.

References:

Zumrawi, Magdi ME (2016) Investigating causes of pavement deterioration in Khartoum State. Int J Civ Eng Technol 7(2): 203--214

Shi, Yong and Cui, Limeng and Qi, Zhiquan and Meng, Fan and Chen, Zhensong (2016) Automatic road crack detection using random structured forests. IEEE Transactions on Intelligent Transportation Systems 17(12): 3434--3445 IEEE

Tsai, Yi-Chang and Chatterjee, Anirban (2017) Comprehensive, quantitative crack detection algorithm performance evaluation system. Journal of Computing in Civil Engineering 31(5): 04017047 American Society of Civil Engineers

Salman, Muhammad and Mathavan, Senthan and Kamal, Khurram and Rahman, Mujib (2013) Pavement crack detection using the Gabor filter. IEEE, 2039--2044, 16th international IEEE conference on intelligent transportation systems (ITSC 2013)

Hu, Yong and Zhao, Chun-xia (2010) A novel LBP based methods for pavement crack detection. Journal of pattern Recognition research 5(1): 140--147 Journal of Pattern Recognition Research

Jiang, Chenglong and Tsai, Yichang James (2016) Enhanced crack segmentation algorithm using 3D pavement data. Journal of Computing in Civil Engineering 30(3): 04015050 American Society of Civil Engineers

Gopalakrishnan, Kasthurirangan and Khaitan, Siddhartha K and Choudhary, Alok and Agrawal, Ankit (2017) Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Construction and building materials 157: 322--330 Elsevier

Panella, Fabio and Lipani, Aldo and Boehm, Jan (2022) Semantic segmentation of cracks: Data challenges and architecture. Automation in Construction 135: 104110 Elsevier

Kheradmandi, Narges and Mehranfar, Vida (2022) A critical review and comparative study on image segmentation-based techniques for pavement crack detection. Construction and Building Materials 321: 126162 Elsevier

Liu, Chuanqi and Zhu, Chengguang and Xia, Xuan and Zhao, Jiankang and Long, Haihui (2022) FFEDN: Feature fusion encoder decoder network for crack detection. IEEE Transactions on Intelligent Transportation Systems 23(9): 15546--15557 IEEE

Qu, Zhong and Chen, Wen and Wang, Shi-Yan and Yi, Tu-Ming and Liu, Ling (2021) A crack detection algorithm for concrete pavement based on attention mechanism and multi-features fusion. IEEE Transactions on Intelligent Transportation Systems 23(8): 11710--11719 IEEE

Long, Jonathan and Shelhamer, Evan and Darrell, Trevor (2015) Fully convolutional networks for semantic segmentation. 3431--3440, Proceedings of the IEEE conference on computer vision and pattern recognition

Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas (2015) U-net: Convolutional networks for biomedical image segmentation. Springer, 234--241, International Conference on Medical image computing and computer-assisted intervention

Ren, Yupeng and Huang, Jisheng and Hong, Zhiyou and Lu, Wei and Yin, Jun and Zou, Lejun and Shen, Xiaohua (2020) Image-based concrete crack detection in tunnels using deep fully convolutional networks. Construction and Building Materials 234: 117367 Elsevier

Li, Yongshang and Ma, Ronggui and Liu, Han and Cheng, Gaoli (2023) Real-time high-resolution neural network with semantic guidance for crack segmentation. Automation in Construction 156: 105112 Elsevier

Han, Chengjia and Ma, Tao and Huyan, Ju and Huang, Xiaoming and Zhang, Yanning (2021) CrackW-Net: A novel pavement crack image segmentation convolutional neural network. IEEE Transactions on Intelligent Transportation Systems 23(11): 22135--22144 IEEE

Liu, Fangyu and Wang, Linbing (2022) UNet-based model for crack detection integrating visual explanations. Construction and Building Materials 322: 126265 Elsevier

Zou, Qin and Zhang, Zheng and Li, Qingquan and Qi, Xianbiao and Wang, Qian and Wang, Song (2018) Deepcrack: Learning hierarchical convolutional features for crack detection. IEEE transactions on image processing 28(3): 1498--1512 IEEE

Lee, JDMCK and Toutanova, K (2018) Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 3(8): 4171--4186

Yang, Lei and Bai, Suli and Liu, Yanhong and Yu, Hongnian (2023) Multi-scale triple-attention network for pixelwise crack segmentation. Automation in Construction 150: 104853 Elsevier

Shi, Lin and Zhang, Ruijun and Wu, Yafeng and Cui, Dongyan and Yuan, Na and Liu, Jinyun and Ji, Zhanlin (2024) AHC-Net: a road crack segmentation network based on dual attention mechanism and multi-feature fusion. Signal, Image and Video Processing 18(6): 5311--5322 Springer

Yang, Lei and Bai, Suli and Liu, Yanhong and Yu, Hongnian (2023) Multi-scale triple-attention network for pixelwise crack segmentation. Automation in Construction 150: 104853 Elsevier

Sun, Xinzi and Xie, Yuanchang and Jiang, Liming and Cao, Yu and Liu, Benyuan (2022) DMA-Net: DeepLab with multi-scale attention for pavement crack segmentation. IEEE Transactions on Intelligent Transportation Systems 23(10): 18392--18403 IEEE

Srinivas, Aravind and Lin, Tsung-Yi and Parmar, Niki and Shlens, Jonathon and Abbeel, Pieter and Vaswani, Ashish (2021) Bottleneck transformers for visual recognition. 16519--16529, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Neil, Houlsby and Dirk, Weissenborn (2020) Transformers for image recognition at scale. Online: https://ai. googleblog. com/2020/12/transformers-for-image-recognitionat. html

Wang, Jing and Yao, Haizhou and Hu, Jinbin and Ma, Yafei and Wang, Jin (2025) Dual-encoder network for pavement concrete crack segmentation with multi-stage supervision. Automation in Construction 169: 105884 Elsevier

Zhang, Zaiyan and Zhuang, Yangyang and Song, Weidong and Wu, Jiachen and Ye, Xin and Zhang, Hongyue and Xu, Yanli and Shi, Guoli (2025) ISTD-CrackNet: Hybrid CNN-transformer models focusing on fine-grained segmentation of multi-scale pavement cracks. Measurement 251: 117215 Elsevier

Zhang, Jianming and Li, Dianwen and Zeng, Zhigao and Zhang, Rui and Wang, Jin (2025) Dual-branch crack segmentation network with multi-shape kernel based on convolutional neural network and Mamba. Engineering Applications of Artificial Intelligence 150: 110536 Elsevier

Chollet, Fran{\c{c}}ois (2017) Xception: Deep learning with depthwise separable convolutions. 1251--1258, Proceedings of the IEEE conference on computer vision and pattern recognition

Wang, Jin and Zeng, Zhigao and Sharma, Pradip Kumar and Alfarraj, Osama and Tolba, Amr and Zhang, Jianming and Wang, Lei (2024) Dual-path network combining CNN and transformer for pavement crack segmentation. Automation in Construction 158: 105217 Elsevier

Hu, Jie and Shen, Li and Sun, Gang (2018) Squeeze-and-excitation networks. 7132--7141, Proceedings of the IEEE conference on computer vision and pattern recognition

Liu, Yahui and Yao, Jian and Lu, Xiaohu and Xie, Renping and Li, Li (2019) DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 338: 139--153 Elsevier

Yang, Fan and Zhang, Lei and Yu, Sijia and Prokhorov, Danil and Mei, Xue and Ling, Haibin (2019) Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Transactions on Intelligent Transportation Systems 21(4): 1525--1535 IEEE

Goo, June Moh and Milidonis, Xenios and Artusi, Alessandro and Boehm, Jan and Ciliberto, Carlo (2025) Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure. Automation in Construction 170: 105960 Elsevier

Gupta, Shreyansh and Shrivastwa, Shivam and Kumar, Sunny and Trivedi, Ashutosh Self-attention-based efficient U-Net for crack segmentation. Computer Vision and Robotics: Proceedings of CVR 2022, Cham, Springer, 2023, 103--114

Byun, Hoon and Kim, Jineon and Yoon, Dongyoung and Kang, Il-Seok and Song, Jae-Joon (2021) A deep convolutional neural network for rock fracture image segmentation. Earth science informatics 14(4): 1937--1951 Springer

Additional Files

Additional file 10