USF-Net: U-shaped Mamba Dehazing Network
Position Fusion
Enhanced by Wavelet Features and Spatial
Longxiang Fang1* and Wang Jiang1
1 Department of Statistics, Anhui Normal University, Wu Hu, 241002,
Anhui, China.
*Corresponding author(s). E-mail(s): lxfang@fudan.edu.cn; Contributing authors: wangj@ahnu.edu.cn;
t These authors contributed equally to this work.
Abstract
In this paper, we employ wavelet transform analysis to reveal the wavelet degradation prior to haze. Through this prior, we demonstrate that haze- related information primarily resides in low-frequency components, while its effect on high-frequency components manifests mainly as edge blurring and tex- ture detail attenuation. Leveraging this insight, we propose a novel dehazing framework, USF-Net, which decouples the ill-posed image dehazing problem into two subtasks: position-guided channel selective enhancement and spatial- aware multi-scale feature extraction. Specifically, we integrate Mamba blocks with spatially-informed channel-weighted attention modules, achieving global structure reconstruction with linear complexity while effectively fusing spatial and channel information. Additionally, we incorporate position-aware multi- receptive-field attention modules to efficiently aggregate multi-scale spatial features, enhancing the network’s perception of spatial structures under varying haze concentrations. This design significantly improves both local detail fidelity and global semantic understanding. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on both synthetic datasets and real-world hazy images.
Keywords:
Image Dehazing Wavelet Degradation Prior Spatial Information Multi-scale Feature Fusion
1
A
1 Introduction
A
The presence of haze leads to a severe attenuation of image information, resulting in a degraded image quality. Not only does it impair human visual perception by reducing image clarity, causing color distortion and detail blurring, but it also adversely affects the accuracy of advanced computer vision tasks such as face recognition, image seg- mentation, and autonomous driving. Therefore, research on image dehazing techniques is of significant importance for improving image quality and improving the robustness and practicality of vision systems.
Traditional image dehazing methods are primarily categorized into two groups: those based on image enhancement and those relying on the atmospheric scattering model. The former approach leverages image enhancement techniques [15] to improve image contrast and clarity, thereby enhancing visual quality, but it does not address the physical formation mechanism of haze. The latter approach is built upon the atmospheric scattering model [6], utilizing priors such as the dark channel prior [6] and the color attenuation prior [7] to estimate the transmission map and atmospheric light. Although these physics-based methods produce more physically plausible results, they often exhibit limited generalization in complex scenarios and tend to introduce artifacts or loss of fine details.
With the emergence of synthetic datasets and rapid advancements in deep learn- ing, image dehazing methods have progressively shifted toward end-to-end deep neural network models [814], primarily encompassing convolutional neural network (CNN)- based and Transformer-based architectures. These approaches learn direct mapping relationships from hazy to clear images, achieving notable progress in visual quality. However, several challenges persist: on one hand, CNN-based methods are constrained by their limited receptive fields, struggling to effectively model multi-scale structural information in images; on the other hand, while Transformers [15] can capture global features through multi-head attention mechanisms, their high computational costs and resource demands hinder practical deployment. Consequently, more efficient and advanced techniques are currently imperative for image dehazing. Recently, visual state space models like Mamba [16] have garnered significant attention due to their capability of modeling long-range dependencies with linear complexity, demonstrat- ing exceptional performance across various vision tasks, including image segmentation [17]. Nevertheless, research on Mamba for image dehazing remains in its nascent stage, with substantial room for improvement in both dehazing performance and model adaptability, positioning it as a promising research direction in the field. f
Deep learning-based image dehazing methods have achieved notable success in recent years. However, the inherent ill-posed nature of image dehazing remains a challenging problem. Modeling and analyzing the physical characteristics of the haze degradation process can enhance the network’s perception of useful image informa- tion, thereby facilitating more efficient and robust dehazing. Nevertheless, few existing approaches explicitly exploit these physical properties, which increases the risk of overfitting. From this perspective, we introduce the wavelet transform to decompose an image into low-frequency subbands representing structural information and high- frequency subbands encoding textural details through multi-scale and time-frequency localized decomposition, as illustrated in Fig. 3. Inspired by this observation and
2
prior works [1823], we decompose the task into two subtasks: position-guided chan- nel attention enhancement and spatial-aware multi-receptive field feature extraction. This decomposition enables the network to aggregate effective information, thereby achieving high-fidelity image restoration.
This paper proposes USF-Net, a novel wavelet degradation prior-guided visual Mamba framework for image dehazing. The framework sequentially integrates UVM- Net and PosSE modules during the encoding stage: the former leverages state space modeling (SSM) to capture spatial sequential dependencies for efficient global fea- ture extraction, while the latter incorporates position awareness through a channel attention mechanism to generate position-sensitive channel weights, thereby enhanc- ing dehazing accuracy and flexibility in non-uniform haze scenarios. The decoding stage employs a symmetric upsampling path to progressively restore spatial resolu- tion, supplemented by skip connections to incorporate shallow-level details. To further improve performance in complex haze conditions and detail recovery, we introduce the PosASPP module for multi-scale spatial position awareness, which enhances mod- eling of global context, depth variations, and spatial hierarchical information. This enables more accurate estimation of transmission maps and atmospheric light, ulti- mately reconstructing haze-free images with sharper edges and a more visually natural appearance. Figure 1 presents a comparative analysis between our USF-Net and state- of-the-art approaches in terms of both performance and computational complexity. The main contributions of this work are summarized as follows:
• We propose USF-Net, a novel image dehazing architecture that integrates the UVM- Net Block, PosSE, and PosASPP modules. By leveraging the physical properties of haze degradation, the dehazing process is decoupled into two subtasks, enabling stepwise, precise restoration.
• In the structural recovery phase, we design a multi-layer structural perception encoding unit, where each unit consists of a UVM-Net Block and a PosSE module in series. This configuration enhances spatial sequence modeling through state- space modeling, while incorporating channel dependency and positional information to strengthen effective feature representation and suppress regions influenced by atmospheric light.
• To further refine the recovery process, we introduce a multi-scale integration module based on CoordASPP at the lowest level. By capturing the multi-scale percep- tion of the non-uniform haze distribution, this module facilitates the distinction between near and distant haze layers. Moreover, the multi-scale features comprehen- sively model haze structures, enabling accurate estimation of the transmission map. Consequently, the reconstructed images are more authentic, natural, and visually appealing.
3
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
hancing the U- haped network with wavelet features and spatial position fusio (USF-Net)
     
Click here to download actual image
 
A
Click here to download actual image
Click here to download actual image
PosSE
Click here to download actual image
Click here to download actual image
PosSE
Click here to download actual image
Click here to download actual image
Click here to download actual image
H/2,W/2,C
Click here to download actual image
PosSE
   
Click here to download actual image
PosSE
   
Click here to download actual image
H/4,W/4,2C
Click here to Correct
H,W,3
Click here to Correct
Click here to Correct
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
H/8,W/8,4C
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Skip Connect
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to download actual image
H/16,W/16,8C
Click here to download actual image
Click here to download actual image
PosSE
Click here to download actual image
H/32,W/32,16C H/32, W/32,16C
A
Figure 2 USF-Net Architecture Overview. USF-Net employs an encoder-decoder framework, where the hazy image is initially decomposed via a wavelet transform. The decomposed features are then fed into the encoder, which progressively processes them through a series of interconnected UVM-Net and PosSE modules to efficiently aggregate spatial and channel information with linear complexity. Skip connections are incorporated to directly transmit shallow features to the decoding stage. In the decoder, a CoordAspp block is integrated following the initial convolutional layers, enabling gradual restoration of spatial resolution while effectively fusing multi-scale spatial information.
2 Related work
Single-image dehazing methods can generally be categorised as being based on image enhancement, prior knowledge or, deep learning. Among image enhancement-based approaches, Retinex theory and its derivative algorithms have a significant influence. Land et al. [5] first proposed the Retinex perceptual model, which simulates stable colour perception in the human visual system under varying illumination conditions. Building on this, Jobson et al. [3]introduced the multiscale Retinex algorithm to sim- ulate how the human visual system adapts to different levels of illumination. Rahman et al. [4] further refined the multiscale filter to overcome the limitations of single- scale Retinex. Furthermore, Seow et al. [1] proposed an image enhancement method that combines homomorphic filtering with colour ratio rules to achieve contrast
4
enhancement and colour preservation through frequency-domain filtering and colour correction. Stark et al. [2] introduced a generalised histogram equalisation framework that adaptively adjusts image contrast through a parameterised mapping function and normalisation constraints. However, image enhancement methods generally suffer from poor generalisation, a lack of rigorous physical modelling, and inadequate noise handling, which limits their effectiveness in practical applications.
A physical model based on prior knowledge. He et al. [6] proposed the widely recognised Dark Channel Prior (DCP) method, which uses atmospheric scattering models for image restoration. However, this method has significant limitations and exhibits poor dehazing performance in sky regions, being prone to colour shifts and excessive darkness. Zhu et al. [7] introduced the Color Attenuation Prior (CAP), which posits a positive correlation between brightness and saturation values in hazy images relative to the scene. However, this assumption often fails under complex lighting or non-uniform haze conditions, leading to inaccurate depth estimation.
Image dehazing techniques based on deep learning methods have evolved to achieve end-to-end fog-free image restoration through complex convolutional neural networks (CNNs) or Transformer architectures. Cai et al. [8] pioneered modelling the dehazing problem as an end-to-end learning task in DehazeNet, enabling CNNs to learn image transmittance mapping relationships autonomously, without the need for manually designed prior knowledge. In AOD-Net, Li et al. [11] restructured atmospheric scat- tering model equations, enabling the network to directly reconstruct fog-free images rather than estimating intermediate transmission values. Qin et al.[24] proposed FFA- Net, which integrates feature fusion attention mechanisms. This significantly enhances structural recovery and detail preservation capabilities by applying adaptive weighting to features through channel-level and pixel-level attention. Dong et al. [9] addressed non-uniform haze distribution by proposing the Multi-Scale Boosting Dehazing Net- work (MSBDN), which is based on image pyramids and achieves fog-free images through progressive feature extraction and reconstruction. Wu et al. [25] developed AECR-Net based on the U-Net architecture. By modelling hazy images through an encoder-decoder structure, AECR-Net enhances the network’s ability to process low- contrast regions and blurred structures, thereby improving the clarity and naturalness of the final results.
Following the remarkable progress of CNN and Transformer-baseddeep learning methods for image dehazing, the recently emerging State Space Models (SSMs) [26, 27] offer a novel alternative for efficient long-range dependency modeling. Originating from classical control theory and signal processing, SSMs aim to mathematically rep- resent the evolution of system states over time. By leveraging their recursive structure and historical memory mechanism, SSMs have become an efficient and interpretable approach for addressing long-sequence dependency problems. Gu et al. [28] proposed a selective state space (Mamba) model capable of dynamically adjusting parameters to implement content-aware retention and forgetting mechanisms. Concurrently, Mamba incorporates a hardware-aware parallel scanning algorithm, enabling linear compu- tational complexity during inference while surpassing the Transformer in processing speed and enhancing its capability for handling long sequences. In this paper, we adapt the Mamba model for image dehazing tasks.
5
3 Approach
In contrast to most existing dehazing methods, the proposed model in this study integrates the degradation priors of wavelet decomposition, channel information, and multi-scale spatial features, while combining the complementary advantages of Mamba and CCNN architectures. This integration enables the model to effectively capture global contextual relationships, crucial fine-grained details, and spatial structural awareness, thereby significantly enhancing both the accuracy and visual fidelity of image dehazing. In this section, we first introduce the essential preliminary knowl- edge, and subsequently elaborate on the overall pipeline and detailed architectural components of the proposed model.
3.1 Preliminaries
State Space Models (SSMs). State Space Models utilize state variables to describe a system’s internal state, and employ state equations and output equations to char- acterize the system’s evolution and outputs. In [21, 27], this system is expressed mathematically as a linear differential equation:
a′ (t) = Xa(t) + Yb(t) (1) c(t) = Za(t) + Wb(t) (2)
Where b(t) and c(t) represent the system input and output respectively, and a(t) denotes the current state of the system. Matrices X, Y, Z, and W are system parameters that characterize the relationships between these variables.
In the context of recurrent systems, conventional computing methodologies are inadequate for addressing the challenges posed by such systems. To address this limita- tion, a decomposition approach involving the use of zero-order hold (ZOH) techniques has been proposed. This approach involves the extraction of samples from the input data at a consistent rate, thereby facilitating the decomposition of continuous data into its constituent elements. [21, 29] The process can be delineated as follows:
- -
a′t = Xat − 1 + Y bt (3)
ct = Zat + Wbt (4)
X- = eX△t (5)
Y- = X − 1(eX△t - I)Y (6)
where △ denotes a learnable parameter representing the input resolution. The
- -
matrices X and Y correspond to the discrete-form parameters of matrices X and Y respectively.
Discrete Wavelet Transform (DWT). As illustrated in Fig. 3 Given an RGB image I ∈ RH×W ×C, It is decomposed into four subbands via Haar wavelet transform [30]: LL (low-low), LH (low-high), HL (high-low), and HH (high-high). Specifically,
6
Fig. 3
Wavelet domain feature decomposition and haze degradation characteristic analysis. The image is decomposed into LL (low-frequency subband), LH (high-frequency subbands), HL (high- frequency subbands), and HH (high-frequency subbands), demonstrating the distribution of haze effects across different frequency-domain features.
Click here to Correct
the Haar wavelet transform partitions the image into these frequency subbands:
{cA, cH, cV, cD} = DWT(I) (7)
Among them, cA ∈ R(H/2) × (W/2)×C denotes the low-frequency component, which mainly contains the global structure and brightness information of the image, while cH, cV, cD ∈ R(H/2) × (W/2)×C represent the high-frequency details in the horizontal, vertical, and diagonal directions. respectively, encoding local information such as tex- tures and edges in the image. These four frequency sub-bands can be recombined through inverse wavelet transform to reconstruct the original image:
IWT(I) = {cA, cH, cV, cD} (8)
It is worth noting that the wavelet decomposition and reconstruction process is a lossless operation.
3.2 Overview
We present the overall framework of USf-Net, as illustrated in Fig. 2. The frame- work adopts a U-Net [31] like architecture, initiating the dehazing process with
7
Discrete Wavelet Transform (DWT) decomposition to separate low-frequency and high-frequency components. Before each downsampling operation, the PosSE module performs channel-wise weight reassignment to enhance effective spatial information aggregation. By incorporating the Mamba mechanism, the model efficiently captures long-range dependencies and accurately reconstructs global structures. A CoordAspp module is integrated at the bottleneck to perform comprehensive feature re-extraction and facilitate subsequent spatial recovery. This module employs convolutions with varying dilation rates and incorporates positional encoding, enabling it to capture both fine details and global semantics while reducing computational parameters and accelerating processing. The decoder progressively restores spatial resolution and incorporates corresponding encoder features via skip connections, effectively compen- sating for detail loss during downsampling and significantly improving the recovery of high-frequency textures and edge structures. Finally, the clear dehazed image is recon- structed through Inverse Wavelet Transform (IWT). In the following section, we will elaborate on the design principles and implementation details of each core component.
3.3 Position Squeeze-and-Excitation Block
A
In the USf-Net framework. The position squeeze, and excitation block(PosSE) has been widely integrated into the encoder and decoder, as well as their jump connection components, to improve the interaction between the characteristics of the channels and spatial information. This module has been introduced to improve the network’s abil- ity to identify significant regions and structural characteristics in complex scenarios. Refer to Fig. 4. As illustrated in Block, the module is composed of two components. First, the module employs a comprehensive information aggregation mechanism to compress the channel dimensions, thereby acquiring channel descriptions that are sen- sitive to the global context. Second, it utilizes a focal attention mechanism based on a coordinate system to introduce spatial location information, thereby enhancing the representation of features in the spatial structure. By implementing a weighted summation in the channel dimensions and introducing spatial location information in the spatial dimension, the model can adaptively strengthen channels that are highly relevant to the current task, while preserving the spatial structure’s dependency. Con- sequently, the model can enhance its overall expressive capacity. Inspired by [20, 21], its modular structure can be formally represented as follows:
Click here to Correct
Click here to Correct
s′ = Concat (s,µ(Xcoord )) (11)
Click here to Correct
8
Click here to Correct
Click here to download actual image
Click here to download actual image
Click here to download actual image
Click here to Correct
C
Click here to Correct
Click here to download actual image
Click here to download actual image
W
 
       
Click here to download actual image
C
Figure 4 The structure of Position Squeeze-and-Excitation Block (PosSE).
~
X = z ⊙ X (13)
Here, X ∈ RH×W ×C denotes the input feature map, s ∈ RC represents the chan- nel descriptor vector obtained by global average pooling, Xcoord ∈ RH×W ×2 denotes the normalized coordinate encoding matrix, s′ ∈ RH×W×(C + 2) is the enhanced fea- ture map, W1 and W2 are the fully connected layer weights, δ(·) denotes the ReLU activation function, σ(·) denotes the Sigmoid function, and ⊙ denotes element-wise multiplication.
3.4 Atrous Spatial Pyramid Pooling based on CoordConv Block
In the USF-Net framework. The atrous spatial pyramid pooling based on the Coor- dAspp Block(CoordAspp) is designed as the final component in the encoding. This module aims to enhance the modeling capabilities of multiple scales and improve the representation of fine-grained details. It ingeniously introduces diverse sensing config- urations, enabling the expansion of perception capabilities in images across different semantic levels while maintaining the complexity of the model. This enhances the adaptability and stability of the model in processing regions with varying levels of complexity. Refer to Fig. 5. This module introduces multiple dilation convolutions with differing dilation rates through parallel branches, incorporating spatial positional information to capture spatial details at various scales[19, 21], thereby achieving more precise restoration of fine-grained features.
Y1 = Conv1 × 1 (Concat(X, Gx, Gy )) (14)
Yi = Convr3i×3 (r)(Concat(X, Gx, Gy ))i = 2, 3, 4 (15)
Ypool = Upsample(Conv 1 × 1(GAP(X))) (16) Ycat = Concat(Y1, Y2, Y3, Y4, Ypool ) (17)
Z = Conv1 × 1 (Concat(Ycat, Gx, Gy )) (18)
9
Fig. 5
The structure of Atrous Spatial Pyramid Pooling based on CoordConv Block (CoordAspp).
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Where X ∈ RH×W ×C denotes the input feature map, CoordConvr3i×3 represents a
3 × 3 dilated convolution with dilation rate ri (e.g. r1 = 2, r2 = 4, r3 = 8, r4 = 16), and GAP(X) denotes the process of applying global average pooling to the input feature followed by upsampling to the original size. The features obtained from all branches are concatenated along the channel dimension and then further fused through a 1 × 1 CoordConv operation to integrate cross-scale information representation.
4 Experiments
4.1 Dataset
This study utilises the publicly available dehazing datasets RESIDE [15] and RS-Haze [32]. The RESIDE series, proposed by Li et al. [32], aims to provide a standardised training and testing benchmark for single-image dehazing tasks. RESIDE-IN com- prises approximately 13,990 pairs of indoor synthetic images generated using NYU Depth v2 depth maps; RESIDE-OUT comprises approximately 313,950 pairs of out- door images synthesised from ImageNet and BSD500 images using depth estimation and random atmospheric parameters; RESIDE-6K is a streamlined version of RESIDE that balances indoor and outdoor proportions, commonly used for lightweight train- ing and ablation studies. Furthermore, it should be noted that the RS-Haze dataset was constructed independently from RESIDE. The model utilises physically realistic simulation methods, incorporating elements such as light direction, aerosol density, and atmospheric light distribution, to generate hazy images that closely resemble real-world environments. This approach serves to validate the model’s generalisation capabilities. In this study, experiments were trained and tested using configurations referenced from Dehazeformer [15], with the aim of comparing dehazing performance under identical conditions.
4.2 Implementation details
The proposed network is trained with a batch size of 16 for 400 epochs. The Adam optimizer is employed with an initial learning rate of 2 × 10 − 4, which is reduced by a factor of 0.5 every 50 epochs. During training, input images are randomly cropped into patches of size 256 × 256, and augmented through random horizontal flipping and rotation to enhance the model’s generalization capability. The total loss function is a
10
Table 1
Quantitative comparison of various dehazing methods trained on the RESIDE datasets.
Methods
ITS
OTS
RESIDE-6K
RS-Haze
Overhead
PSNR
SSIM
PSNR
SSIM
PSNR
SSIM
PSNR
SSIM
#Param
MACs
DCP [6]
16.62
0.818
19.13
0.815
17.88
0.816
17.86
0.734
DehazeNet [8]
19.82
0.821
24.75
0.927
21.02
0.870
23.16
0.816
0.009M
0.581G
MSCNN [9]
19.84
0.833
22.06
0.908
20.31
0.863
22.80
0.823
0.008M
0.525G
AOD-Net [11]
20.51
0.816
24.14
0.920
20.27
0.855
24.90
0.830
0.002M
0.115G
GFN [12]
22.30
0.880
21.55
0.844
23.52
0.905
29.24
0.910
0.499M
14.94G
GCANet [14]
30.23
0.980
25.09
0.923
34.41
0.949
0.702M
18.41G
GridDehazeNet [13]
32.16
0.984
30.86
0.982
25.86
0.946
36.40
0.960
0.956M
21.49G
MSBDN [16]
33.67
0.985
33.48
0.982
28.56
0.966
38.57
0.965
31.35M
41.54G
PFDN [33]
32.68
0.976
28.15
0.962
36.04
0.955
11.27M
50.46G
FFA-Net [24]
36.39
0.989
33.57
0.984
29.96
0.973
39.39
0.969
4.456M
287.8G
AECR-Net [25]
37.17
0.990
28.52
0.964
35.69
0.959
2.611M
52.20G
DehazeFormer-T [15]
35.15
0.989
33.71
0.982
30.36
0.973
39.11
0.968
0.686M
6.658G
DehazeFormer-S [15]
36.82
0.994
34.36
0.983
30.62
0.976
39.57
0.970
1.283M
13.13G
DehazeFormer-B [15]
37.84
0.995
34.94
0.983
31.45
0.980
39.87
0.971
2.514M
25.79G
DehazeFormer-M [15]
38.46
0.994
34.29
0.983
30.89
0.977
39.71
0.971
4.634M
48.64G
DehazeFormer-L [15]
40.05
0.996
25.44M
279.7G
USF-Net(Ours)
42.31
0.988
36.58
0.988
33.76
0.986
41.47
0.977
26. 12M
203.68G
weighted combination of L1 loss, perceptual loss, and SSIM loss, with the corresponding weights set to λ 1 = 1.0, λp = 0.04, and λs = 0.5, respectively. All experiments are implemented on a NVIDIA H100 GPU using the PyTorch 1.12 framework.
4.3 Quantitative Comparison
In this section, a quantitative comparison is made between USF-Net and a range of state-of-the-art learning-based dehazing methods, including DCP [6], DehazeNet [8], MSCNN [9], AOD-Net [11], GFN [12], GCANet [14], GridDehazeNet [13], MSBDN
[16], PFDN [33], FFA-Net [24], AECR-Net [25], DehazeFormer-T [15], DehazeFormer-S
[15], DehazeFormer-B [15], DehazeFormer-M [15]and DehazeFormer-L [15]. The com- parative experimental results summarised in Table 1 demonstrate that the proposed method achieves significant improvements in both PSNR and SSIM metrics. The USF- Net model has been demonstrated to exhibit superior generalisation capabilities across a range of diverse indoor and outdoor scenarios. This finding is supported by perfor- mance evaluations on the outdoor training set (OTS), the indoor training set (ITS), the RESIDE-6K testing set, which encompasses both indoor and outdoor scenes, and the real-world RS-Haze dataset. Furthermore, the model demonstrates augmented effi- cacy in the domain of real-world image restoration, thereby substantiating its practical viability.
Compared to Transformer-based methods, recent Transformer-based dehazing approaches, particularly the DehazeFormer series [15], have demonstrated promising results by leveraging self-attention mechanisms for process-dependent modeling. How- ever, the proposed USF-Net achieves significant improvements, including + 2.26 dB over DehazeFormer-L on ITS, + 1.99 dB over DehazeFormer-S on OTS, and + 2.34 dB over DehazeFormer-M on the RESIDE-6K benchmark. These substantial gains can be attributed to the following three key factors:
11
DWT-based multiscale decomposition explicitly separates low-frequency structural information from high-frequency textural details, providing richer frequency-domain priors than raw spatial attention.
PosSE channel attention adaptively recalibrates feature responses across chan- nels, yielding more discriminative representations compared to uniform channel processing.
CoordASPP module captures multiscale dependencies via parallel dilated con- volutions with varying dilation rates, effectively complementing Mamba’s sequence modeling capabilities.
Beyond its exceptional accuracy, USF-Net also demonstrates remarkable com- putational efficiency. In comparison with DehazeFormer-L, the proposed approach yields a 2.26 dB enhancement while maintaining a commensurate model size (26.12M vs. 25.44M parameters) and exhibiting a 27% reduction in computational demands (203.68G vs. 279.7G MACs). This efficiency enhancement is attributable to the lin- ear complexity of the Mamba architecture and the enhancement modules, which have been meticulously designed. In comparison to attention-based methods such as FFANet [24] (287.8G MACs) and DehazeFormer-L [15](279.7G MACs), which have been shown to have complexities that grow quadratically with input resolution, USF-Net demonstrates superior scalability for high-resolution image processing. Fur- thermore, USF-Net outperforms lightweight CNN approaches such as MSCNN [9] andAOD-Net [11] (0.115G MACs) on ITS, achieving a substantial enhancement of + 22.47 dB and + 21.80 dB, respectively. This enhancement is achieved while main- taining reasonable computational costs, thereby demonstrating an optimal balance between accuracy and efficiency.
4.4 Qualitative Comparison
Fig. 6
Qualitative comparisons of image dehazing methods are shown on the SOTS mixed dataset. The top two rows display outdoor scenes, and the bottom two rows display indoor scenes. The first column is the hazy input, and the last column is the corresponding ground truth.
Click here to Correct
12
Fig. 7
Qualitative comparison of image dehazing methods on RS-Haze. The first column is the hazy images, and the last column is the ground truth.
Click here to Correct
To conduct an in-depth qualitative analysis, we present some representative dehazing results. Given that we did not retrain the baseline methods on RESIDE-Full, we only provide test results on the RESIDE-6K and RS-Haze datasets. We selected several representative dehazing methods for qualitative comparison with USF-Net, as shown in Fig. 6. This figure displays four samples chosen from different scenarios to evaluate dehazing performance, including both indoor and outdoor hazy images. For outdoor images, DCP and AOD-Net exhibit poor parameter estimation, particularly in the sky region. For indoor images, these methods are prone to color distortion and may produce artifacts such as halos and checkerboard patterns. While FFA-Net and Dehaze-Former demonstrate promising dehazing results on both indoor and outdoor datasets, they still suffer from color distortion in low-illumination scenes and at edge details. In com- parison to our USF-Net, our network achieves a more precise restoration of these details. Specifically, in the overall restoration of the buildings shown in the first row of the figure, the details and main structure recovered by our USF-Net are closer to the ground truth image. The dehazing performance on remote sensing images is shown in Fig. 7. We selected four hazy images with varying topographical features (moun- tains and rivers) and haze concentrations to present the dehazing performance. DCP and AOD-Net are generally incapable of recovering clear results, showing extremely poor dehazing effects. FFA-Net and Dehaze-Former show significant improvements in dehazing, with performance on non-uniform haze regions comparable to our USF- Net. However, in terms of texture detail recovery and color fidelity for mountains and rivers, they are inferior to our proposed network. Notably, there is a small white geo- graphical marker in the upper right of the sample in the fourth row of the figure; our network’s restoration effect is slightly more pronounced than that of Dehaze-Former, demonstrating superior dehazing performance.
13
Table 2
Ablation study on the RESIDE datasets.
Methods
ITS
OTS
RESIDE-6K
RS-Haze
Overhead
PSNR
SSIM
PSNR
SSIM
PSNR
SSIM
PSNR
SSIM
#Param
MACs
UVM-Net (Baseline)
40.17
0.996
34.92
0.984
31.92
0.982
39.88
0.972
19.25M
173.55G
+ DWT Multi-scale
41.23
0.997
35.68
0.986
32.74
0.984
40.52
0.974
21. 18M
185.32G
+ PosSE
40.89
0.996
35.41
0.985
32.38
0.983
40.27
0.973
19.47M
174.83G
+ CoordAspp
41.05
0.997
35.52
0.985
32.51
0.983
40.35
0.973
23.61M
192.47G
+ DWT + PosSE
41.87
0.998
36.15
0.987
33.28
0.985
41.03
0.976
21.65M
186.89G
+ DWT + CoordAspp
41.64
0.997
35.94
0.987
33.05
0.987
40.81
0.975
25.73M
201.25G
USF-Net (Full)
42.31
0.998
36.58
0.988
33.76
0.986
41.47
0.977
26. 12M
203.68G
4.5 Ablation study
To validate the contribution of each proposed component, we conducted a detailed ablation study in Section 4.5. As can be seen in Table 2, the DWT multiscale mod- ule provided the most significant standalone gain (+ 1.06 dB on ITS), which confirms our hypothesis that frequency-domain decomposition is essential for extracting haze- related features. The position squeeze and excitation block and ASPP context modules contributed + 0.72 dB and + 0.88 dB, respectively, demonstrating their complementary effects. Combining these modules results in a synergistic improvement of + 2.14 dB, which substantially exceeds the sum of the individual contributions, thus confirming the effectiveness of our unified framework. This collaborative enhancement mecha- nism enables USF-Net to achieve state-of-the-art results while maintaining a practical computational cost, making it an ideal solution for research and practical applications alike.
5 Conclusion
The proposed single-image dehazing architecture, designated USF-Net, integrates CNN-based local feature extraction, Mamba’s efficient global modeling, and frequency- domain prior knowledge. Specifically, the precise multi-scale frequency-domain feature modeling is achieved while maintaining linear complexity by introducing DWT multi-scale decomposition, channel attention modules, and position-aware multi-scale modules. In the context of the ITS, OTS, RESIDE-6K, and RS-Haze benchmarks, USF-Net attains 42.31/36.58/33.76/41.47 dB PSNR, thereby demonstrating supe- riority over DehazeFormer-L by + 2.26/+1.63/+2.31/+1.60 dB while concurrently reducing computational expense by 27%. Ablation studies demonstrate that the syn- ergistic interaction among modules yields a + 2.14 dB gain. USF-Net has been shown to establish a new benchmark for haze removal and to showcase the immense potential of hybrid architectures in image restoration, positioning it as a promising backbone architecture for next-generation image restoration networks.
14
A
A
A
References
1.
Seow MJ, Asari VK (2006) Ratio rule and homomorphic filter for enhancement of digital colour image. Neurocomputing 69:954–958
2.
Stark JA (2000) Adaptive image contrast enhancement using generalizations of histogram equalization. IEEE Trans Image Process 9:889–896
3.
Jobson DJ, Rahman Z, Woodell GA (1997) Properties and performance of a center/surround retinex. IEEE Trans Image Process 6:451–462
4.
Rahman Z, Jobson DJ, Woodell GA (1996) Multiscale retinex for color image enhancement. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), vol. 3, pp. 1003–1006
5.
Land EH (1977) The retinex theory of color vision. Sci Am 237:108–128
6.
He KM, Sun J, Tang XO (2011) Single image haze removal using dark channel prior. IEEE Trans Pattern Anal Mach Intell 33:2341–2353
7.
Zhu QS, Mai JM, Shao L (2015) A fast single image haze removal algorithm using color attenuation prior. IEEE Trans Image Process 24:3522–3533
8.
Cai B, Xu X, Jia K, Qing C, Tao D (2016) Dehazenet: An end-to-end system for single image haze removal. IEEE Trans Image Process 25:5187–5198
9.
Ren W, Liu S, Zhang H, Pan J, Cao X, Yang MH (2016) Single image dehaz- ing via multi-scale convolutional neural networks. In: European Conference on Computer Vision (ECCV), pp. 154–169
10.
Zhang H, Patel VM (2018) Densely connected pyramid dehazing network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3194–3203
11.
Li B, Peng X, Wang Z, Xu J, Feng D (2017) Aod-net: All-in-one dehazing network. In: IEEE International Conference on Computer Vision (ICCV), pp. 4770–4778
12.
Ren W, Ma L, Zhang J, Pan J, Cao X, Liu W, Yang MH (2018) Gated fusion network for single image dehazing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3253–3261 16
13.
Liu X, Ma Y, Shi Z, Chen J (2019) Griddehazenet: Attention-based multi-scale network for image dehazing. In: IEEE International Conference on Computer Vision (ICCV), pp. 7314–7323
14.
Chen D, He M, Fan Q, Liao J, Zhang L, Hou D, Yuan L, Hua G (2019) Gated context aggregation network for image dehazing and deraining. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1375–1383
15.
Song Y, He Z, Qian H, Du X (2023) Vision transformers for single image dehazing. IEEE Trans Image Process 32:1927–1941
16.
Dong H, Pan J, Xiang L, Hu Z, Wang F, Yang MH (2020) Multi-scale boosted dehazing network with dense feature fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2157–2167
17.
Nguyen E, Goel K, Gu A, Downs G, Shah P, Dao T, Baccus S, R´e C (2022) S4nd: Modeling images and videos as multidimensional signals with state spaces. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 2846–2861
18.
Zheng Z, Wu C (2024) U-shaped vision mamba for single image dehazing. arXiv preprint arXiv :240204139
19.
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40:834–848
20.
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141
21.
Liu R, Lehman J, Molino P, Petroski Such F, Frank E, Sergeev A, Yosin- ski J (2018) An intriguing failing of convolutional neural networks and the coordconv solution. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 31, pp. 8329–8340
22.
Tong LH, Liu Y, Li WJ et al (2024) Haze-aware attention network for single-image dehazing. Appl Sci 14:5391
23.
Liu Y, Tian Y, Zhao H, Yu H, Xie L, Wang Y, Ye Q, Jiao J, Liu Y (2024) Vmamba: Visual state space model. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 103031–103063
24.
Qin X, Wang Z, Bai Y, Xie X, Jia H (2020) Ffa-net: Feature fusion attention network for single image dehazing. In: Proceedings of the AAAI Conference on 17 Artificial Intelligence, vol. 34, pp. 11908–11915
25.
Wu H, Qu Y, Lin S et al (2021) : Contrastive learning for compact single image dehazing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10551–10560
26.
Gu A (2023) Modeling sequences with structured state spaces. PhD thesis, Stanford University ProQuest Document ID: 2880853867
27.
Gu A, Johnson I, Goel K, Saab K, Dao T, Rudra A, R´e C (2021) Combin- ing recurrent, convolutional, and continuous-time models with linear state space layers. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 572–585
28.
Gu A, Dao T (2024) Mamba: Linear-time sequence modeling with selective state spaces. In: First Conference on Language Modeling, p. 1039
29.
Gu A, Goel K, R´e C (2021) Efficiently modeling long sequences with structured state spaces. In: International Conference on Learning Representations (ICLR), p. 1
30.
Tarafdar KK, Gadre VM (2025) TFDWT: Fast Discrete Wavelet Transform {TensorFlow} Layers. arXiv preprint arXiv:2504.04168
31.
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241
32.
Li B, Ren W, Fu D, Tao D, Feng D, Zeng W, Wang Z (2018) Benchmarking single-image dehazing and beyond. IEEE Trans Image Process 28:492–505
33.
Dong J, Pan J (2020) Physics-based feature dehazing networks. In: European Conference on Computer Vision (ECCV), pp. 150–165. Springer, Cham
18.
Total words in MS: 4541
Total words in Title: 5
Total words in Abstract: 155
Total Keyword count: 1
Total Images in MS: 12
Total Tables in MS: 7
Total Reference count: 34