Multiscale Cross-Attention of Hyperspectral and Multispectral Image Fusion Based on Transformer

YuxuanJiang1

BinYang1✉Emailyangbin01420@163.com

BinxiTan1

1School of Electrical EngineeringUniversity of South China421001HengyangHunan, COChina

Yuxuan Jiang^a, Bin Yang^a*, and Binxi Tan^a

^a School of Electrical Engineering, University of South China, Hengyang, Hunan, CO 421001, China

Corresponding author: BIN YANG (yangbin01420@163.com).

Abstract

Owing to the limitations of imaging sensors, hyperspectral image (HSI) typically suffer from low spatial resolution. To obtain HSI with high spatial resolution, HSI-MSI fusion has become an effective and widely adopted technique. However, existing deep learning-based HSI-MSI fusion methods often struggle to capture local details and global context, especially when features span multiple scales. To address these issues, we propose a novel Transformer-based multiscale cross-attention fusion network (MCA-Net). MCA-Net integrates three key innovations to overcome these challenges. Firstly, the heterogeneous convolution parallel attention enhancement module (HCPAEM) combines dilated depthwise separable convolutions with parallel attention mechanisms to effectively enhance the representation of both local and global features. Secondly, the multiscale local-global feature extraction module (MLGFEM) integrates convolutional neural networks(CNNs), Transformers, and multiscale feature extraction strategies, modeling non-local and complementary information at multiple scales. Finally, the deep cross-attention fusion module (DCAFM) employs deep cross-attention mechanism to model the correlation between HSI and MSI, promoting the comprehensive fusion of spatial-spectral features. To validate the effectiveness and superiority of MCA-Net, we conducted comparative experiments on five widely used HSI datasets, including Pavia Centre, Pavia University, Washington DC, Botswana, and Chikusei. Experimental results demonstrate significant improvements over state-of-the-art fusion methods. For instance, on the Washington DC dataset, compared with the state-of-the-art method among the comparison algorithms, our method improves PSNR by 11.76%, and reduces RMSE, ERGAS, and SAM by 44.4%, 44.71%, and 43.2%, respectively.

Keywords:

Hyperspectral and multispectral image fusion

Deep learning

Transformer

Cross-attention mechanism

1. Introduction

Hyperspectral image (HSI), characterized by a large number of contiguous narrow spectral bands spanning from the visible to near-infrared regions, exhibit excellent spectral discrimination capabilities. As a result, HSI has been widely applied in various domains such as agriculture [1–2], geological exploration [3], and urban management [4]. Nevertheless, due to the technical limitations of imaging devices, current hyperspectral imaging systems are generally unable to achieve both high spectral and high spatial resolution simultaneously, resulting in HSIs with relatively low spatial resolution. Compared with HSI, multispectral image (MSI) owns a superior spatial resolution but lower spectral resolution. Therefore, effectively fusing low-resolution hyperspectral image (LR-HSI) with high-resolution multispectral image (HR-MSI) to generate high-resolution hyperspectral image (HR-HSI) has become a research direction of great significance.

Existing HSI and MSI fusion methods can be broadly categorized into traditional methods and deep learning-based methods. Traditional methods can further be divided into pansharpening-based methods [5–7], matrix factorization-based methods [8–11], Bayesian-based methods [12–14], and tensor factorization-based methods [15–18]. Although traditional image fusion methods have made certain progress in the field of HSI-MSI fusion and demonstrated their respective advantages, they generally rely on manually designed prior knowledge, resulting in high time costs. Moreover, these methods often suffer from high computational complexity, sensitivity to prior information or initial values, and complicated model construction processes, which limit their efficiency and scalability in practical applications.

Benefiting from the advances in deep learning, model parameters can now be automatically learned from training data through deep neural networks, without the need for predefined prior knowledge about the image. Additionally, convolutional neural network (CNN) has shown remarkable performance in image feature extraction and have been widely applied to HSI-MSI fusion tasks [19–20]. For instance, Zhang et al. [21] proposed a convolutional neural network-based Spatial-Spectral Reconstruction Network (SSR-NET). In this method, a preliminary fused image is generated through Cross-Modality Message Insertion (CMMI), and spatial and spectral details are recovered by optimizing spatial and spectral edge losses using the Spatial Reconstruction Network (SpatRN) and Spectral Reconstruction Network (SpecRN), respectively. Zhu et al. [22] proposed a self-supervised unfolding network (MH-FUNet), which refines the fusion results from coarse to fine through an iterative optimization structure. A multi-scale fusion strategy combined with spectral and spatial attention mechanisms was also introduced to recover spatial and spectral details. The CNN-based methods typically focus on local information when extracting image features, making it difficult to model global contextual relationships.

Transformer-based methods capture global spatial–spectral dependencies, effectively addressing the local receptive field limitations of CNN. Hu et al. [23] introduced FusFormer, a Transformer-based HSI super-resolution method that enhances fusion through global feature interaction. Deng et al. [24] proposed PSRT, a pyramid-structured transformer incorporating a shuffle-and-reshuffle module and window-based self-attention to enable global information exchange while preserving local features, significantly reducing computational cost. HMF-Former was presented by You et al. [25], in which a spatial–spectral transformer block (SSTB) is employed to capture spatial correlations and spectral self-similarities with linear complexity. A hyperspectral and multispectral image fusion method based on Scalable Spatial-Spectral Transformer Network (RSST) was proposed by He et al. [26], where spatial-spectral features are jointly modeled and missing information is reconstructed to enhance fusion quality. To achieve joint spatial–spectral fusion, Ma et al. [27] developed DCTransformer based on a dual-cross transformer architecture. Currently, most transformer-based HSI-MSI fusion methods typically perform independent feature extraction on HSI and MSI first, then fuse them via feature concatenation. Another approach is to connect HSI and MSI in advance, then feed the combined image into the Transformer to capture non-local dependencies. These methods struggle to simultaneously capture both local and global information, and also result in insufficient cross-modal interaction between spatial and spectral features, which is detrimental to the fusion task.

Therefore, this article proposes a novel transformer-based multiscale cross-attention fusion network for hyperspectral and multispectral image fusion. The proposed method mainly includes heterogeneous convolution and parallel attention enhancement module(HCPAEM), multiscale local-global feature extraction module(MLGFEM), deep cross-attention fusion module (DCAFM), and spectral-spatial reconstruction module (SSRM) [28]. First, the HCPAEM enhances local details and global information in HSI and MSI by combining dilated depthwise separable convolutions with parallel attention mechanism. Then, the MLGFEM is used to extract the non-local and complementary features at different scales of HSI and MSI as well as MSI after successive up-sampling and down-sampling. Next, feature maps at the same scale are fed into the DCAFM, where deep cross-attention mechanism is employed to deeply explore the correlation information between the input images, thereby facilitating deep interaction and fusion of spatial and spectral features. Finally, the SSRM is introduced, in which a multidimensional refinement convolution block is designed to enhance fine details across the horizontal, vertical, and channel dimensions. HR-HSI is then reconstructed through cascaded upsampling. The main contributions of this paper are summarized as follows:

(1)

A heterogeneous convolution parallel attention enhancement module is proposed, in which dilated depthwise separable convolutions and parallel attention mechanisms are integrated to effectively enhance the ability to model both local and global features.

(2)

A multiscale local-global feature extraction module is proposed, in which CNNs, Transformers, and multiscale feature extraction strategies are combined to effectively integrate local and global features. While alleviating the loss of local features across multiple scales, the advantage of the Transformer in modeling global dependencies is fully retained.

(3)

A deep cross-attention fusion module is proposed to effectively explore the correlation information between HSI and MSI, enabling efficient interaction and deep fusion between spatial and spectral features.

2. The Method

2.1. Overview

The HSI-MSI fusion framework proposed in this article is illustrated in Fig. 1, consisting of four modules: the heterogeneous convolutional parallel attention enhancement module, the multiscale local-global feature extraction module, the deep cross-attention fusion module, and the spatial-spectral reconstruction module. In this framework, MSI↓, MSI↓↑↓, and HSI↑ represent the MSI after 2-fold down-sampling, MSI after consecutive 2-fold down-sampling and 2-fold up-sampling, and HSI after 2-fold up-sampling, respectively. The bilinear interpolation method is employed in this study to perform continuous up-sampling and down-sampling of MSI, as well as up-sampling of HSI.

Fig. 1

The architecture of the proposed MCA-Net

In the network, the estimation of HR-HSI is defined as

, where H and W represent the height and width of the image, respectively, and C denotes the number of spectral channels. Additionally, we denote LR-HSI and HR-MSI as

and

, where h ≤ H, w ≤ W and c ≪ C. After the upsampling and downsampling processes, MSI↓, MSI↓↑↓, and HSI↑ are represented as

and

respectively.

2.2. Heterogeneous convolutional parallel attention enhancement module

To further optimize the subsequent feature extraction and fusion operations, a heterogeneous convolutional parallel attention enhancement module is designed. By jointly optimizing feature hierarchical extraction and saliency modeling, efficient representation of spectral-spatial features is achieved. This module first constructs a multiscale heterogeneous convolutional architecture: large-scale dilated convolutions with high dilation rates build a wide-range spatial perception ability, similar to a self-attention mechanism, capturing long-range spatial dependencies; medium-scale dilated convolutions establish a dynamic balance between local and global features; small-scale convolutions preserve dense sampling characteristics, capturing fine local details. Multiscale feature tensors are concatenated along the channel dimension, and through cascaded feature processing and parallel multiscale extraction strategies, low computational complexity is maintained. On this basis, a parallel attention mechanism is introduced: channel attention extracts global statistical information of image brightness distribution through global channel interaction, pixel attention dynamically adjusts local contrast features by encoding spatial dependencies, and simple pixel attention enhances the saliency of high-frequency texture details through a lightweight design. By concatenating along the channel dimension and fusing with a multilayer perceptron (MLP), it not only avoids the loss of global optimality caused by feature correction in traditional serial architectures, but also effectively maintains the independence of global statistical information and local structural features. As shown in Fig. 2 (a).

Fig. 2

(a) Heterogeneous convolution and parallel attention enhancement module; (b) Simple pixel attention; (c) Pixel attention; (d) Channel attention.

Firstly, batch normalization is applied to the input to mitigate internal covariate shift and enhance generalization. This is followed by pointwise and standard 5×5 spatial convolutions for local feature extraction. To capture multiscale contextual information, parallel dilated depthwise separable convolution branch is introduced. The resulting feature maps are concatenated along the channel axis and passed through an MLP, which comprises two pointwise convolutional layers and employs GELU as the activation function. A residual connection then adds the MLP output back to the input features, enabling the integration of both local details and context. Taking LR-HSI as an example:

Where PWConv denotes pointwise convolution, while DWDConv19 refers to a 7×7 depthwise dilated convolution with a dilation rate of 3, resulting in an effective receptive field of 19×19. Similarly, DWDConv13 employs a 5×5 kernel with the same dilation rate, achieving an equivalent 13×13 receptive field. DWDConv7 uses a 3×3 kernel under the same configuration, yielding an effective receptive field of 7×7.

BatchNorm is applied to normalize x₃ through

, and the resulting x₄ is then input into parallel attention branches, which include a simple pixel attention, a pixel attention, and a channel attention.

Pixel attention focuses on key details in the image by adaptively assigning different weights to each pixel, thereby enhancing the detail features. The simple pixel attention module consists of two branches: PF_S and PA_S, as shown in Fig. 2 (b). PF_S is responsible for feature extraction, while PA_S serves as the pixel selection signal, controlling the feature flow from PF_S.

Pixel attention includes the PA_p branch, which can extract global pixel gating features, as shown in Fig. 2 (c).

The PWConv-GELU-PWConv is used to fit the features, followed by a Sigmoid activation to generate global pixel-wise gating weights. These weights, denoted as PA_p, are then used to guide pixel selection in the x₄ feature space.

Channel attention enhances discriminative feature representation by assigning adaptive weights across channels, thereby highlighting important information while suppressing redundancy. As illustrated in Fig. 2 (d), the CA_C branch is designed to capture features across the entire channel dimension.

Global channel gating features are extracted through a sequence of Global Average Pooling (GAP), a PWConv-GELU-PWConv, and a Sigmoid activation. The resulting output, denoted as CA_c, serves as the global channel gating signal for the x₄ feature map.

The outputs from the three attention gates are first concatenated along the channel axis. To align the dimensionality with that of x₃, a PWConv-GELU-PWConv-based MLP is employed for channel reduction. The resulting features are then fused with x₃ via residual addition.

This module employs dilated depthwise separable convolutions, and achieves a significant balance among dynamic range expansion, detail preservation, and computational efficiency through deep coupling of multiscale feature enhancement and a parallel attention mechanism.

2.3. Multiscale local-global feature extraction module

The multiscale local-global feature extraction module is designed to capture non-local and complementary information across multiple scales, including X↑, Y↓, and Y↓↑↓, as shown in Fig. 3. Specifically, the first layer applies a 7×7 convolution (stride = 1, padding = 3), followed by normalization and activation. This is followed by a 3×3 convolution (stride = 1, padding = 1), normalization layer, activation function and a convformer to extract features at the initial scale. By recursively applying convolution and convformer, local and global features of the input image are progressively extracted from high resolution to low resolution. Taking X ↑ as an example, the expression is as follows:

Here, X represents the LR-HSI, X↑ represents the HSI after up-sampling, and LR₁, LR₂, and LR₃ correspond to HSI features extracted at three different scales.

Fig. 3

Multiscale local-global feature extraction module

Convformer integrates 3×3 and 1×1 convolutions with transformer by leveraging tensor stacking and decomposition, achieving an effective combination of convolution and transformer operations. In the feature extraction of HSI and MSI, CNN and Transformer each have their unique focus and advantages, and their combination enables the comprehensive capture of both local and global information. Convolution operations, through a local perception mechanism, focus on extracting local spatial features of the image, such as edges, textures, and local associations between different spectral bands. In contrast, Transformer operations rely on the self-attention mechanism, which captures global associations between different spectral bands and spatial regions of the image. By combining the local feature extraction capability of CNN with the global association modeling ability of Transformer, Convformer can comprehensively consider both local and global information in HSI and MSI processing, thereby enhancing the effectiveness of feature extraction. The formula for the self-attention mechanism is shown below:

Where W_q, W_k, and Wv are the weight matrices to be trained. By multiplying the input X with these weight matrices, the corresponding query vector Q, key vector K, and value vector V are obtained. Then, the dot product of the query vector Q and the transpose of the key vector K is computed, and the resulting values are scaled before being passed into the softmax function. This function maps the computed results to attention values between 0 and 1, which reflect the importance of each element in the value vector V.

Multiscale local details and global information of the image are progressively extracted during the down-sampling process. Subsequently, three feature maps of the same scale are fed into the deep cross-attention fusion module, which promotes information interaction and fusion across both spatial and spectral domains.

2.4. Deep cross-attention fusion module

To effectively combine the feature representations of hyperspectral and multispectral data, and to achieve the interactive fusion of spectral and spatial information, the deep cross-attention fusion module is proposed, as shown in Fig. 4. The deep cross-attention fusion module aims to extract meaningful correlations between LR_i(i = 1,2,3), HR_i(i = 1,2,3), and HR_UD_i(i = 1,2,3) through cross-attention mechanism, thereby enabling more efficient feature fusion.

Fig. 4

Deep cross-attention fusion module

Perform cross-attention operations on LRi (i = 1,2,3), HRi (i = 1,2,3), and HR_UDi (i = 1,2,3), as given by the following formula:

Where i∈{1,2,3}, k is the size of the convolutional kernel, and k∈{1,5,7}. The query, key, and value matrices are defined by convolutional layers with different kernel sizes.

Subsequently, the association information from each layer is concatenated along the channel dimension and then processed by a 1×1 convolution to obtain the final fused features.

2.5. Loss Function for MCA-Net

We employed the straightforward and widely-used root mean square error (RMSE) as the loss function. The RMSE is defined as:

Where

represents the reference HR-HSI.

3. Results

We tested the effectiveness of MCA-Net on five datasets. Washington DC: The Washington DC dataset was acquired in 1995 by the Hyperspectral Digital Imagery Collection Experiment (HYDICE) sensor over the National Mall in Washington, D.C. The WDCM dataset covers a wavelength range from 0.4 to 2.5 µm with 210 bands. After excluding water vapor absorption bands, 191 bands remain. The spatial resolution is 2.5 m, and the image size is 1208×307 pixels.

Pavia Center: The Pavia Center dataset was collected by the ROSIS sensor during its flight over Pavia in northern Italy. The spectral range is from 0.43 to 0.86 µm, with a total of 115 bands. After processing, 102 bands were retained. The image size is 1096×715 pixels, and the spatial resolution is 1.3 meters.

Pavia University: The Pavia University dataset was acquired in 2003 by the Reflective Optics Spect Imaging System (ROSIS) sensor over the Pavia University area in Italy. The image size is 610×340 pixels, with 115 spectral bands, of which 12 bands are discarded. The spectral range covered is from 0.43 to 0.86 µm, with a 10 nm interval, and the spatial resolution is 1.3 m.

Botswana: The Botswana dataset was acquired by NASA's EO-1 satellite between 2001 and 2004 using the Hyperion sensor over the Okavango Delta in Botswana. The dataset contains 242 bands, covering a spectral range from 0.4 to 2.5 µm with a 10 nm interval. After removing uncalibrated water absorption features and noise bands, 145 bands remain. The spatial resolution is 30 m, and the image size is 1496×256 pixels.

Chikusei: The Chikusei [29] dataset was acquired on July 29, 2014, in Chikusei, Japan, using the Headwall Hyperspec-VNIR-C sensor. This dataset includes both hyperspectral and multispectral data, with a spectral range from 363 nm to 1018 nm and a spatial resolution of 2.5 meters. The hyperspectral data consists of 128 spectral bands, while the multispectral data includes 5 spectral bands. The image size is 2517 × 2335 pixels.

3.1. Experimental Settings

This article simulates LR-HSI and HR-MSI by processing HR-HSI. Specifically, LR-HSI and HR-MSI are considered as subsamples of HR-HSI, and based on this assumption, we perform the corresponding simulation. First, a 5×5 Gaussian filter is applied to HR-HSI to obtain a blurred HSI image. Then, through four down-sampling operations, a low-resolution HSI, i.e., LR-HSI, is generated. Finally, five images are selected from HR-HSI at fixed intervals to form HR-MSI.

In this article, the central 128×128 region of the image is designated as the test set, while the surrounding area is used as the training region. To strictly adhere to the non-overlap principle, a zero-value mask is first applied to the central test region to prevent overlap between the training and test data. Subsequently, random 128×128 subregions are selected from the training area for training.

The experiment was conducted on a windows 10 operating system using an NVIDIA GeForce RTX 4090 GPU as a high-performance graphics card. The deep learning framework used in this experiment is pytorch 2.1.0. The fusion process was implemented using python 3.10 programming language. The adam optimizer was employed to optimize the experiment, with an initial learning rate set to 1e-4.

3.2. Quantitative Metrics

This article uses six parameters to demonstrate the effectiveness of the network's HSI and MSI fusion task: Root Mean Square Error (RMSE) [30], Peak Signal-to-Noise Ratio PSNR [31], the Synthetic Relative Global Dimensionality Error ERGAS [32], Structural Similarity Index SSIM [33], Spectral Angle Mapper SAM [34], and Correlation Coefficient CC [35].

RMSE is used to evaluate the average deviation between each pixel in the fused image and the ground truth (GT). The smaller the value, the smaller the difference between the fused image and the true values, indicating better fusion performance. It is defined as:

where (i, j) denotes the pixel in the image, G and I represent the GT and the model-predicted fused image, respectively, and M×N represents the image size.

RSNR is calculated using MSE, which is the average of the median differences across all bands. It is also used to measure the inaccuracy between image pixels and assess the quality of the reconstructed image. Its definition is as follows:

where Gi and Ii represent the i-th band of the GT and IHR-HSI, respectively, and C is the total number of pixels in that band. A higher PSNR value indicates less image distortion, meaning that the larger the PSNR value, the better the image quality.

SSIM evaluates the similarity between the fused image and the reference image by combining luminance, contrast, and structure. The SSIM calculation is as follows:

where µ and σ represent the mean and variance of the image, while c₁ and c₂ are small constants used for stability. SSIM is used to measure the structural similarity between the reference image and the fused image, with a range from 0 to 1. A value closer to 1 indicates higher structural similarity and lower structural loss.

SAM exists in the computation of HSI or MSI and is a measure of the spectral similarity between the reference image and the fused image. It is defined as:

where < G, I > represents the inner product between the GT and the fused image HR-HSI, and

and

represent the L₂ norms of GT and HR-HSI, respectively. A smaller SAM value indicates less spectral distortion during the fusion process.

ERGAS measures the spectral quality of fused images. It is expressed as:

where r represents the down-sampling rate of the image. The smaller the ERGAS value, the smaller the overall structural and spectral differences between the GT and HR-HSI, indicating better image quality.

CC is used to measure the linear correlation between the fused image and the source image. A higher CC value indicates a stronger correlation, while a lower value indicates a weaker correlation. The formula for CC is defined as:

where

and

represent the mean values of GT and IHR-HSI, respectively. When using CC to objectively evaluate hyperspectral images, the CC value for each band is first calculated, and then the average is taken. The closer the CC value is to 1, the better the quality of the estimated image HR-HSI.

3.3. Ablation Experiments

To validate the effectiveness of each module in MCA-Net, four groups of ablation experiments were conducted on the Washington D.C. dataset. The subjective performance evaluations and objective metrics of the fused images are presented in Fig. 5 and Table 1. By combining objective metrics with subjective assessments, the performance of the fused images was comprehensively evaluated, providing a clear and intuitive demonstration of the role of each module in the network.

(a)LR-HIS (b)Without HCPAEM (c)Without MLGFEM (d)Without DCAFM

(e) Without SSRM (f)Ours (g)GT

Figure 5. Fusion results of Ablation study on the Washington DC dataset. The first row shows the pseudo-RGB image result after image fusion, while the second row displays the difference image between the fusion image and the GT. (a) LR-HIS. (b) Without HCPAEM. (c) Without MLGFEM. (d) Without DCAFM. (e) Without SSRM. (f) Ours. (g) GT.

Table 1
Objective metrics results on the Washington DC dataset for performance comparison between different modules and the best results are in bold.
Metrics Component	RMSE $\:\downarrow\:$	PSNR $\:\uparrow\:$	ERGAS $\:\downarrow\:$	SAM $\:\downarrow\:$	SSIM $\:\uparrow\:$	CC $\:\uparrow\:$
Without HCPAEM	1.3658	42.7590	0.2377	0.4662	0.7186	0.9886
Without MLGFEM	1.3797	42.6708	0.2395	0.4645	0.7142	0.9882
Without DCAFM	1.7025	40.8451	0.2986	0.5713	0.6617	0.9821
Without SSRM	1.8292	40.2214	0.3171	0.6131	0.7042	0.9788
MCA-Net	0.7093	48.4503	0.1229	0.2375	0.8491	0.9970

(1) HCAPAEM ablation studies: The HCPAEM module within MCA-Net plays a critical role in enhancing image features. By integrating depthwise separable convolutions with a parallel attention mechanism, HCPAEM effectively strengthens both spectral and spatial features of the image. Compared to the baseline network without HCPAEM, the proposed method achieves reductions of 48.07%, 48.30%, and 49.06% in RMSE, ERGAS, and SAM, respectively. Additionally, improvements of 13.31%, 18.16%, and 0.85% are observed in PSNR, SSIM, and CC, respectively.

(2) MLGFEM ablation studies: The MLGFEM module effectively extracts hierarchical feature representations by integrating image information from multiple spatial scales. In this set of experiments, MLGFEM was replaced by a strategy that first applies upsampling and downsampling to the original image, followed by feature extraction and fusion at a unified scale. Experimental results show that, compared to the baseline network without MLGFEM, the proposed method achieves reductions of 48.59%, 48.68%, and 48.87% in RMSE, ERGAS, and SAM, respectively. In addition, it leads to improvements of 13.54%, 18.89%, and 0.89% in PSNR, SSIM, and CC, respectively. These results clearly validate the effectiveness and robustness of the proposed module in extracting both global structural information and local details from the image.

(3) DCAFM ablation studies: The DCAFM module adopts a three-branch architecture, enabling deep fusion of spectral and spatial features by capturing meaningful correlations among the input images. To evaluate the practical contribution of DCAFM, it was replaced in this set of experiments with a simple image concatenation strategy. Experimental results show that, compared to the network without DCAFM, the proposed method achieves improvements of 18.62%, 28.32%, and 1.51% in PSNR, SSIM, and CC, respectively. Meanwhile, it significantly reduces RMSE, ERGAS, and SAM by 58.34%, 58.84%, and 58.43%, respectively, demonstrating clear performance advantages. These results strongly validate the effectiveness of the three-branch deep cross-attention fusion mechanism in facilitating efficient interactive integration of spatial and spectral features, while preserving fine spatial details throughout the fusion process.

(4) SSRM ablation studies: SSRM plays a critical role in high-resolution image reconstruction. To evaluate its effectiveness, in this set of experiments, the SSRM module is replaced by a simple strategy in which the fused images at three different scales are individually upsampled and then directly concatenated. The experimental results show that removing SSRM leads to a noticeable performance degradation across several key metrics. Specifically, compared to the network without SSRM, the proposed method achieves reductions of 61.22%, 61.24%, and 61.26% in RMSE, ERGAS, and SAM, respectively. In addition, improvements of 20.46%, 20.58%, and 1.86% are observed in PSNR, SSIM, and CC, respectively, further confirming the positive contribution of SSRM to the quality of image reconstruction.

In summary, as shown in Fig. 5 and Table 1, when any of the MLGFEM, DCAFM, or SSRM modules is removed from MCA-Net, the resulting fused images exhibit noticeable distortions in both spatial structure and spectral fidelity, accompanied by a significant decline in objective evaluation metrics. Although the removal of the HCPAEM module does not lead to obvious distortions, the fused results appear somewhat blurred, and the corresponding quantitative performance also degrades. These findings collectively demonstrate that each component of the proposed method plays an indispensable role in enhancing the overall quality of image fusion.

3.4. Comparisons with state-of-the-art Methods

To validate the superiority of this model, comparative experiments were conducted on five public datasets using eight methods. These methods include two traditional algorithms: CNMF [36] and FUSE [37], as well as six deep learning-based fusion algorithms: TFNet [39], SSRNet [21], HSRNet [40], MCT [38], MDC [28], AMSF [41].

(1) Experiments on Pavia Center dataset

To better highlight the superiority of the fusion results, pseudo-RGB images were generated for channels 67, 29, and 1. The error maps between the fusion results and the ground truth (GT) are presented in the second row of Fig. 6. From Fig. 6, it can be observed that the HSRNet method exhibits spectral distortion. Additionally, the methods TFNet, SSRNet, and AMSF show noticeable texture loss at the edges of the walls. These results strongly indicate the effectiveness of the proposed method on the Pavia Center dataset.

The objective evaluation metrics of the fusion results obtained by various comparative methods at Pavia Center are presented in Table 2. From table, the proposed MCA-NET method outperforms all other algorithms and achieves optimal performance across the all six metrics. Compared to the second-ranking method on this dataset, the RMSE, ERGAS, and SAM are reduced by 7.97%, 6.89%, and 5.31%, respectively. Furthermore, the PSNR, SSIM, and CC are improved by 1.86%, 0.09%, and 0.04%, respectively.

(a)LR-HSI (b)TFNet (c)SSRNet (d)HSRNet

(e)MCT (f)MDC (g)AMSF (h)Ours (i)GT

Figure 6. Fusion results of different methods on the Pavia Center dataset. The first line is the R-G-B image of the estimated HR-HSI (67-29-1 band), and the second line is the pseudo-color difference image of the estimated R-G-B image and the reference image. (a) LR-HSI. (b) TFNet. (c) SSRNet. (d) HSRNet. (e) MCT. (f) MDC. (g) AMSF. (h) Ours. (i) GT.

Table 2
Objective metrics results of methods on the Pavia Center dataset. The best results are in bold, and the suboptimal results are underlined.
Metrics	RMSE $\:\downarrow\:$	PSNR $\:\uparrow\:$	ERGAS $\:\downarrow\:$	$\:\text{S}\text{A}\text{M}\downarrow\:$	SSIM $\:\uparrow\:$	CC $\:\uparrow\:$
CNMF	15.4877	23.7879	36.9857	4.2639	0.6310	0.7949
FUSE	21.4895	21.4826	12.3575	19.1129	0.6667	0.9921
TFNet	4.1304	35.8111	4.6620	4.7535	0.9762	0.9931
SSRNet	3.6121	36.9756	4.1182	4.0199	0.9823	0.9950
HSRNet	3.7053	36.7544	4.0921	4.3068	0.9809	0.9944
MCT	2.8427	39.0562	3.2184	3.6853	0.9862	0.9968
MDC	3.0896	38.3329	3.4808	4.0196	0.9847	0.9962
AMSF	3.9967	36.0967	4.4700	4.9377	0.9745	0.9935
Ours	2.6160	39.7781	2.9967	3.4895	0.9871	0.9972
(2) Experiments on Pavia University dataset

To further demonstrate the effectiveness of the method, a visual analysis of the fusion results from the dataset is conducted. Figure 7 shows the pseudo-RGB images (R-67, G-29, B-1) of the experimental results obtained using various fusion algorithms on the Pavia University dataset, along with the error map of the ground truth (GT). The proposed algorithm outperforms other methods in terms of spatial and spectral performance on the airplane and surrounding buildings.

(a)LR-HSI (b)TFNet (c)SSRNet (d)HSRNet

(e)MCT (f)MDC (g)AMSF (h)Ours (i)GT

Figure 7 Fusion results of different methods on the Pavia University dataset. The first line is the R-G-B image of the estimated HR-HSI (67-29-1 band), and the second line is the pseudo-color difference image of the estimated R-G-B image and the reference image. (a) LR-HSI. (b) TFNet. (c) SSRNet. (d) HSRNet. (e) MCT. (f) MDC. (g) AMSF. (h) Ours. (i) GT.

Table 3 presents the objective evaluation metrics of the fusion results obtained by Pavia University using various comparison methods. As shown in Table 3, the proposed method and the MCT method outperform other algorithms, achieving the best performance across all six evaluation metrics. Compared to the second-ranking method on this dataset, the RMSE, ERGAS, and SAM are reduced by 7.27%, 5.4%, and 4.94%, respectively. Additionally, the PSNR, SSIM, and CC are improved by 1.51%, 0.16%, and 0.02%, respectively.

Table 3
Objective metrics results of methods on the Pavia University dataset. The best results are in bold, and the suboptimal results are underlined.
Metrics	RMSE $\:\downarrow\:$	PSNR $\:\uparrow\:$	ERGAS $\:\downarrow\:$	$\:\text{S}\text{A}\text{M}\downarrow\:$	SSIM $\:\uparrow\:$	CC $\:\uparrow\:$
CNMF	6.8154	31.7156	4.0917	2.1695	0.9464	0.9573
FUSE	2.4100	39.6765	1.7442	2.3986	0.9785	0.9957
TFNet	2.2459	40.7879	1.6003	2.4965	0.9687	0.9961
SSRNet	1.6846	43.2859	1.2391	1.9798	0.9804	0.9978
HSRNet	2.2026	40.9569	1.5125	2.2142	0.9758	0.9966
MCT	1.6680	43.3716	1.2342	1.9540	0.9805	0.9979
MDC	1.8293	42.5702	1.3331	2.1324	0.9780	0.9974
AMSF	3.3176	37.3991	2.1948	2.2713	0.9516	0.9918
Ours	1.5468	44.0268	1.1675	1.8574	0.9821	0.9981
(3) Experiments on Washington DC dataset

To further validate the effectiveness of the proposed method, a visual analysis of the fusion results using the dataset was conducted. Figure 8 presents the experimental results obtained using various fusion algorithms on the Washington DC dataset. The figure includes pseudo-color images (R-54, G-34, B-10) and error maps compared to the GT, providing a clear visualization of the fusion performance of each method. As shown in Fig. 8, the fusion results of SSRNet and HSRNet exhibit noticeable blurring and artificial artifacts, which significantly degrade their perceptual quality. In contrast, the proposed method, along with MDC and TFNet, subjectively preserves more detailed spatial information. Despite the large number of spectral bands in the Washington dataset, the proposed method achieves superior results both in terms of quantitative metrics and visual quality. These findings confirm that the proposed approach effectively preserves both spectral and spatial information.

Table 4 presents the objective evaluation metrics of various comparative methods on the Washington, D.C. dataset. As shown in the results, the proposed method achieves the best performance across all metrics except for SSIM and CC. Specifically, the SSIM score is 0.0928 lower than the best result, and the CC score is just 0.0006 below the top value. Notably, the proposed method surpasses the second-best approach in PSNR by 11.76%, highlighting its superior capability in fusing LR-HSI with HR-MSI. Furthermore, compared to the second-best results, the proposed method significantly reduces RMSE, ERGAS, and SAM by 44.4%, 44.71%, and 43.2%, respectively, further validating its effectiveness and robustness on this dataset.

Table 4
Objective metrics results of methods on the Washington DC dataset. The best results are in bold, and the suboptimal results are underlined.
Metrics	RMSE $\:\downarrow\:$	PSNR $\:\uparrow\:$	ERGAS $\:\downarrow\:$	$\:\text{S}\text{A}\text{M}\downarrow\:$	SSIM $\:\uparrow\:$	CC $\:\uparrow\:$
CNMF	8.0639	29.9999	12.6720	7.1826	0.7711	0.9930
FUSE	3.0857	38.3436	8.3229	2.5327	0.9419	0.9976
TFNet	1.2758	43.3512	0.2223	0.4182	0.7147	0.9897
SSRNet	2.3774	37.9445	0.4149	0.8162	0.5931	0.9651
HSRNet	2.6960	36.8521	0.4670	0.7385	0.5106	0.9562
MCT	1.4389	42.3058	0.2496	0.4659	0.7436	0.9871
MDC	1.2832	43.3008	0.2231	0.4368	0.7305	0.9897
AMSF	1.6393	41.1732	0.2883	0.5589	0.6461	0.9834
Ours	0.7093	48.4503	0.1229	0.2375	0.8491	0.9970

(a)LR-HSI (b)TFNet (c)SSRNet (d)HSRNet

(e)MCT (f)MDC (g)AMSF (h)Ours (i)GT

Figure 8 Fusion results of different methods on the Washington DC dataset. The first line is the R-G-B image of the estimated HR-HSI (54-34-10 band), and the second line is the pseudo-color difference image of the estimated R-G-B image and the reference image. (a) LR-HSI. (b) TFNet. (c) SSRNet. (d) HSRNet. (e) MCT. (f) MDC. (g) AMSF. (h) Ours. (i) GT.

(4) Experiments on Botswana dataset

Figure 9 presents the fused pseudo-color image and the error map for the Botswana dataset. In this dataset, the pseudo-color image is synthesized from channels 47, 14, and 3. Analysis of the error map reveals that the proposed method exhibits the smallest error among the evaluated approaches. In contrast, the fusion results obtained using TFNet, SSRNet, HSRNet, and MCT demonstrate significant spectral information errors, leading to a loss of detailed texture. The experimental findings indicate that the proposed algorithm effectively preserves the textural details of HR-MSI while maintaining the spectral information of LR-HSI, thereby highlighting the algorithm's robust generalization capability.

(a)LR-HSI (b)TFNet (c)SSRNet (d)HSRNet

(e)MCT (f)MDC (g)AMSF (h)Ours (i)GT

Figure 9 Fusion results of different methods on the Botswana dataset. The first line is the R-G-B image of the estimated HR-HSI (47-14-3 band), and the second line is the pseudo-color difference image of the estimated R-G-B image and the reference image. (a) LR-HSI. (b) TFNet. (c) SSRNet. (d) HSRNet. (e) MCT. (f) MDC. (g) AMSF. (h) Ours. (i) GT.

The results of various comparative experiments on objective evaluation metrics in Botswana are presented in Table 5. Among the various metrics evaluated for the table, all achieved optimal values except for ERGAS and SSIM. Specifically, the ERGAS value of the proposed method is 0.3836 lower than the optimal value, while the SSIM value falls short by 0.0248. However, significant improvements are observed in the RMSE and SAM metrics, with reductions of 24.73% and 14%, respectively, compared to the second-ranked method. Furthermore, the PSNR and CC metrics exhibit enhancements of 6.32% and 0.06%, respectively, when compared to the second-ranked method.

Table 5
Objective metrics results of methods on the Botswana dataset. The best results are in bold, and the suboptimal results are underlined.
Metrics	RMSE $\:\downarrow\:$	PSNR $\:\uparrow\:$	ERGAS $\:\downarrow\:$	$\:\text{S}\text{A}\text{M}\downarrow\:$	SSIM $\:\uparrow\:$	CC $\:\uparrow\:$
CNMF	3.5281	37.1799	2.5116	2.0301	0.9758	0.9979
FUSE	3.3607	37.6023	2.4946	2.1328	0.9783	0.9980
TFNet	0.4876	37.4658	2.7780	2.7772	0.8998	0.9979
SSRNet	0.4327	38.5023	7.7878	2.1363	0.8693	0.9984
HSRNet	0.4056	39.0649	1.6247	1.7254	0.9494	0.9986
MCT	0.4495	38.1713	3.8368	2.0840	0.8796	0.9983
MDC	0.4102	38.9665	3.7630	2.0271	0.9039	0.9985
AMSF	0.4068	39.0390	2.5165	1.9584	0.9303	0.9986
Ours	0.3053	41.5320	2.0083	1.4837	0.9535	0.9992
(5) Experiments on Chikusei dataset

Figure 10 presents the results of fusing HSI and MSI using various methods on the Chikusei dataset. In comparison with existing state-of-the-art methods, the proposed approach excels in preserving spatial edge details and spectral information in the pseudo-color RGB fusion images. In the corresponding residual images, other methods exhibit noticeable distortions in dense building areas. In contrast, the proposed method shows the smallest differences in the residual images, with its colors, detailed shapes, and object edges more closely resembling the ground truth (GT). Furthermore, these results indicate that the proposed method effectively retains important image features, thereby achieving better outcomes in the image fusion task.

(a)LR-HSI (b)TFNet (c)SSRNet (d)HSRNet

(e)MCT (f)MDC (g)AMSF (h)Ours (i)GT

Figure 10 Fusion results of different methods on the Chikusei dataset. The first line is the R-G-B image of the estimated HR-HSI (25-15-10 band), and the second line is the pseudo-color difference image of the estimated R-G-B image and the reference image. (a) LR-HSI. (b) TFNet. (c) SSRNet. (d) HSRNet. (e) MCT. (f) MDC. (g) AMSF. (h) Ours. (i) GT.

Table 6
Objective metrics results of methods on the Chikusei dataset. The best results are in bold, and the suboptimal results are underlined.
Metrics	RMSE $\:\downarrow\:$	PSNR $\:\uparrow\:$	ERGAS $\:\downarrow\:$	$\:\text{S}\text{A}\text{M}\downarrow\:$	SSIM $\:\uparrow\:$	CC $\:\uparrow\:$
CNMF	2.2722	40.0050	3.4477	1.5624	0.9850	0.9904
FUSE	2.1157	40.9169	2.5334	1.7778	0.9884	0.9920
TFNet	0.8642	40.0880	2.8606	1.6969	0.9630	0.9988
SSRNet	0.7742	41.0434	2.6609	1.5648	0.9650	0.9990
HSRNet	0.8055	40.6994	2.2671	1.3613	0.9677	0.9989
MCT	0.8549	40.1821	2.5871	1.6330	0.9651	0.9988
MDC	0.7316	41.5350	2.5602	1.5225	0.9671	0.9991
AMSF	0.8979	39.7562	3.0255	1.7583	0.9602	0.9987
Ours	0.5651	43.7783	2.1066	1.2390	0.9761	0.9995

3.5. Time Efficiency Analysis

In this section, the computational efficiency of the proposed method is compared by calculating the execution time of all methods on the Washington D.C. mall dataset. Table 7 lists the test times of all methods. The model proposed in this article is larger in scale compared to methods such as TFNet, SSRNet, and HSRNet, and incorporates multiscale feature extraction strategies and Transformer modules, significantly increasing the number of parameters and computational complexity. As a result, MCA-Net exhibits lower computational efficiency. Similarly, the integration of Transformer blocks in MCT and MDC also leads to increased computational costs. Overall, despite the higher computational cost, the proposed method in this article demonstrates superior performance in terms of image reconstruction effectiveness and quality.

Table 7
Efficiency analysis with test time of the testing methods.
Methods	TFNet	SSRNet	HSRNet	MCT	MDC	AMSF	Ours
Time (Ms)	46	36	59	274	477	353	455

4. Conclusions

This article proposes a novel Transformer-based multiscale cross-attention fusion network (MCA-Net), which fully considers the modeling of both local and global information as well as the deep interaction between spectral and spatial features during the image fusion process. To effectively mitigate spatial distortion, the model introduces preprocessed hyperspectral images as auxiliary inputs to enhance the expression of spatial information. The heterogeneous convolutional parallel attention enhancement module designed in MCA-Net integrates dilated depthwise separable convolutions with a parallel attention mechanism, effectively improving the representation of both local and global features. The multiscale local-global feature extraction module, combining convolutional neural networks (CNNs), Transformer structures, and multiscale feature extraction strategies, enables the comprehensive mining of non-local and complementary information. The deep cross-attention fusion module constructs a deep cross-attention mechanism to model the correlation between hyperspectral and multispectral images, facilitating the efficient fusion of spatial and spectral features. Finally, the spectral-spatial reconstruction module is introduced to further optimize the feature expression of the fused image, enhancing the overall fusion performance. We conducted extensive experiments on five datasets, and compared with other state-of-the-art methods, our model demonstrated exceptional performance.

Acknowledgement

This research was in part supported by the National Natural Science Foundation of China under Grant 61871210 and the 2024 Hunan Postgraduate Research Innovation Non-Funded Project (No.LXBZZ2024239).

Author Contribution

Bin Yang has taken primary responsibility for conceptualizing the article, providing the Conceptualization. He also developed the structure, and writing the majority of the content. Yuxuan Jiang provided the methodology and examination of relevant research and wrote the main manuscript text. Binxi Tan has been instrumental in collecting and organizing the data, as well as assisting with the preparation of tables and figures All authors reviewed the manuscript.

References

Zhang, B., Chen, Y., Li, Z., Xiong, S., & Lu, X. (2024). SANet: A self-attention network for agricultural hyperspectral image classification. Ieee Transactions On Geoscience And Remote Sensing, 62, 1–15. https://doi.org/10.1109/TGRS.2023.3341473

Lu, F., Sun, H., Tao, L., & Wang, P. (2025). Data integration based on UAV multispectral and proximal hyperspectral sensing for maize canopy nitrogen estimation. Remote Sens, 17, 1411.

Habashi, J., Moghadam, H. J., Oskouei, M. M., et al. (2024). PRISMA hyperspectral remote sensing data for mapping alteration minerals in Sar-e-Châh-e-Shur region, Birjand, Iran. Remote Sens, 16(7), 1277.

Mukundan, A., Karmakar, R., Jouhar, J., et al. (2025). Advancing urban development: Applications of hyperspectral imaging in smart city innovations and sustainable solutions. Smart Cities, 8(2), 51. https://doi.org/10.3390/smartcities8020051

Chen, Z., Pu, H., Wang, B., & Jiang, G. M. (2014). Fusion of hyperspectral and multispectral images: A novel framework based on generalization of pan-sharpening methods. Ieee Geoscience And Remote Sensing Letters, 11(8), 1418–1422.

Aiazzi, B., Baronti, S., & Selva, M. (2007). Improving component substitution pansharpening through multivariate regression of MS + Pan data. Ieee Transactions On Geoscience And Remote Sensing, 45(10), 3230–3239. https://doi.org/10.1109/TGRS.2007.901007

Guo, Q., et al. (2017). Ehlers pan-sharpening performance enhancement using HCS transform for n-band data sets. International Journal Of Remote Sensing, 38(17), 4974–5002. https://doi.org/10.1080/01431161.2017.1339926

Yokoya, N., Yairi, T., & Iwasaki, A. (2012). Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. Ieee Transactions On Geoscience And Remote Sensing, 50(2), 528–537. https://doi.org/10.1109/TGRS.2011.2161320

Kahraman, S., Ertürk, A., & Ertürk, S. (2018). Graph regularized L1/2-sparsity constrained non-negative matrix factorization for hyperspectral and multispectral image fusion. In: 2018 9th workshop on hyperspectral image and signal processing: Evolution in remote sensing (WHISPERS), Amsterdam, Netherlands, pp 1–4.

10.

Xue, J., Zhao, Y., Liao, W., & Chan, J. C. W. (2019). Hyper-Laplacian regularized nonlocal low-rank matrix recovery for hyperspectral image compressive sensing reconstruction. Information Sciences, 501, 406–420.

11.

Karoui, M. S., Deville, Y., & Kreri, S. (2013). Joint nonnegative matrix factorization for hyperspectral and multispectral remote sensing data fusion. In: 2013 5th workshop on hyperspectral image and signal processing: Evolution in remote sensing (WHISPERS), Gainesville, FL, USA, pp 1–4. https://doi.org/10.1109/WHISPERS.2013.8080611

12.

Akhtar, N., Shafait, F., & Mian, A. (2015). Bayesian sparse representation for hyperspectral image super resolution. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, USA, pp 3631–3640.

13.

Wei, Q., Dobigeon, N., & Tourneret, J. Y. (2015). Fast fusion of multi-band images based on solving a Sylvester equation. Ieee Transactions On Image Processing, 24, 4109–4121.

14.

Sui, L., Li, L., Li, J., Chen, N., & Jiao, Y. (2019). Fusion of hyperspectral and multispectral images based on a Bayesian nonparametric approach. IEEE J Sel Top Appl Earth Obs Remote Sens, 12, 1205–1218.

15.

Dian, R., Fang, L., & Li, S. (2017). Hyperspectral image super-resolution via non-local sparse tensor factorization. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, pp 5344–5353.

16.

Li, S., Dian, R., Fang, L., & Bioucas-Dias, J. M. (2018). Fusing hyperspectral and multispectral images via coupled sparse tensor factorization. Ieee Transactions On Image Processing, 27, 4118–4130. https://doi.org/10.1109/TIP.2018.2836307

17.

Zhang, K., Wang, M., Yang, S., & Jiao, L. (2018). Spatial–spectral-graph-regularized low-rank tensor decomposition for multispectral and hyperspectral image fusion. IEEE J Sel Top Appl Earth Obs Remote Sens, 11, 1030–1040.

18.

Xu, T., et al. (2024). A coupled tensor double-factor method for hyperspectral and multispectral image fusion. Ieee Transactions On Geoscience And Remote Sensing, 62, 1–17.

19.

Yu, H., Ling, Z., Zheng, K., Gao, L., Li, J., & Chanussot, J. (2024). Unsupervised hyperspectral and multispectral image fusion with deep spectral-spatial collaborative constraint. Ieee Transactions On Geoscience And Remote Sensing, 62, 1–14.

20.

Ran, R., Deng, L. J., Jiang, T. X., Hu, J. F., Chanussot, J., & Vivone, G. (2023). GuidedNet: A general CNN fusion framework via high-resolution guidance for hyperspectral image super-resolution. IEEE Trans Cybern, 53, 4148–4161.

21.

Zhang, X., Huang, W., Wang, Q., & Li, X. (2021). SSR-NET: Spatial–spectral reconstruction network for hyperspectral and multispectral image fusion. Ieee Transactions On Geoscience And Remote Sensing, 59, 5953–5965.

22.

Zhu, Z., Wang, X., Li, G., & Zhong, Y. (2024). A self-supervised spaceborne multispectral and hyperspectral image fusion unrolling network. Ieee Transactions On Geoscience And Remote Sensing, 62, 1–12.

23.

Hu, J. F., Huang, T. Z., Deng, L. J., Dou, H. X., Hong, D., & Vivone, G. (2022). Fusformer: A transformer-based fusion network for hyperspectral image super-resolution. Ieee Geoscience And Remote Sensing Letters, 19, 1–5.

24.

Deng, S. Q., Deng, L. J., Wu, X., Ran, R., Hong, D., & Vivone, G. (2023). PSRT: Pyramid shuffle-and-reshuffle transformer for multispectral and hyperspectral image fusion. Ieee Transactions On Geoscience And Remote Sensing, 61, 1–15.

25.

You, T., Wu, C., Bai, Y., Wang, D., Ge, H., & Li, Y. (2023). HMF-Former: Spatio-spectral transformer for hyperspectral and multispectral image fusion. Ieee Geoscience And Remote Sensing Letters, 20, 1–5.

26.

He, Y., Li, H., Zhang, M., Liu, S., Zhu, C., Xin, B., Wang, J., & Wu, Q. (2025). Hyperspectral and multispectral remote sensing image fusion based on a retractable spatial–spectral transformer network. Remote Sens, 17, 1973.

27.

Ma, Q., Jiang, J., Liu, X., & Ma, J. (2024). Reciprocal transformer for hyperspectral and multispectral image fusion. Inf Fusion, 104, 1–15.

28.

Sun, L., et al. (2024). MDC-FusFormer: Multiscale deep cross-fusion transformer network for hyperspectral and multispectral image fusion. Ieee Transactions On Geoscience And Remote Sensing, 62, 1–14.

29.

Yokoya, N., & Iwasaki, A. (2016). Airborne hyperspectral data over Chikusei. Space Appl Lab, Univ Tokyo, Japan, Tech Rep SAL-2016-05-27.

30.

Takeyama, S., & Ono, S. (2020). Compressed hyperspectral pansharpening. In: 2020 IEEE international conference on image processing (ICIP), Abu Dhabi, United Arab Emirates.

31.

Yokoya, N., Yairi, T., & Iwasaki, A. (2012). Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. Ieee Transactions On Geoscience And Remote Sensing, 50, 528–537. https://doi.org/10.1109/TGRS.2011.2161320

32.

Wald, L. (2000). Quality of high resolution synthesised images: Is there a simple criterion? In: Proc third conference fusion of earth data: Merging point measurements, raster maps and remotely sensed images, Sophia Antipolis, France, pp 99–103.

33.

Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. Ieee Transactions On Image Processing, 13, 600–612. https://doi.org/10.1109/TIP.2003.819861

34.

Kruse, F. A., et al. (1993). The spectral image processing system (SIPS): Interactive visualization and analysis of imaging spectrometer data. Remote Sensing Of Environment, 283, 192–201.

35.

Garzelli, A., & Nencini, F. (2009). Hypercomplex quality assessment of multi/hyperspectral images. Ieee Geoscience And Remote Sensing Letters, 6, 662–665.

36.

Yokoya, N., Yairi, T., & Iwasaki, A. (2011). Coupled non-negative matrix factorization (CNMF) for hyperspectral and multispectral data fusion: Application to pasture classification. In: 2011 IEEE international geoscience and remote sensing symposium (IGARSS), Vancouver, BC, Canada, pp 1779–1782.

37.

Wei, Q., Dobigeon, N., & Tourneret, J. Y. (2015). Fast fusion of multi-band images based on solving a Sylvester equation. Ieee Transactions On Image Processing, 24, 4109–4121.

38.

Wang, X., Wang, X., Song, R., Zhao, X., & Zhao, K. (2023). MCT-Net: Multi-hierarchical cross transformer for hyperspectral and multispectral image fusion. Knowl-Based Syst, 264, 110362.

39.

Liu, X., Liu, Q., & Wang, Y. (2020). Remote sensing image fusion based on two-stream fusion network. Inf Fusion, 55, 1–15.

40.

Hu, J. F., Huang, T. Z., Deng, L. J., Jiang, T. X., Vivone, G., & Chanussot, J. (2022). Hyperspectral image super-resolution via deep spatio-spectral attention convolutional neural networks. IEEE Trans Neural Netw Learn Syst, 33, 7251–7265.

41.

Liu, S., Shao, T., Liu, S., Li, B., & Zhang, Y. D. (2025). An asymptotic multiscale symmetric fusion network for hyperspectral and multispectral image fusion. Ieee Transactions On Geoscience And Remote Sensing, 63, 1–16.

Yes