Progressive multi-scale multi-attention fusion for hyperspectral image classification

Wait 5 sec.

IntroductionOver the past few decades, the rapid development of hyperspectral technology has led to its widespread application in many fields, such as vegetation1, soil properties2, environmental monitoring3, geophysical exploration4,5, and the military6. Hyperspectral images, rich in spectral and spatial information, distinguish physical differences between surface materials better than natural images through extensive and dense spectral imaging7. This makes hyperspectral imaging a highly dynamic research field. Hyperspectral technology has also received extensive attention in the field of remote sensing8. In remote sensing, hyperspectral data can provide more detailed spectral information for the description of ground objects, and this continuous spectral information can be used to image ground or urban surfaces9. Therefore, hyperspectral data is widely used in land cover classification tasks and has already achieved good classification results. Due to the high resolution and integrated spectrum-image nature of hyperspectral images, many researchers have conducted extensive studies on their classification algorithms. Hyperspectral image classification methods can mainly be divided into two categories: traditional methods and deep learning-based methods.Hyperspectral image classification aims to assign a unique category label to each pixel in the image. Traditional hyperspectral image classification typically involves feature extraction followed by the use of a classifier to categorize the hyperspectral image. In the early stages of hyperspectral image classification research, many spectral-based feature extraction methods were proposed, such as Support Vector Machine (SVM)10, Multinomial Logistic Regression (MLR)11, Random Forest12, Random Subspace13, Sparse Representation14, and K-Nearest Neighbors (KNN)15. These traditional classification methods, which only utilize spectral features, are computationally simple and easy to implement. Among them, the KNN algorithm is satisfactory due to its theoretical simplicity; SVM exhibits superior classification performance in small sample problems and large dataset classification tasks and is widely used in hyperspectral image classification tasks. Moreover, some feature extraction or dimensionality reduction methods have also attracted attention, such as Independent Component Analysis (ICA)16, Principal Component Analysis (PCA)17, and Linear Discriminant Analysis (LDA)18. However, the classification results of the aforementioned pixel-based classifiers are not satisfactory. To better classify hyperspectral images, some effective spatial-spectral feature representation methods have been proposed19,20. With the large-scale increase in data and the complexity of application scenarios, traditional methods reveal their limitations, as they mainly rely on spectral information for feature learning, focusing on extracting shallow features and neglecting the spatial correlation of pixels, resulting in poor discrimination of ground object features. Pixels in spatially adjacent regions are very likely to belong to the same category, and fully utilizing the spatial features of hyperspectral images can further improve classification results. To optimize the extraction of spatial features from images, various morphological profile-based operations have been developed, such as Morphological Profiles (MPs)21, Extended MPs (EMPs)22, and Extended Multi-Attribute Profiles (EMAP)23. The features extracted by mathematical morphology operators like EMPs heavily rely on prior information, leading to poor adaptability to different datasets and insufficient robustness in complex environments. Additionally, the discriminative power of the extracted shallow features is inadequate. Relying solely on either spectral information or spatial information will inevitably result in suboptimal classification outcomes.In recent years, deep learning has rapidly developed and been applied in many fields such as image processing24 and natural language processing25. Using deep learning for hyperspectral image classification has become a hot research topic. In contrast to traditional machine learning-based classification approaches, deep learning methods possess significant advantages in feature extraction. They are capable of delving into the intrinsic abstract features of data and more effectively uncovering the complex internal structure of high-dimensional data. Thereby improving image classification accuracy. Lin et al. proposed using autoencoders for hyperspectral image classification26, presenting a deep learning method that integrates spectral and spatial features. Zheng et al. used Deep Belief Networks (DBN) to classify EEG emotions27. Both SAE and DBN require transforming data into one-dimensional vectors to accommodate their network structures, resulting in the inability to consider the spatial information of images during feature extraction. Convolutional Neural Networks (CNN) can directly process two-dimensional or three-dimensional images, effectively alleviating the aforementioned issues. Li et al.28 used 1-D-CNN for ECG signal classification. Qi Wang et al. designed a 2-D-CNN29, effectively learning discriminative features from multi-source data and improving the generalization capability of CD algorithms. To address the limitations of 2-D-CNN, Li et al.30 proposed a 3-D-CNN to jointly extract spatial and spectral features. To better capture spatial-spectral information, Roy et al. proposed a hierarchical network that combines features extracted by 3D-CNN and 2D-CNN31. Manna et al. proposed an improved residual network to capture spectral and spatial features using an end-to-end training method32. CNN-based models have the unique advantage of feature representation among different channels and are proficient at extracting local features. However, their receptive fields are influenced by the size of their convolution kernels, limiting their ability to extract and represent complex spatial and global features. Lei et al. proposed an expanded CNN model33, constructed by replacing the traditional CNN’s convolution kernels with dilated convolution kernels, and tested the expanded CNN model on the MNIST handwritten digit recognition dataset. Huang proposed a DenseNet-based CNN34, which enhanced feature propagation and achieved better classification performance by introducing dense connections into the network. Zhang et al. proposed an efficient remote sensing image scene classification structure35, named cnn-capsnet, to fully exploit the advantages of both CNN and Capsule Network (CapsNet) models. Since the features obtained by single-scale convolution kernels are not sufficiently rich, many methods based on multi-scale convolution kernels have been used to extract more abundant features, thereby improving the classification performance of HSIs36,37,38.Attention mechanisms have also shown great potential in the field of computer vision. In cognitive science, humans tend to focus on more important information and ignore other information. Attention mechanisms can be seen as a mimicry of human vision and have been widely applied in many areas of computer vision39,40,41. Fu et al. proposed a Dual Attention Network (DANet)42 to extract spatial-spectral information. Ma et al. proposed a Dual-Branch Multi-Attention Network (DBMA)43 and achieved promising classification results. To further improve the classification performance of HSIs, Li et al. proposed a Dual-Branch Dual Attention Network (DBDA)44. Cui et al. proposed a novel Dual-Triple Attention Network (DTAN)45, which can effectively classify hyperspectral images by capturing cross-dimensional interactive information. Tang et al. proposed a model incorporating Graph Attention46, which allows information flow to achieve reasonable bit allocation. Liu et al. proposed a Spatial-Channel Attention mechanism47, which captures cross-dimensional interactions without involving dimensionality reduction and brings significant performance improvements with negligible computational overhead.Hyperspectral image classification with deep learning has become a popular research topic; existing methods have achieved good classification performance but still exhibit certain limitations. For example, (1) single-scale features are insufficient: traditional 1-D/2-D/3-D CNNs rely solely on fixed-size convolution kernels, making it difficult to simultaneously capture fine-grained textures and global semantics and leading to overfitting. (2) Cross-scale information transfer is weak: although current attention networks introduce multiple attention mechanisms, feature fusion mainly occurs at a single scale, resulting in the loss of discriminative information. (3) Computational and feature redundancies coexist: DenseNet and multi-scale convolutions improve accuracy but bring parameter inflation and redundant features, which are unfavorable for edge-device deployment. In order to enhance image classification accuracy, how can one effectively capture features at various scales within an image and select those that contribute to improved classification performance? And how can the flow of information and efficient transmission between different feature maps within a network model be strengthened? This paper integrates these concerns into its network model design, with the aim of boosting the precision of image classification through these methods:(1)To alleviate the loss and insufficiency of single-scale features, inspired by the complementary three-branch responsibilities of a PID controller, this paper propose a Progressive Multi-scale Multi-attention Fusion (PMMF) network for HIC. The network adopts an “extract-stack-extract-fuse” paradigm, which repeatedly extracts features from different branches and then merges them. This design elegantly fuses multi-scale features, avoids the limited expressiveness of single-scale representations, and enriches the resulting feature maps with more information.(2)To capture features more effectively and emphasize important ones, this paper introduce a Shortcut Weight Channel Attention (SWCA:Shortcut Weight Channel Attention) module. By connecting features across scales via shortcut pathways, SWCA refines the representations and simultaneously captures fine-grained textures and global semantics, enabling the network to focus on salient features and improving overall performance.(3)When fusing features from the P, I, and D branches, this paper address weak cross-scale information transfer and the risk of information loss or redundancy by proposing a Multi-attention Fusion Module. Each branch is assigned a distinct attention mechanism tailored to its specific characteristics, ensuring that features within every branch are fully exploited and leading to more effective image classification.The remainder of this paper is structured as follows. The basic structure and principles of the PMMFNet, and details the progressive multi-scale multi-attention network architecture are described in Section “Proposed method”, while the comparative experiments are presented in section “Experimental setup and results”. We summarize the review of image classification, as well as prospects for future work in the last Section “Conclusion”.Proposed methodSpatial shuffle sample preprocessingCNNs have a large number of parameters and require a significant amount of samples for training, while the small sample characteristics of hyperspectral images easily lead to overfitting. Among the many small-sample learning algorithms, spatial-spectral fusion is a commonly used technique. For a given sample pixel, the pixels within an N × N neighborhood around that pixel are selected and combined as a single sample. The spatial and spectral features of this sample are then extracted and fused before being fed into a designed classification algorithm. Based on the spatial structure of neighboring pixels, a spatial shuffle scheme for small-size patches has been proposed48. Using this approach with a basic CNN architecture can achieve relatively high classification accuracy.Specifically, for an N × N neighborhood size, excluding the central pixel, there are N × N − 1 other pixels. Keeping the position of the central pixel unchanged, the remaining N × N − 1 pixels are shuffled randomly to obtain a new sequence, as illustrated in Fig. 1.Fig. 1Schematic diagram of spatial shuffle, where red font indicates the center pixel.Full size imageIn this study, specifically when N = 5, there are a total of 24! = 6.2e+23 potential patterns. With M samples, this results in M × (N × N − 1)! samples, significantly increasing the sample quantity. Despite the continued presence of substantial similarities between samples, these patterns represent potential distributions of real-world features. They also provide deep learning models with greater learning capacity.Overall frameworkThe network proposed in this paper is similar to a PID controller. A PID controller consists of three components: Proportional, Integral, and Derivative, as shown in Fig. 2-Upper. PID is a type of analog closed-loop control method with no steady-state error. It is an effective control method that not only eliminates the steady-state error present in deviation control but also addresses stability and responsiveness. The goal of the PID control algorithm is to adjust the controller output based on the control error (i.e., the difference between the desired value and the actual value) to bring it as close as possible to the desired value. The proportional term reflects the current magnitude of the error, the integral term considers the accumulation of the error, and the derivative term indicates the rate of change of the error. By adjusting the values of these three parameters, the stability and response speed of the system can be controlled.Fig. 2The analogy between the PID controller and the proposed network.Full size imageThe P controller focuses on the current signal, while the I controller accumulates all past signals. Due to the cumulative inertia effect, when the signal reverses, the output of a simple PI controller tends to overshoot. Therefore, a D controller is introduced; if the signal decreases, the D component becomes negative, acting as a damper to reduce the overshoot. The implementation of a PI controller can be written as:$$C_{{{\text{out}}}} [n] = k_{p} e[n] + k_{i} \sum\limits_{i = 0}^{n} {e[i]}$$(1)The context and detailed information are parsed separately through multiple convolutional layers, with both the detailed branch and the context branch consisting of 3 layers, without BN (Batch Normalization) and ReLU. The D controller reduces overshoot by making the control output sensitive to changes in the input signal. The detailed branch parses all types of semantic information, even if not very accurately, while the context branch aggregates low-frequency contextual information and is semantically similar to the operation of using a large average filter. When the system is disturbed (including setpoint changes and disturbances) and the controlled variable deviates from the control value, the PID controller can make the system automatically return to the set value in a stable and fast manner. Simply put, PID stands for Proportional-Integral-Derivative control. When a system is disturbed (including changes in the setpoint and disturbances), causing the controlled variable to deviate from the set value, the PID controller can automatically and quickly stabilize the system and bring it back to the set value.The proposed PMMFNet in this paper is inspired by the excellent three-branch structure of the PID controller. The relationship between the P/I/D controller and the three branches is as follows: The I, acting as the integrator of all signals in the PID, corresponds to the main trunk branch of this network, which integrates local and global context information to parse remote dependencies. The P, focusing on the current signal in the PID, corresponds to the proportional branch (P), which analyzes and preserves detailed information from the low-level high-resolution feature maps in advance. The D, serving as a damper in the PID, corresponds to the derivative branch (D), which extracts deeper high-frequency features to predict boundary areas. On this basis, the SWCA (Shortcut Weight Channel Attention) module is designed and added to each branch of the network. This module reduces internal covariate shift, allows for a larger learning rate to accelerate convergence, activates nonlinear capabilities, and improves computational efficiency. Finally, the P, I, and D branches are fused together, merging the branches with different responsibilities to produce the final output, noticeably improving accuracy. The overall structure is shown in the Fig. 2.As shown in Fig. 3, the network structure proposed in this paper integrates the SWCA module and branch fusion to enhance the feature extraction capability from images. Assuming the hyperspectral image has D bands and spatial dimensions W × H, the overall computational load of the model is reduced after spatial shuffle sample preprocessing. This preprocessing step removes redundant information and noise while preserving essential spectral details. Since each input band is different, all bands less than 200 are filled with zeros. The final input size of feature maps fed into the network is consistently maintained at 25 × 200.Fig. 3The proposed progressive multi-scale multi-attention fusion network architecture of the Proportional-Integral-Derivative (PID) network.Full size imageIn the overall network architecture, a SWCA module is introduced during the feature extraction stage of each branch. The I branch serves as the main backbone, while inputs progressively differ for the other two branches. All three branches employ progressive multi-scale inputs and convolutional feature extraction with varying kernel sizes. This approach enhances feature propagation, noticeably improving feature representation by effectively capturing less prominent local features and thereby boosting classification accuracy. Furthermore, different branches aggregate features from different scales and utilize distinct attention mechanisms to further extract and learn features, facilitating effective feature reuse. Finally, the rich features obtained from the three branches are combined and passed through fully connected layers to accomplish the classification task.The SWCA moduleThe SWCA proposed in this paper is described in the following figure.The SWCA module is shown in Fig. 4, the feature map is first convolved, followed by batch normalization and activation, resulting in a new feature map. A skip connection is then used to concatenate the original feature map with the new one, enhancing the feature representation. Subsequently, a max pooling operation is applied to generate a new input FH×W×C. The obtained input FH×W×C is first processed with a global average pooling and then stretched to obtain input x = [× 1, × 2,….,xm](where m represents the batch size). This approach can reduce internal covariate shift, enabling the use of larger learning rates to speed up convergence. The activation introduces non-linearity, which aids in learning complex features. Moreover, the subsequent pooling operation reduces dimensionality and extracts key features, improving computational efficiency and robustness. The calculation of the mean AB and variance D2B of x is performed, followed by normalization to obtain xi. A 1-dimensional convolution operation with a kernel size of k is then carried out, during which learnable parameters α and β are introduced. The comprehensive formula yi, incorporating the aforementioned steps, is as follows:$$D_{B}^{2} = \frac{1}{m}\sum\nolimits_{i = 1}^{m} {(x_{i} - A_{B} )}^{2}$$(2)$$x_{i} = \frac{{x_{i} - A_{B} }}{{\sqrt {D_{B}^{2} + \varphi } }}$$(3)$$y_{i} = \alpha \frac{{x_{i} - A_{B} }}{{\sqrt {D_{B}^{2} + \varphi } }} + \beta$$(4)$$\omega = \frac{1}{{1 + e^{ - x} }}C1D_{k} (y)$$(5)Fig. 4Operational flow of the SWCA module.Full size imageThe C1D denotes a one-dimensional convolution that involves k parameters, and after passing through the σ activation function, it obtains the weight ω. After obtaining the weights, they are multiplied element-wise with the original feature maps to produce the output feature maps. Each element of the new feature maps, obtained after one convolution, normalization, and activation, is added to the corresponding element of the original feature maps, fully representing the features to obtain the final feature maps.Multi-attention branch fusionThe multi-attention branch fusion structure proposed in this paper is shown in Fig. 5.Fig. 5Operational flow of the multi-attention branch fusion.Full size imageThrough the integration of three different attention mechanisms, the fusion process involves both feature map addition and concatenation. Different branch configurations employ varying sizes of convolutional kernels and pooling layers to extract features at different scales, effectively capturing local details and global contextual information. The network’s branch structure enables parallel learning of diverse feature representations, enhancing efficiency and performance. The I branch remains the main branch, with the features from the P and D branches integrated into it, while the PD branches serve as secondary branches with the features from the I branch also integrated into them. This approach increases the non-linearity of the network, helps the network learn more complex features, and allows it to capture the feature representations from different branches. This study employs three different attention mechanisms:Normalization-based attention module (NAM)49, Convolutional block attention module (CBAM)50, and Global attention mechanism (GAM)51. The NAM attention mechanism primarily functions to automatically learn the significance weights of various channels, suppress noisy channels, enhance effective feature channels, highlight key spatial regions in the feature map, and attenuate background interference. It is suitable for the functionality of branch P, which is designed to pre-save detailed information in the high-resolution feature maps of shallow layers and then conduct feature extraction targeting the effective information. The CBAM attention mechanism mainly works through the collaborative effort of channel and spatial attention, which noticeably enhances the representation capability of convolutional neural networks. It is appropriate for branch I, which aggregates local and global context information to parse long-range dependencies. The GAM attention mechanism mainly serves to improve the neural network’s understanding of complex visual patterns. It corresponds to branch D’s extraction of deeper-level high-frequency features for boundary region prediction. After conducting multiple sets of different attention-matching tests, due to the different scales of feature extraction and the manifestations of features in the three branches, the attention mechanism most suitable for each branch was selected respectively to further strengthen the feature representation. The experimental results also proved the correctness of the above theoretical analysis. Finally, the feature maps obtained from the three branches were flattened and concatenated, and then transformed into the final classification prediction results through the fully connected layer.NAM attention branch (P)The Derivative (D) branch extracts high-frequency features to predict boundary regions.Figure 6 illustrates the detailed structure of the NAM (P) branch, which includes two SWCA modules and a series of convolutional layers. Finally, the output is obtained by processing the feature map through the NAM attention mechanism. The input is the output obtained after the initial SWCA processing of the original feature map, which is still at the shallow stage of feature extraction. Within the NAM (Neighborhood Attention Module) branch, two instances of SWCA modules are utilized for feature extraction, followed by two 1 × 3 convolutions (CNN) to regularize the feature map size (M ×M). Subsequently, it undergoes six times of CNN with spatial sizes of 7 × 7, 7 × 7, 4 × 4, 3 × 3, 3 × 3, and 3 × 3 for multi-scale spatial feature extraction. Following this, it goes through a channel-spatial attention mechanism to obtain the output, which is then stacked with the output from the CBAM (Convolutional Block Attention Module) branch. Ultimately, the output undergoes the most suitable NAM attention mechanism to obtain the final output.Fig. 6Network structure of the NAM (P) branch.Full size imageThe NAM, as shown in the Fig. 6a and b of the diagram above, is an efficient and lightweight attention mechanism that integrates modules from CBAM and redesigns channel and spatial attention sub-modules. In the channel attention sub-module, The scaling factor from batch normalization is utilized, as shown in the formula below. This scaling factor measures the variance across channels and indicates their importance.$$B_{out} = BN(B_{in} ) = \gamma \frac{{B_{in} - \mu_{B} }}{{\sqrt {\sigma_{B}^{2} + \varepsilon } }} + \beta$$(6)where ${\upmu }_{\text{B}}$ and ${\upsigma }_{\text{B}}$ are the batch standard deviations of B, γ and β are trainable affine transformation parameters (scale and shift). The NAM channel attention sub-module (Channel Attention Module) is shown in Fig. 6a above, where the output feature is MS = sigmoid (Wγ(BN(F1))), with Wγ= $\frac{{\upgamma }_{\text{i}}}{{\sum }_{\text{j}=0}{\upgamma }_{\text{j}}}$ representing the weights.Applying the scale factor of Batch Normalization to the spatial dimension is called Pixel Normalization. Correspondingly, in the NAM spatial attention sub-module (Spatial Attention Module) as shown in Fig. 6b above, the output feature is MC = sigmoid(Wγ(BN(F2))), where Wγ= $\frac{{\uplambda }_{\text{i}}}{{\sum }_{\text{j}=0}{\uplambda }_{\text{j}}}$ represents the weights, The scaling factors γ and λ are associated with the two sub-modules of channel and spatial attention, respectively.CBAM attention branch (I)The Integral (I) branch aggregates contextual information locally and globally to resolve long-range dependencies.Figure 7 shows the detailed structure of the CBAM (I) branch, which is also the main backbone branch of the network throughout the entire feature extraction process, which involves two consecutive SWCA modules for feature extraction. The branch makes use of three 1 × 3 CNNs to standardize the feature maps to an M by M dimension. Additionally, it employs six convolutional layers with dimensions 7 × 7, 5 × 5, 4 × 4, 3 × 3, 3 × 3, and 3 × 3 each for the coalesced extraction of spatial features across multiple scales. However, to enhance feature propagation and significantly improve feature representation, after each CNN feature extraction following the reshaping step, the CBAM branch integrates a CBAM module. This addition slightly increases computational and parameter overhead but greatly enhances model performance. As the primary branch of the network, the CBAM branch combines the outputs of the NAM and GAM branches, and then applies CBAM attention to produce the final output. As shown in Fig. 7a above, the CBAM is a lightweight, versatile module. In the feature maps, CBAM sequentially infers attention maps along two independent dimensions (channel and spatial), and then multiplies the attention maps with the input feature maps to perform adaptive feature optimization.Fig. 7Network structure of the CBAM (I) branch.Full size imageAs shown in Fig. 7b, the Channel Attention Module compresses the feature map along the spatial dimensions to obtain a one-dimensional vector for further processing. During compression along the spatial dimensions, both average pooling and max pooling are considered. The Channel Attention Module can be expressed as follows:$$\begin{gathered} M_{c} (F) = \sigma (MLP(AvgPool(F)) + (MLP(MaxPool(F))) \\ = \sigma (W_{1} (W_{0} )F_{avg}^{c} )) + W_{1} (W_{0} (F_{\max }^{c} ))) \\ \end{gathered}$$(7)Similarly, as shown in Fig. 7c, the Spatial Attention Module compresses the channels of the feature map, performing average pooling and max pooling along the channel dimensions. The max pooling operation extracts the maximum values along the channels, repeated for the height multiplied by the width. The avg pooling operation extracts the average values along the channels, repeated for the height multiplied by the width. Then, the previously extracted feature maps (each with 1 channel) are combined to obtain a 2-channel feature map. The Spatial Attention Module can be expressed as follows:$$\begin{aligned} M_{{\text{c}}} \left( F \right) = & \sigma \left( {f^{7 \times 7} \left( {\left[ {AvgPool(F);MaxPool\left( F \right)} \right]} \right)} \right) \\ & = \sigma \left( {f^{7 \times 7} \left[ {F_{{{\text{avg}}}}^{s} ;F_{\max }^{s} } \right]} \right) \\ \end{aligned}$$(8)where σ denotes the sigmoid operation, 7* 7 represents the size of the convolution kernel. A 7*7 convolution kernel has been found to be more effective compared to a 3*3 convolution kernel.GAM attention branch (D)The Derivative (D) branch extracts high-frequency features to predict boundary regions.Figure 8 illustrates the detailed structure of GAM (D) Branch. The input of this branch is the output from the CBAM branch after two SWCA extractions, which has reached the deep stage of feature extraction. The GAM Branch itself undergoes one round of feature extraction, followed by a 1 × 3 CNN operation. Moreover, it employs six convolutional layers with kernel sizes of 5 × 5, 5 × 5, 5 × 5, 5 × 5, 4 × 4, and 3 × 3, respectively, for the coalesced extraction of multi-scale spatial features. Subsequently, the output of channel and spatial attention operations is concatenated with that of the CBAM branch. Finally, the combined features pass through the GAM module, which is most suitable for this branch, to produce the final output.Fig. 8Network structure of the GAM (D) branch.Full size imageThe GAM, as shown in Fig. 8a above, enhances network performance by reducing information diffusion and amplifying global interactive representations. It consists of both a channel attention module and a spatial attention module. As illustrated in Fig. 8b above. The channel attention submodule preserves information across three dimensions using a three-dimensional array. It then employs a two-layer MLP to amplify cross-dimensional channel-space dependencies. The MLP is an encoder-decoder structure similar to BAM, with a compression ratio of r. As shown in Fig. 8c above, the spatial attention submodule utilizes two convolutional layers to fuse spatial information, leveraging the same compression ratio r from the channel attention submodule. Furthermore, to maintain feature mappings, pooling operations are omitted due to their potential negative impact caused by reduced information utilization through max-pooling operations. Given input feature mapping F1 ∈ RC×H×W, intermediate states F2 and F3 are defined as:$$F_{2} = M_{C} (F_{1} ) \otimes F_{1}$$(9)$$F_{3} = M_{S} \left( {F_{2} } \right) \otimes F_{2}$$(10)F1 represents the input. After passing F1 through a three-dimensional arrangement and MLP operations, followed by a sigmoid activation function, we obtain Mc (F1). Then, it is multiplied by the original input F1 to get F2. F2 is processed through a 7 × 7 convolutional layer, undergoes channel reduction, and then passes through another 7 × 7 convolution to obtain a new feature map. After applying an activation function, it results in Ms (F2). Finally, Ms (F2) is multiplied with the spatially obtained F2 to yield the final input F3.Experimental setup and resultsData setSimilar to the approach in Ref.52,53,54, this paper uses the Houston 2013, Salinas Valley (SV), and University of Pavia hyperspectral datasets to evaluate the proposed method. For each dataset, the values are first normalized to the range of 0–1. Then, for each class, 200 pixels along with their surrounding 5 × 5 neighborhoods are randomly selected as training samples, while the remaining pixels are used as test samples.The Houston (HT) dataset has a data size of 349*1905, encompassing 144 spectral bands ranging from 364 to 1046nm. The ground truth labels include 15 land cover categories such as trees and soil, totaling 15,029 pixels. The selected categories and the number of samples are shown in Table 1.Table 1 Sample selection for the Houston dataset.Full size tableThe Salinas Valley (SV) dataset, after the removal of 20 water vapor and noise bands, retains 204 spectral bands. The dataset has a spatial dimension of 512 × 217 pixels with a spatial resolution of 3.7 m. The ground truth is categorized into 16 land cover classes, with specific land cover types and pixel counts detailed in Table 2.Table 2 Sample selection for the Salinas Valley dataset.Full size tableThe University of Pavia (UP) dataset, after the removal of noise and other bands, retains 103 spectral bands. The image size is 610 × 340 pixels, with a spatial resolution of 1.3 m. The spectral range spans from 0.43 to 0.86 µm. Approximately 20% of the pixels are labeled as ground truth, covering various urban structures, soil, natural targets, and shadows, among others. The specific pixel counts are shown in Table 3.Table 3 Sample selection for the University of Pavia dataset.Full size tableParameter settingThe model parameters are set based on experience: the optimizer used is Adam, the learning rate is set to 0.00001, and the batch size for the IP and UP datasets is 256. The loss function is cross-entropy loss.To evaluate the effectiveness of the proposed method in this paper, several approaches were employed, including support vector machines (SVM)4, multinomial logistic regression (MLR)5, extreme learning machines (ELM)10, random forests (RF)55, CNN49, PPF50, SpectralFormer (SF)51, RAMiT56, MSAA57, SAFM58, SSFTT59, morphFormer60 and GSC-ViT61. These methods were tested on the same training and test datasets. In the evaluation of the accuracy of different classification methods, overall accuracy (OA), average accuracy (AA), and Kappa coefficient are commonly employed.The classification performance of the HT datasetAs shown in Fig. 9, the performance of various methods on the HT dataset is presented. It can be observed that traditional machine learning methods do not perform well, exhibiting numerous classification errors. All methods show relatively good performance on the “grass_synergic” category. Deep learning-based methods perform well overall, with both SAFM and the proposed method demonstrating strong performance across all categories. As shown in the enlarged area in the figure, apart from the morphFormer and GSC-ViT methods, other methods have a relatively large number of misclassifications. The classification result of the proposed method in this paper is completely correct in this area. Notably, the method proposed in this paper stands out even more, demonstrating markedly superior classification performance.Fig. 9The classification performance of various methods on the HT dataset: (a) Original HSI (b) Ground Truth Map (c) MLR (d) SVM (e) RF (f) ELM (g) CNN2D (h) PPF (i) SF (j) RAMiT (k) MSAA (l) SAFM (m) SSFTT (n) morphFormer (o) GSC-ViT (p) PMMFNet.Full size imageAs shown in Table 4, the classification accuracy of each method for every category, as well as the overall accuracy (OA), average accuracy (AA), and Kappa coefficient, are presented. From the table, it can be seen that all methods achieve an accuracy exceeding 98% for the “grass_synergic” category. However, for other categories, MLR performs significantly worse than the other methods, resulting in the lowest OA and Kappa coefficient. SF and SFAM show better performance, while the proposed method performs well across all categories, with an OA that is 1.09% higher than the second-ranked SAFM and nearly 40% higher than the worst-performing MLR. In the tennis court class, all methods demonstrate satisfactory classification performance. The classification accuracy of the proposed method in this paper, as well as SSFTT, morphFormer, and GSC-ViT which are specifically designed for hyperspectral image classification, reaches 100%. The results in the table demonstrate that the method proposed in this paper exhibits noticeably superior classification performance.Table 4 Classification performance of various methods on the HT dataset.Full size tableThe classification performance of the SV datasetAs shown in Fig. 10, the classification performance of various methods on the SV dataset is presented. It can be observed that traditional machine learning methods do not perform well, exhibiting numerous classification errors. However, all methods perform well on the “Lettuce-romaine-5wk” category. Several deep learning-based methods show excellent performance on the “Vinyard-vertical-trellis” category. Both SF and SAFM demonstrate relatively good overall classification capabilities, SSFTT and morphFormer perform exceptionally well on the Vinyard-untrained class. As shown in the magnified area of the figure, compared with other methods, the method proposed in this paper demonstrates the best classification results. The method proposed in this paper stands out with superior performance, offering better classification capabilities.Fig. 10The classification performance of various methods on the SV dataset: (a) Original HSI (b) Ground Truth Map (c) MLR (d) SVM (e) RF (f) ELM (g) CNN2D (h) PPF (i) SF (j) RAMiT (k) MSAA (l) SAFM (m) SSFTT (n) morphFormer (o) GSC-ViT (p) PMMFNet.Full size imageAs shown in Table 5, the classification accuracy of each method is highly distinctive. Among them, the accuracy of each method in the Fallow-rough-plow category is close to 100%. However, in other categories, the performance of MLR and RF is significantly lower than that of other methods, resulting in the lowest accuracy. The accuracy of SF, RAMiT, MSAA, and ELM is relatively similar, with SAFN performing slightly better. In the Vinyard-untrained class, the methods proposed in recent years have all performed quite well, with classification accuracies close to or exceeding 90%. However, the GSC-ViT method only has a classification accuracy of 80.31%. The method proposed in this paper performs excellently in all categories, with noticeably higher values in OA, AA, and Kappa coefficient compared to other methods. Based on the results in the table, it can be concluded that the proposed method has superior classification performance.Table 5 Classification performance of various methods on the SV dataset.Full size tableThe classification performance of the UP datasetFigure 11 shows the classification performance of various methods on the UP dataset. It can be observed that in the categories with the largest number of pixels, such as Asphalt, Bare Soil, and Meadows, the classification errors of each method reflect their classification performance. MLR exhibits the most classification errors, followed by RF and ELM. SF, CNN, and RAMiT demonstrate good classification performance, while PPF, MSAA, and SAFM perform very well in terms of misclassification. The classification results of SSFTT and mornpFormer are satisfactory. However, there are numerous misclassifications on the Self-Blocking Bricks class. As shown in the magnified area of the figure, other methods exhibit more misclassifications, while the method presented in this paper performs well.The method proposed in this paper demonstrates the overall best classification performance.Fig. 11The classification performance of various methods on the UP dataset: (a) Original HSI (b) Ground Truth Map (c) MLR (d) SVM (e) RF (f) ELM (g) CNN2D (h) PPF (i) SF (j) RAMiT (k) MSAA (l) SAFM (m) SSFTT (n) morphFormer (o) GSC-ViT (p) PMMFNet.Full size imageAs shown in Table 6 below, overall, the MLR method has the lowest OA, while SVM performs the best among the four traditional machine learning algorithms, though it is slightly inferior to deep learning-based methods. The classification accuracies of PPF and SAFM reach 97.87% and 98.73%, respectively, demonstrating excellent classification performance, while the other methods show relatively lower accuracy. The classification result of GSC-ViT on the Gravel class is only 96.54, which affects its overall classification performance The proposed method in this paper achieves an overall accuracy of up to 99.65%, indicating that it has the highest classification performance on this dataset.Table 6 Classification performance of various methods on the UP dataset.Full size tableClassification performance with fewer samplesTo compare the performance of various methods under the influence of fewer training samples, this paper sets the number of training samples per class to 50, 100, and 200 for training, following the same rationale. The classification results of each method are computed, as shown in the subsequent Fig. 12. It is clearly evident that as the sample size increases, there is a significant upward trend in the classification performance across all datasets, aligning with our expectations. Traditional machine learning algorithms like SVM maintain relatively strong classification performance due to their stable behavior. However, ELM’s performance is less stable on these three datasets. This is because the weight matrix from input neurons to hidden neurons and the hidden layer thresholds are randomly generated, which can lead to ill-conditioned output matrices and unstable network structures, reducing robustness and classification performance. Deep learning-based methods noticeably outperform traditional machine learning methods across all training sample sizes on the three datasets. The classification accuracy of Methods SSFTT, morphFormer, and GSC-ViT is relatively similar, while the proposed method consistently demonstrates the best classification performance.Fig. 12Classification performance of various methods with fewer samples on (a) HT, (b) SV, and (c) UP.Full size imageThe impact of different sample sizesDue to the limited number of samples in hyperspectral images, this paper adopts a method of random generation to create one hundred thousand samples within each class, aiming to increase the sample size and enhance classification performance. Generally speaking, the more samples there are, the better the learning and extraction of more detailed features can be achieved, but this also significantly increases computational load. To evaluate the impact of sample size on classification performance, this paper sets four levels of sample sizes: 10,000, 50,000, 100,000, and 200,000. Using the network structure proposed in this paper, the final classification performance is assessed. The results, as shown in Fig. 13, indicate that as the sample size increases, the network’s classification performance improves. When the sample size reaches 100,000, the classification accuracy tends to plateau. To balance the classification and computing performance, the number of test samples in this paper is set at 100,000.Fig. 13Classification performance of the proposed method under different numbers of samples.Full size imageClassification performance of different attention branchesThis paper employs three different attention branches, encompassing various sizes and scales, to compare the performance of three distinct attention mechanisms—NAM, GAM, and CBAM—across these branches. A total of six different combination structures were considered. Using the same methodology, the classification performance of each combination structure was calculated, with a test sample size of 100,000. The results, as described in Fig. 14, indicate that the use of different attention mechanisms has a certain impact on the classification performance due to the varying dimensions of the feature maps processed by each branch. Comparing the OA, AA, and Kappa coefficients of the six different combination structures, it is evident that the structure adopted in this paper, with Branch 1 using NAM, Branch 2 using CBAM, and Branch 3 using GAM, is noticeably more stable than the other five combinations. Moreover, this structure also achieved much higher classification accuracy across three datasets compared to the other five.Fig. 14The performance of different attention mechanisms in different branchs of the PMMFNet method. (a) CBAM + GAM + NAM (b) CBAM + NAM + GAM (c) GAM + CBAM + NAM (d) GAM + NAM + CBAM (e) NAM + GAM + CBAM (f) NAM + CBAM + GAM (in this paper).Full size imageClassification performance of shuffle or no-shuffleBecause of the small sample size of hyperspectral datasets, there is a risk of overfitting when using few-shot learning methods with deep learning models. Consequently, effectively expanding the sample size is a critical issue. This paper employs a spatial structure based on neighboring pixel elements and proposes a spatial shuffle scheme for small patches. On this foundation, using the network architecture presented in this paper, a relatively high classification accuracy is achieved. To compare the impact of shuffle versus no-shuffle on classification outcomes, this paper conducts tests under the condition of selecting 100,000 samples and training with 200 samples. The data in the Table 7 below clearly indicates that the classification performance obtained after shuffle processing is noticeably superior to the no-shuffle classification performance.Table 7 Shuffle or no-Shuffle on various datasets.Full size tableClassification performance of different patch sizesTo assess the effect of spatial context granularity on classification performance, we designed comparative experiments with different patch sizes (5 × 5, 7 × 7, 9 × 9 pixels) under the condition of selecting 100,000 samples and training with 200 samples. The results from the Table 8 below show that on the Houston dataset, there is little difference in classification results among the three different patch sizes. On the SV dataset, classification performance slightly improves as the size increases. On the UP dataset, increasing the size introduces irrelevant noise, leading to a decrease in classification accuracy. Therefore, this paper selects a 5 × 5 patch size to balance accuracy and computational efficiency, achieving optimal results.Table 8 Classification performance on different datasets corresponding to various patch sizes.Full size tableConclusionIn this paper, we propose a Progressive Multi-Scale Multi-Attention Fusion Network, referred to as PMMFNet. Inspired by the PID controller, PMMFNet progressively extracts feature information from multiple scales through three complementary branches: Proportional (P), Integral (I), and Derivative (D). SWCA module is designed to improve the model’s ability to extract prominent features using skip connection-based feature fusion. Additionally, a multi-attention branch fusion module is proposed, which stacks features from different scales and selects the most appropriate attention mechanism for feature absorption, effectively enhancing feature reusability. Experimental results show that PMMFNet achieves high accuracy. Future work will focus on reducing the sample size while improving classification accuracy.Data availabilityThe datasets used and/or analysed during the current study available from the corresponding author on reasonable request.ReferencesDing, Y. et al. Unsupervised self-correlated learning smoothy enhanced locality preserving graph convolution embedding clustering for hyperspectral images. IEEE Trans. Geosci. Remote Sens. 60, 1–16. https://doi.org/10.1109/TGRS.2022.3202865 (2022).Article Google Scholar Yang, X. & Yu, Y. Estimating soil salinity under various moisture conditions: An experimental study. IEEE Trans. Geosci. Remote Sens. 55(5), 2525–2533. https://doi.org/10.1109/TGRS.2016.2646420 (2017).Article ADS Google Scholar Yokoya, N., Chan, J.C.-W. & Segl, K. Potential of resolution enhanced hyperspectral data for mineral mapping using simulated EnMAP and sentinel-2 images. Remote Sens. 8(3), 172–189. https://doi.org/10.3390/rs8030172 (2016).Article ADS Google Scholar Li, S., Dian, R., Fang, L. & Bioucas-Dias, J. M. Fusing hyperspectral and multispectral images via coupled sparse tensor factorization. IEEE Trans. Image Process. 27(8), 4118–4130. https://doi.org/10.1109/TIP.2018.2836307 (2018).Article ADS MathSciNet Google Scholar Zhang, S., Li, J., Wu, Z. & Plaza, A. Spatial discontinuity-weighted sparse unmixing of hyperspectral images. IEEE Trans. Geosci. Remote Sens. 56(10), 5767–5779. https://doi.org/10.1109/TGRS.2018.2825457 (2018).Article ADS Google Scholar Kumar, V. & Ghosh, J. K. Camouflage detection using MWIR hyperspectral images. J Indian Soc. Remote Sens. 45, 139–145. https://doi.org/10.1007/s12524-016-0555-8 (2017).Article Google Scholar Zhang, Q., Yuan, Q., Song, M., Yu, H. & Zhang, L. Co-operated spectral low-rankness prior and deep spatial prior for HSI unsupervised denoising. IEEE Trans. Image Process. 31, 6356–6368. https://doi.org/10.1109/TIP.2022.3211471 (2022).Article ADS PubMed Google Scholar Ding, Y., Zhang, Z., Kang, W., Yang, A., Zhao, J., Feng, J., Hong, D. & Zheng, Q. Adaptive homophily clustering: A structure homophily graph learning with adaptive filter for hyperspectral image. Preprint at arXiv:2501.01595 (2025).Heiden, U., Roessner, S., Segl, K. & Kaufmann, H. Analysis of spectral signatures of urban surfaces for their identification using hyperspectral HyMap data, in IEEE/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas (Cat. No.01EX482), Rome, Italy, 173–177, https://doi.org/10.1109/DFUA.2001.985871 (2001).Melgani, F. & Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 42(8), 1778–1790. https://doi.org/10.1109/TGRS.2004.831865 (2004).Article ADS Google Scholar Li, J., Bioucas-Dias, J. M. & Plaza, A. Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning. IEEE Trans. Geosci. Remote Sens. 48(11), 4085–4098. https://doi.org/10.1109/TGRS.2010.2060550 (2010).Article Google Scholar Liu, Z., Tang, B., He, X., Qiu, Q. & Liu, F. Class-specific random forest with cross-correlation constraints for spectral—spatial hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 14(2), 257–261. https://doi.org/10.1109/LGRS.2016.2637561 (2017).Article ADS Google Scholar Du, B. & Zhang, L. Target detection based on a dynamic subspace. Pattern Recognit. 47(1), 344–358. https://doi.org/10.1016/j.patcog.2013.07.005 (2014).Article ADS Google Scholar Yu, H. et al. Hyperspectral image classification based on adjacent constraint representation. IEEE Geosci. Remote Sens. Lett. 18(4), 707–711. https://doi.org/10.1109/LGRS.2020.2982706 (2021).Article ADS Google Scholar Le, L., Xie, Y. & Raghavan, V. V. KNN loss and deep KNN. Fund. Inf. 182(2), 95–110. https://doi.org/10.3233/FI-2021-2068 (2021).Article MathSciNet Google Scholar Villa, A., Benediktsson, J. A., Chanussot, J. & Jutten, C. Hyperspectral image classification with independent component discriminant analysis. IEEE Trans. Geosci. Remote Sens. 49(12), 4865–4876. https://doi.org/10.1109/TGRS.2011.2153861 (2011).Article ADS Google Scholar Licciardi, G., Marpu, P. R., Chanussot, J. & Bened-iktsson, J. A. Linear versus nonlinear PCA for the classification of hyperspectral data based on the extended morphological profiles. IEEE Geosci. Remote Sens. Lett. 9(3), 447–451. https://doi.org/10.1109/LGRS.2011.2172185 (2012).Article ADS Google Scholar Bandos, T. V., Bruzzone, L. & Camps-Valls, G. Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Trans. Geosci. Remote Sens. 47(3), 862–873. https://doi.org/10.1109/TGRS.2008.2005729 (2009).Article ADS Google Scholar Ghamisi, P. et al. New frontiers in spectral-spatial hyperspectral image classification: The latest advances based on mathematical morphology, Markov random fields, segmentation, sparse representation, and deep learning. IEEE Geosci. Remote Sens. Mag. 6(3), 10–43. https://doi.org/10.1109/MGRS.2018.2854840 (2018).Article Google Scholar Feng, J., Zhang, T., Zhang, J., Shang, R., Dong, W., Shi, G. & Jiao, L. S 4 DL: Shift-sensitive spatial–spectral disentangling learning for hyperspectral image unsupervised domain adaptation. IEEE Transactions on Neural Networks and Learning Systems. Preprint at arXiv:2408.15263 (2025).Khurshid, H. & Khan, M. F. Classification of remotely sensed images using decimal coded morphological profiles. SIViP 10, 1001–1007. https://doi.org/10.1007/s11760-015-0851-8 (2016).Article Google Scholar Benediktsson, J. A., Palmason, J. A. & Sveinsson, J. R. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Trans. Geosci. Remote Sens. 43(3), 480–491. https://doi.org/10.1109/TGRS.2004.842478 (2005).Article ADS Google Scholar Liu, S., Zhang, L., Cen, Y., Chen, L. & Wang, Y. A fast hyperspectral anomaly detection algorithm based on greedy bilateral smoothing and extended multi-attribute profile. Remote Sens. 13, 3954. https://doi.org/10.3390/rs13193954 (2021).Article ADS Google Scholar Ding, Y., Zhang, Z., Yang, A., Cai, Y., Xiao, X., Hong, D. & Yuan, J., SLCGC: A lightweight self-supervised low-pass contrastive graph clustering network for hyperspectral images. Preprint at arXiv:2502.03497 (2025).Bordes, A., Glorot, X., Weston, J. & Bengio, Y. Joint learning of words and meaning representations for open-text semantic parsing. in Proc.Int. Conf. Art. Intell. Stat, pp. 127–135 (2012).Zheng, W. -L., Zhu, J. -Y., Peng, Y. & Lu, B. -L. EEG-based emotion classification using deep belief networks. in 2014IEEE International Conference on Multimedia and Expo (ICME), Chengdu, China, pp. 1–6, https://doi.org/10.1109/ICME.2014.6890166 (2014)Lin, Z., Chen, Y., Zhao, X. & Wang, G. Spectral-spatial classification of hyperspectral image using autoencoders. in 2013 9th International Conference on Information, Communications & Signal Processing, Tainan, pp. 1–5, https://doi.org/10.1109/ICICS.2013.6782778. (2013).Li, D., Zhang, J., Zhang, Q. & Wei, X. Classification of ECG signals based on 1D convolution neural network. in 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom), Dalian, China, pp. 1–6, https://doi.org/10.1109/HealthCom.2017.8210784 (2017)Wang, Q., Yuan, Z., Du, Q. & Li, X. GETNET: A general end-to-end 2-D CNN framework for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 57(1), 3–13. https://doi.org/10.1109/TGRS.2018.2849692 (2019).Article ADS Google Scholar Li, Y., Zhang, H. & Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sens. 9(1), 67. https://doi.org/10.1016/j.asr.2019.05.005 (2019).Article ADS Google Scholar Roy, S. K., Krishna, G., Dubey, S. R. & Chaudhuri, B. B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 17(2), 277–281. https://doi.org/10.1109/LGRS.2019.2918719 (2020).Article ADS Google Scholar Roy, S. K., Manna, S., Song, T. & Bruzzone, L. Attention—Based adaptive spectral-spatial Kernel ResNet for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 59(9), 7831–7843 (2021).ADS Google Scholar Lei, X., Pan, H. & Huang, X. A dilated CNN model for image classification. IEEE Access 7, 124087–124095. https://doi.org/10.1109/ACCESS.2019.2927169 (2019).Article Google Scholar Huang, G., Liu, Z., van der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2261–2269 (2017).Zhang, W., Tang, P. & Zhao, L. Remote sensing image scene classification using CNN-CapsNet. Remote Sens. 11, 494. https://doi.org/10.3390/rs11050494 (2019).Article ADS Google Scholar Xu, F., Zhang, G., Song, C., Wang, H. & Mei, S. Multiscale and cross-level attention learning for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 61, 1–15. https://doi.org/10.1109/TGRS.2023.3235819 (2023).Article Google Scholar Qiao, X., Roy, S. K. & Huang, W. Multiscale neighborhood attention transformer with optimized spatial pattern for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 61, 1–15. https://doi.org/10.1109/TGRS.2023.3314550 (2023).Article Google Scholar Hu, B., Tu, Q., Ren, X., Liao, Z. C. & Plaza, A. Hyperspectral Image classification via multiscale multiangle attention network. IEEE Trans. Geosci. Remote Sens. 62, 1–18. https://doi.org/10.1109/TGRS.2024.3370919 (2024).Article Google Scholar Vaswani, A. et al., Attention is all you need. in Proc. Adv. Neural Inf. Process. Syst., pp. 5998–6008 (2017).Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 7132–7141 (2018).Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W. & Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 11534–11542 (2020).Fu, J. et al. Dual attention network for scene segmentation. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3146–3154 (2019).Ma, W., Yang, Q., Wu, Y., Zhao, W. & Zhang, X. Double-branch multi-attention mechanism network for hyperspectral image classification. Remote Sens. 11, 1307. https://doi.org/10.3390/rs11111307 (2019).Article ADS Google Scholar Li, R., Zheng, S., Duan, C., Yang, Y. & Wang, X. Classification of hyperspectral image based on double-branch dual-attention mechanism network. Remote Sens. 12, 582. https://doi.org/10.3390/rs12030582 (2020).Article ADS Google Scholar Cui, Y., Yu, Z., Han, J., Gao, S. & Wang, L. Dual-triple attention network for hyperspectral image classification using limited training samples. IEEE Geosci. Remote Sens. Lett. 19, 1–5. https://doi.org/10.1109/LGRS.2021.3067348 (2022).Article Google Scholar Tang, Z. et al. Joint graph attention and asymmetric convolutional neural network for deep image compression. IEEE Trans. Circuits Syst. Video Technol. 33(1), 421–433. https://doi.org/10.1109/TCSVT.2022.3199472 (2023).Article Google Scholar Liu, T. et al. Spatial channel attention for deep convolutional neural networks. Mathematics 10, 1750. https://doi.org/10.3390/math10101750 (2022).Article Google Scholar Wang, Z., Cao, B. & Liu, J. Hyperspectral image classification via spatial shuffle-based convolutional neural network. Remote Sens. 15, 3960. https://doi.org/10.3390/rs15163960 (2023).Article ADS Google Scholar Ham, J., Chen, Y., Crawford, M. M. & Ghosh, J. Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 43(3), 492–501. https://doi.org/10.1109/TGRS.2004.842481 (2005).Article ADS Google Scholar Paoletti, M. E., Haut, J. M., Plaza, J. & Plaza, A. Deep learning classifers for hyperspectral imaging: A review. ISPRS J. Photogram. Remote Sens. https://doi.org/10.1016/j.isprsjprs.2019.09.006 (2019).Article Google Scholar Hong, D. et al. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 60, 1–15. https://doi.org/10.1109/TGRS.2021.3130716 (2022).Article Google Scholar Ding, H. et al. TBSSF-Net: Three-branch spatial-spectral fusion network for hyperspectral image classification. Opt. Express https://doi.org/10.1364/OE.550150 (2025).Article PubMed Google Scholar Li, W., Wu, G., Zhang, F. & Du, Q. Hyperspectral image classification using deep pixel-pair features. IEEE Trans. Geosci. Remote Sens. 55(2), 844–853. https://doi.org/10.1109/TGRS.2016.2616355 (2017).Article ADS Google Scholar Ding, Y. et al. Self-supervised locality preserving low-pass graph convolutional embedding for large-scale hyperspectral image clustering. IEEE Trans. Geosci. Remote Sens. 60, 1–16. https://doi.org/10.1109/TGRS.2022.3198842 (2022).Article Google Scholar He, Y. et al. Generation of 1 km high resolution Standardized Precipitation Evapotranspiration Index for drought monitoring over China using Google Earth Engine. Int. J. Appl. Earth Observ. Geoinf. 135, 104296 (2024).Google Scholar Yu, W., Zhou, P., Yan, S. et al. Inceptionnext: When inception meets convnext. in Proceedings of the CVPR.5672-5683 (2024).Liu, M., et al. CM-UNet: Hybrid CNN-Mamba UNet for remote sensing image semantic segmentation. Preprint at arXiv:2405.10530 (2024).Sun, L. et al. Spatially-adaptive feature modulation for efficient image super-resolution. in Proceedings of the ICCV (2023).Sun, L., Zhao, G., Zheng, Y. & Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 60, 1–14. https://doi.org/10.1109/TGRS.2022.3144158 (2022).Article Google Scholar Roy, S. K. et al. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 61, 1–15. https://doi.org/10.1109/TGRS.2023.3242346 (2023).Article Google Scholar Zhao, Z., Xu, X., Li, S. & Plaza, A. Hyperspectral image classification using groupwise separable convolutional vision transformer network. IEEE Trans. Geosci. Remote Sens. 62, 1–17. https://doi.org/10.1109/TGRS.2024.3377610 (2024).Article Google Scholar Download referencesAcknowledgementsThis research was supported in part by the National Natural Science Foundation of China under Grant 62071175, the Hunan Provincial Natural Science Foundation of China under Grant 2024JJ8359, the Hunan Provincial Department of Education Scientific Research under Grant 22B0376, the Hunan Province Traditional Chinese Medicine Scientific Research Project under Grant A2024003, the 2022 Doctoral Research Initiation Fund of Hunan University of Chinese Medicine under Grant 0001036, the University-level Scientific Research Fund Project of Hunan University of Chinese Medicine under Grant 2024XJZA005.Author informationAuthors and AffiliationsSchool of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, ChinaHu Wang, Jun Liu, Yingying Peng & Zhihui WangAI TCM Lab Hunan, Changsha, 410208, ChinaHu Wang, Jun Liu, Yingying Peng & Zhihui WangSecond Surveying and Mapping Institute of Hunan Province, Changsha, 430103, ChinaSixiang Quan & Hai XiaoCollege of Electrical and Information Engineering, Hunan University, Changsha, 410082, ChinaHuali LiAuthorsHu WangView author publicationsSearch author on:PubMed Google ScholarSixiang QuanView author publicationsSearch author on:PubMed Google ScholarJun LiuView author publicationsSearch author on:PubMed Google ScholarHai XiaoView author publicationsSearch author on:PubMed Google ScholarYingying PengView author publicationsSearch author on:PubMed Google ScholarZhihui WangView author publicationsSearch author on:PubMed Google ScholarHuali LiView author publicationsSearch author on:PubMed Google ScholarContributionsMethodology, H.W. and J.L.; software,H.W, S.Q, Z.W, H.X. and Y.P.; investigation, J.L, H.L, Z.W, H.X. and Y.P.; validation ,H.W, S.Q, H.L, Z.W, H.X. and Y.P.;writing—original draft preparation, H. W.; writing—review and editing,S.Q, H.X., H.L, Z.W, and J.L. All authors have read and agreed to the published version of the manuscript.Corresponding authorCorrespondence to Jun Liu.Ethics declarationsCompeting interestsThe authors declare no competing interests.Additional informationPublisher’s noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.Reprints and permissionsAbout this article