IntroductionMedical imaging analysis, including Computed Tomography(CT) is vital for detecting lung cancer. Medical diagnosis is primarily dependent on the careful examination of these medical images. However, there is still a significant gap in the field of medical diagnosis. Because of the images’ complexity and unique properties, analyzing and interpreting them is a specialized job that is typically reserved for experts. Based on statistics of The Global Cancer Observatory (GCO), lung cancer remains the primary cause of death, responsible for approximately 2.21 million fatalities, which accounts for 18% of all cancer-related deaths1. Each year, lung cancer claims the lives of over 7.6 million people globally. The cases of lung cancer increased, from 16,596 to 29,576 per year in Egypt2. Delayed detection of lung cancer causes an increase in death rate among lung cancer patients. So, programmers work hard to help in early detection of lung cancer to save lives3.Artificial intelligence (AI) is considered an essential part of the field of computer science which aims to help computers think and solve the issues4. herefore, advancements in machine learning and deep learning with integrated AI have significantly impacted medical image processing and classification. In the field of medicine, this integration helps with the identification of abnormal nodules, while radiologists might find it challenging to diagnose or time-consuming to identify by utilizing of AI algorithm5. Furthermore, these technologies enable radiologists to make predictions, allowing for preventive measures to be taken before the disease manifests. And among these AI of algorithms that help radiologists in diagnosis are Reinforcement Learning (RL); Active learning integrated with deep learning and machine leaning.In recent years, deep learning has significantly enhanced the effectiveness of computer-aided diagnosis (CAD) algorithms for cancer screening6. Nonetheless, a drawback is associated with numerous deep learning classification models, including convolutional neural network (CNNs) and Convolutional-Attention Network(CoAtNet). CNN i refers to convolutional neural network which begins with a matrix of input image and then extracts the critical features from the image layer by layer. CNN architecture consists of 3 basic parts: convolution layer, pooling layer, and fully connected layer. In the convolution layer the image is extracted by the convolutional filtering of the input image. Then, the pooling layer reduces the dimension of extracted features while preserving the main characteristic of the image. Finally, the fully connected classifies the medical image as cancer or non-cancer.7.Advanced models utilizing CNNs have been successfully applied to a variety of medical image analysis tasks, including disease detection from X-ray images, further demonstrating their versatility in the medical domain8).CoAtNet Model refers to Convolutional-Attention Network. This is a hybrid model which combines between Convolutional Neural Networks and Transformer (ViT). CNN algorithms have detected important features of image like the edges, textures, and corners. Additionally, the transformer corner captures the data of regions of the picture which are a long way apart from each other by using Self-Attention Layers. CoAtNet has a number of stages. Firstly, the initial stages use convolutional layers and max pooling layers to extract the low features from images and map the reduced features. Secondly, the middle stages are combined with convolutional layers and self-attended to extract local features of the organ and distribution of lung tissues. The last stages refer to using the self-attention layers to extract more of the global features and classification of dataset9.A critical challenge with these sophisticated deep learning and hybrid models is their dependency on vast, expert-annotated datasets to attain optimal accuracy. This necessitates extensive datasets meticulously annotated by expert radiologists, which is a persistently costly, time-consuming, and labor-intensive process, creating a significant labeling bottleneck, particularly in medical imaging analysis where enormous amounts of unlabeled data exist but individual annotation is impractical. Therefore, active learning is used by selectively choosing which data points from an unlabeled dataset should be labeled. It does this by iteratively selecting the most informative, uncertain sample of dataset. By focusing on these specific samples, active learning aims to supply the model with the information it requires to generalize more effectively while reducing labeling costs and time10. Active learning maximizes the use of some query strategies of active learning such as least confidence, entropy, and margin. Moreover, current systems often rely on either hand-crafted features or deep learning-derived features, but rarely optimize the synergistic potential of multiple, diverse feature extraction methods. A key challenge lies in effectively fusing, selecting, and leveraging these heterogeneous feature sets without introducing redundancy or noise, especially with the high dimensionality inherent in comprehensive medical image analysis. Many existing approaches either oversimplify feature representation or lack a systematic way to identify the most pertinent attributes across different feature spaces, leading to suboptimal classification performance and missed diagnostic cues crucial for early and accurate detection. In addition to active learning, reinforcement learning has improved the computer analysis of medical pictures in recent years. It is used to analyze medical images, solve problems related to image analysis, and classify images. Reinforcement learning also can help make the best decisions to classify the medical image. Deep Q-learning and deep Q-Network are some types of deep reinforcement learning11. However, designing an RL framework that can effectively learn optimal policies within the constraints of medical image analysis, without requiring prohibitive amounts of trial-and-error in a clinical setting, remains a complex challenge. A critical unmet need is for diagnostic systems that can adapt and improve continually based on new, diagnostically challenging cases, simulating a radiologist’s evolving expertise. Explainable AI (XAI) is used in deep learning (AI) and Machine Learning (ML) which helps the radiologists make decisions. Many researchers realized the significance of explanation, so the AI models of explainable lung nodules diagnostic is developed by predicting the Clinical features. The explainable AI can be used for detecting useful features and information in medical images. Wherefore it becomes most important in artificial intelligence12.The application of XAI is crucial for ensuring the interpretability of machine learning decisions in clinical practice, having been effectively demonstrated in diagnostic tasks such as stroke detection, where models like Random Forest are combined with SHAP XAI to highlight influential features13). ML techniques applied to medical images are crucial for efficient and cost-effective information extraction from these images. Such techniques greatly enhance the capacity of researchers and healthcare professionals to comprehend the underlying factors contributing to various illnesses. Some notable methods encompass eXtreme Gradient Boosting(XGBoost), random forests (RF), K-nearest neighbors (KNN), and decision trees (DT).Despite the promising advancements in deep learning models, particularly hybrid architectures like CoAtNet, a significant hurdle to their widespread clinical adoption remains: the ‘black box’ nature. Radiologists often hesitate to fully trust AI recommendations without a clear, interpretable explanation of why a particular diagnosis was made. This trust deficit is exacerbated in multi-faceted hybrid models where feature extraction is distributed and complex, making it difficult to pinpoint the exact visual cues influencing the decision. The lack of transparent reasoning not only hinders clinical integration but also complicates error analysis and model refinement, creating a critical need for robust Explainable AI (XAI) tailored for these sophisticated architectures. The goal of AI in medical imaging is not just to classify, but to integrate seamlessly into clinical workflows and enhance diagnostic confidence. A significant gap exists in creating and validating holistic AI-driven diagnostic pipelines that not only achieve high technical accuracy but also provide actionable, interpretable insights directly validated by expert radiologists. Many research efforts focus on individual components (e.g., a new classification model or an XAI technique) but fail to integrate these into a coherent system where active learning efficiently curates’ data. Multiple feature types are optimally exploited, hybrid deep learning models perform classification, and XAI outputs are clinically vetted and refined based on radiologist feedback. This comprehensive integration, coupled with expert opinion, is essential to bridge the chasm between research prototypes and real-world clinical utility.The objective of this research is to establish a comprehensive machine-learning workflow integrating active reinforcement learning, preprocessing, feature extraction, feature selection, and classification techniques with XAI for precise detection of lung cancer from CT scan images. The study concentrates on refining the efficiency and efficacy of the classification model through active learning, diminishing the necessity for extensive labeled data. Additionally, there is a specific emphasis on preprocessing to enhance image quality, involving resizing, noise removal, and employing various methods for feature extraction. Exploration of feature selection techniques aims to identify the most pertinent attributes, optimizing model performance for the classification task. Two primary classification approaches are investigated: traditional machine learning algorithms with diverse feature sets and deep learning models that integrated the CoAtNet architecture with explainable AI (XAI) by using large dataset CT image lung cancer. In Addition, the attention fusion is integrated with CNN and CoAtNet models to improve accuracy. Evaluation metrics encompass accuracy, training and testing times, and Area Under the Curve (AUC) scores, providing valuable insights into the most suitable techniques for accurate lung cancer detection. The overarching goal of this research is to make a substantive contribution to the field of medical image analysis by identifying the most effective combination of methods to enhance diagnostic accuracy in lung cancer detection. This research paper significantly advances the domain of lung cancer diagnosis using CT images by addressing several pivotal challenges. The key contributions in this paper include the following:1.The paper focuses on leveraging a substantial dataset to improve the accuracy of lung cancer detection models. Utilizing a large dataset enables the model to learn from a diverse range of cases, fostering better generalization and ultimately achieving higher accuracy in both training and testing phases.2.The paper also addresses the challenge of labeling data, traditionally a time consuming and expensive process. To address this issue, it implements active reinforcement learning, intelligently selecting informative samples for manual labeling. This approach significantly reduces the time and resources required to label a large, unlabeled dataset, thereby enhancing the efficiency of the training process.3.To improve the result of deep learning, the research employs various algorithms for feature extraction and selection. Careful selection and optimization of features enhance the effectiveness of the classification process and also, combines the attention fusion feature with different architectures of deep learning.4.Finally, CoAtNet is integrated with explainable AI to help experts diagnose and provide more useful information in CT images and take the decision of diagnosis with the Radiologist evaluated to enhance the result.This study falls into the following categories: section “Related works” discusses the related works. Section “Dataset for lung cancer” collects the dataset used in this study. Section “Methodology” describes the process. Section “Experiments and results” describes the experiments and outcomes. Section “Conclusion” and “Future works” present the conclusion and future work of this paper.Related worksMany researchers have put their heads together to develop an effective and efficient approach for detecting or forecasting lung cancer as well as increasing the true-positive rate of lung nodules. These techniques include algorithms for image processing, deep learning detection, and machine learning. These procedures are used on CT images to determine the most efficient and accurate lung cancer detection outcomes.Luo et al.14 proposed a new approach to the LLC-QE model which integrates reinforcement learning and ensemble learning for classifying lung cancer. The Artificial Bee Colony (ABC) algorithm is used to reduce the probability of the model getting stuck in the optimum of local. The feature vectors are extracted by using the CNN model. To train and assess this model, the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) dataset which contained predominately the set of cases without cancer, was used. Reinforcement learning formulated is used as a series of decisions interconnected to reduce the imbalance of the dataset. The F-measure of LLC-QE achieved 89.8% and a geometric mean of 92.7%.Saha et al.15 introduced the new Volumetric Encoder-Decoder Residual Network(VER-Net) using three different transfer learning models to detect lung cancer in CT images. The dataset from Kaggle includes 1,653 CT images. The results show that the VER-Net model reached a high accuracy of 91% when compared to other models.Bhatia et al.16 authors tackled the trade-off between computational cost and performance by introducing a Lightweight Advanced Deep Neural Network( DNN). Aimed at resource-limited environments, the model was trained on the LUNA16 dataset—a curated subset of LIDC-IDRI containing 888 CT scans—with a focus on low memory use and noise reduction. It achieved a high accuracy of 98.2%, showing that streamlined, single-stream architectures can perform on par with more complex ensemble models. That said, like Shatnawi et al.17 , the small dataset makes it difficult to fully evaluate the model’s robustness across diverse clinical scenarios. Additionally, while the approach is computationally efficient, it emphasizes performance metrics and does not include Explainable AI (XAI) visualizations to confirm the clinical relevance of the features it learned.Shatnawi and Abuein17 developed models for automatic prediction and classification of lung cancer CT scan images. They used a dataset of 1,000 CT scans from Kaggle, which includes four types of cancer: 215 normal images, 187 large cell carcinomas, 338 adeno-carcinomas, and 260 squamous cell carcinomas. The data was split into 70% for training and 30% for testing, ensuring a balanced dataset. This research employed several pretrained models, including ConvNeXtSmall, InceptionV3, ResNet50, EfficientNetB0, and VGG16, with testing accuracies of 87%, 76.9%, 94.5%, and 97.9%, respectively. Additionally, the customized CNN model achieved a testing accuracy of 100%, outperforming the other models.Ahmad Hassan et al.18 improved a Gestational Diabetes Mellitus (GDM) prediction model using a fusion technique that combines multiple algorithms with explainability. Given the significant risks linked to GDM, they proposed a new way to build a prediction model that merges traditional Machine Learning (ML) methods with cutting-edge Deep Learning (DL) algorithms. The hybrid model uses several ensemble methods and a meta-classifier to deliver reliable prediction performance. They applied data preprocessing techniques like multiple imputation, feature engineering, and oversampling to tackle class imbalance before running the model. These efforts resulted in high performance levels: accuracy at 98.21%, precision at 97.72%, and AUC at 99.91%, all of which surpass earlier studies using the same data. The authors use explainable AI (XAI) methods to highlight the most important features. This improves interpretability and supports proactive GDM management that can enhance maternal and fetal health.The totality of the advanced methods in Table 1 shows a clear trade-off for clinical practice concerning annotation and computation costs as well as interpretability. Ensemble-based systems like VER-Net15 rely on large, fully annotated datasets and carry heavy inference costs because of multi-stream processing. On the other hand, custom and lightweight CNNs16,17 achieve impressive computational efficiency and high reported accuracy, but they often rely on very limited datasets, sometimes with fewer than 1,000 images. This raises concerns about generalizability and potential overfitting. More complex pipelines, such as LLC-QE14, add significant architectural complexity through reinforcement learning. This can make the models harder to audit. In the medical prediction field, models like the GDM prediction fusion model18 have shown high performance and included XAI for interpretability, but these methods usually focus on tabular datasets, which may be imbalanced instead of focusing on image-based diagnostics.Importantly, none of these methods address the three main challenges in lung cancer CAD systems: (1) reducing reliance on large, annotated datasets, (2) maintaining computational efficiency for use in resource-limited settings, and (3) improving interpretability to build trust among radiologists. This gap motivates the proposed approach, which combines Active Learning with Reinforcement Learning and XAI-guided feature selection. This combination aims to lower labeling costs, improve diagnostic transparency, and achieve strong performance while keeping computational demands low.Table 1 Comparison of State-of-the-Art (SOTA) methods highlighting data size, computational cost, and interpretability.Full size tableDataset for lung cancerThis research uses a curated subset of the Data Science Bowl 2017 dataset19. From the original 285380 CT images for 2101 patients, 30,020 CT images belonging to 790 patients is selected. This dataset was divided into labeled and unlabeled sets to assess model performance20. Figure 1 provides sample images.Fig. 1sample of dataset CT image Kaggle Bowl2017, (a) normal, (b) abnormal.Full size imageMethodologyThe provided block diagram outlines the proposed model: Active reinforcement learning and explainable AI with deep attention to Fusion Features (ARXAF-Net) where it is a multi-stage process for active reinforcement deep learning with XAI, as shown in Fig. 2.In Stage 1, the preprocessing phase commences with two distinct datasets: a labeled image dataset and an unlabeled dataset. Stage 2-(a, b) presents an active reinforcement deep-learning process following preprocessing. Stage 3 focuses on feature extraction by traditional techniques and deep feature learning model. In Stage 4, the process concludes with training and testing, which includes evaluating performance metrics for both training and testing datasets using traditional machine learning methods and deep learning models. Stage 5 presents the integration between best model of stage 4 with explainable AI to help experts with the decision making of diagnosis CT image and then give it to radiologist to evaluate the result of XAI by uisng Gradient-weighted Class Activation Mapping (GRAD-CAM).Fig. 2Block diagram of the proposed model.Full size imageARXAF-Net algorithm overviewThe ARXAF-Net model comprises five core stages that integrate preprocessing, active deep learning with reinforcement learning (RL), explainability (XAI), and robust performance evaluation as shown Algorithm 1.Algorithm 1ARXAF-Net: Active Deep Learning with RL and XAI.Full size imageStage 1-preprocessing of the datasetIn this initial stage, the total dataset contains 30,020 CT images, evenly balanced with 15,010 cancer images and 15,010 non-cancer images. Preprocessing commences with two distinct and essential datasets: a labeled image dataset and an unlabeled dataset, the characteristics of which are summarized in Table 2. The labeled dataset consists of 6,080 images, meticulously acquired from a total of 160 patients, precisely balanced with 80 cancer patients and 80 non-cancer patients. Each patient contributed 38 images, thereby ensuring equitable representation across the cohort. Specifically, 3,040 of these images are unequivocally classified as cancerous, while the remaining 3,040 are designated as non-cancerous, clearly demonstrating a balanced class distribution. To rigorously prevent patient level data leakage, the dataset was strategically split at the patient level into distinct training, validation, and test sets. This critical step ensures that no images from a single patient appear in more than one set. Furthermore, these splits were carefully stratified to maintain consistent class balance across all partitioned sets. As an integral part of the preprocessing pipeline, all images undergo a resizing operation to achieve a uniform resolution of 128 \(\times\) 128 pixels. Following this, the pixel intensity values are normalized across the entirety of the dataset to ensure consistency and minimize inter-image variation.Table 2 Labeled and unlabeled CT image counts.Full size tableStage 2 - active reinforcement deep learningFollowing initial preprocessing, a CoAtNet-based deep learning model is trained on the labeled dataset, resulting in a baseline classifier. The unlabeled dataset is then processed in batches of 200 images. For each batch, the model predicts class probabilities, and the entropy of each sample is computed to quantify prediction uncertainty:$$\begin{aligned} H(x) = -\sum _{c} p_c \log (p_c) \end{aligned}$$where \(p_c\) is the predicted probability for class \(c\). The top-k most uncertain samples, determined using a predefined entropy threshold, are selected and passed to a reinforcement learning (RL) agent for pseudo-labeling.The pseudo-labeling process is formalized as a Markov Decision Process (MDP):$$\begin{aligned} \text {MDP} = (S, A, P, R, \gamma ) \end{aligned}$$where:State \(S\): the flattened feature vector of the selected image.Action \(A\): label assignment (0 = non-cancer, 1 = cancer).Transition \(P(s'|s,a)\): determined by moving to the next image in the batch.Reward \(R(s,a)\): \(+1\) if the assigned label matches the predicted label, \(-1\) otherwise.Discount factor \(\gamma\): 0.95, controlling the importance of future rewards in Q-value updates.In this framework, the RL agent performs label selection, not sample selection, which distinguishes it from standard uncertainty sampling. The transition \(P(s'|s,a)\) follows the sequential order of the top-k uncertain samples: after labeling sample \(i\), the agent moves to the state of sample \(i+1\). The RL agent learns a policy that maximizes label correctness and reduces noise in the pseudo-labeled dataset, something classical active learning cannot achieve. The Q-learning agent updates the Q-table iteratively using:$$\begin{aligned} Q(s,a) \leftarrow Q(s,a) + \alpha \big [ r + \gamma \max _{a'} Q(s',a') - Q(s,a) \big ] \end{aligned}$$where \(\alpha = 0.1\) is the learning rate, \(r\) is the reward, \(s'\) is the next state, and \(a'\) is the next action. The reward function is formally defined as:$$\begin{aligned} R(s,a) = {\left\{ \begin{array}{ll} +1, & \text {if the pseudo-label matches the CoAtNet model's predicted class}\\ -1, & \text {otherwise} \end{array}\right. } \end{aligned}$$This reward structure guides the agent to generate pseudo-labels consistent with high-confidence model predictions. Unlike rewards tied to global metrics, this per-sample feedback ensures the agent prioritizes data integrity. The goal is to train an RL agent that:1.Reduces noise in the pseudo-labeled dataset,2.Maintains consistency with model confidence, and3.Selects the most reliable and informative samples.By encouraging high-quality pseudo-labeling, the reward function indirectly enhances the CoAtNet classifier’s accuracy and stability as it is retrained on a progressively cleaner dataset.After pseudo-labeling, the new samples are incorporated into the labeled pool:$$\begin{aligned} X_{\text {labeled}} \leftarrow X_{\text {labeled}} \cup X_{\text {pseudo}}, \quad y_{\text {labeled}} \leftarrow y_{\text {labeled}} \cup y_{\text {pseudo}} \end{aligned}$$At the end of each iteration, the CoAtNet model is retrained on the expanded dataset. This loop continues for up to 120 iterations, ensuring convergence of both Q-values in the RL component and the overall model performance. Unlike standard active learning, which only selects uncertain samples, this framework allows the RL agent to learn an optimal labeling policy, reducing pseudo-label noise and improving the quality of the expanded training set.In summary, Stage 2 integrates uncertainty-based active learning with reinforcement learning under a formally defined MDP. The framework is characterized by well-defined states, actions, rewards, and Q-learning updates. Hyperparameters are explicitly set, including a batch size of 200, Q-learning rate \(\alpha = 0.1\), discount factor \(\gamma = 0.95\), and exploration rate \(\epsilon = 0.1\). Through strategic interaction, the agent effectively identifies the most informative samples, leading to significant improvements in model accuracy and providing a clear advantage over conventional active learning, reinforcement learning alone, or random sampling approaches.Stage 3-a feature extractionFeature extraction from CT lung cancer images is a critical process. The feature extraction process is for a dataset of 30,020 images. Below, we’ll discuss each of these feature extraction categories:Texture features:21 is GLCM and LBP. Local Binary Patterns are referred to as LBP which yield 25 distinct features. GLCM refers to Gray-Level Co-occurrence Matrix. GLCM contributes a further 15 features, including contrast, dissimilarity, homogeneity, energy, and correlation, each calculated for three different directionsShape features22: provide geometric characteristics of objects in CT images, including area, perimeter, and compactness.Intensity-based features22: The five features are calculated, including intensity of mean, standard deviation, median, skewness, and kurtosis.Deep features with attention fusion features: channel attention is first computed by applying global average pooling across the feature maps, followed by a two-layer Multi-Layer Perceptron(MLP ) and a sigmoid activation to generate channel-wise weights. Spatial attention is then calculated by concatenating the average-pooled and max-pooled features along the channel dimension, passing them through a 7\(\times\)7 convolution, and applying a sigmoid activation. The resulting attention map is applied multiplicatively to the feature maps. Channel and spatial attention are applied sequentially, following the standard procedure described in the referenced method.23.Combining feature (Traditional + Deep): All feature vectors—Texture, Shape, Intensity, and CNN—are standardized to have zero mean and unit variance to ensure they are on comparable scales. The CNN features are first passed through an attention weighting mechanism, where each feature dimension is multiplied by a learned weight normalized via a sigmoid function. This approach emphasizes the most useful deep features and reduces the impact of irrelevant ones. Since the attention weights range from 0 to 1, no CNN feature can exceed the dynamic range of the standardized handcrafted features. This stops any single feature from dominating and keeps the fusion balanced. After applying the attention weights, we combine the CNN features with the standardized traditional features to create a single 176-dimensional feature vector for each image: 40 (Texture) + 3 (Shape) + 5 (Intensity) + 128 (CNN) = 176 features. This combination reflects an early fusion method, where we merge all feature types before classification instead of combining them at the decision level.This method guarantees a balanced representation. The attention mechanism highlights important CNN features without allowing them to overpower others, while standardization ensures that all feature types have a similar role in the classification process.Stage 3-b normalization of feature extractionNormalization is a crucial preprocessing step when dealing with diverse types of features in machine learning to consistent scale of features, including texture, shape features and intensity-based features.Stage 3-c features selectionThis paper focuses on the use of feature selection strategies to increase the performance of machine learning. Three different feature selection approaches were used to sort and prioritize a dataset’s most pertinent attributes, which assign the score to each utilized attribute. Higher relevance ratings are retained, whereas lower significance parts are eliminated. IThe features evaluated for their effectiveness are forest feature (FF), correlation-based feature selection (CFS), and recursive feature elimination (RFE).Stage 4-a classificationTo ensure a rigorous and unbiased evaluation, all dataset splits were performed strictly at the patient level rather than the image level. The dataset consists of 790 patients, each contributing approximately 38 CT images, for a total of 30,020 images. Two patient-level splits were performed: the first for ARL evaluation using only the initially labeled subset (160 patients), and the second for final ML/DL classification after all images were labeled (all 790 patients). For the Active Reinforcement learning (ARL) framework, 160 patients (6040 images) were assigned to the labeled set, and unlabeled pool is 630 patients( 23940 images). Subsequently, the labeled set was divided into training (70%—112 patients, 4256 images), validation (15%—24 patients, 912 images), and testing (15%—24 patients, 912 images). Furthermore, to make sure that the test set images were never used in labeling, training, or validation, a final held-out test set was carefully separated from all ARL iterations and the initial model training. This prevents any data leakage and guarantees that the model is evaluated on completely unseen patients, providing a realistic measure of its generalization performance. Three approaches were implemented as follows:The first approach, the classification of machine learning models, comes into play after extracting features from the CT input to discern between cancerous and normal cases. It employs various classification techniques, including XGBoost, RF, model with used attention fusion features, second approach and high accuracy of first Bayesian network, and DT methods.The second approach, various deep learning architectures for feature extraction from the input images are assessed, such as a CNN with attention fusion, CoAtNet, and a Simple CNN.Simple CNN: includes a global average pooling layer that generates a 128-dimensional feature vector, as well as three convolutional layers (32, 64, and 128 filters of size 3\(\times\)3), each followed by max pooling. There are 92,672 total trainable parameters.CoAtNet: Identifies local and global features by combining convolutional and attention mechanisms. In order to generate a 128-dimensional feature vector, the network employs three convolutional layers with 128 filters each (3\(\times\)3), interspersed with max pooling and dropout layers, and then global average pooling. 296,448 are the total trainable parameters.Simple CNN with attention fusion: This improves the Simple CNN by adding a mechanism for attention fusion. The input image’s features are extracted by two CNN branches operating in parallel. Channel attention, which highlights the most informative feature channels, and spatial attention, which highlights significant spatial regions, are then used to combine these features. The final output is produced by pooling the fused feature maps using global average pooling and passing them through fully connected layers (Dense 128 \(\rightarrow\) Dropout \(\rightarrow\) Dense 64 \(\rightarrow\) Dense 1).Table 3 summarizes the specific hyperparameters for each model, including the number of layers, filter sizes, activation functions, dropout rates, optimizer, number of epochs, classifier type, loss function, and attention mechanism. This ensures reproducibility and helps clarify the technical implementation of the attention fusion mechanism. To identify the most suitable architecture for the study’s overall strategy, all models were trained under the same conditions and their performances were compared.The last approach, the combination between the high results of deep learning approach of traditional features. To gauge the performance of these techniques by metrics in section “Stage 4-b performance metrics”.Table 3 Hyper-parameters of the deep learning architectures.Full size tableStage 4-b performance metricsDifferent performance measures are being used to assess the effectiveness of machine and deep learning models derived from true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN)Accuracy (ACC): Calculate how accurate the model’s predictions are overall.$$\begin{aligned} ACC = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$(1)Precision Precision measures the proportion of true positive predictions among all positive predictions.$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$(2)Recall Recall measures the proportion of true positive predictions among all actual positives.$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$(3)F1-Score: The F1-score which provides a balance between precision and recall.$$\begin{aligned} F1Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \end{aligned}$$(4)AUC-ROC That indicates model performance where higher values indicate better performance.Stage 5 explainable AI with deep learning from a radiologist viewpointIn this study, the model was evaluated by three board-certified radiologists using a total of 912 CT images. A custom Python tool displayed each scan alongside its Grad-CAM heatmap, allowing the radiologists to review both the raw image and the model’s highlighted regions. Their feedback concentrated on three key areas:whether the marked regions corresponded to clinically significant tumor areas.whether any concerning regions were missedwhether the model highlighted irrelevant features.To evaluate the effect of explainable AI (XAI) on diagnostic accuracy, each radiologist assessed the scans two times—initially without Grad-CAM and then with the heatmaps displayed. The incorporation of Grad-CAM enhanced radiologist accuracy from 96.7% to 99.9% and decreased reading time by approximately 25%, demonstrating notable improvements in both efficiency and confidence.Grad-CAM was also evaluated quantitatively. Across the same 912 CT images, the model’s attention maps achieved a mean IoU of 0.72 ± 0.08 against radiologist-annotated lesion masks, and the Pointing Game accuracy reached 0.91. Only a small number of cases (about 3–5, or 0.3–0.5%) involved lesions that were missed or not sufficiently highlighted, typically very small \((