IntroductionImmune checkpoint blockade (ICB) therapies have transformed cancer treatment by enabling the immune system to recognize and attack tumor cells1. Agents targeting immune checkpoints such as PD-1, PD-L1, and CTLA-4 have achieved remarkable success in treating various malignancies, including melanoma2, non-small cell lung cancer3, and renal cell carcinoma4. Nonetheless, clinical response to ICB remain highly variable, with only a fraction of patients experiencing substantial benefits5,6. Predicting which patients will respond to ICB remains a central challenge in oncology, underscoring the need for reliable predictive biomarkers and models. Bulk RNA sequencing (RNA-seq) has played a key role in identifying gene expression signatures that distinguish responders from non-responders to ICB therapies. For example, the Tumor Inflammation Signature (TIS) predicts responses to pembrolizumab in multiple solid tumors7, while the Cytolytic Activity Score (CYT), based on the expression of specific immune-related genes, correlates with improved survival in colorectal cancer8. Additionally, frameworks like Tumor Immune Dysfunction and Exclusion (TIDE) evaluate T-cell activity from RNA-seq data, aiding in patient stratification for ICB therapies9. These advancements highlight the promise of bulk RNA-seq for generating biomarkers that advance personalized cancer treatments. Despite these advances, predictive biomarkers often lose effectiveness when extended to diverse patient populations. Differences in demographics, tumor types, sequencing platforms, and data processing methods contribute to cohort heterogeneity, posing significant challenges to the generalizability of predictive models. For instance, PD-L1 expression was predictive in only 28.9% of cases, influenced by tissue types, cutoffs and cell-based assessments10. Similarly, tumor mutational burden (TMB) shows inconsistent predictive power across cancer types, being effective in melanoma and non-small cell lung cancer but less so in colorectal cancer11. Moreover, Bareche et al. demonstrated that just over half of curated gene expression signatures (22 out of 37; 59%) maintained significant associations with pan-cancer immunotherapy response12, emphasizing the need for careful data integration, refined modeling strategies, and rigorous validation. Machine learning (ML) and deep learning (DL) methods have emerged as potent tools for improving ICB response prediction. Leveraging diverse data sources–including transcriptomics13, clinical records14, and imaging15–these methods have identified novel biomarkers and achieved enhanced generalizability16. For example, ML models can incorporate features such as tumor mutational burden (TMB)17, microsatellite instability (MSI)18, neoantigen profiles19, immune cell infiltration20 and gut microbiota compositions21. Integrative approaches combining multiple data modalities address the limitations of single biomarkers, ultimately offering a more comprehensive understanding of patient responses. However, ML- and DL-based approaches for ICB response prediction still face challenges related to cohort heterogeneity, overfitting, and the reliance on large, standardized training and validation datasets. Harmonizing data across cohorts and validating models in independent clinical trials are critical but remain difficult to achieve. Recent advances in self-supervised pretraining and transfer learning offer promising strategies to address these challenges. Self-supervised pretraining allows models to learn generalized patterns from unlabeled data, while transfer learning adapts these representations to specific tasks, improving predictive accuracy and bolstering robustness against clinical and genomic variability. In this study, we present IC2Bert, a novel computational framework designed to improve ICB response prediction across heterogeneous RNA-seq cohorts. By employing masked gene expression pretraining, IC2Bert captures nuanced, dataset-agnostic gene relationships through the reconstruction of randomly masked data. To further enhance generalizability and specificity of the model, the model incorporates domain-specific supervised transfer learning, fine-tuning the pretrained representations on labeled, cohort-specific datasets while preserving broad, generalized knowledge. Using a Leave-One-Dataset-Out Cross-Validation (LODOCV) strategy–simulating real-world scenarios with unseen data–IC2Bert achieved substantially higher predictive accuracy and robustness than existing approaches, effectively addressing the challenges posed by cohort heterogeneity. By combining masked gene expression pretraining with transfer learning, IC2Bert represents a significant advancement in predictive modeling for immunotherapy. It provides a promising tool for accurately predicting patient responses to ICB therapy using only bulk RNA-seq profiles. This capability may also facilitate the integration of models from other modalities, forming comprehensive, multimodal frameworks that further enhance personalized treatment planning and improve clinical outcomes. To support ongoing research and development, we have made IC2Bert and its source code publicly available on GitHub: https://github.com/data2intelligence/ic2bert.ResultsOverall training and evaluation strategyWe employed an iterative Leave-One-Dataset-Out Cross-Validation (LODOCV) framework (Fig. 1) to thoroughly evaluate IC2Bert’s capacity for generalization across diverse datasets. In each iteration, one dataset served as the holdout set, while the remaining 12 were used for pretraining. Although the holdout dataset remained entirely excluded during pretraining, we subsequently applied transfer learning on its training subset to tailor the model to the target domain before evaluating performance on its test subset. This procedure was repeated for all 13 datasets, ensuring that each one served as the holdout dataset in turn.Fig. 1(A) Training and Validation (Iterative LODOCV) Scheme for IC2Bert. (B) Model Architecture of IC2Bert.Full size imageEach dataset was split into a training (80%) and test (20%) subset using stratified sampling. These internal splits serve distinct roles. During pretraining, IC2Bert learned generalized gene expression patterns through masked language modeling, using only the training subsets of the 12 pretraining datasets. The test subsets of these 12 datasets were completely held out. Most critically, the test subset of the holdout dataset was never used at any stage of the process–not during pretraining, not during transfer learning, and not for hyperparameter selection–and was reserved strictly for final performance evaluation. Following pretraining, we fine-tuned IC2Bert on the training subset of the holdout dataset and assessed its performance on the corresponding test subset. Each scenario was repeated \(N=10\) times with different training/test splits. We quantified classification performance using the Area Under the Receiver Operating Characteristic Curve (AUROC).To further clarify the influence of transfer learning, we compared zero-shot predictions (no fine-tuning on the target dataset) and transfer learning results. We also evaluated an ensemble approach, where predictions from 12 fine-tuned models (each pretrained on different datasets and fine-tuned on the target’s training subset) were combined to generate predictions on the holdout test subset. In addition, we conducted sample size analyses (discussed later) to determine how varying the number of training samples affected transfer learning outcomes. For completeness, we performed transfer learning on the 12 training datasets themselves, examining how effectively the model adapted to domains included in the pretraining phase (Table 1).Table 1 Patient statistics of ICB response cohorts.Full size tableAblation study for number of bins (\(N_{bins}\)) parameter of tokenizerWe conducted an ablation study to determine how varying the number of bins (\(N_{bins}\)) in the gene expression tokenizer affects both masked token reconstruction accuracy during pretraining and downstream classification performance. Figure 2 illustrates the impact of different bin sizes on pretraining masked token reconstruction accuracy. With four bins, the model maintained approximately 95% accuracy throughout training, whereas at 128 bins, accuracy remained around 20% by epoch 600. This outcome is intuitive–fewer bins simplify the prediction task, reducing representational complexity and making it easier for the model to accurately recover masked tokens. Figure 3 compares AUROC performance for transfer learning and zero-shot prediction at various bin sizes. Panel (A) displays results for datasets included in the pretraining phase, while Panel (B) shows outcomes for holdout datasets. Models pretrained with fewer bins generally exhibited better transfer learning performance. In contrast, zero-shot predictions were poor regardless of bin count, though a slight improvement was noted with larger bin sizes. The performance differences between pretraining and holdout datasets were negligible. As shown in Table 2, although \(N_{bins}=4\) achieved the highest mean performance (0.781), results with 8 and 16 bins were comparable (0.775 and 0.774, respectively). Figure 4 presents a dataset-level comparison of AUROC for both transfer learning and zero-shot prediction. Panel (A) reports performance on pretraining datasets, while Panel (B) focuses on holdout datasets. As summarized in Table 3, transfer learning yielded AUROCs ranging from 0.668 to 0.943 (mean 0.781), whereas zero-shot predictions ranged from 0.420 to 0.585 (mean 0.518). Ensemble predictions–derived from 12 models fine-tuned on each pretraining dataset–improved slightly over zero-shot predictions, with AUROCs ranging from 0.493 to 0.807 (mean 0.593). Since the MLP model trained and tested on the same dataset produced AUROCs of 0.500 to 1.000 (mean 0.611), the transfer learning approach on the target dataset alone showed an overall performance enhancement. Collectively, these results confirm that pretraining, followed by targeted domain-specific transfer learning, significantly improves predictive accuracy and enables the model to better adapt to the unique characteristics of each dataset.Fig. 2Masked token reconstruction accuracy (validation) by number of bins (\(N_{bins}\)).Full size imageFig. 3Transfer learning versus zero-shot prediction AUROC by \(N_{bins}\).Full size imageTable 2 Dataset-specific ICB response prediction AUC according to \(N_{bins}\) (mean ± std).Full size tableFig. 4TL versus zero-shot prediction AUC by ICB datasets (\(N_{bins}=4\)).Full size imageTable 3 Comparison between single-dataset train/test and IC2Bert strategy (mean±std).Full size tablePerformance comparison with other strategiesWe evaluated two alternative strategies for predicting ICB response from bulk RNA-seq datasets and compared their performance to that of IC2Bert on the same test sets. The first approach employed gene set–based predictors derived from previously published literature12. Such methods typically use averages of expression values from predefined gene markers or apply sample-dependent averaging techniques like single-sample gene set enrichment analysis (ssGSEA)22. We tested 34 curated gene sets (Supplementary Table S1) from a recent publication, identifying two signatures–Cytotoxic T cell and TGFB–that performed best among the curated predictors, and we also evaluated a composite score combining both signatures. As shown in Table 4, IC2Bert’s pretraining and transfer-learning strategy substantially outperformed these gene set–based methods, achieving roughly a 28% improvement across datasets. Notably, on the small-sample Miao2018 cohort, IC2Bert reached an AUROC of \(0.943 \pm 0.100\), versus \(0.450 \pm 0.191\) for the top gene set predictor.Next, we compared IC2Bert against three baseline neural architectures–DNN, 1D CNN and the Scale-Invariant Neural network Classifier (SINC)23–as well as six state-of-the-art domain-generalization neural network (DGNN) models: Domain Adversarial Neural Network (DANN)24, Invariant Risk Minimization (IRM)25, MetaReg26, MMD-AAE27, CrossGrad28 and LFR29. DGNNs are designed to learn domain-invariant features and predict in a new domain without additional fine-tuning. As summarized in Table 5, the baseline models delivered only moderate cross-cohort performance (e.g., on Miao2018: DNN \(0.521 \pm 0.228\), CNN \(0.450 \pm 0.191\), SINC \(0.500 \pm 0.197\)), and the DGNN approaches offered only marginal gains over these baselines (e.g., DANN \(0.504 \pm 0.233\) on Miao2018) while still struggling on small-sample or distribution-shift cohorts (Table 6). In contrast, IC2Bert’s domain-specific fine-tuning consistently mitigated performance degradation under these challenging conditions–achieving \(0.943 \pm 0.100\) on Miao2018 and delivering superior AUROCs across every test set–underscoring the critical importance of explicit target-domain adaptation for robust and generalizable ICB response prediction.Table 4 Comparison with gene set–based predictors (mean±std).Full size tableTable 5 Comparison of baseline architectures and IC2Bert strategy (mean ± std).Full size tableTable 6 Comparison with domain generalization methods and IC2Bert strategy (mean±std).Full size tableWe also conducted experiments to evaluate whether incorporating pretrained gene embeddings, derived from a large compendium of single-cell sequencing datasets, could enhance the transfer learning performance of the IC2Bert model. As shown in Figure S6 and Table S4, replacing IC2Bert’s original gene embedding module with scGPT’s pretrained embeddings led to a slight improvement in overall performance. However, the gain appears to be marginal. This modest improvement may be attributed to the architectural and input differences: IC2Bert operates on a preselected set of features, whereas the scGPT model is trained on genome-wide expression data. Thus, the impact of the embedding replacement may be limited in this context. These results suggest that pretrained gene embeddings could offer greater benefits when paired with an appropriate model architecture and under suitable data conditions.Effect of training sample size on IC2Bert’s transfer learningWe investigated how the number of balanced training samples influenced IC2Bert’s fine-tuning performance. Figure 5 illustrates the relationship between AUC and training sample size for two representative cohorts, Miao2018 (Fig. 5A) and Atezo+Bev_McDermott2018 (Fig. 5B) datasets. As the number of training samples increased, performance improved substantially, even when starting from a minimal set (\(N_{train}=2\)). Notably, IC2Bert outperformed zero-shot predictions (\(N_{train}=0\)), demonstrating that it retained substantial predictive capability from its pretrained representations. In rare cases, zero-shot predictions approached the performance of models trained with one or a few samples, suggesting intrinsic robustness. Additional sample size analyses are provided in the supplementary figures (Figure S1 to S4), which revealed consistent trends across all cohort datasets. Overall, these results highlight IC2Bert’s scalability and adaptability. With additional training samples, the model more effectively fine-tunes its representations to the target dataset, steadily boosting AUC and reinforcing its potential utility in diverse clinical scenarios.Fig. 5Effect of training sample sizes in performance. (A) Miao2018, (B) Atezo+Bev_McDermott2018 dataset.Full size imageFeature importance analysisTo gain insight into the genes that most influenced IC2Bert’s predictions, we integrated two complementary methods of feature importance assessment: attention-based and permutation-based analyses. The attention-based approach reveals which genes the model inherently focuses on, while the permutation-based method quantifies the impact of each gene on predictive performance. Combining these approaches helps offset the biases each method may introduce, resulting in a more balanced and reliable measure of gene importance. Table 7 presents the top 10 genes with the highest combined importance scores under a 4-bin discretization scheme, which showed best mean AUC in downstream ICB response prediction task. Each gene’s ranking is accompanied by its frequency of appearance in the top 100, 50, 30, and 10 most important features across 13 holdout datasets, as well as the normalized average importance score. Several of these genes, such as FCRL5, CD19, POU2AF1, and CD79A, are closely associated with B-cell development and function, while others–including CKS1B, TOP2A, PRC1, RRM2, MKI67, and NUSAP1–are well-known markers of cell cycle progression and proliferation.Table 7 Top 10 important features across 13 holdout datasets (\(N_{bins}=4\)).Full size tableDiscussionAdvancing predictive robustness through masked pretraining and fine-tuningIn this study, we introduced IC2Bert, a BERT-like transformer model that employs masked gene expression pretraining and domain-specific supervised fine-tuning to predict immune checkpoint blockade (ICB) responses from heterogeneous RNA-seq datasets. Using a broad range of ICB response cohorts and a Leave-One-Dataset-Out Cross-Validation (LODOCV) framework, IC2Bert demonstrated strong generalizability. Achieving a mean AUROC of 0.781 ± 0.135, it significantly outperformed DNN baselines (\(\sim\)0.62 AUROC)30, Scale Invariant Neural network Classifier (SINC) (\(\sim\)0.62 AUROC)23, gene set-based predictors (\(\sim\)0.64 AUROC), and advanced domain generalization approaches (\(\sim\)0.62 AUROC), as shown in Figure S7. This superior performance primarily arises from combining masked pretraining with domain-specific fine-tuning, enabling the model to learn intricate, dataset-agnostic gene relationships without a heavy reliance on labeled data.To evaluate whether large-cohort foundational model-style pretraining improves generalization, we pretrained IC2Bert on The Cancer Genome Atlas (TCGA) dataset and tested its zero-shot prediction performance on ICB cohorts. The TCGA-pretrained model achieved an average AUROC of 0.512 ± 0.192, which is slightly lower than the model pretrained directly on ICB datasets (0.518 ± 0.200). While transfer learning from the TCGA-pretrained model yielded performance comparable to ICB-based pretraining, these findings indicate that foundational pretraining alone does not guarantee improved predictive power in the context of specific ICB response cohorts. Thus, domain-specific fine-tuning remains essential.As shown in Figure S5 and Table S3, supplementing masked pretraining with a supervised loss can enhance zero-shot prediction but may reduce subsequent fine-tuning adaptability–highlighting the advantages of the unsupervised masked pretraining approach. A key insight from our ablation study was that using fewer bins (e.g., four) for tokenization improved downstream transfer learning performance. Furthermore, the need for domain-specific fine-tuning was underscored by the limitations of zero-shot predictions, illustrating the importance of adapting learned representations to individual patient cohorts. Taken together, these findings emphasize how IC2Bert’s methodology–drawing on principles from foundational bulk RNA-seq31 and scRNA-seq32,33 models–captures complex gene expression patterns while preserving the flexibility required to accommodate the unique characteristics of diverse patient populations.Adaptability of IC2Bert in transfer learning scenariosIC2Bert’s adaptability is particularly evident when examining how varying the number of fine-tuning samples affects model performance (Fig. 5). Even with a minimal number of training samples, IC2Bert rapidly approached performance levels comparable to those achieved with substantially larger sample sizes. This suggests that the pretrained representations possess strong predictive power and require only a small amount of domain-specific data to effectively calibrate the model’s outputs to the target domain. As additional samples were introduced, performance gains became modest and stabilized, further indicating that IC2Bert efficiently assimilates domain-specific information without losing its core understanding of gene expression patterns. Such scalability is critical in real-world clinical settings, where data availability can be both limited and heterogeneous. By maintaining strong performance across a broad range of sample sizes, IC2Bert emerges as a highly flexible tool, well-suited for incremental updates as new data become available. This robustness positions the model for practical use in dynamic clinical environments, accommodating varying patient populations, data quality, and cohort sizes, and ultimately improving its utility in personalized treatment planning.Biological insights and clinical implicationsThe feature importance analysis (Table 7) provides a detailed view of the genes most influential in shaping ICB therapy outcomes. Several of the top-ranked genes–MS4A1 (CD20), FCRL5 and CD79A–are closely associated with B-cell functions34,35,36, suggesting that variations in humoral and antigen-presenting activity may critically modulate anti-tumor immunity. At the same time, proliferative markers such as MKI67, TOP2A, PRC1 and RRM2 underscore the importance of cell-cycle dynamics not only in tumor growth but also potentially within rapidly expanding immune subsets that contribute to treatment efficacy37,38,39,40. Notably, CDH1–an adhesion molecule governing epithelial integrity–appears prominently, further highlighting the interplay between tumor architecture, immune infiltration and therapeutic responsiveness41. The appearance of POU2AF1, a transcriptional co-activator in B-cell development, suggests that the differentiation state of lymphoid populations could also influence patient outcomes42. By integrating these immunological, proliferative and structural hallmarks, IC2Bert offers a multifaceted portrait of the molecular processes underlying ICB responsiveness.To validate whether these gene-level signatures reflect differences in immune-cell composition, we applied CODEFACS43 to deconvolute bulk RNA-seq data into estimated cell fractions (Figure S8). Consistent with the model’s feature ranking, responders exhibited significantly higher proportions of canonical effector populations–CD8 T cells, CD4 T cells and M1-polarized macrophages–than non-responders. Interestingly, however, CODEFACS did not detect a statistically significant difference in overall B-cell abundance between the two groups. This apparent discrepancy may arise because IC2Bert captures B-cell functional states or specific subpopulations–such as antibody-secreting or tertiary lymphoid structure-associated B cells–that bulk deconvolution cannot distinguish. It may also reflect context-dependent roles of B cells, where spatial organization and activation status, rather than total numbers, drive therapeutic benefit.From a clinical standpoint, these insights can refine patient stratification and inform treatment decisions. Oncologists may leverage the combined gene-level and deconvolution metrics to identify individuals most likely to benefit from ICB therapy, guiding more personalized and effective strategies. Ultimately, by linking complex genomic patterns to actionable cellular phenotypes, IC2Bert advances the goal of precision oncology.Limitations, future directions, and multimodal integrationWhile our results are promising, certain limitations persist. Although the aggregated dataset spans 13 cohorts, some cancer types remain underrepresented, and increasing both the size and heterogeneity of training cohorts will be essential to bolster IC2Bert’s generalizability. Clinically, bulk RNA-seq remains attractive thanks to well-established workflows, relative cost-effectiveness, and the ability to mine large retrospective collections for model development and validation. However, by averaging expression across all cells, bulk profiles obscure rare or spatially confined populations and eliminate information about cell–cell interactions. Deconvolution approaches such as CODEFACS can partially reconstruct cell-type abundances but remain sensitive to tumor purity, stromal admixture and the choice of reference signatures, which may limit their precision in individual patient settings.Single-cell and spatial transcriptomic technologies address many of these shortcomings by directly profiling cellular heterogeneity, cell states and tissue organization. These modalities can uncover key microanatomical features–such as exhausted T-cell niches or tertiary lymphoid structures–that are invisible to bulk assays yet highly relevant to ICB efficacy. Nevertheless, widespread clinical adoption of single-cell and spatial methods is currently hindered by higher per-sample costs, complex tissue requirements, specialized library preparations and nonstandardized analytical pipelines. To reconcile breadth and depth, future studies should pursue multimodal integration, combining bulk RNA-seq with orthogonal data types–genomic mutations, proteomic signatures or radiographic imaging–to derive richer, more robust predictors of therapeutic response.Finally, interpretability remains an important challenge. Although IC2Bert identifies biologically plausible features, the underlying decision pathways are not fully transparent. Future work will focus on pretraining IC2Bert components on large single-cell or spatial atlases and applying advanced explainable-AI techniques to deconvolute latent model reasoning. By illuminating how specific cell states, spatial contexts and molecular alterations drive predictions–and by uniting multiple data modalities–we can enhance both the mechanistic insight and the clinical utility of ICB response models, ultimately advancing more precise, personalized oncology interventions.ConclusionsIC2Bert marks a notable advancement in immunotherapy response prediction by seamlessly integrating masked gene expression pretraining with domain-specific supervised fine-tuning. This approach effectively tackles the challenges posed by cohort heterogeneity and demonstrates robust, generalizable performance across diverse datasets. By relying solely on widely available bulk RNA-seq data, IC2Bert achieves high predictive accuracy and adaptability, making it a promising tool for personalized oncology. The publicly available IC2Bert model and source code set the stage for ongoing research and enhancement. Future directions include expanding the range of datasets, incorporating additional data modalities, and employing advanced interpretability techniques. We anticipate that these efforts will further elevate IC2Bert’s clinical utility, ultimately guiding more precise therapy selection and improving patient outcomes in the rapidly evolving landscape of personalized cancer treatment.MethodsDatasetsWe employed 13 immune checkpoint blockade (ICB) response cohorts derived from 11 published studies, encompassing a total of 1,214 patients (31.8% responders). Metastatic melanoma was the most frequently studied cancer type, while other cancer types were represented either individually or in a single pan-cancer study. Most patients received anti-PD-1 or anti-PD-L1 monotherapy, with a smaller subset treated using combinations of PD-1/PD-L1 and CTLA-4 inhibitors or bevacizumab. The ICB+bevacizumab cohort was considered distinct from the ICB monotherapy cohorts, even if derived from the same study. Table 1 provides detailed patient statistics for all 13 ICB response cohorts.Feature selectionDue to the high dimensionality of gene expression datasets and the relatively small number of samples in each domain-specific dataset, reducing the feature space is often beneficial when employing machine learning (ML) or deep learning (DL) approaches for ICB response prediction. This reduction helps to decrease model complexity and training time. In this study, we curated 1,179 genes from 34 published gene sets associated with immune checkpoint blockade (ICB) response prediction, which have been identified as predictors of ICB response in certain datasets (Supplementary Table 1). Of these, 983 genes were consistently measured across all 13 ICB cohort datasets, and we used these 983 genes as input features for the IC2Bert model.Model architectureThe IC2Bert model is an encoder-only transformer architecture specifically designed for processing gene expression data and tailored for immunotherapy response prediction. It begins by transforming continuous gene expression values into discretized tokens through a tokenization process, making them suitable for input into a transformer model. Each input token represents a particular gene and its corresponding expression bin, which are then processed through two primary embedding layers: an expression embedding layer that maps discretized expression tokens into 128-dimensional vectors and a gene embedding layer that maps 983 gene identifiers into 64-dimensional vectors. To align the gene embeddings with the expression embeddings, a gene projection layer projects the 64-dimensional gene embeddings to 128 dimensions using a linear transformation. The combined embeddings, capturing both biological context and discretized expression levels, are then fed into four self-attention blocks. Each block comprises a layer normalization, a multi-head self-attention mechanism with query, key, and value projections, an output linear layer, followed by another layer normalization and a feed-forward network. Dropout and layer normalization are applied throughout for regularization and stability. After processing through the self-attention blocks, the model branches into two heads: an MLM head that predicts masked tokens by mapping the 128-dimensional representations to \(N_{bins} + 1\) classes, corresponding to the discretized expression bins plus a special token (i.e., ); and a simple classification head that predicts the binary immunotherapy response label by reducing the 128-dimensional pooled representation to a single logit. The token is a learned embedding that allows the model to learn contextual relationships during masked language modeling, where selected expression tokens are replaced with during pretraining and the model is trained to recover the original expression bin.Gene expression tokenizationWe used log-transformed expression values, \(log(TPM+1)\), as input for the tokenization process. To prepare the data for transformer-based modeling, continuous gene expression values were discretized into tokens using a quantile-based tokenizer. For each gene within a dataset, the tokenizer computed bin thresholds from the training subset and then applied the same thresholds to the corresponding test subset, ensuring consistent and unbiased discretization. The tokenizer partitions the range of expression values into a predefined number of bins (e.g., 4, 8, 16, 32, 64, or 128), assigning each expression value to its corresponding bin. Each bin represents a discrete token that the model learns to interpret. This transformation enables the model to identify categorical patterns in gene activity while accommodating expression-level variability across samples. To examine the impact of bin granularity on predictive performance, we performed an ablation study across six bin sizes: 4, 8, 16, 32, 64, and 128. In addition, we adopted a hybrid approach in which each gene is assigned a unique learnable embedding vector to represent its identity, allowing the model to distinguish between genes independent of their expression values. These gene identity embeddings are projected into the same dimensional space as the expression embeddings and then combined via element-wise addition. This fused representation enables the model to jointly capture gene-specific and expression-level information in a unified input format.PretrainingPretraining involves a masked language modeling (MLM) objective, where random tokens in the input are replaced with a special mask token or retained unchanged according to predefined probabilities. The model is then trained to recover the original values of these masked tokens using a cross-entropy loss function that focuses exclusively on the masked positions. This process enables the IC2Bert model to learn meaningful and context-aware representations of gene expression, capturing complex patterns and relationships in the underlying biological data.To evaluate the effect of pretraining on different datasets, each pretraining run was initialized independently using random parameter initialization. No weights were shared or transferred across runs. Specifically, for each evaluation, the model was pretrained from scratch on the pretrain subsets of 12 datasets (excluding the designated holdout) using the MLM objective. This design ensures that the performance reflects the contribution of each dataset combination, without the influence of cumulative training effects or prior knowledge from other runs.Transfer learningFor downstream tasks, the pretrained IC2Bert model is fine-tuned on the training dataset of the test domain. During fine-tuning, the model’s parameters are adjusted to optimize binary cross-entropy or focal loss, depending on dataset-specific configurations. Hyperparameters such as batch size, learning rate, and dropout rate are dynamically adjusted based on dataset size and class imbalance.To determine and refine optimal hyperparameters for each dataset, we employed a data-driven, context-sensitive approach informed by dataset size, class distribution, and cancer type. We began with baseline configurations grouped into small, medium, or large dataset tiers, then introduced adjustments in response to observed class imbalance, data heterogeneity, and domain-specific factors. For instance, smaller or heavily imbalanced datasets prompted the use of focal loss and extended patience values, whereas larger datasets benefited from strategies like gradient centralization and mild regularization.Focal loss parameters were dynamically set to tailor gamma (\(\gamma\)) and alpha (\(\alpha\)) values based on class ratios. Dataset-specific hyperparameters were tuned for batch size, learning rate, dropout, and related optimization controls. This flexible framework enabled continuous, incremental refinement: initial general-purpose defaults were iteratively adapted as the training process revealed dataset-specific requirements.To ensure consistency and reproducibility, all selected hyperparameters were documented and stored, facilitating uniform transfer learning protocols across diverse datasets. Throughout training, we monitored performance metrics such as AUC, applying early stopping criteria and checkpointing to prevent overfitting and revert to the most effective model state. By integrating these adaptive and systematic strategies, we enhanced both the stability and accuracy of ICB response predictions across a wide range of clinical datasets.For the ensemble prediction, we used 12 models–each pretrained on 12 datasets (excluding one) and fine-tuned on the training subset of the held-out dataset. These 12 fine-tuned models were then used to generate ensemble predictions for the test subset of the held-out dataset. Importantly, the test subset was never used during either the pretraining or fine-tuning stages. This setup ensures a fair comparison between the performance of a single fine-tuned model and the ensemble of 12 models on the same held-out test subset.Sample size analysisTo evaluate the impact of training sample size on transfer learning performance, we performed a systematic analysis using the holdout datasets. For each holdout dataset, we varied the number of training samples from 2 up to the maximum number of available balanced samples, incrementing in steps of 2 to maintain equal representation of positive (responders) and negative (non-responders) classes. This approach ensures that the class balance does not confound the effects of sample size on model performance. For each sample size, we conducted R=10 independent iterations, randomly selected an equal number of positive and negative samples from the training set. The selected samples were used to fine-tune the pretrained IC2Bert model. The performance of the fine-tuned models was evaluated on a separate test set composed of samples not used during fine-tuning. The primary metric for assessing model performance was the Area Under the Receiver Operating Characteristic Curve (AUC). For each sample size and iteration, the model’s AUC was calculated on the test set predictions.Feature importance analysisTo identify the genes most influential in our model’s predictions, we conducted a feature importance analysis using two complementary methods: attention-based importance and permutation-based importance. Attention-Based Importance leverages the attention mechanisms inherent in transformer models. Specifically, we extracted the attention weights from each of the model’s layers and heads and normalized as shown below.$$I_i^{attent}=\frac{|w_i^{attent}|}{(\sum _j|w_j^{attent}| )}$$where i represents gene of interest and j represents all genes in the feature space. The attention scores reflect how much the model focuses on each gene when making predictions. Higher attention scores indicate greater influence of a gene on the model’s internal representation. Permutation-Based Importance assesses the impact of each gene on the model’s predictive performance. For each gene, we permuted its expression values across the dataset, effectively disrupting any relationship between that gene and the target variable. We then measured the decrease in model performance, quantified by the drop in the area under the receiver operating characteristic curve (AUC), as shown below.$$I_i^{perm}=\frac{1}{R} {\sum _{r=1}^R}(AUC_{baseline}-AUC_{perm(i, r)}), \;\;\; I_i^{perm}=\frac{|I_i^{perm}|}{\sum _j|I_j^{perm}|}$$where i represents gene of interest, r represents number of repetitions of the permutation and j represents all genes in the feature space. Genes whose permutation leads to a significant performance drop are considered more important. To obtain a robust measure of feature importance, we combined the attention and permutation scores, as shown below.$$I_i^{comb}=\alpha I_i^{attent}+(1-\alpha )I_i^{perm}$$We assigned a weight of 40% to the attention-based importance and 60% to the permutation-based importance, reflecting the empirical influence on model performance. The combined score provides a balanced view of each gene’s significance, considering both the model’s internal focus and the practical impact on predictions.Implementation detailsIC2Bert is implemented in Python 3.10 and builds on the JAX ecosystem–using JAX v0.4.34 and Haiku v0.0.13 to define the transformer backbone, Optax v0.1.7 for optimization (AdamW with warmup-cosine schedules and gradient clipping), and Orbax v0.7.0 for checkpoint management. Data preprocessing leverages pandas v2.2.3 and numpy v1.24.3 for I/O and array operations, scikit-learn v1.5.2 for stratified splits, and a custom BinnedExpressionTokenizer for quantile-based binning. Progress and parallel analyses use tqdm v4.66.5 and joblib v1.4.2, while matplotlib v3.9.2 and seaborn v0.13.2 support visualization. All dependencies were installed via pip v24.2 in a Linux environment.Hyperparameter selectionWe used a sample-size–dependent hyperparameter selection strategy rather than an exhaustive grid search. At startup, each hold-out cohort is classified as “small” (\(