Cell type annotation for scATAC-seq via DNA large language model and graph domain adaptation

Wait 5 sec.

by Yan Liu, Sheng Guan, He Yan, Long-Chen Shen, Ji-Peng Qiang, Guo WeiSingle-cell ATAC-seq (scATAC-seq) enables the exploration of chromatin accessibility at single-cell resolution, offering critical insights into gene regulation. Accurate cell type annotation is a fundamental prerequisite in scATAC-seq analysis. While cross-modality annotation methods leverage scRNA-seq data for label transfer, they often suffer from modality mismatch and signal distortion. Intra-modality annotation, which utilizes only scATAC-seq reference data, has gained attention for its biological consistency. However, existing methods are limited by insufficient sequence representation and lack of neighborhood modeling during domain adaptation. To address these limitations, we propose scLLMDA, a novel framework for scATAC-seq cell type annotation via DNA large language model and graph-based domain adaptation (GDA). scLLMDA uses a pretrained DNA-specific language model to generate contextual embeddings of peak sequences, which are then integrated with accessibility information to represent individual cells. We construct similarity-based cell graphs for both source and target datasets, and apply a graph neural network to align domains while preserving local structural context. Our approach captures rich sequence semantics and neighborhood dependencies, enabling more accurate and robust cell type annotation across datasets. Extensive experiments on multiple benchmarks demonstrate that scLLMDA outperforms existing methods in accuracy. The source code and implementation of scLLMDA are publicly available at: https://github.com/sheng-guan-2001/scLLMDA.