MainIn prokaryotes, the CRISPR-Cas system provides antiviral immunity by recognizing and disrupting intruding viral DNA through DNA sequence recognition1. This system has been harnessed for precise genome editing in various organisms and cell types1,2,3,4,5. Identifying naturally occurring Cas nucleases and engineering Cas enzyme variants with different features is crucial for various research and clinical applications. CRISPR-Cas nucleases form protein:DNA contacts to initiate target site recognition through a protospacer adjacent motif (PAM)6,7,8,9,10,11. The location (5′ or 3′ of the spacer) and sequence of the PAM differs among different types of CRISPR-Cas system11,12; however, efficient and accurate identification of PAM requirements in eukaryotic cells remains a bottleneck in the discovery and characterization of novel Cas nucleases and their engineered variants.Various methods have been developed for PAM identification, including in silico13 and in vitro cleavage assays12,14,15,16, bacterial-based assays7,10 including the PAM screen achieved by NOT-gate repression (PAM-SCANR)17, cell-free transcription–translation (TXTL) systems18,19, fluorescence-based20 assays including the PAM definition by observable sequence excision (PAM-DOSE)21, human cell library-based approaches22,23 and scalable human cell expression followed by the in vitro cleavage reaction hybrid method, high-throughput PAM determination assay (HT-PAMDA)24,25. Each method has its own advantages and limitations (reviewed elsewhere13,25). In general, in silico and bacterial depletion results may not be easily translated to mammalian cell contexts. In vitro cleavage assays have the advantages of managing large libraries15; however, in vitro methods require laborious protein purification, and the cleavage kinetics may not accurately reflect the conditions in a living organism. Previous in vivo methods require introducing synthetic random oligos as PAM candidates into live cells, which are challenging for maintaining high-diversity sequence libraries. Moreover, fluorescence (GFP or RFP)-based enrichments are associated with particularly low efficiency.Repetitive sequences in the mammalian genome, flanked by diverse sequences, are a potential resource for characterizing the PAM preferences of naturally occurring and engineered Cas nucleases. Here we developed a direct PAM identification method called GenomePAM, which uses highly repetitive sequences in the mammalian genome. To characterize the PAM requirements of Cas enzymes, we identified genomic repeats flanked by highly diverse sequences where the constant sequence can be used as the protospacer in CRISPR-Cas genome editing experiments. The cleaved genomic regions can then be analysed using methods such as the genome-wide unbiased identification of double strand breaks (DSBs) enabled by sequencing (GUIDE-seq)26 that enriches double strand oligodeoxynucleotide (dsODN)-integrated fragments by anchor multiplex PCR sequencing (AMP-seq)27. GenomePAM is highly efficient and accurate at characterizing the mammalian-cell-based PAM of an enzyme because every single cell contains just one full set of identical-complexity candidate PAM library. GenomePAM can also simultaneously assess the potency of thousands of on-target sites across the genome and the fidelity of tens of thousands of potential off-target sites of a Cas nuclease, facilitating performance comparison of different Cas nucleases. Moreover, GenomePAM can be used to better understand and compare genome-wide chromatin accessibility profiles of different cell types.ResultsMethod designThe human genome contains highly repetitive sequences28, most of which are not suitable for use as protospacers due to low-complexity flanking sequences. However, a subset of these sequences can be used for PAM preference identification, provided they have the following features: (1) The number of unique flanking sequences of a given length in the human genome is comparable with, or not significantly smaller than, the number of potential PAMs to be tested. For example, the PAM of SpCas9 (Streptococcus pyogenes Cas9) and its variants may range from 1 to 3 bases; therefore, the number of unique 3-nt-long flanking sequences should preferably be 64 (=43). In the case of a SaCas9 (ref. 29) (Staphylococcus aureus Cas9) and its variants, the PAM may range from 3 to 4 bases; therefore, the number of unique flanking sequences of 4-nt length should preferably be ~256 (=44). (2) The flanking sequences should have highly diverse or nearly completely random sequence compositions.To characterize the PAM of SpCas9 and its variants, we analysed the human genome for all possible 20-nt-long sequences and their flanking sequence diversities. For example, there are 8,471 occurrences of the sequence 5′-GTGAGCCACTGTGCCTGGCC-3′ (part of an Alu sequence; hereafter referred to as ‘Rep-1’) distributed across the human genome (Fig. 1a; ~16,942 occurrences in a human diploid cell) with nearly random flanking sequences of 10-nt length at its 3′ end, making it a suitable candidate as the protospacer sequence for PAM identification (Fig. 1b). For type II Cas nucleases with 3′ PAMs, such as SpCas9 and SaCas9, Rep-1 can be directly used for PAM preference characterization. For type V Cas nucleases with their PAM at the 5′ end of the spacer, such as FnCas12a12 (Francisella novicida Cas12a), the reverse complementary sequence 5′-GGCCAGGCACAGTGGCTCAC-3′ (‘Rep-1RC’) can be used as the protospacer sequence (Fig. 1b). Since Cas nucleases can often tolerate a few base mismatches (off targets), we calculated the numbers of 20-nt sequences with 1, 2, 3 and 4 base mismatches. For Rep-1, these numbers were 48,207, 206,767, 579,336 and 1,350,488, respectively, and >2 million in total in the human genome (hg38). Thus, using Rep-1 or Rep-1RC as the protospacer, there are potentially >4 million targets in a single human diploid cell. A list of example repeats, their occurrences, flanking sequence diversity and their use are shown in Extended Data Fig. 1.Fig. 1: Method design.a, Genome-wide distribution of the Rep-1 sequence in the human genome. b, SeqLogo plot showing nucleotide frequency at each position in the 10 bases at 3′ of Rep-1 and the 10 bases at 5′ of Rep-1RC (reverse complement of Rep-1). Rep-1 and Rep-1RC sequences can be used to characterize PAM preferences of Cas nucleases with 3′ PAM and 5′ PAM, respectively. c, GenomePAM workflow for the identification of PAM preferences using the highly repetitive genome sequence Rep-1 as the protospacer and the GUIDE-seq experiments to capture cleaved genomic sites of SpCas9. CMV, cytomegalovirus promoter. d, An example of GUIDE-seq output. Each line shows one SpCas9 cleavage site and the mismatch bases are colour coded. The flanking sequences and GUIDE-seq read counts of each site are shown on the right side. e, SeqLogo plot summary for SpCas9 PAM preferences using their corresponding read counts as weights and stratified by perfect-match and mismatch targets. f, GenomePAM Table reporting the enriched PAM sequences and counts, along with the numbers and percentages of corresponding genomic sites edited, and associated statistical significance. P values were derived from two-sided chi-square test.Full size imageTo leverage these genomic repeats to characterize the PAM of various CRISPR-Cas enzymes, the repeat sequence Rep-1 was chosen as the protospacer target. The corresponding spacer was cloned into a guide RNA (gRNA) expression cassette to be used along with a plasmid that encodes the candidate Cas nuclease. To identify which repeats within the genome were cleaved in an experiment, we adapted the GUIDE-seq26 method to capture cleaved genomic sites in HEK293T cells (Fig. 1c). Only those sites whose flanking sequences contain functional PAMs can be cleaved by the Cas nuclease. Cell toxicity after large numbers of DSBs occur in one cell was reported previously when using CRISPR to target highly repetitive element LINE1 (ref. 30) or unique repeat sequences associated with temozolomide mutational signature31. To assess this toxicity, we measured cell viability in four different conditions, including Lipofectamine 3000 transfection controls, and in two different cell lines (Methods). The results showed largely similar cell viability across different transfection conditions at 24 h and 48 h after transfection in HEK293T (Extended Data Fig. 2a) and HepG2 cells (Extended Data Fig. 2b). During the GUIDE-seq data analysis, the candidate PAM was set as unknown (‘NNNNNNNNNN’) and 13,908 sites across the genome were identified (Fig. 1d). The mismatch bases were typically located at positions 8–11 of the targets and were transitions of the intended bases (Fig. 1d). The resulting PAMs were then summarized using their corresponding read counts as weights and used for SeqLogo plotting (Fig. 1e), which was stratified by perfect-match and mismatch targets. Beyond the descriptive SeqLogo, the consistent genomic background sequences inspired us to create an iterative ‘seed-extension’ method. This approach identifies statistically significant enriched motifs and reports the percentages of edited genomic sites at each iteration step (Fig. 1f and Methods).Performance of GenomePAM on SpCas9, SaCas9 and FnCas12aTo evaluate the performance of GenomePAM, we chose three Cas nucleases with well-established PAMs: SpCas9, SaCas9 and FnCas12a. We used Rep-1 for GenomePAM analysis of SpCas9 and SaCas9, and Rep-1RC for FnCas12a. The results showed that the PAM preferences for SpCas9, SaCas9 and FnCas12a were NGG at 3′, NNGRRT (R is G or A) at 3′ and YYN (Y is T or C) at 5′ sides of the spacers, respectively, consistent with previous results7,10,12,29,32 (Fig. 2a–c; top, perfect match; bottom, mismatch; Fig. 2d–f, 4-base heat map of relative PAM cleavage value (PCV); Methods; GUIDE-seq results are listed in Supplementary Tables 1–3 and visualized in Supplementary Tables 4–6). GenomePAM Table analysis showed that, for SpCas9 3′ PAM, the most significant single base was the G at position 3 (1,103 (65.6%) of total 1,681 targets in human genome edited), the most significant two bases were GG at positions 2 and 3 (449 (94.1%) out of total 477 targets edited), and no further significant bases (Fig. 2g). For SaCas9 3′ PAM, the GenomePAM Table analysis showed increasing significance for G at position 3, GR at 3–4, GRR at 3–5, and GRRT at 3–6, respectively (Fig. 2h). The corresponding percentages of edited targets were 44.6%, 62.1–73.9%, 80.0–93.3%, and 96.7–98.0%, respectively (Fig. 2i). For FnCas12a 5′ PAM, the GenomePAM Table analysis showed a Y position −3 (5.3–6.0% of targets edited) and a YY at position −2 to −3 (8.5–9.6% of targets edited). For the known PAM wobble bases6,33, namely the 2nd position of N[G/A]G in SpCas9 and the 6th position of NNGRR[T/A] in SaCas9, the proportions of the prominent bases ([G] in SpCas9 and [T] in SaCas9) in the mismatch target-associated PAMs were higher than those perfect-match-associated PAMs (Fig. 2j; left, SpCas9; right, SaCas9; both P