Cross-species gene redesign leveraging ortholog information and generative modeling

Wait 5 sec.

IntroductionHeterologous gene expression—expressing a gene in a non-native host organism—has become a cornerstone of modern biotechnology, enabling the production of recombinant proteins and enzymes in convenient expression systems. A fundamental challenge in this process is that different organisms exhibit distinct codon usage biases; the frequency with which synonymous codons are used can vary greatly between a gene’s native species and the host species1,2. To overcome this, codon optimization is routinely employed to modify a gene’s coding sequence without altering the amino acid sequence, replacing rare codons with those preferred by the host’s translational machinery. Traditional codon optimization methods—such as algorithms maximizing the codon adaptation index (CAI) or simply swapping out low-frequency codons—have been successful in many cases3,4,5. However, these conventional approaches have notable limitations: they inherently freeze the protein sequence1,4 and often neglect other critical sequence features that influence protein expression, such as mRNA secondary structure, codon context, regulatory motifs, and translation kinetics6,7,8,9. As a result, a sequence optimized purely on codon usage frequencies may still perform suboptimally in the host, and in some instances over-optimization can even reduce protein yield7,10,11.Orthologous genes—genes in different species that encode the same or similar proteins—provide valuable insights into how nature resolves the challenge of adapting a coding sequence to different organismal contexts12. Even between moderately distant species, orthologs typically accumulate numerous non-synonymous substitutions (amino acid-changing mutations) and sometimes small insertions or deletions (indels) over the course of evolution13. For instance, when comparing the coding sequences of Bacillus thuringiensis genes to their orthologs in Bacillus subtilis, only about 57% of codons encode the same amino acid (25% of codons are identical and about 32% are synonymous substitutions where the amino acid is conserved). The remaining 43% of codons differ in amino acid, and in addition, insertions and deletions account for roughly 4% of positions. Similarly high levels of amino acid divergence are observed for more distantly related species (e.g., Escherichia coli vs B. subtilis orthologs have over 50% non-synonymous codon differences). These observations underscore that nature’s solution to cross-species gene transfer is not limited to swapping codons for synonymous alternatives—it also involves adjusting the protein sequence itself when necessary.In recent years, data-driven approaches have begun to address some shortcomings of purely frequency-based approaches14,15. For example, deep learning models trained on large collections of synonymous gene variants and their expression levels have captured subtle sequence features (e.g., codon pair context, GC content distribution, mRNA structure propensity) associated with high protein yields. Deep generative models can design synthetic coding sequences that not only match a host’s codon usage bias but also minimize problematic structures or motifs, effectively learning the language of codons for a given organism. Indeed, several studies have demonstrated improved protein expression using neural network–designed gene sequences compared to those optimized by conventional algorithms16. Nevertheless, existing machine-learning-based gene design tools still typically restrict changes to synonymous substitutions, leaving the protein sequence itself unaltered.In this study, we introduce a problem formulation at the DNA level: orthologous gene conversion framed as a sequence‑to‑sequence (seq2seq) translation task between species. We develop OrthologTransformer, a general deep learning framework based on a seq2seq architecture17 that expands codon optimization into cross-species gene adaptation for prokaryote. By training on orthologous gene pairs from diverse bacteria, OrthologTransformer learns to introduce nucleotide changes (including amino acid substitutions and indels) in a biologically informed manner, aiming to produce a gene sequence that looks native to the target host while preserving the protein’s function. Thus, OrthologTransformer is trained in a supervised seq2seq manner on naturally occurring ortholog pairs that already embody the balance between host adaptation and function preservation. Consequently, the model does not decide this balance via an explicit objective or hand tuned weights; rather, it reproduces the patterns present in orthologs (synonymous changes, conservative amino acid substitutions, occasional indels). We show that OrthologTransformer-generated sequences more closely resemble true orthologs of the target species and improve heterologous expression outcomes. We then showcase a practical case study: the PETase enzyme, which enables bacteria to degrade PET plastic18. We use OrthologTransformer to convert the PETase gene from its native bacterium (Ideonella sakaiensis) to a target host species (B. subtilis), producing a gene sequence encoding a putative orthologous enzyme. We demonstrate that an OrthologTransformer-designed PETase gene enables B. subtilis to produce and secrete an active PET-degrading enzyme at levels surpassing those achieved by standard codon optimization. Our study focuses on DNA sequence generation and design; consequently, direct comparisons with protein‑level generative models such as ProGen19 or ESM20 (which generate amino‑acid sequences, often de novo or under high‑level conditioning) are not appropriate. OrthologTransformer conditions on an input gene and performs cross‑species rewriting at DNA resolution on prokaryotic genes, a fundamentally different objective from unconditional or tag‑conditioned protein generation.ResultsA deep learning model for orthologous gene conversionTo enable gene sequence adaptation beyond synonymous codon changes, we developed OrthologTransformer, a sequence-to-sequence deep learning model that converts a coding DNA sequence from a source species into a predicted orthologous sequence in a target species. The model was trained on a large collection of known orthologous gene pairs from diverse bacteria. We implemented OrthologTransformer using the Transformer architecture, as illustrated in Fig. 1a, which is well-suited for modeling long sequences with complex dependencies. In essence, OrthologTransformer learns to edit the input gene, inserting or removing codon tokens and changing codons as needed to produce an orthologous sequence for the target species. Because the training ortholog pairs were naturally aligned by protein function, the model learned to make changes that preserve function (as true orthologs do) rather than introducing random, disruptive mutations.Fig. 1: Schematic of the OrthologTransformer model and downstream selection.a Input: a coding DNA sequence from Species A (source), prepended with a source_species token \({s}_{{tgt}}\), is encoded. Output: the decoder, conditioned by a target_species token \({s}_{{tgt}}\), generates an orthologous coding sequence for Species B, permitting synonymous and conservative non‑synonymous substitutions and indels where supported by ortholog supervision. The model features a 20-layer encoder-decoder structure, with each layer equipped with Add & Normalization layers and Multi-head Attention mechanisms. Species tokens (\({s}_{{src}}\), \({s}_{{tgt}}\)) are prepended to the input sequence, enabling species-specific sequence conversion. b OrthologTransformer employs a two-stage learning approach consisting of pretraining and fine-tuning. In the pretraining phase, the model learns general sequence conversion patterns from many-to-many orthologous relationships across multiple species. In the fine-tuning phase, the model is specialized for specific one-to-one species pair conversions using targeted training data. c During candidate selection, a multi‑objective Monte Carlo Tree Search (MCTS) routine jointly optimizes GC content and mRNA secondary‑structure stability (MFE).Full size imageDuring training, we additionally conditioned the model on the identity of the target species. In practice, a special token indicating the target species was prepended to each input sequence. This enabled a single model to handle translations into many possible target species (Fig. 1b). Through this mechanism, OrthologTransformer learns each species’ particular codon usage biases and typical amino acid adaptations. The final trained model takes an input gene (from a specified source species) and a desired target species, and it generates a new coding sequence that could function as that gene’s ortholog in the target species.Benchmarking OrthologTransformer’s performanceWe first evaluated how well the trained OrthologTransformer could generate known orthologous sequences across diverse species. In these tests, the model was given a gene from one species and tasked with predicting its ortholog in another species, and we measured the similarity between the AI-generated sequence and the actual native ortholog. Our benchmark datasets ranged from one-to-one ortholog pairs between 2 species (658 sequence pairs) to many-to-many ortholog relationships across 2138 species (over 4.9 million sequence pairs from the OMA database21), as shown in Supplementary Table 1. We found that the model’s performance improved dramatically with the breadth of training data available. This progressive expansion of training data yielded remarkable improvements in ortholog conversion accuracy, as demonstrated in Fig. 2, where the codon sequence identity between generated sequence and target sequence dramatically increased from 0.15 when using a very limited dataset (ortholog pairs from 2 species) to 0.40 when leveraging the full 2138-species dataset.Fig. 2: Performance improvement with increasing training dataset scale.Evaluation of accuracy when converting from I. sakaiensis to B. subtilis. Each bar graph shows the conversion score (codon sequence identity) between the generated sequence and the target sequence at different dataset scales (ranging from 2 to 2138 bacterial species). Experiments were conducted using three approaches: models without alignment processing (Alignment-free, black), models with alignment processing (Alignment, gray), and models with fine-tuning (Finetuning, white). The numbers within the graph indicate the actual number of sequence pairs used in each dataset. The vertical axis represents the codon sequence identity between the generated sequence and the target sequence (range 0–1), while the horizontal axis shows the number of bacterial species used. Source data for this figure is available in the Source Data file.Full size imageAs part of our computational validation, we conducted a large-scale benchmark designed to match the practical upper limit of existing baseline methods. Specifically, we extended the performance comparison to the maximum possible scale, covering all 45 bacterial species included in both the OMA database (2138 species) and the set of species supported by the existing deep-learning method CodonTransformer22 (164 species), and evaluating 450 source-to-target species combinations (10 source species per target species). CodonTransformer, a recent deep learning model, uses a Transformer architecture to optimize synonymous codon choices in a context‑aware manner, and is trained on large multispecies datasets22. The extensive comparison result is summarized in Fig. 3 using codon-level sequence identity to the target ortholog as the evaluation metric. In this large-scale benchmark, OrthologTransformer outperformed conventional frequency-based codon optimization and CodonTransformer across all evaluated source-to-target species combinations. OrthologTransformer consistently achieved significant improvements (typically with p values less than 1e-5) over synonym-focused approaches, indicating that capturing ortholog-like sequence patterns goes beyond what can be achieved by synonymous substitutions alone.Fig. 3: Performance comparison of OrthologTransformer across 45 bacterial species.This figure shows codon sequence identity to the target sequence for codon sequences generated by OrthologTransformer, frequency-based codon optimization (“Codon optimization”), and CodonTransformer22, along with the identity between the original source and target sequences (“Source vs Target”). A large-scale benchmark was conducted for 45 bacterial species included in both the OMA database (2138 species) and the set of species that the publicly available trained model of CodonTransformer can handle (164 species), as well as 450 source-to-target species combinations (10 source species per target species). The x-axis indicates the target species, and the y-axis indicates codon sequence identity; each plotted value is the average identity across the 10 source-to-target conversions for that target species (n = 10), and the target species are ordered from left to right by increasing mean identity values. Data are presented as mean values ± SD (Error bars indicate standard deviation). Across all evaluated pairs, OrthologTransformer consistently achieves higher codon sequence identity than both frequency-based codon optimization and CodonTransformer. Source data for this figure is available in the Source Data file.Full size imageTo provide concrete and biologically interpretable examples, we highlight nine representative species pairs spanning diverse genomic and physiological characteristics, where each species pair contains several thousand orthologous genes (Table 1). As shown in Table 1, the codon-level identity between the generated sequences and the true target sequences increases substantially: especially for conversions between species with markedly different genomic contexts. For example, when converting between B. subtilis (43.5% GC content) and I. sakaiensis (66.7% GC content), the identity to the target sequence doubled from 0.221 (original source sequence) to 0.424 (generated sequence). Likewise, between L. lactis and T. thermophilus, which differ substantially in optimal growth temperatures (30 °C versus 65 °C), the sequence identity increased more than threefold—from 0.157 (original source sequence) to 0.467 (generated sequence).Table 1 Performance comparison of OrthologTransformer across 9 representative bacterial species pairsFull size tableImportantly, improvements are not limited to synonymous patterns. When examining sequence similarity at the amino acid level, OrthologTransformer preserves functional integrity while introducing biologically plausible amino-acid substitutions (that is, non-synonymous substitutions): Table 1 also reports BLOSUM-based amino-acid similarity, showing consistent improvements across multiple pairs. Many changes correspond to conservative substitutions (chemically similar amino acids), suggesting that the model preferentially introduces subtle, function-preserving protein-level edits when supported by ortholog supervision. In addition, OrthologTransformer captures species-specific sequence characteristics beyond raw identity: during conversion, GC content shifts toward the target genome, and CAI approaches the distribution of native target genes, as illustrated for representative pairs in Fig. 4: the distribution of CAI and GC content for OrthologTransformer-designed sequences (green) aligned much more closely with native target genes than did the original source sequences (blue).Fig. 4: Comparison of CAI and GC content distribution across different bacterial species pairs.Each panel shows ortholog conversion results for different bacterial species pairs. The upper portion of each panel displays violin plots representing the distribution of the Codon Adaptation Index (CAI), while the lower portion shows the distribution of GC content. Blue represents the source sequences, green indicates the target sequences, orange corresponds to sequences generated by the pretrained model, and red denotes sequences generated by the fine-tuned model. Box plots (shown within the violins) indicate the median (center line), the interquartile range (bounds of the box; 25th-75th percentiles), and whiskers extending to the minima and maxima. Source data for this figure is available in the Source Data file.Full size imageFinally, these results translate into clear advantages over existing methods. Across the representative pairs, OrthologTransformer achieves approximately a 1.7‑fold improvement in codon sequence identity compared to conventional codon optimization (Table 1). This suggests that synonymous‑only optimization is insufficient to meet the contextual and evolutionary demands of host adaptation. In the subset of five species pairs where CodonTransformer was evaluated, OrthologTransformer still maintains a clear advantage, delivering an average improvement of ~1.8‑fold. For example, when considering sequence conversion to E. coli, CodonTransformer achieves only slight improvements, whereas OrthologTransformer demonstrates more than a two-fold enhancement. These findings highlight the limitations of relying solely on synonymous substitutions and demonstrate that OrthologTransformer, by incorporating non-synonymous substitutions and indels, provides a more advanced and effective approach to gene adaptation than existing synonym‑focused methods. Furthermore, we conducted statistically rigorous test (two-sided paired t-test) against conventional methods with three clearly defined evaluation metrics: codon-level sequence identity, CAI (codon adaptation index) proximity and GC alignment between source, target, and model generated sequences, and report statistical significance of improvements (Supplementary Table 2). Across those three metrics, OrthologTransformer consistently outperformed baseline codon optimization and, where comparable, CodonTransformer, with significant gains across pairs.Together, the large-scale benchmark spanning 45 species and 450 source-to-target combinations (Fig. 3) and the in-depth analysis of nine source-to-target pairs (Table 1 and Supplementary Table 2) demonstrate that OrthologTransformer provides a more effective and evolution-consistent route to cross-species gene adaptation than synonym-only optimization, by allowing non-synonymous substitutions and indels when they are supported by natural ortholog patterns.Designing a PETase enzyme for B. subtilisHaving validated the model’s general performance, we next applied OrthologTransformer to a specific biotechnologically relevant challenge: adapting a plastic-degrading enzyme (PETase) from Ideonella sakaiensis (a Gram-negative bacterium) to function in Bacillus subtilis (a Gram-positive host). I. sakaiensis PETase naturally breaks down PET plastic18,23,24, but I. sakaiensis grows slowly and is not an ideal organism for industrial use. In contrast, B. subtilis is a fast-growing, spore-forming bacterium amenable to large-scale fermentation25, making it an attractive host for PETase26,27,28—if the enzyme’s gene can be successfully expressed.Using the model, we generated a set of candidate B. subtilis-adapted PETase sequences. To ensure we explored a broad design space, we employed additional computational optimization steps in conjunction with the model. In particular, we performed a multi-objective search using Monte Carlo Tree Search (MCTS) to refine the model’s outputs (Fig. 1c). This search aimed to optimize two key properties of the PETase coding sequence: (1) GC content around 36–37%, which is closer to B. subtilis genome composition, and (2) mRNA secondary structure stability, to maintain sufficient RNA structure for mRNA longevity29. We also evaluated a variant strategy where the OrthologTransformer model was fine-tuned specifically on orthologous gene data from I.sakaiensis and B. subtilis (to further specialize it for this species pair) before generating sequences. In total, we designed 12 distinct PETase gene variants by systematically varying (i) the breadth of training species (23, 54, or 2138), (ii) alignment processing (±), (iii) pair‑specific fine‑tuning (±), and (iv) MCTS multi‑objective refinement (±). For interpretability, we group the variants by training breadth and data source: OrthoFinder‑based (23/54 species; AI‑S1, AI‑M1–AI‑M6) and OMA‑based (2138 species; AI‑L1–AI‑L5). These twelve variants are summarized in Table 2, and the corresponding DNA sequences are provided in Supplementary Data 1. As illustrated in Fig. 5a, these modifications involved varying degrees of insertions, deletions, synonymous substitutions, and non-synonymous substitutions across the 12 AI-designed sequences. The extent of these changes varied dramatically among the different versions: from minimal modifications in AI-L1 (no changes) to extensive remodeling in AI-M3 (160 insertions, 139 deletions, 72 synonymous substitutions, and 30 non-synonymous substitutions). Despite such extensive sequence modifications, structural predictions showed that key functional domains remained intact.Fig. 5: Predicted Structures, global and local structural conservations, and sequence-level properties of AI-designed PETase variants.a Predicted tertiary structures of twelve different PETase variants (AI-S1–AI-L5) generated by OrthologTransformer with various degrees of sequence modifications. The wild-type PETase structure (PDB entry 5XJH) is shown on the left for reference. The four numbers below each structure denote the counts of the modifications introduced in each variant in the following order: insertions/deletions/synonymous substitutions/non‑synonymous substitutions. b Global and local structural conservation of AI-designed PETase variants. TM‑score (global fold similarity), predicted structural stability, backbone RMSD, and per‑residue pLDDT are shown for AI‑designed variants (AI-S1–AI-L5), wild‑type (WT), and codon‑optimized (CO). The AI-designed variants, particularly those trained on broader datasets, achieved a favorable balance across these measures, indicating preservation of the PETase fold while permitting small, evolution-consistent modifications and highlighting the benefit of multi-objective optimization relative to conventional codon optimization. c Sequence-level properties. GC content and RNA secondary-structure free energy (ΔG) among AI-S1–AI-L5, WT, and CO are shown. The AI-designed variants converge toward the GC composition of B. subtilis (target host), whereas the wild-type I. sakaiensis PETase gene is substantially more GC-rich (~66.7%). The AI-designed sequences also exhibit favorable mRNA secondary-structure energetics. Source data for (b, c) is available in the Source Data file.Full size imageTable 2 Experimental conditions for PETase sequence design variantsFull size tableSeveral AI-designed sequences were predicted to exhibit markedly improved properties in B. subtilis. As shown in Fig. 5b, c, one design in particular, AI-L2, achieved an optimal balance of characteristics: the highest predicted structural stability, a TM-score of 0.98 for the modeled 3D structure, a GC content of 37.0% (the wild-type I. sakaiensis PETase gene has substantially higher GC), and a favorable mRNA secondary structure (predicted folding free energy ΔG ≈ −281 kcal/mol for the full-length mRNA). This high TM-score supports the conclusion that the amino-acid differences in AI-L2 relative to wild-type PETase do not disrupt the overall fold. To provide complementary local assessments, we also examined backbone RMSD and per-residue pLDDT. As summarized in Fig. 5b, c, the redesigned sequences preserve the PETase fold (consistently high TM-scores), show small RMSD deviations, and display high pLDDT confidence across the fold. Notably, AI-L2 combines a TM-score ≈ 0.98 and the highest predicted structural stability with RMSD/pLDDT patterns consistent with near-identity to the wild-type structure, in line with its top functional performance.Experimental validation of AI-designed PETase in B. subtilisWe synthesized a selection of the AI-designed PETase genes and tested their expression and activity in B. subtilis. Twelve constructs (AI‑S1, AI‑M1–AI‑M6, AI‑L1–AI‑L5) were assembled, each encoding the designed PETase variant (or controls) under an inducible promoter on a B. subtilis shuttle plasmid. To facilitate secretion of the enzyme (since PET is an extracellular substrate), all constructs included an N-terminal signal peptide for Sec pathway secretion and a C-terminal 6×His-tag for detection (Supplementary Fig. 1). We included two control genes: WT, the wild-type I. sakaiensis PETase coding sequence, and CO, a purely codon-optimized sequence (identical amino acids, every codon replaced by the most preferred B. subtilis synonymous codon). All PETase constructs were transformed into B. subtilis, and expression was induced in shaking flask cultures.mRNA transcriptionAll engineered B. subtilis strains successfully transcribed the full-length PETase mRNA. PCR amplification yielded the expected around 0.9 kb PETase transcript (including signal peptide and His-tag regions) in every AI-designed strain as well as in the WT and CO controls (Supplementary Fig. 2). Quantitative real-time PCR (qPCR) confirmed that all designs produced substantial PETase transcripts, although transcript levels varied by construct (Supplementary Fig. 3). Several variants (e.g., AI-S1, AI-M1, AI-M6, AI-L1, AI-L4) showed PETase mRNA levels comparable to or higher than the WT and CO controls, while a few (AI-M3, AI-M5, AI-L2) had somewhat lower transcript levels. Nonetheless, all AI-designed genes were robustly transcribed in B. subtilis.Protein expression and secretionWestern blot analysis of culture supernatants (anti-His detection) showed that the PETase protein (~30 kDa) was present in many of the AI-designed strains (Supplementary Fig. 4), indicating successful expression and secretion. No PETase band was detected in the empty-vector negative control. Notably, variants AI-L1, AI-L2, and AI-L5 had especially strong PETase bands in the supernatant, comparable to or exceeding the CO control, indicating that those designs achieved particularly high secreted enzyme levels. The presence of PETase in the media confirms that the signal peptide functioned and the enzyme was exported out of the cells (a crucial feature for PET degradation, since PET is extracellularly located as a solid substrate).PET degradation activity of AI-designed PETaseFinally, we tested the functional activity of each enzyme using a PET degradation assay. In this assay, B. subtilis cells expressing PETase were incubated with a film of additive-free PET plastic, and the breakdown products (terephthalic acid (TPA), mono (2-hydroxyethyl) terephthalate (MHET) and Bis (2-hydroxyethyl) Terephthalate (BHET)) were measured over time by HPLC method (Supplementary Fig. 5). The results confirmed that the AI-designed PETase is functional. B. subtilis cells harboring the AI-S1—AI-L5 genes exhibited clear PET degradation activity, comparable to cells with the codon-optimized PETase CO. PET hydrolysis was monitored by measuring its soluble breakdown products (TPA, MHET, and BHET) in the culture supernatant on days 1, 2, 3, and 7. Due to PETase’s known endo-activity, which predominantly generates MHET, the accumulation of MHET serves as the expected indicator of PETase activity in this assay system. MHET was detected as early as the 2nd day in some engineered strains (AI-M2, AI-M3, AI-L2, AI-L3, AI-L5, and WT), and by the 3rd day MHET had accumulated in all PETase-expressing cultures (Fig. 6a). Notably, the AI-L2 variant stood out by producing roughly three-fold more MHET than any other strain by day 3, reflecting significantly higher PET-degrading activity (p