Powerful tools are revealing the ‘control knobs’ of the genome

Wait 5 sec.

For all that scientists talk of ‘decoding the genome’, the messy reality is that the genome isn’t even written in a single language. Scientists are fluent in the three-nucleotide codons that make up the protein-coding genes in DNA, but these represent only about 2% of the genomic text. The remainder is written in an entirely distinct language, which researchers are yet to untangle.“Whenever we sequence a human individual, we get about 3.5 million variants, and only 0.6% of those will be in coding regions,” says Nadav Ahituv, a geneticist at the University of California, San Francisco. That fraction is relatively easy to interpret, Ahituv says, but for the rest, “we really don’t understand what it’s doing — we don’t have a regulatory code”.AI can write genomes — how long until it creates synthetic life?But researchers are making progress on decoding the regulatory region of the genome, and learning the underlying grammar of the elements that govern when and where genes are turned on and off. To do so, they have turned to a suite of methods known as massively parallel reporter assays (MPRAs). These tools measure how millions of isolated genetic elements or sequence variants influence the expression of a hand-picked ‘reporter’ gene. That helps researchers to identify the genome’s control knobs and untangle their function without being overwhelmed by all the other parts of the genome. “It’s reducing it down to this synthetic system,” says Ryan Tewhey, a geneticist at the Jackson Laboratory in Bar Harbor, Maine. “But you’re still maintaining enough complexity to probe [genomic] space that we don’t fully understand, and that’s kind of the sweet spot.”MPRAs could help to clarify the genetic foundations of disease, reveal the changes wrought by evolution and guide development of next-generation therapeutics. They could even be used to train artificial-intelligence systems to design genetic circuits, with applications in health care and other sectors. Customized regulatory elements could, for example, tighten the control of gene-based therapies and ensure that the treatments activate only in specific tissues and under particular conditions, minimizing the ever-present threat of off-target effects. “We’re really trying to engineer things that could be turned on very easily and simply, and even not by a drug,” says Ahituv.The non-coding genome is not a complete black box, of course. Researchers have identified thousands of proteins, called transcription factors, that have established roles in gene expression and pinpointed the DNA sequences to which they bind.By mapping these sequences across the genome, scientists can predict which ones initiate gene expression — these are known as promoters — and which act as volume knobs, or enhancers, to amplify that expression under specific conditions. They can also look for signatures of regulatory elements by probing genomic regions in which the DNA is exposed and ready for transcription. Chromosomal DNA is wrapped around proteins, forming material called chromatin. Elements in densely packed chromatin are generally inaccessible to regulatory proteins and thus inactive.But validating these predictions has historically entailed a painstaking process of testing how different mutations in individual enhancers or promoters affect nearby genes. “I realized that if we wanted to ever computationally really analyse and predict enhancers, we would not need a handful of enhancers — we would need hundreds of thousands,” says Alexander Stark, a computational biologist at the Research Institute of Molecular Pathology in Vienna. Over the past 15 years or so, Stark and others have developed a range of methods, known generally as MPRAs, that enable such functional assessments at a previously unimaginable scale.How DeepMind’s genome AI could help solve rare-disease mysteriesThe core principles of these assays were established in 20091. Jay Shendure and his colleagues at the University of Washington in Seattle used cloning to generate libraries of small, circular pieces of DNA known as plasmids in which a reporter gene was physically linked to hundreds of variations of a particular promoter. These variants represented every possible single-base mutation in that regulatory sequence, allowing the researchers to survey the role of each nucleotide individually. For identification purposes, each variant was linked to a unique DNA ‘barcode’. The researchers incubated their library in a test tube with the molecular components required for transcription and then sequenced the transcribed RNAs to determine the level of expression triggered by each promoter variant from the abundance of its associated barcode.Today, MPRAs are typically performed in cultured cells. They are introduced using either a plasmid-based ‘episomal’ approach that never incorporates into the host genome, or lentiviruses, which integrate the libraries into random chromosomal sites. Tewhey prefers the episomal method because of its high efficiency. “Typically, you get a lot more copies in any one cell, so you can test a lot more constructs,” he says. But lentiviruses have their advantages — for example, they can infect cell types that tend to be resistant to plasmid delivery, such as stem cells.Although essential to making MPRAs work, barcoding also presents one of the technology’s major challenges. Many of the regulatory variants tested in an assay differ by only a nucleotide or two, whereas a typical barcode spans 20 bases, potentially creating an even larger perturbation. “The barcode effect outweighs the variant effect,” explains Hyejung Won, a geneticist at the University of North Carolina at Chapel Hill. As a result, researchers typically use anywhere from 10 to 100 barcodes for each tested sequence. Furthermore, to interpret the results, researchers need to know which barcode is associated with each sequence variant. But a DNA-exchange process known as recombination can jumble these associations.As a solution, Stark and his colleagues developed an alternative MPRA format, called STARR-seq. This approach takes advantage of the fact that enhancers are frequently located in gene sequences and thus incorporated into the transcribed RNA2. “You clone only a single fragment library, and that fragment library is then its own barcode,” says Stark, adding that this approach can reduce assay cost and complexity.Putting the regulome on the mapModern MPRAs can be performed on a vast scale, powering truly genome-wide surveys of the regulatory landscape. “The biggest MPRA that we’ve done was in total close to two billion fragments,” says Bas van Steensel, a genomics researcher at the Oncode Institute and the Netherlands Cancer Institute in Amsterdam.Many of these experiments have focused on mapping enhancer locations throughout the genome and learning how they drive gene expression in specific tissues. Stark says that as his team began performing STARR-seq in fruit-fly and mammalian cells, it discovered many elements that would have been invisible with other methods because they are generally kept inactive in densely packed chromatin. But in the simplified context of a STARR-seq assay, they can be detected readily. “They can still work as naked DNA,” says Stark.Last year, researchers led by Ahituv and Shendure reported a systematic analysis of almost every known regulatory sequence in the human genome. They tested how different combinations of promoters and enhancers affected reporter-gene activity in three human cell lines3. van Steensel’s team has used MPRAs to explore communication between genomic regulatory elements4. “Enhancers and promoters show a degree of compatibility — a degree of whether or not they can talk to each other,” he says. “It’s not a day and night thing — these are graded levels of compatibility.”Regulatory elements in DNA (brown and orange) are often tightly wrapped around proteins.Credit: Mol* Image from the RCSB PDB (RCSB.org) of PDB ID 8XRJ (T. Kujirai et al./Nature Commun.)His team often assembles libraries by simply breaking chromosomal DNA into small fragments, which allows the researchers to explore the natural genomic landscape. But when the goal is to learn why a given element works the way it does, it’s sometimes preferable to synthesize DNA that contains combinations of known regulatory sequences. Ahituv says his team starts with neutral sequences. “Then we start putting ‘words’ on them, like transcription factor-binding sites, and playing around with the spacing, the order, the amount, the orientation, and seeing what works and what doesn’t.” Alternatively, his group might systematically mutate a specific enhancer or promoter to determine the consequences.But MPRAs can survey only so much of the genomic terrain. Synthetic DNA libraries become difficult and expensive to manufacture as they get longer, and there are limits to the size of DNA that can be packaged into a plasmid or lentivirus. van Steensel says that it is unusual for an MPRA library to contain sequences longer than 1,000 bases — an serious limitation, since natural genome regulatory regions can be much larger.MPRAs thus provide a reductionist view of gene regulation. For example, there are open questions around how well different MPRA designs replicate the natural distribution of chromatin proteins on a given sequence in the genome itself, and the behaviour of genomic elements in MPRAs might not reflect how they work in their normal genomic context. “Reporter assays tell you what a sequence can do, not necessarily what it actually does in the genome,” says Tewhey. As such, MPRAs require considerable validation — which can be done, for instance, using CRISPR-based gene-editing strategies or transgenic animal models.Alternative assay designs are emerging that can yield more biologically meaningful results. Last year, the Ahituv lab described a method called Capture-C, in which researchers first identify regulatory elements that interact physically and then use those as the building blocks for an MPRA experiment5. The method has proved especially effective at isolating silencer elements that repress rather than stimulate gene expression, says Ahituv. “We found over 1,000 silencers in our assay, which historically have been very hard to characterize.”Making sense of mutationsGene regulation is a matter of location, timing and environmental conditions, and an MPRA performed on a static cell culture will inevitably overlook elements that are active only under certain conditions. Accordingly, many researchers are developing experimental variants to document how different triggers affect regulatory activity.Beyond AlphaFold: how AI is decoding the grammar of the genomeAnat Kreimer, a computational biologist at Rutgers University in Piscataway, New Jersey, is using MPRAs to understand the regulatory pathways underlying normal and aberrant brain development. In one series of studies, she and her colleagues collected MPRA data over multiple time points to document changes in enhancer activity as embryonic stem cells develop into the progenitors of mature neurons. “We came up with a computational framework to understand which transcription factors are relevant for neural differentiation,” Kreimer says. She, Ahituv and their colleagues have used those data to reconstruct a regulatory blueprint for brain development and show how both timing and the cellular environment modulate gene activation or repression by different regulatory domains6.