A new set of protocols and pipelines, implemented in an Alzheimer’s disease research project, will open up previously inaccessible regions of the genome

September 14, 2023

By Rose Miyatsu

Researchers at UC Santa Cruz’s Computational Genomics Lab and their collaborators have released new wet-lab and computational protocols that will make long-read sequencing feasible for large genomics projects. These protocols, which they have already implemented in a National Institute of Health project for Alzheimer’s research, will allow researchers to characterize regions of the genome that were previously inaccessible with short-read technology. This could have major implications for future studies on human health, as a number of these unexplored regions can contain medically relevant variations. 

In a paper published in Nature Methods today, the researchers have made their data, sequencing pipelines, and informatics pipelines freely available for others to use for the benefit of future large-scale genomics projects.

The importance of long reads

In projects that involve DNA and RNA sequencing, researchers have a choice between a number of methods for reading the genetic material that all have their advantages and disadvantages. These methods are categorized into “short read” and “long read” technologies. 

Short reads sequence multiple small sections of DNA strands from the same sample, which then get pieced together using computational tools that identify where these small bits overlap in order to reconstruct a full sequence. This method is usually highly accurate, but has limitations. For example, short reads don’t work for regions of DNA that have a lot of repetition, and are not very good at identifying the large variations between sequences, known as structural variations, that are longer than the total length of the short reads. 

Long read sequencing, like nanopore sequencing, is a newer innovation that sequences longer sections of DNA or RNA at once. This allows researchers to detect more variation and sequence sections of the genome with heavy repeats that are inaccessible with short reads. 

Projects such as the Telomere-to-Telomere (T2T) Consortium’s completion of a human genome and the assembly of a Human Pangenome reference have showcased how effective long reads can be for assembling sections of the genome that had previously been inaccessible with short-read technology, and how they could help us learn more about genetic disease and human variation. This has created a strong desire among the scientific community to use long reads in more large-scale projects to study medically relevant portions of the genome. 

However, there have been a number of barriers to doing so. For one, the current method for preparing DNA for long reads has lower yields, which means researchers have to set up multiple costly flow cells to get the same output. Additionally, long reads are slightly less accurate than short reads and need to be repeated for higher accuracy, although the technology is constantly increasing in accuracy. Projects such as T2T have overcome these problems by using multiple techniques to get full coverage, but this is laborious and expensive, so researchers have been looking for ways to make studies using just nanopore long-read sequencing feasible. 

“It may sound simple, but creating scalable protocols to make them practical for large cohorts is a big deal,” said Benedict Paten, associate professor of biomolecular engineering and senior author on the paper. “Many groups are working on this, and I believe our paper is the first to establish a protocol that we have demonstrated is cost-efficient and practical enough to be used for large-scale genomic studies.”

Creating open-source resources 

The new protocols and pipelines outlined in the Nature Methods paper make it possible to detect variations using only ONT reads at a cost similar to that of short read experiments. They can also accurately detect methylation, a chemical modification of DNA that can alter gene expression, which will also be important for medical studies. 

The protocol has already been applied to cell lines and brain tissue samples in a pilot project for NIH Center for Alzheimer’s and Related Dementias (CARD) and has been found to be more effective than short read sequencing at identifying structural variants that could be important to understanding disease.

The authors hope that the new protocols will be quickly adopted by the larger genomics community to make new discoveries. To better facilitate their use, they have worked to make the new resources as open as possible. In line with the mission of their parent organizations — the UC Santa Cruz Genomics Institute and Baskin School of Engineering — to promote open science, the Computational Genomics Lab has made the informatics pipeline available as an easy-to-run open-source software package. The cell line data is also available through the Terra workspace on the AnVIL platform.

The UC Santa Cruz Genomics Institute is housed under the Baskin School of Engineering and is one of the premier public institutions for storing, cataloging, assembling, validating, and analyzing huge volumes of genomic data. Its mission is to use genomics to positively impact health and nature. It creates advanced technologies and open-source genomics platforms to unravel evolutionary patterns, molecular processes, and the underpinnings of disease. The Genomics Institute is dedicated to openly and responsibly sharing what it learns and creates in order to contribute to creating a healthier world.