Clicky

MIT’s Comprehensive Map of the SARS-CoV-2 Genome and Analysis of Nearly 2,000 COVID Mutations

MIT’s Comprehensive Map of the SARS-CoV-2 Genome and Analysis of Nearly 2,000 COVID Mutations

0 View

Publish Date:
12 May, 2021
Category:
Covid
Video License
Standard License
Imported From:
Youtube

MIT researchers generated what they describe as the most complete gene annotation of the SARS-CoV-2 genome. Credit: MIT News

MIT researchers have determined the virus’s protein-coding gene set and analyzed the likelihood of new mutations to help the virus adapt.

In early 2020, a few months after the Covid-19 pandemic started, scientists were able to sequence the entire genome of SARS-CoV-2, the virus that causes the Covid-19 infection. Although many of his genes were already known at the time, the full complement of protein-coding genes had not yet been resolved.

Now, after conducting an extensive comparative genomics study, MIT researchers have generated what they describe as the most accurate and complete gene annotation of the SARS-CoV-2 genome. In their study, published in Nature Communications on May 11, 2021, they confirmed several protein-coding genes and found that a few others suggested as genes do not code for proteins.

“We were able to use this powerful comparative genomics approach to evolutionary signatures to discover the true functional protein-coding content of this hugely important genome,” said Manolis Kellis, the study’s senior author and a professor of computer science at MIT’s Computer. Science and Artificial Intelligence Laboratory (CSAIL) and member of the Broad Institute of MIT and Harvard.

The research team also analyzed nearly 2,000 mutations that have evolved in various SARS-CoV-2 isolates since it began infecting humans, allowing them to assess how important those mutations may be in altering the virus’ ability to attack the immune system. avoid or become more contagious. .

Comparative Genomics

The SARS-CoV-2 genome consists of nearly 30,000 RNA bases. Scientists have identified several regions known to encode protein-coding genes, based on their similarity to protein-coding genes found in related viruses. A few other regions were suspected to encode proteins, but they were not definitively classified as protein-coding genes.

To determine which parts of the SARS-CoV-2 genome actually contain genes, the researchers conducted a type of study known as comparative genomics, in which they compare the genomes of similar viruses. The SARS-CoV-2 virus belongs to a subgenus of viruses called Sarbecovirus, most of which infect bats. The researchers conducted their analysis on SARS-CoV-2, SARS-CoV (which triggered the 2003 SARS outbreak) and 42 strains of bat sarbecoviruses.

Kellis has previously developed computational techniques to perform this type of analysis, which his team has also used to compare the human genome with genomes from other mammals. The techniques are based on analyzing whether certain DNA or RNA bases are conserved between species and comparing their evolution patterns over time.

Using these techniques, the researchers confirmed six protein-coding genes in the SARS-CoV-2 genome in addition to the five that are well established in all coronaviruses. They also found that the region encoding a gene called ORF3a also encodes an extra gene, which they call ORF3c. The gene has RNA bases that overlap with ORF3a but appear in a different reading frame. This gene-in-a-gene is rare in large genomes, but is common in many viruses, whose genomes are under selective pressure to remain compact. The role of this new gene, as well as several other SARS-CoV-2 genes, is not yet known.

The researchers also showed that five other regions proposed as possible genes do not code for functional proteins, and they also ruled out the possibility that more conserved protein-coding genes might need to be discovered.

“We have analyzed the entire genome and are confident that there are no other conserved protein-coding genes,” said Irwin Jungreis, lead author of the study and researcher at CSAIL. “Experimental studies are needed to find out the functions of the uncharacterized genes, and by determining which ones are real, we let other researchers focus their attention on those genes instead of spending their time on something that doesn’t even get into protein. translated. “

The researchers also acknowledged that many previous articles used not only incorrect gene sets, but also sometimes conflicting gene names. To remedy the situation, they brought the SARS-CoV-2 community together and presented a set of recommendations for naming SARS-CoV-2 genes, in a separate article published in Virology a few weeks ago.

Rapid evolution

In the new study, the researchers also analyzed more than 1,800 mutations that have arisen in SARS-CoV-2 since it was first identified. For each gene, they compared how quickly that particular gene has evolved in the past with how much it has evolved since the start of the current pandemic.

They found that in most cases, genes that developed rapidly over a long period of time before the current pandemic continued to do so, and genes that tended to evolve slowly have maintained that trend. However, the researchers also identified exceptions to these patterns, which may shed light on how the virus evolved as it adapted to its new human host, Kellis says.

In one example, the researchers identified a region of the nucleocapsid protein surrounding the viral genetic material that had many more mutations than expected based on its historical evolutionary patterns. This protein region is also classified as a target of human B cells. Therefore, mutations in that region can help the virus evade the human immune system, Kellis says.

“The most accelerated region in the entire SARS-CoV-2 genome is smack in the middle of this nucleocapsid protein,” he says. “We speculate that those variants that do not mutate that region are recognized by the human immune system and eliminated, while those variants that randomly accumulate mutations in that region are in fact better able to bypass the human immune system and remain in circulation.”

The researchers also analyzed mutations that have arisen in variants of concern, such as the B.1.1.7 strain from England, the P.1 strain from Brazil, and the B.1.351 strain from South Africa. Many of the mutations that make those variants more dangerous are found in the spike protein and help the virus spread faster and avoid the immune system. However, each of those variants also has other mutations.

“Each of those variants has more than 20 other mutations, and it’s important to know which of those variants are likely to do something and which won’t,” Jungreis said. “So we used our comparative genomics evidence to get an initial estimate of which of these are likely to be important, based on which ones were in conserved positions.”

This data could help other scientists turn their attention to the mutations most likely to have significant effects on the infectivity of the virus, the researchers say. They have made the annotated gene set and their mutation classifications available at the University of California at Santa Cruz Genome Browser for other researchers who want to use it.

“We can now study the evolutionary context of these variants and understand how the current pandemic fits into that larger history,” says Kellis. “For strains that have many mutations, we can see which of these mutations are likely host-specific adaptations, and which mutations may be nothing to write home about.”

Reference: “SARS CoV-2 Gene Content and COVID-19 Mutation Impact by Comparing 44 Sarbecovirus Genomes” by Irwin Jungreis, Rachel Sealfon and Manolis Kellis, May 11, 2021, Nature Communications.
DOI: 10.1038 / s41467-021-22905-7

The research was funded by the National Human Genome Research Institute and the National Institutes of Health. Rachel Sealfon, a research scientist at the Flatiron Institute Center for Computational Biology, is also an author of the paper.