How and why to add a new uncultivated virus genome to a public database

15th December 2023

Advice on how to submit a new virus genome to public databases, to keep pace with changing viral taxonomy

As we learn more about microbiomes, we are discovering more and more new viruses. DNA and RNA sequencing are revealing new parts of the viral world we did not know existed.

Viruses discovered through sequencing often have not, or cannot, be grown or cultured in the lab. This means that we only have the DNA or RNA sequence to use when sorting new viruses into groups of related viruses.

Taxonomy, the way that we group organisms or things together, can change over time as we learn more about them. In 2005, there was a five-rank structure to classify viruses. Since 2019, we use a 15-rank structure which introduces new ranks like class, phylum, kingdom and realm. Both genetic information and physical properties are used in the current principles of viral taxonomy.

For a virus to be classified into the official virus taxonomy framework developed and maintained by the International Committee on Taxonomy of Viruses (ICTV), it needs to be added to a public database. This is so that scientists can keep track of information about the viruses over time, in case they need to be reclassified. These public databases include GenBank (National Center for Biotechnology Information (NCBI)), the European Nucleotide Archive (ENA) or the DNA Data Bank of Japan (DDBJ).

It’s important for new viruses to be added to these public databases so that other researchers around the world know about them and can use them in their research. The sequences of new viruses are particularly important for comparative genomic analyses or genomic epidemiology.

Researchers across the world, including the Quadram Institute’s Dr Evelien Adriaenssens have developed guidelines to help scientists submit new viruses to these databases to help uncover more about the hidden viral world.

A high quality sequence

For a virus to be classified as the first representative of a new species, it needs to have a high-quality genome sequence that is publicly available.

The sequence must be annotated, which means that what each bit of genome does has been identified and labelled. As part of this annotation, all coding parts of the sequence should have been fully sequenced.

There are lots of potentially new viral sequences in other public repositories but often these sequences are not annotated. If annotated, these sequences could be submitted to the approved public databases and be formally recognised as new viruses. Even if the data is in the public domain, it is recommended that you contact the original data depositors to let them know.

As these new viruses have not been cultured in the lab, we can’t use the “complete genome” tag for the virus isolate or genome name. A complete genome requires it to have been experimentally verified as complete, through experiments in the lab. Currently, the only alternative to “complete genome” in GenBank is “partial genome,” which should be used in the case of these new unculturable virus genomes.

Metadata

When you submit a new virus to a public database, you don’t just submit the DNA or RNA sequence. You should also add other information about the virus and the sample it was found in.

It’s important that certain bits of information added in here don’t reference the current taxonomy, or way things are grouped. This is so that the virus can be reclassified in the future if taxonomy changes and as new groups are identified.

One field in the database where it is important not to mention the current taxonomy is a field called “Isolate”. If you add in here that the new virus is called, “new coronavirus 5” it may turn out not be a coronavirus in the current or future classification. Instead, you should use a unique name or code for the isolate field.

Adding metadata is also important to make sure that the data is findable, accessible, interoperable and reusable (FAIR) so it is useful to others across the world.

Where the virus came from

Often we find new virus sequences through metagenomics. Metagenomics is when one sample containing several different organisms or viruses is sequenced together. It’s often used to study microbiomes.

When a new virus is discovered through metagenomics, it is good to add information about which metagenomic sample it came from when you submit it to a public database. Doing so, allows for easier reuse of the data for large-scale analyses such as the creation of human virome databases or investigations into viruses associated with certain species of plants.

Future of virus taxonomy

The speed of sequencing viruses is fast outpacing our ability to classify them. We need to be able to keep track of these new sequences in an organised way so we can classify – and if necessary reclassify – viruses as we learn more about them.

Adding new virus data to public databases in a consistent way across the globe will help us learn more about the viral world now and in the future.

This blog is adapted from the article Guidelines for public database submission of uncultivated virus genome sequences for taxonomic classification

Related Targets

Targeting the understanding of the microbiome

Understanding the Microbiome

Coronavirus (COVID-19)

Related Research Groups

A digital illustration of green bacteriophages infecting a bacteria which is pink, against a dark blue background.

Evelien Adriaenssens

Related Research Areas

A black background with a spherical form of green and purple bacteria. Radiating out from the central spherical form and green and purple streaks.

Microbes and Food Safety