Genome annotation guideline

The purpose of this guide is to provide advice, tips and rules for annotating yeast genomes. Inspection of existing annotations available in public databases reveals a high heterogeneity in the way genetic features are presented and formatted. This yields to important difficulties in data handling, such as genome comparisons or automated data extraction. In addition, most of the available annotation guidelines available on the web are incomplete, proposing solutions only for the most frequent cases.

It is worth noting that the solutions proposed in this guideline are not absolute rules but only suggestions that we adopt for GRYC data. With this guideline, our objectives are:

A few definitions

Genome annotations rely on feature table formats that use:

The complete lists of standard features and qualifiers are available on the INSDC website. It is at the basis of the three major feature table formats GenBank (NCBI, USA), EMBL (ENA, UK) and DDBJ (DDBJ, Japan).

Note that as a European resource, we work with the EMBL file format and hence, the illustrations provided in this guideline are formatted in EMBL. However, each of the proposed rules can be applied in both the DDBJ and GenBank formats.

Last, there are other formats to describe genome annotations such as the GFF3 format, as well as other standard/ontology to characterize genetic entities (e.g., the Sequence Ontology). We won't discuss about these alternatives but some of them inspired us to define our annotation guideline.

Concept of feature hierarchy

In genome annotation files, genetic elements are represented by feature entries (e.g., gene, CDS, tRNA). The GRYC database relies on a key concept, which is the definition of a hierarchy between these features. Note that this concept is not fully adapted to EMBL, GenBank and DDBJ formats, whereas it is central in GFF3.

The feature hierarchy we propose relies on 5 feature categories:

Types of genetic features considered

The INSDC website provides an exhaustive list of features covering all domains of genetic/genomic annotation. Here is the selection of features that we consider for annotating yeast genomes and that are considered in the GRYC database:

Feature name Hierarchic level Short description Mandatory qualifiers
assembly_gap Chromosomal region Gapped region within a scaffold /estimated_length; /gap_type; /linkage_evidence
CDS Translated feature Region that codes for a protein
centromere Chromosomal region Region that contains the centromere
gap Chromosomal region Region with undetermined base(s) (N) /estimated_length
gene Locus feature Global location of a gene
misc_binding Regulatory feature Site that Binds another moiety /bound_moiety
misc_feature Locus feature A locus of interest that cannot be describe with any other locus feature
misc_RNA Transcribed feature Other transcribed RNA product
mobile_element Locus feature Mobile element locus /mobile_element_type
mRNA Transcribed feature Messenger RNA feature
ncRNA Transcribed feature RNA from a non-protein-coding gene (other than tRNA and rRNA) /ncRNA_class
polyA_site Regulatory feature Site of post-transcriptional polyadenylation
protein_bind Regulatory feature Non-covalent protein binding site on nucleic acid /bound_moiety
regulatory Regulatory feature Any region that functions in regulation /regulatory_class
repeat_region Regulatory feature OR Locus feature Region containing repeat units
rRNA Transcribed feature Ribosomal RNA
sig_peptide Regulatory feature Signal peptide coding sequence
source Chromosomal region Global sequence description /organism; /mol_type
telomere Chromosomal region Region identified as a telomere
tRNA Transcribed feature Transfer RNA

Naming convention

There are no absolute rules for naming the assembled sequences and the annotated genes. However, it is worth noting that the INSDC databases have some requirements concerning sequence and gene names. In addition, not respecting a convention leads to a high heterogeneity of sequence/gene names, which unnecessarily complicates analyses.

Defining a prefix for each strain

When we prepare our genomic data, we use a prefix to build both sequence and feature labels. This prefix must contain only capital letters and numbers. By convention, we often use the genus and the species names to build this prefix, followed by a number to distinguish the strains of the same species. For example, the first strain of Monosporozyma unispora would have the prefix MOUN0, the second strain would be MOUN1, and so on. When creating a BioProject to deposit genomic data in an INSDC database, it is generally possible to reserve this prefix to ensure that it will be used and maintained by the different public databases once the data have been submitted.

Sequence labels

Labels of genomic sequences are based on the prefix. For a chromosome level assembly, sequences are enumerated by letters and chromosomes should be ordered by increasing length. Hence the shortest chromosome should be the chromosome A. For scaffold and contig level assemblies, sequences are numbered and should be ordered by decreasing length. For consistency with locus labels (see below), it is recommended to separate the prefix and sequence letter/number by an underscore ("_").

Here are two examples:

Note: when considering a scaffold and contig level assembly, the letter "S" is added to the end of the label to avoid confusion with the locus number (see below).

Once your data is submitted to a public database, sequence labels are generally replaced by accession numbers. However, the labels you have defined can be conserved in the sequence description as well as in the source feature of the annotation.

Locus labels

From the various INSDC database guidelines, a locus label (/locus_tag) must respect the following rules:

To comply with these "rules", we generally use the sequence labels (see previous section) completed with 5 digits. The value of the first locus of a given sequence is 00100. The values of the following loci are incremented by 22. In this way, it will possible to insert missing loci in the annotation without disturbing the order of the locus labels.

Here are some examples:

This suggested locus naming rule allows to label up to about 4,500 loci on the same chromosome/sequence, which is generally enough for yeast. If you need to label more loci on a single sequence, then use a shorter increment step or use 6 digits.

Detailed feature annotation

Protein coding genes

Protein coding genes are the most frequent features in genome annotations. It generally consists in at least one CDS (Coding DNA Sequence) feature, which gives the coordinates of the sequence(s) of the gene that are translated into protein. It often includes functional information about the gene. Protein-coding genes may include mRNA feature(s) that gives the complete exon coordinates, including the untranslated region (UTR). The entire locus is represented by a gene feature. Its coordinates are often flanked by the symbols < and >, indicating that the locus may be larger. In addition, a gene feature cannot have joined coordinates; multiple exons are provided in the mRNA and the CDS features. Last, structural and functional regulatory elements may also have regulatory features or other dedicated features.

For complex multi-exon genes, it is necessary to specify gene, mRNA and CDS features to allow to identification of possible alternative transcription/translation as well as the location of untranslated regions.

Single protein-coding gene

Here, we consider the most frequent situation, a single protein-coding gene. The coding sequence is given by the CDS feature.


FT   gene            <1200..5300>
FT                   /locus_tag="YEAT0_A00100"
FT   regulatory      1200..1220
FT                   /locus_tag="YEAT0_A00100"
FT                   /regulatory_class="promoter"
FT   mRNA            join(1200..1350,1540..3560,4020..5300)
FT                   /locus_tag="YEAT0_A00022"
FT   CDS             join(1950..3560,4020..5125)
FT                   /gene="XXX1"
FT                   /locus_tag="YEAT0_A00100"
FT                   /note="Annotation comments..."
FT                   /product="Putative protein..."
FT                   /translation="MSPRTIA..."
FT   polyA_site      5299..5300
FT                   /locus_tag="YEAT0_A00100"

Simple CDS

Multiple transcription/translation protein coding gene

When annotating a locus with multiple products, which may result from multiple transcription or translation, each feature must be reported separately and must contain the same /locus_tag.


FT   gene            <1200..5300>
FT                   /locus_tag="YEAT0_A00100"
FT   regulatory      1200..1220
FT                   /locus_tag="YEAT0_A00100"
FT                   /regulatory_class="promoter"
FT   mRNA            join(1200..1350,1540..3560,4020..5300)
FT                   /locus_tag="YEAT0_A00100"
FT   mRNA            join(1200..1300,1540..3560,4020..5300)
FT                   /locus_tag="YEAT0_A00100"
FT   CDS             join(1290..1300,1540..3560,4020..5125)
FT                   /gene="XXX1"
FT                   /locus_tag="YEAT0_A00100"
FT                   /note="Long form..."
FT                   /product="Putative protein..."
FT                   /translation="MKAAREY..."
FT   CDS             join(1950..3560,4020..5125)
FT                   /gene="XXX1"
FT                   /locus_tag="YEAT0_A00100"
FT                   /note="Short form..."
FT                   /product="Putative protein..."
FT                   /translation="MSPRTIA..."
FT   polyA_site      5299..5300
FT                   /locus_tag="YEAT0_A00100"

Multiple CDS

Pseudo protein coding gene

Pseudo genes can be annotated in the same way as a regular protein coding gene, except that the CDS feature cannot contain \product and \translation qualifiers. It must also contain a qualifier indicating that it is a pseudo gene. There are two possible qualifiers:


FT   gene            <1200..5300>
FT                   /locus_tag="YEAT0_A00100"
FT   mRNA            join(1200..1350,1540..5300)
FT                   /locus_tag="YEAT0_A00022"
FT   CDS             join(1290..1300,1540..4021,4020..5125)
FT                   /gene="XXX1"
FT                   /locus_tag="YEAT0_A00100"
FT                   /note="Annotation comments..."
FT                   /pseudogene="unknown"

pseudo CDS

Transposable elements (TEs)

Complete transposable element

For complete (or almost complete) transposable element features, the locus should be represented by a mobile_element feature. This feature type requires a mandatory qualifier, which is /mobile_element_type="mobile_element_type[:mobile_element_name]". Authorized values for this mobile_element_type are: "transposon", "retrotransposon", "integron", "insertion sequence", "non-LTR retrotransposon", "SINE", "MITE", "LINE", and "other". It is also possible to specify the name of the element after the type value (separated by a colon). For example, a Ty1 LTR retrotransposon can have the following qualifier: /mobile_element_type="retrotransposon:Ty1".

Terminal repeats such as LTRs or TIRs should be annotated with repeat_region features. It is recommended to use the qualifier /rpt_type=type (the value type is given without quotes). All possible value for this qualifier are provided here. For transposable elements, it can be long_terminal_repeat, inverted, non_ltr_retrotransposon_polymeric_tract, nested, terminal or other. The qualifier rpt_family="text" can be used to provide more information about the repeat region, such as the TE family/name.

Below is an example of an LTR retrotransposon annotation:


FT   mobile_element  <5000..11200>
FT                   /locus_tag="YEAT0_A00144"
FT                   /mobile_element_type="retrotransposon:Ty1"
FT   repeat_region   5000..5250
FT                   /locus_tag="YEAT0_A00144"
FT                   /rpt_type=retrotransposon
FT                   /rpt_family="LTR of a Ty1..."
FT   mRNA            <5000..11200>
FT                   /locus_tag="YEAT0_A00144"
FT   CDS             5300..6200
FT                   /locus_tag="YEAT0_A00144"
FT                   /gene="GAG"
FT                   /product="putative transposon..."
FT                   /note="retrotransposon GAG gene..."
FT                   /translation="MSNESKFDSKA..."
FT   CDS             5300..10980
FT                   /locus_tag="YEAT0_A00144"
FT                   /gene="GAG-POL"
FT                   /product="putative transposon..."
FT                   /note="retrotransposon GAG-POL gene..."
FT                   /ribosomal_slippage
FT                   /translation="MSNESKFDSKA..."
FT   repeat_region   10950..11200
FT                   /locus_tag="YEAT0_A00144"
FT                   /rpt_type=retrotransposon
FT                   /rpt_family="LTR of a Ty1..."

Complete TE

Relics of transposable elements

When considering relics of transposable elements, such as solo LTRs, it is recommend not to use a mobile_element feature type, but to use a repeat_region feature directly.


FT   repeat_region   15200..15510
FT                   /locus_tag="YEAT0_A00188"
FT                   /rpt_type=retrotransposon
FT                   /rpt_family="solo LTR of a Ty1..."

TE relic

Non-protein-coding genes

There are several features that are available for annotating non protein-coding genes, namely tRNA, rRNA, ncRNA and misc_RNA. They should be associated with a gene feature at the locus level. Only the ncRNA has a mandatory qualifier, which is ncRNA_class="class", whose possible values are explained here.

Here is an example of a transfer RNA:


FT   gene            21457..21530
FT                   /locus_tag="YEAT0_A00364"
FT   tRNA            21457..21530
FT                   /locus_tag="YEAT0_A00364"
FT                   /gene="tRNA-Asn(GTT)"
FT                   /product="transfer RNA-Asn(GTT)"

tRNA feature