Genome annotation guideline

The purpose of this guide is to provide advice, tips and rules for annotating yeast genomes. Inspection of existing annotations available in public databases reveals a high heterogeneity in the way genetic features are presented and formatted. This yields to important difficulties in data handling, such as genome comparisons or automated data extraction. In addition, most of the available annotation guidelines available on the web are incomplete, proposing solutions only for the most frequent cases.

It is worth noting that the solutions proposed in this guideline are not absolute rules but only suggestions that we adopt for GRYC data. With this guideline, our objectives are:

To comply with the INSDC annotation rules, to ensure compatibility with the main public databases.

To propose solutions for each type of locus/genetic feature.

To converge towards standardised yeast genome annotations.

A few definitions

Genome annotations rely on feature table formats that use:

Features to identify the different types of genetic elements (e.g., gene, CDS, tRNA),
Qualifiers to annotate features (e.g., /product, /note).

The complete lists of standard features and qualifiers are available on the INSDC website. It is at the basis of the three major feature table formats GenBank (NCBI, USA), EMBL (ENA, UK) and DDBJ (DDBJ, Japan).

Note that as a European resource, we work with the EMBL file format and hence, the illustrations provided in this guideline are formatted in EMBL. However, each of the proposed rules can be applied in both the DDBJ and GenBank formats.

Last, there are other formats to describe genome annotations such as the GFF3 format, as well as other standard/ontology to characterize genetic entities (e.g., the Sequence Ontology). We won't discuss about these alternatives but some of them inspired us to define our annotation guideline.

Concept of feature hierarchy

In genome annotation files, genetic elements are represented by feature entries (e.g., gene, CDS, tRNA). The GRYC database relies on a key concept, which is the definition of a hierarchy between these features. Note that this concept is not fully adapted to EMBL, GenBank and DDBJ formats, whereas it is central in GFF3.

The feature hierarchy we propose relies on 5 feature categories:

Chromosomal regions: features that delimit a region in a chromosome. This type of feature can include several loci and cannot be considered/used to define a single locus/gene (they cannot have a /locus_tag qualifier).

Locus features: features that allow to define the exact location of a gene on a chromosome. In most cases, this includes other features that describe sub-elements of the gene such as regulatory elements or coding regions.

Regulatory features: features that describe a structural or a functional regulatory element of a gene (e.g., a promoter, a polyadenylation site).
Transcribed features: features that describe a transcribed region of a locus.

Translated features: features that describe a translated region of a locus (and that must be included in a transcribed feature).

Types of genetic features considered

The INSDC website provides an exhaustive list of features covering all domains of genetic/genomic annotation. Here is the selection of features that we consider for annotating yeast genomes and that are considered in the GRYC database:

Feature name	Hierarchic level	Short description	Mandatory qualifiers
assembly_gap	Chromosomal region	Gapped region within a scaffold	/estimated_length; /gap_type; /linkage_evidence
CDS	Translated feature	Region that codes for a protein
centromere	Chromosomal region	Region that contains the centromere
gap	Chromosomal region	Region with undetermined base(s) (N)	/estimated_length
gene	Locus feature	Global location of a gene
misc_binding	Regulatory feature	Site that Binds another moiety	/bound_moiety
misc_feature	Locus feature	A locus of interest that cannot be describe with any other locus feature
misc_RNA	Transcribed feature	Other transcribed RNA product
mobile_element	Locus feature	Mobile element locus	/mobile_element_type
mRNA	Transcribed feature	Messenger RNA feature
ncRNA	Transcribed feature	RNA from a non-protein-coding gene (other than tRNA and rRNA)	/ncRNA_class
polyA_site	Regulatory feature	Site of post-transcriptional polyadenylation
protein_bind	Regulatory feature	Non-covalent protein binding site on nucleic acid	/bound_moiety
regulatory	Regulatory feature	Any region that functions in regulation	/regulatory_class
repeat_region	Regulatory feature OR Locus feature	Region containing repeat units
rRNA	Transcribed feature	Ribosomal RNA
sig_peptide	Regulatory feature	Signal peptide coding sequence
source	Chromosomal region	Global sequence description	/organism; /mol_type
telomere	Chromosomal region	Region identified as a telomere
tRNA	Transcribed feature	Transfer RNA

Naming convention

There are no absolute rules for naming the assembled sequences and the annotated genes. However, it is worth noting that the INSDC databases have some requirements concerning sequence and gene names. In addition, not respecting a convention leads to a high heterogeneity of sequence/gene names, which unnecessarily complicates analyses.

Defining a prefix for each strain

When we prepare our genomic data, we use a prefix to build both sequence and feature labels. This prefix must contain only capital letters and numbers. By convention, we often use the genus and the species names to build this prefix, followed by a number to distinguish the strains of the same species. For example, the first strain of Monosporozyma unispora would have the prefix MOUN0, the second strain would be MOUN1, and so on. When creating a BioProject to deposit genomic data in an INSDC database, it is generally possible to reserve this prefix to ensure that it will be used and maintained by the different public databases once the data have been submitted.

Sequence labels

Labels of genomic sequences are based on the prefix. For a chromosome level assembly, sequences are enumerated by letters and chromosomes should be ordered by increasing length. Hence the shortest chromosome should be the chromosome A. For scaffold and contig level assemblies, sequences are numbered and should be ordered by decreasing length. For consistency with locus labels (see below), it is recommended to separate the prefix and sequence letter/number by an underscore ("_").

Here are two examples:

If the assembly of Kazachstania africana consists of 12 complete chromosomes, they are named from KAAF0_A (the shortest chromosome) to KAAF0_L (the largest chromosome).
If the assembly of Maudiozyma humilis consists of 17 scaffolds, they are names from MAHU0_01S (the largest scaffold) to MAHU0_17S (the shortest scaffold).

Note: when considering a scaffold and contig level assembly, the letter "S" is added to the end of the label to avoid confusion with the locus number (see below).

Once your data is submitted to a public database, sequence labels are generally replaced by accession numbers. However, the labels you have defined can be conserved in the sequence description as well as in the source feature of the annotation.

Locus labels

From the various INSDC database guidelines, a locus label (/locus_tag) must respect the following rules:

It must contain only capital letters and numbers
It starts with the assembly prefix (see above) followed by an underscore ("_")
The last part of the label, which is specific to the locus, should be simple, such as just a few digits. However, it may contain additional information such as the chromosome number/letter.

To comply with these "rules", we generally use the sequence labels (see previous section) completed with 5 digits. The value of the first locus of a given sequence is 00100. The values of the following loci are incremented by 22. In this way, it will possible to insert missing loci in the annotation without disturbing the order of the locus labels.

Here are some examples:

In the chromosome KAAF0_A, the first locus is labelled KAAF0_A00100, then KAAF0_A00122, and so on.
In the scaffold MAHU0_13S, the first locus is labelled MAHU0_13S00100, then MAHU0_13S00122, and so on.

This suggested locus naming rule allows to label up to about 4,500 loci on the same chromosome/sequence, which is generally enough for yeast. If you need to label more loci on a single sequence, then use a shorter increment step or use 6 digits.

Detailed feature annotation

Protein coding genes

Protein coding genes are the most frequent features in genome annotations. It generally consists in at least one CDS (Coding DNA Sequence) feature, which gives the coordinates of the sequence(s) of the gene that are translated into protein. It often includes functional information about the gene. Protein-coding genes may include mRNA feature(s) that gives the complete exon coordinates, including the untranslated region (UTR). The entire locus is represented by a gene feature. Its coordinates are often flanked by the symbols < and >, indicating that the locus may be larger. In addition, a gene feature cannot have joined coordinates; multiple exons are provided in the mRNA and the CDS features. Last, structural and functional regulatory elements may also have regulatory features or other dedicated features.

For complex multi-exon genes, it is necessary to specify gene, mRNA and CDS features to allow to identification of possible alternative transcription/translation as well as the location of untranslated regions.

Single protein-coding gene

Here, we consider the most frequent situation, a single protein-coding gene. The coding sequence is given by the CDS feature.


FT   gene            <1200..5300>
FT                   /locus_tag="YEAT0_A00100"
FT   regulatory      1200..1220
FT                   /locus_tag="YEAT0_A00100"
FT                   /regulatory_class="promoter"
FT   mRNA            join(1200..1350,1540..3560,4020..5300)
FT                   /locus_tag="YEAT0_A00022"
FT   CDS             join(1950..3560,4020..5125)
FT                   /gene="XXX1"
FT                   /locus_tag="YEAT0_A00100"
FT                   /note="Annotation comments..."
FT                   /product="Putative protein..."
FT                   /translation="MSPRTIA..."
FT   polyA_site      5299..5300
FT                   /locus_tag="YEAT0_A00100"

Simple CDS

Multiple transcription/translation protein coding gene

When annotating a locus with multiple products, which may result from multiple transcription or translation, each feature must be reported separately and must contain the same /locus_tag.


FT   gene            <1200..5300>
FT                   /locus_tag="YEAT0_A00100"
FT   regulatory      1200..1220
FT                   /locus_tag="YEAT0_A00100"
FT                   /regulatory_class="promoter"
FT   mRNA            join(1200..1350,1540..3560,4020..5300)
FT                   /locus_tag="YEAT0_A00100"
FT   mRNA            join(1200..1300,1540..3560,4020..5300)
FT                   /locus_tag="YEAT0_A00100"
FT   CDS             join(1290..1300,1540..3560,4020..5125)
FT                   /gene="XXX1"
FT                   /locus_tag="YEAT0_A00100"
FT                   /note="Long form..."
FT                   /product="Putative protein..."
FT                   /translation="MKAAREY..."
FT   CDS             join(1950..3560,4020..5125)
FT                   /gene="XXX1"
FT                   /locus_tag="YEAT0_A00100"
FT                   /note="Short form..."
FT                   /product="Putative protein..."
FT                   /translation="MSPRTIA..."
FT   polyA_site      5299..5300
FT                   /locus_tag="YEAT0_A00100"

Multiple CDS

Pseudo protein coding gene

Pseudo genes can be annotated in the same way as a regular protein coding gene, except that the CDS feature cannot contain \product and \translation qualifiers. It must also contain a qualifier indicating that it is a pseudo gene. There are two possible qualifiers:

/pseudo (without text value) should be used when a gene is pseudo for a technical reason, such as an uncompleted assembly, an assembly mistake or an assembly gap.
/pseudogene="TYPE" should be used when the gene is "really" a pseudo gene, the "TYPE" is mandatory and must be one of the following values: processed, unprocessed, unitary, allelic, unknown.


FT   gene            <1200..5300>
FT                   /locus_tag="YEAT0_A00100"
FT   mRNA            join(1200..1350,1540..5300)
FT                   /locus_tag="YEAT0_A00022"
FT   CDS             join(1290..1300,1540..4021,4020..5125)
FT                   /gene="XXX1"
FT                   /locus_tag="YEAT0_A00100"
FT                   /note="Annotation comments..."
FT                   /pseudogene="unknown"

pseudo CDS

Transposable elements (TEs)

Complete transposable element

For complete (or almost complete) transposable element features, the locus should be represented by a mobile_element feature. This feature type requires a mandatory qualifier, which is /mobile_element_type="mobile_element_type[:mobile_element_name]". Authorized values for this mobile_element_type are: "transposon", "retrotransposon", "integron", "insertion sequence", "non-LTR retrotransposon", "SINE", "MITE", "LINE", and "other". It is also possible to specify the name of the element after the type value (separated by a colon). For example, a Ty1 LTR retrotransposon can have the following qualifier: /mobile_element_type="retrotransposon:Ty1".

Terminal repeats such as LTRs or TIRs should be annotated with repeat_region features. It is recommended to use the qualifier /rpt_type=type (the value type is given without quotes). All possible value for this qualifier are provided here. For transposable elements, it can be long_terminal_repeat, inverted, non_ltr_retrotransposon_polymeric_tract, nested, terminal or other. The qualifier rpt_family="text" can be used to provide more information about the repeat region, such as the TE family/name.

Below is an example of an LTR retrotransposon annotation:


FT   mobile_element  <5000..11200>
FT                   /locus_tag="YEAT0_A00144"
FT                   /mobile_element_type="retrotransposon:Ty1"
FT   repeat_region   5000..5250
FT                   /locus_tag="YEAT0_A00144"
FT                   /rpt_type=retrotransposon
FT                   /rpt_family="LTR of a Ty1..."
FT   mRNA            <5000..11200>
FT                   /locus_tag="YEAT0_A00144"
FT   CDS             5300..6200
FT                   /locus_tag="YEAT0_A00144"
FT                   /gene="GAG"
FT                   /product="putative transposon..."
FT                   /note="retrotransposon GAG gene..."
FT                   /translation="MSNESKFDSKA..."
FT   CDS             5300..10980
FT                   /locus_tag="YEAT0_A00144"
FT                   /gene="GAG-POL"
FT                   /product="putative transposon..."
FT                   /note="retrotransposon GAG-POL gene..."
FT                   /ribosomal_slippage
FT                   /translation="MSNESKFDSKA..."
FT   repeat_region   10950..11200
FT                   /locus_tag="YEAT0_A00144"
FT                   /rpt_type=retrotransposon
FT                   /rpt_family="LTR of a Ty1..."

Complete TE

Relics of transposable elements

When considering relics of transposable elements, such as solo LTRs, it is recommend not to use a mobile_element feature type, but to use a repeat_region feature directly.


FT   repeat_region   15200..15510
FT                   /locus_tag="YEAT0_A00188"
FT                   /rpt_type=retrotransposon
FT                   /rpt_family="solo LTR of a Ty1..."

TE relic

Non-protein-coding genes

There are several features that are available for annotating non protein-coding genes, namely tRNA, rRNA, ncRNA and misc_RNA. They should be associated with a gene feature at the locus level. Only the ncRNA has a mandatory qualifier, which is ncRNA_class="class", whose possible values are explained here.

Here is an example of a transfer RNA:


FT   gene            21457..21530
FT                   /locus_tag="YEAT0_A00364"
FT   tRNA            21457..21530
FT                   /locus_tag="YEAT0_A00364"
FT                   /gene="tRNA-Asn(GTT)"
FT                   /product="transfer RNA-Asn(GTT)"

tRNA feature