The purpose of this guide is to provide advice, tips and rules for annotating yeast genomes. Inspection of existing annotations available in public databases reveals a high heterogeneity in the way genetic features are presented and formatted. This yields to important difficulties in data handling, such as genome comparisons or automated data extraction. In addition, most of the available annotation guidelines available on the web are incomplete, proposing solutions only for the most frequent cases.
It is worth noting that the solutions proposed in this guideline are not absolute rules but only suggestions that we adopt for GRYC data. With this guideline, our objectives are:
- To comply with the INSDC annotation rules, to ensure compatibility with the main public databases.
- To propose solutions for each type of locus/genetic feature.
- To converge towards standardised yeast genome annotations.
Genome annotations rely on feature table formats that use:
Note that as a European resource, we work with the EMBL file format and hence, the illustrations provided in this guideline are formatted in EMBL. However, each of the proposed rules can be applied in both the DDBJ and GenBank formats.
Last, there are other formats to describe genome annotations such as the GFF3 format, as well as other standard/ontology to characterize genetic entities (e.g., the Sequence Ontology). We won't discuss about these alternatives but some of them inspired us to define our annotation guideline.
In genome annotation files, genetic elements are represented by feature entries (e.g., gene, CDS, tRNA). The GRYC database relies on a key concept, which is the definition of a hierarchy between these features. Note that this concept is not fully adapted to EMBL, GenBank and DDBJ formats, whereas it is central in GFF3.
The feature hierarchy we propose relies on 5 feature categories:
The INSDC website provides an exhaustive list of features covering all domains of genetic/genomic annotation. Here is the selection of features that we consider for annotating yeast genomes and that are considered in the GRYC database:
Feature name | Hierarchic level | Short description | Mandatory qualifiers |
---|---|---|---|
assembly_gap | Chromosomal region | Gapped region within a scaffold | /estimated_length; /gap_type; /linkage_evidence |
CDS | Translated feature | Region that codes for a protein | |
centromere | Chromosomal region | Region that contains the centromere | |
gap | Chromosomal region | Region with undetermined base(s) (N) | /estimated_length |
gene | Locus feature | Global location of a gene | |
misc_binding | Regulatory feature | Site that Binds another moiety | /bound_moiety |
misc_feature | Locus feature | A locus of interest that cannot be describe with any other locus feature | |
misc_RNA | Transcribed feature | Other transcribed RNA product | |
mobile_element | Locus feature | Mobile element locus | /mobile_element_type |
mRNA | Transcribed feature | Messenger RNA feature | |
ncRNA | Transcribed feature | RNA from a non-protein-coding gene (other than tRNA and rRNA) | /ncRNA_class |
polyA_site | Regulatory feature | Site of post-transcriptional polyadenylation | |
protein_bind | Regulatory feature | Non-covalent protein binding site on nucleic acid | /bound_moiety |
regulatory | Regulatory feature | Any region that functions in regulation | /regulatory_class |
repeat_region | Regulatory feature OR Locus feature | Region containing repeat units | |
rRNA | Transcribed feature | Ribosomal RNA | |
sig_peptide | Regulatory feature | Signal peptide coding sequence | |
source | Chromosomal region | Global sequence description | /organism; /mol_type |
telomere | Chromosomal region | Region identified as a telomere | |
tRNA | Transcribed feature | Transfer RNA |
There are no absolute rules for naming the assembled sequences and the annotated genes. However, it is worth noting that the INSDC databases have some requirements concerning sequence and gene names. In addition, not respecting a convention leads to a high heterogeneity of sequence/gene names, which unnecessarily complicates analyses.
When we prepare our genomic data, we use a prefix to build both sequence and feature labels. This prefix must contain only capital letters and numbers. By convention, we often use the genus and the species names to build this prefix, followed by a number to distinguish the strains of the same species. For example, the first strain of Monosporozyma unispora would have the prefix MOUN0, the second strain would be MOUN1, and so on. When creating a BioProject to deposit genomic data in an INSDC database, it is generally possible to reserve this prefix to ensure that it will be used and maintained by the different public databases once the data have been submitted.
Labels of genomic sequences are based on the prefix. For a chromosome level assembly, sequences are enumerated by letters and chromosomes should be ordered by increasing length. Hence the shortest chromosome should be the chromosome A. For scaffold and contig level assemblies, sequences are numbered and should be ordered by decreasing length. For consistency with locus labels (see below), it is recommended to separate the prefix and sequence letter/number by an underscore ("_").
Here are two examples:
Once your data is submitted to a public database, sequence labels are generally replaced by accession numbers. However, the labels you have defined can be conserved in the sequence description as well as in the source feature of the annotation.
From the various INSDC database guidelines, a locus label (/locus_tag) must respect the following rules:
Here are some examples:
This suggested locus naming rule allows to label up to about 4,500 loci on the same chromosome/sequence, which is generally enough for yeast. If you need to label more loci on a single sequence, then use a shorter increment step or use 6 digits.
Protein coding genes are the most frequent features in genome annotations. It generally consists in at least one CDS (Coding DNA Sequence) feature, which gives the coordinates of the sequence(s) of the gene that are translated into protein. It often includes functional information about the gene. Protein-coding genes may include mRNA feature(s) that gives the complete exon coordinates, including the untranslated region (UTR). The entire locus is represented by a gene feature. Its coordinates are often flanked by the symbols < and >, indicating that the locus may be larger. In addition, a gene feature cannot have joined coordinates; multiple exons are provided in the mRNA and the CDS features. Last, structural and functional regulatory elements may also have regulatory features or other dedicated features.
For complex multi-exon genes, it is necessary to specify gene, mRNA and CDS features to allow to identification of possible alternative transcription/translation as well as the location of untranslated regions.
Here, we consider the most frequent situation, a single protein-coding gene. The coding sequence is given by the CDS feature.
FT gene <1200..5300>
FT /locus_tag="YEAT0_A00100"
FT regulatory 1200..1220
FT /locus_tag="YEAT0_A00100"
FT /regulatory_class="promoter"
FT mRNA join(1200..1350,1540..3560,4020..5300)
FT /locus_tag="YEAT0_A00022"
FT CDS join(1950..3560,4020..5125)
FT /gene="XXX1"
FT /locus_tag="YEAT0_A00100"
FT /note="Annotation comments..."
FT /product="Putative protein..."
FT /translation="MSPRTIA..."
FT polyA_site 5299..5300
FT /locus_tag="YEAT0_A00100"
When annotating a locus with multiple products, which may result from multiple transcription or translation, each feature must be reported separately and must contain the same /locus_tag.
FT gene <1200..5300>
FT /locus_tag="YEAT0_A00100"
FT regulatory 1200..1220
FT /locus_tag="YEAT0_A00100"
FT /regulatory_class="promoter"
FT mRNA join(1200..1350,1540..3560,4020..5300)
FT /locus_tag="YEAT0_A00100"
FT mRNA join(1200..1300,1540..3560,4020..5300)
FT /locus_tag="YEAT0_A00100"
FT CDS join(1290..1300,1540..3560,4020..5125)
FT /gene="XXX1"
FT /locus_tag="YEAT0_A00100"
FT /note="Long form..."
FT /product="Putative protein..."
FT /translation="MKAAREY..."
FT CDS join(1950..3560,4020..5125)
FT /gene="XXX1"
FT /locus_tag="YEAT0_A00100"
FT /note="Short form..."
FT /product="Putative protein..."
FT /translation="MSPRTIA..."
FT polyA_site 5299..5300
FT /locus_tag="YEAT0_A00100"
Pseudo genes can be annotated in the same way as a regular protein coding gene, except that the CDS feature cannot contain \product and \translation qualifiers. It must also contain a qualifier indicating that it is a pseudo gene. There are two possible qualifiers:
FT gene <1200..5300>
FT /locus_tag="YEAT0_A00100"
FT mRNA join(1200..1350,1540..5300)
FT /locus_tag="YEAT0_A00022"
FT CDS join(1290..1300,1540..4021,4020..5125)
FT /gene="XXX1"
FT /locus_tag="YEAT0_A00100"
FT /note="Annotation comments..."
FT /pseudogene="unknown"
For complete (or almost complete) transposable element features, the locus should be represented by a mobile_element feature. This feature type requires a mandatory qualifier, which is /mobile_element_type="mobile_element_type[:mobile_element_name]". Authorized values for this mobile_element_type are: "transposon", "retrotransposon", "integron", "insertion sequence", "non-LTR retrotransposon", "SINE", "MITE", "LINE", and "other". It is also possible to specify the name of the element after the type value (separated by a colon). For example, a Ty1 LTR retrotransposon can have the following qualifier: /mobile_element_type="retrotransposon:Ty1".
Terminal repeats such as LTRs or TIRs should be annotated with repeat_region features. It is recommended to use the qualifier /rpt_type=type (the value type is given without quotes). All possible value for this qualifier are provided here. For transposable elements, it can be long_terminal_repeat, inverted, non_ltr_retrotransposon_polymeric_tract, nested, terminal or other. The qualifier rpt_family="text" can be used to provide more information about the repeat region, such as the TE family/name.
Below is an example of an LTR retrotransposon annotation:
FT mobile_element <5000..11200>
FT /locus_tag="YEAT0_A00144"
FT /mobile_element_type="retrotransposon:Ty1"
FT repeat_region 5000..5250
FT /locus_tag="YEAT0_A00144"
FT /rpt_type=retrotransposon
FT /rpt_family="LTR of a Ty1..."
FT mRNA <5000..11200>
FT /locus_tag="YEAT0_A00144"
FT CDS 5300..6200
FT /locus_tag="YEAT0_A00144"
FT /gene="GAG"
FT /product="putative transposon..."
FT /note="retrotransposon GAG gene..."
FT /translation="MSNESKFDSKA..."
FT CDS 5300..10980
FT /locus_tag="YEAT0_A00144"
FT /gene="GAG-POL"
FT /product="putative transposon..."
FT /note="retrotransposon GAG-POL gene..."
FT /ribosomal_slippage
FT /translation="MSNESKFDSKA..."
FT repeat_region 10950..11200
FT /locus_tag="YEAT0_A00144"
FT /rpt_type=retrotransposon
FT /rpt_family="LTR of a Ty1..."
When considering relics of transposable elements, such as solo LTRs, it is recommend not to use a mobile_element feature type, but to use a repeat_region feature directly.
FT repeat_region 15200..15510
FT /locus_tag="YEAT0_A00188"
FT /rpt_type=retrotransposon
FT /rpt_family="solo LTR of a Ty1..."
There are several features that are available for annotating non protein-coding genes, namely tRNA, rRNA, ncRNA and misc_RNA. They should be associated with a gene feature at the locus level. Only the ncRNA has a mandatory qualifier, which is ncRNA_class="class", whose possible values are explained here.
Here is an example of a transfer RNA:
FT gene 21457..21530
FT /locus_tag="YEAT0_A00364"
FT tRNA 21457..21530
FT /locus_tag="YEAT0_A00364"
FT /gene="tRNA-Asn(GTT)"
FT /product="transfer RNA-Asn(GTT)"