Annotation_guide

Structural annotation - So you want to annotate protein-coding genes in your genome?
Version 1.0 - August 2023

Authors: Alice Dennis, Jèssica Gómez, Leanne Haggerty, Lucile Soler, Aureliano Bombarley, Henrik Lantz, Florian Maumus, Hugues Roest Crollius, Fergal Martin, Jean-Marc Aury, Christian deGuttry, Robert Waterhouse, and the ERGA Annotation committee.

STEP 1 - Before you start

Step 1a: Be sure the assembly is done and you are working with a frozen/stable version!

Table 1: Genome assembly evaluation before annotation. Rationale: Low consensus accuracy, incomplete genomes, and contaminations lead to poor annotation. It is thus essential to evaluate your genome before you start the annotation process.

1. Consensus Accuracy and assembly completeness evaluated (suggestion: Merqury)

2.Gene space completeness evaluated by:
a. Conserved gene space (suggestion: BUSCO)
b. RNA-Seq mappig (suggestion: STAR/Minimap2)
3. Organelle/Contamination screening and removal:
a. Organelle Genomes( suggestion: Minimap2)
b. Contaminations (suggestion: BlobTools)
4. Uncollapsed duplication for the consensus haploid assembly (suggestion: purge_dups)
5. Full THE completeness evaluated with the LAI (suggestion: LTR_Retriever)
6. Does Your genome meet EBP standards?

Step 1b: Is your assembly done? If yes go to Step 1C, if no go to Table 1.

Step 1c: Is the assembly publicly available? Public release is necessary for annotation by

Ensembl. If yes go to Step 1d, if no, go to Step 2.

Step 1d: Is the public assembly linked to ERGA? If yes, go to Step 1f. If no, go to Step 1e

Step 1f: This will make the assembly available for annotation at Ensembl rapid provided that relevant transcriptomic data are also publicly available (ENA).

Step 1e: Instructions on how to link your project to ERGA

STEP 2 - Do you want to do your own annotation?

Step 2a: Do you want to do your own annotation? If yes, go to Step 2b

Step 2b: Gather all available Evidence data: Transcriptomic and protein datasets to support the annotation process.

Table 2: Evaluation of your evidence data: the accuracy of the genome annotation process is very sensitive to the amount and quality of your evidence data.

1. RNA-Seq transcriptomic data
a. Mapping evaluation (suggestion STAR).
b. Transcript models (suggestion: StringTie).
c. Gene space completeness (suggestion: BUSCO).
2. Protein dataset
a. Gene space completeness (suggestion: BUSCO).
b. Percentage of full protein alignments (suggestion: Spaln).
3. IsoSeq transcriptomic data
a. Mapping evaluation (suggestion: Minimap2).
b. Transcript models (suggestion: StringTie).
c.Gene space completeness (suggestion: BUSCO).

Done? If yes, go to Step 2c.

Step 2c: Repeat prediction. ERGA recommends: Repeat Modeler2, Repeat Masker, Protein Excluder, TEclass, PASTEC, TEdenovo.

Done? If yes, go to Step 2d.

Step 2d: Ab initio training and prediction. ERGA recommends: AUGUSTUS and Gene Mark-ET/EP/ETP.

Step 2e: Gene modelling. ERGA recommends: TSEBRA (BRAKER based predictions),

Evidence Modeler, and MAKER.

Done? For an evaluation of your evidence data go back to Table 2. Once done, you are ready for the final quality and contamination check and you can go to Step 3a.

STEP 3 - Evaluate your annotation

Step 3a: Evaluate your annotation. There is no temporal order for the following suggestions:

Step 3b: MAKER eAED scores.

Step 3c: Gene family analysis.

Step 3d: Genome visualization: 1. IGV; 2. Apollo (manual curation); 3. EasyGB (JBrowser for a simple dataset).

Step 3e: Generate basic gene model summary statistics and compare with related species.

Step 3f: BUSCO, visual inspection in browser in context with evidence.

Step 3g: Use mapped reads to estimate: 1. How many apparently transcribed regions don't have annotation?; 2. How many genes or exons are supported by read data?

Step 3h: Compare gene content to related species with similar annotation approach.

Happy with the metrics assessment of each of the parameters for which the annotation has been evaluated? Remember that some of them may depend on the phyla. This is your DIY annotation v1. Again, this is a stopping place. Do not go forward until this is complete. If you are happy with the metric assessment, you can move to Step 4.

STEP 4 - Finalise your annnotation

Step 4a: Create proper file formats (ENA GFF3 format recommendations). Consider to change the Identifiers produced by the different gene annotation tools (e.g., gene-1) for a more meaningful Identifier (SpeciesCode+AssemblyVersion+Chr/Scf/Ctg-XXX+G+YYYYYY).

Step 4b: Provide this annotation to Ensembl as a second track (via GFF3 submission to ENA) and go back to Step 1f.

1. Before you start

2. Do you want to do your own annotation?

3. Evaluate your Annotation

4. Finalise your Annotation

Structural annotation - So you want to annotate protein-coding genes in your genome? Version 1.0 - August 2023

Structural annotation - So you want to annotate protein-coding genes in your genome?
Version 1.0 - August 2023