Are you embarking on a reference genome project?

Do you want to learn about the steps required for success?

Then join the growing family of ERGA Community Genomes!

ERGA aims to coordinate the production of high-quality annotated genome assemblies that represent eukaryotic biodiversity in Europe. A key part of this is building capacity across European researchers and institutes by supporting the growing community of scientists in biodiversity genomics through the provision of guidelines, workflows, and best practices that explain and greatly facilitate the successful execution of the many steps required along the complex workflow for reference genome generation.

The guidelines below cover many of the main steps along the genome generation workflow, providing step-by-step advice and answers to frequently asked questions to help researchers navigate the complexities and find out where to turn for additional assistance:

PLEASE NOTE: This is a work in progress. The initial beta version of these guidelines has been developed with input from the ERGA Committee Coordinators and its continued development and further elaboration are still ongoing and will include all ERGA Ccommittees.

Contributors: Tom Brown, Diego de Panis, Joao Pimenta, Christian de Guttry, Ann Mc Cartney, Rita Monteiro, Javier Palma, Luisa Marins, Astrid Böhne, Robert Waterhouse, Camila Mazzoni.

1. Pre sampling

Letter of Support

Do you wish to indicate in your grant proposal that you are knowledgeable about where and how to find support for the genome generation pipeline? Considering the difficulty of obtaining funding for research in areas where you have no prior experience, your application can be supported by a letter of support from our chairs. If you would like to have this type of assistance, please indicate so on this form.

Your grant's genomic section

I. Are you in need of assistance in preparing a complete, convincing, and coherent grant application with realistic budget estimates for your project?

It can be challenging to prepare your first grant in this area of research, and ERGA provides its hub-of-knowledge to assist you in your first journey into the world of reference genomes. An online meeting can be arranged where you can benefit from the experience of researchers who have already passed through this process several times.

II. The grant has already been written but you are unsure of the content of the reference genome generation section?

The expertise of the ERGA's Committees can assist you in this endeavour. Upon request, we will conduct a brief review of the genomic section of your grant proposal to help ensure it is of a high standard.

Check the current status of the reference genome for the species you wish to sequence

Is another research group already producing the reference genome for your species? On one of the following portals, you can check to see if anyone has already produced the reference genome for your species:

ENA: The European Nucleotide Archive (ENA) operates as a public archive for nucleotide sequence data. By bringing together databases for raw sequence data, assembly information and functional annotation, the ENA provides a comprehensive and integrated resource for this fundamental source of biological information.
ERGA data portal: This portal allows you to see species for which high-quality reference genomes are already being or have already been produced by ERGA-Affiliated projects.
GoAT: Genomes on a Tree presents genome-relevant metadata for all Eukaryotic taxa across the tree of life. Metadata in GoaT include, genome assembly attributes, genome sizes, C values, and chromosome numbers from multiple sources. GoaT also collects information from various BioGenome projects about the species they plan to sequence and/or have already started sequencing.
Other BioGenome consortia: Darwin Tree of Life (DToL), Catalan Initiative for the Earth BioGenome Project (CBP), The Earth BioGenome Project (EBP), the Vertebrate Genomes Project (VGP).

1. Pre sampling

2. Sample Acquisition Strategy

2. Sample acquisition strategy

You are planning your field work and would like to know where to start? Here are some key considerations to help you get started:

a. Permits

Sampling collections should comply with the local and EU regulations. Make sure to have all the required permissions authorising collecting, exporting and sequencing species with open access data deposition before the collection of the species.

i. ABS - Nagoya: The Nagoya Protocol is an international agreement that was adopted in 2010 as a supplementary agreement to the Convention on Biological Diversity. The Nagoya Protocol sets out rules and procedures regarding the access to genetic resources, and the sharing of the benefits derived from their usage. It also provides guidance on how to ensure the fair and equitable sharing of the benefits arising from the utilisation of traditional knowledge associated with genetic resources. The Protocol has been ratified by over 100 countries, and has been widely praised as an effective tool for promoting the conservation and sustainable use of biological diversity. To proceed, researchers should first verify whether their country has signed the relevant agreement, and subsequently, they should reach out to the designated focal point. Researchers should outline their intentions to conduct genome sequencing and subsequent release, while also requesting the necessary permit.

ii. Sample collection permit: It grants permission to the holder to collect wildlife samples, with the understanding that all samples are used for scientific research and educational purposes only. The holder of this permit must abide by all applicable international, national, and local legislation in the collection of wildlife samples. No sample may be collected without prior approval from the relevant authority. Some general rules, with more detailed information presented in the ERGA Sampling Code of Conduct:

1.Ensuring that all samples collected are properly labelled and documented.
2.Providing adequate protection of the samples while in transport.
3.Maintaining records of all samples collected.
4.Obtaining authorization to transport the samples to the appropriate research facility.
5.The holder of this permit is responsible for disposing of all unwanted samples in a manner that is respectful to the environment and in compliance with all applicable laws.
6.The holder of this permit must provide a copy of the collected samples to the appropriate authorities upon request.
7.The undersigned agrees to abide by the conditions of this permit.

b. Traditional Knowledge and Biocultural labels

It is important to determine whether an indigenous population or a local community is involved in the project or whether the species is of special concern to them. In that case a label should be requested. The Labels allow communities to express local and specific conditions for sharing and engaging in future research and relationships in ways that are consistent with already existing community rules, governance and protocols for using, sharing and circulating knowledge and data. The primary objectives are to enhance and legitimise locally based decision-making and Indigenous governance frameworks for determining ownership, access, and culturally appropriate conditions for sharing historical, contemporary, and future collections of cultural heritage and Indigenous data. For more information check the ‘Local Contexts’ website.

c. Sample Collection

What sampling procedures do you follow on the field? For the generation of reference genomes, the perfect method is liquid nitrogen. There are many organic materials that can be stored in liquid nitrogen, including cells, tissues samples and entire individuals. As liquid nitrogen rapidly freezes samples, it provides researchers with the capability to store samples for long periods of time and minimises their DNA/RNA degradation. It is fundamental to collect samples and process them following the requirements specified by the sequencing facility. For species that can be maintained alive, they can be transported to a lab for processing and fast freezing the material immediately after dissection to prevent DNA and RNA degradation. The sample should be dissected on top of a plate on ice, to keep the sample cold, and fast frozen in liquid nitrogen.

d. Taxonomic Validation

Did an expert taxonomist confirm the identity of the collected species? Taxonomic validation is a complex and important process that is necessary for accurately classifying organisms by their physical and genetic traits. Reference genomes have already been created, but eventually the species was not what was targeted. Whenever possible, we recommend to DNA barcode the sample to prevent this from occurring.

e. Vouchering

A voucher specimen consists of a representative sample of the collected species. A voucher preserves as much as possible of the physical remains of an organism, serving as a verifiable and permanent record of wildlife. The sample is typically collected in the field and preserved in a herbarium or museum collection. Separate specimen voucher and take scaled pictures, following the requirements from the respective collection facility. (link of some facilities as an example). In addition, it should be noted that e-vouchers, which involve digital documentation and images, are also permissible in certain cases.

f. Biobanking

Biobanking refers to the storage of biological samples for research purposes. Animal/plant tissue biobanking is used to track genetic changes over time, which can help understand the evolution of species. Material for biobank should be deposited in biobank repositories. In addition to tissue biobanking, DNA biobanking is also possible. Ideally tissue and DNA are from the same specimen that will be sequenced, but for very small specimens a different individual can be used. The material should be preferably deposited in a repository in the same country of origin of the material. If national infrastructure is not available – or in addition to this, the LIB Biobank at Museum Koenig, Bonn, can centrally store any ERGA project samples. For contact information and sample requirements, please see LIB Biobank deposition guidelines.

g. Storage

Samples must be kept as cold as possible to prevent DNA degradation prior to sequencing. If possible, place the sample tubes into dry ice, a charged LN2 Dry Shipper (< -150ºC ) or a -80ºC freezer. Please note that wet ice and -20°C freezers are not appropriate for the storage of tubes containing samples intended for genome sequencing.

h. Material Transfer Agreements

MTAs are agreements between two parties, typically a provider and a recipient, that govern the transfer of biological samples. MTAs are used to ensure that the provider of the material is adequately compensated for the use of the material, and that the recipient of the material is legally and ethically responsible for its use. Sample providers should be aware of any MTA, for example when sending biological material between their research facility and sequencing centres/biobanks. Please check the requirement with your sequencing centre and biobank. More information can be found in the CETAF Code of Conduct and Best Practices (Example MTA without change in ownership).

i. Shipping

All samples must be shipped on dry ice or in a dry shipper. Please make sure that they refill at borders/often. Be careful on the regulation of non-EU countries in Europe.

j. ERGA manifest

Do you wish to learn what metadata you need to submit with your sample in order to register it with ERGA as an ERGA Community genome? This is the ERGA sample manifest. Fields marked in bold are the mandatory variables.

k. ENA mandatory fields:

The European Nucleotide Archive (ENA) operates as a public archive for nucleotide sequence data. This is the ENA checklist of minimum requirements to register a physical sample.

3. DNA/RNA extraction

Did you acquire the samples and you are ready to extract DNA and RNA? Ideally, high molecular weight (HMW) DNA and RNA should not be shipped, but extracted on site or handled very carefully prior to delivery.

DNA

a. DNA extraction protocols: DNA extraction is the process of isolating DNA from cells, tissues or other biological samples.

b. High Molecular Weight DNA extraction protocols: Please see in the following section Libraries Preparation for a list of recommended protocols for extracting and preparing HMW DNA for sequencing.

RNA extraction protocols

RNA extraction involves separating ribonucleic acid (RNA) from a cell or a tissue sample.

DNA concentration, integrity, and purity

i. DNA concentration: is typically measured in nanograms per microliter and can be determined using techniques such as Qubit assays.

ii. DNA integrity: is a measure of the quality of the DNA. DNA integrity can be determined using gel electrophoresis or PCR-based methods. It is important to ensure that the DNA is intact and not degraded, as this can affect the accuracy of results.

iii. DNA purity: is a measure of the level of impurities in DNA samples. It is important to ensure that DNA is free from contaminants, as this can affect the accuracy of results. DNA purity can be assessed using spectrophotometry based methods as NanoDrop.

3. DNA/RNA extraction

4. Library preparation

You’ve extracted your DNA and are wondering how to go about getting the required DNA/RNA to assemble or annotate your genome? DNA library preparation is a key step in the process of sequencing. The library preparation will determine the quality of your assembly and annotation. Ensuring that the DNA is processed properly in order for accurate and reliable results to be obtained. Here you can find our recommended protocols library preparation such as PacBio, Oxford Nanopore Instruments, Chromatin Conformation Capture (HiC) sequencing and whole-transcript sequencing, among others.

PacBio HiFi

Typically made up of DNA fragments around 10-15kb in size and with an accuracy of over 99%, PacBio HiFi reads are constructed by circularising DNA and creating a Circular Consensus Sequence (CCS) with high accuracy. This protocol has a history of producing high-quality reference de-novo genomes for a wide range of species and genomes.

ONT

Oxford Nanopore Technologies offers an alternative to read long pieces of DNA via electrical fluctuations caused by the nucleotides passing through a membrane pore. The reads sequenced here can be much longer than with PacBio HiFi (typically over 30kb, but ultra-long libraries are established to sequence reads of over 200kb in length) but come with a higher error rate. As the hardware and base-calling software have improved over time, the error rates have reduced from over 15% to almost 1% in modal error rate.

Hi-C Arima or Dovetail genomics

3-dimensional Chromatin Conformation Capture libraries allow us to gain insight into the organisation of the genome into Topologically Associated Domains (TADs), Eu- and hetero-chromatin and chromosomes. In the generation of a reference genome, we leverage the information that regions close together in the linear are more likely to be close together in 3D space to order and orient our smaller assembled sequences (contigs and scaffolds) into chromosomes. HiC protocols generally follow the steps of isolating nuclei, cross-linking chromatin in its 3D conformation, digesting the DNA at either enzyme motif sites (Arima) or DNAse-exposed areas of the genome (Dovetail) and then sequencing the two cross-linked regions via paired-end sequencing on an Illumina device.

Illumina shotgun sequencing

Useful for error-correction of the final assembly, or identifying sequences from parental lines when performing a trio-binned assembly, Whole Genome Sequencing (WGS or Shotgun Sequencing) aims to sequence the entire genome in short fragments (typically 100/150bp paired-end libraries) with high accuracy (Q30 or 99.9% accuracy).

RNA-seq

Recommend to help the annotation process of creating your reference genome. Sequencing of RNA-seq libraries is typically performed on an Illumina instrument after RNA has been extracted from your tissues of interest (usually brain or gonad for genome annotation), converted to cDNA and finally amplified before loading onto an instrument.

Iso-seq

The PacBio Iso-seq protocol offers full-length sequencing of transcripts, which is particularly powerful when annotating alternate isoforms in the genome. The sequencing is performed on a PacBio instrument and again leverages the repeated sequencing of circular cDNA to create a high-accuracy consensus sequence for each transcript.

4. Libraries preparation

5. DNA sequencing data

5. DNA Sequencing Data

You’ve finished the DNA sequencing for your genome and want some guidance with your assembly to ensure you meet ERGA quality standards? The Sequencing and Assembly Committee will prepare a number of workflows that you can download and run to assemble your genome.

I’m having trouble with my assembly

Assembling a partially quintaploid, highly-repetitive, AT-rich genome? The Sequencing & Assembly Committee (SAC) would love to hear about your genome and can advise on what to do next. Contact assembly@erga-biodiversity.eu to arrange a presentation at the fortnightly committee meeting to get some feedback from our members.

6. RNA sequencing data - “An assembly is nothing without an annotation”

After you have produced a reference-quality genome assembly, you should think about annotating the key features of your genome. This includes, but is not limited to, finding and recording the locations of: Repeat sequences; Transposable Elements; Telomeres and Centromeres; Protein-coding sequences; Micro-transcript sequences (miRNA); Non-coding sequences (ncRNA).

The Annotation committee has prepared a number of workflows that you can download and run to assemble your genome.

The Annotation Committee can guide you with some of these steps, or for ERGA Community genomes, we also recommend uploading your genome to ENA, where ENSEMBL can annotate your genome using publicly-available transcript data.

6. RNA sequencing data

7. Assembly completed

You have produced a genome assembly and want to associate it with ERGA as a Community genome? Here we detail the next steps required to obtain the ERGA label for your genome and some recommendations for what to do next as part of our best practices:

How do I know if my assembly is good enough?

First, your assembly should meet the EBP metrics, the Sequencing and Assembly Committee will be able to guide you through the post-assembly QC process. Either submit an EAR or present your genome at a SAC meeting.

Open-access genomes for all

If you have a high-quality genome and want to associate it with ERGA, it needs to be of EBP quality and in the public domain. We recommend uploading your genome to ENA and then contacting the SAC. Once your genome has the “Seal of Approval”, we will link your publicly available genome to the ERGA Community Genomes BioProject.

7. Assembly completed

8. Annotation completed

You have an assembly and annotation that you wish to associate with ERGA as a Community genome? Here we detail the next steps required to obtain the ERGA label for your genome and some recommended next steps:

How do I know if my assembly and annotation are good enough?

First, your assembly should meet the EBP metrics, the ERGA annotation committee will be able to guide you through the post-assembly QC process. Either submit an EAR or present your genome at a SAC meeting. Your annotation should be in a format that can be downloaded and used by all (e.g. gff3) and linked to your assembly.

How do I get the ERGA label?

You need to upload your assembly, annotation and all sequenced data to ENA in order to be associated with the ERGA BioProject. Once your genome and data are available, contact the SAC to get the “Seal of Approval” and have your genome linked to the ERGA BioProject. If you wish to make use of the Ensembl rapid annotation, all associated transcript sequencing data also needs to be published on ENA.

What next?

Now you have a high-quality genome, there is a host of analysis that can be performed including Population Genomics, Phylogenomics, Comparative Genomics & Functional Genomics. The Data Analysis Committee have produced a guide on how to conduct a variety of Downstream Analyses.

9. Downstream analysis

You have an ERGA reference genome and you would like to analyse the data? Here we suggest the next steps required to plan your downstream analysis within the highest scientific standards, suggesting recommended frameworks and pipelines to tackle your research questions by applying your reference genome. High-quality reference genomes are an essential tool to detect genic and intergenic regions and identify genetic variants (e.g. SNPs, CNV’s, and structural variants), which are crucial to understand processes in the different fields of genomic research.

The Data Analysis committee (DAC) can provide additional help through its subcommittees devoted to the different fields of genomic research: Population Genomics, Phylogenomics, Comparative Genomics & Functional Genomics. You can contact the subcommittee relevant for your research question and meet with several experts in the field. You can also take the opportunity to present your research to the ERGA community and get relevant feedback to develop your research. DAC also offers opportunities for training through its conferences and workshops organised in collaboration with the Training and Knowledge Transfer committee.

DAC Subcommittees

i. Population Genomics: this subcommittee encloses a group of researchers who specialize in studying the genetic variation and evolutionary processes within populations. This field combines the principles of genetics, genomics, and population biology to understand how genetic diversity arises, spreads, and changes over time. The main objective of this group is to support the investigation of the genetic factors influencing the composition and dynamics of populations and species. Through collaborative efforts and interdisciplinary approaches, the subcommittee intends to contribute to the broader field of genomics and its applications in various areas of biodiversity

ii. Phylogenomics: this subcommittee encloses a group of researchers who are focused on studying evolutionary relationships and the diversification of organisms using genomic data. The subcommittee's main purpose is to support the development of research on accurate and robust reconstruction of phylogenetic trees or evolutionary histories using genomic information. Through the collaboration with research teams, the subcommittee intends to provide valuable insights into the tree of life and clarify the evolutionary history of European species.

iii. Comparative Genomics: this subcommittee encloses a group of researchers devoted on studying and comparing the genomes of different organisms to gain insights into their evolutionary relationships, genetic variations, and functional elements. Comparative genomics combines genomics, bioinformatics, and evolutionary biology to explore the similarities and differences in the genetic makeup of various species. The subcommittee's main purpose is to support the development of analyses and interpretation of genomic data from multiple organisms, by identifying shared and unique genomic characteristics, to infer evolutionary relationships, gene function, and evolutionary processes. The subcommittee intends to promote the advancement of our understanding of the genomic landscape across species. By comparing and analyzing genomic data, research in this field will offer insights into the evolutionary history and functional elements of genomes, ultimately contributing to various aspects of biological research.

iv. Functional Genomics: this subcommittee encloses a group of researchers devoted to understanding the functional elements and activities of genomes, clarifying the functions and interactions of genes, non-coding elements, and regulatory networks, as well as their roles in various biological processes and disease conditions. The main objective of this group is to support research on how genomic information is translated into functional outcomes, exploring the relationships between DNA sequences, gene expression patterns, protein production, and cellular processes. The subcommittee intends to provide insights into the functional aspects of genomes, gene functions, regulatory networks, and their impact on biological processes and disease conditions.

9. Downstream analysis