Connection #8 - Bioinformatics: reassembling the book of life

luisamarins19
3 minutes ago
3 min read

The European Reference Genome Atlas (ERGA) and the European node of the International Barcode of Life (iBOL Europe), two international communities of scientists brought together under the Biodiversity Genomics Europe Project, are joining forces for “Connections,” a series of blog posts that explore the fascinating world of Biodiversity Genomics and the intersection of their communities.

In our previous posts, we compared DNA to a book: barcodes help us identify which book we are holding, while reference genomes enable us to read every page. But here is the twist: by the time DNA leaves the wet lab, the book is broken as if we have run the pages through a “paper–shredder”. DNA extraction, library preparation, and DNA sequencing all turn the long DNA sequence into millions of pieces (Check the Connections blog #3 for an overview of these different steps of the genomic workflow). Bioinformatics is the art of turning that pile of shreds back into something we can read, search, and compare. It is the art that turns barcodes and reference genomes into usable knowledge.

*Figure 1: Informatics and advanced computing are necessary to analyse the huge amount of data generated for genomic research.*

Bioinformatics is the product of molecular biology meeting computing. Bioinformatics facilitated the development of the first sequence alignments and substitution matrices, dynamic programming, the creation of searchable databases, and the first “find-it-fast” tools that supercharged homology searches. As sequencing scaled, assembly algorithms emerged, followed by hybrid approaches for long-read platforms. Alongside the algorithms came various file formats (FASTA/FASTQ/BAM/CRAM/GFF/GTF), workflow engines, and the hard-won lesson that reproducibility matters more than quick fixes.

For barcoding, the task is targeted: extract a standard marker (or “abstract”), check its quality, align it against a trusted database, and report the most accurate match with confidence. Think of well-indexed catalogues and fast look-ups, ideal for monitoring and quick assessments. For reference genomes, the task is editorial. Correct sequencing errors, assemble the million pieces into chromosomes, phase haplotypes, polish with multiple evidence tracks (long reads, linked reads, Hi-C, RNA-seq), and annotate genes and repeats. That finished “book” enables population genomics, local adaptation, and conservation genomics studies.

Figure 2: Examples of some common bioinformatics tasks when working with genomic data from across the tree of life. Bioinformatics is the art that turns raw data into knowledge with useful applications for biodiversity.

Modern analyses involve dozens of steps, quality checks, trimming, deduplication, mapping, variant calling, assembly, scaffolding, annotation, all wrapped in containers and workflows to make sure a colleague can re-run them on Tuesday and get an answer on the same day. Good metadata is the structure that holds all the pages: sample, permit, locality, preservation, instrument, kit, and version numbers (Check this episode of the Genomic Connections Podcast to learn more about the importance of metadata). Without that structure, even the finest assembly becomes a vague curiosity.

A few field notes from the trenches

Everyone has a story of a 2 a.m. run that failed because a file was called final_FINAL_reallyFinal.fastq.gz. We have all been rescued by checksums, saved by containerised toolchains, and learned never to delete intermediate files before the multi-QC report is green. We name scripts after pets, we comment our code (eventually), and we celebrate the day a 500 GB BAM shrinks elegantly into a reproducible VCF.

Why does this matter to BGE? For iBOL Europe, robust bioinformatics means clean barcode libraries, sound assignments, and credible trend analyses. For ERGA, it means reference genomes that stand up to re-analysis and can power subsequent population, functional, and comparative genomics, the applications stakeholders care about (from conservation planning to bioeconomy uses).

Bioinformatics is not an afterthought: it is a research field itself! It is the bridge from sequencer output to decisions. Treat pipelines as publishable methods, treat metadata as data, and treat your future self as a collaborator who deserves clarity. In the next post, we will demonstrate how these computational foundations are applied in practical settings, including monitoring, policy, and management, without losing sight of the overall context (or the pages).

Read The Easy Connection #8 "Bioinformatics”