Scalable, accessible, and reproducible reference genome assembly and evaluation in Galaxy.

TitleScalable, accessible, and reproducible reference genome assembly and evaluation in Galaxy.
Publication TypeJournal Article
Year of Publication2023
AuthorsLarivière D, Abueg L, Brajuka N, Gallardo-Alba C, Grüning B, Ko BJune, Ostrovsky A, Palmada-Flores M, Pickett BD, Rabbani K, Balacco JR, Chaisson M, Cheng H, Collins J, Denisova A, Fedrigo O, Gallo GRoberto, Giani AMaria, Gooder GMacDonald, Jain N, Johnson C, Kim H, Lee C, Marques-Bonet T, O'Toole B, Rhie A, Secomandi S, Sozzoni M, Tilley T, Uliano-Silva M, van den Beek M, Waterhouse RM, Phillippy AM, Jarvis ED, Schatz MC, Nekrutenko A, Formenti G
JournalbioRxiv
Date Published2023 Jun 30
Abstract

Improvements in genome sequencing and assembly are enabling high-quality reference genomes for all species. However, the assembly process is still laborious, computationally and technically demanding, lacks standards for reproducibility, and is not readily scalable. Here we present the latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ~500 million years. The pipeline is versatile and combines PacBio HiFi long-reads and Hi-C-based haplotype phasing in a new graph-based paradigm. Standardized quality control is performed automatically to troubleshoot assembly issues and assess biological complexities. We make the pipeline freely accessible through Galaxy, accommodating researchers even without local computational resources and enhanced reproducibility by democratizing the training and assembly process. We demonstrate the flexibility and reliability of the pipeline by assembling reference genomes for 51 vertebrate species from major taxonomic groups (fish, amphibians, reptiles, birds, and mammals).

DOI10.1101/2023.06.28.546576
Alternate JournalbioRxiv
PubMed ID37425881
PubMed Central IDPMC10327048
Grant ListU01 CA253481 / CA / NCI NIH HHS / United States
U24 CA231877 / CA / NCI NIH HHS / United States
U24 HG010263 / HG / NHGRI NIH HHS / United States
U41 HG006620 / HG / NHGRI NIH HHS / United States