MGnify Catalogue of marine genomes
Result ID
AtlantECO-KO-33
Description
This latest release (v2.0) of the marine catalogue contains data from 1628 studies, including genomes from major sampling expeditions such as Tara Oceans, Malaspina, GO-Ship, and Geotraces, amongst others. An important advancement in this version of the catalogue is that it is representative of the marine genomes available in the public archives at the time of generation. It includes not only those generated and submitted by MGnify, but also MAGs generated by other groups and submitted to the INSDC, as well as isolate genomes (and MAGs) from marine data as curated by MarDB. All genomes undergo the same filtering for a quality score of QS50 (QS, quality score, defined as completeness - 5 x contamination) resulting in a total of 50,866 genomes (50,634 MAGs and 232 isolates) included in v2.0 of the MGnify Genomes marine catalogue. This catalogue was generated using v2.3.0 of the MGnify Genomes catalogue pipeline which is available on GitHub and WorkflowHub. Genomes are clustered into 13,223 species-level clusters, with a cluster representative genome defined for each. GTDB-Tk (Genome Taxonomy DataBase Toolkit) is used to assign a taxonomic rank to the clusters, allowing us to also understand the proportion of genomes within the catalogue which can be considered novel with respect to the current release of GTDB. As with all MGnify Genomes catalogues, the data is available to access and query both through the MGnify website as well as via the MGnify API. The catalogue can be browsed as a list of genomes, with the ability to filter on metadata fields, or a taxonomic tree providing access to individual cluster representative genome records. Within the individual genome records there are comprehensive genome statistics, summaries of annotations, and an interactive genome browser allowing interrogation of the various annotation tracks and their genomic context. All results files can be downloaded via HTTP or FTP from the catalogue directory. There are also two sequence-based search options that can be carried out via the website or API. The first is a COBS (COmpact Bit-sliced Signature index)-based query for searching gene sequences against the catalogues. The second is a kmer-based search using Sourmash to allow querying of whole genomes or sets of genomes against the catalogues.