Gap-analysis of DNA barcoding reference libraries for two taxon inventories

Background All-inclusive DNA barcoding libraries in the storage and analysis platform of the BoLD (Barcode of Life Data) system are essential for the study of the marine biodiversity and are pertinent for regulatory purposes, including ecosystem monitoring and assessment, such as in the context of the EU Water Framework Directive (WFD) and the Marine Strategy Framework Directive (MSFD). Here we investigate knowledge gaps in the lists of DNA barcoded organisms within the Cnidaria (Anthozoa and Hydrozoa) and Ascidiacea reference libraries of the European Register of Marine Species (ERMS) dataset (402 ascidians and 1200 cnidarian species). ERMS records were checked species by species, against publicly available sequence information and the other data stored in BoLD pages. Results Results revealed that just 22.9% and 29.2% of the listed ascidians and cnidarians species, respectively, are BoLD’s barcoded species, of which, 58.4% and 52.3% of the seemingly barcoded species, respectively, were noted to have complete BoLD pages. Thus, only 11.44% of the tunicate and 17.07% of the cnidarian data in the ERMS lists are of high quality. Deep analyses revealed seven common types of gaps in the list of the barcoded species in addition to a wide range of discrepancies and misidentications, discordances and errors primarily in the GenBank mined data as with the BINs assignments, and more. taxonomic groups exist and addition, quality elements (quality assurance and quality control) were not employed when using the for national monitoring projects, for regulatory compliance and other purposes. Even though Bold is the most trustable DNA barcoding reference library, worldwide projects of DNA barcoding are in order to close these gaps of mistakes, verications, missing data and unreliable sequences labs. Tight quality control and quality assurance to close the knowledge gaps of of the European recommended ERMS reference library.

Following Weigand et al. [5] global analysis, we aim here to investigate potential gaps in already DNA barcoded organisms (based on publicly available data in BoLD database) listed in two reference libraries of the ERMS inventory. We discuss the necessity of quality control (QC) when building and curating a barcode reference library, and provide recommendations for lling the gaps in the barcode library of European aquatic taxa.

Methods
Each BoLD page consists of six sections: 1. A short taxon description with a link to a species-speci c website; 2. Statistics data, including: the number of records, specimens with sequences, specimens with barcodes, subspecies, subspecies with barcodes, public records, public available subspecies and public BIN clustering; 3. Worldwide specimen depositories. 5. Collection sites including countries; 4. Origin of sequences (GenBank ID numbers or sequencing laboratories); 6. Species images gallery.
Species checklist of two distant taxa (the Anthozoa and Hydrozoa of the phylum Cnidaria and the class Ascidiacea of the phylum Chordata) were downloaded from the European Register of Marine Species dataset (ERMS [12]) These taxa were not analysed in Weigand et al. [5] and are used here for deep analyses on quality control (QC) of the barcoded species from these two lists. The conformity of taxonomy and assurance of correct spelling were performed manually, species by species, against the World Register of Marine Species database (WoRMS [18])and assessed following the recommendations by Costello et al. [19].
Finalized species-level checklists were re-ordered and compared with the BoLD list. For the analyses on the barcoded species (the COI marker) we used the checklist on BoLD created by Dirk Steinke, titled 'Marine Animals Europe' (BoLD checklist code: CL-MARAE; last updated on March 20th, 2017). The full list contains 27,634 records of marine species belonging to 10 phyla, Annelida, Arthropoda (class Decapoda and superorder Peracarida), Brachiopoda, Chordata (class Ascidiacea-subphylum Tunicata and class Pisces), Cnidaria (classes Anthozoa and Hydrozoa), Echinodermata, Mollusca (classes Bivalvia and Gastropoda), Nemertea, Priapulida, and Sipuncula. Datasets were generated on two checklists (the cnidarians and the ascidians; updated 18 July 2019) that were compared, species by species, to all publicly available sequence information in BoLD system and to the other data stored in BoLD pages. Working species by species allows the discrimination and the analyses of records in BoLD, the number of sequences/species, bin numbers/species, specimen in public depositories, the number of barcodes publicly stored in BoLD including those mined from the GenBank database at the National Library of Medicine (NCBI [20]) U.S. and the number of privately stored barcodes/species in BoLD. No geographical data were considered for the taxa analysed.

Results
The analysis was performed on the BoLD [3] database, for the 1603 species extracted from the ERMS taxonomic checklists, including 402 ascidians species and 1200 hydrozoans and anthozoans species (Supporting Information Table 3). Checking against the BoLD database (July 18th, 2019 inventory) we found only 88 (22.9%) of the ascidians species and just 351 (29.2%) cnidarians species in the list of BoLD's barcoded species (Table 1). Then, analyses were performed on all the available BoLD records at three levels: the statistics level, the repository level and at the sequence level. At the statistic level, data for each species was individually inspected for the number of BoLD specimen (records), records that hold full sequences, records with just barcodes, and the numbers of public records and public BINs. At the repository level, we searched for number of records mined from GenBank and the number of repository facilities where voucher specimen were deposited. At the sequence level we recorded the open to the public sequences and scored (good, medium or low) the sequence trace les quality (Tables 1, Supporting Information Tables 2-3). **= Species total number BoLD with more the one Bin in BoLD.
Ascidians (Tables 1, Supporting Information Table 1) For only 50 species (12.4% of the whole ERMS list of ascidians) of the 88 ERMS species referenced in the BoLD database there are more than ve specimens barcoded. The COI gene sequence was assigned to 81 species of this list and just 78 of the species (19.4% of total species) have full descriptions in the BoLD pages, including number of specimen records, sequences, specimens with barcodes, species names, public records, public species and public BINS. As for the COI sequences, we assigned three types of records: records with no sequences, records containing sequences downloaded from GenBank (hence with no trace les, without trusted curation), and records containing BoLD related new sequences (with trace les). Many species contained all three record types. Thus, in bins were assembled and 32 species contained more than a single Bin (Table 1).
Cnidarians (Tables 1, Supporting Information Table 2) For only 153 species (12.7% of the whole 1200 ERMS list) of the 351 species found in the BoLD database contained more than ve specimens barcoded. The COI gene sequence was assigned to 310 species of this list and just 297 species (6.5% of total species) had full descriptions in the BoLD pages, including number of specimen records, sequences, specimens with barcodes, species names, public records, public species and public BINS. As for the COI sequences, we assigned three types of records-records with no sequences, records containing sequences downloaded form GenBank (hence with no trace les, without trusted curation), and records containing BoLD related new sequences (with trace les). In 278 species COI was sequenced and contains trace les, however in 205 species, sequences were mined form the GenBank. A total of 68 public bins were assembled and 65 species contained more than a single Bin (Table 1).
Then, we assigned seven common types of gaps in the list of barcoded species, as: (a) records with no data available (empty pages on BoLD website)-10 tunicate species and 41 cnidarians, (b) records with partial public data and no COI sequences-10 tunicate species and 41 cnidarians, (c) no public available records-17 tunicate species and 73 cnidarians, (d) records with sequences mined from the GenBank, many with gaps in sequences and all without trace les-74 tunicate species and 205 cnidarians, (e) records dispersed between more than a single bin-29 tunicate species and 65 cnidarians, (f) records with no bin-25 tunicates and 120 cnidarians, (g) records with sequences dispersed within several bins-29 tunicate species and 65 cnidarians (examples for the seven knowledge gap types are detailed in Table 2). In summary, only 52.3% and 58.4% for cnidarians and tunicates, respectively, of the seemingly 'barcoded species' on the BoLD website have complete BoLD pages, just 11.44% tunicate species and 17.07% cnidarian species appearing in the ERMS lists.

Discussion
The transformation of marine ecosystems and global biodiversity loss [21][22] pose challenges to the developing marine and water strategy directives (such as WFD and MSFD), as to the development of marine policy and management approaches [23][24][25]. Clearly, the current know-how of marine biodiversity is conclusive to the rigour of the science that underpins policy and management assessments and for the future of all marine ecosystems.
Reliable barcode reference libraries are of particular importance [5]. It is evident that the level of taxonomic detection and degree of accuracy is directly contingent on the newly developed molecular-biology depositories, the reference libraries completion and reliability of the DNA records [26][27][28][29]. Given our increased reliance on molecular taxonomy as a robust tool [4], strengthening the existing reference libraries is a necessitate for a wide range of scienti c and applied purposes, such as monitoring, eDNA and metabarcoding approaches, all targeting matched identi cations and the assessments of biodiversity and abundance [5,[14][15][16][17]29] .
To test if the available barcoding data is applicable, we analysed two major reference libraries, the ascidians and the cnidarians form the ERMS recommended list. Working on members of two taxa, the Anthozoa and Hydrozoa of the phylum Cnidaria and the class Ascidiacea of the phylum Chordata, our analyses showed that important reference libraries lack reliable barcodes for these dominant marine macroinvertebrate species. Results further revealed a wide range of di culties and inconsistencies, including taxonomic congruency of the COI barcode records on the one hand and possible cryptic diversity (sensu Leite et al. [28]), that should be further studied, on the other hand. The above clearly affect the wholeness of the ERMS list, as only 52.3% and 58.4% of the cnidarians and tunicates, respectively, of the short list of seemingly well 'barcoded species' were noted to have con rmed complete BoLD pages, further lighting that only 11.44% of the tunicate data and 17.07% of the cnidarian data in the ERMS lists are reliable and fully supported (July 2019 state).
Further, two major results have been emerged. The rst indicated that the BINs assignments revealed a sizeable amount of discordances, many are probably related to species misidenti cations or synonyms [28] or the de ciency of the BIN clustering algorithm to correctly discriminate species [30]. The second outcome pointed towards the low power of GenBank results as compared to the BoLD, discrepancies that are already noted in the literature [2,31], characterized by contaminations of the query sequences discordances and misidenti cations. The BoLD and the GenBank data storage systems are highly intermingled. About 11% of COI barcode records on BoLD are mined from the GenBank, while 75% of the COI barcodes on the GenBank system originate from the BoLD system [31], yet our results point to the many weakness features associated with the GenBank data that are less informative and do not present extended data elements such as trace electropherograms, specimen images, voucher numbers or BIN assignments, and are usually poorly curated (see also Bridge et al. [26]) as compared to the BoLD.
Human-made artefacts during the barcoding processes, primarily for the GenBank storage data, affect the reliability of DNA barcoding to correctly assign a given specimen to species. The overall recorded knowledge gaps in the DNA barcodes found in the present study are considerably high. Clearly, the two types of gaps, the rst of records not found in BoLD, and the second-errors and missing data in already DNA barcoded organisms that are found in the BoLD database, may impact dramatically the accuracy of any DNA-based assessment or biomonitoring approach that counts on BoLD's data. Many additional commonly accumulated errors target inaccurate taxonomic identi cations of specimens by nonexpert taxonomists (in addition to the lack of voucher specimen), sequence contamination, incomplete reference data and insu cient quality of the uploaded molecular data [5,30]. Working on reference libraries of DNA barcodes of marine organisms (invertebrates and sh taxa), Weigand et al. [5] recorded numerous identi cation errors, sequence contamination, incomplete reference (missing trace les or primer information) as inadequate data management. The results of the present study, as supported by other recent studies [2,5,17,28,30,32], reveal that we are still away from possessing decent representative reference libraries for important taxonomic groups. In addition, new DNA barcode data are continuously made available for the already barcoded and additional species from the reference libraries, including additional auditing and annotation processes, altogether helping in closing gap knowledge and purging accumulated erroneous data.
Our global assessment on the completeness of the ERMS library elucidates that the available COI barcode data adequately covers just a small fraction of this reference library, raising an alarm for similar statuses in other reference libraries, such as the AZTI's Marine Biotic Index (AMBI; [33]).