Gap analysis of DNA barcoding in ERMS reference libraries for ascidians and cnidarians

All-inclusive DNA-barcoding libraries in the storage and analysis platform of the BOLD (Barcode of Life Data) system are essential for the study of the marine biodiversity and are pertinent for regulatory purposes, including ecosystem monitoring and assessment, such as in the context of the EU Water Framework Directive (WFD) and the Marine Strategy Framework Directive (MSFD). Here, we investigate knowledge gaps in the lists of DNA barcoded organisms within two inventories, Cnidaria (Anthozoa and Hydrozoa) and Ascidiacea from the reference libraries of the European Register of Marine Species (ERMS) dataset (402 ascidians and 1200 cnidarian species). ERMS records were checked species by species, against publicly available sequence information and other data stored in BOLD system. As the available COI barcode data adequately cover just a small fraction of the ERMS reference library, it is of importance to employ quality control on existing data, to close the knowledge gaps and purge errors off BOLD. Results revealed that just 22.9% and 29.2% of the listed ascidians and cnidarians species, respectively, are BOLD barcodes of which 58.4% and 52.3% of the seemingly barcoded species, respectively, were noted to have complete BOLD pages. Thus, only 11.44% of the tunicate and 17.07% of the cnidarian data in the ERMS lists are of high quality. Deep analyses revealed seven common types of gaps in the list of the barcoded species in addition to a wide range of discrepancies and misidentifications, discordances, and errors primarily in the GenBank mined data as with the BINs assignments, among others. Gap knowledge in barcoding of important taxonomic marine groups exists, and in addition, quality management elements (quality assurance and quality control) were not employed when using the list for national monitoring projects, for regulatory compliance purposes and other purposes. Even though BOLD is the most trustable DNA-barcoding reference library, worldwide projects of DNA barcoding are needed to close these gaps of mistakes, verifications, missing data, and unreliable sequencing labs. Tight quality control and quality assurance are important to close the knowledge gaps of Barcoding of the European recommended ERMS reference library.


Background
The accurate evaluation of biodiversity for any given ecosystem is a keystone element, even imperative, in numerous biological and applied disciplines, including ecology, conservation biology, food regulatory compliance, forensics, and ecosystem monitoring and assessment [1,2]. In response to these needs, DNA-based taxon identification relying on the genetic barcode markers, mainly and above all, Cytochrome c Oxidase I (COI) gene, is commonly used to assess biodiversity, including species identification, boundaries, and diversity analyses. Except for the COI, other mitochondrial genes are in used for animal barcoding such as Cytb, 12S, or 18S, and while COI marker is commonly used for most animals, several other markers are in usage for different taxa, such as the RuBisCO (Ribulose-1,5-bisphosphate carboxylase/oxygenase) is used for plants, internal transcribed spacer (ITS) Open Access *Correspondence: guy@ocean.org.il Israel Oceanographic and Limnological Research, National Institute of Oceanography, 3108001 Haifa, Israel rRNA often used for fungi, 16S rRNA gene is widely used in identification of prokaryotes, the 18S rRNA gene is mostly used for detecting microbial eukaryotes and other markers as well, can be found under "primers list" in BOLD where a few thousands of primers are available for the different markers for the species identification. BOLD [3], a cloud-based data storage and analysis platform that can be further employed as a curation tool, currently contains (updated in April 2020) about 8 million barcodes, encompassing > 310,000 animal, plant, and fungal species. The rationale for using the COI marker gene in species barcoding relies on the fact that intraspecific diversity for this gene is usually lower than interspecific diversity and thus is more effective in species identification, and along with the difficulties associated with the traditional morphological taxonomy [4]. The major benefit of using BOLD is immediately emerged when an unknown sequence is compared against a database to determine its closest species match, an evaluation that strictly depends on the correctness and reliability of the data stored BOLD and on the quality of the barcode libraries [5]. Yet, it should be noted that while barcodes may be excellent tools to identify species that are already in BOLD, they may have poor predictive power in identification of unknown species; however, it should be noted that they do have good predictive power if the species has close relatives in BOLD already.
As a curator tool, it is inferred that all barcode sequences stored in the BOLD database are backed by vouchered specimens and thoroughly identified by taxonomy experts. Yet, being a public database, it is inevitable that BOLD, as any other similar curation tool, might accrue erroneous data, sometimes significantly [2,6]. Taxonomic misidentifications and/or taxonomic conundrums, cryptic species complexes, delimiting cryptic species, technical faults, such as deficient DNA extraction, PCR-based errors, and foreign DNA contaminations, including bacterial sequences, especially COI sequences, are just some of the causes that may unavoidably generate erroneous data and inaccurate sequences [2,[6][7][8][9][10]. The above difficulties may affect dramatically the accuracy of barcoding. For example, the Barcode Index Number (BIN) is used as a system persistent registry for animal OTUs and is recognized through sequence variation in the COI DNA barcode region, and aid for the taxonomy of species by flagging possible cases of synonyms for specimens that are likely to belong to the same species; however, it has been claimed that it can lead to the lack of an unambiguous species-level identification in the BOLD system, and to taxonomic conflicts by the assignment of more than a single species name per BIN [11].
The European Register of Marine Species (ERMS [12]) is an authoritative taxonomic checklist of species that are found in all European marine environments (the alltaxon marine species inventory from the Canaries and Azores to Greenland and north west Russia, towards the Mediterranean sea and the Baltic Sea), from the deep sea, all continental shelf areas and up to the splash zone above the high tide mark, and in estuaries, down to 0.5 psu salinity. During 1997-1999, ERMS was published on the internet and subsequently as a book, containing a list of about 30,000 marine species of the kingdoms Animalia, Plantae, Fungi, and Protoctista, occurring in the European marine environment [13]. It is projected that this marine species inventory will be used as the standard reference and technological tool for marine research and for management of the marine environment in Europe.
Until recently, the standardized methodologies available for biological monitoring and management in the marine environments, primarily for practitioners, were restricted to traditional morphological taxonomy, tedious, and time-consuming methodologies that require the involvement of expert taxonomists with skills that can only be attained via years of practice. This line of analyses is currently being complemented and may be even replaced in the future by molecular approaches such as DNA barcoding and metabarcoding of bulk or environmental DNA (eDNA) [5,[14][15][16][17]. The success of these approaches is strictly dependent of complete and reliable DNA barcode reference libraries. Thus, it is of special interest to identify gaps in the current existing or developing DNA barcode reference libraries, primarily those that are pertinent in the context of the EU Water Framework Directive (WFD) and the Marine Strategy Framework Directive (MSFD). A recent global study on this perspective [5] has revealed that the barcoding coverage varies strongly among taxonomic groups, and among geographic regions, pointing to many missing species and unreliable data (e.g., errors in species identification, discordance among taxonomists) that are relevant to monitoring and highlighted the needs for improving quality assurance of the barcode reference libraries.
Following Weigand et al's. [5] global analysis, we aim here to investigate potential gaps in already DNA barcoded organisms (based on publicly available data in BOLD database) listed in the ascidians and cnidarian (Anthozoa and Hydrozoa)-reference libraries of the ERMS inventory. We discuss the necessity of quality control (QC) when building and curating a barcode reference library, and provide recommendations for filling the gaps in the barcode library of European aquatic taxa.

Methods
Each BOLD page consists of six sections: 1. A short taxon description with a link to a species-specific website; 2. Statistics data, including: the number of records, specimens with sequences, specimens with barcodes, subspecies, subspecies with barcodes, public records, public available subspecies and public BIN clustering; 3. Worldwide specimen repositories; 4. Origin of sequences (GenBank ID numbers or sequencing laboratories); 5. collection sites including countries; 6. species images gallery.
Species checklist of two distant taxa (the Anthozoa and Hydrozoa of the phylum Cnidaria and the class Ascidiacea of the phylum Chordata) was downloaded from the European Register of Marine Species dataset (ERMS [12]). These taxa were not analyzed in Weigand et al. [5] and are used here for deep analyses on quality control (QC) of the barcoded species from these two lists. The conformity of taxonomy and assurance of correct spelling were performed using BOLD toolbox and manually, species by species, for each species BOLD pages, against the World Register of Marine Species database (WoRMS [18]) and assessed following the recommendations by Costello et al. [19]. Finalized species-level checklists were re-ordered and compared with the BOLD list. For the analyses on the barcoded species (the COI marker), we used the checklist on BOLD created by Dirk Steinke, titled 'Marine Animals Europe' (BOLD checklist code: CL-MARAE; last updated on March 20th, 2017). The full list contains 27,634 records of marine species belonging to 10 phyla, Annelida, Arthropoda (class Decapoda and superorder Peracarida), Brachiopoda, Chordata (class Ascidiacea-subphylum Tunicata and class Pisces), Cnidaria (classes Anthozoa and Hydrozoa), Echinodermata, Mollusca (classes Bivalvia and Gastropoda), Nemertea, Priapulida, and Sipuncula. Datasets were generated for the two checklists (the cnidarians and the ascidians; updated 18 July 2019) that were compared, species by species, to all publicly available sequence information in BOLD system and to the other data stored in BOLD pages. Working species by species allows the discrimination and the analyses of records in BOLD, the number of sequences/species, BIN numbers/species, specimen in public depositories, the number of barcodes publicly stored in BOLD, including those mined from the GenBank database at the National Center for Biotechnology Information (NCBI [20]), and the number of privately stored barcodes/species in BOLD. Geographical data were not considered for the taxa analyzed. The analyses were performed on all the available BOLD records at three levels: the statistics level, the repository level, and at the sequence level. At the statistic level, data for each species were individually inspected for the number of BOLD specimen, records that hold validated sequences, i.e., containing trace files which are necessary to qualify the barcode status and to provide quality control for sequences (both reverse and forward directions of the trace files and/or final edited sequences), records with just barcodes, and the numbers of public records and public BINs. At the repository level, we searched for number of records mined from GenBank and the number of repository facilities where voucher specimens were deposited. At the sequence level, we investigate the open to the public sequences and scored the sequence trace files quality, and the mark (good, medium, or low) which is scored by the BOLD system (Table 1, Additional files 1,  2: Tables S2, S3).

Results
The analysis was performed on the BOLD systems (version v4) [3] database, for the 1602 species extracted from the ERMS taxonomic checklists, including 402 ascidians species and 1200 hydrozoans and anthozoans species (Additional file 3: Table S3). Checking against the BOLD database (July 18th, 2019 inventory), we found only 88 (22.9%) of the ascidians species and just 351 (29.2%) cnidarians species in the list of BOLD's barcoded species (Table 1).
Ascidians (Table 1, Additional file 3: Table S1) Of the 88 ERMS species referenced in the BOLD database, only 50 species (12.4% of the whole ERMS list of ascidians) BOLD had more than five specimen barcoded. COI gene sequences were found for 81 species of this list and just 78 of the species (19.4% of total species) have full descriptions in the BOLD ages, including number of specimen records, sequences, specimens with barcodes, species names, public records, public species, and public BINS. As for the COI sequences, we assigned three types of records: records with no sequences, records containing sequences downloaded from GenBank (hence with no trace files, without a reliable curation), and records containing BOLD-related new sequences (with trace files). Thus, in 77 species, the COI gene was sequenced and contains trace files, while for 74 species, the COI sequences were mined from GenBank. A total of 68 public BINs are assigned to 88 species and 32 species contained more than a single BIN (Table 1).
Cnidarians (Table 1, Additional file 1: Table S2). For only 153 species (12.7% of the whole 1200 ERMS list), of the 351 species found in the BOLD database, contained more than five specimens barcoded. COI gene sequences were found for 310 species of this list and just 297 species (6.5% of total species) had full descriptions in the BOLD pages, including number of specimen records, sequences, specimens with barcodes, species names, public records, public species, and public BINS. As for the COI sequences, we assigned three types of records-records with no sequences, records containing sequences downloaded from GenBank (hence with no trace files, without a reliable curation), and records containing BOLD-related new sequences (with trace files). In 278 species, COI was  33:4 sequenced and contained trace files; however, for 205 species, sequences were mined from GenBank. A total of 231 public BINs were found, out of which 65 species contained more than a single BIN ( Table 1).

Types of gaps
We assigned seven common types of gaps (Table 2) in the list of barcoded species, such as: (a) records with no data available (empty pages on BOLD website)-10 tunicate species and 41 cnidarians, (b) records with partial public data and no COI sequences-10 tunicate species and 41 cnidarians, (c) no public available records-17 tunicate species and 73 cnidarians, (d) records with sequences mined from the GenBank, many with gaps in sequences and all without trace files-74 tunicate species and 205 cnidarians, (e) records containing more than a single bin for species-29 tunicate species and 65 cnidarians, (f ) species with no bin-25 tunicates and 120 cnidarians, and (g) species with sequences dispersed within several bins-29 tunicate species and 65 cnidarians (examples for the seven knowledge gap types are detailed in Table 2). In summary, only 52.3% and 58.4% for cnidarians and tunicates, respectively, of the seemingly 'barcoded species' on the BOLD website have complete BOLD pages, which is about 11.44% of the tunicate species and 17.07% of the cnidarian species of the ERMS lists.

Discussion
In the current study, we analyzed two major reference libraries for the availability of barcoding data, for the ascidians and the cnidarians form the ERMS list, which were missed from a previous gap analysis for the monitoring of aquatic biota in Europe. By working on members of two taxa, the Anthozoa and Hydrozoa of the phylum Cnidaria and the class Ascidiacea of the phylum Chordata, our analyses showed that important reference libraries lack reliable barcodes for these dominant marine macroinvertebrate species. Results further revealed a wide range of difficulties and inconsistencies, including taxonomic congruency of the COI barcode records on one hand and possible cryptic diversity (sensu Leite et al. [21]), that should be further studied, on the other hand. The above clearly affect the wholeness of the ERMS list, since only 52.3% and 58.4% of the cnidarians and tunicates, respectively, of the short list of seemingly well 'barcoded species' were noted to have complete BOLD pages, highlighting that only 11.44% of the tunicate data and 17.07% of the cnidarian data in the ERMS lists are reliable and fully supported (July 2019 status).
The literature further discloses two relevant findings. The first indicates that the BINs assignments present a sizeable amount of discordances, many of them relate to species misidentifications or synonyms [21], to taxonomic conflicts by the assignment of more than a single species name per BIN [11] or the deficiency of the BIN clustering algorithm to correctly discriminate species [22]. The second points towards the low control of GenBank compared to BOLD, discrepancies that were already pointed out earlier [2,23], characterized by sequence discordances and misidentifications. The BOLD and the GenBank data storage systems are highly intermingled. About 11% of COI barcode records on BOLD are mined from GenBank, while 75% of the COI barcodes on GenBank originated from BOLD [23]. Yet, our results further point to many weaknesses associated with the GenBank data that are less informative and do not present extended data elements such as trace electropherograms, specimen images, voucher numbers, or BIN assignments, and are usually poorly curated (see also Bridge et al. [24]), compared to BOLD.
Human-made artifacts during the barcoding process, in particular for the GenBank stored data, affect the reliability of DNA barcoding to correctly assign a given specimen to a species. The overall recorded knowledge gaps in the DNA barcodes found in the present study are considerably high. Clearly, the two types of gaps; the first of records not found in BOLD, and the second the errors and missing data in for records already BOLD deposited in BOLD, may impact dramatically the accuracy of any DNA-based assessment or biomonitoring approach relying on BOLD's datasets. Many other common errors target inaccurate taxonomic identifications of specimens by nonexpert taxonomists (amplified by the lack of voucher specimen), sequence contamination, incomplete reference data, and insufficient quality of the uploaded molecular data [5,22]. By working on reference libraries of DNA barcodes of marine organisms (invertebrates and fish taxa), Weigand et al. [5] recorded numerous identification errors, sequence contamination, incomplete reference (missing trace files or primer information) as well as inadequate data management. The results of the present study, as supported by other recent studies [2,5,17,21,22,25], reveal that we are still away from holding representative and reliable reference libraries for important taxonomic groups, as those that were analyzed in the current study. In addition, new DNA barcode data are continuously made available for the already barcoded and also for additional species from the reference libraries, including additional auditing and annotation processes, altogether helping in closing the gap knowledge and purging accumulated erroneous data.
Our assessment on the completeness of the two selected taxa from the ERMS library (i.e., Ascidiacea and Cnidaria) elucidates that the available COI barcode data may adequately cover just a small fraction of this reference library, raising an alarm for similar statuses in other  reference libraries, using indexes such as the AZTI's Marine Biotic Index (AMBI; [26]). Global change of marine ecosystems and marine biodiversity loss [27,28] pose great challenges to emerging marine and water strategy directives (such as WFD and MSFD), as well as to the development of marine policy and management approaches [29][30][31]. In view of these challenges, reliable barcode reference libraries are of particular importance [5], revealing the necessity of quality control (QC) when building and curating a barcode reference library. Our analysis further elucidate the necessity of filling the gaps in the barcode libraries used. It is evident that the level of taxonomic detection and degree of accuracy by newly developed molecular tools is directly contingent on the reference libraries' completion and reliability of the DNA records [21,24,32,33]. Given our increased reliance on molecular taxonomy as a robust tool [4], strengthening the existing reference libraries is a need for a wide range of scientific and applied purposes, such as monitoring, eDNA, and metabarcoding approaches, all targeting matched identifications and for assessments of biodiversity and abundance [5,[14][15][16][17]33].

Conclusions
DNA-barcoding-based approaches are superior in issues like diminished ambiguity and improved accuracy of species identification with ultimate verification of results against repository documentations [14,25]. Quality management elements (such as quality assurance and quality control) should be employed when using the list for monitoring and other purposes and for closing the knowledge gaps. Purging of errors from BOLD, the most reliable DNA-barcoding reference library, will significantly contribute to future attempts in biodiversity monitoring efforts, in eDNA and metabarcoding approaches and their assessments, for various regulatory compliance purposes, forensics, among others [1,2,32]. Given the increasing use of high-throughput sequencing approaches and of automated pipelines, data quality aspects of DNA barcodes should be cogitated with higher priority.