Criteria for Reporting and Evaluating ecotoxicity Data (CRED): comparison and perception of the Klimisch and CRED methods for evaluating reliability and relevance of ecotoxicity studies

Kase, Robert; Korkaric, Muris; Werner, Inge; Ågerstrand, Marlene

doi:10.1186/s12302-016-0073-x

Research
Open access
Published: 29 February 2016

Criteria for Reporting and Evaluating ecotoxicity Data (CRED): comparison and perception of the Klimisch and CRED methods for evaluating reliability and relevance of ecotoxicity studies

Robert Kase¹,
Muris Korkaric²,
Inge Werner¹ &
…
Marlene Ågerstrand³

Environmental Sciences Europe volume 28, Article number: 7 (2016) Cite this article

66k Accesses
45 Citations
3 Altmetric
Metrics details

An Erratum to this article was published on 11 April 2016

Abstract

Background

The regulatory evaluation of ecotoxicity studies for environmental risk and/or hazard assessment of chemicals is often performed using the method established by Klimisch and colleagues in 1997. The method was, at that time, an important step toward improved evaluation of study reliability, but lately it has been criticized for lack of detail and guidance, and for not ensuring sufficient consistency among risk assessors.

Results

A new evaluation method was thus developed: Criteria for Reporting and Evaluating ecotoxicity Data (CRED). The CRED evaluation method aims at strengthening consistency and transparency of hazard and risk assessment of chemicals by providing criteria and guidance for reliability and relevance evaluation of aquatic ecotoxicity studies. A two-phased ring test was conducted to compare and characterize the differences between the CRED and Klimisch evaluation methods. A total of 75 risk assessors from 12 countries participated. Results show that the CRED evaluation method provides a more detailed and transparent evaluation of reliability and relevance than the Klimisch method. Ring test participants perceived it to be less dependent on expert judgement, more accurate and consistent, and practical regarding the use of criteria and time needed for performing an evaluation.

Conclusions

We conclude that the CRED evaluation method is a suitable replacement for the Klimisch method, and that its use may contribute to an improved harmonization of hazard and risk assessments of chemicals across different regulatory frameworks.

Background

The availability of reliable and relevant ecotoxicity data is a prerequisite for hazard and risk assessment of chemicals in regulatory frameworks, e.g., marketing authorization of plant production products [1], biocides [2], and human pharmaceuticals [3], and within the Water Framework Directive (WFD) [4] and REACH legislations [5]. In most of these frameworks, ecotoxicity studies used for risk assessment must undergo an evaluation of reliability and relevance, as described in detail in the REACH guidance. REACH defines reliability as “the inherent quality of a test report or publication relating to preferably standardized methodology and the way the experimental procedure and results are described to give evidence of the clarity and plausibility of the findings.” Relevance is defined as “the extent to which data and tests are appropriate for a particular hazard identification or risk characterisation” [5].

Similarly, the United States Environmental Protection Agency (US EPA) has published guidelines for screening, reviewing, and using published open literature toxicity data in ecological risk assessments [6]; however, these lack detailed guidance regarding relevance evaluation. Available European guidance documents, e.g., the REACH guidance document, the European Commission’s Technical Guidance Document (TGD) [7], and WFD [8], do not provide detailed information on how to evaluate reliability and relevance of a study. This lack of guidance causes study evaluations to depend strongly on expert judgement, which in turn can result in discrepancies among assessments and disagreements among risk assessors on whether or not a study can be used for regulatory purposes. Evaluation criteria and minimum reporting requirements for scientific test results should ensure that regulatory decisions are taken on a thorough and verifiable basis [9]. In this context, a need for applying robust and science-based principles in ecotoxicology has been stipulated [10, 11].

Currently, the evaluation method proposed by Klimisch et al. in 1997 [12] forms the backbone of many regulatory procedures where reliability of ecotoxicity studies needs to be determined. The Klimisch method provides a system where ecotoxicity studies categorized as either “reliable without restrictions” or “reliable with restrictions” are generally considered adequate for use in environmental hazard and risk assessments, while those categorized as “not reliable” are not accepted for regulatory use. Studies categorized as “not assignable” lack the detailed information needed to evaluate reliability. Depending on the framework and the reasons for lower reliability, studies classified as “not reliable” and “not assignable” may be used as supporting information. Even though the Klimisch method is widely used, some shortcomings have become evident and the method has been increasingly subject to criticism. Recent studies have demonstrated that the Klimisch method does not guarantee consistent evaluation results among different risk assessors [13, 14]. It provides only limited criteria for reliability evaluation and no specific guidance for relevance evaluation [15, 16]. Insufficient guidance increases the inconsistency of evaluation results. As a result of this, an ecotoxicity study may be categorized as “reliable with restrictions” by one risk assessor and as “not reliable” by another, leading to disagreement regarding the inclusion of the study in a data set used for hazard or risk assessment. Such inconsistent evaluation results can directly influence the outcome of a hazard or risk assessment for a specific chemical, which in turn may result e.g., in unnecessary risk mitigation measures, or underestimated risks for the environment. It is therefore essential that all available studies are evaluated using a science-based method that promotes consistency and transparency.

The Klimisch method favors studies performed according to Good Laboratory Practice (GLP) [13, 16, 17] and validated ecotoxicity protocols, such as those provided by the Organisation for Economic Co-operation and Development (OECD) and the US EPA. This can lead to a situation where risk assessors automatically categorize such studies as reliable without restrictions, even if obvious flaws (e.g., control mortality above accepted level, selection of non-relevant endpoints) exist. The preference for GLP studies has led to marketing authorization dossiers which rely almost exclusively on contract laboratory data provided by the registrants, while excluding peer-reviewed studies from the scientific literature. This has been openly criticized [18], and several regulatory frameworks [1, 5, 6, 8] have recently recommended to take all available information into account. This is important since hazard and risk assessments and the derivation of environmental quality criteria (EQC) or standards (EQS) for individual chemicals often suffer from limited data availability. In addition, the inclusion of more peer-reviewed studies offers the potential to save resources, both from an economic and ethical (number of animals used) point of view.

Multiple evaluation methods are available to risk assessors [6, 21–27]; however, most focus exclusively on the evaluation of reliability with the exception of those provided by Ågerstrand et al. [15] and Beronius et al. [24], while the equally important relevance aspects of a study are not considered. A comparison of four methods for reliability evaluation of ecotoxicity studies showed that choice of method affected the outcome [13]. In addition, a review of methods used to evaluate toxicity studies concluded that only five of 30 methods investigated had been rigorously tested by risk assessors in order to evaluate their applicability [28]. Other attempts to score study reliability were developed by industry and other institutions [23]; however, to our knowledge they—like most other methods—were not implemented in regulatory guidance documents. A method that provides increased transparency and consistency in evaluations is therefore needed to reduce uncertainties regarding the assessment of (key) ecotoxicity studies and provide more harmonized hazard and risk assessments.

The Criteria for Reporting and Evaluating ecotoxicity Data (CRED) project resulted from a 2012 initiative focused on the need for improvement of the Klimisch method [16]. The project aims at strengthening the transparency, efficiency, and robustness of environmental hazard and risk assessments of chemicals, and increasing the utilization of peer-reviewed ecotoxicity studies for substance evaluations through improved reporting. To this end, the CRED evaluation method and the CRED reporting recommendations were developed. The CRED evaluation method provides reliability and relevance evaluation criteria and detailed guidance material [19, 20]. These tools can be used for prospective and retrospective hazard and risk assessment of chemicals, as well as in the peer-review process.

This publication presents results of a ring test aimed at comparing the CRED and Klimisch evaluation methods with regard to study categorization, consistency of evaluations, the risk assessors’ perception of the methods, and practical aspects such as time requirements and the use of proposed evaluation criteria. It was our aim to demonstrate that the increased guidance provided by the CRED evaluation method would reduce inconsistency and increase transparency of study evaluations while being practical in use.

Methods

Development of the CRED evaluation method

The CRED evaluation method is based on OECD ecotoxicity test guidelines (e.g., OECD guidelines 201, 210 and 211), existing evaluation methods (e.g., [15, 25]), as well as practical expertise in evaluating studies for regulatory purposes. The method was presented and discussed at several expert meetings including the Society of Environmental Toxicology and Chemistry (SETAC) Global Environmental Risk Assessment Advisory Group, and the SETAC Global Pharmaceutical Advisory Group. Feedback from these meetings and ring test participants, as well as results of the ring test, was incorporated into the development of the final version of the CRED evaluation method [19]. General characteristics of the Klimisch and CRED evaluation methods are provided in Table 1.

Table 1 General characteristics of the Klimisch and the final CRED evaluation methods

Full size table

Ring test

The ring test was performed in two phases. In phase I (November–December 2012), each participant evaluated reliability and relevance of two out of eight ecotoxicity studies (see below and Table 2) using the Klimisch method. In phase II (March–April 2013), each participant evaluated two different studies from the same set of eight publications according to the CRED evaluation method: It is important to note that the ring test was not performed with the final CRED method [19] but with a draft version of this method (Additional file 1: part A, B). However, differences between these two versions are small (details in Additional file 1: part C Table C1). Studies were assigned based on each participant’s area of expertise, and each study was evaluated by different participants in phases I and II with no overlap within institutes to ensure independent study evaluation.

Table 2 List of ecotoxicity studies evaluated in the ring test

Full size table

Since the relevance of a study can vary depending on the regulatory context, risk assessors were informed prior to the evaluation to assume that the studies should be evaluated for their potential use in derivation of EQC within the Water Framework Directive, where all endpoints with known population relevant effects are accepted [4]. Each participant summarized the reliability evaluation into one of the categories established by Klimisch et al. [1]: R1 = Reliable without restrictions, R2 = Reliable with restrictions, R3 = Not reliable, and R4 = Not assignable. Relevance was summarized using the following categories: C1 = Relevant without restrictions, C2 = Relevant with restrictions, and C3 = Not relevant. It should be noted that Klimisch [12] did not suggest any categories for relevance.

Following the evaluation in phases I and II, risk assessors completed a questionnaire (Additional file 1: part A, B) in which they were asked to report their experience using the methods, and their perception of uncertainty regarding the resulting reliability and relevance assessments, and to list evaluation criteria they found to be missing. Based on a consistency analysis for each criterion following phase II, the wording of the CRED evaluation criteria was optimized if consistency (see consistency analysis below) was below 50 %. In addition, criteria identified as missing in these questionnaires were included in the final CRED evaluation method [19, 20], resulting in 20 (instead of 19) reliability criteria and 13 (instead of 11) relevance criteria (SI Table C1). In the draft method, 14 of 19 reliability criteria and 9 of 11 relevance criteria were designated as critical; however, participants explicitly avoided the use of critical criteria and voiced their preference for expert judgement, because a specific criterion may be critical for one test design but not for another. Moreover, the relevance category C4 (=Not assignable) was added to the final CRED evaluation method.

Ring test participants

Announcements of the ring test were made via email, at the annual meeting of SETAC Europe in 2012 and at several international regulatory meetings. The 75 ring test participants came from 12 countries and 35 organizations. They represented 9 regulatory agencies, 17 consulting companies and advisory groups, and 9 industry and stakeholder organizations. Participants from regulatory agencies were from the following countries: Canada, Denmark, Germany, France, the Netherlands, Sweden, the United Kingdom, and the USA. Regulators from the European Chemicals Agency (ECHA) and other international expert groups, and members of the Multilateral Meeting, where EU risk assessors meet to harmonize national EQC proposals, also joined the exercise. In addition, the following institutions, consulting companies, and advisory groups participated: CEFAS (United Kingdom), CEHTRA (France), CERI (Japan), Deltares (The Netherlands), DHI (Singapore), ECT (Germany), Eurofins AG (Switzerland), GAB Consult (Germany), ITEM (Germany), SETAC Pharmaceutical Advisory Group (PAG), SETAC Global Ecological Risk Assessment Group (ERAAG), Swiss Centre for Applied Ecotoxicology Eawag-EPFL (Switzerland), TSGE (United Kingdom), and wca (United Kingdom). The following industry and stakeholder organizations participated: Astrazeneca (United Kingdom), Bayer (Germany), BASF (Germany), Givaudan International SA (Switzerland), Golder Associates Inc. (United States of America), ECETOC (EU), Harlan (Switzerland), Monsanto Europe (Belgium), and Pfizer (United States of America).

Of a total of 75 participants, 62 participated in phase I (where the Klimisch method was used) and 54 in phase II (where the CRED evaluation method was used), with 76 % of participants taking part in both phases of the ring test. More than 80 % of the participants in phase I had previously used the Klimisch method. The majority of the participants (58 % in phase I and 62 % in phase II) had more than 5 years of experience in performing ecotoxicity study evaluations. The distribution of years of experience among participants in phase I and phase II was similar: 9 and 8 % with 0–1 year, 6 and 10 % with 1–2 years, 27 and 20 % with 2–5 years, 14 and 15 % with 5–10 years, and 44 and 47 % with above 10 years of experience in phase I and II, respectively.

Selection of ecotoxicity studies to be evaluated

Eight studies were evaluated in the ring test. Seven of these were published in peer-reviewed journals, and one was an industry study report from a contract laboratory (Table 2). Studies were selected by the coordinators of the ring test (who did not participate) to cover different taxonomic groups (cyanobacteria, algae, higher plants, crustacean, and fish), test designs (acute and long-term), and chemical substance classes. Five studies were potential key studies for derivation of EQC proposals in national dossiers. Toxicity endpoints evaluated were biomass, hatching success, sex ratio, growth, and immobilization. It should be noted that the eight studies represent a small selection, which cannot be representative of the broad range of ecotoxicity studies that exist. Based on previous evaluations they were expected to be categorized as either “reliable with restrictions” or “not reliable,” the two categories which are often the most difficult to assign, and which distinguish between inclusion and exclusion of studies in a hazard or risk assessment. Two studies reported the same experimental data, one in the form of an industry study report (study E) and the other as a peer-reviewed publication (study F).

Reliability and relevance evaluations

A total of 121 study evaluations were performed by 62 risk assessors in phase I (Klimisch evaluation method), and 104 study evaluations by 54 participants in phase II (CRED evaluation method). Five participants (three in phase I and two in phase II) submitted only one of the two requested questionnaires.

Each study was evaluated by 9–20 risk assessors (Additional file 1: part D, Tables D1, D2), with a mean of 15 per study in phase I and 13 in phase II for both reliability and relevance evaluations. The number of evaluations per study differed because some participants did not submit their completed questionnaires. Reliability and relevance categories assigned to each study were tested for significant differences between evaluation methods using the exact (permutation) version of the Chi-square test in R [37]. False discovery rate was checked according to Benjamini et al. [38]. This is a standard statistical test for categorical data comparison, which evaluates the difference between the numbers of observations (evaluations) in groups of data points (categories).

Arithmetic means and standard deviations of conclusive categories assigned to each study were calculated in order to quantify shifts in evaluation results between methods (Additional file 1: part D, Table D3). Conclusive categories are assigned when a conclusion about the reliability and relevance of a study can be drawn (for reliability: R1, R2, R3; for relevance: C1, C2, C3). Conclusive categories were weighted equally. Non-conclusive categories (R4 and C4) contain those studies where a conclusive category could not be assigned due to a lack of information.

Consistency analysis

To determine whether one of the tested evaluation methods would produce more consistent results than the other method, the percentage of evaluations in each conclusive reliability (R1–R3) or relevance (C1–C3) category was calculated for each study. The difference between the highest percentage and the average of the two lower percentages was used as a measure of consistency (Additional file 1: part D, Figure D1). For example, 100 % consistency is reached when all participants selected the same evaluation category for a given study. In contrast, if all evaluation categories were selected in equal proportions the consistency is zero. This method was chosen because it neither requires a large number of evaluations per study, nor an equal number of evaluations per evaluation method.

Practicality analysis

Two main aspects were investigated to compare the practicality of the two evaluation methods: (i) the time required to evaluate a study and (ii) the relationship between the number of fulfilled CRED criteria and assignment of reliability or relevance categories as an indication of the usefulness of the selected criteria for assigning categories. The time required to complete an evaluation using either the Klimisch or CRED evaluation method was reported by participants and compared. Time slots (<20, 20–40, 40–60, 60–180, or >180 min) were chosen to cover a broad temporal scale. Time requirements below 60 min were considered to be indicative of an efficient evaluation system, and this period was therefore examined more closely, hence the choice of smaller time intervals. Results are reported as the percentage of participants per time slot. To determine the relationship between the number of fulfilled CRED criteria and assigned reliability or relevance categories, the number of fulfilled criteria was recorded for each study evaluation. Results are presented by category as arithmetic mean of the percentage of criteria fulfilled, the range, and standard deviation (SD).

Perception of the evaluation methods

To determine the perception of the two evaluation methods by ring test participants, questionnaires that were completed by risk assessors participating in both phases of the ring test were analyzed. Participants rated their level of agreement with five different statements regarding the accuracy, applicability, consistency, and dependence on expert judgement of each evaluation method. Two additional statements were added during phase II to evaluate the participants’ perception with regard to the transparency of the CRED evaluation method in comparison to the Klimisch method, and to the usefulness of the guidance material provided with the CRED evaluation method (Additional file 1: part D, Table D4). Results were analyzed using the non-parametric pair rank-sum test according to Wilcoxon [39].

Participants were also asked to state the confidence in their evaluation results, ranging from “very confident” to “not confident,” for both evaluation methods using five response alternatives. Data were analyzed using the exact (permutation) version of the Chi-square test in R [40], and checked for false discovery rate according to Benjamini et al. [41]. The percentage of responses was calculated for the Klimisch (n = 121) and CRED evaluation methods (n = 103).

Results

Reliability evaluation

When using the Klimisch method, 8 % of all evaluations categorized the selected studies as “reliable without restrictions,” 45 % as “reliable with restrictions,” 42 % as “not reliable,” and 6 % as “not assignable.” When using the CRED evaluation method, 2 % of all evaluations categorized the studies as “reliable without restrictions,” 24 % as “reliable with restrictions,” 54 % as “not reliable,” and 20 % as “not assignable” (Fig. 1a; Additional file 1: part D, Table D1).

For five of eight studies, reliability categories assigned based on the CRED evaluation method did not differ significantly from those obtained using the Klimisch method (Fig. 1b; Additional file 1: part D Table D1), while significant differences were found for three studies (D, E, G). Two of these (D, E) were categorized as less reliable when using the CRED evaluation method, with the largest difference found for the industry study report (study E). In general, the use of the Klimisch evaluation method resulted in stronger discrepancies among categorizations into the first two (“reliable without restrictions “ and “reliable with restrictions”) versus the last two reliability categories (“not reliable,” “not assignable”) (Fig. 1; Additional file 1: part D, Table D1). For example, studies A, B, and G were evaluated by approximately half of the ring test participants as R1 or R2 (57, 40, and 45 %, respectively), and half as R3 or R4 (43, 60, and 55 %, respectively). This means that the categorization of these studies as regulatory usable or not usable would likely be intensively discussed among experienced risk assessors. In comparison, the use of the CRED evaluation method resulted in studies A, B, and G being assigned to R1 or R2 by only 20 % (studies A and B) and 30 % (study G) of participants, while 80 % (studies A and B) and 70 % (study G) of participants chose categories R3 or R4. A much higher percentage (40 %) of participants chose category “not assignable” (R4) for study G. Similar patterns were observed for studies D and F.

In order to illustrate methodological differences between the two evaluation methods a more detailed analysis of the evaluations for studies D, E, and G, which showed significant differences between evaluation methods, is provided below.

Study D [32]

This study reports on the toxicity of three insecticides using a standard algal growth inhibition test with Desmodesmus subspicatus, according to ISO 8692. The risk assessors were asked to evaluate the reliability of a 72-h no observed effect concentration (NOEC) for growth of the insecticide deltamethrin. Using the Klimisch method, 11 of 17 (65 %) ring test participants categorized this study as “reliable with restrictions” and 6 (35 %) as “not reliable.” This was in contrast to the results obtained using the CRED evaluation method which resulted in only 1 of 10 participants (10 %) categorizing the study as “reliable with restrictions,” 6 (60 %) to be “not reliable,” and 3 (30 %) as “not assignable.” The arithmetic means of conclusive categories (R1, R2, R3, not R4) assigned were 2.4 using the Klimisch method and 2.9 using CRED evaluation method (Additional file 1: part D, Table D3).

Independent of the evaluation method used, it was frequently observed by the participants that this study was missing analytical data on exposure concentrations. Of the participants using the Klimisch method, only one participant noted that the exposure concentrations for the test substance (deltamethrin) exceeded its maximum solubility in water; however, still categorized it as “reliable with restrictions.” When the CRED evaluation method was used, nearly all participants who categorized this study as “not reliable” (60 %) listed water exposure concentration above the test substance’s water solubility as one of the reasons. It is surprising that this issue was not detected by participants using the Klimisch evaluation method, since solubility of the test substance is mentioned in both evaluation methods as a factor which may influence test results. We conclude that the more systematic way of performing the evaluations, using a list of criteria (CRED) rather than criteria in text format (Klimisch), may have been sufficient to prompt a more thorough study review regarding this criterion. Thus, important flaws of a study might be more easily discovered if the evaluation is guided by the CRED evaluation method. The additional guidance material provided by the CRED evaluation method [19, 20] might also add to the risk assessor’s confidence when categorizing studies.

Study E [33]

This study is an industry study in the form of a GLP report providing fish toxicity data for Danio rerio exposed to estrone, a steroidal hormone and metabolite of estradiol. Ring test participants were asked to evaluate the reliability of a 40-day NOEC for sex ratio. Four of nine ring test participants (44 %) using the Klimisch method categorized this study as “reliable without restrictions” and 11 (56 %) as “reliable with restrictions.” With the CRED evaluation method, 3 of 19 participants (16 %) categorized this study as “reliable without restrictions,” 4 (21 %) as “reliable with restrictions,” and the remaining 12 (63 %) as “not reliable.” Independent of the method used, study E was never categorized as “not assignable.” The arithmetic means of conclusive categories (R1–R3) assigned were 1.6 when using the Klimisch method and 2.5 when using the CRED evaluation method (Additional file 1: part D, Table D3).

Using the Klimisch method, some risk assessors remarked that information on test substance purity and solubility as well as raw data in general was missing, yet none of them categorized it as “not reliable” or “not assignable.” In contrast, participants using the CRED evaluation method discovered flaws in the study design related to dosing and potential loss of the test substance. In addition, it was frequently noted that replication and control data provided were insufficient, e.g., due to missing solvent control data. Another issue raised with study E was the uneven number of fish used per treatment group. As for study D, these results suggest that the CRED evaluation method helped risk assessors to detect flaws in study design and reporting.

Study G [35]

This study reports fish toxicity data for Oncorhynchus mykiss with nonylphenol as a test substance. Participants were asked to evaluate the reliability of a 60-day NOEC for hatching success. This study was categorized as either “reliable with restrictions” by 9 of 20 participants (45 %) or “not reliable” by 11 participants (55 %) using the Klimisch method. Using the CRED evaluation method, it was categorized by 3 of 10 participants (30 %) as “reliable with restrictions,” by 3 (30 %) as “not reliable,” and by 4 (40 %) as “not assignable.” The arithmetic means of conclusive categories (R1–R3) assigned were 2.6 when using the Klimisch method and 2.5 when using the CRED evaluation method (Additional file 1: part D, Table D3).

The main flaw of this study was the use of the solvent dimethylsulfoxide (0.15 %) above the OECD-recommended concentration in test controls and treatments, and the relatively high concentration of 4 % formaldehyde as a disinfectant for fish eggs. Ring test participants using the CRED evaluation method reported additionally that information on the test method was missing, for example, exposure concentrations in the flow through system, purity of the tested substance, and details on feeding of organisms. In this case, the CRED evaluation method appeared to raise awareness regarding the distinction between conclusive (R1–R3) and non-conclusive (R4) categories, the latter referring to the absence of information rather than the inherent quality of the study itself.

Relevance evaluation

Overall, the ring test showed that both evaluation methods provide similar results regarding the relevance evaluation of a study, even though a differentiation between “relevant without restrictions” and “relevant with restrictions” is not foreseen in the Klimisch system. Differences occured when deciding whether a study is to be categorized as either “relevant without restrictions” or “relevant with restrictions.” Both the CRED and Klimisch evaluation methods resulted in a high percentage of studies categorized as “relevant without restrictions” (C1) and “relevant with restrictions” (C2) (Fig. 2a; Additional file 1: part D Table D2). Only 7 % of assessors using the Klimisch method and 8 % of assessors using the CRED evaluation method categorized the studies as “not relevant” (C3) for the intended purpose. The proportion of “relevant without restrictions” was 32 % when using the Klimisch and 57 % when using the CRED evaluation method.

For one of the eight studies (study H), results of the relevance evaluations differed significantly between the Klimisch and CRED evaluation methods (Fig. 2b; Additional file 1: part D Table D2) with a higher percentage of studies categorized as “relevant without restrictions” when using the CRED evaluation method. The arithmetic mean of conclusive categories (C1–C3) assigned to this study was 1.9 using the Klimisch method and 1.6 using the CRED evaluation method (Additional file 1: part D Table D3). In line with this evaluation result, 16 of 20 (80 %) ring test participants indicated problems using the Klimisch evaluation method, especially with regard to uncertainties on how to evaluate relevance and the lack of relevance criteria. The fact that study H was not a guideline study was stated more often when the Klimisch evaluation method was used, and more deficiencies of the study were listed. In contrast, 4 of 9 (44 %) participants who used the CRED evaluation method mentioned neither uncertainties nor the lack of criteria as limiting factors. Thus, the use of CRED relevance criteria appeared to result in a lower number of questions regarding the relevance of this study.