Skip to main content

Clustering and prioritization to design a risk-based monitoring program in groundwater sources for drinking water



The number of chemical parameters included in monitoring programs of water utilities increased in the last decade. In accordance with the European Drinking Water Directive, utilities aim at a tailored risk-based monitoring (RBM) program. Here, such a RBM program was developed for the largest Dutch water utility, mostly using groundwater as a source. Data from target analyses and high-resolution mass spectrometry-based suspect screening was used to cluster the different source waters. Targets were prioritized based on (preliminary) drinking water guideline values or the threshold of toxicological concern. Suspects were prioritized for further identity confirmation based on semi-quantitative occurrence concentrations combined with in vitro toxicity information. Finally, a RBM program was suggested for each cluster of source waters.


Out of 731 target chemicals, 153 were detected at least once over a 5-year period. Roughly 10% of the detected non-target screening features matched to suspects. 108 source waters were clustered into 7 clusters. Source waters with low numbers and concentrations of organic chemicals were located in areas with all land-use types, while clusters of source waters with higher numbers of chemicals were related to infiltrated surface water. For perfluorinated chemicals, 25 suspects matched features detected in source waters and 7 features detected in drinking water. For the target chemicals, simple treatment showed the lowest and sorption-based techniques relatively high removal efficiencies. The chemical composition of all drinking waters related to non-contaminated source waters. (Preliminary) guideline values were available for 45 of the retrieved target chemicals, and used for prioritization for monitoring frequencies. These chemicals individually posed no appreciable concern to human health. Suspects were prioritized for further identity confirmation based on semi-quantitative occurrence in produced water, detection frequencies and information on toxic potency. Once confirmed and assessed as relevant, the suspects could be added to target monitoring.


This approach provided a feasible workflow for RBM of target chemicals for clusters of groundwater sources, connected to a feed of new relevant chemicals based on suspect screening.


Worldwide, drinking water regulations prescribe drinking water quality standards for a selection of chemicals. The EU Drinking Water Directive (EU DWD) for example lists standards for 26 chemical parameters. Most drinking water utilities monitor a broad set of parent chemicals and their transformation products, using target, non-target [1] and bioanalytical methods [2]. The EU DWD stipulates that drinking water monitoring is performed in a more flexible way, provided that protection of public health is ensured. The aim is to reduce obsolete analyses and concentrate on relevant issues, following the principle of ‘hazard analysis and critical control point’ (HACCP) [3] and the water safety plan approach as developed by WHO (World Health Organization) [4].

Compared to surface water, groundwater is less intensively studied and monitored [5,6,7]. Groundwater can, however, be highly influenced by anthropogenic activities related to the land-use [8], by infiltrating surface water [9], by historical contamination [10] or by activities in the subsoil [11]. The susceptibility of the groundwater aquifers to these pressures depends on soil characteristics and groundwater hydrology [12, 13]. Chemical properties, such as persistence and mobility, are reflected in spatio-temporal patterns of chemical occurrence in groundwater after emissions. The chemical properties also influence removal efficiencies during drinking water production, depending on the water treatment techniques applied [14].

Water utility Vitens services drinking water in a large area in the Netherlands, using groundwater as a major source. The set of organic chemical parameters in their monitoring program tripled the last decade. In accordance with the EU DWD, the water utility aims to prioritize measured chemicals and to develop a tailored risk-based monitoring program. In literature several prioritization methods for chemicals of emerging concern (CEC) have been developed [15], that make use of target monitoring data, non-target and suspect screening data [1, 16,17,18] exposure models [19, 20] or chemo-informatics [21].

The aim here is to develop a risk-based monitoring program for the drinking water sources in the service area of the water utility. We use available target and suspect monitoring data and characteristics of the supply zones. We use clustering techniques to cluster the supply zones based on target and suspect data. We prioritize targets based on (preliminary) drinking water guideline ((p)GLVs) values or threshold of toxicological concern (TTC). Based on this information, we suggest a risk-based monitoring for each clusters of supply zones. We prioritize the suspects for further identity confirmation based on semi-quantitative concentrations combined with in vitro toxicity information.

Material and methods

Typology drinking water supply zones

The data used originate from 141 source waters, mixed water from one or multiple pumping wells prior to drinking water treatment in the central, eastern and northern parts of the Netherlands. Two drinking water supply zones are mainly fed by river bank filtrate, the other supply zones use groundwater as a source. Per source water the percentage infiltrated surface water is given, expressed in four classes, i.e., (i) 5–10%, (ii) 10–20%, (iii) 20–50% and (iv) 50–70%. The supply zones are classified following the ABIKOU typology (Stuyfzand [13], in which A corresponds to phreatic groundwater in sandy soil, B for (semi-)confined groundwater, I for artificially infiltrated surface water and U for riverbank infiltrated surface water. The land-use in the 25-year infiltration zone is defined as the percentages of urban, agriculture and nature area in the total recharge area.

The water is treated at 96 production stations. The drinking water treatment techniques consist mostly of commonly used drinking water treatment techniques such as flocculation, sand filtration, aeration, water softening, pH adjustment and more occasionally also includes reverse osmosis (RO) and active carbon filtration.

Analytical chemistry

We use monitoring data generated by Vitens drinking water laboratory. This laboratory works via strictly defined QA/QC criteria, takes part in round robin tests, works via standard procedures when available (EN ISO/IEC 17025:2005, NEN 6265, NEN-EN-ISO 19458, NEN 6414, NEN 6421, NEN-EN 872, NEN-ISO 7888, NEN-EN-ISO 581, NEN-EN ISO 10304–1, NEN-EN-ISO 9562, NEN-EN-1484), and is officially accredited via the Dutch Board for Accreditation (see for further details Vitens routinely performs monitoring in both the source and produced waters, for 731 target chemicals using several methods (See Additional file 2: Targets, where also limits of detection per target chemical are given). Current monitoring frequency in source water is at least once per year. The frequency depends on both the estimated susceptibility of the supply zone and whether the parameter is explicitly mentioned in current legislation. Here, we use routine target monitoring data produced in the period 2010 to 2016. The dataset consists of 553,440 entries for source water including 8954 entries above reporting limits, and 760,339 entries for drinking water including 5352 entries above reporting limits. For each parameter, the frequency of detection and variability (averages and 90th percentiles) over 2010–2016 of the detected concentrations is deduced averaged over all samples from source waters, and averaged per cluster of source waters and drinking water.

In addition in 2016, the source and produced waters of all supply zones are once monitored using high-resolution mass spectrometry coupled to high-pressure liquid chromatography. Vitens is equipped with an AB Sciex Q-TOF (API Triple TOF 5600 +). Samples were directly injected without preconcentration, separate injections were made for positive and negative ionization mode (ESI). To 10 mL sample, 20 µL EDTA, 10µL formic acid and 80 µL internal standard was added. A reversed phase Waters Xselect -T3 column was used with a gradient of acidified water and acetonitrile. Parent compounds were measured in the 100-1300 Da range, while fragmentation was measured in “high resolution”, i.e., 50-1300 Da. For both positive and negative ionization, multiple standards with masses between 143 and 394 were used, i.e., dimetridazole-d3, fenuron, desethylatrazine-13c3, atrazine-d5, carbamazepine-d10, sulfamethoxazole-d4, atenolol-d7, fluoxetine-d6, ciprofloxacin-d8, tamoxifen-d5 and diflufenican-d3 for the positive mode and 2-nitrophenol-d4, acesulfame-d4, 2,4-dinitrophenol-d3, mecoprop-d6, bentazon-d6, bromacil-d3, neburon, hydrochlorothiazide–d2, sulfadimethoxin-d6, bezafibrate-d6 and diflufenican-d3 in the negative mode. All measurements are performed in duplicate. Based on drinking water spiked with 381 known chemicals, limits of detection for the vast majority of compounds are determined and ranged from <10 to 50 ng/L. All results are expressed in terms of internal standard equivalent (I.S.-eqs.), for both positive and negative ionization mode neburon was used to semi-quantitatively express concentrations. The limit of detection of the suspects differs with regard to their ionization potential [18, 22]. The expression of the concentration in terms of I.S.-eqs. is semi-quantitative, a preliminary study showed that the responses of 80% of 53 suspects vary within two orders of magnitude [18].

Suspect screening

In total, the current dataset consists of 41,267 detected entries in source water and 12,123 detected entries in drinking water. A total of 12,294 features (7503 using positive ionization mode and 4791 using negative ionization mode) are matched to suspects from NORMAN SusDat (14,632 entries, www.norman-network, October 2016 version) and Sjerps et al. [18] (5219 entries) suspect lists. The latter consists of industrial chemicals (> 100 ton), pharmaceuticals, veterinary pharmaceuticals, pesticides and biocides which are authorized on the European market. See Dulio et al. [23] and Schymanski and Williams [24] for more background on the activities of the NORMAN network and the importance of open science in the evolution of suspect screening. Specific attention is paid to perfluorinated chemicals, for which the Norman PFAS suspect list (PFASTRIER) was used comprising 691 CAS numbers.

All suspect data are filtered according to their accurate mass, suspects with a mass difference <2ppm between feature and suspect are further processed. Next data are filtered on their predicted retention time, based on 173 compounds following methods described by McEachran et al. [25]. A tolerance of <3 min was applied to reduce false-positive hits. In the present study, confidence levels of the retrieved suspects, according to the scheme by Schymanski et al.[26], are not defined and the identity of the suspects is not further confirmed. For each parameter, the frequency of detection and variability (averages and 90th percentiles) of the semi-quantitative concentrations are given.


To cluster the source water samples, average concentrations of each target chemical are calculated over a period of 6 years for each sampling location. Average concentrations are based on detected concentrations above the reporting limit (RL); when all measurements on a sampling location over 2010–2016 are below RL the concentration is expressed as 0.5*RL, based on the lowest RL for the target chemical in the dataset. The following is excluded from the dataset: a) CH4, DOC, TOC; b) chemicals that are not found above RL in any of the source water samples; c) chemicals that are detected in less than 100 water samples and d) source water samples for which less than 50 chemicals are detected. All chemical concentrations are log-transformed. This results in a subset of 108 source water samples and 152 target chemicals.

The 108 source waters are clustered using k-means clustering. This is a commonly used algorithm of unsupervised learning, and is used to partition a number of observations into k clusters based on their similarity. To relate the clusters of source water to information of a large number of chemicals, we reduce dimensionality of the dataset using principal component analysis (PCA). The chemicals that are detected in only one water sample are excluded. The major axes of variations extracted with PCA are interpreted based on the loading of each chemical. The clusters of source waters are projected on the reduced dimensions of PCA. In addition, the clusters are also projected on a plane of two metrics which represent overall abundance of target chemicals, i.e., total concentrations and number of all detected chemicals. Finally, the clusters of source waters are compared to surface water influence, the proportion of land-use types (urban, agriculture, nature), and the ABIKOU class. For the sake of presentation, the clusters are numbered based on their median values of total concentration of all detected chemicals.

Similar, drinking water samples are also clustered based on target chemicals. The above-mentioned exclusions result in a subset of 101 drinking water samples and 112 target chemicals. Chemicals that are detected in only one water sample are excluded, leaving 72 chemicals. K-means clusters are related to treatment class applied to each drinking water (Table 1).

Table 1 Treatment technology classes of the drinking water production locations

Using the PCA loadings of detected target chemicals in source water, the PCA scores of 101 drinking water samples are calculated and plotted on the PCA plane based on source water. The PCA scores of drinking water are derived by multiplying the concentrations of target chemicals in drinking water with the PCA loadings computed from target chemicals in source water. Known pairs of source water and produced drinking water are connected by arrows. In this way, the chemical composition of drinking water can be projected on the same 2D plane as source water, enabling a visualization of change in water quality due to treatment.

Source water samples are also clustered based on suspect chemicals. After the same exclusion procedure as target chemicals, source water samples and 1297 suspect chemicals are used for k-means clustering. Prior to the analysis, suspect chemical concentrations are log-transformed and in order to avoid zero-values before the log-transformation a value of 0.001 μg/L I.S.-eq was added. Since the number of suspects is too large compared to the number of water samples to conduct PCA, we reduce the number of suspects from 1297 to 162 by selecting only those suspects that are detected in more than 5 water samples and with 90th percentiles above 0.01 µg/L I.S.-eq (see Additional file 1: Fig. S1).

All statistical analyses are conducted using R version 3.4.1.

Analysis of treatment efficiencies

Removal efficiencies are derived for all detected target chemicals in source water for locations with a comparable combination of treatment techniques. For each drinking water production location and per target chemical, individual measurement of the concentration in the (mixed) source water is compared to the corresponding individual measurement of the concentration in the produced drinking water. The calculated removal efficiencies are expressed per group of production locations with similar treatment techniques and overall production locations. Table 1 defines the techniques as used in the various treatment classes discerned, i.e., simple, sorption, size exclusion or a combination, and gives the number of production locations for a specific treatment class. For parameters for which concentrations in drinking water are <RL, RL is assumed as a realistic worst case approach. Removal efficiencies are calculated as (Csource–C drinking water)/(Csource).

Prioritization and risk-based monitoring for target chemicals

When (preliminary) drinking water guideline values ((p)GLVs) are available for target chemicals present in source waters or produced drinking water, these are used for further prioritization [27]. Chemicals are prioritized for all supply zones and per cluster by comparing averages and 90th percentiles of the concentration in source water and produced drinking water to the (p)GLVs. The ratio of both is expressed as the benchmark quotient (BQ) [28]. For those target chemicals for which no (p)GLVs are available, the concentrations in produced drinking and source water are compared to the TTC (threshold of toxicological concern) value [27]. The TTC is a pragmatic and generic screening level for preliminary and precautionary risk assessment, which is protective for health effects for the vast majority of chemicals and can be used to prioritize chemicals for a further and more in-depth toxicological risk assessment based on chemical-specific toxicological data.

We suggest that all target chemicals that are not detected in any source or produced water, can be monitored in a lower frequency, in accordance with the monitoring obligations related to the EU Water Framework Directive. Higher frequencies are recommended for all chemicals that are found in produced or source water, according to Table 2. This risk-based monitoring program for target chemicals is defined per cluster of source waters based on the criteria for monitoring frequency.

Table 2 Criteria for frequency of monitoring of target chemicals in

Prioritization for identity confirmation for suspect chemicals

For both target and suspect chemicals, octanol–water partition coefficient (log Kow) and half-life (DT50) values are gathered via EPI Suite [29]. When available, experimental data are preferred over modeled data. DT50 values are predicted according to Biowin 3, which is built on measured biodegradability data of over 200 substances for which molecular fragments are described. Likely biodegradation half-lives are expressed by a score system, i.e., 5 reflects hours, 4 reflects days, 3 reflects weeks, 2 reflects months and 1 reflects years [30]. For further analysis of the suspects in relation to log Kow and DT50, all features that match to more than 5 suspects are neglected for further analyses, to reduce uncertainty.

For prioritization for further identity confirmation of the suspects, in relation to their toxicity, multiple features that match to a similar suspect are reduced to one entry. Minimum and 5th percentile AC50 values, i.e., the concentration at which 50% of the maximum response is achieved per chemical per in vitro bioassay, are gathered from EPA’s ToxCast database [31, 32]. ToxCast chemical codes are linked to CAS numbers of the suspects retrieved. AC50 values are extracted for all in vitro assays in which a specific suspect is tested. For more details, we refer to Brunner et al. [33]. The features are prioritized for further confirmation based on the ratio of average I.S.-eq occurrence in all produced waters divided by the minimum AC50 per feature.

Results and discussion

Clustering of source waters based on targets

Out of 731 measured target chemicals, 153 chemicals were detected at least once.

PCA axes 1 and 2 of target chemicals in source water explain, respectively, 14.8 and 9.4% of the total variance. Axis 1 is associated with negative loading of almost all chemicals and therefore reflects cleanness of water (Additional file 1: Fig. S2a). This axis is highly and negatively correlated with the number of detected target chemicals (Spearman correlation coefficient ρ= − 0.64, p < 0.001) and the total concentrations of the target chemicals (ρ=− 0.46, p < 0.001). Source water which is influenced by a large amount of surface water scores low on this axis (Additional file 1: Fig. S2a). PCA axis 2 reflects the type of chemicals present in the sample, since most of the pesticides, pharmaceuticals or artificial sweeteners are positively related to this axis while industrial chemicals are negatively related (Additional file 1: Fig. S2b). Accordingly, the scores of samples on this axis are positively correlated with the proportion of agricultural land-use (ρ= 0.44, p < 0.001) and negatively correlated with the proportion of urban land-use (ρ= − 0.63, p < 0.001). Further axes also explain a relatively low part of the total variance, i.e., 8.1 and 5.1% of the total variance is explained by PCA 3 and 4, respectively.

The clustering of the source waters based on target chemicals is depicted in Fig. 1a, b (see Additional file 2: Sources for clustering of the individual source waters and their properties). A k-value of 7 is chosen because the variance explained by the clusters starts to plateau at k-values between 7 and 10. Cluster 7, which are the relatively non-vulnerable source waters with low concentrations and low number of target chemicals, occurs in all land-use types. However source waters consisting solely of the land-use nature are all clustered into cluster 7 (Additional file 1: Fig. S3a). Source waters in cluster 3 and 4, in which higher number of target chemicals are found, consist of more than 50% of infiltrated surface water. Two wells influenced by point source contamination with chlorinated hydrocarbons are separately clustered in cluster 1. See Additional file 1: Fig. S3. for more information on clustering of source waters related to the supply zone typology in terms of land-use and influence of surface water infiltration.

Fig. 1

Clustering of source water target data (a) and suspect data (c), plotted on PCA axis 1 and 2, and plotted according to total concentration and number of detected chemicals per sample for target data in µg/L (b) and suspect data in I.S. eq/L (d)

Clustering of source waters based on suspects

In all 141 individual source waters, 1398 features are retrieved that match to 3590 suspects as described. Detected suspects do not show a different pattern in hydrophobicity and toxicity compared to non-detected suspects (Additional file 1: Fig. S4.). Features can match to a maximum of 36 different suspects, on average features match to 3 different suspects both in the positive and negative ionization mode (Additional file 1: Fig. S5). The majority of the suspects retrieved will therefore be false positives. Using smaller suspect lists will lead to fewer hits and fewer false positives, but potentially also to false negatives. Similar, to clustering based on the detected target chemicals, 7 clusters of source waters were distinguished based on the detected suspects (Fig. 1c, d, Additional file 2: Suspects).

PCA axis 1 and 2 of suspect chemicals in source water explain, respectively, 19.8% and 7.4% of the total variance (Fig. 1c, d). For source waters in cluster 7, again the relatively non-vulnerable source waters, all land-use types are present in their recharge areas; however, recharge areas with a high proportion of agricultural area are less frequently present. A high number of suspect chemicals are found in source waters from cluster 1, 3 and 4, influenced by more than 50% infiltrated surface water. See Additional file 1: Fig. S3 for more information on clustering of source waters related to typology in terms of land-use, influence of surface water infiltration and structure of the subsoil.

A comparison of clustering based on target and suspect chemicals (Table 3) shows that approximately half of the source waters (56 source waters out of 108, grouped as cluster 7 for both) can be considered as relatively non-vulnerable to anthropogenic influences in terms of both target chemical composition and non-target chemical composition. Seven of the source waters, i.e., cluster 3 and 4 for the targets and cluster 1, 3, and 4 of the suspects, are similar with relatively high levels of surface water infiltration. There is a large overlap between cluster 6 based on targets and cluster 6 based on suspects, which consists of source waters with a high percentage of agricultural land-use.

Table 3 Clustering of 108 out of 141

However, it is also clear that suspect screening gives complementary information to the target analyses [18], as many other source waters clustered differently based on either target or suspect data. An example are 5 source waters from cluster 7 based on the suspects, consisting of relatively clean waters, that occur in cluster 1 and 2 according to the targets, consisting of relatively contaminated waters. On the other hand, 5 source waters from cluster 1 and 2 based on the suspects, consisting of relatively contaminated water also cluster in clusters 6 and 7 based on the targets, consisting of relatively clean waters. An explanation for these differences is chemicals that are not well ionized or that are very volatile cannot easily be detected via liquid chromatography–high-resolution mass spectrometry (LC–HRMS).

Perfluorinated chemicals

For the perfluorinated chemicals, 25 suspects from the Norman PFAS suspect list match features in the source waters. Depending on the exact suspect 1 to 33 different supply zones for source water contain these suspects, while 7 suspects are also retrieved in drinking water, in 1 to 14 different production stations (Additional file 1: Table S1). Merely four of these 25 retrieved suspect perfluorinated chemicals are REACH registered. For 17 chemicals the registration status is “pre-registered”. For these chemicals information on which companies are actually producing/using them cannot be retrieved. Furthermore, only for a few chemicals it is known what they are actually used for. They are mainly employed as surfactants. A total of 14 of these chemicals could not be found in scientific literature. However, there are two papers dealing with the global emission of several C4-C14 PFCAs (Wang et al. 2014ab). Mean concentrations of perfluoroalkyl substances in WWTP effluent and sludge are reported between 1–800 ng/L and 1–100 ng/g, respectively [34]. This study found at least two chemicals from the list in WWTP effluents in concentrations of 5 ng/L for CAS nr 355-46-4 and of 80 ng/L for CAS nr 335-67-1 [34].

Analysis of treatment efficiencies

When samples of produced drinking water are plotted on the PCA planes derived from target chemicals in source water (Fig. 1a), they coincide with cluster 7 of the non-vulnerable source waters (Fig. 2). Both simple and sorption techniques, combined with mixing of individual source waters, have a positive effect on the composition of the water quality.

Fig. 2

PCA scores for source water (black) and drinking water samples (red) plotted on the PCA axes as derived in Fig. 1a

The mean removal efficiencies for simple, sorption and size exclusion treatment techniques differ significantly (ANOVA, p<0.01, Fig. 3a). Large variability in removal efficiency within locations with the same treatment techniques does occur. Drinking water treatment based on only simple treatment techniques shows as expected the lowest removal rates, while sorption-based techniques—granulated activated and powder-activated charcoal—show relatively high removal efficiencies. Techniques for size exclusion include reverse osmosis and nanofiltration and generally treat only half of the drinking water volume at the production locations of the water utility and followed by mixing with differently treated water. The removal rates presented in Fig. 3a are based on concentrations in mixed drinking water which explains the relatively low removal efficiencies. Removal efficiencies for target chemicals treated with sorption techniques, i.e., active carbon filtration, show as expected [35] a significant correlation with hydrophobicity (p<0.01, Fig. 3b), however the explained variance is low (R2=0.02) and variability is high. Several target chemicals, at few points in time and few production locations, are introduced or show an increase in concentration during drinking water treatment as a result of transformation processes. This holds for 65 chemicals and for 69 production locations.

Fig. 3

a Distribution of removal efficiencies expressed as fraction retained in produced drinking water as compared to the (mixed) source water per production location, including removal efficiencies based on <RL in drinking water, for target chemicals per treatment type. Box extends from 25 to 75th percentiles and whiskers extent from 1 to 99th percentiles, size exclusion is applied on only half of the produced drinking water volume. b Relation between removal efficiencies for individual target chemicals for production stations where sorptive techniques are included and hydrophobicity (p<0.01, R2=0.02)

Prioritization and risk-based monitoring for target chemicals

For the prioritization of target chemicals (p)GLVs are used which were earlier derived based on available chemical-specific in vivo toxicological data [27], which are available for 45 of the 153 target chemicals found in source and drinking water. For all these target chemicals, concentrations in drinking water are below the benchmark quotient of 0.1 (Fig. 4). So, these individual target chemicals pose individually no noteworthy threat to human health, which is in line with earlier conclusions [27, 28, 36,37,38].

Fig. 4

Provisional drinking water guideline values [27] compared to mean and 90th percentile concentrations found in drinking water and source water. Concentrations of chemicals without pGLVs are compared to of the TTC value of 0.1 µg/L (in grey). Black line represents a benchmark quotient of 1, while dotted line represents a benchmark quotient of 0.1

In drinking water, 19 chemicals with a pGLV and 22 chemicals without a pGLV have a BQ>0.001 based on the 90th percentile concentration. According to Table 2, these chemicals are advised to be most frequently monitored in drinking and source water. For source water, 32 chemicals with an available pGLV and 81 chemicals without a pGLV have a BQ>0.001 based on the 90th percentile concentration. Again, these chemicals are advised to be most frequently monitored in drinking and source water, and when possible to derive a pGLV if this is absent. For each cluster of source waters, according to the established criteria for frequency of monitoring of target chemicals in source and drinking water (Table 2), a suggestion for a risk-based monitoring program for the target chemicals is given (Additional file 2: Targets).

Prioritization for identity confirmation for suspect chemicals

As features can match multiple suspects, further effort is needed to confirm identity based on, e.g., isotopic patterns and MS2 fragmentation data [26] and ultimately matching the suspect’s retention time and spectra to a reference standard. In view of the efforts demanded, automation of structural identification based on MS2 data, cross-laboratory exchange of information and open science will be needed to achieve this [24]. Structured, semi-automated workflows are being developed for prioritization and confirmation [1, 39,40,41].

Due to the still laborious identity confirmation of the 3590 suspects retrieved, here we prioritize suspects for which it is warranted to further confirm their identity. The unequivocal confirmation of the identity of the suspects itself is not the aim here.

Of the 3590 retrieved suspects, 1017 have a type of use classification [18], and for 2398 and 2819 of the suspects information is available on, respectively, log Kow and DT50 according to EPI Suite [29]. For 2400 of the retrieved suspects, AC50 data are available in the EPA ToxCast database.

Average concentrations and frequencies of detection in relation to log Kow and DT50 show no clear pattern that more hydrophobic and degradable suspects are better removed (Additional file 1: Fig. S7). Such a pattern would be expected [42], but is probably disturbed by false positives occurring in the dataset.

Data on average I.S.-eq. and AC50 values per feature are given in Fig. 5, for source and produced drinking water. The number and concentrations of suspects are as expected higher in source water as compared to produced drinking water. Many suspects retrieved in the source waters are not found in finished drinking water. Only a limited number of suspects is found in finished drinking water but not in the source water, potentially transformation products formed during drinking water production [43].

Fig. 5

Average suspect concentration versus in vitro toxicity as based on minimum AC50 for suspects in source (a) and produced (b) water

The features and their possible suspects are prioritized for further confirmation based on the ratio of average I.S.-eq occurrence in produced water divided by the minimum AC50 over all possible suspects for that specific feature times the detected frequency (Table 4 and Additional file 2: Suspects). For a feature of which the suspect is to be confirmed, all possible suspects for that feature are to be considered. The semi-quantitative expression of concentrations in IS-eq. brings along large uncertainties of multiple orders of magnitude, this uncertainty is taken away when the concentrations can be expressed based on reference standards [18]. Once confirmed, after a period of more intensive monitoring to collect a sufficient body of data, the identified suspects can be added to the target monitoring when appropriate based on quantitative occurrence information and toxicity data following the methodology as described for prioritization and risk-based monitoring for target chemicals.

Table 4 Top 20 prioritized suspects for further confirmation of identity


We presented a feasible workflow to design risk-based monitoring for drinking water utilities which was demonstrated in practice. The monitoring program is specified for target chemicals for clusters of groundwater supply zones, connected to a feed of new relevant chemicals based on LC–HRMS suspect screening. To do so, required resources should be available to perform the required HRMS screening and maintain capacities for data interpretation.

Out of 731 measured target chemicals, 153 chemicals are once or multiple times detected in all sources and produced drinking waters over a 5-year period. 1398 out of 12,294 occurring HRMS features match to 3,590 suspects. Many suspects retrieved in the source waters are not found in finished drinking water, while only a limited number of suspects is found in finished drinking water but not in the source water. We prioritized suspects for which the identity is to be further confirmed based on the ratio of occurrence in produced water and potency. Once confirmed and assessed as relevant, the suspects can be added to the target monitoring.

For perfluorinated chemicals, 25 out of 691 suspects match features in source waters while 7 suspects are also retrieved in drinking water. Limited information is available for the 25 retrieved suspect perfluorinated chemicals both in the EU REACH registration and in scientific literature.

108 source waters are clustered based on target and suspect information in 7 clusters. Approximately half of the source waters can be considered as relatively non-vulnerable to anthropogenic influences. Clusters of source waters where higher number of chemicals are detected relate to high levels of infiltrated surface water. Produced drinking water clusters with the non-vulnerable source waters. Per cluster of source waters, according to proposed risk-based criteria for frequency of monitoring of target chemicals in source and drinking water (Table 2), a suggestion for a risk-based monitoring program for target chemicals is given.

Both simple and sorption treatment techniques, combined with mixing of individual source waters, have a positive effect on the composition of the water quality. Mean removal efficiencies for simple, sorption and size exclusion drinking water treatment technologies differ significantly. Treatment based on only simple treatment shows lowest removal rates, while sorption-based techniques show relatively high removal efficiencies.

For prioritization of target chemicals, (p)GLVs are available for 45 of the 153 retrieved chemicals. These chemicals pose individually no appreciable concern to human health.

Availability of data and materials

The datasets used and/or analyzed during the current study are available in the supplementary information or from the corresponding author on reasonable request.



Benchmark quotient


Chemicals of emerging concern


EU Drinking Water Directive




Hazard analysis and critical control point


High-resolution mass spectrometry

Log Kow:

Octanol–water partition coefficient


Principal component analysis


(Preliminary) guideline values


Risk-based monitoring


Reporting limit


Reverse osmosis


Threshold of toxicological concern


World Health Organization


  1. 1.

    Hollender J, Schymanski EL, Singer HP, Ferguson PL (2017) Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go? Environ Sci Technol 51:11505–11512

    CAS  Article  Google Scholar 

  2. 2.

    Leusch FDL, Neale PA, Hebert A, Scheurer M, Schriks MCM (2017) Analysis of the sensitivity of in vitro bioassays for androgenic, progestagenic, glucocorticoid, thyroid and estrogenic activity: Suitability for drinking and environmental waters. Environ Int 99:120–130

    CAS  Article  Google Scholar 

  3. 3.

    Van Wezel A, Mons M, Van Delft W (2010) New methods to monitor emerging chemicals in the drinking water production chain. J Environ Monit 12:80–89

    Article  Google Scholar 

  4. 4.

    Kot M, Castleden H, Gagnon GA (2015) The human dimension of water safety plans: a critical review of literature and information gaps. Environ Rev 23:24–29

    Article  Google Scholar 

  5. 5.

    Jurado A, Vàzquez-Suñé E, Carrera J, López de Alda M, Pujades E, Barceló D (2012) Emerging organic contaminants in groundwater in Spain: a review of sources, recent occurrence and fate in a European context. Sci Total Environ 440:82–94

    CAS  Article  Google Scholar 

  6. 6.

    Lapworth DJ, Baran N, Stuart ME, Ward RS (2012) Emerging organic contaminants in groundwater: A review of sources, fate and occurrence. Environ Pollut 163:287–303

    CAS  Article  Google Scholar 

  7. 7.

    Loos R, Locoro G, Comero S, Contini S, Schwesig D, Werres F, Balsaa P, Gans O, Weiss S, Blaha L, Bolchi M, Gawlik BM (2010) Pan-European survey on the occurrence of selected polar organic persistent pollutants in ground water. Water Res 44:4115–4126

    CAS  Article  Google Scholar 

  8. 8.

    ter Laak TL, Puijker LM, van Leerdam JA, Raat KJ, Kolkman A, de Voogt P, van Wezel AP (2012) Broad target chemical screening approach used as tool for rapid assessment of groundwater quality. Sci Total Environ 427–428:308–313

    Article  Google Scholar 

  9. 9.

    Sui Q, Cao X, Lu S, Zhao W, Qiu Z, Yu G (2015) Occurrence, sources and fate of pharmaceuticals and personal care products in the groundwater: a review. Emerg Contaminants 1:14–24

    Article  Google Scholar 

  10. 10.

    Eggen T, Moeder M, Arukwe A (2010) Municipal landfill leachates: a significant source for new and emerging pollutants. Sci Total Environ 408:5147–5157

    CAS  Article  Google Scholar 

  11. 11.

    Bonte M, Stuyfzand PJ, Hulsmann A, van Beelen P (2011) Underground thermal energy storage: Environmental risks and policy developments in the Netherlands and European Unio. Ecol Soc .

    Article  Google Scholar 

  12. 12.

    Mendizabal I, Baggelaar PK, Stuyfzand PJ (2012) Hydrochemical trends for public supply well fields in The Netherlands (1898–2008), natural backgrounds and upscaling to groundwater bodies. J Hydrol 450–451:279–292

    Article  Google Scholar 

  13. 13.

    van Wezel A, Puijker L, Vink C, Versteegh A, de Voogt P (2009) Odour and flavour thresholds of gasoline additives (MTBE, ETBE and TAME) and their occurrence in Dutch drinking water collection areas. Chemosphere 76:672–676

    Article  Google Scholar 

  14. 14.

    Van Wezel AP, Ter Laak TL, Fischer A, Bäuerlein PS, Munthe J, Posthuma L (2017) Mitigation options for chemicals of emerging concern in surface waters operationalising solutions-focused risk assessment. Environ Sci Water Res Technol 3:403–414

    Article  Google Scholar 

  15. 15.

    Guillén D, Ginebreda A, Farré M, Darbra RM, Petrovic M, Gros M, Barceló D (2012) Prioritization of chemicals in the aquatic environment based on risk assessment: analytical, modeling and regulatory perspective. Sci Total Environ 440:236–252

    Article  Google Scholar 

  16. 16.

    von der Ohe PC, Dulio V, Slobodnik J, De Deckere E, Kühne R, Ebert RU, Ginebreda A, De Cooman W, Schüürmann G, Brack W (2011) A new risk assessment approach for the prioritization of 500 classical and emerging organic microcontaminants as potential river basin specific pollutants under the European Water Framework Directive. Sci Total Environ 409:2064–2077

    Article  Google Scholar 

  17. 17.

    Moschet C, Wittmer I, Simovic J, Junghans M, Piazzoli A, Singer H, Stamm C, Leu C, Hollender J (2014) How a complete pesticide screening changes the assessment of surface water quality. Environ Sci Technol 48:5423–5432

    CAS  Article  Google Scholar 

  18. 18.

    Sjerps RMA, Vughs D, van Leerdam JA, ter Laak TL, van Wezel AP (2016) Data-driven prioritization of chemicals for various water types using suspect screening LC-HRMS. Water Res 93:254–264

    CAS  Article  Google Scholar 

  19. 19.

    Arnot JA, Brown TN, Wania F, Breivik K, McLachlan MS (2012) Prioritizing chemicals and data requirements for screening-level exposure and risk assessment. Environ Health Perspect 120:1565–1570

    Article  Google Scholar 

  20. 20.

    Wambaugh JF, Setzer RW, Reif DM, Gangwal S, Mitchell-Blackwood J, Arnot JA, Joliet O, Frame A, Rabinowitz J, Knudsen TB, Judson RS (2013) High-throughput models for exposure-based chemical prioritization in the ExpoCast project. Environ Sci Technol. 47(15):8479–8488

    CAS  Google Scholar 

  21. 21.

    Guha N, Guyton KZ, Loomis D, Barupal DK (2016) Prioritizing chemicals for risk assessment using chemoinformatics: Examples from the IARC monographs on pesticides. Environ Health Perspect 124:1823–1829

    CAS  Article  Google Scholar 

  22. 22.

    Alygizakis NA, Samanipour S, Hollender J, Ibáñez M, Kaserzon S, Kokkali V, Van Leerdam JA, Mueller JF, Pijnappels M, Reid MJ, Schymanski EL, Slobodnik J, Thomaidis NS, Thomas KV (2018) Exploring the potential of a global emerging contaminant early warning network through the use of retrospective suspect screening with high-resolution mass spectrometry. Environ Sci Technol 52:5135–5144

    CAS  Article  Google Scholar 

  23. 23.

    Dulio V, van Bavel B, Brorström-Lundén E, Harmsen J, Hollender J, Schlabach M, Slobodnik J, Thomas K, Koschorreck J (2018) Emerging pollutants in the EU: 10 years of NORMAN in support of environmental policies and regulations. Environ Sci Eur.

    Article  Google Scholar 

  24. 24.

    Schymanski EL, Williams AJ (2017) Open science for identifying “known unknown” chemicals. Environ Sci Technol 51:5357–5359

    CAS  Article  Google Scholar 

  25. 25.

    McEachran AD, Mansouri K, Newton SR, Beverly BEJ, Sobus JR, Williams AJ (2018) A comparison of three liquid chromatography (LC) retention time prediction models. Talanta 182:371–379

    CAS  Article  Google Scholar 

  26. 26.

    Schymanski EL, Jeon J, Gulde R, Fenner K, Ruff M, Singer HP, Hollender J (2014) Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol 48:2097–2098

    CAS  Article  Google Scholar 

  27. 27.

    Baken KA, Sjerps RMA, Schriks M, van Wezel AP (2018) Toxicological risk assessment and prioritization of drinking water relevant contaminants of emerging concern. Environ Int 118:293–303

    CAS  Article  Google Scholar 

  28. 28.

    Schriks M, Heringa MB, van der Kooi MME, de Voogt P, van Wezel AP (2010) Toxicological relevance of emerging contaminants for drinking water quality. Water Res 44:461–476

    CAS  Article  Google Scholar 

  29. 29.

    US EPA (2020) Estimation Programs Interface Suite™ for Microsoft® Windows, v 4.11. United States Environmental Protection Agency, Washington, DC, USA

    Google Scholar 

  30. 30.

    Aronson D, Boethling R, Howard P, Stiteler W (2006) Estimating biodegradation half-lives for use in chemical screening. Chemosphere 63:1953–1960

    CAS  Article  Google Scholar 

  31. 31.

    Toxicity ForeCaster (ToxCast™) Data. ,, Accessed 16 Nov 2017.

  32. 32.

    Richard AM, Judson RS, Houck KA, Grulke CM, Volarath P, Thillainadarajah I, Yang C, Rathman J, Martin MT, Wambaugh JF, Knudsen TB, Kancherla J, Mansouri K, Patlewicz G, Williams AJ, Little SB, Crofton KM, Thomas RS (2016) ToxCast chemical landscape: paving the road to 21st century toxicology. Chem Res Toxicol 29:1225–1251

    CAS  Article  Google Scholar 

  33. 33.

    Brunner AM, Dingemans MM, Baken KA, Van Wezel AP (2019) Prioritizing anthropogenic chemicals in drinking water and sources through combined use of mass spectrometry and ToxCast toxicity data. J Haz Mat 364:332–338

    CAS  Article  Google Scholar 

  34. 34.

    Kwon HO, Kim HY, Park YM, Seok KS, Oh JE, Choi SD (2017) Updated national emission of perfluoroalkyl substances (PFASs) from wastewater treatment plants in South Korea. Environ Pollut 220:298–306

    CAS  Article  Google Scholar 

  35. 35.

    Westerhoff P, Yoon Y, Snyder S, Wert E (2005) Fate of endocrine-disruptor, pharmaceutical, and personal care product chemicals during simulated drinking water treatment processes. Environ Sci Technol 39:6649–6663

    CAS  Article  Google Scholar 

  36. 36.

    Bruce GM, Pleus RC, Snyder SA (2010) Toxicological relevance of pharmaceuticals in drinking water. Environ Sci Technol 44:5619–5626

    CAS  Article  Google Scholar 

  37. 37.

    de Jongh CM, Kooij PJF, de Voogt P, ter Laak TL (2012) Screening and human health risk assessment of pharmaceuticals and their transformation products in Dutch surface waters and drinking water. Sci Total Environ 427–428:70–77

    Article  Google Scholar 

  38. 38.

    Houtman CJ, Kroesbergen J, Lekkerkerker-Teunissen K, van der Hoek JP (2014) Human health risk assessment of the mixture of pharmaceuticals in Dutch drinking water and its sources based on frequent monitoring data. Sci Total Environ 496:54–62

    CAS  Article  Google Scholar 

  39. 39.

    Gros M, Blum KM, Jernstedt H, Renman G, Rodríguez-Mozaz S, Haglund P, Andersson PL, Wiberg K, Ahrens L (2017) Screening and prioritization of micropollutants in wastewaters from on-site sewage treatment facilities. J Hazard Mater 328:37–45

    CAS  Article  Google Scholar 

  40. 40.

    Kaserzon SL, Heffernan AL, Thompson K, Mueller JF, Ramos MJ (2017) Rapid screening and identification of chemical hazards in surface and drinking water using high resolution mass spectrometry and a case-control filter. Chemosphere 1(182):656–664

    Article  Google Scholar 

  41. 41.

    Pochodylo AL, Helbling DE (2017) Emerging investigators series: prioritization of suspect hits in a sensitive suspect screening workflow for comprehensive micropollutant characterization in environmental samples. Environ Sci Water Res Technol 3:54–65

    CAS  Article  Google Scholar 

  42. 42.

    Reemtsma T, Berger U, Arp HPH, Gallard H, Knepper TP, Neumann M, Quintana JB, Voogt PD (2016) Mind the gap: persistent and mobile organic compounds—water contaminants that slip through. Environ Sci Technol 50:10308–10315

    CAS  Article  Google Scholar 

  43. 43.

    Bader T, Schulz W, Kümmerer K, Winzenbacher R (2017) LC-HRMS data processing strategy for reliable sample comparison exemplified by the assessment of water treatment processes. Anal Chem 89:13219–13226

    CAS  Article  Google Scholar 

Download references


Funding for this study was provided by the Joint Research Programme of the Dutch water utilities.


Joint Research Programme of the Dutch water utilities.

Author information




RS, AMB and YF collected, analyzed and interpreted the data, and wrote the draft manuscript. JdM contributed to data analysis. PB contributed to the perfluorinated chemicals sections. AvW conceptualized the project, supervised and was a major contributor in writing the manuscript. BB, MdJ, and MS provided data and gave critical input to the study design and manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Annemarie van Wezel.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Additional figures and tables.

Additional file 2.

Additional tables.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sjerps, R.M.A., Brunner, A.M., Fujita, Y. et al. Clustering and prioritization to design a risk-based monitoring program in groundwater sources for drinking water. Environ Sci Eur 33, 32 (2021).

Download citation


  • Risk-based monitoring program
  • Contaminants of emerging concern
  • Groundwater quality
  • Drinking water
  • Suspect screening
  • Risk assessment