- Research
- Open access
- Published:
Application of bagging and boosting ensemble machine learning techniques for groundwater potential mapping in a drought-prone agriculture region of eastern India
Environmental Sciences Europe volume 36, Article number: 155 (2024)
Abstract
Groundwater is a primary source of drinking water for billions worldwide. It plays a crucial role in irrigation, domestic, and industrial uses, and significantly contributes to drought resilience in various regions. However, excessive groundwater discharge has left many areas vulnerable to potable water shortages. Therefore, assessing groundwater potential zones (GWPZ) is essential for implementing sustainable management practices to ensure the availability of groundwater for present and future generations. This study aims to delineate areas with high groundwater potential in the Bankura district of West Bengal using four machine learning methods: Random Forest (RF), Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), and Voting Ensemble (VE). The models used 161 data points, comprising 70% of the training dataset, to identify significant correlations between the presence and absence of groundwater in the region. Among the methods, Random Forest (RF) and Extreme Gradient Boosting (XGBoost) proved to be the most effective in mapping groundwater potential, suggesting their applicability in other regions with similar hydrogeological conditions. The performance metrics for RF are very good with a precision of 0.919, recall of 0.971, F1-score of 0.944, and accuracy of 0.943. This indicates a strong capability to accurately predict groundwater zones with minimal false positives and negatives. Adaptive Boosting (AdaBoost) demonstrated comparable performance across all metrics (precision: 0.919, recall: 0.971, F1-score: 0.944, accuracy: 0.943), highlighting its effectiveness in predicting groundwater potential areas accurately; whereas, Extreme Gradient Boosting (XGBoost) outperformed the other models slightly, with higher values in all metrics: precision (0.944), recall (0.971), F1-score (0.958), and accuracy (0.957), suggesting a more refined model performance. The Voting Ensemble (VE) approach also showed enhanced performance, mirroring XGBoost's metrics (precision: 0.944, recall: 0.971, F1-score: 0.958, accuracy: 0.957). This indicates that combining the strengths of individual models leads to better predictions. The groundwater potentiality zoning across the Bankura district varied significantly, with areas of very low potentiality accounting for 41.81% and very high potentiality at 24.35%. The uncertainty in predictions ranged from 0.0 to 0.75 across the study area, reflecting the variability in groundwater availability and the need for targeted management strategies.
In summary, this study highlights the critical need for assessing and managing groundwater resources effectively using advanced machine learning techniques. The findings provide a foundation for better groundwater management practices, ensuring sustainable use and conservation in Bankura district and beyond.
Introduction
Groundwater is an indispensable and crucial resource, fundamental to the survival of almost all forms of life, and essential for various developmental processes. Globally, nearly 2.5 billion people depend on groundwater for their daily needs, emphasizing its vital role [11, 20, 21, 38, 94]. India stands as the world's largest groundwater user, consuming approximately 230 km3 annually, which accounts for a quarter of the global total [28, 34, 41, 88]. Groundwater supports more than 60% of agricultural irrigation and 85% of drinking water needs in India [3, 68, 82, 85]. The country’s rapid population growth and changing rainfall patterns have intensified reliance on groundwater for agriculture and other sectors [54, 61, 70]. During weak monsoon years, groundwater often becomes the primary source of water for drinking and irrigation. Additionally, it serves as a crucial buffer against unpredictable monsoon rains, playing a key role in maintaining food security and reducing poverty by enhancing crop yields and mitigating crop failures [61, 73]. The increasing dependence on groundwater is also a response to the declining availability of surface water resources. Dams, reservoirs, and rivers, traditionally major sources of water, are under severe stress due to over-extraction, pollution, and climate variability. Surface water resources are diminishing, exacerbating water scarcity and leading to increased groundwater extraction as a compensatory measure [25]. Furthermore, water quality issues such as contamination from agricultural runoff, industrial effluents, and domestic waste have significantly degraded surface water sources, making groundwater an even more critical resource for safe drinking water and agricultural irrigation [68, 80].
Climate change has globally impacted groundwater, yet the exact magnitude of these changes remains uncertain due to insufficient observations [43, 44]. The persistent deterioration of groundwater resources threatens water supply, food security, economic prosperity, and the ability to achieve sustainable development [25]. Currently, around two billion people worldwide lack access to potable water, and almost half the global population experiences significant water shortages at various times of the year [45]. Climate change also directly or indirectly affects the hydrological cycle, which in turn affects the amount and quality of groundwater, via changes in precipitation levels and air temperature [24, 46].
This study aims to identify potential groundwater zones in the Bankura district of West Bengal which can be defined as the extent of groundwater storage that can be exploited beneficially for human use [51]. Bankura district, situated between the eastern fringe of the Chhotanagpur plateau and the Hooghly–Damodar alluvial tract, exhibits distinct groundwater availability patterns. The western part, characterized by a granite-gneiss basement complex and moderate slopes, has low groundwater potential and is prone to drought due to inhibited natural recharge [1]. Conversely, the eastern part, composed of older alluvium from the Hooghly–Damodar basin, has high groundwater potential but faces significant depletion due to over-extraction for irrigation. Groundwater potential mapping of the district is essential for devising a comprehensive water resource management strategy to combat drought in the west and regulate groundwater depletion in the east. To achieve this, we employed the “Groundwater Potential Zone Mapping (GWPZM)” method, which predicts spatial estimates of groundwater potential zones and assesses the likelihood of finding groundwater. GWPZM considers various surface features such as geomorphology, lithology, lineament density, slope, rainfall, and vegetation indices [29, 30, 48, 56, 59, 66, 70, 74, 87]. Additional factors like soil clay content, evapotranspiration, cropping intensity, and irrigated area indirectly affect groundwater availability. Supervised methods for classification are trained to establish the extent of correlation between the influencing factors of groundwater and the identified groundwater location data. The results are extrapolated to predict groundwater potential throughout the specified study region based on these associations in several models [36].
Conventional approaches to evaluating groundwater potential are time-consuming and expensive, often relying on extensive field mapping and exploration. Recent advances in statistical techniques, multi-criteria decision analysis, Remote Sensing (RS), and Geographic Information Systems (GIS) have facilitated the creation of groundwater potential zone maps cost-effectively [8, 16, 31, 47, 49, 62, 76, 85, 91].
In recent years, significant advancements in artificial intelligence (AI) and machine learning (ML) have been applied to the study of groundwater potential zones [26, 52, 54, 58, 69, 72, 78, 81, 92]. In this study, we employed four advanced ML algorithms—Random Forest (RF), Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), and Voting Ensemble (VE)—to evaluate groundwater potential in the Bankura district. These models were trained using 20 groundwater-influencing parameters to create precise maps of groundwater potential zones (GWPZ) for the study area, enhancing the evaluation of groundwater potentiality. A key challenge in ML is overfitting, where a model learns not only the true patterns, but also the noise in the data, leading to poor performance on new data. To address overfitting, we utilized cross-validation, regularization, and ensemble methods, ensuring robust models that perform effectively on new data. The primary objectives of this study are: (1) to create a groundwater inventory map for the designated study area, and (2) to develop and implement four innovative machine learning (ML) models—Random Forest (RF), Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), and Voting Ensemble (VE)—along with Remote Sensing (RS) and Geographic Information Systems (GIS) building a precise Groundwater Potential Zone (GWPZ) map for the study area.
This study contributes valuable geographic data on groundwater recharge potential, aiding informed decision-making regarding water management in a region prone to drought and lacking adequate data.
Database of the study area
Study area
Bankura, located in western West Bengal between latitudes 22°38’N and 23°38′N and longitudes 86°36′E and 87°46′E, spans approximately 6882 km2 (Fig. 1c). It serves as a geographical link between the elevated Chhotanagpur plateau to the west and the flat alluvial plains of West Bengal to the east. Shaped like an isosceles triangle, the district is bounded by the Damodar River to the north. It has two subdivisions, 22 Community Development blocks, and a population of around 3.6 million. The district exhibits diverse geological and topographical features. The western region is part of the Chhotanagpur plateau, characterized by granitic extrusions exposed since the Precambrian period. This region's hard granite and gneiss rocks limit groundwater recharge due to low porosity and permeability. Conversely, the eastern area, comprised recent alluvial deposits from the Damodar–Dwarakeswar rivers, facilitates greater groundwater recharge due to higher porosity. The central region primarily consists of laterites.
Physiographically, Bankura is divided into three units: the Western Upland (150–300 m elevation, peaking at 448 m), the Central Rolling Plain (50–150 m elevation), and the Eastern and Northern Alluvial Plain (30–50 m elevation) (Fig. 1). The district experiences a tropical monsoon climate with dry summers, an average annual temperature of 25 °C, and a temperature range from 6 °C in January to 45 °C in May. The mean annual rainfall is 130 cm, with 80% occurring during the monsoon from June to September. The region also experiences frequent Nor’westers during winter and early summer, contributing significantly to groundwater recharge. Culturally, there is a distinct disparity between the western and eastern parts of Bankura. The eastern region, benefiting from higher groundwater potential, supports profitable agriculture with major crops including rice, wheat, and potatoes. In contrast, the western part relies more on canal irrigation and surface water, leading to lower crop yields and agricultural output. Over-reliance on groundwater in the eastern region has led to significant water table depletion, increasing the depth required to access groundwater (Fig. 2l).
Groundwater inventory map
Generating groundwater inventory points is a fundamental process in developing robust models for delineating groundwater potential zones, particularly in regions where data availability is limited. The methodology begins with a rigorous analysis and assessment of aquifer distribution and depth within the district, utilizing data sourced from the Central Ground Water Board (CGWB). This dataset serves as a foundational resource for comprehending the spatial characteristics of groundwater resources. Given the constraints of data scarcity, our approach to generating groundwater inventory points involved a methodical methodology. This included comprehensive data gathering on aquifer properties, with careful consideration given to the district's geomorphological and lithological formations, which exert significant influence on groundwater dynamics. An in-depth analysis of drainage density and proximity to streams provided critical insights into the hydrological network, aiding in the identification of areas suitable and unsuitable for groundwater potential. Furthermore, elevation data were utilized to discern varying levels of groundwater potential across different regions. Despite these challenges, a total of 231 points were judiciously selected throughout the district to ensure the models were trained with a representative sample of available data. In the selection of groundwater inventory points, a concerted effort was made to ensure that the distribution of these points is random and remains uninfluenced by any spatial bias. Subsequently, every point is allocated a binary designation, where '0' represents sites without potential and '1' signifies areas having potential for groundwater presence. This binary assignment is based on a detailed evaluation process. Each point is accessed for aquifer presence and depth using CGWB data, with significant aquifer presence indicating potential sites. Points located in areas with low drainage density and proximity to streams are considered more likely to have groundwater potential due to better recharge conditions. Points at lower elevations that support groundwater recharge and storage are identified as potential sites. Based on the collective impact of these factors, each point is assigned a binary value. Points that exhibit favourable conditions across these criteria are assigned a value of ‘1’ indicating the potential for groundwater presence, while those that do not meet these criteria are designated with ‘0’.
In order to validate the models’ efficiency, the dataset is divided into training and testing subsets, maintaining a ratio of 70% (161 points) for training and 30% (70 points) for testing. This partitioning method enables the model's training on a major chunk of the data while providing a distinct set for unbiased evaluation and validation.
Groundwater controlling factors and potentiality parameters (GWPP)
Groundwater occurrence and movement are influenced by various factors, including geological structure, topography, lithology, geomorphology, fracture characteristics, porosity, recharge, water table distribution, drainage, slope, landform, land use/cover, climate, and human activities [10, 16, 35, 40, 61, 71, 75]. Recent studies have employed integrated approaches combining Remote Sensing (RS), Geographic Information System (GIS), and Machine Learning (ML) for groundwater potentiality mapping [60, 63, 75, 81, 89]. This study utilized 20 influential groundwater conditioning factors derived from satellite data, geological maps, and field surveys. Topographic variables (elevation, slope, profile curvature, TWI) were obtained from SRTM DEM (30 m), while landcover factors (NDVI, mNDWI) were derived from Landsat-8 OLI/TIRS (30 m). LULC data came from Dynamic World (10 m), and hydrological factors (drainage density, distance to streams) were extracted from SRTM DEM (30 m). Annual rainfall data were sourced from CHIRPS (5 km), evapotranspiration from MODIS MOD16A2 (500 m), soil texture from ESDAC, and clay content from Open Land Map (250 m). Geological aspects like lithology, geomorphology, and lineament were derived from a digital map of the district, BHUKOSH GSI (Geological Survey of India). Water table depth point data were obtained from a Digital map of the district, Central Ground Water Board (CGWB), and then interpolated using ‘idw interpolation’ tool in ArcGIS 10.4.1. The population count thematic layer was obtained from the Global Human Settlement Layer (GHSL) (100 m x 100 m). Cropping intensity and irrigated area data are collected from the Agriculture Census (Government of India, Ministry of Agriculture & Farmers Welfare), and the thematic layer was prepared using ‘choropleth map’ in ArcGIS 10.4.1. Before implementing supervised machine learning classification models, all raster layers underwent resampling to a 30 m resolution using the ‘resample’ tool within the GIS platform. Table 1 provides a detailed overview of the sources and descriptions of the parameters utilized in the groundwater potentiality zoning, and descriptions are provided as Supplementary Information (S1). The spatial distribution of selected conditioning factors are presented in Fig. 2a-t.
Multicollinearity test
Multicollinearity is a statistical phenomenon that arises when there is a significant degree of correlation among the predictor variables in a regression model [2, 7, 12, 14]. This correlation can have a significant impact on the reliability of statistical inferences about the data. A total of 20 groundwater influencing factors that have an impact on groundwater were chosen for the study region, and tolerances and variance inflation factors (VIF) were utilized to evaluate the extent of multicollinearity. Before doing the machine learning modelling, it was necessary to assess these groundwater influencing factors for multicollinearity to prevent noise in the groundwater models. According to different literature reviews, the variance inflation factor (VIF) is commonly used as an indicator to evaluate multicollinearity in geosciences, specifically in groundwater modelling research [6, 9, 30, 58, 72, 86]. Multicollinearity is an indicator of a strong correlation between two or more predicting variables in a model. A method for identifying multicollinearity involves the computation of the Variable Inflation Factor (VIF). Typically, a VIF value above 5 shows a strong correlation between variables, suggesting multicollinearity. However, in this investigation, a VIF threshold of 10 was used to choose parameters for the final modelling. Conversely, if the tolerance value is less than 0.1, it indicates the presence of multicollinearity. Hence, it is highly recommended to exclude these influential factors from the model if their VIF values are > 10 and their tolerance is < 0.1. The formulas (Eqs. 1 and 2) were used to calculate the tolerance and VIF values of the selected groundwater influencing factors:
where “\({T}_{i}\) is the tolerance for the ith predictor variable, \({VIF}_{i}\) is the VIF for the ith predictor variable, \({R}_{i}^{2}\) is the coefficient of determination when the ith variable is regressed against all the other predictor variables”.
Methods
Implementation of machine learning algorithms
In the context of the groundwater potential zonation of Bankura district, we have employed three distinct tree-based machine learning models, namely Random Forest (RF), Adaptive Boosting (AdaBoost), and Extreme Gradient Boosting (XGBoost), along with a hybrid ensemble model known as Voting Ensemble (VE). A unique approach to our study is the deliberate exclusion of typical data pre-processing techniques, such as feature scaling. Tree-based models are not sensitive to the magnitude of the input features since they make decisions based on relative comparisons between characteristics at each split in the tree. In contrast to certain machine learning models that demand pre-processing approaches such as standardization or normalization of the input data, tree-based models are less susceptible to the impact of feature scale. The algorithms were selected based on their exceptional capacity to classify and predict outcomes with precision. The methodological flow diagram shows (Fig. 3) the complete blueprint of the entire research.
Random forest (RF)
Random Forest acts as a meta-estimator by aggregating predictions from multiple decision trees to produce the final output, developed by Leo Breiman and Adele Cutler [18, 23, 79]. It trains several base estimators on different subsets of the dataset and uses averaging to enhance predictive accuracy and prevent overfitting. In the binary classification scenario within a random forest, it is crucial to understand the Gini index, which is a formula (Eq. 3) used to determine how nodes branch on a decision tree:
Here, \(C\) refers to the number of classes in the dataset, and \({p}_{i}\) is the proportion of occurrences of class \({\prime}i{\prime}\). The Gini index ranges from 0 to 1, where a lower value suggests lesser impurity (homogeneity) and a higher value signifies higher impurity (heterogeneity). The main goal in employing the Gini index is to construct a tree structure that adeptly distinguishes between different classes in the dataset.
The mathematical expression (Eq. 4) for the decision function of the random forest model requires aggregating the decisions made by individual decision trees:
In this context, the representation of a feature vector's prediction is labelled as \(F\left(X\right)\), where ‘\(mode\)’ signifies the class, most frequently predicted by each tree. The total number of trees is represented by the variable \(N\). Each \({f}_{i}\left(X\right)\) denotes the output of the \(i\)th decision tree, which is trained on a randomly chosen subset of features and a bootstrapped sample extracted from the dataset.
Adaptive boosting (AdaBoost)
AdaBoost, an efficient boosting machine learning algorithm, is a member of the ensemble learning models, introduced by Yoav Freund and Robert Schapire [33, 65]. The algorithm works by iteratively training a sequence of weak learners, or decision stumps, and raising the weights assigned to misclassified examples. This allows the algorithm to improve its predictions over several rounds. By combining the weak learners, a strong classifier is created, ultimately increasing the predicted accuracy overall.
The process begins by assigning the initial weights for each sample, \({w}_{i}^{\left(1\right)}=\frac{1}{N}\), where \(N\) refers to the number of samples. The week classifier \({h}_{t}\) is then trained using weights \({w}_{i}^{\left(t\right)}\) on the training sets and the weighted error (\({err}_{t}\)) and classifier weight (\({a}_{t}\)) are computed (Eqs. 5–6):
Here, \(t\) ranges from \(1\) to \(T\), where \(T\) represents the total number of iterations. The true class label of sample \(i\) is represented by \({y}_{i}\), the prediction produced by the \(t\)th weak classifier for the same sample is represented by \({h}_{t}\left({x}_{i}\right)\), and the weight assigned to the sample at iteration \(t\) is represented by \({\omega }_{i}^{\left(t\right)}\). The indicator function \(I\left({y}_{i}\ne {h}_{t}\left({x}_{i}\right)\right)\) evaluates to 1 if a misclassification occurs, meaning the true class label differs from the predicted label by the weak classifier.
After calculating the weighted error and overall weight of the weak classifier at each iteration \(t\), the AdaBoost algorithm proceeds to update the sample weights. The update is intended to assign larger weights to the incorrectly classified samples, concentrating the upcoming iterations on examples that the present ensemble of weak classifiers was unable to adequately train. The following formula (Eq. 7) expresses the sample weight update:
The final prediction of the binary classification problem is obtained by combining the weak classifiers (Eqs. 8):
In this formulation, the prediction produced by the \(t\)th weak classifier for a sample is denoted by \({h}_{t}\left(x\right)\). The \(sign\)() function ensures that the final prediction is either + 1 or − 1, representing the positive and negative classes, respectively.
Extreme gradient boosting (XGBoost)
XGBoost is an efficient and optimized version of the gradient boosting technique, outperforming its predecessor in terms of performance and versatility in solving complicated machine learning tasks, developed by Tianqi Chen and the contributors of the DMLC (Distributed Machine Learning Community) [19, 39]. Both the XGBoost and AdaBoost methods come under the ensemble learning umbrella, where weak learners are successively trained to produce a robust model. In contrast to AdaBoost's emphasis on decision stumps as weak learners, XGBoost employs a more flexible approach, frequently leveraging decision trees, and integrates regularization approaches to prevent overfitting (Eq. 9):
Here, the final prediction \(F\left(X\right)\) is a result of the combined contributions of each weak learner, denoted as \({f}_{m}\left(X\right)\), where \(M\) represents the total number of weak learners. An individual tree is strategically trained to reduce the loss of its forerunners, leading to an adaptive process that improves the overall accuracy of the model.
One key factor contributing to XGBoost's efficacy is the integration of regularization components within the objective function. In XGBoost, regularization is implemented using penalty terms in the objective function, which consists of two essential components: the loss term assesses the model's alignment with the training data, while the regularization term penalizes excessive model complexity. The objective function is expressed as follows (Eq. 10):
Here, \(L\left({y}_{i},{\widehat{y}}_{i}\right)\) denotes the loss function, which measures the disparity between the actual label \({y}_{i}\) and the predicted label \({\widehat{y}}_{i}\), while \(\Omega \left({f}_{m}\right)\) stands for the regularization term applied to the mth tree.
Voting ensemble (VE)
A Voting Ensemble is a type of multi-model ensemble that involves the prediction of several base models to achieve better performance than any single classifier used in the ensemble. In this ensemble paradigm, several independent base models are trained using the training set. The predictions for each base model are totaled, and the label with the majority vote is predicted by the final ensemble model. The mathematical formulation for a Voting Ensemble varies on the exact type of voting mechanism used, whether it is hard voting or soft voting.
In the case of hard voting, the final prediction is established by the majority vote. If there are \(M\) base models, and each model \(i\) produces a binary prediction \({h}_{i}\left(x\right)\) for input \(x\), the final prediction \(H\left(x\right)\) is given by (Eq. 11):
In the case of soft voting, the final prediction depends on the weighted average of the anticipated probability. If each model \(i\) offers a probability distribution over classes \({p}_{i,k}\left(x\right)\) for each class \(k\), the final predicted probability distribution \(P{\left(x\right)}_{k}\) can be expressed as (Eq. 12):
Here, \({\omega }_{i}\) is the weight applied to each model, and \(P{\left(x\right)}_{k}\) is the predicted probability for class \(k\). The Voting Ensemble stands as a methodological architecture that exploits the capabilities of its constituent models, expressing a cooperative and meta-approach to predictive modelling.
Optimization of hyperparameters
Fine-tuning the hyperparameters is an essential step in machine learning model refinement to optimize algorithmic configurations and enhance predictive accuracy. GridSearchCV serves as a systematic and rigorous strategy to navigate the large hyperparameter space and determine the most optimized combination for a specific algorithm. The methodological underpinning of GridSearchCV involves an exhaustive search over a predetermined grid of hyperparameter values, enabling the evaluation of model performance across numerous combinations. In our model development process, we have incorporated Scikit Learn’s ‘GridSearchCV’ class along with a fivefold cross-validation strategy to systematically search for the most favourable hyperparameter combinations.
The specified hyperparameters in Table 2 for each classifier are crucial in determining its predictive nature. In Random Forest, a maximum depth of 6 with a maximum feature percentage of 0.4 ensures a balanced approach to reduce overfitting and bring variation among trees, respectively. With a maximum sample fraction of 0.5, each tree is trained on a random 50% subset of the whole training set, creating diversity and preventing over-reliance on individual data. Samples per leaf and split (both set at 2) offer granularity to decision-making, while the choice of 50 trees finds a balance between computational efficiency and the model's ability to capture complicated patterns. These hyperparameters collectively tune the Random Forest to balance complexity, diversity, and predictive accuracy.
The selected hyperparameters for the AdaBoost and XGBoost algorithms reveal distinct configurations of their respective boosting frameworks. For AdaBoost utilizing the SAMME.R algorithm, a learning rate of 0.3 governs the contribution of each weak learner, and with 50 estimators, or weak learners, the ensemble is created incrementally. Conversely, XGBoost demonstrates a more sophisticated setup. A colsample_bytree of 0.3 represents the fraction of features included when generating each tree, bringing diversity among trees. A gamma value of 0 suggests no regularization of the tree complexity. The learning rate is set to 0.05, suggesting a slow but cautious learning approach, while the maximum depth of 2 and minimum child weight of 1 limit the tree structure, preventing overfitting. With a significant 150 estimators, XGBoost strives to produce a strong ensemble with a fine-tuned balance between complexity and predictive performance.
The Voting Ensemble, combining Random Forest, AdaBoost, and XGBoost as base estimators, is a form of multi-model ensemble with “voting” set to “soft”, a weighted average of estimated probability is applied. The “weights” parameter is “None”, suggesting equal priority assigned to each model. This is suitable when models contribute comparably. However, for instances with performance variances, manual weight assignment could be investigated for optimization. The tedious choices of hyperparameters contribute to building a collection of finely tuned models, ensuring dependable and situation-specific predictions for places with the potential to have groundwater in the domain of groundwater potential mapping.
Model interpretability
Using interpretability tools to explain how black box models operate has increased significantly in recent years, particularly those based on tree ensembles such as Random Forest and XGBoost [84]. In the present study, we employed SHAP (SHapley Additive exPlanations) as a post hoc technique to compute Shapley values and expound upon the predictions generated by the machine learning (ML) models under examination. This approach allows us to strike a balance between achieving optimal accuracy and fostering a meaningful level of interpretability in the model. Shapley values, rooted in game theory [83], offer insights into the individual contributions of features to the model's outcomes. The computation of Shapley values involves analysing the average marginal impact of input variables on prediction outcomes by assessing permutations of feature values to derive their respective Shapley values. Features deemed more influential are assigned higher Shapley values, reflective of their pronounced impact on the model's results. To carry out this analysis, we leverage the KernelExplainer implementation of Kernel SHAP, a model-agnostic and efficient method for estimating SHAP values applicable to any model.
Estimation of uncertainty
Integrating uncertainty estimation is essential in groundwater potential mapping as it enables a more nuanced interpretation of the findings. It aids in pinpointing regions where predictions are less dependable and where extra data may be required to enhance the accuracy of the model. Uncertainty estimates improve the reliability and trustworthiness of our groundwater potential maps by providing a measure of prediction confidence. This allows decision-makers to make better-educated choices based on our findings.
We added uncertainty measures by calculating the per-pixel standard deviation of the groundwater potential probabilities generated by each model (Random Forest, AdaBoost, and XGBoost). To accomplish this, we utilized the ensemble probabilities generated by the Voting Ensemble model as the average value for computing the standard deviation. This approach allows us to measure the variability in predictions at each pixel, allowing a precise assessment of the consistency across different models. By comparing the probabilities of each model with the average of the ensemble, we derived a thorough assessment of the uncertainty linked to each prediction. This assessment highlighted places where the models exhibited consistent agreement, as well as areas with higher levels of predictive variability.
Evaluation of model performance
Multicollinearity results
The variance inflation factor (VIF) is utilized to evaluate the existence of multicollinearity and choose variables for regression analysis. Nevertheless, the variance inflation factor (VIF) alone does not ascertain the elements that exert an influence on the dependent variable. To determine the degree of multicollinearity in the groundwater potential zonation modelling and assess the importance of the selected elements that govern groundwater potentiality, we computed both the variance inflation factor (VIF) and the tolerance value for multicollinearity. A collinearity issue is typically detected when the variance inflation factor (VIF) surpasses 10 and the tolerance value falls below 0.1. The VIF and tolerance values for all groundwater influencing factors in this research, as indicated in Table 3, lie within the range of 10 to 0.01. This suggests that none of the selected groundwater influencing factors display any problems of multicollinearity. According to the results, there is no issue of multicollinearity.
Interpretability of machine learning by Shapley method
Shapley values, derived from cooperative game theory, offer a robust method for interpreting the contribution of each player (in our context, each machine learning model) towards the predictive performance of a coalition (ensemble). This is particularly useful in complex ensemble methods where understanding individual contributions is key to improving overall model performance and transparency.
The Shapley value can be calculated as:
where \(S\) is a subset of the features used in the model, \(x\) is the vector of feature values of instance to be explained, \(p\) the number of features, and \(\upsilon al\left(S\right)\) is the prediction for feature values in set \(S\) marginalized over features that are not included in set \(S\).
Statistical measures criteria
Precision, recall, F1-Score, and accuracy are important measures used in the domain of machine learning and statistics to assess the effectiveness of classification models [3242, 60, 63, 67, 75, 78]. Each metric provides a different perspective on the effectiveness of a model, especially in scenarios involving binary or multi-class classification (Eqs. 14 to 17). Here is a detailed look at each:
(a) Precision: Precision is the ratio of correctly predicted positive observations to all predicted positive outcomes [42, 63, 67]. It can be considered as a measure of a classifier's exactness. High precision is directly correlated with a low false positive rate. It is especially beneficial when the cost of incorrect positive results is significant. The equation for precision is presented in Eq. 14.
(b) Recall: Recall is the ratio of accurately predicted positive observations to all observations in the actual class. It measures a classifier’s completeness [60, 67, 78]. High recall reflects that the classifier is returning most of the relevant results. It is especially important in scenarios where missing a positive instance is costly. The expression for recall is shown in Eq. 15.
(c) F1-Score: The weighted average of precision and recall is known as the F1-Score. It evaluates both false positives and false negatives into consideration [36, 67, 75]. It quantifies the balance between precision and recall. The equation for the F1 score is shown in Eq. 16.
(d) Accuracy: Accuracy is the simplest intuitive performance metric and it is simply a ratio of accurately predicted observations to the total observations (Elvis et al. 2022; [63, 67]). It is a measure of how many predictions a model got right. However, it can be misleading in cases where there is a significant imbalance between classes. The formula to calculate accuracy is shown in Eq. 17:
where ‘TP’ represents correctly predicted potential points, ‘TN’ stands for accurately predicted non-potential points, ‘FP’ indicates falsely predicted potential points, ‘FN’ signifies incorrectly predicted potential points.
While accuracy is the simplest performance metric, it might not always be the best indicator of a model's performance, especially in imbalanced datasets. All other metrics provide a more nuanced knowledge of the models’ strengths and weaknesses in different aspects of their predictive capabilities.
Accuracy assessment of the model
Validating model accuracy stands as a crucial stage in data analysis, with data validation emerging as a key approach to achieving this goal [57, 93]. There are several ways to check the accuracy of machine learning algorithms, including the receiver operating characteristics (ROC) curve, which is a reliable way to see how well the models work, especially when it comes to binary classification tasks [22, 75]. The curve represents the accuracy of a classifier using the true positive rate (TPR) against the false positive rate (FPR) at different threshold values. The y-axis shows TPR, which is also called sensitivity or recall. It is a measure of the number of correctly predicted positive observations compared to the total number of observations in the actual class. Meanwhile, FPR, which is derived as (1—specificity), presented on the x-axis, quantifies the proportion of mistakenly projected positive observations relative to the total number of actual negative observations [77]. “TN” stands for true negative, “FP” represents false positive, “TP” indicates true positive, and “FN” signifies false negative [13, 17]:
Moreover, the study also utilized the AUC curve to quantitatively evaluate the performances of the ML algorithms within the study area [67, 89]. AUC-ROC is a widely adopted procedure for assessing model performance. A value of 1 shows an excellent association between AUC and prediction rate, while 0 suggests a weak relationship [6]. The values can be classified into several ranges: 0.5–0.6 (poor), 0.6–0.7 (average), 0.7–0.8 (good), 0.8–0.9 (very good), and 0.9–1.0 (excellent).
Results
Groundwater potential zones
Based on the evaluation of groundwater potential zones (GWPZ) in the Bankura district, West Bengal, using machine learning algorithms, the results section will present the findings from the application of four different algorithms: Random Forest (RF), Extreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), and a Voting Ensemble (VE). As seen in Fig. 4, classified areas are divided into five significant groundwater potential zones, based on their groundwater potentiality (e.g., ‘very low potentiality’, ‘low potentiality’, ‘moderate potentiality’, ‘high potentiality’, and ‘very high potentiality’). The Random Forest (RF) model showed (Fig. 5) promising results with a high accuracy rate in predicting groundwater potential zones. The model's ability to handle large datasets and its feature importance mechanism provided valuable insights into key factors influencing groundwater availability. The study region is classified as having very low 2531.33 sq. km (36.73%), low 1791.35 sq. km (26%), moderate 643.64 sq. km (9.34%), high 556.15 sq. km (8.07%), very high 1368.44 sq. km (19.86%). Extreme Gradient Boosting (XGBoost) performed exceptionally well (Fig. 5), slightly outperforming RF in terms of precision and recall. This model demonstrated high efficiency in managing complex nonlinear relationships between the variables. The study region is classified as having very low 3492.53 sq. km (50.68%), low 920.8 sq. km (13.36%), moderate 398.39 sq. km (5.78%), very high 335.64 sq. km (4.87%), high 1743.54 sq. km (25.30%). Adaptive Boosting (AdaBoost) while slightly less accurate than RF and XGBoost (Fig. 5), offered valuable insights due to its focus on improving the prediction of minority classes, which is crucial in imbalanced datasets often found in environmental studies. The study region is classified as having very low 2611.63 sq. km (37.9%), low 1791.35 sq. km (26%), moderate 734.1 sq. km (10.65%), high 310.00 sq. km (4.50%), very high 1664.07 sq. km (24.15%). The Voting Ensemble (VE) model which combined predictions from RF, XGBoost, and AdaBoost (Fig. 5), showed an improvement in overall accuracy and stability compared to the individual models. This improvement highlighted the benefit of leveraging the strengths of individual models to enhance predictive performance. The study region is classified as having very low 2881.26 sq. km (41.81%), low 1437.8 sq. km (20.87%), moderate 556.27 sq. km (8.07%), high 337.51 sq. km (4.90%), very high 1678.07 sq. km (24.35%). The study concluded that while all models were effective in predicting GWPZ, the VE model's integration of different algorithms provided the most reliable and robust predictions. This approach is suggested for practical applications in groundwater management and planning in the Bankura district.
The arid western areas of the district have zones with “very low” potential for groundwater (Table 4). This region is characterized by a prehistoric granite gneiss geological formation, which results in a gently rolling landscape that hinders water infiltration. The majority of the district's areas with limited groundwater potential are mostly located in the centre and western regions. The area is characterized by a dissected plateau and unclassified crystalline geological formations, mostly composed of gneiss. The location has limited potential for the occurrence and transportation of groundwater owing to its underdeveloped soil cover and steep slope at high elevation. This condition is exacerbated by the heightened aridity and less precipitation. The primary concentration of the district’s groundwater with significant potential is located in the elevated interfluvial zones in the eastern parts. The region is covered by a higher and undulating alluvial plain. The location is favourable for water percolation and has a high potential for groundwater owing to the presence of many lineaments, a relatively low height and slope, and a thick soil cover. River basins that contain large amounts of sedimentary deposits and thick layers of soil have a "very high" capacity for groundwater, which makes the region suitable for significant groundwater storage. The groundwater table is elevated because of the atypically gentle slope, topography, and abundant precipitation and drainage density.
Collinearity analysis
The correlation matrix provided in Fig. 6 delineates the intricate interdependencies among a set of explanatory variables integral to groundwater potential mapping. The coefficients within the matrix, ranging from -1 to 1, represent the strength and directionality of linear relationships between each pair of variables. Positive coefficients denote a positive correlation, indicating a simultaneous growth in the variables, whereas negative coefficients denote an inverse association. A significant positive correlation is seen between the geomorphology and drainage density (0.54), signifying that specific geomorphic layers are associated with higher drainage density. Conversely, a considerable negative correlation of − 0.82 appears between cropping intensity and irrigated areas. Similarly, the correlation of 0.66 between NDVI and evapotranspiration implies that places with elevated vegetation may have enhanced evapotranspiration rates. Furthermore, the correlation of 0.54 between NDVI and clay content reveals a moderate link, showing that specific soil features connected with clay content may influence vegetation patterns since soils having higher clay content generally have a higher water-holding capacity, making them better at retaining soil moisture for vegetation use. However, it is essential to note that they also have slower drainage rates and may be prone to waterlogging. The positive correlation of 0.50 between rainfall and cropping intensity indicates that places with higher rainfall may see more intensive cropping.
Such information, retrieved from the correlation matrix, reveals the intricate interplay of the environmental elements in the context of the groundwater potential mapping. The detection of prominent correlations aids in the identification of potential multicollinearity issues, a vital aspect of predictive modelling. The examination of such correlations permits a robust selection of relevant features, hence boosting the robustness and interpretability of the models. It is important to note that there is no universally established hard criterion for what constitutes an acceptable correlation value. The correlation values can vary based on the nature of the dataset and the specific objectives of the investigation. In our investigation of groundwater potential mapping, the maximum observed correlation is 0.73, while the minimum is − 0.82. Given the restricted number of explanatory factors and the absence of a set threshold, we elected to keep all features for the machine learning models. The integration of all features provides a holistic perspective of the factors impacting groundwater potential, enabling the models to exploit the available knowledge for accurate projections.
Importance of the GWPZ conditioning factors
Feature importance is a fundamental concept in machine learning that helps in determining and measuring the contribution of each input variable, or feature, to a model's prediction (Fig. 7). A comprehensive understanding of the importance of each feature aids in selecting, optimizing, and interpreting the model by revealing which features have the most effects on the model’s output. In the context of groundwater potential mapping, we employed three distinct tree-based classifiers and a multi-model voting classifier, implemented through the scikit-learn Python library.
The feature importance scores of the Random Forest model are calculated from the mean decrease in Gini (MDG) impurity metric. Gini impurity quantifies the total decrease in node impurity brought by a feature. Features that lead to more homogeneous classes (low impurity) are considered to be more relevant. The feature importance scores of the RF model reported that geomorphology holds the highest value (0.446), followed by elevation (0.202) and soil type (0.080), clearly outlining the crucial function of these variables in the groundwater potential mapping process in the district. The feature importance scores, determined by the AdaBoost algorithm, refer to the weighted contribution of each feature to the total classification accuracy. It is derived by summing up the weights supplied to the weak learners, highlighting features that regularly aid in correcting misclassifications. Notably, distance to stream and drainage density are allocated the highest importance (0.20 each), showing their large influence on the model's prediction accuracy. Features such as mNDWI, elevation, lithology, soil, and irrigated area also display remarkable value, collectively contributing to the enhanced delineation of locations with diverse groundwater potential features. XGBoost also determines the importance of a feature in a similar way by evaluating the number of times a feature is employed to split the data across all boosting rounds, weighted by the improvement it brings to the model's prediction. The variable importance scores across features suggest that geomorphology plays the most significant role (> 0.175), followed by elevation, soil, and lithology with scores ranging from 0.075 to 0.125. Other features, such as rainfall, drainage density, NDVI, clay content, and lineament density, also indicate substantial importance, varying from 0.05 to 0.06. The Voting Classifier combines the predictions of RF, AdaBoost, and XGBoost models, and the feature importance scores are averaged to create an ensemble view. The feature importance scores of the Voting Ensemble imply that geomorphology holds the highest relevance (0.223), followed by elevation (0.135) and drainage density (0.100). The ensemble approach gives a balanced analysis of characteristics, capturing their cumulative influence on the groundwater potential mapping process.
Geomorphology frequently appears as a key influencer across all models, highlighting its vital significance in groundwater potential mapping. Elevation and drainage density also display strongly, signifying their major impact on predictive accuracy. Distance to stream, mNDWI, and soil display remarkable importance, leading to the detailed characterization of locations with diverse groundwater potential features. This collective understanding of feature importance provides a complete analysis of the complex interrelationships among multiple explanatory variables, facilitating effective decision-making and resource management in the context of groundwater potential assessment.
Model evaluation
Four distinct machine learning algorithms were applied to predict flood susceptibility in the district. These algorithms include random forest (RF), adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), and Voting Ensemble (VE). Table 5 illustrates the validation metrics for each of the four machine learning models utilized in our study. These metrics provide a quantitative assessment of the models' performance and their groundwater potentiality prediction.
Random Forest (RF) shows high precision (0.919) and recall (0.971), indicating a strong ability to correctly predict groundwater zones with minimal false positives and negatives. Its F1-score (0.944) and accuracy (0.943) are also high, suggesting overall effectiveness. Adaptive Boosting (AdaBoost) mirrors RF in performance across all metrics (precision: 0.919, recall: 0.971, F1-score: 0.944, accuracy: 0.943), indicating a similar capability in accurately predicting groundwater potential areas. Extreme Gradient Boosting (XGBoost) has slightly higher values in all metrics (precision: 0.944, recall: 0.971, F1-score: 0.958, accuracy: 0.957), suggesting a more refined model performance. Voting Ensemble (VE), like XGBoost, shows enhanced performance (precision: 0.944, recall: 0.971, F1-score: 0.958, accuracy: 0.957), indicating that the ensemble approach effectively combines the strengths of individual models for better predictions.
The receiver operating characteristic (ROC) curve is a significant tool for analysing the classification performance of machine learning models. It provides insights into a model's efficacy in discriminating between positive and negative instances. The area under the ROC curve (AUC) is a statistic that quantifies the overall performance of a model, with a higher AUC value suggesting more discriminatory power. The AUC scores for our machine learning models were as follows: XGBoost achieved the highest AUC of 0.993, followed closely by RF (0.993), Voting Ensemble (0.991), and AdaBoost (0.984). These AUC values represent the models' ability to differentiate between groundwater potential and non-potential locations (Fig. 8).
Discussion
This study focuses on predicting groundwater potential zones in the Bankura district by utilizing Machine Learning (ML) algorithms such as Random Forest (RF), Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), and Voting Ensemble (VT). These algorithms were applied to analyse twenty groundwater conditioning factors, encompassing topographical, hydrological, geological, and anthropogenic aspects crucial for assessing groundwater availability.
Random Forest (RF) was highly effective in handling the nonlinear relationships and interactions among the various conditioning factors. It provided robust predictions by averaging over multiple decision trees, thereby reducing the risk of overfitting [27, 52, 73, 86, 90]. Adaptive boosting (AdaBoost) enhanced the performance of weak learners and was particularly useful in adjusting the significance of misclassified observations, leading to improved model accuracy, especially in complex terrains [36, 50, 65]. Extreme gradient boosting (XGBoost) having the ability to manage large and diverse datasets made it extremely suitable for this study. It excelled in capturing intricate patterns and dependencies among conditioning factors, which is essential in groundwater studies [4, 5, 22, 39, 53]. The Voting Ensemble (VE) approach has combined the strengths of individual models, mitigating their weaknesses. This ensemble technique ensured a more balanced and reliable prediction by leveraging the collective insights from RF, AdaBoost, and XGBoost. We also elucidated the significance of individual features in the predictions generated by the VE approach, thereby determining their relative importance. Considering the various variables representing climate and soil data, it is apparent that they do not have equal effects or uniform importance in predicting groundwater potential. As a result, it is crucial to identify the most significant features, enabling domain experts to understand critical factors that are more important than others. In this study, we used the SHAP (SHapley Additive exPlanations) technique as a post hoc approach to acquire insights into prediction outcomes and identify input features that greatly influence model outputs. Features with greater or lesser Shapley values imply a proportionately stronger or lower influence on the expected outputs of groundwater potential zones. In order to determine its global significance, we computed the mean absolute Shapley value for each variable in the dataset. The findings, shown in Fig. 9, emphasize the red bars which indicate characteristics that have a positive correlation with groundwater potential, thereby having a beneficial impact on the estimated groundwater potential. Based on Fig. 9, it can be inferred that the top four variables/features with the most importance are geomorphology (mean SHAP value = + 0.35), followed by elevation (mean SHAP value = + 0.05), drainage density (mean SHAP value = + 0.03), and mNDWI (mean SHAP value = + 0.02). The importance of geomorphology in estimating groundwater potential zones may be attributed to its capacity to provide essential information about the physical characteristics of the Earth's surface. This information is crucial in determining the flow of water over the land and its interactions with the underground.
The integration of these advanced machine learning algorithms with a comprehensive set of conditioning factors offered a nuanced and highly accurate method for assessing groundwater potential zones in the Bankura district. This approach can significantly aid in sustainable groundwater management and planning. The varied conditioning factors provided a holistic view of the groundwater scenario, encompassing natural and anthropogenic influences. The study demonstrates the efficacy of machine learning techniques in environmental and resource management, setting a precedent for future research in similar domains. Future studies could further refine this methodology by incorporating real-time data, climate change projections, and dynamic land-use patterns to enhance the predictive capability and applicability of the models in sustainable groundwater management.
The uncertainty map provides a spatial representation of predictive uncertainty in groundwater potential mapping across the Bankura district (Fig. 10). Regions with high uncertainty, depicted in red, indicate significant variability in model predictions, suggesting considerable disagreement among the models. This variability is likely due to sparse data, complex geological conditions, or other factors that introduce higher uncertainty. In contrast, areas with low uncertainty, shown in green, indicate strong agreement among the models, suggesting more reliable predictions due to better data quality and consistent geological and hydrological conditions. Geographically, higher uncertainty is observed in the western and southern parts of the district, such as near Hirbandh, Khatra, and Raipur, possibly due to rugged terrain and limited data. The eastern regions, including Sonamukhi, Patrasayer, and Indus, exhibit lower uncertainty, likely due to more homogeneous conditions and better data coverage. Central areas around Bankura show moderate to low uncertainty. This map highlights the need for additional data collection in high-uncertainty areas to improve model reliability and support more informed decision-making for groundwater management in low-uncertainty regions.
The research area has been divided into three physiographic divisions, namely The Western Hilly Terrain characterized by its rugged topography, which is overlain by ancient crystalline rocks from the Archean era. The Eastern Plain Land refers to a flat and expansive geographical area located in the eastern region. The land in the Bishnupur, Indus, Kotulpur block is mostly a flat plain, consisting of a large expanse of fertile land suitable for cultivation, along with a smaller area of The Undulating Marginal Terrain. The middle section of the district exhibits this topography, where the steep terrain in the west gradually transitions into a plain alluvial land with scattered hillocks and mounds, covering most of the region in Paschim Bardhhaman district and the Alluvial plain. The drainage system in Bankura consists of the Damodar and Kangsabati sub-basins. It is drained by many large rivers that primarily travel in a northwest-to-southeast direction throughout the region. The major rivers flowing from North to South in this region are Damodar, Sali (a tributary of Damodar), Gandheswari, Dwarkeswar, Sialbati, and Kangsabati. The predominant soil types in this region consist of Entisols, which mostly consist of alluvial soil, Alfisols, which consist of older alluvial soil and red soil, and Ultisols, which predominantly consist of lateritic soil.
The main agricultural products cultivated in the district are rice (specifically Aus, Aman, and Boro varieties), wheat, other types of pulses, oilseeds (such as sunflower), mustard oil, jute, sugarcane, and other Rabi crops. Several examples of seasonal vegetables are potato, tomato, cabbage, cauliflower, and pumpkin. This area has extensive irrigation capacity that remains untapped. The cultivated crops mostly rely on rainfall for irrigation. The cultivation of Kharif, Rabi, and Boro paddy and vegetables mostly relies on groundwater extraction using Deep Tube Wells (DTW) and Shallow Tube Wells (STW). The district has a combined total of 21,475 tanks, with 473 being RLI, 429 being DTW, 6261 being DW, and 30,005 being STW, all of which serve as sources for irrigation. Durgapur, Asansol, and Raniganj form a significant industrial belt with a well-established large-scale industrial infrastructure. This region is particularly known for its focus on the growth of small and medium size industries. The Mejia thermal power plant (MTPS) operated by DVC is the central hub for industrial growth in the region. The Barjora industrial belt is seeing rapid growth, with the establishment of several sponge iron companies, alloy industries, and plastic industries. Coal is the primary mineral resource that is currently being extracted for commercial purposes in the Barjora and Mejia blocks of the study area.
The majority of the Bankura district is characterized by diverse lithological units from varying geological ages. The district is composed of two main geological formations. The first is Archean age crystalline granite gneiss, which is found in the western and southwestern parts of the district. The second is lower Gondwana age sedimentary sandstone and shales, which cover the northern and northwestern parts of the district. Both the Archean and Gondwana formations have been intersected by dolerite dykes that are equivalent to Rajmahals. The region has a linear stretch where Pleistocene-aged laterite and earlier alluvium deposits are visible. The extensive accumulation of alluvium in the eastern region. The subsurface geology of the research region in the western sector is mostly composed of Crystalline rocks. By correlating the Lithology, it has been shown that groundwater is present in the worn mantle, which varies in thickness from 6 to 15 m under water table circumstances. Lateritic gravels often overlay the worn foundation rock in various regions, creating favourable conditions for rainwater percolation. The sector includes the Bankura-I, Chatna, Gangajalghati, Hirbandh, Indpur, Khattra, Ranibandh, and Saltora blocks. Groundwater is found in the middle section, which is covered by laterite and earlier alluvium. It is present in a fairly thick to thin aquifer, under semi-confined to unconfined conditions. The water-bearing formation in the region is characterized by heterogeneity and complicated aquifer geometry. It is suitable for constructing open-dug wells with a depth ranging from 10 to 15 m and a diameter of 3 m. This sector encompasses the whole or some portions of the Bankura II, Mejia, Taldangra, Simlapal, Raipur, and Sarenga blocks. Groundwater is found in a restricted state behind a layer of clay, with a thickness that typically ranges about 10 m. The Kotulpur and Joypur blocks are located in the eastern alluvial section of the Indus region. The district's hydrogeological state is influenced by its complex geological composition. In regions where hard crystalline and Gondwana rocks are present, groundwater is found in an unconfined state within the weathered residuum up to a depth of about 15 mbgl. It is also present in semi-confined to confined conditions inside fracture zones at depths ranging from 30 to 60 mbgl.
The district contains groundwater that exists in both unconfined and confined conditions. In the western sector, which is mostly composed of crystalline rocks, groundwater is found in the worn mantle of various thicknesses. In this section, groundwater from the area with secondary porosities is extracted using bore wells, with a yield ranging from 45 to 150 lpm. The eastern portion of the district is characterized by the presence of alluvium, while the central-southern part contains older alluvium and laterites. Groundwater exploration in the area has revealed that the thickness of the alluvial sediments gradually increases from 36 m in the western margin to 150 m in the easternmost part. The aquifers have a potential depth range of 30–95 mbgl. The wells have a discharge rate that varies from 20 to 124 (m3/hr), and the amount of decline ranges from 6 to 13 m, depending on the shape and size of the aquifers. The depth to the water level in the older alluvium ranges from 6 to 15 mbgl during the premonsoon era. The excavated wells in the laterites often get depleted over the summer season. However, wells that have accessed both laterites and lithomarge layers below are seen to retain water even throughout the summer months. Several artesian wells are present beside the banks of the Dwarkeswar, Jaipanda, and Silai rivers. The tube wells in the area have a depth ranging from 30 to 75 m and a diameter of 38 to 50 mm. They have a free flow discharge rate of 23 to 30 lpm along the Dwarkeswar River, as well as on both banks of the Jaipanda River in the Bishnupur and Taldangra blocks. The existing auto-flow tube wells in the area have a depth ranging from 45 to 75 m and a free flow discharge rate ranging from 126 to 252 lpm. The maximum recorded pressure head is 1.10 mbgl. Small-scale irrigation is accomplished with the help of these wells.
Previously, Nag et al. [64] and Biswas et al. [16] provided a comprehensive analysis of the hydrogeology, groundwater development, and management aspects in the study area. Different works have been done in the entire Bankura District to identify groundwater potential zones. Mahala [55], used ten factors (geology, geomorphology, pedology, soil texture, lineament density, relief, slope, drainage, aridity, and land use landcover) which delineate the groundwater potentiality with the relative weight derived from the Multi-Influencing Factor (MIF) and Weighted Index Overlay Analysis (WIOA) analysis. Area sharing by different groundwater potential zones of Bankura district is very poor at 438 sq. km (6%), poor at 2689 sq. km (39%), good at 3177 sq. km (46%), and very good at 586 sq. km (9%). Nag et al., [64], used eight factors (geomorphology, soil, geology, drainage density, slope, lineament, land use/land cover, and rainfall) which delineates the groundwater potentiality with the relative weight derived from the Multi-Influencing Factor (MIF) analysis. It was noted that 1248.81 sq. km (18.36%) and 1737.15 sq. km (25.55%) of the study area are categorized under the ‘very good’ and ‘good’ zone with respect to groundwater potentiality, respectively. Approximately 1988.66 sq. km area covering around 29.24% of the study area has been categorized as ‘moderate’. ‘poor’ and ‘very poor’ groundwater prospective zones cover an area of 1432.44 sq. km (21.06%) and 392.94 sq. km (5.79%) of the total study area, respectively. Goswami and Ghosal [37], used Eleven significant regulating factors which delineate the groundwater potentiality with two Multi-Criteria Decision Making (MCDM) approaches, Analytical Hierarchy Process (AHP) and Multi-Influencing Factor Analysis (MIF)[15, 30]. It has been found that about 1722.72 sq. km (25%) of the total district area has low to very low potentiality; whereas, 2549.63 sq. km (37%) of the total area in AHP and 2205.09 km2 (32%) of the total area in the MIF-based normalized map shows medium potentiality. Finally, the receiver operating curve (ROC) and area under curve (AUC) techniques have been used to check the reliability of the two applied models and their suitability for the study region. The comparison shows that the MIF method (AUC 0.879) is more reliable and more suitable than the AHP (AUC 0.767), specifically for Bankura District. [15, 16, 87], used nine groundwater controlling parameters (geology, geomorphology, slope, land use–land cover (LULC), rainfall, drainage density, soil texture, topographical wetness index (TWI), and lineament density) to delineate groundwater potential zone with the help of a knowledge-driven statistical technique, analytical hierarchy process (AHP). The study of the GWPZ map revealed that 920.62 km2 (13.36%), 2329.12 sq. km (33.80%), 1450.53 sq. km (21.05%), 1310.65 sq. km (19.02%) and 879.96 sq. km (12.77%) of the area of the Bankura district is under very good, good, moderate, poor and very poor condition, respectively. The result is evaluated by employing the receiver operation characteristics (ROC) and area under the curve (AUC) technique using the well’s yield groundwater data, and this value (0.757) also displayed the reliability of this work. Compared to the present study we have found that all machine learning algorithms reach maximum accuracy. In the Voting Ensemble (VT) model the study region is classified as having very low 2881.26 sq. km (41.81%), low 1437.8 sq. km (20.87%), moderate 556.27 sq. km (8.07%), high 337.51 sq. km (4.90%), very high 1678.07 sq. km (24.35%). A higher AUC value suggests more discriminatory power. In the present study machine learning models XGBoost achieved the highest AUC of 0.993, followed closely by RF (0.993), Voting Ensemble (0.991), and AdaBoost (0.984).
There is significant potential for the development of groundwater in the agricultural, household, and industrial sectors via various structures, taking into account the optimal command area of the abstraction structures. Nevertheless, a highly efficient water management strategy has been suggested to organize and oversee resources in the region. It is recommended to implement rainwater collection structures in all the blocks to improve the sustainability of groundwater in the research region.
Conclusion
The research indicates that combining Remote Sensing (RS) and Geographic Information System (GIS) techniques is an effective method for locating potential groundwater zones; this allows for the delineation of advantageous locations for groundwater withdrawal. To determine groundwater potential zones at the district level, the technique relies on integrating significant groundwater conditioning factors such as elevation, slope, profile curvature, TWI, drainage density, distance to stream, rainfall, NDVI, mNDWI, clay content, evapotranspiration, water table depth, population, cropping intensity, irrigated area, LULC, soil, lineament density, lithology, and geomorphology.
The research also indicates that the region can be classified into five distinct groups, ranging from very low to very high. The presence of the very low zone clearly shows that the area is not ideal for groundwater prospecting, while the high to very high zones imply the most suitable region. In our attempt to tackle the issue of limited transparency in our black box model's predictions, we utilized post hoc explanation tools such as SHAP. However, we have encountered specific mathematical difficulties when using Shapley values to determine feature importance, especially when it comes to causal reasoning. Moreover, Shapley values do not automatically conform to human logic, which raises doubts regarding their usefulness as a tool for providing natural explanations. Using surrogate tools to explain black box models instead of building models with intrinsic interpretability may not be consistent with rigorous scientific norms. Therefore, the future direction of this study is to create more sophisticated models that achieve a harmonious combination of high accuracy and inherent interpretability. Nevertheless, this work provides a basis for further investigation and improvement of prediction models to make well-informed decisions in water resource planning and management. This research highlights the efficiency of ensemble approaches in mitigating the limitations of individual algorithms, thus enhancing the accuracy and reliability of groundwater potential mapping which in turn facilitates informed decision-making processes, policymakers, and stakeholders in the implementation of targeted interventions to optimize water allocation, mitigate drought impacts, and enhance agricultural resilience. Further study could consider incorporating additional variables such as lithological properties (e.g., permeability, porosity) and structures (e.g., faults, fractures) impacting groundwater flow. Hydrological variables such as groundwater recharge rates and the existence of surface water bodies (e.g., lakes, rivers) provide insights into groundwater availability. Climate parameters including temperature extremes, seasonal precipitation variability, and indices (e.g., SPI, SPEI) illustrate groundwater sensitivity to climatic circumstances. Land surface factors such as temperature and soil moisture content offer insights into surface conditions affecting groundwater dynamics. Anthropogenic variables such as proximity to pumping wells, agricultural activities, and urbanization influence groundwater availability. Remote sensing data like thermal infrared images and radar/LiDAR provide insights into subsurface structures controlling groundwater behaviour. Integrating these variables strengthens models and deepens our understanding of groundwater dynamics.
Additionally, considerations of socio-economic factors, environmental sustainability, and community engagement are also paramount for the successful implementation of groundwater management strategies in agricultural landscapes.
Data availability
Data available on request from the authors.
Code availability
Made available on request from the authors.
References
Ahmed N, Hoque MA, Pradhan B, Arabameri A (2021) Spatio-temporal assessment of groundwater potential zone in the drought-prone area of Bangladesh using GIS-Based bivariate models. Nat Resour Res 30(5):3315–3337. https://doi.org/10.1007/s11053-021-09870-0
Al-Abadi AM, Alsamaani JJ (2020) Spatial analysis of groundwater flowing artesian condition using machine learning techniques. Groundw Sustain Dev 11:100418. https://doi.org/10.1016/j.gsd.2020.100418
Al-Abadi AM, Fryar AE, Rasheed AA, Pradhan B (2021) Assessment of groundwater potential in terms of the availability and quality of the resource: a case study from Iraq. Environ Earth Sci. https://doi.org/10.1007/s12665-021-09725-0
AlAyyash S, Al-Fugara A, Shatnawi R, Al-Shabeeb AR, Al-Adamat R, Al-Amoush H (2023) Combination of metaheuristic optimization algorithms and machine learning methods for groundwater potential mapping. Sustainability 15(3):2499. https://doi.org/10.3390/su15032499
Al-Fugara A, Ahmadlou M, Shatnawi R, AlAyyash S, Al-Adamat R, Al-Shabeeb AA, Soni S (2020) Novel hybrid models combining meta-heuristic algorithms with support vector regression (SVR) for groundwater potential mapping. Geocarto Int 37(9):2627–2646. https://doi.org/10.1080/10106049.2020.1831622
Al-Kindi KM, Janizadeh S (2022) Machine Learning and Hyperparameters algorithms for identifying groundwater AFLAJ potential mapping in Semi-Arid ecosystems using LIDAR, Sentinel-2, GIS data, and analysis. Remote Sensing 14(21):5425. https://doi.org/10.3390/rs14215425
Al-Ozeer AZ, Al-Abadi AM, Hussain TA, Fryar AE, Pradhan B, Alamri A, Maulud KNA (2021) Modeling of groundwater potential using cloud computing platform: a case study from Nineveh plain. Northern Iraq Water 13(23):3330. https://doi.org/10.3390/w13233330
Alrawi I, Chen J, Othman AA (2022) Groundwater potential zone mapping: Integration of Multi-Criteria Decision Analysis (MCDA) and GIS techniques for the Al-Qalamoun region in Syria. ISPRS Int J Geo Inf 11(12):603. https://doi.org/10.3390/ijgi11120603
Arabameri A, Pal SC, Rezaie F, Nalivan OA, Chowdhuri I, Saha A, Lee S, Moayedi H (2021) Modeling groundwater potential using novel GIS-based machine-learning ensemble techniques. J Hydrol Reg Stud 36:100848. https://doi.org/10.1016/j.ejrh.2021.100848
Arabameri A, Rezaei K, Cerdà A, Lombardo L, Rodrigo-Comino J (2019) GIS-based groundwater potential mapping in Shahroud plain, Iran. a comparison among statistical (bivariate and multivariate), data mining and MCDM approaches. Sci Total Environ 658:160–177. https://doi.org/10.1016/j.scitotenv.2018.12.115
Arabameri A, Santosh M, Moayedi H, Tiefenbacher JP, Pal SC, Nalivan OA, Costache R, Ahmed N, Hoque MA, Chakrabortty R, Cerdà A (2022) Application of the novel state-of-the-art soft computing techniques for groundwater potential assessment. Arab J Geosci. https://doi.org/10.1007/s12517-021-09005-y
Aslam B, Maqsoom A, Hassan U, Maqsoom S, Alaloul WS, Musarat MA, Khan S (2023) Comparison between machine learning and bivariate statistical models for groundwater recharge zones. Res Sq 1:1
Bai Z, Liu Q, Liu Y (2022) Groundwater potential mapping in Hubei region of China using machine learning, ensemble learning, deep learning and AutoML methods. Nat Resour Res 31(5):2549–2569. https://doi.org/10.1007/s11053-022-10100-4
Benjmel K, Amraoui F, Aydda A, Tahiri A, Yousif M, Pradhan B, Abdelrahman K, Fnais MS, Abioui M (2022) A multidisciplinary approach for groundwater potential mapping in a fractured Semi-Arid terrain (Kerdous Inlier, Western Anti-Atlas, Morocco). Water 14(10):1553. https://doi.org/10.3390/w14101553
Biswas S, Mukhopadhyay BP, Bera A (2020) Delineating groundwater potential zones of agriculture dominated landscapes using GIS based AHP techniques: a case study from Uttar Dinajpur district, West Bengal. Environ Earth Sci. https://doi.org/10.1007/s12665-020-09053-9
Biswas T, Pal SC, Ruidas D, Islam ARMT, Saha A, Costache R, Shit M (2023) Modelling of groundwater potential zone in hard rock-dominated drought-prone region of eastern India using integrated geospatial approach. Environ Earth Sci. https://doi.org/10.1007/s12665-023-10768-8
Braham M, Boufekane A, Bourenane H, Amara BN, Bensalem R, Oubaiche EH, Bouhadad Y (2022) Identification of groundwater potential zones using remote sensing, GIS, machine learning and electrical resistivity tomography techniques in Guelma basin, Northeastern Algeria. Geocarto Int 37(26):12042–12072. https://doi.org/10.1080/10106049.2022.2063408
Breiman L (2001) Random forests. Mach learning 45:5–32. https://doi.org/10.1023/A:1010933404324
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016. ACM. pp. 785–794. arXiv:1603.02754. https://doi.org/10.1145/2939672.2939785
Chen W, Li Y, Tsangaratos P, Shahabi H, Ilia I, Xue W, Bian H (2020) Groundwater spring potential mapping using artificial intelligence approach based on kernel logistic regression, random forest, and alternating decision tree models. Appl Sci 10(2):425. https://doi.org/10.3390/app10020425
Chen Y, Chen W, Pal SC, Saha A, Chowdhuri I, Adeli B, Janizadeh S, Dineva A, Wang X, Mosavi A (2021) Evaluation efficiency of hybrid deep learning algorithms with neural network decision tree and boosting methods for predicting groundwater potential. Geocarto Int 37(19):5564–5584. https://doi.org/10.1080/10106049.2021.1920635
Choudhary S, Pingale SM, Khare D (2022) Delineation of groundwater potential zones of upper Godavari sub-basin of India using bi-variate, MCDM and advanced machine learning algorithms. Geocarto Int 37(27):15063–15093. https://doi.org/10.1080/10106049.2022.2093992
Cutler A, Cutler DR, Stevens JR (2012) Random forests. Springer eBooks, Berlin, pp 157–175. https://doi.org/10.1007/978-1-4419-9326-7_5
Dao PU, Heuzard AG, Le TXH, Zhao J, Yin R, Shang C, Fan C (2024) The impacts of climate change on groundwater quality: a review. Sci Total Environ 912:169241. https://doi.org/10.1016/j.scitotenv.2023.169241
Dar FA, Ramanathan A, Mir RA, Pir RA (2024) Groundwater scenario under climate change and anthropogenic stress in Ladakh Himalaya India. J Water Climate Change. https://doi.org/10.2166/wcc.2024.307
Das RJ, Saha S (2022) Spatial mapping of groundwater potentiality applying ensemble of computational intelligence and machine learning approaches. Groundw Sustain Dev 18:100778. https://doi.org/10.1016/j.gsd.2022.100778
Dey B, Abir KAM, Ahmed R, Salam MA, Redowan M, Miah MD, Iqbal M (2023) Monitoring groundwater potential dynamics of north-eastern Bengal Basin in Bangladesh using AHP-Machine learning approaches. Ecol Indicators 154:110886. https://doi.org/10.1016/j.ecolind.2023.110886
Díaz-Alcaide S, Martínez-Santos P (2019) Review: Advances in groundwater potential mapping. Hydrogeol J 27(7):2307–2324. https://doi.org/10.1007/s10040-019-02001-3
Dilekoğlu MF, Aslan V (2021) Determination of groundwater potential distribution of ceylanpinar plain (Turkey) in Upper Mesopotamia by using geographical information techniques and Fuzzy-AHP with MCDM. Water Sci Technol Water Supply 22(1):372–390. https://doi.org/10.2166/ws.2021.268
Duan H, Deng Z, Deng F, Wang D (2016) Assessment of groundwater potential based on multicriteria decision making model and decision tree algorithms. Math Probl Eng 2016:1–11. https://doi.org/10.1155/2016/2064575
Elbeih SF (2015) An overview of integrated remote sensing and GIS for groundwater mapping in Egypt. Ain Shams Eng J 6(1):1–15. https://doi.org/10.1016/j.asej.2014.08.008
Elmahdy SI, Ali T, Mohamed MM (2021) Regional mapping of groundwater potential in Ar Rub al Khali, Arabian Peninsula using the classification and regression trees model. Remote Sensing 13(12):2300. https://doi.org/10.3390/rs13122300
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. https://doi.org/10.1006/jcss.1997.1504
Ghosh A, Adhikary PP, Bera B, Bhunia GS, Shit PK (2022) Assessment of groundwater potential zone using MCDA and AHP techniques: case study from a tropical river basin of India. Appl Water Sci. https://doi.org/10.1007/s13201-021-01548-5
Gómez-Escalonilla V, Marie-Louise V, Destro E, Isseini M, Origgi G, Daïra D, Martínez-Santos P, Holecz F (2021) Delineation of groundwater potential zones by means of ensemble tree supervised classification methods in the Eastern Lake Chad basin. Geocarto Int 37(25):8924–8951. https://doi.org/10.1080/10106049.2021.2007298
Gómez-Escalonilla V, Martínez-Santos P, Martín-Loeches M (2022) Preprocessing approaches in machine-learning-based groundwater potential mapping: an application to the Koulikoro and Bamako regions. Mali Hydrol Earth Syst Sci 26(2):221–243. https://doi.org/10.5194/hess-26-221-2022
Goswami T, Ghosal S (2022) Understanding the suitability of two MCDM techniques in mapping the groundwater potential zones of semi-arid Bankura District in eastern India. Groundw Sustain Dev 17:100727. https://doi.org/10.1016/j.gsd.2022.100727
Grönwall J, Danert K (2020) Regarding groundwater and drinking water access through a human rights lens: self-supply as a norm. Water 12(2):419. https://doi.org/10.3390/w12020419
Guo X, Gui X, Xiong H, Hu X, Li Y, Cui H, Qiu Y, Ma C (2023) Critical role of climate factors for groundwater potential mapping in arid regions: Insights from random forest, XGBoost, and LightGBM algorithms. J Hydrol 621:129599. https://doi.org/10.1016/j.jhydrol.2023.129599
Hakim WL, Nur AS, Rezaie F, Panahi M, Lee C, Lee S (2022) Convolutional neural network and long short-term memory algorithms for groundwater potential mapping in Anseong, South Korea. J Hydrol Reg Stud 39:100990. https://doi.org/10.1016/j.ejrh.2022.100990
Halder S, Roy MB, Roy PK (2021) Tropical plateau basin prioritisation for sustainable groundwater management using classical algorithms. Arab J Geosci. https://doi.org/10.1007/s12517-021-08496-z
Ijlil S, Essahlaoui A, Mohajane M, Essahlaoui N, Mili EM, Van Rompaey A (2022) Machine learning algorithms for modeling and mapping of groundwater pollution risk: a study to reach water security and sustainable development (SDG) goals in a mediterranean aquifer system. Remote Sensing 14(10):2379. https://doi.org/10.3390/rs14102379
IPCC (2012) Summary for policymakers. In: Qin D, Dokken DJ, Ebi KL, Mastrandrea MD, Mach KJ, Plattner GK, Allen SK, Tignor M, Midgley PM (eds) Managing the risks of extreme events and disasters to advance climate change adaptation, a special report of working groups i and ii of the intergovernmental panel on climate change. Cambridge University Press, Cambridge, pp 3–21
IPCC 2019 IPCC WGII Sixth Assessment Report, Chapter 4. https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_FOD_Chapter04.pdf accessed 01–12–2023.
IPCC 2021 Climate Change 2021: The Physical Science Basis. In: Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (Masson-Delmotte V., Zhai P., Pirani A., Connors S. L., Péan C., Berger S., Caud N., hen Y., Goldfarb L., Gomis M. I., Huang M., Leitzell K., Lonnoy E., Matthews J. B. R., Maycock T. K., Waterfield T., Yelekçi O., Yu R. & Zhou B. eds). Cambridge University Press, Cambridge, UK and New York, NY, USA. https://doi.org/10.1017/9781009157896. Accessed 01 Dec 2023.
IPCC 2023 Summary for Policymakers. In: Climate Change 2023: Synthesis Report. Contribution of Working Groups I, II and III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (Lee H & Romero J, eds). IPCC, Geneva, Switzerland. https://doi.org/10.59327/IPCC/AR6-9789291691647.001.
Islam F, Tariq A, Guluzade R, Zhao N, Shah SU, Ullah M, Hussain ML, Ahmad MN, Alasmari A, Alzuaibr FM, Askary AE, Aslam M (2023) Comparative analysis of GIS and RS based models for delineation of groundwater potential zone mapping. Geomatics Nat Hazards Risk. https://doi.org/10.1080/19475705.2023.2216852
Jha MK, Chowdhury A, Chowdary VM, Peiffer S (2006) Groundwater management and development by integrated remote sensing and geographic information systems: prospects and constraints. Water Res Manage 21(2):427–467. https://doi.org/10.1007/s11269-006-9024-4
Kayal P, Majumder S, Chowdhury IR (2022) Modeling the spatial pattern of potential groundwater zone using MCDM-AHP and geospatial technique in sub-tropical plain region: a case study of Islampur sub-division, West Bengal, India. Sustain Water Res Manag. https://doi.org/10.1007/s40899-022-00759-1
Khan ZA, Jhamnani B (2023) Identification of groundwater potential zones of Idukki district using remote sensing and GIS-based machine-learning approach. Water Sci Technol Water Supply 23(6):2426–2446. https://doi.org/10.2166/ws.2023.134
Khosravi K, Khozani ZS, Cooper JR (2021) Predicting stable gravel-bed river hydraulic geometry: a test of novel, advanced, hybrid data mining algorithms. Environ Model Softw 144:105165. https://doi.org/10.1016/j.envsoft.2021.105165
Kumar M, Singh P, Singh P (2023) Machine learning and GIS-RS-based algorithms for mapping the groundwater potentiality in the Bundelkhand region, India. Ecol Inf 74:101980. https://doi.org/10.1016/j.ecoinf.2023.101980
Kundu M, Zafor A, Maiti R (2023) Assessing the nature of potential groundwater zones through machine learning (ML) algorithm in tropical plateau region, West Bengal, India. Acta Geophys. https://doi.org/10.1007/s11600-023-01042-3
Liu R, Li G, Wei L, Xu Y, Gou X, Luo S, Yang X (2022) Spatial prediction of groundwater potentiality using machine learning methods with grey wolf and sparrow search algorithms. J Hydrol 610:127977. https://doi.org/10.1016/j.jhydrol.2022.127977
Mahala A (2021) Delineating the status of groundwater in a plateau fringe region using multi-influencing factor (MIF) and GIS: a study of Bankura district, West Bengal, India. In Springer hydrogeology (pp. 215–237). https://doi.org/10.1007/978-3-030-62397-5_11
Mahamat AO, Bounab A (2023) The use of explanatory statistics for mapping groundwater potential zones in a semiarid area: case of the Waddai province, eastern Chad. J Afr Earth Sc 205:105012. https://doi.org/10.1016/j.jafrearsci.2023.105012
Mallick J, Talukdar S, Ahmed M (2022) Combining high resolution input and stacking ensemble machine learning algorithms for developing robust groundwater potentiality models in Bisha watershed, Saudi Arabia. Appl Water Sci. https://doi.org/10.1007/s13201-022-01599-2
Mallick J, Talukdar S, Alsubih M, Almesfer MK, Shahfahad HHT, Rahman A (2021) Integration of statistical models and ensemble machine learning algorithms (MLAs) for developing the novel hybrid groundwater potentiality models: a case study of semi-arid watershed in Saudi Arabia. Geocarto Int 37(22):6442–6473. https://doi.org/10.1080/10106049.2021.1939439
Mandal T, Saha S, Das J, Sarkar A (2021) Groundwater depletion susceptibility zonation using TOPSIS model in Bhagirathi river basin, India. Modeling Earth Syst Environ 8(2):1711–1731. https://doi.org/10.1007/s40808-021-01176-7
Maskooni EK, Naghibi SA, Hashemi H, Berndtsson R (2020) Application of advanced machine learning algorithms to assess groundwater potential using remote sensing-derived data. Remote Sensing 12(17):2742. https://doi.org/10.3390/rs12172742
Masroor M, Sajjad H, Kumar P, Saha TK, Rahaman MH, Choudhari P, Kulimushi LC, Pal S, Saito O (2023) Novel ensemble machine learning modeling approach for groundwater potential mapping in Parbhani district of Maharashtra. India Water 15(3):419. https://doi.org/10.3390/w15030419
Mitra R, Roy D (2022) Delineation of groundwater potential zones through the integration of remote sensing, geographic information system, and multi-criteria decision-making technique in the sub-Himalayan foothills region, India. Int J Energ Water Res 7(4):581–601. https://doi.org/10.1007/s42108-022-00181-5
Morgan H, Madani A, Hussien HM, Nassar T (2023) Using an ensemble machine learning model to delineate groundwater potential zones in desert fringes of East Esna-Idfu area, Nile valley, Upper Egypt. Geosci Lett. https://doi.org/10.1186/s40562-023-00261-2
Nag SK, Chowdhury P, Das S, Mukherjee A (2021) Deciphering prospective groundwater zones in Bankura district, West Bengal: a study using GIS platform and MIF techniques. Int J Energ Water Res 5(3):323–341. https://doi.org/10.1007/s42108-020-00110-4
Nguyen PT, Ha DH, Jaafari A, Nguyen HD, Van Phong T, Al-Ansari N, Prakash I, Van Le H, Pham BT (2020) Groundwater potential mapping combining artificial neural network and real ADABoost ensemble technique: the DakNong province case-study, Vietnam. Int J Environ Res Public Health 17(7):2473. https://doi.org/10.3390/ijerph17072473
Osiakwan GM, Gibrilla A, Kabo-Bah AT, Appiah-Adjei EK, Anornu GK (2022) Delineation of groundwater potential zones in the Central region of Ghana using GIS and fuzzy analytic hierarchy process. Modeling Earth Syst Environ 8(4):5305–5326. https://doi.org/10.1007/s40808-022-01380-z
Ouali L, Kabiri L, Namous M, Hssaisoune M, Abdelrahman K, Fnais MS, Kabiri H, Hafyani ME, Oubaassine H, Arioua A, Bouchaou L (2023) Spatial prediction of groundwater withdrawal potential using shallow, hybrid, and deep learning algorithms in the Toudgha Oasis, southeast Morocco. Sustainability 15(5):3874. https://doi.org/10.3390/su15053874
Paria B, Pani A, Mishra P, Behera B (2021) Irrigation-based agricultural intensification and future groundwater potentiality: experience of Indian states. SN Appl Sci. https://doi.org/10.1007/s42452-021-04417-7
Park S, Kim J (2021) The predictive capability of a novel ensemble tree-based algorithm for assessing groundwater potential. Sustainability 13(5):2459. https://doi.org/10.3390/su13052459
Paul S, Roy D (2023) Geospatial modeling and analysis of groundwater stress-prone areas using GIS-based TOPSIS, VIKOR, and EDAS techniques in Murshidabad district, India. Modeling Earth Syst Environ. https://doi.org/10.1007/s40808-022-01589-y
Pham QB, Kumar M, Di Nunno F, Elbeltagi A, Granata F, Islam ARMT, Talukdar S, Nguyen XC, Ahmed AN, Anh DT (2022) Groundwater level prediction using machine learning algorithms in a drought-prone area. Neural Comput Appl 34(13):10751–10773. https://doi.org/10.1007/s00521-022-07009-7
Pham QB, Pandey M, Mishra VN, Singh KK, Ahmadi K, Janizadeh S, Yến TTH, Linh NTT, Nguyen D (2023) Assessment of groundwater potential modeling using support vector machine optimization based on Bayesian multi-objective hyperparameter algorithm. Appl Soft Comput 132:109848. https://doi.org/10.1016/j.asoc.2022.109848
Prasad P, Loveson VJ, Kotha M, Yadav R (2020) Application of machine learning techniques in groundwater potential mapping along the west coast of India. Gisci Remote Sensing 57(6):735–752. https://doi.org/10.1080/15481603.2020.1794104
Rahaman MH, Sajjad H, Roshani Masroor M, Bhuyan N, Rehman S (2022) Delineating groundwater potential zones using geospatial techniques and fuzzy analytical hierarchy process (FAHP) ensemble in the data-scarce region: evidence from the lower Thoubal river watershed of Manipur, India. Arab J Geosci. https://doi.org/10.1007/s12517-022-09946-y
Rasool U, Yin X, Xu Z, Rasool MA, Senapathi V, Hussain M, Siddique J, Trabucco JC (2022) Mapping of groundwater productivity potential with machine learning algorithms: a case study in the provincial capital of Baluchistan, Pakistan. Chemosphere 303:135265. https://doi.org/10.1016/j.chemosphere.2022.135265
Ravichandran R, Ayyavoo R, Rajangam L, Madasamy N, Murugaiyan B, Sumathi S (2022) Identification of groundwater potential zone using analytical hierarchical process (AHP) and multi-criteria decision analysis (MCDA) for Bhavani river basin, Tamil Nadu, southern India. Groundwater Sustain Dev 18:100806. https://doi.org/10.1016/j.gsd.2022.100806
Saha R, Baranval NK, Das I, Kumaranchat VK, Reddy KS (2022) Application of machine learning and geospatial techniques for groundwater potential mapping. J Ind Soc Remote Sensing 50(10):1995–2010. https://doi.org/10.1007/s12524-022-01582-z
Sahour H, Sultan M, Abdellatif B, Emil MK, Abotalib AZ, Abdelmohsen K, Vazifedan M, Mohammad AT, Hassan SM, Metwalli MR, Bastawesy ME (2022) Identification of shallow groundwater in arid lands using multi-sensor remote sensing data and machine learning algorithms. J Hydrol 614:128509. https://doi.org/10.1016/j.jhydrol.2022.128509
Sameen MI, Pradhan B, Lee S (2018) Self-learning random forests model for mapping groundwater yield in data-scarce areas. Nat Res Res 28(3):757–775. https://doi.org/10.1007/s11053-018-9416-1
Seifu TK, Ayenew T, Woldesenbet TA, Alemayehu T (2022) Identification of groundwater potential sites in the drought-prone area using geospatial techniques at Fafen-Jerer sub-basin, Ethiopia. Geol Ecol Landscapes. https://doi.org/10.1080/24749508.2022.2141993
Seifu TK, Eshetu KD, Woldesenbet TA, Alemayehu T, Ayenew T (2023) Application of advanced machine learning algorithms and geospatial techniques for groundwater potential zone mapping in Gambela plain, Ethiopia. Hydrol Res. https://doi.org/10.2166/nh.2023.083
Shandu ID, Atif I (2023) An integration of geospatial modelling and machine learning techniques for mapping groundwater potential zones in Nelson Mandela Bay, South Africa. Water 15(19):3447. https://doi.org/10.3390/w15193447
Shapley LS (1953) A value for n-person games (Princeton University Press, 1953)
Srivastava AK, Safaei N, Khaki S, Lopez G, Zend W, Ewert F, Gaiser T, Rahimi J (2022) Winter wheat yield prediction using convolutional neural networks from environmental and phenological data. Sci Rep. https://doi.org/10.1038/s41598-022-06249-w
Tamiru H, Wagari M, Tadese B (2022) An integrated artificial intelligence and GIS spatial analyst tools for delineation of groundwater potential zones in complex terrain: Fincha Catchment, Abay Basi, Ethiopia. Air Soil Water Res 15:117862212110459. https://doi.org/10.1177/11786221211045972
Tegegne AM (2022) Applications of convolutional neural network for classification of land cover and groundwater potentiality zones. J Eng 2022:1–8. https://doi.org/10.1155/2022/6372089
Thakuriah G (2023) Geographic information system and analytical hierarchical process approach for groundwater potential zone of lower Kulsi basin, India. Sustain Water Res Manage. https://doi.org/10.1007/s40899-023-00870-x
Thành NT, Thunyawatcharakul P, Ngu NH, Chotpantarat S (2022) Global review of groundwater potential models in the last decade: Parameters, model techniques, and validation. J Hydrol 614:128501. https://doi.org/10.1016/j.jhydrol.2022.128501
Trabelsi F, Ali SBH, Lee S (2022) Comparison of novel hybrid and benchmark machine learning algorithms to predict groundwater potentiality: case of a drought-prone region of Medjerda Basin, northern Tunisia. Remote Sensing 15(1):152. https://doi.org/10.3390/rs15010152
Vafadar S, Rahimzadegan M, Asadi R (2023) Evaluating the performance of machine learning methods and Geographic Information System (GIS) in identifying groundwater potential zones in Tehran-Karaj plain. Iran J Hydrol 624:129952. https://doi.org/10.1016/j.jhydrol.2023.129952
Wang D, Qian J, Ma L, Zhao W, Gao D, Hou X, Ma H (2022) Characterizing groundwater distribution potential using GIS-based machine learning model in Chihe River basin, China. Environ Earth Sci. https://doi.org/10.1007/s12665-022-10444-3
Wang Z, Wang J, Han J (2022) Spatial prediction of groundwater potential and driving factor analysis based on deep learning and geographical detector in an arid endorheic basin. Ecol Ind 142:109256. https://doi.org/10.1016/j.ecolind.2022.109256
Yariyan P, Avand M, Omidvar E, Pham QB, Linh NTT, Tiefenbacher JP (2021) Optimization of statistical and machine learning hybrid models for groundwater potential mapping. Geocarto Int 37(13):3877–3911. https://doi.org/10.1080/10106049.2020.1870164
Yousefi S, Sãdhasivam N, Pourghasemi HR, Nazarlou HG, Golkar F, Tavangar S, Santosh M (2020) Groundwater spring potential assessment using new ensemble data mining techniques. Measurement 157:107652. https://doi.org/10.1016/j.measurement.2020.107652
Funding
“This publication was supported by the Deanship of Scientific Research at the King Faisal University, Saudi Arabia (Grant: 5794)”. Amit Kumar Srivastava is funded by the German Federal Ministry of Education and Research (BMBF) in the framework of the funding measure ‘Soil as a Sustainable Resource for the Bioeconomy—BonaRes’, project BonaRes (Module A): BonaRes Center for Soil Research, subproject ‘Sustainable Subsoil Management—Soil3’(Grant 031B0151A). In addition, we also acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy—EXC 2070—390732324 and COINS (Grant 01LL2204C).
Author information
Authors and Affiliations
Contributions
All authors contributed to the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Halder, K., Srivastava, A.K., Ghosh, A. et al. Application of bagging and boosting ensemble machine learning techniques for groundwater potential mapping in a drought-prone agriculture region of eastern India. Environ Sci Eur 36, 155 (2024). https://doi.org/10.1186/s12302-024-00981-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12302-024-00981-y