Pesticides, cancer, and oxidative stress: an application of machine learning to NHANES data

Background The large-scale application of pyrethroids and organophosphorus pesticides has great benefits for pest control. However, the increase of cancer incidence rate in recent years has also caused public concern about the health risks of pesticides. Hence, we utilized data from the National Health and Nutrition Examination Survey (NHANES) to assess the association and risk between pesticide exposure and several cancers, along with the comprehensive impact of oxidative stress. In this study, six cancers and six common pesticides were included to analyze their correlation and risk. And the levels of eight oxidative stress marks and two inflammatory markers were used for stratified analysis. Multiple logistic regression analysis was applied to estimate the odds ratio and 95% confidence intervals. Machine learning prediction models were established to evaluate the importance of different exposure factors. Results According to the data analyzed, each pesticide increased the risk of three to four out of six cancers on average. Iron, aspartate aminotransferase (AST), and gamma glutamyl transferase levels positively correlated with cancer risk in most cases of pesticide exposure. Except for demographic factors, factors such as AST, iron, and 3-phenoxyben-zoic acid showed high contributions to the random forest model, which was consistent with our expectations. The receiver operating characteristic curve showed that the prediction model had sufficient accuracy (74.2%). Conclusion Our results indicated that specific pesticide exposure increased the risk of cancer, which may be mediated by various oxidative stress mechanisms. Additionally, some biochemical indicators have the potential to be screened for cancer prevention.


Background
Cancer has long been considered a serious health hazard.The accumulation of different gene alterations in the cancer genome, a hallmark of all malignancies, is the cause, promotion, and development of cancer [1].Surgery, radiotherapy, chemotherapy, hormone treatment, immunotherapy, and targeted therapy are the few available therapeutic options.However, they are not sufficient to address the current problem [2].Hence, the identification of a reliable technique for cancer prevention has long been a pressing issue.Comprehending the pathogenic components of cancer is essential for its successful prevention and understanding.
Pesticides are beneficial to public life as they increase food production yields and decrease foodborne and vector infections [3].Nevertheless, long-term exposure to pesticides has certain effects on human health owing to the chemical properties of pesticides, such as increasing the risk of cancer [4].Over 650 of 800 pesticides used globally have influence on human endocrine system [5].And most pesticides are sensitive to the human endocrine system because of their long half-lives and lipophilic characteristics.Therefore, pesticides are also considered as endocrine disruptors (EDCs).They are volatile organic compounds (VOCs) and semi volatile organic compounds (SVOCs) in the gas phase [6][7][8], and also adhere to particulate matter in the solid phase [9].Pesticide-derived EDCs have many exposure pathways, including personal lifestyle, agricultural and industrial applications, living area, and geographical location.Studies have shown that EDCs have a significant risk for endocrine system related diseases, including hormone related cancers [10].Oxidative stress can be simply defined as the imbalance between the production of free radicals leading to cellular lipid peroxidation and the body's antioxidant defense [11].The toxicity of many exogenous drugs is related to the production of free radicals, which also relate to the pathophysiology of diseases.Studies on human or animals support that pesticide induced oxidative stress is a mechanism of its toxic effects in vivo [12,13].
With the increase of the incidence rate of cancer, the long-term health hazards caused by environmental pollution were increasingly concerned.Some cohort and epidemiological studies have shown that exposure to pesticides, particularly organochlorine and organophosphorus pesticides, accelerates cancer development [14,15].However, previous research has mostly been limited to regions and time.Our study is the first based on the National Health and Nutrition Survey (NHANES) and uses machine learning methods with nationally representative samples to assess the risk of pesticide exposure for cancers.And our study provides important evidence that pesticide exposure is a risk factor for the development of various cancers which may be mediated by oxidative stress.

Study design
The National Health and Nutrition Examination Survey (NHANES) aims to assess the nutritional and physical health of Americans.The National Center for Health Statistics Ethics Review Board approved the NHANES data collection and research.Informed consent was provided by the participants, and the investigation protocol was approved by the National Center for Health Statistics' (NCHS) Ethical Review Committee.The NHANES survey design, questionnaire, and analytical procedures can be viewed in detail on the CDC website.

Study population
We excluded participants younger than 20 years with unclear or unknown socioeconomic factors (education level, race, family income level, and marital status).Patients with incomplete, unknown, or unclear cancer, pesticide exposure, or laboratory test data were also excluded.The "demographics", "medical condition", "standard biochemistry profile", "complete blood count with 5-Part differential-whole blood" and "Pyrethroids, Herbicides, & OP Metabolites-Urine" questionnaire data of the participants were extracted.We used complex sample weights to make the estimates applicable to the US population.

Definition
We used the answers of MCQ220 (ever told you had cancer or malignancy) and MCQ230 (what kind of cancer) in the ''medical condition'' questionnaire as the criteria for determining the possibility of having a certain cancer.According to the NHANES data, data on 30 types of cancer were collected.Seven types of cancers were selected (breast, colon, cervical, prostate, melanoma, non-melanoma skin cancer, and other types of skin cancer) with a disease sample greater than 0.5% of the total sample for analysis.The population suffering from a certain type of cancer was defined as a "case group", and the population not suffering from this type of cancer was defined as a "control group".The demographic characteristics of the case and control groups differed significantly.
The standard for judging exposure to pesticides was based on the concentration of each compound in the subject's urine, as recorded in the Pyrethroids, Herbicides, & OP Metabolites-Urine questionnaire.If the concentration reached or exceeded the detection limit, it was considered exposed to pesticides; if it was below the detection limit, it was considered not exposed.In this module, the main detectable metabolites of organophosphorus pesticides and chlorpyrifos, 2,4-D (μg/L), 4-fluoro-3-phenoxybenzoic (μg/L), 3-phenoxybenzoic (μg/L), oxypyrimidine (μg/L), paranitrophenol (μg/L), and dichlorovnl-dimeth prop carboacid (μg/L), were selected as exposure factors.
The NHANES demographic questionnaire divides race/ethnicity into five groups, including non-Hispanic blacks, non-Hispanic whites, other Hispanics, Mexican Americans, and other races (including multiple races).Socioeconomic factors, including marital status, ratio of family income to poverty (PIR), and education level, were defined.Education level consisted of three categories: high school or below, college and college graduates, and above.Household income was divided into three levels, with 130% and 338% as the boundaries, according to the PIR ratio.
In addition, three types of cancer (cervical, breast, and prostate) limited by sex were extracted and analysed separately.

Covariates
In this study, the selected covariates included age, sex (not applicable to prostate, cervical, and breast cancer), race, family income, educational level, and marital status.For covariates, sex, race, education level, family income, education level, and marital status were used as categorical variables, and age was used as a continuous variable.For more details on pesticide exposure, cancer, and covariates, visit http:// www.cdc.gov/ nchs/ nhanes/.

Machine learning and model interpretation
We compared eight machine learning algorithms (including Boosting Tree, Decision Tree, Logistic Regression, Meridian Lossless Packing (mlp), Naive Bayes, K Nearest Neighbor, Random Forest, Radial Basis Function Kernel (svm rbf )) based on the area under the curve (AUC) and accuracy.The best machine learning model will be used to build the final prediction model.To verify the prediction performance, we randomly selected data for model training, constructed test and training sets, and used five-fold cross-validation to optimize hyperparameters.The ROC curve is used to evaluate the prediction effect.We repeated 500 iterations with different random seeds to evaluate the prediction performance and stability from patient segmentation to machine learning model construction.We introduced SHAP values as an unexplained method for various black box machine learning models in this study.SHAP can simultaneously perform local and global interpretability, and has a solid theoretical foundation compared to other methods.All analyses were conducted using R software version 4.2.1 (the R Foundation for Statistical Computing, USA).P < 0.05 for both sides were considered statistically significant.

Statistical analysis
We conducted statistical analysis of the data using IBM SPSS Statistics 24.Table 1 presents descriptive statistical information on the traits of individuals with or without any type of cancer.Both the chi-square and t-tests were used to compare categorical variables.Employing complex sample weights enabled us to address selection, oversampling, and unresponsiveness biases while estimating demographic variables and the overall prevalence of the cancers.Age, race, education level, marital status, and PIR were adjusted to evaluate temporal trends in cancer prevalence and pesticide use.Covariates were corrected using a logistic regression model.We estimated the odds ratios (OR), along with P-values and 95% confidence intervals (CIs).The outcome was deemed significant if the two-tailed P-value was less than 0.05 using R languages pROC and random forest tools to assess the significance of various exposure factors and precision of the prediction models.

Characteristics of study participants and correlation analysis
A total of 4310 eligible participants were screened for this study (Fig. 1).Each participant comprised approximately 40,415 individuals.Chi-square analysis was used to evaluate the correlation between demographic factors as covariates, exposure factors (pesticide exposure), and outcome variables (cancer risk).After confirming the significance of the demographic factors, we analysed the correlation between the exposure factors and outcome variables.The results are presented in Tables 1, 2.
It is worth noting that patients with cancer are generally older than healthy individuals.There were significant statistical differences in demographic factors such as race, education, marital status, and family income.Prostate cancer was more common among Mexican Americans (3.5%) and non-Hispanic Whites (13.8%) than other cancers.Most patients with skin cancer were non-Hispanic Whites (99.4%).Cervical cancer was more common in people of other races (9.5%) than in those of the aforementioned ethnic groups.
Compared to other cancers, prostate cancer (78.6%) and other types of skin cancer (76.6%) were far more common among married or cohabiting individuals, whereas cervical cancer (57.2%) was more common among single individuals.
According to our data, there was a significant correlation between pesticide exposure and cancer.The results showed that the vast majority of patients with cancer had a history of exposure to 2,4-D, 3-phenoxybenzoic acid, and paranitrophenol, which was particularly significant in prostate cancer (91.7% for 2,4-D; 90.5% for 3-phenoxybenzoic acid; and 98.6% for paranitrophenol).
In summary, we computed the P-values and examined the relationship between demographic factors and cancer.According to our findings, there was a strong relationship among pesticide exposure, demographic factors, and cancer (P < 0.001).Covariables, including age, education level, family income, and marital status were significantly correlated with cancer risk, whereas exposure  factors (whether exposed to a certain pesticide) and outcome variables (whether suffering from a certain cancer) showed a significant correlation.

Multivariate logistic regression model for predicting cancer risk
A multiple logistic regression model was established to assess the effects of pesticide exposure on cancer risk.To increase the accuracy of the model, we gradually incorporated the impact of covariates into the regression model and presented it in the form of two multiple logistic regression models: model 1 without covariate adjustment and model 2 with covariate adjustment (Table 3).
After controlling for variables, each pesticide increased the risk of three to four types of cancer on average.Although all pesticides increase the risk of cancer, prostate and cervical cancers are particularly susceptible to them.Exposure to fluoro-phenoxybenzoic acid and dichlorovnl-dimeth prop carboacid does not significantly increase the risk of developing prostate cancer (before and after adjusting for covariates, model 1: OR = 1.195, 95% CI 1.192-1.199,model 2: OR = 0.911, CI 0.908-0.914,P < 0.01) or cervical cancer (model 1: OR = 0.813, 95% CI 0.811-0.815,P < 0.01).

The impact of oxidative stress on predictive models
To investigate the potential mechanisms, we stratified cancer risk according to the concentration of oxidative stress indicators (Table 4).Based on oxidative stress indicator levels, we divided the association between cancer and pesticide exposure into subgroups.We found that many regression relationships showed trend changes in the subgroup analysis, with a significant positive correlation (such as ALT, AST, and GGT) or negative correlation (such as iron and uric acid) between the indicator concentration and cancer risk.
Elevated AST levels were accompanied with increased risk of cancers, showing a significant positive correlation (such as colon, prostate, breast, and cervical cancers patients exposed to 2,4-D; melanoma, prostate cancer, and breast cancer patients exposed to oxypyrimidine).Iron, in contrast, exhibited a significant negative correlation trend in most regression relationships (such as colon cancer and melanoma patients exposed to 4-fluoro-3-phenoxybenzoic acid; colon cancer, melanoma, and other types of skin cancer patients exposed to dichlorovnl-dimeth prop carboacid).
It should be noted that, we did not observe an increase in the risk of melanoma with exposure to any chemical substances, as we did not observe OR values greater than 1 in any regression model related to melanoma (Additional file 1: Table S1).

Machine learning reveals the importance of different variables
Machine learning is an advanced form of pattern recognition that enables machines to make judgments by analysing large amounts of data.By comparing the predictive performance of different machine learning models, we found the random forest model showed the highest accuracy values (0.707) and AUC values (0.720) (Fig. 2A, B), and was also higher than other models in the ROC curve (Fig. 2C), showing the best predictive performance.Therefore, the random forest model was chosen as the final model for evaluating oxidative stress indicators.
After determining the Random Forest model, we conducted hyperparameter optimization and analysed the importance of variables (Fig. 3A).Iron, creatinine, ALT, AST, albumin, and GGT levels were found to be the most significant in the model of Mean Decrease Accuracy.However, in terms of the mean decrease, Gini coefficient, paranitrophenol, 3-phenoxybenzoic acid, and 2,4-D demonstrated high relevance.Iron, creatinine, ALT, AST, and GGT have contributed significantly to the prediction model.The ranking of the importance of the six pesticides was low.To visually explain the selected variables, we used SHAP to illustrate how these variables affect the cancer risk in the model.It can be considered that creatinine has the largest positive effect, while iron has the largest negative effect on cancer risk (Fig. 3C).Receiver operating characteristic analysis showed that our prediction model had an accuracy of 74.2% (AUC = 0.742), which indicates that our model can better evaluate the contribution of different variables to cancer risk (Fig. 3B).

Discussion
The International Chemical Safety Program defines EDC (endocrine disrupters) as exogenous substances with the potential to alter numerous endocrine and hormonal processes in the human body, causing a wide range of abnormalities and affecting hormone synthesis, metabolism, and excretion during homeostasis and development.EDCs interfere with a series of functions of endocrine system either by enzyme and receptor-mediated mechanisms or epigenetic effects, thereby adversely inducing various aspects of reproductive, metabolic problems of human life [7,16].In recent years, more and more human health problems have been reported to be related to EDCs.Additionally, due to their unique physical and chemical properties, EDCs can be widely present in the air or attached to particle surfaces.With the widespread use of insecticides, they bring potential pathogenic risks [9].The United States used 857 million pounds of conventional insecticides in 2007 (EPA, 2020).With 7-9 million pounds applied, chlorpyrifos is the most frequently employed pesticide in the agricultural market.2,4-D is the second most commonly used herbicide in the agricultural sector and the most commonly used herbicide in the commercial, home, and garden sectors.Malathion is the most popular insecticide in the commercial sector and the second most popular insecticide in the home and garden sectors.Pyrethroids are the most popular insecticides in the home and garden markets.In 2007, 22% of all pesticides used worldwide were applied in the United States [17][18][19][20].

Table 3 Association between pesticide exposure and cancer
Epidemiological studies on the effects of pesticides have mainly been conducted among farmers, highlighting the association between pesticide exposure and an increased risk of cancer in specific locations, neurological diseases (Parkinson's disease, Alzheimer's disease, and amyotrophic lateral sclerosis), and reproductive disorders (spontaneous abortions, stillbirth, and sperm quality).Currently, most existing studies have associated pesticides with endocrine dysfunction and immune dysfunction.The endocrine disruptor properties of pesticides (such as lipophilicity, persistence, etc.) are often considered the main causes of cancer induction.However, few studies have considered the role of oxidative stress as the main toxicological mechanism of pesticides in this process.Besides, previous studies have been mostly limited to specific regions and times, with few large-scale, longterm, and cross-sectional studies [10,21].
In this study, we built a machine learning prediction model in R utilizing eight years of nationally representative data from the NHANES database.During this process, we found that most cancers, such as colon, skin, cervical, and prostate cancers, were at an increased risk under the influence of three or more pesticides.Whereas a few cancers, such as breast cancer and melanoma, were only affected by one or two pesticides, indicating different mechanisms behind this process.
In addition, as the concentration of different oxidative stress markers increases, the risk of cancer increases, which indicates a mechanism by which pesticides damage cells through oxidative stress and promote cancer initiation.Previous studies have shown that organophosphorus pesticides can induce inflammation [10,22], affect lymphocyte function [23], and interact among microorganisms [24] and the immune system [23], increase oxidative stress [25], disrupt estrogen pathways [26,27] damage brain function [28], and increase the risk of cell carcinogenesis.It is important to note, only a small fraction of pesticides can increase the risk of cancer and also raise the inflammatory markers, indicating that the main mode of damage caused by pesticides to the human body is oxidative stress.
Certain physical and chemical characteristics of organophosphorus pesticides, such as their high fat solubility, transmembrane properties, and extended half-life, allow them to persist in the body for a long time, thus providing a concealed risk for cell canceration [29].In this case, we can speculate that the metabolites of organophosphorus pesticides may act on the endoplasmic reticulum [30], interfere with endocrine hormones, or act as cancer promoters or inducers of cytochrome P450 enzymes [31], which may lead to the formation of genotoxic DNA adducts, further causing reproductive system-related cancers, including prostate and cervical cancers.
Molecular studies have also supported the idea that pesticides increase cancer risk.Endosulfan can upregulate β-actin [32].The expression of catenin and interleukin-6 promotes colitis [33].The insecticide chlorpyrifos can also promote the development of colon cancer by activating the EGFR/ERK1/2 growth signalling pathway [34].And Dennis et al. showed that the combined effect of paraquat exposure and light damage significantly increased the probability of agricultural workers suffering from skin cancer and melanoma.These findings demonstrated that pesticides can cause mutant cells to clone and multiply, offer them the opportunity to further alter the genome through high proliferative activity or the emergence of additional carcinogenic sites.
Since GGT is both an oxidant of AST and an antioxidant of iron, it has a positive relationship with some cancer risks and a negative relationship with others.Meanwhile, iron is an antioxidant, and as its concentration increases, the risk of cancer decreases.But for some cancers, higher iron concentrations actually lead to higher cancers risk.The phenomenon may be explained by ferroptosis, a new type of oxidative regulation of cell death.Ferroptosis is associated with severe impairment of mitochondrial morphology, bioenergy, and metabolism, and high iron concentrations increase disease risk [35,36].Moreover, we observed that 2, 4-D increased AST levels in four cancers (colon, prostate, breast, and cervical) and oxypyrimidine increased AST levels in three (non-melanoma skin cancer, prostate cancer, and breast cancer).It is possible that pesticides may affect oxidative stress by damaging the liver to induce cancers.
Breast cancer, non-melanoma skin cancer, and other types of skin cancer exhibited an increase in the number of lymphocytes under exposure to 3-phenoxybenzoic acid, which increased the incidence of chronic inflammation.Several studies have demonstrated that inflammatory reactions cause oxidative stress and lower the antioxidant capacity of cells.Fatty acids and proteins in the cell membrane react with an abundance of free radicals, irreversibly impairing their function.Free radicals can cause DNA damage and mutations that may lead to cancer and other age-related disorders.However, there is no evidence that other pesticides have comparable impacts over time.
The machine learning random forest prediction model led to a conclusion consistent with the findings presented above.The model's reliance on exposure factors, including pesticides, led us to hypothesise that some pesticides may not directly cause cancer but instead cause oxidative stress by harming organs, which in turn may indirectly cause cancer.As is well known, changes in ALT, AST, and GGT levels often coincide with liver injury and have a high contribution in our machine learning prediction model.This means that the role of pesticides as endocrine disruptors may be achieved through liver damage [37,38].
Although no significant continuity was found in the regression analysis, the high contribution of lymphocytes and neutrophils to the Gini coefficient in the random forest prediction model suggested that chronic inflammation may be involved in the process of pesticide induced cancer.
Our study also has some limitations.First, the NHANES database was a cross-sectional study and did not provide longitudinal follow-up information.Owing to the retrospective nature of this study, future research should focus more on the longitudinal effects of pesticide exposure, providing more convincing clinical evidence through long-term follow-up data.Besides, cellular toxicology experiments and animal experiments are also necessary.Second, some of the data came from a self-report questionnaire and resulted in recall and self-report biases.The last, other potential confounding factors such as lifestyle factors, genetic factors, and other environmental exposures need to be analyzed in future studies.

Conclusion
In summary, our analysis, based on national representative surveys, demonstrated that pesticides may induce oxidative stress by damaging organs (such as liver), increase the risk of cancers.And different cancers showed distinct sensitivities to pesticides.Iron, creatinine, ALT, AST, albumin, and GGT had high sensitivity to changes in cancer risk under pesticide exposure, which made them potential as detection markers for cancer prediction.

Fig. 1
Fig. 1 Flowchart of the screening process for the selection of eligible participants in NHANES 2007-2014

Model 1 :
multiple logistic regression prediction model without adjustment for covariates Model 2: multiple logistic regression prediction model after adjusting covariates (including age, race, education level, marital status, PIR) Bold indicates when the OR values showed a clear and stable positive or negative correlation with the levels of oxidative stressmarkers

Fig. 2
Fig. 2 Comparison of different machine learning models.A Comparison of prediction performance of machine learning models based on accuracy and AUC values.boost_tree: Boosting Tree; decision_tree: Decision Tree; logistic_reg: Logistic Regression; mlp: Meridian Lossless Packing; naive_ Bayes: Naive Bayes; nearest_neighbor: K Nearest Neighbor; rand_forest: Random Forest; svm_rbf: Radial Basis Function Kernel.B ROC curves of eight machine learning models

Table 1
Correlation analysis between demographic factors and outcome variablesFigures (for mean age) are expressed as mean, other figures are expressed as percent

Table 2
Correlation analysis between exposure factors and outcome variablesFigures (for mean age) are expressed as mean, other figures are expressed as percent

Table 4
Association of pesticide exposure with cancer based on subgroup of oxidative stress marker level