Engineering Design Projects (ES 100), the capstone course at the Harvard John A. Paulson School of Engineering and Applied Sciences, challenges seniors to engineer a creative solution to a real-world problem.
Statistical Modeling of Heavy Metal Contamination in United States Well Water
Jonas LaPier, S.B. ’21, environmental science and engineering
Please give a brief summary of your project.
Millions of Americans rely on private drinking water wells which are not regulated and are seldom tested for contaminants. Heavy metal contamination is often imperceptible, even in unsafe concentrations, so it can remain in drinking water supplies unnoticed. My project aims to create a machine learning model (random forest) of arsenic, cadmium, lead, and manganese groundwater contamination in the United States. It relies upon five decades of water sampling data from throughout the U.S. to train the model. Environmental conditions like temperature, precipitation, groundwater pH, soil chemistry, geology, and others in addition to human influence from industrial emissions are then used to predict which areas of the U.S. are most susceptible to contamination from each metal.
How will this project help solve the problem you identified?
Prediction maps from the model indicate regions of the U.S. where contamination is more likely. A public awareness campaign could inform well owners in these areas they should have their wells tested if they have not already. Should the federal government implement a national testing plan or allocate resources to states to implement testing or education, the maps point to states where contamination is most likely.
What were the biggest challenges of this project?
Improving the predictions for the models is more of an art than anything else and requires a lot of creativity to try to guess what will make the difference. For example, I played around quite a bit with different calculations to incorporate anthropogenic emissions and they never seemed to improve the models. Another day I added new groundwater data from a familiar source, performed a simple interpolation, and now those are some of the best predictors in the models. It is very hit or miss.
What did you learn through this experience?
I am extremely familiar with this modeling approach now and I am able to read through similar literature and point out flaws in different published approaches. More generally, I have become very familiar with the R programming language and data manipulation which I will be able to use for any number of applications down the road.
Press Contact
Adam Zewe | 617-496-5878 | azewe@seas.harvard.edu