Going into this project, my interest is to understand the relationships between streamflow and other variables in the hydrologic systems (precipitation, temperature shift, snowmelt, basin characteristics, etc.) and how can we make better streamflow prediction based on our understanding of these interactions.
A recent implementation of the Random Forests (RF) model has allowed me to take a close look at these relationships.
I built a basic RF model that currently takes 4 predictors:
1. Mean daily streamflow from previous day (a common predictor variable) at
the same gage
2. Daily total precipitation from previous day (a common predictor variable)
drawing from the closest GHCN station*
3. The month of the predicted streamflow (to somewhat account for seasonality)
4. The sum of precipitation from n-previous days (I vary this variable and
observe the performance of the model)
Output is the predicted daily average streamflow.
I performed the model on 2 randomly chosen USGS gages (No 14145500 and No
14137000) from 2 sub-watersheds within HUC 17 Pacific Northwest (17-8 and 17-9)
that are different in size and geometry but locate in the similar region that can
facilitate the comparison.
Some preliminary observations
At both gages, the streamflow at previous day has the highest predictive power. The selection of input variables seems to have considerable impact on the accuracy of the output.
The error tends to increase as the magnitude of streamflow increases (plot below). So the next step would be looking into how to reduce the error as well as the correlation (error should be occurring at random).
Moving forward. These are just very preliminary results and I think the model will improve with addition of other predictors (snowmelt, temperature, streamflow from nearby gauges, and possibly climate indices). In order to model a big watershed like HUC 17, an approach would be to optimize the input variables by growing a different RF for each sub-watershed. I'm also exploring some data-decomposition techniques to improve the quality of the input data.
Comments