I've spent the past several days getting relatively simple machine learning models running on the stream flow data. We've processed data from several sources that includes stream flow, precipitation, and snow data, but I've decided to focus my initial efforts on the stream flow data to make sure I understand it before further complicating the problem with additional data. One key assumption I wanted to validate was that there would be significant relationships between stream flow stations on the same stream. For example, if one station measures a significant increase in stream flow then a station further downstream would be likely to measure an increase in stream flow a few hours in the future. If this is the case then we can improve predictions of stream flow at a particular station by incorporating information from stations that are upstream.
However, I've now implemented several simple models and so far it looks like predictions are only slightly improved by incorporating stream flow measurements from multiple stations rather than one. This pretty strongly violates my expectations, so at this point I'm spending time digging into the data to try to make sure that the sorts of relationships I expect to be found in the data set actually exist. So far the results of that investigation are mixed. I think there is a good chance that the lack of performance gain from incorporating data from multiple stream flow stations could be an artifact of the pre-processing or details of the problem definition. For example, stations measure stream flow on 15 minute intervals but I'm aggregating measurements into non-overlapping 3 hour windows which may be too coarse of a temporal resolution. There are a several other small decisions like this that could have a significant effect on the relationship between the stations that measure stream flow.
A major component of the project as I've currently formulated it is modelling relationships between locations where data is collected to improve predictions. If it turns out that relationships between stream flow stations don't improve predictive performance then that would have an impact on the project.
Comentários