上QQ阅读APP看书，第一时间看更新

Irrelevant features and labels

Eventually, using enough data from Hawaii, China, and some other places in the world, we notice a clear and globally generalizable pattern, which we can use to predict the weather. So, everybody is happy, until one day, your prediction model tells you it is going to be a bright sunny day, and a tornado comes knocking on your door. What happened? Where did we go wrong? Well, it turns out that when it comes to tornadoes, our two-featured binary classification model does not incorporate enough information about our problem (this being the dynamics of tornadoes) to allow us to approximate a function that reliably predicts this specifically devastating outcome. So far, our model did not even try to predict tornadoes, and we only collected data for sunny and rainy days.

A climatologist here might say, Well, then start collecting data on altitude, humidity, wind speed, and direction, and add some labeled instanced of tornadoes to your data, and, indeed, this would help us fend off future tornadoes. That is, until an earthquake hits the continental shelf and causes a tsunami. This illustrative example shows how whatever model you choose to use, you need to keep tracking relevant features, and have enough data per each prediction class (such as whether it is sunny, rainy, tornado-ey, and so on) to achieve good predictive accuracy. Having a good prediction model simply means that you have discovered a mechanism that is capable of using the data you have collected so far, to induct a set of predictive rules that are seemingly being obeyed.