20210811 We might observe that the variable
contains the amount of rain recorded tomorrow. We can treat this as a
risk variable, being a measure of the impact, or the severity, of
the prediction. That is, a measure of the amount of risk associated
with a “positive” outcome (that it will rain tomorrow).
A risk variable is an output variable and should not be used as an input to any machine learning model building—it is not an independent variable. In other circumstances it might actually be treated as the target variable. If you do accidentally include it as an input, the variable will usually be selected as the strongest predictor of the outcome, even the perfect predictor. Give it a try. Whenever you see such a perfect model start questioning whether the variable can be used in this way.
We can explore our risk variable here:
# Note the risk variable - measures the severity of the outcome. <- "risk_mm"risk
For this risk variable note that we expect it to have a value of 0 for
all observations when the target variable has the value
# Review the distribution of the risk variable for non-targets. %>% ds filter(rain_tomorrow == "No") %>% select(risk_mm) %>% summary()
## risk_mm ## Min. :0.00000 ## 1st Qu.:0.00000 ## Median :0.00000 ## Mean :0.07444 ## 3rd Qu.:0.00000 ## Max. :1.00000
Note that a little rain (defined as 1mm or less) is regarded as no rain. That is useful to keep in mind and is a discovery of the data that we might not have expected. As data scientists we should be expecting to find the unexpected.
A similar analysis for the target observations is more in line with expectations.
# Review the distribution of the risk variable for targets. %>% ds filter(rain_tomorrow == "Yes") %>% select(risk_mm) %>% summary()
## risk_mm ## Min. : 1.1 ## 1st Qu.: 2.4 ## Median : 5.2 ## Mean : 10.3 ## 3rd Qu.: 11.8 ## Max. :474.0
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0