Variable Roles

		Data Science Desktop Survival Guide by Graham Williams

CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Variable Roles

20180723 Now that we have a basic idea of the size and shape and contents of the dataset and have performed some basic data type identification and conversion we are in a position to identify the roles played by the variables within the dataset. First we will record the list of available variables so that we might reference them below.

# Note the available variables.

vars <- names(ds) %T>% print()

##  [1] "date"            "location"        "min_temp"        "max_temp"   ...
##  [5] "rainfall"        "evaporation"     "sunshine"        "wind_gust_di...
##  [9] "wind_gust_speed" "wind_dir_9am"    "wind_dir_3pm"    "wind_speed_9...
## [13] "wind_speed_3pm"  "humidity_9am"    "humidity_3pm"    "pressure_9am...
## [17] "pressure_3pm"    "cloud_9am"       "cloud_3pm"       "temp_9am"   ...
## [21] "temp_3pm"        "rain_today"      "risk_mm"         "rain_tomorrow"

By this stage of the project we will usually have identified a business problem that is the focus of attention. In our case we will assume it is to build a predictive analytics model to predict the chance of it raining tomorrow given the observation of today's weather. In this case the variable rain_tomorrow is the target variable. Given today's observations of the weather this is what we want to predict. The dataset we have is then a training dataset of historic observations. The task is to identify any patterns among the other observed variables that suggest that it rains the following day.

# Note the target variable.

target <- "rain_tomorrow"

# Place the target variable at the beginning of the vars.

vars <- c(target, vars) %>% unique() %T>% print()

##  [1] "rain_tomorrow"   "date"            "location"        "min_temp"   ...
##  [5] "max_temp"        "rainfall"        "evaporation"     "sunshine"   ...
##  [9] "wind_gust_dir"   "wind_gust_speed" "wind_dir_9am"    "wind_dir_3pm...
## [13] "wind_speed_9am"  "wind_speed_3pm"  "humidity_9am"    "humidity_3pm...
## [17] "pressure_9am"    "pressure_3pm"    "cloud_9am"       "cloud_3pm"  ...
## [21] "temp_9am"        "temp_3pm"        "rain_today"      "risk_mm"
....

We have taken the opportunity here to move the target variable to be the first in the vector of variables recorded in vars. This is common practice where the first variable in a dataset is the target (dependent variable) and the remainder are the variables (the independent variables) that will be used to build a model to predict that target.

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.