Data Science Desktop Survival Guide by Graham Williams Desktop Survival Project Home Preface Data Science Introducing R R Constructs R Tasks R Strings R Read, Write, and Create Data Template Data Exploration Data Wrangling Data Visualisation Statistics ML Template ML Scenarios ML Activities ML Applications ML Algorithms Cluster Analysis Decision Trees Computer Vision Graph Data Privacy Literate Data Science Coding with Style Resources Bibliography Index

## Variable Roles

20180723 Now that we have a basic idea of the size and shape and contents of the dataset and have performed some basic data type identification and conversion we are in a position to identify the roles played by the variables within the dataset. First we will record the list of available variables so that we might reference them below.

# Note the available variables.

vars <- names(ds) %T>% print()
 ```## [1] "date" "location" "min_temp" "max_temp" ... ## [5] "rainfall" "evaporation" "sunshine" "wind_gust_di... ## [9] "wind_gust_speed" "wind_dir_9am" "wind_dir_3pm" "wind_speed_9... ## [13] "wind_speed_3pm" "humidity_9am" "humidity_3pm" "pressure_9am... ## [17] "pressure_3pm" "cloud_9am" "cloud_3pm" "temp_9am" ... ## [21] "temp_3pm" "rain_today" "risk_mm" "rain_tomorrow" ```

By this stage of the project we will usually have identified a business problem that is the focus of attention. In our case we will assume it is to build a predictive analytics model to predict the chance of it raining tomorrow given the observation of today's weather. In this case the variable rain_tomorrow is the target variable. Given today's observations of the weather this is what we want to predict. The dataset we have is then a training dataset of historic observations. The task is to identify any patterns among the other observed variables that suggest that it rains the following day.

# Note the target variable.

target <- "rain_tomorrow"

# Place the target variable at the beginning of the vars.

vars <- c(target, vars) %>% unique() %T>% print()
 ```## [1] "rain_tomorrow" "date" "location" "min_temp" ... ## [5] "max_temp" "rainfall" "evaporation" "sunshine" ... ## [9] "wind_gust_dir" "wind_gust_speed" "wind_dir_9am" "wind_dir_3pm... ## [13] "wind_speed_9am" "wind_speed_3pm" "humidity_9am" "humidity_3pm... ## [17] "pressure_9am" "pressure_3pm" "cloud_9am" "cloud_3pm" ... ## [21] "temp_9am" "temp_3pm" "rain_today" "risk_mm" .... ```

We have taken the opportunity here to move the target variable to be the first in the vector of variables recorded in vars. This is common practice where the first variable in a dataset is the target (dependent variable) and the remainder are the variables (the independent variables) that will be used to build a model to predict that target.