 Data Science Desktop Survival Guide by Graham Williams Desktop Survival Project Home Preface Data Science Introducing R R Constructs R Tasks R Strings R Read, Write, and Create Data Template Data Exploration Data Wrangling Data Visualisation Statistics ML Template ML Scenarios ML Activities ML Applications ML Algorithms Cluster Analysis Decision Trees Computer Vision Graph Data Privacy Literate Data Science Coding with Style Resources Bibliography Index

## Identify Variable Types

20180726 Metadata is data about the data. We now record data about our dataset that we use later in further processing and analysing our data. In one sense the metadata is simply a convenient store.

We identify the variables that will be used to build analytic models that provide different kinds of insight into our data. Above we identified the variable roles such as the target, a risk variable and the ignored variables. From an analytic modelling perspective we identify variables that are the model inputs. We record then both as a vector of characters (the variable names) and a vector of integers (the variable indicies).

inputs <- setdiff(vars, target) %T>% print()
 ```##  "min_temp" "max_temp" "rainfall" "evaporation"... ##  "sunshine" "wind_gust_dir" "wind_gust_speed" "wind_dir_9am... ##  "wind_dir_3pm" "wind_speed_9am" "wind_speed_3pm" "humidity_9am... ##  "humidity_3pm" "pressure_9am" "cloud_9am" "cloud_3pm" ... ##  "rain_today" ```

The integer indices are determined from the base::names() of the variables in the original dataset. Note the use of USE.NAMES= from base::sapply() to turn off the inclusion of names in the resulting vector to keep the result as a simple vector.

inputi <- sapply(inputs,
function(x) which(x == names(ds)),
USE.NAMES=FALSE)
inputi
 ```##  3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19 22 ```

For convenience we record the number of observations:

nobs <- nrow(ds) %T>% print()
 ```##  172430 ```

Here we simply report on the dimensions of various data subsets primarily to confirm the dataset appear as we expect:

dim(ds)
 ```##  172430 24 ```

dim(ds[vars])
 ```##  172430 18 ```

dim(ds[inputs])
 ```##  172430 17 ```