Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go

Identify Variable Types

20180726 Metadata is data about the data. We now record data about our dataset that we use later in further processing and analysing our data. In one sense the metadata is simply a convenient store.

We identify the variables that will be used to build analytic models that provide different kinds of insight into our data. Above we identified the variable roles such as the target, a risk variable and the ignored variables. From an analytic modelling perspective we identify variables that are the model inputs. We record then both as a vector of characters (the variable names) and a vector of integers (the variable indicies).

inputs <- setdiff(vars, target) %T>% print()
##  [1] "min_temp"        "max_temp"        "rainfall"        "evaporation"...
##  [5] "sunshine"        "wind_gust_dir"   "wind_gust_speed" "wind_dir_9am...
##  [9] "wind_dir_3pm"    "wind_speed_9am"  "wind_speed_3pm"  "humidity_9am...
## [13] "humidity_3pm"    "pressure_9am"    "cloud_9am"       "cloud_3pm"  ...
## [17] "rain_today"

The integer indices are determined from the base::names() of the variables in the original dataset. Note the use of USE.NAMES= from base::sapply() to turn off the inclusion of names in the resulting vector to keep the result as a simple vector.

inputi <- sapply(inputs,
                 function(x) which(x == names(ds)),
                 USE.NAMES=FALSE)
inputi
##  [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 18 19 22

For convenience we record the number of observations:

nobs <- nrow(ds) %T>% print()
## [1] 172430

Here we simply report on the dimensions of various data subsets primarily to confirm the dataset appear as we expect:

dim(ds)
## [1] 172430     24

dim(ds[vars])
## [1] 172430     18

dim(ds[inputs])
## [1] 172430     17


Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.