Data Science Desktop Survival Guide
by Graham Williams |
|||||
Identify Variable Types |
20180726 Metadata is data about the data. We now record data about our dataset that we use later in further processing and analysing our data. In one sense the metadata is simply a convenient store.
We identify the variables that will be used to build analytic models that provide different kinds of insight into our data. Above we identified the variable roles such as the target, a risk variable and the ignored variables. From an analytic modelling perspective we identify variables that are the model inputs. We record then both as a vector of characters (the variable names) and a vector of integers (the variable indicies).
inputs <- setdiff(vars, target) %T>% print()
The integer indices are determined from the base::names() of the variables in the original dataset. Note the use of USE.NAMES= from base::sapply() to turn off the inclusion of names in the resulting vector to keep the result as a simple vector.
|
inputi <- sapply(inputs,
function(x) which(x == names(ds)), USE.NAMES=FALSE) inputi
For convenience we record the number of observations:
|
nobs <- nrow(ds) %T>% print()
Here we simply report on the dimensions of various data subsets primarily to confirm the dataset appear as we expect: |
dim(ds)
dim(ds[vars])
dim(ds[inputs])
|