Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go

Target as a Factor

20180726 We often build classification models. For such models we want to ensure the target is categoric. Often it is 0/1 and hence is loaded as numeric. We could tell our model algorithm of choice to explicitly do classification or else set the target using base::as.factor() in the formula. Nonetheless it is generally cleaner to do this here and note that this code has no effect if the target is already categoric.

# Ensure the target is categoric.

ds[[target]] %<>% as.factor()

# Confirm the distribution.

ds[target] %>% table()
## .
##     no    yes 
## 135353  37077

We can visualise the distribution of the target variable using ggplot2. The dataset is piped to ggplot2::ggplot() whereby the target is associated through ggplot2::aes_string() (the aesthetics) with the x-axis of the plot. To this we add a graphics layer using ggplot2::geom_bar() to produce the bar chart, with bars having width= 0.2 and a fill= color of "grey". The resulting plot can be seen in Figure 9.1.

ds %>%
  ggplot(aes_string(x=target)) +
  geom_bar(width=0.2, fill="grey") +
  theme(text=element_text(size=14))

Figure 9.1: Target variable distribution. Plotting the distribution is useful to gain an insight into the number of observations in each category. As is the case here we often see a skewed distribution.

\includegraphics[width=\textwidth]{figures/onepager/data:plot_target_distribution-1}


Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.