10.52 Rain

20180723 The two remaining character variables are: rain_today, rain_tomorrow. Their distributions are generated by dplyr::select()ing from the dataset those variables that start with rain_ and then build a base::table() over those variables. We use base::sapply() to apply base::table() to the selected columns to count the frequency of the occurrence of each value of a variable within the dataset.

# Review the distribution of observations across levels.

ds %>%
  select(starts_with("rain_")) %>%
  sapply(table)
##     rain_today rain_tomorrow
## No      171174        171165
## Yes      48919         48929

Noting that No and Yes are the only values these two variables will take it makes sense to convert them both to factors. We will keep the ordering as alphabetic and so a simple call to base::factor() will to convert from character to factor.

# Note the names of the rain variables.

ds %>% 
  select(starts_with("rain_")) %>% 
  names() ->
vnames

# Confirm these are currently character variables.

ds[vnames] %>% sapply(class)
##    rain_today rain_tomorrow 
##      "factor"      "factor"
# Convert these variables from character to factor.

ds[vnames] %<>% 
  lapply(factor) %>% 
  data.frame() %>% 
  as_tibble()

# Confirm they are now factors.

ds[vnames] %>% sapply(class)
##    rain_today rain_tomorrow 
##      "factor"      "factor"


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0