Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go


20180723 Above we converted rain_today and rain_tomorrow to factors. They have just two values as we confirm here, in addition to a small number of missing values (NA).

ds %>%
  select(rain_today, rain_tomorrow) %>%
##  rain_today    rain_tomorrow
##  No  :135371   No  :135353  
##  Yes : 37058   Yes : 37077  
##  NA's:  4318   NA's:  4317

As binary valued factors, and particularly as the values suggest, they are both candidates for being considered as logical variables (sometimes called Boolean). They can be treated as FALSE/TRUE instead of No/Yes and so supported directly by R as class logical. Different functions will then treat them as appropriate but not all functions do anything special. If this suits our purposes then the following can be used to perform the conversion to logical.

ds %<>%
  mutate(rain_today    = rain_today    == "Yes",
         rain_tomorrow = rain_tomorrow == "Yes")

Best to now check that the distribution itself has not changed.

ds %>%
  select(rain_today, rain_tomorrow) %>%
##  rain_today      rain_tomorrow  
##  Mode :logical   Mode :logical  
##  FALSE:110319    FALSE:110316   
##  TRUE :31880     TRUE :31877    
##  NA's :3261      NA's :3267

Observe that the TRUE (Yes) values are much less frequent than the FALSE (No) values, and we also note the missing values.

The majority of days not having rain can be cross checked with the rainfall variable. In the previous summary of its distribution we note that rainfall has a median of zero, consistent with fewer days of actual rain. As data scientists we perform various cross checks on the hunt for oddities in the data.

As data scientists we will also want to understand why there are missing values. Is it simply some rare failures to capture the observation, or for example is there a particular location not recording rainfall? We would explore that now before moving on.

For our purposes going forward we will retain these two variables as factors. One reason for doing so is that we will illustrate missing value imputation using randomForest::na.roughfix() and this functoin does not handle logical data but keeping rain_tomorrow as character will allow missing value imputation. Of course we could skip this variable for the imputation.

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.