10.22 Missing Value Imputation

20201026 See Section 10.27 to replace missing values with specific values and Section ?? to drop rows in a dataset containing missing values.

Missing value imputation is useful but must be done with care. It can be akin to inventing new data. We may be tempted to do so as a quick fix for avoiding warnings that would otherwise advise us of missing data when using ggplot2 (Wickham et al. 2024), for example. We can utilise the imputation function randomForest::na.roughfix() to perform missing value imputation through the use of machine learning to fill in the gaps. This particular function operates on numeric and factor columns, thus we remove the first two columns from the dataset to be imputed (date and location),

# Count the number of missing values.

ds %>% is.na() %>% sum()

## [1] 644978

# No missing values in the first two columns (date and location)

ds[1:2] %>% is.na() %>% sum()

## [1] 0

# Impute missing values.

ds[3:24] %<>% na.roughfix()

# Confirm that no missing values remain.

ds %>% is.na() %>% sum()

## [1] 0

References

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2024. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0