Data Review

		Data Science Desktop Survival Guide by Graham Williams

CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Data Review

20180721 Having ingested the dataset and normalised the variable names we can now explore more. Using dplyr::glimpse() gives us some insight:

# Review the dataset.

glimpse(ds)

## Rows: 176,747
## Columns: 24
## $ date            <date> 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04,...
## $ location        <chr> "Albury", "Albury", "Albury", "Albury", "Albury"...
## $ min_temp        <dbl> 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7...
## $ max_temp        <dbl> 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, ...
## $ rainfall        <dbl> 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4...
## $ evaporation     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ sunshine        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ wind_gust_dir   <ord> W, WNW, WSW, NE, W, WNW, W, W, NNW, W, N, NNE, W...
## $ wind_gust_speed <dbl> 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, ...
## $ wind_dir_9am    <ord> W, NNW, W, SE, ENE, W, SW, SSE, SE, S, SSE, NE, ...
## $ wind_dir_3pm    <ord> WNW, WSW, WSW, E, NW, W, W, W, NW, SSE, ESE, ENE...
## $ wind_speed_9am  <dbl> 20, 4, 19, 11, 7, 19, 20, 6, 7, 15, 17, 15, 28, ...
## $ wind_speed_3pm  <dbl> 24, 22, 26, 9, 20, 24, 24, 17, 28, 11, 6, 13, 28...
## $ humidity_9am    <int> 71, 44, 38, 45, 82, 55, 49, 48, 42, 58, 48, 89, ...
## $ humidity_3pm    <int> 22, 25, 30, 16, 33, 23, 19, 19, 9, 27, 22, 91, 9...
## $ pressure_9am    <dbl> 1007.7, 1010.6, 1007.6, 1017.6, 1010.8, 1009.2, ...
## $ pressure_3pm    <dbl> 1007.1, 1007.8, 1008.7, 1012.8, 1006.0, 1005.4, ...
## $ cloud_9am       <int> 8, NA, NA, NA, 7, NA, 1, NA, NA, NA, NA, 8, 8, N...
## $ cloud_3pm       <int> NA, NA, 2, NA, 8, NA, NA, NA, NA, NA, NA, 8, 8, ...
## $ temp_9am        <dbl> 16.9, 17.2, 21.0, 18.1, 17.8, 20.6, 18.1, 16.3, ...
## $ temp_3pm        <dbl> 21.8, 24.3, 23.2, 26.5, 29.7, 28.9, 24.6, 25.5, ...
## $ rain_today      <fct> No, No, No, No, No, No, No, No, No, Yes, No, Yes...
## $ risk_mm         <dbl> 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0...
## $ rain_tomorrow   <fct> No, No, No, No, No, No, No, No, Yes, No, Yes, Ye...

Observe the variety of data types here, ranging from Date (date), through character (chr) and numeric (dbl). The data mostly looks as expected though it is odd that evaporation and sunshine are identified as character. Probably because they seem to be all missing, at least in the first 10 or so observations. We begin question other aspects of the data too. For example, is date an ongoing sequence of days as it appears to be here? Does location have values other than Albury? What is the distribution of the different variables?

These are all questions we will start asking ourselves in the context of “living and breathing” our data. Our aim should be to gleam all we can about the data that we are dealing with. Data science is very much about understanding, not blindly processing. The excitement is in the discovery of patterns in the data and the narrative the data is seeking to tell.

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.