10.38 Numeric
20180723 Summaries of numeric data are provided using base::summary(). In the following we identify the numeric variables and summarise each. In doing so, as a data scientist, we want to again observe any oddities and to explain them.
%>%
ds sapply(is.numeric) %>%
which() %>%
%T>%
names print() ->
numi
## [1] "min_temp" "max_temp" "rainfall" "evaporation"
## [5] "sunshine" "wind_gust_speed" "wind_speed_9am" "wind_speed_3pm"
## [9] "humidity_9am" "humidity_3pm" "pressure_9am" "pressure_3pm"
## [13] "cloud_9am" "cloud_3pm" "temp_9am" "temp_3pm"
## [17] "risk_mm"
%>%
ds[numi] summary()
## min_temp max_temp rainfall evaporation
## Min. :-8.7 Min. :-4.10 Min. : 0.000 Min. : 0.00
## 1st Qu.: 7.5 1st Qu.:18.00 1st Qu.: 0.000 1st Qu.: 2.60
## Median :11.9 Median :22.70 Median : 0.000 Median : 4.80
## Mean :12.1 Mean :23.29 Mean : 2.254 Mean : 5.54
## 3rd Qu.:16.8 3rd Qu.:28.30 3rd Qu.: 0.600 3rd Qu.: 7.40
## Max. :33.9 Max. :48.90 Max. :474.000 Max. :138.70
## NA's :2760 NA's :2555 NA's :5037 NA's :96586
## sunshine wind_gust_speed wind_speed_9am wind_speed_3pm
## Min. : 0.00 Min. : 2.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 4.90 1st Qu.: 31.00 1st Qu.: 7.00 1st Qu.:13.0
## Median : 8.50 Median : 39.00 Median :13.00 Median :19.0
## Mean : 7.65 Mean : 40.18 Mean :14.07 Mean :18.7
## 3rd Qu.:10.70 3rd Qu.: 48.00 3rd Qu.:19.00 3rd Qu.:24.0
## Max. :14.50 Max. :135.00 Max. :87.00 Max. :87.0
## NA's :104989 NA's :14376 NA's :3351 NA's :6406
## humidity_9am humidity_3pm pressure_9am pressure_3pm
## Min. : 0.0 Min. : 0.00 Min. : 979.1 Min. : 978.9
## 1st Qu.: 56.0 1st Qu.: 35.00 1st Qu.:1013.0 1st Qu.:1010.5
## Median : 69.0 Median : 51.00 Median :1017.7 Median :1015.3
## Mean : 68.4 Mean : 50.89 Mean :1017.8 Mean :1015.3
## 3rd Qu.: 83.0 3rd Qu.: 65.00 3rd Qu.:1022.6 3rd Qu.:1020.2
## Max. :100.0 Max. :100.00 Max. :1041.1 Max. :1040.1
## NA's :3828 NA's :7472 NA's :21098 NA's :21089
## cloud_9am cloud_3pm temp_9am temp_3pm
## Min. :0.00 Min. :0.00 Min. :-6.20 Min. :-5.10
## 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:12.20 1st Qu.:16.60
## Median :5.00 Median :5.00 Median :16.70 Median :21.20
## Mean :4.57 Mean :4.58 Mean :16.97 Mean :21.75
## 3rd Qu.:7.00 3rd Qu.:7.00 3rd Qu.:21.60 3rd Qu.:26.50
## Max. :9.00 Max. :9.00 Max. :40.20 Max. :48.20
## NA's :80359 NA's :86239 NA's :2832 NA's :6472
## risk_mm
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 2.254
## 3rd Qu.: 0.600
## Max. :474.000
## NA's :5038
Reviewing this information we can make some obvious observations. Temperatures, for example, appears to be in degrees Celsius rather than Fahrenheit. Rainfall looks like millimetres. There are some quite skewed distributions with min and median small but large max values. As data scientists we will further explore the distributions as in Chapter 9.
Your donation will support ongoing development and give you access to the PDF version of the book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.