9.3 Random Sample
20200317
A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:
## # A tibble: 41,699 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-04-15 Tuggeranong 7.2 24.7 0 NA NA
## 2 2021-06-14 PerthAirport 10.4 18.3 7.2 2 7
## 3 2019-07-09 Tuggeranong -3.1 13.9 0 NA NA
## 4 2011-07-06 MountGambier 6.4 13 8.4 0.6 1.3
## 5 2021-04-14 Townsville 21.8 29.6 0 NA NA
## 6 2009-05-25 Sale 2.2 19.8 0 0.8 3.3
## 7 2014-01-16 Wollongong 17.7 24.9 0 NA NA
## 8 2010-05-02 Ballarat 8 15.3 0 NA NA
## 9 2019-07-11 Albany 11.2 16.4 0 2.4 NA
## 10 2013-10-27 PearceRAAF 12.9 29.4 0 NA 12.4
## # ℹ 41,689 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
The next time you randomly sample the dataset the resulting sample will be different:
## # A tibble: 41,699 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2009-02-01 Sydney 21.2 28.1 0 9.2 11.1
## 2 2018-12-08 NorfolkIsland 17.2 22.5 0 8 NA
## 3 2012-09-08 MountGinini -3.2 2.6 0 NA NA
## 4 2010-08-27 Mildura 8.8 15.3 0.6 2.8 7.2
## 5 2009-07-10 Sydney 9.5 18.6 8.8 1.8 8.4
## 6 2021-09-05 Cairns 20.8 29.2 0.6 NA NA
## 7 2020-12-21 NorfolkIsland 18.9 24.7 0 7.4 NA
## 8 2015-04-15 CoffsHarbour 15.3 25.6 0 NA NA
## 9 2017-04-21 MelbourneAirport 15.4 19.1 7.6 4 0.1
## 10 2009-12-17 NorfolkIsland 17.2 22.7 0 11 9.5
## # ℹ 41,689 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
To ensure the sample random sample each time use base::set.seed():
## # A tibble: 41,699 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017-03-15 Mildura 19.8 34.1 0 8 10.5
## 2 2020-12-25 Launceston 5.7 20.6 0 NA NA
## 3 2012-04-29 Albany 16 25.2 0.2 1.2 2.7
## 4 2021-12-31 BadgerysCreek 13 29.9 0 NA NA
## 5 2021-08-29 WaggaWagga 3.7 15.3 2.2 NA NA
## 6 2009-06-18 Woomera 6.3 17.3 0 2.2 6.7
## 7 2018-10-14 AliceSprings 21.5 38.5 0.2 NA NA
## 8 2009-06-16 MelbourneAirport 4.7 13.4 0 0.8 4.7
## 9 2010-05-05 Bendigo 2.7 14.7 4.6 2.4 NA
## 10 2020-10-07 Perth 7.9 23.2 0 5.4 8.6
## # ℹ 41,689 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
## # A tibble: 41,699 × 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2017-03-15 Mildura 19.8 34.1 0 8 10.5
## 2 2020-12-25 Launceston 5.7 20.6 0 NA NA
## 3 2012-04-29 Albany 16 25.2 0.2 1.2 2.7
## 4 2021-12-31 BadgerysCreek 13 29.9 0 NA NA
## 5 2021-08-29 WaggaWagga 3.7 15.3 2.2 NA NA
## 6 2009-06-18 Woomera 6.3 17.3 0 2.2 6.7
## 7 2018-10-14 AliceSprings 21.5 38.5 0.2 NA NA
## 8 2009-06-16 MelbourneAirport 4.7 13.4 0 0.8 4.7
## 9 2010-05-05 Bendigo 2.7 14.7 4.6 2.4 NA
## 10 2020-10-07 Perth 7.9 23.2 0 5.4 8.6
## # ℹ 41,689 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## # wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## # wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## # pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## # temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## # rain_tomorrow <fct>
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0