9.3 Random Sample

20200317

A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:

ds %>% sample_frac(0.2)
## # A tibble: 41,699 × 24
##    date       location     min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>           <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2020-04-15 Tuggeranong       7.2     24.7      0          NA       NA  
##  2 2021-06-14 PerthAirport     10.4     18.3      7.2         2        7  
##  3 2019-07-09 Tuggeranong      -3.1     13.9      0          NA       NA  
##  4 2011-07-06 MountGambier      6.4     13        8.4         0.6      1.3
##  5 2021-04-14 Townsville       21.8     29.6      0          NA       NA  
##  6 2009-05-25 Sale              2.2     19.8      0           0.8      3.3
##  7 2014-01-16 Wollongong       17.7     24.9      0          NA       NA  
##  8 2010-05-02 Ballarat          8       15.3      0          NA       NA  
##  9 2019-07-11 Albany           11.2     16.4      0           2.4     NA  
## 10 2013-10-27 PearceRAAF       12.9     29.4      0          NA       12.4
## # ℹ 41,689 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>

The next time you randomly sample the dataset the resulting sample will be different:

ds %>% sample_frac(0.2)
## # A tibble: 41,699 × 24
##    date       location         min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>               <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2009-02-01 Sydney               21.2     28.1      0           9.2     11.1
##  2 2018-12-08 NorfolkIsland        17.2     22.5      0           8       NA  
##  3 2012-09-08 MountGinini          -3.2      2.6      0          NA       NA  
##  4 2010-08-27 Mildura               8.8     15.3      0.6         2.8      7.2
##  5 2009-07-10 Sydney                9.5     18.6      8.8         1.8      8.4
##  6 2021-09-05 Cairns               20.8     29.2      0.6        NA       NA  
##  7 2020-12-21 NorfolkIsland        18.9     24.7      0           7.4     NA  
##  8 2015-04-15 CoffsHarbour         15.3     25.6      0          NA       NA  
##  9 2017-04-21 MelbourneAirport     15.4     19.1      7.6         4        0.1
## 10 2009-12-17 NorfolkIsland        17.2     22.7      0          11        9.5
## # ℹ 41,689 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>

To ensure the sample random sample each time use base::set.seed():

set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 41,699 × 24
##    date       location         min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>               <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2017-03-15 Mildura              19.8     34.1      0           8       10.5
##  2 2020-12-25 Launceston            5.7     20.6      0          NA       NA  
##  3 2012-04-29 Albany               16       25.2      0.2         1.2      2.7
##  4 2021-12-31 BadgerysCreek        13       29.9      0          NA       NA  
##  5 2021-08-29 WaggaWagga            3.7     15.3      2.2        NA       NA  
##  6 2009-06-18 Woomera               6.3     17.3      0           2.2      6.7
##  7 2018-10-14 AliceSprings         21.5     38.5      0.2        NA       NA  
##  8 2009-06-16 MelbourneAirport      4.7     13.4      0           0.8      4.7
##  9 2010-05-05 Bendigo               2.7     14.7      4.6         2.4     NA  
## 10 2020-10-07 Perth                 7.9     23.2      0           5.4      8.6
## # ℹ 41,689 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>
set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 41,699 × 24
##    date       location         min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>               <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2017-03-15 Mildura              19.8     34.1      0           8       10.5
##  2 2020-12-25 Launceston            5.7     20.6      0          NA       NA  
##  3 2012-04-29 Albany               16       25.2      0.2         1.2      2.7
##  4 2021-12-31 BadgerysCreek        13       29.9      0          NA       NA  
##  5 2021-08-29 WaggaWagga            3.7     15.3      2.2        NA       NA  
##  6 2009-06-18 Woomera               6.3     17.3      0           2.2      6.7
##  7 2018-10-14 AliceSprings         21.5     38.5      0.2        NA       NA  
##  8 2009-06-16 MelbourneAirport      4.7     13.4      0           0.8      4.7
##  9 2010-05-05 Bendigo               2.7     14.7      4.6         2.4     NA  
## 10 2020-10-07 Perth                 7.9     23.2      0           5.4      8.6
## # ℹ 41,689 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0