9.3 Random Sample

20200317

A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:

ds %>% sample_frac(0.2)
## # A tibble: 45,374 × 24
##    date       location     min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>           <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2018-08-06 Williamtown       4.9     16.7      0          NA       NA  
##  2 2023-03-11 MountGambier      8.3     20.9      0.2        NA       NA  
##  3 2010-01-25 Ballarat         10       27.9      0          NA       NA  
##  4 2009-04-13 Albany           17.5     23        0           4        9.1
##  5 2017-11-28 WaggaWagga       17.4     32.7      0.6         5.2     NA  
##  6 2010-12-11 Witchcliffe       9.3     23.5      0          NA       NA  
##  7 2014-07-30 Wollongong       12.7     22.4     NA          NA       NA  
##  8 2015-12-17 Hobart           15.4     26.5      0           9.4     10.9
##  9 2012-10-02 Bendigo           2.3     22        0          NA       NA  
## 10 2010-07-01 Penrith           1.4     16.6      0          NA       NA  
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>

The next time you randomly sample the dataset the resulting sample will be different:

ds %>% sample_frac(0.2)
## # A tibble: 45,374 × 24
##    date       location    min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>          <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2017-11-12 Cobar           17.3     32.2      0          NA       NA  
##  2 2011-03-03 NorahHead       18.4     30.8      0          NA       NA  
##  3 2019-03-09 MountGinini      8.9     19.4      3          NA       NA  
##  4 2020-07-06 Hobart           7.9     13.2      0           1       NA  
##  5 2018-03-29 Portland        12.2     21        0          NA       NA  
##  6 2010-11-08 GoldCoast       20.8     26.6      2.2        NA       NA  
##  7 2021-02-03 Sale             9       20.2      0          NA       NA  
##  8 2008-09-12 Adelaide        10.6     21.3      0           4.4      9.4
##  9 2014-03-31 Townsville      24.3     30.4      0           5.8     10.9
## 10 2014-01-03 WaggaWagga      19.4     35.1      0           6.4      6  
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>

To ensure the sample random sample each time use base::set.seed():

set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 45,374 × 24
##    date       location         min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>               <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2019-12-14 MelbourneAirport     13.4     22        0           9.4      2.5
##  2 2014-02-02 SalmonGums           16.4     26.2      7.2        NA       NA  
##  3 2014-11-10 MountGambier          6.8     20.4      0           5.6      9.9
##  4 2020-12-05 BadgerysCreek        16.4     24.7     NA          NA       NA  
##  5 2022-10-13 SydneyAirport        15.4     22.3      0           3.2      3.4
##  6 2010-04-02 Adelaide             16.1     26.7      0          NA       10.8
##  7 2010-10-08 Walpole              11       27.4      0          NA       NA  
##  8 2014-01-23 Bendigo              13.8     33.9      0           8.2     NA  
##  9 2018-05-03 MountGinini           5.4     13.7      0          NA       NA  
## 10 2018-12-06 Witchcliffe          11.9     20.5     15.4        NA       NA  
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>
set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 45,374 × 24
##    date       location         min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>               <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2019-12-14 MelbourneAirport     13.4     22        0           9.4      2.5
##  2 2014-02-02 SalmonGums           16.4     26.2      7.2        NA       NA  
##  3 2014-11-10 MountGambier          6.8     20.4      0           5.6      9.9
##  4 2020-12-05 BadgerysCreek        16.4     24.7     NA          NA       NA  
##  5 2022-10-13 SydneyAirport        15.4     22.3      0           3.2      3.4
##  6 2010-04-02 Adelaide             16.1     26.7      0          NA       10.8
##  7 2010-10-08 Walpole              11       27.4      0          NA       NA  
##  8 2014-01-23 Bendigo              13.8     33.9      0           8.2     NA  
##  9 2018-05-03 MountGinini           5.4     13.7      0          NA       NA  
## 10 2018-12-06 Witchcliffe          11.9     20.5     15.4        NA       NA  
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0