9.3 Random Sample

20200317

A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:

ds %>% sample_frac(0.2)

## # A tibble: 45,374 × 24
##    date       location      min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>            <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2014-10-21 Portland           8.5     22.4      0           4.6     12  
##  2 2014-02-21 GoldCoast         24.7     29.8      0          NA       NA  
##  3 2017-06-04 PerthAirport      12.8     22.6      0           1.8      9.1
##  4 2012-09-16 Richmond           4.6     23.3      0          NA       NA  
##  5 2019-07-21 WaggaWagga         2.5     17        0          NA       NA  
##  6 2018-07-07 AliceSprings       0.3     17.4      0          NA       NA  
##  7 2023-03-01 Ballarat           9.9     21.7      0.2        NA       NA  
##  8 2018-10-10 NorfolkIsland     16.3     21        0.2         3.2     NA  
##  9 2009-02-04 Nuriootpa         12.8     35.6      0          11.4     13.1
## 10 2011-07-08 WaggaWagga        -4.7      8.3      0           1.2      2.4
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>

The next time you randomly sample the dataset the resulting sample will be different:

ds %>% sample_frac(0.2)

## # A tibble: 45,374 × 24
##    date       location      min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>            <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2015-04-27 Watsonia           9.8     15        4.4         2.8      3.3
##  2 2012-09-12 SydneyAirport     10.8     22.9      0           3.8     10.2
##  3 2019-05-21 Katherine         NA       NA       NA          NA       NA  
##  4 2022-07-09 Bendigo            1.5     12.7      0.4        NA       NA  
##  5 2013-11-09 Newcastle         19.5     36.2      0          NA       NA  
##  6 2023-03-24 WaggaWagga        13.2     28.6     16.6        NA       NA  
##  7 2014-09-06 Penrith           11.2     17.4      1.2        NA       NA  
##  8 2012-04-23 Perth             11.3     23.1      0           5       10.5
##  9 2020-07-28 Walpole           10.4     16       20          NA       NA  
## 10 2021-06-07 Dartmoor          12.5     18.4      2.2        NA       NA  
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>

To ensure the sample random sample each time use base::set.seed():

set.seed(72346)
ds %>% sample_frac(0.2)

## # A tibble: 45,374 × 24
##    date       location         min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>               <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2019-12-14 MelbourneAirport     13.4     22        0           9.4      2.5
##  2 2014-02-02 SalmonGums           16.4     26.2      7.2        NA       NA  
##  3 2014-11-10 MountGambier          6.8     20.4      0           5.6      9.9
##  4 2020-12-05 BadgerysCreek        16.4     24.7     NA          NA       NA  
##  5 2022-10-13 SydneyAirport        15.4     22.3      0           3.2      3.4
##  6 2010-04-02 Adelaide             16.1     26.7      0          NA       10.8
##  7 2010-10-08 Walpole              11       27.4      0          NA       NA  
##  8 2014-01-23 Bendigo              13.8     33.9      0           8.2     NA  
##  9 2018-05-03 MountGinini           5.4     13.7      0          NA       NA  
## 10 2018-12-06 Witchcliffe          11.9     20.5     15.4        NA       NA  
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>

set.seed(72346)
ds %>% sample_frac(0.2)

## # A tibble: 45,374 × 24
##    date       location         min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>               <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2019-12-14 MelbourneAirport     13.4     22        0           9.4      2.5
##  2 2014-02-02 SalmonGums           16.4     26.2      7.2        NA       NA  
##  3 2014-11-10 MountGambier          6.8     20.4      0           5.6      9.9
##  4 2020-12-05 BadgerysCreek        16.4     24.7     NA          NA       NA  
##  5 2022-10-13 SydneyAirport        15.4     22.3      0           3.2      3.4
##  6 2010-04-02 Adelaide             16.1     26.7      0          NA       10.8
##  7 2010-10-08 Walpole              11       27.4      0          NA       NA  
##  8 2014-01-23 Bendigo              13.8     33.9      0           8.2     NA  
##  9 2018-05-03 MountGinini           5.4     13.7      0          NA       NA  
## 10 2018-12-06 Witchcliffe          11.9     20.5     15.4        NA       NA  
## # ℹ 45,364 more rows
## # ℹ 17 more variables: wind_gust_dir <ord>, wind_gust_speed <dbl>,
## #   wind_dir_9am <ord>, wind_dir_3pm <ord>, wind_speed_9am <dbl>,
## #   wind_speed_3pm <dbl>, humidity_9am <int>, humidity_3pm <int>,
## #   pressure_9am <dbl>, pressure_3pm <dbl>, cloud_9am <int>, cloud_3pm <int>,
## #   temp_9am <dbl>, temp_3pm <dbl>, rain_today <fct>, risk_mm <dbl>,
## #   rain_tomorrow <fct>

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0