9.3 Random Sample

20200317

A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:

ds %>% sample_frac(0.2)
## # A tibble: 38,286 x 24
##    date       location      min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>            <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2013-04-15 Walpole           13.9     23.6      0          NA       NA  
##  2 2017-09-03 Nhil               8.6     14.7      2.2        NA       NA  
##  3 2017-09-13 Brisbane          14.8     27.1      0           5.4      9.4
##  4 2019-01-17 Nhil              23       37.8      0          NA       NA  
##  5 2009-07-22 SalmonGums         5.1     16.6      3.8        NA       NA  
##  6 2015-09-22 Townsville        16.2     29.1      0           6.8     10.1
##  7 2010-12-24 BadgerysCreek     16.8     22.7      0          NA       NA  
##  8 2012-03-21 Cobar             19.2     31.7      0           7.2     NA  
##  9 2013-07-11 Woomera            7.5     20.3      0           3.2      8.4
## 10 2012-10-02 Williamtown        9.3     20.3      0.8         3.6     10.3
## # … with 38,276 more rows, and 17 more variables: wind_gust_dir <ord>,
## #   wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## #   wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## #   humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## #   cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## #   rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>

The next time you randomly sample the dataset the resulting sample will be different:

ds %>% sample_frac(0.2)
## # A tibble: 38,286 x 24
##    date       location      min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>            <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2013-11-07 Albany            13.5     17.4      1.8           2      6.3
##  2 2014-10-12 MountGinini        9.1     18.7      0            NA     NA  
##  3 2018-07-07 WaggaWagga         2.6     10.6      6             2     NA  
##  4 2015-12-29 PearceRAAF        19.2     38.8      0            NA     10.1
##  5 2018-11-05 BadgerysCreek     14.5     28.6      0            NA     NA  
##  6 2019-11-23 Sydney            18.7     21.9      6.2           5      0  
##  7 2015-04-19 GoldCoast         20.3     28.9      0            NA     NA  
##  8 2009-07-23 AliceSprings       4.3     21.1      0             5     10.5
##  9 2015-01-22 Launceston        19.4     29.4      0            NA     NA  
## 10 2014-12-26 Canberra          15       24.6      4.8          NA     NA  
## # … with 38,276 more rows, and 17 more variables: wind_gust_dir <ord>,
## #   wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## #   wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## #   humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## #   cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## #   rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>

To ensure the sample random sample each time use base::set.seed():

set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 38,286 x 24
##    date       location     min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>           <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2009-05-24 Watsonia         10       20.6      0           5.8      0  
##  2 2014-02-05 PerthAirport     20       33.3      0          12       12.7
##  3 2010-11-03 Cobar             9.2     23        0           5.4     NA  
##  4 2009-09-17 Wollongong       15.7     29.7      0.2        NA       NA  
##  5 2009-12-26 PearceRAAF       14.6     37.4      0          NA       11.9
##  6 2019-07-19 Melbourne         7.5     15.7      1.6         3.6      9.1
##  7 2018-12-17 Sale             14.2     27.7      0.2        NA       NA  
##  8 2013-08-29 Launceston        3.5     17.8      0          NA       NA  
##  9 2011-01-23 WaggaWagga       18.2     33.9      3.4         9.8     12.4
## 10 2010-07-20 Dartmoor          2.6     13.2      1.2         1.2      7  
## # … with 38,276 more rows, and 17 more variables: wind_gust_dir <ord>,
## #   wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## #   wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## #   humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## #   cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## #   rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>
set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 38,286 x 24
##    date       location     min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>           <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2009-05-24 Watsonia         10       20.6      0           5.8      0  
##  2 2014-02-05 PerthAirport     20       33.3      0          12       12.7
##  3 2010-11-03 Cobar             9.2     23        0           5.4     NA  
##  4 2009-09-17 Wollongong       15.7     29.7      0.2        NA       NA  
##  5 2009-12-26 PearceRAAF       14.6     37.4      0          NA       11.9
##  6 2019-07-19 Melbourne         7.5     15.7      1.6         3.6      9.1
##  7 2018-12-17 Sale             14.2     27.7      0.2        NA       NA  
##  8 2013-08-29 Launceston        3.5     17.8      0          NA       NA  
##  9 2011-01-23 WaggaWagga       18.2     33.9      3.4         9.8     12.4
## 10 2010-07-20 Dartmoor          2.6     13.2      1.2         1.2      7  
## # … with 38,276 more rows, and 17 more variables: wind_gust_dir <ord>,
## #   wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## #   wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## #   humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## #   cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## #   rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>


Your donation will support ongoing development and give you access to the PDF version of the book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.