Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go



CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Random Sample

20200317

A common task is to randomly sample rows from a dataset. The dplyr::sample_frac() function will randomly choose a specified fraction (e.g. 20%) of the rows of the dataset:

ds %>% sample_frac(0.2)
## # A tibble: 35,349 x 24
##    date       location min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>       <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2010-09-08 Cairns       23.5     30.2      0           7        7.8
##  2 2011-10-25 Mildura       8.7     21.5     13.4         4.4     12.2
##  3 2017-06-24 Launces~     -0.5     11.4      8.4        NA       NA  
....

The next time we randomly sample the dataset the resulting sample will be different:

ds %>% sample_frac(0.2)
## # A tibble: 35,349 x 24
##    date       location min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>       <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2011-03-22 Badgery~     19.2     32       13.2        NA       NA  
##  2 2011-05-26 Newcast~      9.9     17.2      0          NA       NA  
##  3 2013-07-04 AliceSp~      0.3     23.2      0           4       10.5
....

To ensure the sample random sample each time use base::set.seed():

set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 35,349 x 24
##    date       location min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>       <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2010-07-20 Brisbane     13.4     20.7      1           0.8     NA  
##  2 2015-08-24 Walpole       6       15.7      3.2        NA       NA  
##  3 2012-08-02 Cobar         1.6     17.2      0           2.4     NA  
....

set.seed(72346)
ds %>% sample_frac(0.2)
## # A tibble: 35,349 x 24
##    date       location min_temp max_temp rainfall evaporation sunshine
##    <date>     <chr>       <dbl>    <dbl>    <dbl>       <dbl>    <dbl>
##  1 2010-07-20 Brisbane     13.4     20.7      1           0.8     NA  
##  2 2015-08-24 Walpole       6       15.7      3.2        NA       NA  
##  3 2012-08-02 Cobar         1.6     17.2      0           2.4     NA  
....


Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.