Go to TogaWare.com Home Page. Data Science Desktop Survival Guide
by Graham Williams
Duck Duck Go



CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

A Data Frame as a Dataset

20210103 A data frame is essentially a rectangular table (or matrix) of data consisting of rows (observations) and columns (variables). We can base::print.data.frame() to view a table, here choosing the first 10 observations of the first 6 variables of the ds dataset.
# Display the table structure of the ingested dataset.

ds[1:10,1:6] %>% print.data.frame()
##          date location min_temp max_temp rainfall evaporation
## 1  2008-12-01   Albury     13.4     22.9      0.6          NA
## 2  2008-12-02   Albury      7.4     25.1      0.0          NA
## 3  2008-12-03   Albury     12.9     25.7      0.0          NA
## 4  2008-12-04   Albury      9.2     28.0      0.0          NA
## 5  2008-12-05   Albury     17.5     32.3      1.0          NA
## 6  2008-12-06   Albury     14.6     29.7      0.2          NA
## 7  2008-12-07   Albury     14.3     25.0      0.0          NA
## 8  2008-12-08   Albury      7.7     26.7      0.0          NA
## 9  2008-12-09   Albury      9.7     31.9      0.0          NA
## 10 2008-12-10   Albury     13.1     30.1      1.4          NA

Alternatively we might sample 10 random observations (dplyr::sample_n()) of 5 random variables (dplyr::select()):

# Display a random selection of observations and variables.

ds %>%
  sample_n(10) %>%
  select(sample(1:ncol(ds), 5)) %>%
  print.data.frame()
##    wind_gust_speed max_temp rainfall wind_speed_3pm rain_today
## 1               30     25.1      0.2             11         No
## 2               72     30.7      1.0             33         No
## 3               56     14.9      1.6             20        Yes
## 4               33     28.8      0.0             20         No
## 5               37     31.3      0.0             20         No
## 6               35     35.7      0.0             15         No
## 7               24     15.5      0.0             15         No
## 8               22     22.5      0.0              6         No
## 9               31     20.0      0.4              4         No
## 10              35     21.9      0.0             17         No

This tabular form (i.e., it has rows and columns) is common for data science and we refer to it as our dataset.


Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.