3.1 A Data Frame as a Dataset

20210103 A data frame is essentially a rectangular table (or matrix) of data consisting of rows (observations) and columns (variables). We can base::print.data.frame() to view a table, here choosing the first 10 observations of the first 6 variables of the ds dataset.

# Display the table structure of the ingested dataset.

ds[1:10,1:6] %>% print.data.frame()

##          date location min_temp max_temp rainfall evaporation
## 1  2008-12-01   Albury     13.4     22.9      0.6          NA
## 2  2008-12-02   Albury      7.4     25.1      0.0          NA
## 3  2008-12-03   Albury     12.9     25.7      0.0          NA
## 4  2008-12-04   Albury      9.2     28.0      0.0          NA
## 5  2008-12-05   Albury     17.5     32.3      1.0          NA
## 6  2008-12-06   Albury     14.6     29.7      0.2          NA
## 7  2008-12-07   Albury     14.3     25.0      0.0          NA
## 8  2008-12-08   Albury      7.7     26.7      0.0          NA
## 9  2008-12-09   Albury      9.7     31.9      0.0          NA
## 10 2008-12-10   Albury     13.1     30.1      1.4          NA

Alternatively we might sample 10 random observations (dplyr::sample_n()) of 5 random variables (dplyr::select()):

# Display a random selection of observations and variables.

ds %>%
  sample_n(10) %>%
  select(sample(1:ncol(ds), 5)) %>%
  print.data.frame()

##    humidity_3pm min_temp wind_gust_speed max_temp rainfall
## 1            75     11.5              39     16.8     20.2
## 2            68     13.6              59     17.9      8.2
## 3            23     21.0              74     34.1      0.0
## 4            54     15.9              46     26.0      6.8
## 5            43      1.8              22     20.0      0.0
## 6            36      9.9              43     26.8      0.0
## 7            16     18.9              39     28.2      0.0
## 8            40     12.9              35     31.0      0.0
## 9            12      6.5              37     32.2      0.0
## 10           44     10.2              46     18.4      0.4

This tabular form (i.e., it has rows and columns) is common for data science and we refer to it as our dataset.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0