10.42 ID Variables

20180723 From our observations so far we note that the variable (date) acts as an identifier as does the variable (location). Given a date and a location we have an observation of the remaining variables. Thus we note that these two variables are so-called identifiers. Identifiers would not usually be used as independent variables for building predictive analytics models.

# Note any identifiers.

id <- c("date", "location")

We might get a sense of how this works with the following which will list a random sample of locations and how long the observations for that location have been collected.

ds[id] %>%
  group_by(location) %>%
  count() %>%
  rename(days=n) %>%
  mutate(years=round(days/365)) %>%
  as.data.frame() %>%
  sample_n(10)
##            location days years
## 1      AliceSprings 3984    11
## 2        WaggaWagga 3953    11
## 3      MountGambier 3983    11
## 4          Ballarat 3984    11
## 5          Dartmoor 3953    11
## 6  MelbourneAirport 3953    11
## 7         Melbourne 4137    11
## 8     BadgerysCreek 3936    11
## 9     NorfolkIsland 3953    11
## 10         Canberra 4380    12

The data for each location ranges in length from 4 years up to 9 years, though most have 8 years of data.

ds[id] %>%
  group_by(location) %>%
  count() %>%
  rename(days=n) %>%
  mutate(years=round(days/365)) %>%
  ungroup() %>%
  select(years) %>%
  summary()
##      years     
##  Min.   : 7.0  
##  1st Qu.:11.0  
##  Median :11.0  
##  Mean   :10.8  
##  3rd Qu.:11.0  
##  Max.   :12.0


Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.