ID Variables

		Data Science Desktop Survival Guide by Graham Williams

CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

ID Variables

20180723 From our observations so far we note that the variable (date) acts as an identifier as does the variable (location). Given a date and a location we have an observation of the remaining variables. Thus we note that these two variables are so-called identifiers. Identifiers would not usually be used as independent variables for building predictive analytics models.

# Note any identifiers.

id <- c("date", "location")

We might get a sense of how this works with the following which will list a random sample of locations and how long the observations for that location have been collected.

ds[id] %>%
  group_by(location) %>%
  count() %>%
  rename(days=n) %>%
  mutate(years=round(days/365)) %>%
  as.data.frame() %>%
  sample_n(10)

##         location days years
## 1    Williamtown 3649    10
## 2       Ballarat 3680    10
## 3   MountGambier 3679    10
## 4          Moree 3649    10
## 5      Newcastle 3680    10
....

The data for each location ranges in length from 4 years up to 9 years, though most have 8 years of data.

ds[id] %>%
  group_by(location) %>%
  count() %>%
  rename(days=n) %>%
  mutate(years=round(days/365)) %>%
  ungroup() %>%
  select(years) %>%
  summary()

##      years       
##  Min.   : 6.000  
##  1st Qu.:10.000  
##  Median :10.000  
##  Mean   : 9.878  
##  3rd Qu.:10.000  
##  Max.   :11.000

Support further development by purchasing the PDF version of the book.
Other online resources include the GNU/Linux Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd. . Creative Commons ShareAlike V4.