Data Science Desktop Survival Guide by Graham Williams Desktop Survival Project Home Preface Data Science Introducing R R Constructs R Tasks R Strings R Read, Write, and Create Data Template Data Exploration Data Wrangling Data Visualisation Statistics ML Template ML Scenarios ML Activities ML Applications ML Algorithms Cluster Analysis Decision Trees Computer Vision Graph Data Privacy Literate Data Science Coding with Style Resources Bibliography Index

## ID Variables

20180723 From our observations so far we note that the variable (date) acts as an identifier as does the variable (location). Given a date and a location we have an observation of the remaining variables. Thus we note that these two variables are so-called identifiers. Identifiers would not usually be used as independent variables for building predictive analytics models.

 # Note any identifiers. id <- c("date", "location") We might get a sense of how this works with the following which will list a random sample of locations and how long the observations for that location have been collected.

ds[id] %>%
group_by(location) %>%
count() %>%
rename(days=n) %>%
mutate(years=round(days/365)) %>%
as.data.frame() %>%
sample_n(10)
 ```## location days years ## 1 Williamtown 3649 10 ## 2 Ballarat 3680 10 ## 3 MountGambier 3679 10 ## 4 Moree 3649 10 ## 5 Newcastle 3680 10 .... ```

The data for each location ranges in length from 4 years up to 9 years, though most have 8 years of data.

ds[id] %>%
group_by(location) %>%
count() %>%
rename(days=n) %>%
mutate(years=round(days/365)) %>%
ungroup() %>%
select(years) %>%
summary()
 ```## years ## Min. : 6.000 ## 1st Qu.:10.000 ## Median :10.000 ## Mean : 9.878 ## 3rd Qu.:10.000 ## Max. :11.000 ```