Data Science Desktop Survival Guide
by Graham Williams |
|||||
ID Variables |
20180723 From our observations so far we note that the variable (date) acts as an identifier as does the variable (location). Given a date and a location we have an observation of the remaining variables. Thus we note that these two variables are so-called identifiers. Identifiers would not usually be used as independent variables for building predictive analytics models.
# Note any identifiers.
id <- c("date", "location")
We might get a sense of how this works with the following which will list a random sample of locations and how long the observations for that location have been collected.
|
ds[id] %>%
group_by(location) %>% count() %>% rename(days=n) %>% mutate(years=round(days/365)) %>% as.data.frame() %>% sample_n(10)
The data for each location ranges in length from 4 years up to 9 years, though most have 8 years of data.
|
ds[id] %>%
group_by(location) %>% count() %>% rename(days=n) %>% mutate(years=round(days/365)) %>% ungroup() %>% select(years) %>% summary()
|