Data Science Desktop Survival Guide
by Graham Williams
Chapter: Data Template
Raw The business understanding phase of a data science project aims to understand the business problem and to then liaise with the business data technicians to identify the data available. This is followed by the data understanding phase where we work with the business data technicians to access and ingest the data into R. We are then in a position to initiate our journey of discovery driven by the data. By living and breathing the data in the context of the business problem we gain our bearings and feed our intuitions as we journey.
In this chapter we present the common series of steps that initialise the data phase of data science—the data setup. Through this chapter we extract the basic shape and characteristics of the dataset. We prepare the dataset for exploration and wrangling. At the end of this chapter we will have a template for the repeatable end-to-end processing of the data. As you become proficient with R and data science you will develop your own habits and idiosyncrasies which you will incorporate into your own template.
The template concept, developed extensively in Williams (2017), consists of canonical programming codes that can be reused with little or no modification on a new dataset. The intention is that to get started with a new dataset only a few initial lines code within the template need to be modified. Only minimal change is then required for the remainder of the codes within the template. For the software engineer, the concept of a template is a stepping stone toward developing functions in R that are general and reusable. For us though, rather than delving into the intricacies of the R language we immerse ourselves into using R to achieve our outcomes, learning R as we proceed and moving into more sophisticated software engineering practices.
Rather than delving into the intricacies of the R language we immerse ourselves into using R to achieve our outcomes, learning more about R as we proceed.
The template consists of programming code that can be reused with little or no modification on a new dataset. The intention is that to get started with a new dataset only a few lines at the top of the template need to be modified. Minimal (if any) change is then required for the remainder of the code. In many respects the concept of a template is a stepping stone toward writing functions in R.