Data Science Desktop Survival Guide
by Graham Williams
Chapter: Literate Data Science
20200602 A data scientist's role is to tell the stories supported by the data. The narrative that we tell is one of our key deliverables and as such we need our narrative to be well supported by the data. In telling the narrative the analysis needs to be transparent, repeatable, and reproducible. We also capture and share our activities for quality assurance and for peer review. We will find ourselves repeating our work on other datasets in other scenarios and with other organisations. Documenting what we do helps when we come back to the code at a later time. Others will also want to reproduce our work and we should do all we can to facilitate that process. In short, we need to clearly communicate what we do so that we and others can understand and can continue the journey.
A general rule of thumb tells us that we should spend about a quarter of our time capturing what we have done—documenting our projects. Even more important is to capture this as we are doing the work rather than the chore of writing it up later. This does present an overhead and risks interrupting the flow of our work but the investment pays off longer term. Tools can be utilised to support the capture of our work with minimal interruption to our work flow.
To support the narrative and to encourage our efforts to be transparent, repeatable and reproducible we introduce the concept of literate programming (Knuth, 1984). The concept is to intermix our narrative with the underlying analyses of the data (our code) within the one document. By introducing the concept here we aim to provide a solid foundation for the data scientist. We won't always have the time or the patience to deliver a carefully crafted narrative telling the story derived from the data but we should strive to do so.
We will use knitr to support literate data science. This package combines the document typesetting power of the free and open source LaTeX software with the statistical power of R. Literate data science is also well supported by which is able to process the source document into a beautifully formatted PDF. This book is itself produced using knitr.
In addition to these packages we also need to install the LaTeX software. LaTeX is a typesetting markup language which combined with knitr allows us to intermix R code with our narrative and to program certain parts of the narrative using R. LaTeX is free and open source software and instructions for installing are available from the https://latex-project.orgLaTeX Project.