Data scientists write programs to ingest, fuse, clean, wrangle,
visualise, analyse, and model data. Programming over data is a core
task for the data scientist. We will primarily use R (R Core Team, 2020)
and in particular the tidyverse as our programming language and assume
basic familiarity of R as may be gained from the many resources
available on the Intranet, particularly from
https://cran.r-project.org/manuals.html.
The development of the tidyverse has been instrumental in bringing
R into the modern data science era and the resources provided by
RStudio and the tidyverse community are extensive. In particular, as
you develop your data analyses, be sure to have the RStudio
cheatsheets for the tidyverse in front of you. You will find them
invaluable. Visit https://rstudio.com/resources/cheatsheets/.
Programmers of data develop sentences or code. Code
instructs a computer to perform specific tasks. A collection of
sentences written in a language is what we might call a
program. Through programming by example and
learning by immersion we will share programs to deliver
insights and outcomes from our data.
R is a large and complex ecosystem for the practice of data
science. There is much freely available information on the Internet
from which we can continually learn and borrow useful code segments
that illustrate almost any task we might think of. We introduce here
the basics for getting started with R, libraries and packages which
extend the language, and the concepts of functions, commands, and
operators.