Data Science Desktop Survival Guide
by Graham Williams |
|||||
Chapter: Data Visualisation |
20200608 One of the most important tasks for any data scientist is to visualise data. Presenting data visually will often lead to new insights and discoveries, as well as providing glimpses of any issues with the data itself. Sometimes these will stand out in a visual presentation whilst being hidden in textual views. A visual presentation is also an effective means for communicating insight to others.
The simplest of plots is a Scatter Plot which plots points onto the graph. A Line Graph joins the points. A Bar Chart plots bars up to the points. A Box Plot presents the distribution of values of an observation. A Pie Chart is not usually recommended. Faceted plots consist of multiple graphs of the same data but across different factors. Plots can be coloured or annotated with text. Indeed, we are free to innovate with visualisations, though we need to be aware of principles for effectively communicating.
R is capable of producing excellent publication ready graphics in many formats, including PostScript, PDF, PNG (for vector images and text), and JPG (for colorful images). R offers a comprehensive suite of tools to visualise data including graphics (traditional), grid and lattice (high level graphics), and now ggplot2, the newest plotting development.
ggplot2 is based on the idea of a grammar for graphics, a grammar for writing sentences describing the graphics. The default themes for producing plots is based on the collective wisdom of many visual presentation researchers. Using this package we construct a plot beginning with the dataset along with the aesthetics (e.g., the x-axis and y-axis) and then add geometric elements, statistical operations, scales, facets, coordinates, and numerous other components to the plot.
The official GGPlot2 documentation is extensive and accessible.