Data Science Desktop Survival Guide
Graham Williams
Donate to Receive PDF
Preface
About this Book
Organisation of the Book
Acknowledgements
Waiver and Copyright
1
Data Science
1.1
Data Wrangling
1.2
Data Analyst
1.3
Data Scientist
1.4
Data Science Through Doing
2
Introducing R and Rattle
2.1
Installing R
2.2
Installing R Libraries
2.3
Installing R with CRAN on Debian/Ubuntu
2.4
Installing Rattle
2.5
Installing Rattle on Linux
2.6
Installing Rattle on Linux Debian/Ubuntu
2.7
Installing Rattle on Linux Homebrew
2.8
Installing Rattle on Linux Snap
2.9
Installing Rattle on Linux Zip Local User
2.10
Installing Rattle on Linux Zip All Users
2.11
Installing Rattle on macOS
2.12
Installing Rattle on macOS Homebrew
2.13
Installing Rattle on Windows
2.14
Installing Rattle on Windows Inno
2.15
Installing Rattle on Windows Zip
2.16
Installing Rattle on Windows Troubleshooting
2.17
Installing Rattle V5 Deprecated
2.18
Installing RStudio
2.19
Introducing RStudio
2.20
Editing Code
2.21
Executing Code
2.22
RStudio Review
2.23
Packages and Libraries
2.24
A Glimpse of the Dataset
2.25
Attaching a Package
2.26
Simplifying Commands
2.27
Working with the Library
2.28
Getting Help
3
R Constructs
3.1
A Data Frame as a Dataset
3.2
A Tibble as a Dataset
3.3
Assignment
3.4
Case Statement
3.5
Commands
3.6
Functions
3.7
Operators
3.8
Pipe Operator
3.9
Pipeline
3.10
Pipeline Construction
3.11
Pipeline Identity Operator
3.12
Pipeline Syntactic Sugar
3.13
Pipes: Assignment Pipe
3.14
Pipes: Exposition Pipe
3.15
Pipes: Tee Pipe
3.16
Pipes: Tee Pipe Intermediate Results
3.17
Pipes: Tee Pipe Load CSV Files
3.18
Pipes and Plots
3.19
Variables
4
R Tasks
4.1
Create Directory
4.2
File Exists
4.3
Format Date
5
Dates
5.1
Dates Setup
5.2
Convert String to Date
5.3
Extract Date from Date Time Object
5.4
Extract Time from Date Time Object
5.5
Extract Year, Month and Day
5.6
Format Dates
5.7
Format Strings
6
Strings
6.1
Strings Setup
6.2
Case Conversion
6.3
Concatenate Strings
6.4
Concatenate NULL
6.5
Glue Strings Together
6.6
Glue Pipelines
6.7
Last Character of a String
6.8
Length of String
6.9
Random Strings
6.10
Regexp Pattern Matching
6.11
Regexp Quantifiers
6.12
Regexp Character Classes
6.13
Sub Strings in Base R
6.14
Sub Strings in the Tidyverse
6.15
Substitute Strings
6.16
Trim and Pad
6.17
Wrapping and Words
7
R Read, Write, and Create
7.1
Data Creation Setup
7.2
Clipboard Data
7.3
CSV Data Reading
7.4
CSV Data Writing
7.5
Excel Data Reading
7.6
Excel Data Writing
7.7
MATLAB Data
7.8
Random Dataset
7.9
Read Strings from a File
7.10
TSV Data
8
Data Template
8.1
Dataset Setup
8.2
Data Glimpse
8.3
Normalise Variable Names
8.4
Variables and Model Target
8.5
Identify Numeric Variables
8.6
Normalise Factor Levels
8.7
Modelling Roles
8.8
Variable Types
8.9
Formula to Describe the Goal
8.10
Missing Values
8.11
Random Seed
8.12
Train, Tune, and Test Datasets
8.13
Review the Dataset
8.14
Data Template
9
Data Exploration
9.1
Exploration Setup
9.2
Counting Groups
9.3
Random Sample
9.4
Constant Variables
9.5
Correlated Numeric Variables
9.6
Missing Values in Rattle
9.7
Selecting Columns
9.8
Selecting Rows
9.9
Shuffle Rows
10
Data Wrangling
10.1
Wrangling Setup
10.2
Wrangling Data Review
10.3
Add Columns
10.4
Add Columns Using Variables
10.5
Add Counts
10.6
Binning
10.7
Cleanup
10.8
Combine Rows
10.9
Counting Groups
10.10
Dollar to Numeric Conversion
10.11
Drop Columns
10.12
Drop Obs with Missing Values
10.13
Extract Column as Vector
10.14
Filter Rows Having Missing Values
10.15
Imputation
10.16
Impute Constant
10.17
Impute Mean/Media/Mode
10.18
Impute Zero/Missing
10.19
Indicator Variables
10.20
Join Categorics
10.21
Lag Variable Calculations
10.22
Missing Value Imputation
10.23
Modify Columns
10.24
Normalise Variables
10.25
Pivot Pairwise Binary Table
10.26
Rename Variables
10.27
Replace Missing Values
10.28
Rescale Data in Rattle
10.29
Rescale Data using Rank
10.30
Rescale Data using Recenter in Rattle
10.31
Subset of Rows Within Groups
10.32
Transforming Data in Rattle
10.33
Data Source
10.34
Data Ingestion
10.35
The Shape of the Dataset
10.36
A Glimpse of the Dataset
10.37
Introducing Template Variables
10.38
Locating Datasets in Memory
10.39
Changing Datasets in Memory
10.40
Reviewing Variable Names
10.41
Effect on Data Storage
10.42
Special Case Variable Name Transformations
10.43
Data Review
10.44
Dataset Head and Tail
10.45
Random Observations
10.46
Characters
10.47
Factors
10.48
Location
10.49
Evaporation and Sunshine
10.50
Wind Directions
10.51
Ordered Factor
10.52
Rain
10.53
Numeric
10.54
Logical
10.55
Variable Roles
10.56
Risk Variable
10.57
ID Variables
10.58
Ignore IDs and Outputs
10.59
Ignore Missing
10.60
Ignore Excessive Level Variables
10.61
Dealing with Correlations
10.62
Removing Ignored Variables
10.63
Feature Selection
10.64
Missing Targets
10.65
Omitting Observations
10.66
Normalise Factors
10.67
Target as a Factor
10.68
Identify Variable Types
10.69
Identify Numeric and Categoric Variables
10.70
Save the Dataset
10.71
A Template for Data Preparation
11
Data Visualisation
11.1
Visualisation Setup
11.2
Visualisation Data
11.3
Visualisation Data Review
11.4
Arranging Plots Quickly with Blanket
11.5
Bar Chart
11.6
Bar Chart Colour No Legend
11.7
Bar Chart Dodge
11.8
Bar Chart Dodge Labelled Colour Brewer
11.9
Bar Chart Faceted Background
11.10
Bar Chart Flipped Colour Mean no Legend
11.11
Bar Chart Flipped Colour Mean Confidence Intervals
11.12
Bar Chart Flipped Sorted Axes
11.13
Bar Chart Flipped Text Annotations
11.14
Bar Chart Flipped Text Annotations Commas
11.15
Bar Chart Labels
11.16
Bar Chart Narrow Bars Economist Theme
11.17
Bar Chart Ordered X Axis
11.18
Bar Chart Stacked
11.19
Bar Chart Supplied Values
11.20
Bar Chart Texts
11.21
Bar Chart Wide Bars
11.22
Bar Chart Wide and Borders
11.23
Bar Chart Showcase Solar
11.24
Benford Plot
11.25
Box Plot
11.26
Box Plot in Rattle
11.27
Colour Names
11.28
Colour Ranges
11.29
Cumulative Plot
11.30
Density Plot
11.31
Density Plot in Rattle
11.32
Dot Plot
11.33
Faceted Location Scatter Plot
11.34
Faceted Location Thin Lines
11.35
Faceted Wind Direction
11.36
Filter Data Within Plot
11.37
Labels and Titles
11.38
Labels with Comma
11.39
Labels with Dates Economist Theme
11.40
Labels with Dollars
11.41
Labels Removed
11.42
Labels Rotated
11.43
Legend Position
11.44
Legend Removal
11.45
Line Chart Basic
11.46
Line Chart Density Distribution
11.47
Line Chart Skewed Distributions
11.48
Line Chart Log X Axis
11.49
Line Chart Log Breaks
11.50
Line Chart Log Ticks
11.51
Line Chart Log Custom Labels
11.52
Mosaic Plot
11.53
Multiple Plots in a Single Plot
11.54
Pairs Plot: Using ggpairs()
11.55
Pie Chart
11.56
Plotting Regions
11.57
Rose Chart
11.58
Rose Chart Discussion
11.59
Save Plot to File
11.60
Scatter Plot
11.61
Scatter Plot Colour Shape Theme BW
11.62
Scatter Plot Colour Choice
11.63
Scatter Plot Smooth Gam
11.64
Scatter Plot Smooth Loess
11.65
Text Path
11.66
Text Path Clock Multiple Plots
11.67
Text Path Density Plot Text Location
11.68
Text Path Smooth Plot
11.69
Themes in Rattle
11.70
Violin Plot
11.71
Violin Plot Embedded Box Plot
11.72
Violin Plot Faceted Location
11.73
XFig Support
12
Statistics
12.1
Analysis of Variance ANOVA
12.2
Correlation Test
12.3
F-Test Two-Sample
12.4
Kolmogorov-Smirnov Test
12.5
t-Test Two-Sample
12.6
Wilcoxon Rank Sum Test
12.7
Wilcoxon Signed Rank Test
13
Spatial Data and Maps
13.1
Geocodes
13.2
Understanding Spatial Data
13.3
Plotting Shapefiles
13.4
Google Maps: Geocoding
14
Model Template
14.1
ML Setup
14.2
ML Data and Variables
14.3
ML Modelling Setup
14.4
ML Data Glimpse
14.5
Model Building
14.6
Predict Class
14.7
Predict Probability
14.8
Evaluating the Model
14.9
Accuracy and Error Rate
14.10
Confusion Matrix
14.11
ROC Chart
14.12
Risk Chart
14.13
Biased Estimate from the Training Dataset
14.14
Step 8: Save the Model to File
14.15
Boilerplate
14.16
Command Summary
14.17
Model Template Further Reading
14.18
Model Template Example
15
ML Scenarios
15.1
Machine Learning Setup
15.2
Multi Arm Bandit
15.3
Reinforcement Learning
15.4
Supervised Learning
15.5
Unsupervised Learning
15.6
Hyper-Parameter Tuning
16
ML Activities
16.1
Classification
16.2
Cluster Analysis
16.3
Outlier Detection
16.4
Prediction
16.5
Text Mining
17
ML Applications
17.1
Airport Gate Assignment
17.2
Article Summarisation
17.3
Facial Recognition
17.4
Gene Detection
17.5
Group Recommendations
17.6
Program Comprehension
17.7
Pull Request Generation
17.8
Reservoir Inflow
17.9
Sentiment Analysis
17.10
Topic Modelling
18
ML Algorithms
18.1
Algorithms Setup
18.2
Algorithms Data and Variables
18.3
Algorithms Data Review
18.4
Collaborative Filtering
18.5
Convolutional Neural Network CNN
18.6
Decision Trees
18.7
Deep Convolutional Generative Adversarial Network
18.8
Generative Adversarial Network
18.9
K Means Clustering
18.10
K Nearest Neighbours
18.11
Linear Regression
18.12
Logistic Regression
18.13
Long Short-Term Memory Neural Networks LSTM
18.14
Multi Layer Perceptron
18.15
Neural Networks
18.16
Naive Bayes
18.17
One-Class Support Vector Machine
18.18
Recurrent Neural Networks
18.19
Residual Neural Network
18.20
Support Vector Machine
19
Cluster Analysis
19.1
Clustering Setup
19.2
Biclustering
19.3
K-Means Clustering
20
Decision Trees
20.1
Decision Trees Setup
20.2
Decision Trees Modelling Setup
20.3
Rattle Startup
20.4
Rattle Weather Dataset
20.5
Rattle Summary of Dataset
20.6
Rattle Model Tab
20.7
Rattle Build Tree
20.8
Interpret RPart Decision Tree
20.9
Rattle View Decision Tree
20.10
Rattle Error Matrix
20.11
Rattle Risk Chart
20.12
Rattle ROC Curve
20.13
Rattle Hand Plots
20.14
Rattle Score Dataset
20.15
Rattle Log
20.16
Rattle GUI to R
20.17
Build a Decision Tree Model
20.18
Summary of the Model
20.19
Complexity Parameter
20.20
Complexity Parameter Plot
20.21
Complexity Parameter Behaviour
20.22
Complexity Parameter 0
20.23
Complexity Parameter Table
20.24
Variable Importance
20.25
Node Details and Surrogates
20.26
Decision Tree Performance
20.27
Rules from Decision Tree
20.28
Rules Using Rpart Plot
20.29
Plot Decision Trees
20.30
Plot Decision Tree Uniformly
20.31
Plot Decision Tree with Extra Information
20.32
Fancy Rpart Plot
20.33
RPart Plot Default Tree
20.34
RPart Plot Favourite
20.35
Enhanced Plot: With Colour
20.36
Enhanced Plots: Label all Nodes
20.37
Enhanced Plots: Labels Below Nodes
20.38
Enhanced Plots: Split Labels
20.39
Enhanced Plots: Interior Labels
20.40
Enhanced Plots: Number of Observations
20.41
Enhanced Plots: Add Percentage of Observations
20.42
Enhanced Plots: Classification Rate
20.43
Enhanced Plots: Add Percentage of Observations
20.44
Enhanced Plots: Misclassification Rate
20.45
Enhanced Plots: Probability per Class
20.46
Enhanced Plots: Add Percentage Observations
20.47
Enhanced Plots: Only Probability Per Class
20.48
Enhanced Plots: Probability of Second Class
20.49
Enhanced Plots: Add Percentage Observations
20.50
Enhanced Plots: Only Probability of Second Class
20.51
Enhanced Plots: Probability of the Class
20.52
Enhanced Plots: Overall Probability
20.53
Enhanced Plots: Percentage of Observations
20.54
Enhanced Plots: Show the Node Numbers
20.55
Enhanced Plots: Show the Node Indicies
20.56
Enhanced Plots: Line up the Leaves