Data Science Desktop Survival Guide
by
Graham Williams
Desktop Survival
Project Home
Preface
Data Science
Introducing R
R Constructs
R Tasks
R Strings
R Read, Write, and Create
Data Template
Data Exploration
Data Wrangling
Data Visualisation
Statistics
ML Template
ML Scenarios
ML Activities
ML Applications
ML Algorithms
Cluster Analysis
Decision Trees
Computer Vision
Graph Data
Privacy
Literate Data Science
Coding with Style
Resources
Bibliography
Index
CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE
Data Science Desktop Survival Guide
This book is copyright by the author and licensed under the
Creative Commons Attribution-ShareAlike 4.0
license. It is made available to serve as a useful resource for users of Free and Open Source Data Science Software, and in particular the
R
statistical software. The procedures and applications presented in this book have been included for their instructional value. They have been tested at various times over many years but are not guaranteed for any particular purpose. We also note that functionality of different packages change over time and whilst we make an effort to update the material the sheer volume presents a challenge. The publisher,
togaware.com
, does not offer any warranties or representations, nor does it accept any liabilities with respect to the programs and applications. This book is a continual work in progress and is updated regularly. All readers are invited to send corrections, comments, suggestions, and updates to me at Graham.Williams@togaware.com. Your feedback is most welcome and will be acknowledged within the book. A PDF version of this book is available for a small fee which goes towards supporting the development and availability of the book. Please visit
https://onepager.togaware.com
to receive the PDF version in return. The HTML version contains the same text and remains freely available from
https://onepager.togaware.com/
.
Printed January 25, 2021
Copyright © 1995-2021 by Graham Williams
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (
https://creativecommons.org/licenses/by-sa/4.0/
).
Contents
Chapter: Preface
Chapter: Data Science
Data Wrangling
Data Scientist
Data Science Through Doing
Chapter: Introducing R
Tooling For
R
Programming
Introducing RStudio
Editing Code
Executing Code
RStudio Review
Packages and Libraries
A Glimpse of the Dataset
Attaching a Package
Simplifying Commands
Working with the Library
Getting Help
Chapter: R Constructs
A Data Frame as a Dataset
Assignment
Case Statement
Commands
Functions
Operators
Pipe Operator
Pipeline
Pipeline Construction
Pipeline Identity Operator
Pipeline Syntactic Sugar
Pipes: Assignment Pipe
Pipes: Exposition Pipe
Pipes: Tee Pipe
Pipes: Tee Pipe Intermediate Results
Pipes: Tee Pipe Load CSV Files
Pipes and Plots
Variables
Chapter: R Tasks
Create Directory
File Exists
Format Date
Chapter: R Strings
Strings Setup
Case Conversion
Concatenate Strings
Concatenate NULL
Glue Strings Together
Glue Pipelines
Length of String
Random Strings
Regexp Pattern Matching
Regexp Quantifiers
Regexp Character Classes
Sub-Strings in Base
R
Sub-Strings in Tidy
Trim and Pad
Wrapping and Words
Chapter: R Read, Write, and Create
Create Data Setup
Clipboard Data
CSV Data
Excel Data Read
Excel Data Write
MATLAB Data
Random Dataset
Read Strings from a File
TSV Data
Chapter: Data Template
Dataset Setup
Data Glimpse
Normalise Variable Names
Variables and Model Target
Identify Numeric Variables
Normalise Factor Levels
Modelling Roles
Variable Types
Formula to Describe the Goal
Missing Values
Random Seed
Training, Tuning, and Test Datasets
Review the Dataset
Data Template
Chapter: Data Exploration
Exploration Setup
Counting Groups
Random Sample
Constant Variables
Correlated Numeric Variables
Selecting Columns
Selecting Rows
Shuffle Rows
Chapter: Data Wrangling
Wrangling Setup
Wrangling Data Review
Add Columns
Add Columns Using Variables
Add Counts
Combine Rows
Counting Groups
Dollar to Numeric Conversion
Extract Column as Vector
Filter Rows Having Missing Values
Missing Value Imputation
Modify Columns
Normalise Variables
Rename Variables
Replacing Missing Values
Subset of Rows Within Groups
Data Source
Data Ingestion
The Shape of the Dataset
A Glimpse of the Dataset
Introducing Template Variables
Locating Datasets in Memory
Changing Datasets in Memory
Reviewing Variable Names
Effect on Data Storage
Special Case Variable Name Transformations
Data Review
Dataset Head and Tail
Random Observations
Characters
Factors
Location
Evaporation and Sunshine
Wind Directions
Ordered Factor
Rain
Numeric
Logical
Variable Roles
Risk Variable
ID Variables
Ignore IDs and Outputs
Ignore Missing
Ignore Excessive Level Variables
Dealing with Correlations
Removing Ignored Variables
Feature Selection
Missing Targets
Missing Values
Omitting Observations
Normalise Factors
Target as a Factor
Identify Variable Types
Identify Numeric and Categoric Variables
Save the Dataset
A Template for Data Preparation
Chapter: Data Visualisation
Visualisation Setup
Visualisation Data Review
Bar Chart Basic
Bar Chart Colour No Legend
Bar Chart Faceted Background
Bar Chart Narrow Bars
Bar Chart Supplied Values
Bar Chart Wide Bars
Bar Chart Wide and Borders
Box Plot Distributions
Colour Names
Colour Ranges
Faceted Location Scatter Plot
Faceted Location Thin Lines
Faceted Wind Direction
Flipped Bar Chart
Flipped Mean Confidence Intervals
Flipped Sorted Axes
Flipped Text Annotations
Flipped Text Annotations Commas
Labels
Labels with Comma
Labels with Dollars
Labels Removed
Labels Rotated
Line Chart Basic
Line Chart Density Distribution
Line Chart Skewed Distributions
Line Chart Log X Axis
Line Chart Log Breaks
Line Chart Log Ticks
Line Chart Log Custom Labels
Pie Chart
Plotting Regions
Save Plot to File
Scatter Plot
Scatter Plot Colour
Scatter Plot Colour Alternative
Scatter Plot Smooth Gam
Scatter Plot Smooth Loess
Transparent Plots
Violin Plot
Violin Plot Embedded Box Plot
Violin Plot Faceted Location
XFig Support
Chapter: Statistics
Analysis of Variance ANOVA
Chapter: ML Template
ML Setup
ML Modelling Setup
ML Data Glimpse
Model Building
Predict Class
Predict Probability
Evaluating the Model
Accuracy and Error Rate
Confusion Matrix
ROC Chart
Risk Chart
Biased Estimate from the Training Dataset
Step 8: Save the Model to File
Boilerplate
Command Summary
Further Reading
Model Template
Chapter: ML Scenarios
Machine Learning Setup
Reinforcement Learning
Supervised Learning
Unsupervised Learning
Chapter: ML Activities
Classification
Cluster Analysis
Outlier Detection
Prediction
Text Mining
Chapter: ML Applications
Airport Gate Assignment
Article Summarisation
Gene Detection
Group Recommendations
Program Comprehension
Pull Request Generation
Reservoir Inflow
Sentiment Analysis
Topic Modelling
Chapter: ML Algorithms
Algorithms Setup
Algorithms Data and Variables
Algorithms Data Review
Collaborative Filtering
Convolutional Neural Network CNN
Decision Trees
K Means Clustering
K Nearest Neighbours
Linear Regression
Logistic Regression
Long Short-Term Memory Neural Networks LSTM
Multi Layer Perceptron
Neural Networks
Naïve Bayes
One-Class Support Vector Machine
Recurrent Neural Networks
Residual Neural Network
Support Vector Machine
Chapter: Cluster Analysis
Clustering Setup
Biclustering
References
Chapter: Decision Trees
Decision Trees Setup
Decision Trees Modelling Setup
Rattle Startup
Rattle Weather Dataset
Rattle Summary of Dataset
Rattle Model Tab
Rattle Build Tree
Interpret RPart Decision Tree
Rattle View Decision Tree
Rattle Error Matrix
Rattle Risk Chart
Rattle ROC Curve
Rattle Hand Plots
Rattle Score Dataset
Rattle Log
Rattle GUI to R
Build a Decision Tree Model
Summary of the Model
Complexity Parameter
Complexity Parameter Plot
Complexity Parameter Behaviour
Complexity Parameter 0
Complexity Parameter Table
Variable Importance
Node Details and Surrogates
Decision Tree Performance
Rules from Decision Tree
Rules Using Rpart Plot
Plot Decision Trees
Plot Decision Tree Uniformly
Plot Decision Tree with Extra Information
Fancy Rpart Plot
RPart Plot Default Tree
RPart Plot Favourite
Enhanced Plot: With Colour
Enhanced Plots: Label all Nodes
Enhanced Plots: Labels Below Nodes
Enhanced Plots: Split Labels
Enhanced Plots: Interior Labels
Enhanced Plots: Number of Observations
Enhanced Plots: Add Percentage of Observations
Enhanced Plots: Classification Rate
Enhanced Plots: Add Percentage of Observations
Enhanced Plots: Misclassification Rate
Enhanced Plots: Probability per Class
Enhanced Plots: Add Percentage Observations
Enhanced Plots: Only Probability Per Class
Enhanced Plots: Probability of Second Class
Enhanced Plots: Add Percentage Observations
Enhanced Plots: Only Probability of Second Class
Enhanced Plots: Probability of the Class
Enhanced Plots: Overall Probability
Enhanced Plots: Percentage of Observations
Enhanced Plots: Show the Node Numbers
Enhanced Plots: Show the Node Indicies
Enhanced Plots: Line up the Leaves
Enhanced Plots: Angle Branch Lines
Enhanced Plots: Do Not Abbreviate Factors
Enhanced Plots: Add a Shadow to the Nodes
Enhanced Plots: Draw Branches as Dotted Lines
Enhanced Plots: Other Options
Party Tree
Conditional Decision Tree
Conditional Decision Tree Performance
CTree Plot
Weka Decision Tree
Weka Decision Tree Performance
Weka Decision Tree Plot Using Party
The Original C5.0 Implementation
C5.0 Summary
C5.0 Decision Tree Performance
C5.0 Rules Model
C5.0 Rules Summary
C5.0 Rules Performance
Regression Trees
Visualise Regression Trees
Visualise Regression Trees: Uniform
Visualise Regression Trees: Extra Information
Fancy Plot of Regression Tree
Enhanced Plot of Regression Tree: Default
Enhanced Plot of Regression Tree: Favourite
Party Regression Tree
Conditional Regression Tree
CTree Plot
Weka Regression Tree
Chapter: Computer Vision
Computer Vision Setup
Resnet Models
Chapter: Graph Data
Graph Setup
Graph Embedding
Graph Terminology
Knowledge Graph
Link Prediction
Reasoning over Knowledge Graphs
Chapter: Privacy
Privacy Setup
Privacy Computer Vision
Differential Privacy
Chapter: Literate Data Science
KnitR Setup
Basic LaTeX Template
RStudio with KnitR
Template for a Narrative
RStudio Compile PDF
Compiled PDF
SweaveOpts Undefined
Including
R
Commands
KnitR Basic Example
Inline
R
Code
Formatting Tables Using Kable
Formatting Options
Improvements Using BookTabs
Formatting Tables Using XTable
Formatting Numbers with XTable
Adding a Caption and Reference Label
Sophisticated Captions
Including Figures
Sample Figure
Adjusting Aspect
Choosing Dimensions
Setting Output Width
Add a Caption and Label
Animation: Basic Example
Adding a Flowchart
Adding Bibliographies
Referencing Chunks in LaTeX
Truncating Long Lines
Truncating Too Many Lines
Selective Lines of Code
Knitr Options
Knitr Resources
Chapter: Coding with Style
Style Matters
Naming Files
Multiple File Scripts
Naming Objects
Naming Functions
Comments
Layout
If-Else Issue
Indentation
Alignment
Sub-Block Alignment
Function Guidelines
Function Definition Layout
Function Call Layout
Functions from Packages
Assignment
Miscellaneous
Good Practise
Style Resources
Chapter: Resources
Bibliography
Index
Support further development by purchasing the
PDF version of the book
.
Other online resources include the
GNU/Linux Desktop Survival Guide
.
Books available on Amazon include
Data Mining with Rattle
and
Essentials of Data Science
.
Popular open source software includes
rattle
and
wajig
.
Hosted by
Togaware
, a pioneer of free and open source software since 1984.
Copyright © 2000-2020 Togaware Pty Ltd.
. Creative Commons ShareAlike V4
.